Code
= df['date'].min()
min_date
= df['date'].max()
max_date
print(f"\nPeriodo de tweets recolectados: {min_date} / {max_date}\n")
Periodo de tweets recolectados: 2010-06-11 11:27:43-03:00 / 2023-03-20 08:00:26-03:00
Información general sobre la base de datos
Periodo de tweets recolectados: 2010-06-11 11:27:43-03:00 / 2023-03-20 08:00:26-03:00
<class 'pandas.core.frame.DataFrame'>
Index: 43709 entries, 45185 to 88893
Data columns (total 63 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 query 43709 non-null object
1 id 43709 non-null float64
2 timestamp_utc 43709 non-null int64
3 local_time 43709 non-null object
4 user_screen_name 43709 non-null object
5 text 43709 non-null object
6 possibly_sensitive 26733 non-null object
7 retweet_count 43709 non-null float64
8 like_count 43709 non-null float64
9 reply_count 43709 non-null float64
10 impression_count 365 non-null object
11 lang 43709 non-null object
12 to_username 874 non-null object
13 to_userid 874 non-null float64
14 to_tweetid 560 non-null float64
15 source_name 43709 non-null object
16 source_url 43709 non-null object
17 user_location 43709 non-null object
18 lat 0 non-null object
19 lng 0 non-null object
20 user_id 43709 non-null object
21 user_name 43709 non-null object
22 user_verified 43709 non-null float64
23 user_description 43709 non-null object
24 user_url 43709 non-null object
25 user_image 43709 non-null object
26 user_tweets 43709 non-null object
27 user_followers 43709 non-null float64
28 user_friends 43709 non-null object
29 user_likes 43709 non-null float64
30 user_lists 43709 non-null float64
31 user_created_at 43709 non-null object
32 user_timestamp_utc 43709 non-null float64
33 collected_via 43709 non-null object
34 match_query 43709 non-null float64
35 retweeted_id 0 non-null float64
36 retweeted_user 0 non-null float64
37 retweeted_user_id 0 non-null float64
38 retweeted_timestamp_utc 0 non-null object
39 quoted_id 53 non-null object
40 quoted_user 53 non-null object
41 quoted_user_id 53 non-null float64
42 quoted_timestamp_utc 53 non-null float64
43 collection_time 43709 non-null object
44 url 43709 non-null object
45 place_country_code 21 non-null object
46 place_name 21 non-null object
47 place_type 21 non-null object
48 place_coordinates 21 non-null object
49 links 20062 non-null object
50 domains 20062 non-null object
51 media_urls 10154 non-null object
52 media_files 10154 non-null object
53 media_types 10154 non-null object
54 media_alt_texts 399 non-null object
55 mentioned_names 3473 non-null object
56 mentioned_ids 3149 non-null object
57 hashtags 2265 non-null object
58 intervention_type 0 non-null float64
59 intervention_text 0 non-null float64
60 intervention_url 0 non-null float64
61 country 43709 non-null object
62 date 43709 non-null datetime64[ns, America/Sao_Paulo]
dtypes: datetime64[ns, America/Sao_Paulo](1), float64(20), int64(1), object(41)
memory usage: 21.3+ MB
Lista del top 20 de otros sitios web mencionados en los tweets y su frecuencia
domains
youtu.be 7145
verdadegospel.com 3659
youtube.com 2116
goo.gl 1719
veja.abril.com.br 992
vitoriaemcristo.org 848
migre.me 696
editoracentralgospel.com 336
bit.ly 166
instagram.com 153
gospelplay.com 147
facebook.com 140
facebook.com|youtube.com 123
pastoresjuntos.com 111
eventos.vitoriaemcristo.org 103
fb.watch|instagram.com 103
escoladelideresonline.com 69
g1.globo.com 67
advec.org 67
eventospastorsilas.com.br 66
Name: count, dtype: int64
Lista del top 20 de hashtags más usados y su frecuencia
# convert dataframe column to list
hashtags = df['hashtags'].to_list()
# remove nan items from list
hashtags = [x for x in hashtags if not pd.isna(x)]
# split items into a list based on a delimiter
hashtags = [x.split('|') for x in hashtags]
# flatten list of lists
hashtags = [item for sublist in hashtags for item in sublist]
# count items on list
hashtags_count = pd.Series(hashtags).value_counts()
# return first n rows in descending order
top_hashtags = hashtags_count.nlargest(20)
top_hashtags
silasmalafaia 203
roubalheiraepttudoaver 197
dilmavaiperderaecio45vencer 123
elessabiamdorouboforadilma 116
12anosderoubalheiradoptchega 71
povobrasileirocontradilmaept 70
chegaderoubalheiraforadilma 69
225 60
chegaderouboementiraforadilma 59
dilmanaodialoguecomterrorista 54
clamandopelobrasil 52
dilmaroubalheiraepttudoaver 52
fachinnão 46
nemcorrupçãonemptnemdilma 43
videodepregacao 43
respond 43
aovivocommalafaia 42
lulaedilmamenosodiocontramarina 42
votoaeciopelobr45il 41
marinaresistentevaiserpresidente 39
Name: count, dtype: int64
Top 20 de usuarios más mencionados en los tweets
# filter column from dataframe
users = df['mentioned_names'].to_list()
# remove nan items from list
users = [x for x in users if not pd.isna(x)]
# split items into a list based on a delimiter
users = [x.split('|') for x in users]
# flatten list of lists
users = [item for sublist in users for item in sublist]
# count items on list
users_count = pd.Series(users).value_counts()
# return first n rows in descending order
top_users = users_count.nlargest(20)
top_users
verdadegospel 638
advecoficial 532
edcentralgospel 179
reinaldoazevedo 158
avec_oficial 135
elizetemalafaia 115
radaronline 103
eyshila1 70
cgospelmusic 60
pastormalafaiaoficial 52
pastormalafaia 45
nani_azevedo 43
drmikemurdock 41
silasmalafaia 39
veja 38
jozyanneoficial 37
magnomaltaofc 35
advecsaopaulo 33
pgm_ratinho 26
danilogentili 23
Name: count, dtype: int64
Lista del top 20 de los tokens más comunes y su frecuencia
# load the spacy model for Portuguese
nlp = spacy.load("pt_core_news_sm")
# load stop words for Spanish
STOP_WORDS = nlp.Defaults.stop_words
# Function to filter stop words
def filter_stopwords(text):
# lower text
doc = nlp(text.lower())
# filter tokens
tokens = [token.text for token in doc if not token.is_stop and token.text not in STOP_WORDS and token.is_alpha]
return ' '.join(tokens)
# apply function to dataframe column
df['text_pre'] = df['text'].apply(filter_stopwords)
# count items on column
token_counts = df["text_pre"].str.split(expand=True).stack().value_counts()[:20]
token_counts
assista 7856
q 5381
vídeo 4499
deus 3435
programa 3030
dia 2894
bolsonaro 2840
vitória 2782
cristo 2620
brasil 2418
acesse 2256
hoje 2189
vou 2168
pt 2073
divulgue 1981
ñ 1882
sábado 1816
imprensa 1727
lula 1721
imperdível 1715
Name: count, dtype: int64
Lista de las 10 horas con más cantidad de tweets publicados
hour
17 3772
12 3693
14 3432
15 3366
16 3268
10 3017
11 3004
13 2849
18 2446
19 2377
Name: count, dtype: int64
Plataformas desde las que se publicaron contenidos y su frecuencia
source_name
Twitter Web Client 11191
Postcron App 10752
Twitter for iPad 8900
mLabs - Gestão de Redes Sociais 7243
Twitter for iPhone 2183
erased3412752 723
Twitter Ads 580
Twitter for Android 466
Twitter for Android Tablets 444
TweetDeck 424
Twitter Web App 303
Postgrain 144
Periscope 106
Twitter for BlackBerry® 98
Twitter for Advertisers. 65
Dynamic Tweets 63
Twitpic 7
Twitter for Websites 3
Mobile Web 3
iOS 3
Mobile Web (M2) 2
Instagram 2
Photos on iOS 1
Twitter Media Studio 1
audioBoom 1
Twitter for Windows Phone 1
Name: count, dtype: int64
Técnica de modelado de tópicos con transformers
y TF-IDF
# remove urls, mentions, hashtags and numbers
p.set_options(p.OPT.URL, p.OPT.MENTION, p.OPT.NUMBER)
df['text_pre'] = df['text_pre'].apply(lambda x: p.clean(x))
# replace emojis with descriptions
df['text_pre'] = df['text_pre'].apply(lambda x: demojize(x))
# filter column
docs = df['text_pre']
# calculate topics and probabilities
topic_model = BERTopic(language="multilingual", calculate_probabilities=True, verbose=True)
# training
topics, probs = topic_model.fit_transform(docs)
# visualize topics
topic_model.visualize_topics()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)