Code
= df['date'].min()
min_date
= df['date'].max()
max_date
print(f"\nPeriodo de tweets recolectados: {min_date} / {max_date}\n")
Periodo de tweets recolectados: 2010-02-06 20:50:46-05:00 / 2023-03-21 05:00:25-05:00
Información general sobre la base de datos
Periodo de tweets recolectados: 2010-02-06 20:50:46-05:00 / 2023-03-21 05:00:25-05:00
<class 'pandas.core.frame.DataFrame'>
Index: 17450 entries, 179250 to 196699
Data columns (total 63 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 query 17450 non-null object
1 id 17450 non-null float64
2 timestamp_utc 17450 non-null int64
3 local_time 17450 non-null object
4 user_screen_name 17450 non-null object
5 text 17450 non-null object
6 possibly_sensitive 15102 non-null object
7 retweet_count 17450 non-null float64
8 like_count 17450 non-null float64
9 reply_count 17450 non-null float64
10 impression_count 620 non-null object
11 lang 17450 non-null object
12 to_username 100 non-null object
13 to_userid 100 non-null float64
14 to_tweetid 87 non-null float64
15 source_name 17450 non-null object
16 source_url 17450 non-null object
17 user_location 17450 non-null object
18 lat 9 non-null object
19 lng 9 non-null object
20 user_id 17450 non-null object
21 user_name 17450 non-null object
22 user_verified 17450 non-null float64
23 user_description 17450 non-null object
24 user_url 17450 non-null object
25 user_image 17450 non-null object
26 user_tweets 17450 non-null object
27 user_followers 17450 non-null float64
28 user_friends 17450 non-null object
29 user_likes 17450 non-null float64
30 user_lists 17450 non-null float64
31 user_created_at 17450 non-null object
32 user_timestamp_utc 17450 non-null float64
33 collected_via 17450 non-null object
34 match_query 17450 non-null float64
35 retweeted_id 0 non-null float64
36 retweeted_user 0 non-null float64
37 retweeted_user_id 0 non-null float64
38 retweeted_timestamp_utc 0 non-null object
39 quoted_id 51 non-null object
40 quoted_user 51 non-null object
41 quoted_user_id 51 non-null float64
42 quoted_timestamp_utc 51 non-null float64
43 collection_time 17450 non-null object
44 url 17450 non-null object
45 place_country_code 795 non-null object
46 place_name 795 non-null object
47 place_type 795 non-null object
48 place_coordinates 795 non-null object
49 links 10194 non-null object
50 domains 10194 non-null object
51 media_urls 7526 non-null object
52 media_files 7526 non-null object
53 media_types 7526 non-null object
54 media_alt_texts 1940 non-null object
55 mentioned_names 3066 non-null object
56 mentioned_ids 2583 non-null object
57 hashtags 10072 non-null object
58 intervention_type 0 non-null float64
59 intervention_text 0 non-null float64
60 intervention_url 0 non-null float64
61 country 17450 non-null object
62 date 17450 non-null datetime64[ns, America/Bogota]
dtypes: datetime64[ns, America/Bogota](1), float64(20), int64(1), object(41)
memory usage: 8.5+ MB
Lista del top 20 de otros sitios web mencionados en los tweets y su frecuencia
domains
fb.me 3803
instagram.com 1778
youtu.be 1280
ow.ly 1173
misionpaz.org 624
twitter.com 216
youtube.com 176
fb.me|ow.ly 150
misionpaz.org|youtu.be 135
bit.ly 98
pst.cr 90
jhonmilton.org 64
pscp.tv 58
fb.me|youtube.com 58
explosion.misionpaz.org 56
inscripciones.genesis.misionpaz.org 38
wp.me 37
congresos.misionpaz.org 23
misiónpaz.org 19
fb.me|new.livestream.com 18
Name: count, dtype: int64
Lista del top 20 de hashtags más usados y su frecuencia
# convert dataframe column to list
hashtags = df['hashtags'].to_list()
# remove nan items from list
hashtags = [x for x in hashtags if not pd.isna(x)]
# split items into a list based on a delimiter
hashtags = [x.split('|') for x in hashtags]
# flatten list of lists
hashtags = [item for sublist in hashtags for item in sublist]
# count items on list
hashtags_count = pd.Series(hashtags).value_counts()
# return first n rows in descending order
top_hashtags = hashtags_count.nlargest(20)
top_hashtags
misionpazmicasa 1561
misionpaz 1098
mpntucasa 1012
iglesiampn 887
devocional 881
misionpazencasa 604
feparagrandesvictorias 514
20añostransformando 333
esnuestracasa 327
envivo 320
mpnenvivo 278
familiampn 253
somosfamilia 245
yosoympn 240
mpnnuestracasa 236
explosioncontundente 220
avivamiento 185
fiestademilagros 179
vive 176
mpn 157
Name: count, dtype: int64
Top 20 de usuarios más mencionados en los tweets
# filter column from dataframe
users = df['mentioned_names'].to_list()
# remove nan items from list
users = [x for x in users if not pd.isna(x)]
# split items into a list based on a delimiter
users = [x.split('|') for x in users]
# flatten list of lists
users = [item for sublist in users for item in sublist]
# count items on list
users_count = pd.Series(users).value_counts()
# return first n rows in descending order
top_users = users_count.nlargest(20)
top_users
johnmiltonr_ 712
gerarydiana 342
jhonmiltonr 270
joelmanderfield 257
youtube 186
prjhonmilton 171
ce_palace 164
gissymander 151
profetanormasr 117
misionpaziglesia 76
misionpaz_ 75
soynormaruiz 45
marcobarrientos 45
normanormaruiz 41
fundacionmisionpaz 30
prgerardoydiana 28
pastorcashluna 25
cesarfajardosm 24
otonielfont 23
evancraft 21
Name: count, dtype: int64
Lista del top 20 de los tokens más comunes y su frecuencia
# load the spacy model for Spanish
nlp = spacy.load("es_core_news_sm")
# load stop words for Spanish
STOP_WORDS = nlp.Defaults.stop_words
# Function to filter stop words
def filter_stopwords(text):
# lower text
doc = nlp(text.lower())
# filter tokens
tokens = [token.text for token in doc if not token.is_stop and token.text not in STOP_WORDS and token.is_alpha]
return ' '.join(tokens)
# apply function to dataframe column
df['text_pre'] = df['text'].apply(filter_stopwords)
# count items on column
token_counts = df["text_pre"].str.split(expand=True).stack().value_counts()[:20]
token_counts
dios 6122
paz 2051
misión 1780
vida 1589
misionpazmicasa 1556
mensaje 1280
completo 1193
tiempo 1123
misionpaz 1100
conéctate 1086
devocional 1033
mpntucasa 1016
amor 1000
celebración 983
familia 926
iglesiampn 891
jesús 851
pm 809
esperamos 805
señor 793
Name: count, dtype: int64
Lista de las 10 horas con más cantidad de tweets publicados
hour
18 1245
20 1223
19 1119
04 1063
12 1058
11 1041
09 1039
17 1027
08 1003
10 996
Name: count, dtype: int64
Plataformas desde las que se publicaron contenidos y su frecuencia
source_name
Facebook 4460
Twitter Web App 2791
Hootsuite 2423
Instagram 1614
Twitter Web Client 1341
Postcron App 971
Twitter for iPad 868
Twitter for Android 682
Twitter for iPhone 627
TweetDeck 290
SocialGest 285
Google 254
Twitter Media Studio 230
Repost.social 167
Hootsuite Inc. 165
a Ning Network 106
Restream.io 72
Periscope 57
erased9_3Ud7cuBk0y 39
erased132190 3
Ustream.TV 2
LinkedIn 1
Twitter for Advertisers. 1
erased138961 1
Name: count, dtype: int64
Técnica de modelado de tópicos con transformers
y TF-IDF
# remove urls, mentions, hashtags and numbers
p.set_options(p.OPT.URL, p.OPT.MENTION, p.OPT.NUMBER)
df['text_pre'] = df['text_pre'].apply(lambda x: p.clean(x))
# replace emojis with descriptions
df['text_pre'] = df['text_pre'].apply(lambda x: demojize(x))
# filter column
docs = df['text_pre']
# calculate topics and probabilities
topic_model = BERTopic(language="multilingual", calculate_probabilities=True, verbose=True)
# training
topics, probs = topic_model.fit_transform(docs)
# visualize topics
topic_model.visualize_topics()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Mapa con el 20% del total de tópicos generados
No se identificó ningún tópico que hable de manera contundente sobre aborto, feminismo y genero
# convert column to list
tweets = df['text_pre'].to_list()
timestamps = df['local_time'].to_list()
topics_over_time = topic_model.topics_over_time(docs=tweets,
timestamps=timestamps,
global_tuning=True,
evolution_tuning=True,
nr_bins=20)
topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=20)