Code
= df['date'].min()
min_date
= df['date'].max()
max_date
print(f"\nPeriodo de tweets recolectados: {min_date} / {max_date}\n")
Periodo de tweets recolectados: 2013-08-20 15:43:12-05:00 / 2023-03-21 08:54:01-05:00
Información general sobre la base de datos
Periodo de tweets recolectados: 2013-08-20 15:43:12-05:00 / 2023-03-21 08:54:01-05:00
<class 'pandas.core.frame.DataFrame'>
Index: 23687 entries, 21498 to 45184
Data columns (total 63 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 query 23687 non-null object
1 id 23687 non-null float64
2 timestamp_utc 23687 non-null int64
3 local_time 23687 non-null object
4 user_screen_name 23687 non-null object
5 text 23687 non-null object
6 possibly_sensitive 4337 non-null object
7 retweet_count 23687 non-null float64
8 like_count 23687 non-null float64
9 reply_count 23687 non-null float64
10 impression_count 1985 non-null object
11 lang 23687 non-null object
12 to_username 16705 non-null object
13 to_userid 16705 non-null float64
14 to_tweetid 16687 non-null float64
15 source_name 23687 non-null object
16 source_url 23687 non-null object
17 user_location 0 non-null object
18 lat 0 non-null object
19 lng 0 non-null object
20 user_id 23687 non-null object
21 user_name 23687 non-null object
22 user_verified 23687 non-null float64
23 user_description 23687 non-null object
24 user_url 0 non-null object
25 user_image 23687 non-null object
26 user_tweets 23687 non-null object
27 user_followers 23687 non-null float64
28 user_friends 23687 non-null object
29 user_likes 23687 non-null float64
30 user_lists 23687 non-null float64
31 user_created_at 23687 non-null object
32 user_timestamp_utc 23687 non-null float64
33 collected_via 23687 non-null object
34 match_query 23687 non-null float64
35 retweeted_id 0 non-null float64
36 retweeted_user 0 non-null float64
37 retweeted_user_id 0 non-null float64
38 retweeted_timestamp_utc 0 non-null object
39 quoted_id 2119 non-null object
40 quoted_user 2119 non-null object
41 quoted_user_id 2119 non-null float64
42 quoted_timestamp_utc 2119 non-null float64
43 collection_time 23687 non-null object
44 url 23687 non-null object
45 place_country_code 58 non-null object
46 place_name 58 non-null object
47 place_type 58 non-null object
48 place_coordinates 58 non-null object
49 links 1626 non-null object
50 domains 1626 non-null object
51 media_urls 3806 non-null object
52 media_files 3806 non-null object
53 media_types 3806 non-null object
54 media_alt_texts 361 non-null object
55 mentioned_names 17305 non-null object
56 mentioned_ids 16211 non-null object
57 hashtags 2499 non-null object
58 intervention_type 0 non-null float64
59 intervention_text 0 non-null float64
60 intervention_url 0 non-null float64
61 country 23687 non-null object
62 date 23687 non-null datetime64[ns, America/Guayaquil]
dtypes: datetime64[ns, America/Guayaquil](1), float64(20), int64(1), object(41)
memory usage: 11.6+ MB
Lista del top 20 de otros sitios web mencionados en los tweets y su frecuencia
domains
panampost.com 321
es.panampost.com 165
youtube.com 105
youtu.be 85
twitter.com 52
bit.ly 43
instagram.com 37
gaceta.es 32
facebook.com 21
buff.ly 20
publichealth.lacounty.gov 19
eluniverso.com 18
amazon.com 14
abc.es 13
lozierinstitute.org 13
vatican.va 10
amp.milenio.com 9
lifenews.com 9
library.brown.edu 9
bbc.com 9
Name: count, dtype: int64
Lista del top 20 de hashtags más usados y su frecuencia
# convert dataframe column to list
hashtags = df['hashtags'].to_list()
# remove nan items from list
hashtags = [x for x in hashtags if not pd.isna(x)]
# split items into a list based on a delimiter
hashtags = [x.split('|') for x in hashtags]
# flatten list of lists
hashtags = [item for sublist in hashtags for item in sublist]
# count items on list
hashtags_count = pd.Series(hashtags).value_counts()
# return first n rows in descending order
top_hashtags = hashtags_count.nlargest(20)
top_hashtags
leyabortistano 243
femeninasífeministano 87
coronavirus 87
salvemoslasdosvidas 64
blacklivesmatter 64
ecuadoresprovida 47
vetopresidencial 42
tiraníasanitaria 41
provida 33
leydelviolador 29
síalavida 28
noalaborto 27
nohablesenminombre 26
abortolegal 26
guateesvida 24
datomatarelato 24
laviolencianotienegénero 23
covid19 22
8m 21
justiciaporlucio 19
Name: count, dtype: int64
Top 20 de usuarios más mencionados en los tweets
# filter column from dataframe
users = df['mentioned_names'].to_list()
# remove nan items from list
users = [x for x in users if not pd.isna(x)]
# split items into a list based on a delimiter
users = [x.split('|') for x in users]
# flatten list of lists
users = [item for sublist in users for item in sublist]
# count items on list
users_count = pd.Series(users).value_counts()
# return first n rows in descending order
top_users = users_count.nlargest(20)
top_users
agustinlaje 330
lassoguillermo 324
mamelafialloflo 271
panampost_es 239
jairbolsonaro 210
etorrescobo 203
felipeleon88 177
realdonaldtrump 171
xileone 138
pjavieror 135
vox_es 131
asambleaecuador 129
simpliciterpaco 127
fundlibre 118
pmunoziturrieta 110
gloriaalvarez85 109
freityt 109
jmilei 98
avelinaponceg 97
pontifex_es 95
Name: count, dtype: int64
Lista del top 20 de los tokens más comunes y su frecuencia
# load the spacy model for Spanish
nlp = spacy.load("es_core_news_sm")
# load stop words for Spanish
STOP_WORDS = nlp.Defaults.stop_words
# Function to filter stop words
def filter_stopwords(text):
# lower text
doc = nlp(text.lower())
# filter tokens
tokens = [token.text for token in doc if not token.is_stop and token.text not in STOP_WORDS and token.is_alpha]
return ' '.join(tokens)
# apply function to dataframe column
df['text_pre'] = df['text'].apply(filter_stopwords)
# count items on column
token_counts = df["text_pre"].str.split(expand=True).stack().value_counts()[:20]
token_counts
gracias 2344
mujer 1729
vida 1548
mujeres 1394
aborto 1381
feminismo 1167
libertad 924
matar 881
derecho 701
the 666
ecuador 657
hombre 630
quieren 617
feministas 614
izquierda 603
madre 586
personas 549
dios 529
violencia 528
hombres 528
Name: count, dtype: int64
Lista de las 10 horas con más cantidad de tweets publicados
hour
08 1665
09 1659
10 1573
22 1571
07 1522
11 1338
23 1312
21 1302
19 1268
12 1237
Name: count, dtype: int64
Plataformas desde las que se publicaron contenidos y su frecuencia
Técnica de modelado de tópicos con transformers
y TF-IDF
# remove urls, mentions, hashtags and numbers
p.set_options(p.OPT.URL, p.OPT.MENTION, p.OPT.NUMBER)
df['text_pre'] = df['text_pre'].apply(lambda x: p.clean(x))
# replace emojis with descriptions
df['text_pre'] = df['text_pre'].apply(lambda x: demojize(x))
# filter column
docs = df['text_pre']
# calculate topics and probabilities
topic_model = BERTopic(language="multilingual", calculate_probabilities=True, verbose=True)
# training
topics, probs = topic_model.fit_transform(docs)
# visualize topics
topic_model.visualize_topics()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Mapa con el 20% del total de tópicos generados
Selección de tópicos que tocan temas de género
# selection of topics
topics = [1, 5]
keywords_list = []
for topic_ in topics:
topic = topic_model.get_topic(topic_)
keywords = [x[0] for x in topic]
keywords_list.append(keywords)
# flatten list of lists
words_list = [item for sublist in keywords_list for item in sublist]
# use apply method with lambda function to filter rows
filtered_df = df[df['text_pre'].apply(lambda x: any(word in x for word in words_list))]
percentage = round(100 * len(filtered_df) / len(df), 2)
print(f"Del total de {len(df)} tweets de @MamelaFialloFlo, alrededor de {len(filtered_df)} hablan sobre temas de género, es decir, cerca del {percentage}%")
print(f"Lista de palabras en tópicos {topics}:\n{words_list}")
Del total de 23687 tweets de @MamelaFialloFlo, alrededor de 9016 hablan sobre temas de género, es decir, cerca del 38.06%
Lista de palabras en tópicos [1, 5]:
['feminismo', 'mujer', 'mujeres', 'feministas', 'feminista', 'hombres', 'violencia', 'hombre', 'lgbt', 'trans', 'aborto', 'pro', 'mujeres', 'provida', 'feministas', 'abortos', 'abortar', 'matar', 'leyabortistano', 'vida']
# drop rows with 0 values in two columns
filtered_df = filtered_df[(filtered_df.like_count != 0) & (filtered_df.retweet_count != 0)]
# add a new column with the sum of two columns
filtered_df['impressions'] = (filtered_df['like_count'] + filtered_df['retweet_count'])/2
# extract year from datetime column
filtered_df['year'] = filtered_df['date'].dt.year
# remove urls, mentions, hashtags and numbers
p.set_options(p.OPT.URL)
filtered_df['tweet_text'] = filtered_df['text'].apply(lambda x: p.clean(x))
# Create scatter plot
fig = px.scatter(filtered_df, x='like_count',
y='retweet_count',
size='impressions',
color='year',
hover_name='tweet_text')
# Update title and axis labels
fig.update_layout(
title='Tweets talking about gender with most Likes and Retweets',
xaxis_title='Number of Likes',
yaxis_title='Number of Retweets'
)
fig.show()
# convert column to list
tweets = df['text_pre'].to_list()
timestamps = df['local_time'].to_list()
topics_over_time = topic_model.topics_over_time(docs=tweets,
timestamps=timestamps,
global_tuning=True,
evolution_tuning=True,
nr_bins=20)
topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=20)