Code
= df['date'].min()
min_date
= df['date'].max()
max_date
print(f"\nPeriodo de tweets recolectados: {min_date} / {max_date}\n")
Periodo de tweets recolectados: 2010-07-06 10:37:00-05:00 / 2023-03-21 09:58:42-05:00
Información general sobre la base de datos
Periodo de tweets recolectados: 2010-07-06 10:37:00-05:00 / 2023-03-21 09:58:42-05:00
<class 'pandas.core.frame.DataFrame'>
Index: 8314 entries, 10617 to 18930
Data columns (total 63 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 query 8314 non-null object
1 id 8314 non-null float64
2 timestamp_utc 8314 non-null int64
3 local_time 8314 non-null object
4 user_screen_name 8314 non-null object
5 text 8314 non-null object
6 possibly_sensitive 2818 non-null object
7 retweet_count 8314 non-null float64
8 like_count 8314 non-null float64
9 reply_count 8314 non-null float64
10 impression_count 243 non-null object
11 lang 8314 non-null object
12 to_username 2188 non-null object
13 to_userid 2188 non-null float64
14 to_tweetid 2120 non-null float64
15 source_name 8314 non-null object
16 source_url 8314 non-null object
17 user_location 8314 non-null object
18 lat 65 non-null object
19 lng 65 non-null object
20 user_id 8314 non-null object
21 user_name 8314 non-null object
22 user_verified 8314 non-null float64
23 user_description 8314 non-null object
24 user_url 8314 non-null object
25 user_image 8314 non-null object
26 user_tweets 8314 non-null object
27 user_followers 8314 non-null float64
28 user_friends 8314 non-null object
29 user_likes 8314 non-null float64
30 user_lists 8314 non-null float64
31 user_created_at 8314 non-null object
32 user_timestamp_utc 8314 non-null float64
33 collected_via 8314 non-null object
34 match_query 8314 non-null float64
35 retweeted_id 0 non-null float64
36 retweeted_user 0 non-null float64
37 retweeted_user_id 0 non-null float64
38 retweeted_timestamp_utc 0 non-null object
39 quoted_id 800 non-null object
40 quoted_user 800 non-null object
41 quoted_user_id 800 non-null float64
42 quoted_timestamp_utc 800 non-null float64
43 collection_time 8314 non-null object
44 url 8314 non-null object
45 place_country_code 672 non-null object
46 place_name 672 non-null object
47 place_type 672 non-null object
48 place_coordinates 672 non-null object
49 links 1660 non-null object
50 domains 1660 non-null object
51 media_urls 1812 non-null object
52 media_files 1812 non-null object
53 media_types 1812 non-null object
54 media_alt_texts 249 non-null object
55 mentioned_names 4040 non-null object
56 mentioned_ids 3764 non-null object
57 hashtags 1824 non-null object
58 intervention_type 0 non-null float64
59 intervention_text 0 non-null float64
60 intervention_url 0 non-null float64
61 country 8314 non-null object
62 date 8314 non-null datetime64[ns, America/Guayaquil]
dtypes: datetime64[ns, America/Guayaquil](1), float64(20), int64(1), object(41)
memory usage: 4.1+ MB
Lista del top 20 de otros sitios web mencionados en los tweets y su frecuencia
domains
etorrescobo.com 241
tinyurl.com 141
instagram.com 120
bit.ly 85
youtu.be 64
elcomercio.com 64
youtube.com 52
abc.es 52
ft.com 46
eluniverso.com 40
elpais.com 36
ow.ly 30
facebook.com 25
medium.com 24
expreso.ec 23
wsj.com 23
twitter.com 20
internacional.elpais.com 16
hoy.com.ec 13
nyti.ms 12
Name: count, dtype: int64
Lista del top 20 de hashtags más usados y su frecuencia
# convert dataframe column to list
hashtags = df['hashtags'].to_list()
# remove nan items from list
hashtags = [x for x in hashtags if not pd.isna(x)]
# split items into a list based on a delimiter
hashtags = [x.split('|') for x in hashtags]
# flatten list of lists
hashtags = [item for sublist in hashtags for item in sublist]
# count items on list
hashtags_count = pd.Series(hashtags).value_counts()
# return first n rows in descending order
top_hashtags = hashtags_count.nlargest(20)
top_hashtags
ecuador 213
ambato 163
tungurahua 116
quito 47
venezuela 43
asambleanacional 35
maduro 29
trump 26
españa 24
vota6 23
atención 21
coip 20
emprendersinobstáculos 19
usfq 16
colombia 16
brexit 15
estebantorres 14
cambio 14
ecuadorprotesta 14
toros 13
Name: count, dtype: int64
Top 20 de usuarios más mencionados en los tweets
# filter column from dataframe
users = df['mentioned_names'].to_list()
# remove nan items from list
users = [x for x in users if not pd.isna(x)]
# split items into a list based on a delimiter
users = [x.split('|') for x in users]
# flatten list of lists
users = [item for sublist in users for item in sublist]
# count items on list
users_count = pd.Series(users).value_counts()
# return first n rows in descending order
top_users = users_count.nlargest(20)
top_users
asambleaecuador 235
estebanperezm 140
etorrescobo 133
lftorrest 86
lassoguillermo 74
bancadapsc 66
eluniversocom 62
rxandrade 57
el_pais 52
usfq_ecuador 48
jfcarpio 47
cambioec 43
abc_es 42
amandahidalgoa 39
xvillalba1 39
ecuavisa 37
cristiano 35
la6ecuador 35
youtube 33
lenin 31
Name: count, dtype: int64
Lista del top 20 de los tokens más comunes y su frecuencia
# load the spacy model for Spanish
nlp = spacy.load("es_core_news_sm")
# load stop words for Spanish
STOP_WORDS = nlp.Defaults.stop_words
# Function to filter stop words
def filter_stopwords(text):
# lower text
doc = nlp(text.lower())
# filter tokens
tokens = [token.text for token in doc if not token.is_stop and token.text not in STOP_WORDS and token.is_alpha]
return ' '.join(tokens)
# apply function to dataframe column
df['text_pre'] = df['text'].apply(filter_stopwords)
# count items on column
token_counts = df["text_pre"].str.split(expand=True).stack().value_counts()[:20]
token_counts
ecuador 586
gracias 388
gobierno 343
vía 332
asamblea 298
saludos 280
país 274
ambato 265
presidente 259
ley 251
artículo 221
nacional 216
the 203
ecuatorianos 178
tungurahua 173
comparto 167
años 161
debate 152
vida 152
quito 150
Name: count, dtype: int64
Lista de las 10 horas con más cantidad de tweets publicados
hour
11 633
10 587
12 578
17 571
20 552
09 503
13 502
21 490
16 469
18 451
Name: count, dtype: int64
Plataformas desde las que se publicaron contenidos y su frecuencia
source_name
Twitter for iPhone 4322
Twitter Web Client 1787
Twitter for iPad 704
Twitter for Android 432
Twitter for BlackBerry® 344
Twitter Web App 232
Twitter for Websites 229
Instagram 105
Kioskoymas 41
Agorapulse app 33
Mobile Web 17
Medium 17
iOS 14
Hootsuite Inc. 13
TweetChat 6
Kindle 5
Canva 5
FOX News Login 2
Photos on iOS 2
OS X 1
Instagram on iOS 1
Crowdfire Inc. 1
bitly bitlink 1
Name: count, dtype: int64
Técnica de modelado de tópicos con transformers
y TF-IDF
# remove urls, mentions, hashtags and numbers
p.set_options(p.OPT.URL, p.OPT.MENTION, p.OPT.NUMBER)
df['text_pre'] = df['text_pre'].apply(lambda x: p.clean(x))
# replace emojis with descriptions
df['text_pre'] = df['text_pre'].apply(lambda x: demojize(x))
# filter column
docs = df['text_pre']
# calculate topics and probabilities
topic_model = BERTopic(language="multilingual", calculate_probabilities=True, verbose=True)
# training
topics, probs = topic_model.fit_transform(docs)
# visualize topics
topic_model.visualize_topics()
Mapa con 10 tópicos del contenido de los tweets
Selección de tópicos que tocan temas de género
# selection of topics
topics = [7]
keywords_list = []
for topic_ in topics:
topic = topic_model.get_topic(topic_)
keywords = [x[0] for x in topic]
keywords_list.append(keywords)
# flatten list of lists
words_list = [item for sublist in keywords_list for item in sublist]
# use apply method with lambda function to filter rows
filtered_df = df[df['text_pre'].apply(lambda x: any(word in x for word in words_list))]
percentage = round(100 * len(filtered_df) / len(df), 2)
print(f"Del total de {len(df)} tweets de @etorrescobo, alrededor de {len(filtered_df)} hablan sobre temas de género, es decir, cerca del {percentage}%")
print(f"Lista de palabras en tópicos {topics}:\n{words_list}")
Del total de 8314 tweets de @etorrescobo, alrededor de 542 hablan sobre temas de género, es decir, cerca del 6.52%
Lista de palabras en tópicos [7]:
['aborto', 'mujeres', 'violación', 'matrimonio', 'feminismo', 'despenalización', 'mujer', 'derecho', 'vida', 'coip']
# drop rows with 0 values in two columns
filtered_df = filtered_df[(filtered_df.like_count != 0) & (filtered_df.retweet_count != 0)]
# add a new column with the sum of two columns
filtered_df['impressions'] = (filtered_df['like_count'] + filtered_df['retweet_count'])/2
# extract year from datetime column
filtered_df['year'] = filtered_df['date'].dt.year
# remove urls, mentions, hashtags and numbers
p.set_options(p.OPT.URL)
filtered_df['tweet_text'] = filtered_df['text'].apply(lambda x: p.clean(x))
# Create scatter plot
fig = px.scatter(filtered_df, x='like_count',
y='retweet_count',
size='impressions',
color='year',
hover_name='tweet_text')
# Update title and axis labels
fig.update_layout(
title='Tweets talking about gender with most Likes and Retweets',
xaxis_title='Number of Likes',
yaxis_title='Number of Retweets'
)
fig.show()
# convert column to list
tweets = df['text_pre'].to_list()
timestamps = df['local_time'].to_list()
topics_over_time = topic_model.topics_over_time(docs=tweets,
timestamps=timestamps,
global_tuning=True,
evolution_tuning=True,
nr_bins=20)
topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=20)