Code
= df['date'].min()
min_date
= df['date'].max()
max_date
print(f"\nPeriodo de tweets recolectados: {min_date} / {max_date}\n")
Periodo de tweets recolectados: 2012-01-18 20:07:08-05:00 / 2023-03-21 09:59:39-05:00
Información general sobre la base de datos:
Periodo de tweets recolectados: 2012-01-18 20:07:08-05:00 / 2023-03-21 09:59:39-05:00
<class 'pandas.core.frame.DataFrame'>
Index: 32462 entries, 138957 to 171419
Data columns (total 63 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 query 32462 non-null object
1 id 32462 non-null float64
2 timestamp_utc 32462 non-null int64
3 local_time 32462 non-null object
4 user_screen_name 32462 non-null object
5 text 32462 non-null object
6 possibly_sensitive 15705 non-null object
7 retweet_count 32461 non-null float64
8 like_count 32461 non-null float64
9 reply_count 32461 non-null float64
10 impression_count 1206 non-null object
11 lang 32461 non-null object
12 to_username 7757 non-null object
13 to_userid 7757 non-null float64
14 to_tweetid 7498 non-null float64
15 source_name 32461 non-null object
16 source_url 32461 non-null object
17 user_location 32461 non-null object
18 lat 0 non-null object
19 lng 0 non-null object
20 user_id 32461 non-null object
21 user_name 32461 non-null object
22 user_verified 32461 non-null float64
23 user_description 32461 non-null object
24 user_url 32461 non-null object
25 user_image 32461 non-null object
26 user_tweets 32461 non-null object
27 user_followers 32461 non-null float64
28 user_friends 32461 non-null object
29 user_likes 32461 non-null float64
30 user_lists 32461 non-null float64
31 user_created_at 32461 non-null object
32 user_timestamp_utc 32461 non-null float64
33 collected_via 32461 non-null object
34 match_query 32461 non-null float64
35 retweeted_id 0 non-null float64
36 retweeted_user 0 non-null float64
37 retweeted_user_id 0 non-null float64
38 retweeted_timestamp_utc 0 non-null object
39 quoted_id 4421 non-null object
40 quoted_user 4421 non-null object
41 quoted_user_id 4421 non-null float64
42 quoted_timestamp_utc 4421 non-null float64
43 collection_time 32461 non-null object
44 url 32461 non-null object
45 place_country_code 3 non-null object
46 place_name 3 non-null object
47 place_type 3 non-null object
48 place_coordinates 3 non-null object
49 links 11585 non-null object
50 domains 11585 non-null object
51 media_urls 7326 non-null object
52 media_files 7326 non-null object
53 media_types 7326 non-null object
54 media_alt_texts 898 non-null object
55 mentioned_names 14806 non-null object
56 mentioned_ids 14239 non-null object
57 hashtags 6863 non-null object
58 intervention_type 0 non-null float64
59 intervention_text 0 non-null float64
60 intervention_url 0 non-null float64
61 country 32461 non-null object
62 date 32462 non-null datetime64[ns, America/Bogota]
dtypes: datetime64[ns, America/Bogota](1), float64(20), int64(1), object(41)
memory usage: 15.9+ MB
Lista del top 20 de otros sitios web mencionados en los tweets y su frecuencia
domains
bit.ly 1510
semana.com 1111
eltiempo.com 660
mariafernandacabal.com 544
facebook.com 467
bluradio.com 258
twitter.com 249
lafm.com.co 240
ow.ly 228
elcolombiano.com 220
youtu.be 207
youtube.com 205
centrodemocratico.com 192
ln.is 179
rcnradio.com 177
wradio.com.co 176
instagram.com 176
caracol.com.co 175
elespectador.com 175
costanoticias.com 153
Name: count, dtype: int64
Lista del top 20 de hashtags más usados y su frecuencia
# convert dataframe column to list
hashtags = df['hashtags'].to_list()
# remove nan items from list
hashtags = [x for x in hashtags if not pd.isna(x)]
# split items into a list based on a delimiter
hashtags = [x.split('|') for x in hashtags]
# flatten list of lists
hashtags = [item for sublist in hashtags for item in sublist]
# count items on list
hashtags_count = pd.Series(hashtags).value_counts()
# return first n rows in descending order
top_hashtags = hashtags_count.nlargest(20)
top_hashtags
columna 481
soycabal 433
lascosascomoson 196
100porcientocabal 129
envivo 123
soyopositor 122
atención 120
votacd100cabal 118
alaire 107
farc 93
recomendado 86
restituciónsindespojo 77
urgente 73
bogotá 67
colombia 65
vocesysonidos 61
opinión 60
comunidad 57
mañanasblu 57
venezuela 55
Name: count, dtype: int64
Top 20 de usuarios más mencionados en los tweets
# filter column from dataframe
users = df['mentioned_names'].to_list()
# remove nan items from list
users = [x for x in users if not pd.isna(x)]
# split items into a list based on a delimiter
users = [x.split('|') for x in users]
# flatten list of lists
users = [item for sublist in users for item in sublist]
# count items on list
users_count = pd.Series(users).value_counts()
# return first n rows in descending order
top_users = users_count.nlargest(20)
top_users
alvarouribevel 507
jorenvilla1 393
juanmansantos 326
eltiempo 314
petrogustavo 310
drvargasquemba 301
igonima 295
cedemocratico 268
bluradioco 265
rcnlaradio 236
revistasemana 234
policiacolombia 225
elespectador 218
jflafaurie 208
ricardopuentesm 201
col_ejercito 189
alirestrepo 169
noticiasrcn 161
fiscaliacol 158
yobusgo 157
Name: count, dtype: int64
Lista del top 20 de los tokens más comunes y su frecuencia
# load the spacy model for Spanish
nlp = spacy.load("es_core_news_sm")
# load stop words for Spanish
STOP_WORDS = nlp.Defaults.stop_words
# Function to filter stop words
def filter_stopwords(text):
# lower text
doc = nlp(text.lower())
# filter tokens
tokens = [token.text for token in doc if not token.is_stop and token.text not in STOP_WORDS and token.is_alpha]
return ' '.join(tokens)
# apply function to dataframe column
df['text_pre'] = df['text'].apply(filter_stopwords)
# count items on column
token_counts = df["text_pre"].str.split(expand=True).stack().value_counts()[:20]
token_counts
q 3678
farc 2630
colombia 2396
paz 2222
d 2188
gobierno 1317
país 1240
santos 1104
gracias 850
petro 841
justicia 816
venezuela 784
uribe 770
bogotá 759
libertad 716
víctimas 708
años 708
columna 704
presidente 687
colombianos 633
Name: count, dtype: int64
Lista de las 10 horas con más cantidad de tweets publicados
hour
10 2424
12 2238
11 2224
09 2190
08 2136
20 1822
18 1815
13 1813
21 1788
14 1756
Name: count, dtype: int64
Plataformas desde las que se publicaron contenidos y su frecuencia
source_name
Twitter for iPhone 14186
Twitter for BlackBerry® 8291
Twitter for Android 5049
Twitter Web Client 2627
Twitter for BlackBerry 841
TweetDeck 396
Twitter for iPad 239
Twitter for Android 207
Instagram 167
Twitter Web App 94
Periscope 84
Jetpack.com 75
Twitter for Android Tablets 73
Twitter Media Studio 73
Twitter for Websites 19
Twitter for Windows Phone 18
iOS 10
Twitlonger 4
erased5423693 4
Mobile Web (M2) 3
Twiffo 1
Name: count, dtype: int64
Técnica de modelado de tópicos con transformers
y TF-IDF
# remove urls, mentions, hashtags and numbers
p.set_options(p.OPT.URL, p.OPT.MENTION, p.OPT.NUMBER)
df['text_pre'] = df['text_pre'].apply(lambda x: p.clean(x))
# replace emojis with descriptions
df['text_pre'] = df['text_pre'].apply(lambda x: demojize(x))
# filter column
docs = df['text_pre']
# calculate topics and probabilities
topic_model = BERTopic(language="multilingual", calculate_probabilities=True, verbose=True)
# training
topics, probs = topic_model.fit_transform(docs)
# visualize topics
topic_model.visualize_topics()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Mapa con el 20% del total de tópicos generados
Selección de tópicos que tocan temas de género
# selection of topics
topics = [10]
keywords_list = []
for topic_ in topics:
topic = topic_model.get_topic(topic_)
keywords = [x[0] for x in topic]
keywords_list.append(keywords)
# flatten list of lists
words_list = [item for sublist in keywords_list for item in sublist]
# use apply method with lambda function to filter rows
filtered_df = df[df['text_pre'].apply(lambda x: any(word in x for word in words_list))]
percentage = round(100 * len(filtered_df) / len(df), 2)
print(f"Del total de {len(df)} tweets de @MariaFdaCabal, alrededor de {len(filtered_df)} hablan sobre temas de género, es decir, cerca del {percentage}%")
print(f"Lista de palabras en tópicos {topics}:\n{words_list}")
Del total de 32462 tweets de @MariaFdaCabal, alrededor de 725 hablan sobre temas de género, es decir, cerca del 2.23%
Lista de palabras en tópicos [10]:
['mujeres', 'negros', 'mujer', 'gay', 'aborto', 'negras', 'homosexuales', 'gays', 'comunidades', 'niñas']
# drop rows with 0 values in two columns
filtered_df = filtered_df[(filtered_df.like_count != 0) & (filtered_df.retweet_count != 0)]
# add a new column with the sum of two columns
filtered_df['impressions'] = (filtered_df['like_count'] + filtered_df['retweet_count'])/2
# extract year from datetime column
filtered_df['year'] = filtered_df['date'].dt.year
# remove urls, mentions, hashtags and numbers
p.set_options(p.OPT.URL)
filtered_df['tweet_text'] = filtered_df['text'].apply(lambda x: p.clean(x))
# Create scatter plot
fig = px.scatter(filtered_df, x='like_count',
y='retweet_count',
size='impressions',
color='year',
hover_name='tweet_text')
# Update title and axis labels
fig.update_layout(
title='Tweets talking about gender with most Likes and Retweets',
xaxis_title='Number of Likes',
yaxis_title='Number of Retweets'
)
fig.show()
# convert column to list
tweets = df['text_pre'].to_list()
timestamps = df['local_time'].to_list()
topics_over_time = topic_model.topics_over_time(docs=tweets,
timestamps=timestamps,
global_tuning=True,
evolution_tuning=True,
nr_bins=20)
topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=20)