Análisis de tweets de @MariaFdaCabal

Datos

Información general sobre la base de datos:

Code
min_date = df['date'].min()

max_date = df['date'].max()

print(f"\nPeriodo de tweets recolectados: {min_date} / {max_date}\n")

Periodo de tweets recolectados: 2012-01-18 20:07:08-05:00 / 2023-03-21 09:59:39-05:00
Code
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 32462 entries, 138957 to 171419
Data columns (total 63 columns):
 #   Column                   Non-Null Count  Dtype                         
---  ------                   --------------  -----                         
 0   query                    32462 non-null  object                        
 1   id                       32462 non-null  float64                       
 2   timestamp_utc            32462 non-null  int64                         
 3   local_time               32462 non-null  object                        
 4   user_screen_name         32462 non-null  object                        
 5   text                     32462 non-null  object                        
 6   possibly_sensitive       15705 non-null  object                        
 7   retweet_count            32461 non-null  float64                       
 8   like_count               32461 non-null  float64                       
 9   reply_count              32461 non-null  float64                       
 10  impression_count         1206 non-null   object                        
 11  lang                     32461 non-null  object                        
 12  to_username              7757 non-null   object                        
 13  to_userid                7757 non-null   float64                       
 14  to_tweetid               7498 non-null   float64                       
 15  source_name              32461 non-null  object                        
 16  source_url               32461 non-null  object                        
 17  user_location            32461 non-null  object                        
 18  lat                      0 non-null      object                        
 19  lng                      0 non-null      object                        
 20  user_id                  32461 non-null  object                        
 21  user_name                32461 non-null  object                        
 22  user_verified            32461 non-null  float64                       
 23  user_description         32461 non-null  object                        
 24  user_url                 32461 non-null  object                        
 25  user_image               32461 non-null  object                        
 26  user_tweets              32461 non-null  object                        
 27  user_followers           32461 non-null  float64                       
 28  user_friends             32461 non-null  object                        
 29  user_likes               32461 non-null  float64                       
 30  user_lists               32461 non-null  float64                       
 31  user_created_at          32461 non-null  object                        
 32  user_timestamp_utc       32461 non-null  float64                       
 33  collected_via            32461 non-null  object                        
 34  match_query              32461 non-null  float64                       
 35  retweeted_id             0 non-null      float64                       
 36  retweeted_user           0 non-null      float64                       
 37  retweeted_user_id        0 non-null      float64                       
 38  retweeted_timestamp_utc  0 non-null      object                        
 39  quoted_id                4421 non-null   object                        
 40  quoted_user              4421 non-null   object                        
 41  quoted_user_id           4421 non-null   float64                       
 42  quoted_timestamp_utc     4421 non-null   float64                       
 43  collection_time          32461 non-null  object                        
 44  url                      32461 non-null  object                        
 45  place_country_code       3 non-null      object                        
 46  place_name               3 non-null      object                        
 47  place_type               3 non-null      object                        
 48  place_coordinates        3 non-null      object                        
 49  links                    11585 non-null  object                        
 50  domains                  11585 non-null  object                        
 51  media_urls               7326 non-null   object                        
 52  media_files              7326 non-null   object                        
 53  media_types              7326 non-null   object                        
 54  media_alt_texts          898 non-null    object                        
 55  mentioned_names          14806 non-null  object                        
 56  mentioned_ids            14239 non-null  object                        
 57  hashtags                 6863 non-null   object                        
 58  intervention_type        0 non-null      float64                       
 59  intervention_text        0 non-null      float64                       
 60  intervention_url         0 non-null      float64                       
 61  country                  32461 non-null  object                        
 62  date                     32462 non-null  datetime64[ns, America/Bogota]
dtypes: datetime64[ns, America/Bogota](1), float64(20), int64(1), object(41)
memory usage: 15.9+ MB

Dominios

Lista del top 20 de otros sitios web mencionados en los tweets y su frecuencia

Code
# count items on column
domains_list = df['domains'].value_counts()

# return first n rows in descending order
top_domains = domains_list.nlargest(20)

top_domains
domains
bit.ly                    1510
semana.com                1111
eltiempo.com               660
mariafernandacabal.com     544
facebook.com               467
bluradio.com               258
twitter.com                249
lafm.com.co                240
ow.ly                      228
elcolombiano.com           220
youtu.be                   207
youtube.com                205
centrodemocratico.com      192
ln.is                      179
rcnradio.com               177
wradio.com.co              176
instagram.com              176
caracol.com.co             175
elespectador.com           175
costanoticias.com          153
Name: count, dtype: int64

Hashtags

Lista del top 20 de hashtags más usados y su frecuencia

Code
# convert dataframe column to list
hashtags = df['hashtags'].to_list()

# remove nan items from list
hashtags = [x for x in hashtags if not pd.isna(x)]

# split items into a list based on a delimiter
hashtags = [x.split('|') for x in hashtags]

# flatten list of lists
hashtags = [item for sublist in hashtags for item in sublist]

# count items on list
hashtags_count = pd.Series(hashtags).value_counts()

# return first n rows in descending order
top_hashtags = hashtags_count.nlargest(20)

top_hashtags
columna                  481
soycabal                 433
lascosascomoson          196
100porcientocabal        129
envivo                   123
soyopositor              122
atención                 120
votacd100cabal           118
alaire                   107
farc                      93
recomendado               86
restituciónsindespojo     77
urgente                   73
bogotá                    67
colombia                  65
vocesysonidos             61
opinión                   60
comunidad                 57
mañanasblu                57
venezuela                 55
Name: count, dtype: int64

Usuarios

Top 20 de usuarios más mencionados en los tweets

Code
# filter column from dataframe
users = df['mentioned_names'].to_list()

# remove nan items from list
users = [x for x in users if not pd.isna(x)]

# split items into a list based on a delimiter
users = [x.split('|') for x in users]

# flatten list of lists
users = [item for sublist in users for item in sublist]

# count items on list
users_count = pd.Series(users).value_counts()

# return first n rows in descending order
top_users = users_count.nlargest(20)

top_users
alvarouribevel     507
jorenvilla1        393
juanmansantos      326
eltiempo           314
petrogustavo       310
drvargasquemba     301
igonima            295
cedemocratico      268
bluradioco         265
rcnlaradio         236
revistasemana      234
policiacolombia    225
elespectador       218
jflafaurie         208
ricardopuentesm    201
col_ejercito       189
alirestrepo        169
noticiasrcn        161
fiscaliacol        158
yobusgo            157
Name: count, dtype: int64

Likes en el tiempo

Code
# plot the data using plotly
fig = px.line(df, 
              x='date', 
              y='like_count', 
              title='Likes over Time',
              template='plotly_white', 
              hover_data=['text'])

# show the plot
fig.show()

Tokens

Lista del top 20 de los tokens más comunes y su frecuencia

Code
# load the spacy model for Spanish
nlp = spacy.load("es_core_news_sm")

# load stop words for Spanish
STOP_WORDS = nlp.Defaults.stop_words

# Function to filter stop words
def filter_stopwords(text):
    # lower text
    doc = nlp(text.lower())
    # filter tokens
    tokens = [token.text for token in doc if not token.is_stop and token.text not in STOP_WORDS and token.is_alpha]
    return ' '.join(tokens)

# apply function to dataframe column
df['text_pre'] = df['text'].apply(filter_stopwords)

# count items on column
token_counts = df["text_pre"].str.split(expand=True).stack().value_counts()[:20]

token_counts
q              3678
farc           2630
colombia       2396
paz            2222
d              2188
gobierno       1317
país           1240
santos         1104
gracias         850
petro           841
justicia        816
venezuela       784
uribe           770
bogotá          759
libertad        716
víctimas        708
años            708
columna         704
presidente      687
colombianos     633
Name: count, dtype: int64

Horas

Lista de las 10 horas con más cantidad de tweets publicados

Code
# extract hour from datetime column
df['hour'] = df['date'].dt.strftime('%H')

# count items on column
hours_count = df['hour'].value_counts()

# return first n rows in descending order
top_hours = hours_count.nlargest(10)

top_hours
hour
10    2424
12    2238
11    2224
09    2190
08    2136
20    1822
18    1815
13    1813
21    1788
14    1756
Name: count, dtype: int64

Pataformas

Plataformas desde las que se publicaron contenidos y su frecuencia

Code
df['source_name'].value_counts()
source_name
Twitter for iPhone             14186
Twitter for BlackBerry®         8291
Twitter for Android             5049
Twitter Web Client              2627
Twitter for BlackBerry           841
TweetDeck                        396
Twitter for iPad                 239
Twitter for  Android             207
Instagram                        167
Twitter Web App                   94
Periscope                         84
Jetpack.com                       75
Twitter for Android Tablets       73
Twitter Media Studio              73
Twitter for Websites              19
Twitter for Windows Phone         18
iOS                               10
Twitlonger                         4
erased5423693                      4
Mobile Web (M2)                    3
Twiffo                             1
Name: count, dtype: int64

Tópicos

Técnica de modelado de tópicos con transformers y TF-IDF

Code
# remove urls, mentions, hashtags and numbers
p.set_options(p.OPT.URL, p.OPT.MENTION, p.OPT.NUMBER)
df['text_pre'] = df['text_pre'].apply(lambda x: p.clean(x))

# replace emojis with descriptions
df['text_pre'] = df['text_pre'].apply(lambda x: demojize(x))

# filter column
docs = df['text_pre']

# calculate topics and probabilities
topic_model = BERTopic(language="multilingual", calculate_probabilities=True, verbose=True)

# training
topics, probs = topic_model.fit_transform(docs)

# visualize topics
topic_model.visualize_topics()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

Reducción de tópicos

Mapa con el 20% del total de tópicos generados

Code
# calculate the 20% from the total of topics
num_topics = len(topic_model.get_topic_info())
per_topics = int(num_topics * 20 / 100)


# reduce the number of topics
topic_model.reduce_topics(docs, nr_topics=per_topics)

# visualize topics
topic_model.visualize_topics()

Términos por tópico

Code
topic_model.visualize_barchart(top_n_topics=per_topics)

Análisis de tópicos

Selección de tópicos que tocan temas de género

Code
# selection of topics
topics = [10]

keywords_list = []
for topic_ in topics:
    topic = topic_model.get_topic(topic_)
    keywords = [x[0] for x in topic]
    keywords_list.append(keywords)

# flatten list of lists
words_list = [item for sublist in keywords_list for item in sublist]

# use apply method with lambda function to filter rows
filtered_df = df[df['text_pre'].apply(lambda x: any(word in x for word in words_list))]

percentage = round(100 * len(filtered_df) / len(df), 2)
print(f"Del total de {len(df)} tweets de @MariaFdaCabal, alrededor de {len(filtered_df)} hablan sobre temas de género, es decir, cerca del {percentage}%")

print(f"Lista de palabras en tópicos {topics}:\n{words_list}")
Del total de 32462 tweets de @MariaFdaCabal, alrededor de 725 hablan sobre temas de género, es decir, cerca del 2.23%
Lista de palabras en tópicos [10]:
['mujeres', 'negros', 'mujer', 'gay', 'aborto', 'negras', 'homosexuales', 'gays', 'comunidades', 'niñas']
Code
# drop rows with 0 values in two columns
filtered_df = filtered_df[(filtered_df.like_count != 0) & (filtered_df.retweet_count != 0)]

# add a new column with the sum of two columns
filtered_df['impressions'] = (filtered_df['like_count'] + filtered_df['retweet_count'])/2

# extract year from datetime column
filtered_df['year'] = filtered_df['date'].dt.year

# remove urls, mentions, hashtags and numbers
p.set_options(p.OPT.URL)
filtered_df['tweet_text'] = filtered_df['text'].apply(lambda x: p.clean(x))

# Create scatter plot
fig = px.scatter(filtered_df, x='like_count', 
                 y='retweet_count',
                 size='impressions', 
                 color='year',
                 hover_name='tweet_text')

# Update title and axis labels
fig.update_layout(
    title='Tweets talking about gender with most Likes and Retweets',
    xaxis_title='Number of Likes',
    yaxis_title='Number of Retweets'
)

fig.show()

Tópicos en el tiempo

Code
# convert column to list
tweets = df['text_pre'].to_list()
timestamps = df['local_time'].to_list()

topics_over_time = topic_model.topics_over_time(docs=tweets, 
                                                timestamps=timestamps, 
                                                global_tuning=True, 
                                                evolution_tuning=True, 
                                                nr_bins=20)

topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=20)