Análisis de tweets de @_FamiliaEcuador

Datos

Información general sobre la base de datos

Code

min_date = df['date'].min()

max_date = df['date'].max()

print(f"\nPeriodo de tweets recolectados: {min_date} / {max_date}\n")


Periodo de tweets recolectados: 2018-07-28 18:41:21-05:00 / 2023-03-20 11:57:01-05:00

Code

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2716 entries, 196700 to 199415
Data columns (total 63 columns):
 #   Column                   Non-Null Count  Dtype                            
---  ------                   --------------  -----                            
 0   query                    2716 non-null   object                           
 1   id                       2716 non-null   float64                          
 2   timestamp_utc            2716 non-null   int64                            
 3   local_time               2716 non-null   object                           
 4   user_screen_name         2716 non-null   object                           
 5   text                     2716 non-null   object                           
 6   possibly_sensitive       1663 non-null   object                           
 7   retweet_count            2716 non-null   float64                          
 8   like_count               2716 non-null   float64                          
 9   reply_count              2716 non-null   float64                          
 10  impression_count         46 non-null     object                           
 11  lang                     2716 non-null   object                           
 12  to_username              1199 non-null   object                           
 13  to_userid                1199 non-null   float64                          
 14  to_tweetid               1187 non-null   float64                          
 15  source_name              2716 non-null   object                           
 16  source_url               2716 non-null   object                           
 17  user_location            2716 non-null   object                           
 18  lat                      0 non-null      object                           
 19  lng                      0 non-null      object                           
 20  user_id                  2716 non-null   object                           
 21  user_name                2716 non-null   object                           
 22  user_verified            2716 non-null   float64                          
 23  user_description         2716 non-null   object                           
 24  user_url                 2716 non-null   object                           
 25  user_image               2716 non-null   object                           
 26  user_tweets              2716 non-null   object                           
 27  user_followers           2716 non-null   float64                          
 28  user_friends             2716 non-null   object                           
 29  user_likes               2716 non-null   float64                          
 30  user_lists               2716 non-null   float64                          
 31  user_created_at          2716 non-null   object                           
 32  user_timestamp_utc       2716 non-null   float64                          
 33  collected_via            2716 non-null   object                           
 34  match_query              2716 non-null   float64                          
 35  retweeted_id             0 non-null      float64                          
 36  retweeted_user           0 non-null      float64                          
 37  retweeted_user_id        0 non-null      float64                          
 38  retweeted_timestamp_utc  0 non-null      object                           
 39  quoted_id                412 non-null    object                           
 40  quoted_user              412 non-null    object                           
 41  quoted_user_id           412 non-null    float64                          
 42  quoted_timestamp_utc     412 non-null    float64                          
 43  collection_time          2716 non-null   object                           
 44  url                      2716 non-null   object                           
 45  place_country_code       31 non-null     object                           
 46  place_name               31 non-null     object                           
 47  place_type               31 non-null     object                           
 48  place_coordinates        31 non-null     object                           
 49  links                    478 non-null    object                           
 50  domains                  478 non-null    object                           
 51  media_urls               1644 non-null   object                           
 52  media_files              1644 non-null   object                           
 53  media_types              1644 non-null   object                           
 54  media_alt_texts          239 non-null    object                           
 55  mentioned_names          1946 non-null   object                           
 56  mentioned_ids            1904 non-null   object                           
 57  hashtags                 1630 non-null   object                           
 58  intervention_type        0 non-null      float64                          
 59  intervention_text        0 non-null      float64                          
 60  intervention_url         0 non-null      float64                          
 61  country                  2716 non-null   object                           
 62  date                     2716 non-null   datetime64[ns, America/Guayaquil]
dtypes: datetime64[ns, America/Guayaquil](1), float64(20), int64(1), object(41)
memory usage: 1.3+ MB

Dominios

Lista del top 20 de otros sitios web mencionados en los tweets y su frecuencia

Code

# count items on column
domains_list = df['domains'].value_counts()

# return first n rows in descending order
top_domains = domains_list.nlargest(20)

top_domains

domains
youtu.be                           63
bit.ly                             43
facebook.com                       30
instagram.com                      25
youtube.com                        19
aciprensa.com                      19
eluniverso.com                     18
ecuadorporlafamilia.org            14
arquidiocesisdeguayaquil.org.ec    11
citizengo.org                      11
familiaecuador.org                 10
twitter.com                         9
open.spotify.com                    8
liveaction.org                      6
foxnews.com                         6
expreso.ec                          5
pscp.tv                             5
buff.ly                             5
drive.google.com                    5
forms.gle                           4
Name: count, dtype: int64

Hashtags

Lista del top 20 de hashtags más usados y su frecuencia

Code

# convert dataframe column to list
hashtags = df['hashtags'].to_list()

# remove nan items from list
hashtags = [x for x in hashtags if not pd.isna(x)]

# split items into a list based on a delimiter
hashtags = [x.split('|') for x in hashtags]

# flatten list of lists
hashtags = [item for sublist in hashtags for item in sublist]

# count items on list
hashtags_count = pd.Series(hashtags).value_counts()

# return first n rows in descending order
top_hashtags = hashtags_count.nlargest(20)

top_hashtags

ecuadoresprovida           307
salvemoslas2vidas          242
ecuador                    176
provida                    118
escudero                    89
abortoporviolacion          78
ecuadorporlafamilia         70
aborto                      63
coip                        58
asambleistaqueserespeta     57
leyabortistano              56
marthavillafuerte           46
escudera                    44
votoprovida2021             42
escuderos                   40
conabortonotevoto           37
prolife                     36
chantajehumanitario         35
mentirasverdes              29
juntosporlafamilia          29
Name: count, dtype: int64

Usuarios

Top 20 de usuarios más mencionados en los tweets

Code

# filter column from dataframe
users = df['mentioned_names'].to_list()

# remove nan items from list
users = [x for x in users if not pd.isna(x)]

# split items into a list based on a delimiter
users = [x.split('|') for x in users]

# flatten list of lists
users = [item for sublist in users for item in sublist]

# count items on list
users_count = pd.Series(users).value_counts()

# return first n rows in descending order
top_users = users_count.nlargest(20)

top_users

asambleaecuador    332
hectoryepezm       145
justiciaan         134
etorrescobo        132
lenin              120
lourdescuestao     105
amishijoseduco      88
agustinlaje         77
amparo_medina       71
ecuadorprovida      71
cesarrohon          61
lacristifranco      57
julietasagnay       51
gomezrobertoa       51
eluniversocom       49
marthaceciliavl     47
crisvalverdej       46
polyugarteg         44
viviana_bonilla     41
corteconstecu       38
Name: count, dtype: int64

Likes en el tiempo

Code

# plot the data using plotly
fig = px.line(df, 
              x='date', 
              y='like_count', 
              title='Likes over Time',
              template='plotly_white', 
              hover_data=['text'])

# show the plot
fig.show()

Tokens

Lista del top 20 de los tokens más comunes y su frecuencia

Code

# load the spacy model for Spanish
nlp = spacy.load("es_core_news_sm")

# load stop words for Spanish
STOP_WORDS = nlp.Defaults.stop_words

# Function to filter stop words
def filter_stopwords(text):
    # lower text
    doc = nlp(text.lower())
    # filter tokens
    tokens = [token.text for token in doc if not token.is_stop and token.text not in STOP_WORDS and token.is_alpha]
    return ' '.join(tokens)

# apply function to dataframe column
df['text_pre'] = df['text'].apply(filter_stopwords)

# count items on column
token_counts = df["text_pre"].str.split(expand=True).stack().value_counts()[:20]

token_counts

vida                562
aborto              378
familia             354
ecuadoresprovida    316
gracias             305
ecuador             299
provida             218
apoyo               152
mujeres             139
violación           131
niños               120
nacer               119
hijos               119
concepción          118
causa               116
ley                 115
mujer               112
escudero            112
the                 108
voz                 108
Name: count, dtype: int64

Horas

Lista de las 10 horas con más cantidad de tweets publicados

Code

# extract hour from datetime column
df['hour'] = df['date'].dt.strftime('%H')

# count items on column
hours_count = df['hour'].value_counts()

# return first n rows in descending order
top_hours = hours_count.nlargest(10)

top_hours

hour
12    206
15    193
09    183
08    175
10    175
16    173
13    172
11    168
14    156
17    149
Name: count, dtype: int64

Pataformas

Plataformas desde las que se publicaron contenidos y su frecuencia

Code

df['source_name'].value_counts()

source_name
Twitter for Android    2398
Twitter Web App         151
Twitter Web Client      120
Twitter for iPhone       17
Instagram                17
TweetDeck                13
Name: count, dtype: int64

Tópicos

Técnica de modelado de tópicos con transformers y TF-IDF

Code

# remove urls, mentions, hashtags and numbers
p.set_options(p.OPT.URL, p.OPT.MENTION, p.OPT.NUMBER)
df['text_pre'] = df['text_pre'].apply(lambda x: p.clean(x))

# replace emojis with descriptions
df['text_pre'] = df['text_pre'].apply(lambda x: demojize(x))

# filter column
docs = df['text_pre']

# calculate topics and probabilities
topic_model = BERTopic(language="multilingual", calculate_probabilities=True, verbose=True)

# training
topics, probs = topic_model.fit_transform(docs)

# visualize topics
topic_model.visualize_topics()

Términos por tópico

Code

topic_model.visualize_barchart(top_n_topics=41)

Análisis de tópicos

Selección de tópicos que tocan temas de género

Code

# selection of topics
topics = [4, 40]

keywords_list = []
for topic_ in topics:
    topic = topic_model.get_topic(topic_)
    keywords = [x[0] for x in topic]
    keywords_list.append(keywords)

# flatten list of lists
words_list = [item for sublist in keywords_list for item in sublist]

# use apply method with lambda function to filter rows
filtered_df = df[df['text_pre'].apply(lambda x: any(word in x for word in words_list))]

percentage = round(100 * len(filtered_df) / len(df), 2)
print(f"Del total de {len(df)} tweets de @_FamiliaEcuador, alrededor de {len(filtered_df)} hablan sobre temas de género, es decir, cerca del {percentage}%")

print(f"Lista de palabras en tópicos {topics}:\n{words_list}")

Del total de 2716 tweets de @_FamiliaEcuador, alrededor de 1638 hablan sobre temas de género, es decir, cerca del 60.31%
Lista de palabras en tópicos [4, 40]:
['aborto', 'ecuador', 'abortista', 'violación', 'vida', 'derechos', 'nacer', 'argentina', 'méxico', 'ley', 'niabusoniaborto', 'mentirasverdes', 'abortoporviolacion', 'berreado', 'prestos', 'gkecuador', 'entrevista', 'confundir', 'hipocresiaverde', 'simposio']

Code

# drop rows with 0 values in two columns
filtered_df = filtered_df[(filtered_df.like_count != 0) & (filtered_df.retweet_count != 0)]

# add a new column with the sum of two columns
filtered_df['impressions'] = (filtered_df['like_count'] + filtered_df['retweet_count'])/2

# extract year from datetime column
filtered_df['year'] = filtered_df['date'].dt.year

# remove urls, mentions, hashtags and numbers
p.set_options(p.OPT.URL)
filtered_df['tweet_text'] = filtered_df['text'].apply(lambda x: p.clean(x))

# Create scatter plot
fig = px.scatter(filtered_df, x='like_count', 
                 y='retweet_count',
                 size='impressions', 
                 color='year',
                 hover_name='tweet_text')

# Update title and axis labels
fig.update_layout(
    title='Tweets talking about gender with most Likes and Retweets',
    xaxis_title='Number of Likes',
    yaxis_title='Number of Retweets'
)

fig.show()

Tópicos en el tiempo

Code

# convert column to list
tweets = df['text_pre'].to_list()
timestamps = df['local_time'].to_list()

topics_over_time = topic_model.topics_over_time(docs=tweets, 
                                                timestamps=timestamps, 
                                                global_tuning=True, 
                                                evolution_tuning=True, 
                                                nr_bins=20)

topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=20)