Análisis de tweets de @etorrescobo

Datos

Información general sobre la base de datos

Code
min_date = df['date'].min()

max_date = df['date'].max()

print(f"\nPeriodo de tweets recolectados: {min_date} / {max_date}\n")

Periodo de tweets recolectados: 2010-07-06 10:37:00-05:00 / 2023-03-21 09:58:42-05:00
Code
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 8314 entries, 10617 to 18930
Data columns (total 63 columns):
 #   Column                   Non-Null Count  Dtype                            
---  ------                   --------------  -----                            
 0   query                    8314 non-null   object                           
 1   id                       8314 non-null   float64                          
 2   timestamp_utc            8314 non-null   int64                            
 3   local_time               8314 non-null   object                           
 4   user_screen_name         8314 non-null   object                           
 5   text                     8314 non-null   object                           
 6   possibly_sensitive       2818 non-null   object                           
 7   retweet_count            8314 non-null   float64                          
 8   like_count               8314 non-null   float64                          
 9   reply_count              8314 non-null   float64                          
 10  impression_count         243 non-null    object                           
 11  lang                     8314 non-null   object                           
 12  to_username              2188 non-null   object                           
 13  to_userid                2188 non-null   float64                          
 14  to_tweetid               2120 non-null   float64                          
 15  source_name              8314 non-null   object                           
 16  source_url               8314 non-null   object                           
 17  user_location            8314 non-null   object                           
 18  lat                      65 non-null     object                           
 19  lng                      65 non-null     object                           
 20  user_id                  8314 non-null   object                           
 21  user_name                8314 non-null   object                           
 22  user_verified            8314 non-null   float64                          
 23  user_description         8314 non-null   object                           
 24  user_url                 8314 non-null   object                           
 25  user_image               8314 non-null   object                           
 26  user_tweets              8314 non-null   object                           
 27  user_followers           8314 non-null   float64                          
 28  user_friends             8314 non-null   object                           
 29  user_likes               8314 non-null   float64                          
 30  user_lists               8314 non-null   float64                          
 31  user_created_at          8314 non-null   object                           
 32  user_timestamp_utc       8314 non-null   float64                          
 33  collected_via            8314 non-null   object                           
 34  match_query              8314 non-null   float64                          
 35  retweeted_id             0 non-null      float64                          
 36  retweeted_user           0 non-null      float64                          
 37  retweeted_user_id        0 non-null      float64                          
 38  retweeted_timestamp_utc  0 non-null      object                           
 39  quoted_id                800 non-null    object                           
 40  quoted_user              800 non-null    object                           
 41  quoted_user_id           800 non-null    float64                          
 42  quoted_timestamp_utc     800 non-null    float64                          
 43  collection_time          8314 non-null   object                           
 44  url                      8314 non-null   object                           
 45  place_country_code       672 non-null    object                           
 46  place_name               672 non-null    object                           
 47  place_type               672 non-null    object                           
 48  place_coordinates        672 non-null    object                           
 49  links                    1660 non-null   object                           
 50  domains                  1660 non-null   object                           
 51  media_urls               1812 non-null   object                           
 52  media_files              1812 non-null   object                           
 53  media_types              1812 non-null   object                           
 54  media_alt_texts          249 non-null    object                           
 55  mentioned_names          4040 non-null   object                           
 56  mentioned_ids            3764 non-null   object                           
 57  hashtags                 1824 non-null   object                           
 58  intervention_type        0 non-null      float64                          
 59  intervention_text        0 non-null      float64                          
 60  intervention_url         0 non-null      float64                          
 61  country                  8314 non-null   object                           
 62  date                     8314 non-null   datetime64[ns, America/Guayaquil]
dtypes: datetime64[ns, America/Guayaquil](1), float64(20), int64(1), object(41)
memory usage: 4.1+ MB

Dominios

Lista del top 20 de otros sitios web mencionados en los tweets y su frecuencia

Code
# count items on column
domains_list = df['domains'].value_counts()

# return first n rows in descending order
top_domains = domains_list.nlargest(20)

top_domains
domains
etorrescobo.com             241
tinyurl.com                 141
instagram.com               120
bit.ly                       85
youtu.be                     64
elcomercio.com               64
youtube.com                  52
abc.es                       52
ft.com                       46
eluniverso.com               40
elpais.com                   36
ow.ly                        30
facebook.com                 25
medium.com                   24
expreso.ec                   23
wsj.com                      23
twitter.com                  20
internacional.elpais.com     16
hoy.com.ec                   13
nyti.ms                      12
Name: count, dtype: int64

Hashtags

Lista del top 20 de hashtags más usados y su frecuencia

Code
# convert dataframe column to list
hashtags = df['hashtags'].to_list()

# remove nan items from list
hashtags = [x for x in hashtags if not pd.isna(x)]

# split items into a list based on a delimiter
hashtags = [x.split('|') for x in hashtags]

# flatten list of lists
hashtags = [item for sublist in hashtags for item in sublist]

# count items on list
hashtags_count = pd.Series(hashtags).value_counts()

# return first n rows in descending order
top_hashtags = hashtags_count.nlargest(20)

top_hashtags
ecuador                   213
ambato                    163
tungurahua                116
quito                      47
venezuela                  43
asambleanacional           35
maduro                     29
trump                      26
españa                     24
vota6                      23
atención                   21
coip                       20
emprendersinobstáculos     19
usfq                       16
colombia                   16
brexit                     15
estebantorres              14
cambio                     14
ecuadorprotesta            14
toros                      13
Name: count, dtype: int64

Usuarios

Top 20 de usuarios más mencionados en los tweets

Code
# filter column from dataframe
users = df['mentioned_names'].to_list()

# remove nan items from list
users = [x for x in users if not pd.isna(x)]

# split items into a list based on a delimiter
users = [x.split('|') for x in users]

# flatten list of lists
users = [item for sublist in users for item in sublist]

# count items on list
users_count = pd.Series(users).value_counts()

# return first n rows in descending order
top_users = users_count.nlargest(20)

top_users
asambleaecuador    235
estebanperezm      140
etorrescobo        133
lftorrest           86
lassoguillermo      74
bancadapsc          66
eluniversocom       62
rxandrade           57
el_pais             52
usfq_ecuador        48
jfcarpio            47
cambioec            43
abc_es              42
amandahidalgoa      39
xvillalba1          39
ecuavisa            37
cristiano           35
la6ecuador          35
youtube             33
lenin               31
Name: count, dtype: int64

Likes en el tiempo

Code
# plot the data using plotly
fig = px.line(df, 
              x='date', 
              y='like_count', 
              title='Likes over Time',
              template='plotly_white', 
              hover_data=['text'])

# show the plot
fig.show()

Tokens

Lista del top 20 de los tokens más comunes y su frecuencia

Code
# load the spacy model for Spanish
nlp = spacy.load("es_core_news_sm")

# load stop words for Spanish
STOP_WORDS = nlp.Defaults.stop_words

# Function to filter stop words
def filter_stopwords(text):
    # lower text
    doc = nlp(text.lower())
    # filter tokens
    tokens = [token.text for token in doc if not token.is_stop and token.text not in STOP_WORDS and token.is_alpha]
    return ' '.join(tokens)

# apply function to dataframe column
df['text_pre'] = df['text'].apply(filter_stopwords)

# count items on column
token_counts = df["text_pre"].str.split(expand=True).stack().value_counts()[:20]

token_counts
ecuador         586
gracias         388
gobierno        343
vía             332
asamblea        298
saludos         280
país            274
ambato          265
presidente      259
ley             251
artículo        221
nacional        216
the             203
ecuatorianos    178
tungurahua      173
comparto        167
años            161
debate          152
vida            152
quito           150
Name: count, dtype: int64

Horas

Lista de las 10 horas con más cantidad de tweets publicados

Code
# extract hour from datetime column
df['hour'] = df['date'].dt.strftime('%H')

# count items on column
hours_count = df['hour'].value_counts()

# return first n rows in descending order
top_hours = hours_count.nlargest(10)

top_hours
hour
11    633
10    587
12    578
17    571
20    552
09    503
13    502
21    490
16    469
18    451
Name: count, dtype: int64

Pataformas

Plataformas desde las que se publicaron contenidos y su frecuencia

Code
df['source_name'].value_counts()
source_name
Twitter for iPhone         4322
Twitter Web Client         1787
Twitter for iPad            704
Twitter for Android         432
Twitter for BlackBerry®     344
Twitter Web App             232
Twitter for Websites        229
Instagram                   105
Kioskoymas                   41
Agorapulse app               33
Mobile Web                   17
Medium                       17
iOS                          14
Hootsuite Inc.               13
TweetChat                     6
Kindle                        5
Canva                         5
FOX News Login                2
Photos on iOS                 2
OS X                          1
Instagram on iOS              1
Crowdfire Inc.                1
bitly bitlink                 1
Name: count, dtype: int64

Tópicos

Técnica de modelado de tópicos con transformers y TF-IDF

Code
# remove urls, mentions, hashtags and numbers
p.set_options(p.OPT.URL, p.OPT.MENTION, p.OPT.NUMBER)
df['text_pre'] = df['text_pre'].apply(lambda x: p.clean(x))

# replace emojis with descriptions
df['text_pre'] = df['text_pre'].apply(lambda x: demojize(x))

# filter column
docs = df['text_pre']

# calculate topics and probabilities
topic_model = BERTopic(language="multilingual", calculate_probabilities=True, verbose=True)

# training
topics, probs = topic_model.fit_transform(docs)

# visualize topics
topic_model.visualize_topics()

Reducción de tópicos

Mapa con 10 tópicos del contenido de los tweets

Code
# reduce the number of topics
topic_model.reduce_topics(docs, nr_topics=31)

# visualize topics
topic_model.visualize_topics()

Términos por tópico

Code
topic_model.visualize_barchart(top_n_topics=31)

Análisis de tópicos

Selección de tópicos que tocan temas de género

Code
# selection of topics
topics = [7]

keywords_list = []
for topic_ in topics:
    topic = topic_model.get_topic(topic_)
    keywords = [x[0] for x in topic]
    keywords_list.append(keywords)

# flatten list of lists
words_list = [item for sublist in keywords_list for item in sublist]

# use apply method with lambda function to filter rows
filtered_df = df[df['text_pre'].apply(lambda x: any(word in x for word in words_list))]

percentage = round(100 * len(filtered_df) / len(df), 2)
print(f"Del total de {len(df)} tweets de @etorrescobo, alrededor de {len(filtered_df)} hablan sobre temas de género, es decir, cerca del {percentage}%")

print(f"Lista de palabras en tópicos {topics}:\n{words_list}")
Del total de 8314 tweets de @etorrescobo, alrededor de 542 hablan sobre temas de género, es decir, cerca del 6.52%
Lista de palabras en tópicos [7]:
['aborto', 'mujeres', 'violación', 'matrimonio', 'feminismo', 'despenalización', 'mujer', 'derecho', 'vida', 'coip']
Code
# drop rows with 0 values in two columns
filtered_df = filtered_df[(filtered_df.like_count != 0) & (filtered_df.retweet_count != 0)]

# add a new column with the sum of two columns
filtered_df['impressions'] = (filtered_df['like_count'] + filtered_df['retweet_count'])/2

# extract year from datetime column
filtered_df['year'] = filtered_df['date'].dt.year

# remove urls, mentions, hashtags and numbers
p.set_options(p.OPT.URL)
filtered_df['tweet_text'] = filtered_df['text'].apply(lambda x: p.clean(x))

# Create scatter plot
fig = px.scatter(filtered_df, x='like_count', 
                 y='retweet_count',
                 size='impressions', 
                 color='year',
                 hover_name='tweet_text')

# Update title and axis labels
fig.update_layout(
    title='Tweets talking about gender with most Likes and Retweets',
    xaxis_title='Number of Likes',
    yaxis_title='Number of Retweets'
)

fig.show()

Tópicos en el tiempo

Code
# convert column to list
tweets = df['text_pre'].to_list()
timestamps = df['local_time'].to_list()

topics_over_time = topic_model.topics_over_time(docs=tweets, 
                                                timestamps=timestamps, 
                                                global_tuning=True, 
                                                evolution_tuning=True, 
                                                nr_bins=20)

topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=20)