Análisis de tweets de @misionpaz_

Datos

Información general sobre la base de datos

Code

min_date = df['date'].min()

max_date = df['date'].max()

print(f"\nPeriodo de tweets recolectados: {min_date} / {max_date}\n")


Periodo de tweets recolectados: 2010-02-06 20:50:46-05:00 / 2023-03-21 05:00:25-05:00

Code

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 17450 entries, 179250 to 196699
Data columns (total 63 columns):
 #   Column                   Non-Null Count  Dtype                         
---  ------                   --------------  -----                         
 0   query                    17450 non-null  object                        
 1   id                       17450 non-null  float64                       
 2   timestamp_utc            17450 non-null  int64                         
 3   local_time               17450 non-null  object                        
 4   user_screen_name         17450 non-null  object                        
 5   text                     17450 non-null  object                        
 6   possibly_sensitive       15102 non-null  object                        
 7   retweet_count            17450 non-null  float64                       
 8   like_count               17450 non-null  float64                       
 9   reply_count              17450 non-null  float64                       
 10  impression_count         620 non-null    object                        
 11  lang                     17450 non-null  object                        
 12  to_username              100 non-null    object                        
 13  to_userid                100 non-null    float64                       
 14  to_tweetid               87 non-null     float64                       
 15  source_name              17450 non-null  object                        
 16  source_url               17450 non-null  object                        
 17  user_location            17450 non-null  object                        
 18  lat                      9 non-null      object                        
 19  lng                      9 non-null      object                        
 20  user_id                  17450 non-null  object                        
 21  user_name                17450 non-null  object                        
 22  user_verified            17450 non-null  float64                       
 23  user_description         17450 non-null  object                        
 24  user_url                 17450 non-null  object                        
 25  user_image               17450 non-null  object                        
 26  user_tweets              17450 non-null  object                        
 27  user_followers           17450 non-null  float64                       
 28  user_friends             17450 non-null  object                        
 29  user_likes               17450 non-null  float64                       
 30  user_lists               17450 non-null  float64                       
 31  user_created_at          17450 non-null  object                        
 32  user_timestamp_utc       17450 non-null  float64                       
 33  collected_via            17450 non-null  object                        
 34  match_query              17450 non-null  float64                       
 35  retweeted_id             0 non-null      float64                       
 36  retweeted_user           0 non-null      float64                       
 37  retweeted_user_id        0 non-null      float64                       
 38  retweeted_timestamp_utc  0 non-null      object                        
 39  quoted_id                51 non-null     object                        
 40  quoted_user              51 non-null     object                        
 41  quoted_user_id           51 non-null     float64                       
 42  quoted_timestamp_utc     51 non-null     float64                       
 43  collection_time          17450 non-null  object                        
 44  url                      17450 non-null  object                        
 45  place_country_code       795 non-null    object                        
 46  place_name               795 non-null    object                        
 47  place_type               795 non-null    object                        
 48  place_coordinates        795 non-null    object                        
 49  links                    10194 non-null  object                        
 50  domains                  10194 non-null  object                        
 51  media_urls               7526 non-null   object                        
 52  media_files              7526 non-null   object                        
 53  media_types              7526 non-null   object                        
 54  media_alt_texts          1940 non-null   object                        
 55  mentioned_names          3066 non-null   object                        
 56  mentioned_ids            2583 non-null   object                        
 57  hashtags                 10072 non-null  object                        
 58  intervention_type        0 non-null      float64                       
 59  intervention_text        0 non-null      float64                       
 60  intervention_url         0 non-null      float64                       
 61  country                  17450 non-null  object                        
 62  date                     17450 non-null  datetime64[ns, America/Bogota]
dtypes: datetime64[ns, America/Bogota](1), float64(20), int64(1), object(41)
memory usage: 8.5+ MB

Dominios

Lista del top 20 de otros sitios web mencionados en los tweets y su frecuencia

Code

# count items on column
domains_list = df['domains'].value_counts()

# return first n rows in descending order
top_domains = domains_list.nlargest(20)

top_domains

domains
fb.me                                  3803
instagram.com                          1778
youtu.be                               1280
ow.ly                                  1173
misionpaz.org                           624
twitter.com                             216
youtube.com                             176
fb.me|ow.ly                             150
misionpaz.org|youtu.be                  135
bit.ly                                   98
pst.cr                                   90
jhonmilton.org                           64
pscp.tv                                  58
fb.me|youtube.com                        58
explosion.misionpaz.org                  56
inscripciones.genesis.misionpaz.org      38
wp.me                                    37
congresos.misionpaz.org                  23
misiónpaz.org                            19
fb.me|new.livestream.com                 18
Name: count, dtype: int64

Hashtags

Lista del top 20 de hashtags más usados y su frecuencia

Code

# convert dataframe column to list
hashtags = df['hashtags'].to_list()

# remove nan items from list
hashtags = [x for x in hashtags if not pd.isna(x)]

# split items into a list based on a delimiter
hashtags = [x.split('|') for x in hashtags]

# flatten list of lists
hashtags = [item for sublist in hashtags for item in sublist]

# count items on list
hashtags_count = pd.Series(hashtags).value_counts()

# return first n rows in descending order
top_hashtags = hashtags_count.nlargest(20)

top_hashtags

misionpazmicasa           1561
misionpaz                 1098
mpntucasa                 1012
iglesiampn                 887
devocional                 881
misionpazencasa            604
feparagrandesvictorias     514
20añostransformando        333
esnuestracasa              327
envivo                     320
mpnenvivo                  278
familiampn                 253
somosfamilia               245
yosoympn                   240
mpnnuestracasa             236
explosioncontundente       220
avivamiento                185
fiestademilagros           179
vive                       176
mpn                        157
Name: count, dtype: int64

Usuarios

Top 20 de usuarios más mencionados en los tweets

Code

# filter column from dataframe
users = df['mentioned_names'].to_list()

# remove nan items from list
users = [x for x in users if not pd.isna(x)]

# split items into a list based on a delimiter
users = [x.split('|') for x in users]

# flatten list of lists
users = [item for sublist in users for item in sublist]

# count items on list
users_count = pd.Series(users).value_counts()

# return first n rows in descending order
top_users = users_count.nlargest(20)

top_users

johnmiltonr_          712
gerarydiana           342
jhonmiltonr           270
joelmanderfield       257
youtube               186
prjhonmilton          171
ce_palace             164
gissymander           151
profetanormasr        117
misionpaziglesia       76
misionpaz_             75
soynormaruiz           45
marcobarrientos        45
normanormaruiz         41
fundacionmisionpaz     30
prgerardoydiana        28
pastorcashluna         25
cesarfajardosm         24
otonielfont            23
evancraft              21
Name: count, dtype: int64

Likes en el tiempo

Code

# plot the data using plotly
fig = px.line(df, 
              x='date', 
              y='like_count', 
              title='Likes over Time',
              template='plotly_white', 
              hover_data=['text'])

# show the plot
fig.show()

Tokens

Lista del top 20 de los tokens más comunes y su frecuencia

Code

# load the spacy model for Spanish
nlp = spacy.load("es_core_news_sm")

# load stop words for Spanish
STOP_WORDS = nlp.Defaults.stop_words

# Function to filter stop words
def filter_stopwords(text):
    # lower text
    doc = nlp(text.lower())
    # filter tokens
    tokens = [token.text for token in doc if not token.is_stop and token.text not in STOP_WORDS and token.is_alpha]
    return ' '.join(tokens)

# apply function to dataframe column
df['text_pre'] = df['text'].apply(filter_stopwords)

# count items on column
token_counts = df["text_pre"].str.split(expand=True).stack().value_counts()[:20]

token_counts

dios               6122
paz                2051
misión             1780
vida               1589
misionpazmicasa    1556
mensaje            1280
completo           1193
tiempo             1123
misionpaz          1100
conéctate          1086
devocional         1033
mpntucasa          1016
amor               1000
celebración         983
familia             926
iglesiampn          891
jesús               851
pm                  809
esperamos           805
señor               793
Name: count, dtype: int64

Horas

Lista de las 10 horas con más cantidad de tweets publicados

Code

# extract hour from datetime column
df['hour'] = df['date'].dt.strftime('%H')

# count items on column
hours_count = df['hour'].value_counts()

# return first n rows in descending order
top_hours = hours_count.nlargest(10)

top_hours

hour
18    1245
20    1223
19    1119
04    1063
12    1058
11    1041
09    1039
17    1027
08    1003
10     996
Name: count, dtype: int64

Pataformas

Plataformas desde las que se publicaron contenidos y su frecuencia

Code

df['source_name'].value_counts()

source_name
Facebook                    4460
Twitter Web App             2791
Hootsuite                   2423
Instagram                   1614
Twitter Web Client          1341
Postcron App                 971
Twitter for iPad             868
Twitter for Android          682
Twitter for iPhone           627
TweetDeck                    290
SocialGest                   285
Google                       254
Twitter Media Studio         230
Repost.social                167
Hootsuite Inc.               165
a Ning Network               106
Restream.io                   72
Periscope                     57
erased9_3Ud7cuBk0y            39
erased132190                   3
Ustream.TV                     2
LinkedIn                       1
Twitter for Advertisers.       1
erased138961                   1
Name: count, dtype: int64

Tópicos

Técnica de modelado de tópicos con transformers y TF-IDF

Code

# remove urls, mentions, hashtags and numbers
p.set_options(p.OPT.URL, p.OPT.MENTION, p.OPT.NUMBER)
df['text_pre'] = df['text_pre'].apply(lambda x: p.clean(x))

# replace emojis with descriptions
df['text_pre'] = df['text_pre'].apply(lambda x: demojize(x))

# filter column
docs = df['text_pre']

# calculate topics and probabilities
topic_model = BERTopic(language="multilingual", calculate_probabilities=True, verbose=True)

# training
topics, probs = topic_model.fit_transform(docs)

# visualize topics
topic_model.visualize_topics()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

Reducción de tópicos

Mapa con el 20% del total de tópicos generados

Code

# calculate the 20% from the total of topics
num_topics = len(topic_model.get_topic_info())
per_topics = int(num_topics * 20 / 100)

# reduce the number of topics
topic_model.reduce_topics(docs, nr_topics=per_topics)

# visualize topics
topic_model.visualize_topics()

Términos por tópico

Code

topic_model.visualize_barchart(top_n_topics=per_topics)

Análisis de tópicos

No se identificó ningún tópico que hable de manera contundente sobre aborto, feminismo y genero

Tópicos en el tiempo

Code

# convert column to list
tweets = df['text_pre'].to_list()
timestamps = df['local_time'].to_list()

topics_over_time = topic_model.topics_over_time(docs=tweets, 
                                                timestamps=timestamps, 
                                                global_tuning=True, 
                                                evolution_tuning=True, 
                                                nr_bins=20)

topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=20)