Análisis de tweets de @misionpaz_


Información general sobre la base de datos

min_date = df['date'].min()

max_date = df['date'].max()

print(f"\nPeriodo de tweets recolectados: {min_date} / {max_date}\n")

Periodo de tweets recolectados: 2010-02-06 20:50:46-05:00 / 2023-03-21 05:00:25-05:00
Lista del top 20 de otros sitios web mencionados en los tweets y su frecuencia

# count items on column
domains_list = df['domains'].value_counts()

# return first n rows in descending order
top_domains = domains_list.nlargest(20)

domains                                  3803                          1778                               1280                                  1173                           624                             216                             176|                             150|                  135                                   98                                   90                           64                                  58|                        58                  56      38                                    37                  23
misió                            19|                 18
Name: count, dtype: int64


Lista del top 20 de hashtags más usados y su frecuencia

# convert dataframe column to list
hashtags = df['hashtags'].to_list()

# remove nan items from list
hashtags = [x for x in hashtags if not pd.isna(x)]

# split items into a list based on a delimiter
hashtags = [x.split('|') for x in hashtags]

# flatten list of lists
hashtags = [item for sublist in hashtags for item in sublist]

# count items on list
hashtags_count = pd.Series(hashtags).value_counts()

# return first n rows in descending order
top_hashtags = hashtags_count.nlargest(20)

misionpazmicasa           1561
misionpaz                 1098
mpntucasa                 1012
iglesiampn                 887
devocional                 881
misionpazencasa            604
feparagrandesvictorias     514
20añostransformando        333
esnuestracasa              327
envivo                     320
mpnenvivo                  278
familiampn                 253
somosfamilia               245
yosoympn                   240
mpnnuestracasa             236
explosioncontundente       220
avivamiento                185
fiestademilagros           179
vive                       176
mpn                        157
Name: count, dtype: int64


Top 20 de usuarios más mencionados en los tweets

# filter column from dataframe
users = df['mentioned_names'].to_list()

# remove nan items from list
users = [x for x in users if not pd.isna(x)]

# split items into a list based on a delimiter
users = [x.split('|') for x in users]

# flatten list of lists
users = [item for sublist in users for item in sublist]

# count items on list
users_count = pd.Series(users).value_counts()

# return first n rows in descending order
top_users = users_count.nlargest(20)

johnmiltonr_          712
gerarydiana           342
jhonmiltonr           270
joelmanderfield       257
youtube               186
prjhonmilton          171
ce_palace             164
gissymander           151
profetanormasr        117
misionpaziglesia       76
misionpaz_             75
soynormaruiz           45
marcobarrientos        45
normanormaruiz         41
fundacionmisionpaz     30
prgerardoydiana        28
pastorcashluna         25
cesarfajardosm         24
otonielfont            23
evancraft              21
Name: count, dtype: int64

Likes en el tiempo

# plot the data using plotly
fig = px.line(df, 
              title='Likes over Time',

# show the plot


Lista del top 20 de los tokens más comunes y su frecuencia

# load the spacy model for Spanish
nlp = spacy.load("es_core_news_sm")

# load stop words for Spanish
STOP_WORDS = nlp.Defaults.stop_words

# Function to filter stop words
def filter_stopwords(text):
    # lower text
    doc = nlp(text.lower())
    # filter tokens
    tokens = [token.text for token in doc if not token.is_stop and token.text not in STOP_WORDS and token.is_alpha]
    return ' '.join(tokens)

# apply function to dataframe column
df['text_pre'] = df['text'].apply(filter_stopwords)

# count items on column
token_counts = df["text_pre"].str.split(expand=True).stack().value_counts()[:20]

dios               6122
paz                2051
misión             1780
vida               1589
misionpazmicasa    1556
mensaje            1280
completo           1193
tiempo             1123
misionpaz          1100
conéctate          1086
devocional         1033
mpntucasa          1016
amor               1000
celebración         983
familia             926
iglesiampn          891
jesús               851
pm                  809
esperamos           805
señor               793
Name: count, dtype: int64


Lista de las 10 horas con más cantidad de tweets publicados

# extract hour from datetime column
df['hour'] = df['date'].dt.strftime('%H')

# count items on column
hours_count = df['hour'].value_counts()

# return first n rows in descending order
top_hours = hours_count.nlargest(10)

18    1245
20    1223
19    1119
04    1063
12    1058
11    1041
09    1039
17    1027
08    1003
10     996
Name: count, dtype: int64


Plataformas desde las que se publicaron contenidos y su frecuencia

Facebook                    4460
Twitter Web App             2791
Hootsuite                   2423
Instagram                   1614
Twitter Web Client          1341
Postcron App                 971
Twitter for iPad             868
Twitter for Android          682
Twitter for iPhone           627
TweetDeck                    290
SocialGest                   285
Google                       254
Twitter Media Studio         230                167
Hootsuite Inc.               165
a Ning Network               106                   72
Periscope                     57
erased9_3Ud7cuBk0y            39
erased132190                   3
Ustream.TV                     2
LinkedIn                       1
Twitter for Advertisers.       1
erased138961                   1
Name: count, dtype: int64


Técnica de modelado de tópicos con transformers y TF-IDF

# remove urls, mentions, hashtags and numbers
p.set_options(p.OPT.URL, p.OPT.MENTION, p.OPT.NUMBER)
df['text_pre'] = df['text_pre'].apply(lambda x: p.clean(x))

# replace emojis with descriptions
df['text_pre'] = df['text_pre'].apply(lambda x: demojize(x))

# filter column
docs = df['text_pre']

# calculate topics and probabilities
topic_model = BERTopic(language="multilingual", calculate_probabilities=True, verbose=True)

# training
topics, probs = topic_model.fit_transform(docs)

# visualize topics
Reducción de tópicos

Mapa con el 20% del total de tópicos generados

# calculate the 20% from the total of topics
num_topics = len(topic_model.get_topic_info())
per_topics = int(num_topics * 20 / 100)

# reduce the number of topics
topic_model.reduce_topics(docs, nr_topics=per_topics)

# visualize topics

Términos por tópico


Análisis de tópicos

No se identificó ningún tópico que hable de manera contundente sobre aborto, feminismo y genero

Tópicos en el tiempo

# convert column to list
tweets = df['text_pre'].to_list()
timestamps = df['local_time'].to_list()

topics_over_time = topic_model.topics_over_time(docs=tweets, 

topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=20)