Análisis de tweets de @PastorMalafaia

Datos

Información general sobre la base de datos

Code
min_date = df['date'].min()

max_date = df['date'].max()

print(f"\nPeriodo de tweets recolectados: {min_date} / {max_date}\n")

Periodo de tweets recolectados: 2010-06-11 11:27:43-03:00 / 2023-03-20 08:00:26-03:00
Code
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 43709 entries, 45185 to 88893
Data columns (total 63 columns):
 #   Column                   Non-Null Count  Dtype                            
---  ------                   --------------  -----                            
 0   query                    43709 non-null  object                           
 1   id                       43709 non-null  float64                          
 2   timestamp_utc            43709 non-null  int64                            
 3   local_time               43709 non-null  object                           
 4   user_screen_name         43709 non-null  object                           
 5   text                     43709 non-null  object                           
 6   possibly_sensitive       26733 non-null  object                           
 7   retweet_count            43709 non-null  float64                          
 8   like_count               43709 non-null  float64                          
 9   reply_count              43709 non-null  float64                          
 10  impression_count         365 non-null    object                           
 11  lang                     43709 non-null  object                           
 12  to_username              874 non-null    object                           
 13  to_userid                874 non-null    float64                          
 14  to_tweetid               560 non-null    float64                          
 15  source_name              43709 non-null  object                           
 16  source_url               43709 non-null  object                           
 17  user_location            43709 non-null  object                           
 18  lat                      0 non-null      object                           
 19  lng                      0 non-null      object                           
 20  user_id                  43709 non-null  object                           
 21  user_name                43709 non-null  object                           
 22  user_verified            43709 non-null  float64                          
 23  user_description         43709 non-null  object                           
 24  user_url                 43709 non-null  object                           
 25  user_image               43709 non-null  object                           
 26  user_tweets              43709 non-null  object                           
 27  user_followers           43709 non-null  float64                          
 28  user_friends             43709 non-null  object                           
 29  user_likes               43709 non-null  float64                          
 30  user_lists               43709 non-null  float64                          
 31  user_created_at          43709 non-null  object                           
 32  user_timestamp_utc       43709 non-null  float64                          
 33  collected_via            43709 non-null  object                           
 34  match_query              43709 non-null  float64                          
 35  retweeted_id             0 non-null      float64                          
 36  retweeted_user           0 non-null      float64                          
 37  retweeted_user_id        0 non-null      float64                          
 38  retweeted_timestamp_utc  0 non-null      object                           
 39  quoted_id                53 non-null     object                           
 40  quoted_user              53 non-null     object                           
 41  quoted_user_id           53 non-null     float64                          
 42  quoted_timestamp_utc     53 non-null     float64                          
 43  collection_time          43709 non-null  object                           
 44  url                      43709 non-null  object                           
 45  place_country_code       21 non-null     object                           
 46  place_name               21 non-null     object                           
 47  place_type               21 non-null     object                           
 48  place_coordinates        21 non-null     object                           
 49  links                    20062 non-null  object                           
 50  domains                  20062 non-null  object                           
 51  media_urls               10154 non-null  object                           
 52  media_files              10154 non-null  object                           
 53  media_types              10154 non-null  object                           
 54  media_alt_texts          399 non-null    object                           
 55  mentioned_names          3473 non-null   object                           
 56  mentioned_ids            3149 non-null   object                           
 57  hashtags                 2265 non-null   object                           
 58  intervention_type        0 non-null      float64                          
 59  intervention_text        0 non-null      float64                          
 60  intervention_url         0 non-null      float64                          
 61  country                  43709 non-null  object                           
 62  date                     43709 non-null  datetime64[ns, America/Sao_Paulo]
dtypes: datetime64[ns, America/Sao_Paulo](1), float64(20), int64(1), object(41)
memory usage: 21.3+ MB

Dominios

Lista del top 20 de otros sitios web mencionados en los tweets y su frecuencia

Code
# count items on column
domains_list = df['domains'].value_counts()

# return first n rows in descending order
top_domains = domains_list.nlargest(20)

top_domains
domains
youtu.be                       7145
verdadegospel.com              3659
youtube.com                    2116
goo.gl                         1719
veja.abril.com.br               992
vitoriaemcristo.org             848
migre.me                        696
editoracentralgospel.com        336
bit.ly                          166
instagram.com                   153
gospelplay.com                  147
facebook.com                    140
facebook.com|youtube.com        123
pastoresjuntos.com              111
eventos.vitoriaemcristo.org     103
fb.watch|instagram.com          103
escoladelideresonline.com        69
g1.globo.com                     67
advec.org                        67
eventospastorsilas.com.br        66
Name: count, dtype: int64

Hashtags

Lista del top 20 de hashtags más usados y su frecuencia

Code
# convert dataframe column to list
hashtags = df['hashtags'].to_list()

# remove nan items from list
hashtags = [x for x in hashtags if not pd.isna(x)]

# split items into a list based on a delimiter
hashtags = [x.split('|') for x in hashtags]

# flatten list of lists
hashtags = [item for sublist in hashtags for item in sublist]

# count items on list
hashtags_count = pd.Series(hashtags).value_counts()

# return first n rows in descending order
top_hashtags = hashtags_count.nlargest(20)

top_hashtags
silasmalafaia                       203
roubalheiraepttudoaver              197
dilmavaiperderaecio45vencer         123
elessabiamdorouboforadilma          116
12anosderoubalheiradoptchega         71
povobrasileirocontradilmaept         70
chegaderoubalheiraforadilma          69
225                                  60
chegaderouboementiraforadilma        59
dilmanaodialoguecomterrorista        54
clamandopelobrasil                   52
dilmaroubalheiraepttudoaver          52
fachinnão                            46
nemcorrupçãonemptnemdilma            43
videodepregacao                      43
respond                              43
aovivocommalafaia                    42
lulaedilmamenosodiocontramarina      42
votoaeciopelobr45il                  41
marinaresistentevaiserpresidente     39
Name: count, dtype: int64

Usuarios

Top 20 de usuarios más mencionados en los tweets

Code
# filter column from dataframe
users = df['mentioned_names'].to_list()

# remove nan items from list
users = [x for x in users if not pd.isna(x)]

# split items into a list based on a delimiter
users = [x.split('|') for x in users]

# flatten list of lists
users = [item for sublist in users for item in sublist]

# count items on list
users_count = pd.Series(users).value_counts()

# return first n rows in descending order
top_users = users_count.nlargest(20)

top_users
verdadegospel            638
advecoficial             532
edcentralgospel          179
reinaldoazevedo          158
avec_oficial             135
elizetemalafaia          115
radaronline              103
eyshila1                  70
cgospelmusic              60
pastormalafaiaoficial     52
pastormalafaia            45
nani_azevedo              43
drmikemurdock             41
silasmalafaia             39
veja                      38
jozyanneoficial           37
magnomaltaofc             35
advecsaopaulo             33
pgm_ratinho               26
danilogentili             23
Name: count, dtype: int64

Likes en el tiempo

Code
# plot the data using plotly
fig = px.line(df, 
              x='date', 
              y='like_count', 
              title='Likes over Time',
              template='plotly_white', 
              hover_data=['text'])

# show the plot
fig.show()

Tokens

Lista del top 20 de los tokens más comunes y su frecuencia

Code
# load the spacy model for Portuguese
nlp = spacy.load("pt_core_news_sm")

# load stop words for Spanish
STOP_WORDS = nlp.Defaults.stop_words

# Function to filter stop words
def filter_stopwords(text):
    # lower text
    doc = nlp(text.lower())
    # filter tokens
    tokens = [token.text for token in doc if not token.is_stop and token.text not in STOP_WORDS and token.is_alpha]
    return ' '.join(tokens)

# apply function to dataframe column
df['text_pre'] = df['text'].apply(filter_stopwords)

# count items on column
token_counts = df["text_pre"].str.split(expand=True).stack().value_counts()[:20]

token_counts
assista       7856
q             5381
vídeo         4499
deus          3435
programa      3030
dia           2894
bolsonaro     2840
vitória       2782
cristo        2620
brasil        2418
acesse        2256
hoje          2189
vou           2168
pt            2073
divulgue      1981
ñ             1882
sábado        1816
imprensa      1727
lula          1721
imperdível    1715
Name: count, dtype: int64

Horas

Lista de las 10 horas con más cantidad de tweets publicados

Code
# extract hour from datetime column
df['hour'] = df['date'].dt.strftime('%H')

# count items on column
hours_count = df['hour'].value_counts()

# return first n rows in descending order
top_hours = hours_count.nlargest(10)

top_hours
hour
17    3772
12    3693
14    3432
15    3366
16    3268
10    3017
11    3004
13    2849
18    2446
19    2377
Name: count, dtype: int64

Pataformas

Plataformas desde las que se publicaron contenidos y su frecuencia

Code
df['source_name'].value_counts()
source_name
Twitter Web Client                 11191
Postcron App                       10752
Twitter for iPad                    8900
mLabs - Gestão de Redes Sociais     7243
Twitter for iPhone                  2183
erased3412752                        723
Twitter Ads                          580
Twitter for Android                  466
Twitter for Android Tablets          444
TweetDeck                            424
Twitter Web App                      303
Postgrain                            144
Periscope                            106
Twitter for BlackBerry®               98
Twitter for Advertisers.              65
Dynamic Tweets                        63
Twitpic                                7
Twitter for Websites                   3
Mobile Web                             3
iOS                                    3
Mobile Web (M2)                        2
Instagram                              2
Photos on iOS                          1
Twitter Media Studio                   1
audioBoom                              1
Twitter for Windows Phone              1
Name: count, dtype: int64

Tópicos

Técnica de modelado de tópicos con transformers y TF-IDF

Code
# remove urls, mentions, hashtags and numbers
p.set_options(p.OPT.URL, p.OPT.MENTION, p.OPT.NUMBER)
df['text_pre'] = df['text_pre'].apply(lambda x: p.clean(x))

# replace emojis with descriptions
df['text_pre'] = df['text_pre'].apply(lambda x: demojize(x))

# filter column
docs = df['text_pre']

# calculate topics and probabilities
topic_model = BERTopic(language="multilingual", calculate_probabilities=True, verbose=True)

# training
topics, probs = topic_model.fit_transform(docs)

# visualize topics
topic_model.visualize_topics()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

Reducción de tópicos

Mapa con el 20% del total de tópicos generados

Code
# calculate the 20% from the total of topics
num_topics = len(topic_model.get_topic_info())
per_topics = int(num_topics * 20 / 100)

# reduce the number of topics
topic_model.reduce_topics(docs, nr_topics=per_topics)

# visualize topics
topic_model.visualize_topics()

Términos por tópico

Code
topic_model.visualize_barchart(top_n_topics=per_topics)

Análisis de tópicos

Selección de tópicos que tocan temas de género

Code
# selection of topics
topics = [4, 30, 40, 120]

keywords_list = []
for topic_ in topics:
    topic = topic_model.get_topic(topic_)
    keywords = [x[0] for x in topic]
    keywords_list.append(keywords)

# flatten list of lists
words_list = [item for sublist in keywords_list for item in sublist]

# use apply method with lambda function to filter rows
filtered_df = df[df['text_pre'].apply(lambda x: any(word in x for word in words_list))]

percentage = round(100 * len(filtered_df) / len(df), 2)
print(f"Del total de {len(df)} tweets de  @PastorMalafaia, alrededor de {len(filtered_df)} hablan sobre temas de género, es decir, cerca del {percentage}%")

print(f"Lista de palabras en tópicos {topics}:\n{words_list}")
Del total de 43709 tweets de  @PastorMalafaia, alrededor de 7590 hablan sobre temas de género, es decir, cerca del 17.36%
Lista de palabras en tópicos [4, 30, 40, 120]:
['gay', 'ativismo', 'gays', 'ativistas', 'homofobia', 'parada', 'causa', 'kit', 'jurídica', 'manobra', 'aborto', 'anencéfalos', 'útero', 'mobilização', 'bebê', 'envie', 'prolongamento', 'concepção', 'emails', 'mãe', 'gênero', 'ideologia', 'mulheres', 'prefeitura', 'constrangimento', 'mulher', 'implantada', 'rio', 'feministas', 'estupro', 'trans', 'transexual', 'liga', 'mamaria', 'feminina', 'cirurgia', 'prótese', 'reais', 'vôlei', 'mulher']
Code
# drop rows with 0 values in two columns
filtered_df = filtered_df[(filtered_df.like_count != 0) & (filtered_df.retweet_count != 0)]

# add a new column with the sum of two columns
filtered_df['impressions'] = (filtered_df['like_count'] + filtered_df['retweet_count'])/2

# extract year from datetime column
filtered_df['year'] = filtered_df['date'].dt.year

# remove urls, mentions, hashtags and numbers
p.set_options(p.OPT.URL)
filtered_df['tweet_text'] = filtered_df['text'].apply(lambda x: p.clean(x))

# Create scatter plot
fig = px.scatter(filtered_df, x='like_count', 
                 y='retweet_count',
                 size='impressions', 
                 color='year',
                 hover_name='tweet_text')

# Update title and axis labels
fig.update_layout(
    title='Tweets talking about gender with most Likes and Retweets',
    xaxis_title='Number of Likes',
    yaxis_title='Number of Retweets'
)

fig.show()

Tópicos en el tiempo

Code
# convert column to list
tweets = df['text_pre'].to_list()
timestamps = df['local_time'].to_list()

topics_over_time = topic_model.topics_over_time(docs=tweets, 
                                                timestamps=timestamps, 
                                                global_tuning=True, 
                                                evolution_tuning=True, 
                                                nr_bins=20)

topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=20)