Exploring Twitter Dynamics Using R and Gephi: Analyzing Text and Centrality

This is the second part of a three-part series about analyzing Twitter clusters using R and Gephi. Part one established the foundation for the example we’ll explore in detail here; part three utilizes cluster analysis to analyze polarized posts about US politics and draw conclusions.

Importance of Nodes in Social Networks

Before we begin, it’s essential to understand the concept of centrality. In the realm of network science, centrality refers to nodes that hold significant sway over the network. However, “influence” itself can be interpreted in numerous ways. Is a node with numerous connections inherently more influential than one with fewer but potentially more significant connections? What defines a “significant connection” within a social network?

To tackle these complexities, network scientists have devised various centrality measures. We’ll delve into four frequently used measures, noting that many more exist.

Degree Centrality

The most prevalent and easily grasped measure is degree centrality. Its premise is straightforward: a node’s influence is gauged by the number of connections it possesses. Variations exist for directed graphs; you can assess indegree (hub score) and outdegree (authority score).

Our previous exploration utilized the undirected approach. This time, we’ll concentrate on the indegree approach. This allows for a more precise analysis by prioritizing users who are frequently retweeted over those who simply retweet often.

Eigenvector Centrality

The eigenvector measure expands on degree centrality. The more influential nodes that link to a particular node, the higher its score. We start with an adjacency matrix, where rows and columns denote nodes. A 1 or 0 signifies whether the corresponding nodes in a given row and column are linked. The primary calculation determines the eigenvectors of this matrix. The principal eigenvector houses the desired centrality measures, with position i containing the centrality score of node i.

PageRank

PageRank is a variant of the eigenvector measure that lies at the heart of Google’s search algorithm. While Google’s exact method remains undisclosed, the fundamental idea involves each node starting with a score of 1 and then distributing this score evenly among its outgoing edges. For instance, a node with three outgoing edges would “send” one-third of its score through each edge. Simultaneously, a node’s importance is boosted by the edges that point towards it. This results in a solvable system of N equations with N unknowns.

Betweenness Centrality

The fourth measure, betweenness centrality, employs a distinct approach. Here, a node’s influence is determined by its presence on numerous short paths connecting other nodes. Essentially, it plays a crucial role in connecting various parts of the network.

In social network analysis, such nodes could represent individuals who excel at helping others secure new jobs or establish novel connections—they act as gateways to previously unexplored social circles.

Choosing the Right Centrality Measure

The ideal centrality measure hinges on your analytical objective. Do you aim to identify users frequently highlighted by others in terms of sheer volume? Degree centrality might be the optimal choice. If you prioritize a quality-focused measure, eigenvector or PageRank would be more suitable. If pinpointing users who effectively bridge different communities is paramount, betweenness centrality is your best bet.

When employing multiple similar measures, like eigenvector and PageRank, calculate both and compare their rankings. Discrepancies can prompt further analysis or the creation of a new measure by combining their scores.

Alternatively, utilize principal component analysis to determine which measure provides deeper insights into the true influence wielded by nodes within your network.

Practical Centrality Calculation

Let’s explore how to calculate these measures using R and RStudio (achievable with Gephi as well).

Begin by loading the necessary libraries:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
library("plyr")
library(igraph)
library(tidyverse)
library(NLP)
library("tm")
library(RColorBrewer)
library(wordcloud)
library(topicmodels)
library(SnowballC)
library("textmineR")

Next, remove isolated nodes from our existing data, as they are inconsequential for this analysis. Employ the igraph functions betweenness, centr_eigen, page_rank, and degree to compute centrality measures. Store the scores within the igraph object and a data frame to identify the most central users.

1
2
3
4
5
6
load("art1_tweets.RData")
Isolated = which(degree(net)==0)
net_clean = delete.vertices(net, Isolated)

cent<-data.frame(bet=betweenness(net_clean),eig=centr_eigen(net_clean)$vector,prank=(page_rank(net_clean)$vector),degr=degree(net_clean, mode="in"))
cent <- cbind(account = rownames(cent), cent)

Now, let’s examine the top 10 most central users based on each measure:

Degree
top_n(cent,10,degr)%>% arrange(desc(degr))%>% select(degr)
Eigenvector
top_n(cent,10,eig)%>% arrange(desc(eig))%>% select(eig)
PageRank
top_n(cent,10,prank)%>% arrange(desc(prank))%>% select(prank)
Betweenness
top_n(cent,10,bet)%>% arrange(desc(bet))%>% select(bet)

Here are the results:

DegreeEigenvectorPageRankBetweenness
ESPNFC5892PSG_inside1mundodabola0.037viewsdey77704
TrollFootball5755CrewsMat190.51AleLiparoti0.026EdmundOris76425
PSG_inside5194eh011959910.4PSG_inside0.017ba*****lla63799
CrewsMat194344mohammad1356800.37RoyNemer0.016FranciscoGaius63081
brfootball4054ActuFoot_0.34TrollFootball0.013Yemihazan62534
PSG_espanol3616marttvall0.34ESPNFC0.01hashtag2weet61123
IbaiOut3258ESPNFC0.3PSG_espanol0.007Angela_FCB60991
ActuFoot_3175brfootball0.25lnstantFoot0.007Zyyon_57269
FootyHumour2976SaylorMoonArmy0.22IbaiOut0.006CrewsMat1953758
mundodabola2778JohnsvillPat0.22010MisterChip0.006MdeenOlawale49572

Observe that the initial three measures share users like PSG_inside, ESPNFC, CrewsMat19, and TrollFootball, implying their significant influence on the discussion. Betweenness, with its different approach, exhibits less overlap.

Note: Opinions expressed by the mentioned Twitter accounts do not represent those of Toptal or the author.

Below are visualizations of our original color-coded network graph with user label overlays. The first highlights nodes by PageRank scores; the second, by betweenness scores:

An image showing a colored PageRank plot, with the top 10 users and their networks highlighted. The three biggest users are PSG_inside, TrollFootball, and ESPNFC. ESPNFC is located on the left of the plot and colored purple, while PSG_inside is placed to the right of it, colored red. TrollFootball is located higher and to the right of them, between green-, blue-, and orange-colored users.
Messi discussion with the top 10 PageRank users highlighted
An image showing a colored betweenness plot, with the top 10 users and their networks labeled and highlighted. All of the top 10 users, which are more similar in size than in the previous image, are located in the lower-left corner of the image, which is colored purple. They are grouped together tightly.
Messi discussion with the top 10 betweenness users highlighted

These visualizations can be recreated using Gephi. Calculate betweenness or PageRank scores via the Network Diameter button within the statistics panel. Display node names using attributes as demonstrated in the first part of this series.

Analyzing Text with R and LDA

Social network discussions can be analyzed to uncover conversation topics. One approach is topic modeling using Latent Dirichlet Allocation (LDA), an unsupervised machine learning technique that helps identify sets of co-occurring words. We can then infer the discussed topic from these word sets.

First, we need to clean up the text using the following function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# This function normalizes text by removing Twitter-related terms and noisy characters
sanitize_text <- function(text) {
  # Convert to ASCII to remove accented characters:
  text <- iconv(text, to = "ASCII", sub = " ")
  # Move to lower case and delete RT word (this is added by Twitter)
  text <- gsub("rt", " ", tolower(text))
  # Delete links and user names:
  text <- gsub("@\\w+", " ", gsub("http.+ |http.+$", " ", text))
  # Delete tabs and punctuation:
  text <- gsub("[ |\t]{2,}", " ", gsub("[[:punct:]]", " ", text))
  text <- gsub("amp", " ", text)  # Remove HTML special character
  # Delete leading and lagging blanks:
  text <- gsub("^ ", "", gsub(" $", "", text))
  text <- gsub(" +", " ", text) # Delete extra spaces
  return(text)
}

We also need to eliminate stop words, duplicate entries, and empty entries. Subsequently, convert our text into a format suitable for LDA processing: a document-term matrix.

Our dataset includes users communicating in various languages (English, Spanish, French, etc.). For optimal LDA performance, focusing on a single language is recommended. We’ll apply it to users within the largest community identified in the previous part, primarily consisting of English-speaking accounts.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# Detect communities:
my.com.fast <-cluster_louvain(as.undirected(simplify(net)))
largestCommunities <- order(sizes(my.com.fast), decreasing=TRUE)[1:3]
# Save the usernames of the biggest community:
community1 <- names(which(membership(my.com.fast) == largestCommunities[1]))

# Sanitize the text of the users of the biggest community:
text <- unique(sanitize_text(tweets.df[which(tweets.df$screen_name %in% community1),]$text))
text = text[text!=''] # Delete empty entries
stopwords_regex = paste(stopwords('es'), collapse = '\\b|\\b')
stopwords_regex = paste0('\\b', stopwords_regex, '\\b')
# Remove English stopwords:
text = stringr::str_replace_all(text, stopwords_regex, '')
# Create the document term matrix:
dtm <- CreateDtm(text,
                 doc_names = seq(1:length(text)),
                 ngram_window = c(1, 2))

Determining Optimal Topic Count

A crucial LDA hyperparameter is the number (k) of topics to estimate. A common approach involves training LDA models with varying k values and evaluating the coherence of each model. We’ll test k values from 3 to 20, as values outside this range are generally less insightful based on experience:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
tf <- TermDocFreq(dtm = dtm)
# Remove infrequent words:
tf_trimmed = tf$term[ tf$term_freq > 1 & tf$doc_freq < nrow(dtm) / 2 ]

# Create a folder to store trained models:
model_dir <- paste0("models_", digest::digest(tf_trimmed, algo = "sha1"))
if (!dir.exists(model_dir)) dir.create(model_dir)

# Define a function to infer LDA topics:
train_lda_model <- function(number_of_topics){
    filename = file.path(model_dir, paste0(number_of_topics, "_topics.rda"))
    # Check if the model already exists:
    if (!file.exists(filename)) {
        # To get exactly the same output on each run, use a constant seed:
        set.seed(12345)
        lda_model = FitLdaModel(dtm = dtm, k = number_of_topics, iterations = 500)
        lda_model$k = number_of_topics
        lda_model$coherence = CalcProbCoherence(phi = lda_model$phi, dtm = dtm, M = 5)
        save(lda_model, file = filename)
    } else {
        load(filename)
    }
    
    lda_model
}
# The number of topics that we are going to infer in each LDA training run:
topic_count = seq(3, 20, by = 1)
# Train through the TmParallelApply function
models = TmParallelApply(X = topic_count,
                         FUN = train_lda_model,
                         export = c("dtm", "model_dir"))

Now, let’s visualize the coherence score for each k:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
coherence_by_topics_quantity = data.frame(
topic_number = sapply(models, function(model_instance) nrow(model_instance$phi)),
     score_coherence = sapply(models,
function(model_instance) mean(model_instance$coherence)),
     stringsAsFactors = FALSE)
ggplot(coherence_by_topics_quantity, aes(x = topic_number, y = score_coherence)) +
  geom_point() +
  geom_line(group = 1) +
  ggtitle("Coherence by Topic") + theme_minimal() +
  scale_x_continuous(breaks = seq(1,20,1)) + ylab("Coherence Score") + xlab("Number of topics")

A higher coherence value signifies better topic segmentation within the text:

A graph showing the coherence score for different topics. The coherence score varies from slightly over 0.05 on six to seven topics, with three to 12 topics all having a score below 0.065. The score suddenly peaks at about 0.105 for 13 topics. Then it goes below 0.06 for 17 topics, up to nearly 0.09 for 19 topics, and finishes at just above 0.07 for 20 topics.

With k = 13 yielding the peak coherence score, we’ll proceed with the LDA model trained using 13 topics. Utilizing the GetTopTerms function, we can extract the 10 principal words for each topic and infer the topic’s meaning:

1
2
3
4
5
6
best_model <- models[which.max(coherence_by_topics_quantity$score_coherence)][[ 1 ]]

# Most important terms by topic:
best_model$top_terms <- GetTopTerms(phi = best_model$phi, M = 20)
top10 <- as.data.frame(best_model$top_terms)
top10

The following table presents the five most prominent topics detected and their representative 10 principal words:

 t_1t_2t_3t_4t_5
1messimessimessimessimessi
2lionelinstagramleagueestpsg
3lionel_messipostwinilleo
4psgmilliongoalsauleo_messi
5madridlikeschpourahora
6realspoionspascompa
7barcelonagoatch_ionsavecva
8parispsguclduser
9real_madridbarballonquijugador
10mbappbiggerworldjemejor

While English dominates this community, French and Spanish speakers are also present (t_4 and t_5). We can deduce that the first topic revolves around Messi’s former team (FC Barcelona), the second concerns Messi’s Instagram post, and the third centers on Messi’s accomplishments.

Having identified the topics, we can determine the most discussed one. Begin by concatenating tweets by user (again, within the largest community):

1
2
3
4
5
tweets.df.com1 = tweets.df[which(tweets.df$screen_name %in% community1),]
users_text <- ddply(tweets.df.com1,
                    ~screen_name,
                    summarise,
                    text = paste(text, collapse = " "))

Next, sanitize the text and generate the DTM. Then, call the predict function, supplying our LDA model and DTM as arguments. Set the method to Gibbs for enhanced computational speed due to the substantial text volume:

1
2
3
4
5
6
7
8
9
users_text$text <- sanitize_text(users_text$text) # Get rid of duplicates
stopwords_regex = paste(stopwords('en'), collapse = '\\b|\\b')
stopwords_regex = paste0('\\b', stopwords_regex, '\\b')
users_text$text = stringr::str_replace_all(users_text$text, stopwords_regex, '')

dtm.users.com1 <- CreateDtm(users_text$text,
                 doc_names = users_text$screen_name,
                 ngram_window = c(1, 2))
com1.users.topics = predict(best_model, dtm.users.com1, method="gibbs", iterations=100)

The com1.users.topics data frame now contains information on each user’s engagement with each topic:

Accountt_1t_2t_3t_4t_5[…]
___99th0.027160490.866666660.002469130.002469130.00246913 
Boss__0.051851850.841975300.002469130.002469130.00246913 
Memphis0.003278680.003278680.036065570.003278680.00327868 
___Alex10.009523800.009523800.009523800.009523800.00952380 
[…]      

Finally, leverage this information to create a new node graph attribute indicating the most discussed topic by each user. Generate a new GML file for visualization in Gephi:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Get the subgraph of the first community:
net.com1 = induced_subgraph(net,community1)
# Estimate the topic with the max score for each user:
com1.users.maxtopic = cbind(users_text$screen_name,
                            colnames(com1.users.topics)[apply(com1.users.topics,
                                                              1,
                                                              which.max)])
# Order the users topic data frame by the users' order in the graph:
com1.users.maxtopic = com1.users.maxtopic[match(V(net.com1)$name,
                                          com1.users.maxtopic[,1]),]
# Create a new attr of the graph by the topic most discussed by each user:
V(net.com1)$topic = com1.users.maxtopic[,2]
# Create a new graph:
write_graph(simplify(net.com1),  "messi_graph_topics.gml", format = "gml")
A colored node graph generated using Gephi, showing ESPNFC as the highest-ranking user by PageRank centrality. ESPNFC is located near the bottom of the image, with many purple nodes grouped below it.
Largest community of Messi discussion colored by topic and with users highlighted by PageRank centrality
An image showing the percentage of users highlighted by each color used in the graph, with the purple "t 6" being the most-used color (40.53% of all users in the graph), followed by the green "t 13" at 11.02%, and blue/cyan "t 10" at 9.68%. A gray "NA," in second-to-last position of this list of 11, makes up 2.25%.
Topic labels and percentage of users for each color used in the graph

Summary and Next Steps

In this installment, we built upon our initial exploration by introducing additional criteria for identifying influential users. We also learned to detect and interpret conversation topics and visualize them within the network.

The next article delves further into analyzing clustered social media data, providing users with valuable tools for deeper exploration.

Also in This Series:

  • Social Network Analysis in R and Gephi: Digging Into Twitter

  • Mining for Twitter Clusters: Social Network Analysis With R and Gephi

Licensed under CC BY-NC-SA 4.0