Extracting Data Clusters: Analyzing Social Networks Using R and Gephi

This article concludes our three-part series exploring X (formerly Twitter) cluster analyses using R and Gephi. In the first installment, we analyzed heated online discussion surrounding famed Argentine footballer Lionel Messi; part 2 deepened the analysis by focusing on key players and topic dissemination. While Twitter rebranded as X in July 2023, replacing “tweet” and “retweet” with other terms, the underlying data science principles and strategies we cover remain relevant.

Political discourse often leads to polarization. When distinct communities with widely divergent viewpoints emerge online, their Twitter activity tends to cluster tightly around two main groups, with minimal interaction between them. This phenomenon exemplifies homophily, the human inclination to engage primarily with like-minded individuals.

Our previous article delved into computational methods for analyzing Twitter data, leveraging Gephi to generate insightful visualizations](https://gephi.org/). Now, let’s employ cluster analysis to extract meaningful conclusions from these techniques and pinpoint the most informative aspects of social data.

To highlight clustering, we’ll shift our focus to US political data from May 10-20, 2020, employing the same Twitter data acquisition process as before but targeting the then-president’s name instead of “Messi.”

The following visualization represents the interaction graph of this political conversation. Similar to our first article, we used Gephi’s ForceAtlas2 layout to plot the data and color-coded the communities identified by the Louvain algorithm.

A non-identified binary data cluster interaction graph generated within Gephi — Data Cluster Interaction Graph

Let’s delve deeper into the nuances of this data.

Unpacking Cluster Composition

As emphasized throughout this series, we can characterize clusters by their influential figures. However, Twitter offers a wealth of additional data for analysis. Take, for instance, the user description field, a space for brief user biographies. Word clouds enable us to uncover common self-descriptors within these descriptions. The code snippet below generates two word clouds, one for each cluster, based on word frequency in user descriptions, revealing how these self-portrayals can be collectively informative:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# Load necessary libraries
library(rtweet)
library(igraph)
library(tidyverse)
library(wordcloud)
library(tidyverse)
library(NLP)
library("tm")
library(RColorBrewer)


# First, identify the communities through Louvain
my.com.fast = cluster_louvain(as.undirected(simplify(net)),resolution=0.4)

# Next, get the users that conform to the two biggest clusters
largestCommunities <- order(sizes(my.com.fast), decreasing=TRUE)[1:4]
community1 <- names(which(membership(my.com.fast) == largestCommunities[1]))
community2 <- names(which(membership(my.com.fast) == largestCommunities[2]))

# Now, split the tweets’ data frames by their communities
# (i.e., 'republicans' and 'democrats')

republicans = tweets.df[which(tweets.df$screen_name %in% community1),]
democrats = tweets.df[which(tweets.df$screen_name %in% community2),]

# Next, given that we have one row per tweet and we want to analyze users, 
# let’s keep only one row by user
accounts_r = republicans[!duplicated(republicans[,c('screen_name')]),]
accounts_d = democrats[!duplicated(democrats[,c('screen_name')]),]

# Finally, plot the word clouds of the user’s descriptions by cluster

## Generate the Republican word cloud
## First, convert descriptions to tm corpus
corpus <- Corpus(VectorSource(unique(accounts_r$description)))

### Remove English stop words
corpus <- tm_map(corpus, removeWords, stopwords("en"))

### Remove numbers because they are not meaningful at this step
corpus <- tm_map(corpus, removeNumbers)

### Plot the word cloud showing a maximum of 30 words
### Also, filter out words that appear only once
pal <- brewer.pal(8, "Dark2")
wordcloud(corpus, min.freq=2, max.words = 30, random.order = TRUE, col = pal)

## Generate the Democratic word cloud

corpus <- Corpus(VectorSource(unique(accounts_d$description))) 
corpus <- tm_map(corpus, removeWords, stopwords("en"))
pal <- brewer.pal(8, "Dark2")
wordcloud(corpus, min.freq=2, max.words = 30, random.order = TRUE, col = pal)

Past US election data underscores the strong geographical segregation of voters](https://www.nytimes.com/interactive/2021/upshot/2020-election-map.html). Building on our identity analysis, let’s examine the place_name field, where users often indicate their location. The following R code generates word clouds based on this field:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Convert place names to tm corpus
corpus <- Corpus(VectorSource(accounts_d[!is.na(accounts_d$place_name),]$place_name))

# Remove English stop words
corpus <- tm_map(corpus, removeWords, stopwords("en"))

# Remove numbers
corpus <- tm_map(corpus, removeNumbers)

# Plot
pal <- brewer.pal(8, "Dark2")
wordcloud(corpus, min.freq=2, max.words = 30, random.order = TRUE, col = pal)

## Do the same for accounts_r

The RStudio-generated word clouds for each data cluster — Word Clouds

While some locations might appear in both word clouds due to the presence of both Republican and Democratic voters in most areas, certain states like Texas, Colorado, Oklahoma, and Indiana exhibit a strong Republican affiliation. Conversely, cities like New York, San Francisco, and Philadelphia show a strong Democratic correlation.

Deciphering User Behavior

Shifting our focus to user behavior, let’s investigate the distribution of account creation dates within each cluster. A uniform distribution would suggest no correlation between creation date and cluster affiliation.

The histogram below illustrates this distribution:

1
2
3
4
5
6
7
8
9
# First we need to format the account date field to be effectively read as Date
## Note that we are using the accounts_r and accounts_d data frame, this is because we want to focus on unique users and don’t distort the plot by the number of tweets that each user has submitted

accounts_r$date_account <- as.Date(format(as.POSIXct(accounts_r$account_created_at,format='%Y-%m-%d %H:%M:%S'),format='%Y-%m-%d'))

# Now we plot the histogram
ggplot(accounts_r, aes(date_account)) + geom_histogram(stat="count")+scale_x_date(date_breaks = "1 year", date_labels = "%b %Y") 

## Do the same for accounts_d

A histogram generated with RStudio showing the number of Republican users created for each date within the data set — Number of Republican Users Created by Date

A histogram generated with RStudio showing the number of Democrat users created for each date within the data set — Number of Democratic Users Created by Date

Clearly, Republican and Democratic users are not uniformly distributed. Both groups experienced surges in new accounts in January 2009 and January 2017, coinciding with presidential inaugurations following the November elections of the preceding years. This pattern suggests a potential link between proximity to these events and heightened political engagement, aligning with our focus on political tweets.

Intriguingly, the most significant peak for Republicans arises in mid-2019, culminating in early 2020. Could this behavioral shift reflect digital habits shaped by the pandemic?

Democrats also saw a spike during this period, albeit less pronounced. Perhaps Republican supporters exhibited a larger surge due to stronger sentiments regarding COVID lockdowns? Delving into political science knowledge, theories, and findings would be necessary to formulate more robust hypotheses. Nevertheless, this data reveals fascinating trends worthy of political analysis.

Another avenue for comparing behavior is analyzing how users retweet and reply. Retweets amplify messages, while replies contribute to specific conversations or debates. A high reply count often indicates a tweet’s contentious, unpopular, or divisive nature, while favoriting signifies agreement. Let’s examine the ratio measure between tweet favorites and replies.

Given the principle of homophily, we’d anticipate users to predominantly retweet those within their community. We can verify this using R:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# Get users who have been retweeted by both sides
rt_d = democrats[which(!is.na(democrats$retweet_screen_name)),]
rt_r = republicans[which(!is.na(republicans$retweet_screen_name)),]

# Retweets from democrats to republicans
rt_d_unique = rt_d[!duplicated(rt_d[,c('retweet_screen_name')]),]
rt_dem_to_rep = dim(rt_d_unique[which(rt_d_unique$retweet_screen_name %in% unique(republicans$screen_name)),])[1]/dim(rt_d_unique)[1]

# Retweets from democrats to democrats

rt_dem_to_dem = dim(rt_d_unique[which(rt_d_unique$retweet_screen_name %in% unique(democrats$screen_name)),])[1]/dim(rt_d_unique)[1]

# The remainder
rest = 1 - rt_dem_to_dem - rt_dem_to_rep

# Create a dataframe to make the plot
data <- data.frame(
  category=c( "Democrats","Republicans","Others"),
  count=c(round(rt_dem_to_dem*100,1),round(rt_dem_to_rep*100,1),round(rest*100,1))
)
 
# Compute percentages
data$fraction <- data$count / sum(data$count)

# Compute the cumulative percentages (top of each rectangle)
data$ymax <- cumsum(data$fraction)

# Compute the bottom of each rectangle
data$ymin <- c(0, head(data$ymax, n=-1))

# Compute label position
data$labelPosition <- (data$ymax + data$ymin) / 2

# Compute a good label
data$label <- paste0(data$category, "\n ", data$count)

# Make the plot

ggplot(data, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=c('red','blue','green'))) +
  geom_rect() +
  geom_text( x=1, aes(y=labelPosition, label=label, color=c('red','blue','green')), size=6) + # x here controls label position (inner / outer)

  coord_polar(theta="y") +
  xlim(c(-1, 4)) +
  theme_void() +
  theme(legend.position = "none")

# Do the same for rt_r

Two ring graphs showing which user types retweet tweets from each cluster. Looking at Republican retweets, 76.3% are from other Republicans and 1.3% are from Democrats, while 22.4% are from nonclustered users. When looking at Democratic retweets, 75.3% are from other Democrats and 2.4% are from Republicans, while 22.3% are from nonclustered users. — User Type Retweet Distribution

As expected, Republicans tend to retweet fellow Republicans, and the same holds true for Democrats. Now, let’s observe how party affiliation influences tweet replies.

Two ring graphs showing which user types reply to tweets from each cluster. Looking at replies to Republican tweets, 36.5% are from Republicans and 16.2% are from Democrats, while 47.3% are from nonclustered users. When looking at replies to Democratic tweets, 28% are from Democrats and 20.6% are from Republicans, while 51.5% are from nonclustered users. — User Type Tweet Reply Distribution

A strikingly different pattern emerges. While users are more likely to reply to tweets from those who share their political leaning, retweeting within their group remains significantly more prevalent. Additionally, individuals outside these two primary clusters seem more inclined to engage in replies.

Employing the topic modeling technique outlined in part two, we can predict the conversation topics users are likely to engage in with in-group versus out-group members.

The following table highlights the two most prominent topics for each interaction type:

Democrats to Democrats		Democrats to Republicans		Republicans to Democrats		Republicans to Republicans
Topic 1	Topic 2	Topic 1	Topic 2	Topic 1	Topic 2	Topic 1	Topic 2
fake	people	trump	americans	news	biden	people	china
putin	covid	news	trump	fake	obama	money	news
election	virus	fake	dead	cnn	obamagate	country	people
money	taking	lies	people	read	joe	open	media
trump	dead	fox	deaths	fake_news	evidence	back	fake

Fake news appears to be a hot-button issue in replies. Regardless of their affiliation, users replying to those from the opposing party tended to discuss news sources favored by their respective sides. Within their groups, Democrats focused on Putin, election integrity, and COVID, while Republicans emphasized ending lockdowns and Chinese disinformation.

Polarization in Action

Polarization is a pervasive pattern in social media, extending far beyond the US. We’ve explored how to analyze community identity and behavior within a polarized context. These tools empower anyone to conduct cluster analysis on data sets of interest, uncovering patterns and generating insights that can both inform and inspire further exploration.

Also in This Series:

Social Network Analysis in R and Gephi: Digging Into Twitter
Understanding Twitter Dynamics With R and Gephi: Text Analysis and Centrality