Analyzing Social Networks on Twitter using R and Gephi

This article is the first in a series of three that explores Twitter cluster analyses using R and Gephi. In the next installment, we will delve deeper into the analysis initiated today, aiming to pinpoint key players and comprehend topic dissemination. The final part will utilize cluster analysis to extract insights from polarized discussions surrounding US politics.

Social network analysis originated in 1934 with Jacob Levy Moreno’s creation of sociograms, visual representations of social connections. In essence, a sociogram is a graphical representation where each point signifies an individual, and connecting lines depict interactions between them. Moreno employed sociograms to examine the dynamics of small groups.

Why small groups? Because during his time, accessing detailed information about numerous personal interactions was challenging. However, the advent of online platforms like Twitter changed this. Today, anyone can readily obtain substantial Twitter data without charge, paving the way for insightful analyses that enhance our understanding of human behavior and its societal ramifications.

This initial installment of our social network analysis series will guide you through conducting such analyses using the R language for data acquisition and preprocessing, and Gephi for generating compelling visualizations. Gephi is a freely available application specifically designed to visualize various networks. It empowers users to effortlessly customize visualizations based on a range of criteria and attributes.

Acquiring Twitter Data for Social Network Analysis in R

If you haven’t already, set up a Twitter developer account and request Essential access. To download data, create an app within the Twitter Developer Portal. Subsequently, in the Projects & Apps section, choose your app and navigate to the Keys & Tokens tab to generate your credentials. These credentials grant you access to the Twitter API for downloading data.

With your credentials in place, you’re ready to begin. Our analysis will utilize three R libraries:

  1. igraph, responsible for constructing the interaction graph.
  2. tidyverse, used for data preparation.
  3. rtweet, which facilitates communication with the Twitter Dev API.

Installation of these libraries can be done using the install.packages() function in R. We’ll assume you have R and RStudio installed, along with a basic grasp of their functionality.

Our demonstration focuses on analyzing the vibrant online discourse surrounding renowned Argentine footballer Lionel Messi during his inaugural week with Paris Saint-Germain (PSG) Football Club. Keep in mind that the free Twitter API restricts data retrieval to seven days preceding the current date. While replicating our exact dataset isn’t possible, you can apply the process to current discussions.

Let’s start with data acquisition. We’ll load the necessary libraries, create an authorization token using your credentials, and define the download criteria.

The following code snippet illustrates these three steps:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
## Load libraries
library(rtweet)
library(igraph)
library(tidyverse)

## Create Twitter token
token <- create_token(
  app = <YOUR_APP_NAME>,
  consumer_key = <YOUR_CONSUMER_KEY>,
  consumer_secret = <YOUR_CONSUMER_SECRET>,
  access_token =<YOUR_ACCESS_TOKEN>,
  access_secret = <YOUR_ACCESS_SECRET>)

## Download Tweets
tweets.df <- search_tweets("messi", n=250000,token=token,retryonratelimit = TRUE,until="2021-08-13")

## Save R context image
save.image("filename.RData")

Important: Remember to replace all the tags enclosed in <> with the information obtained during the credential creation process.

This code queried the Twitter API, retrieving all tweets (capped at 250,000) containing the keyword “messi” posted between August 8, 2021, and August 13, 2021. We set a limit of 250,000 tweets due to Twitter’s requirement for a quantity value, and because this number provides a substantial dataset for analysis.

Twitter’s download rate is 45,000 tweets per 15 minutes; therefore, retrieving 250,000 tweets took over an hour.

Finally, we stored all contextual variables in an RData file for easy restoration if we need to close RStudio or restart our machine.

Constructing the Interaction Graph

Once the download is complete, the tweets.df dataframe will contain our tweets. This dataframe is structured with each row representing a tweet and each column representing a specific tweet characteristic. Our first step is to use this dataframe to create the interaction graph, where each point symbolizes a user and connecting lines represent interactions like retweets or mentions. Leveraging the capabilities of tidyverse and igraph, we can generate this graph efficiently with a single line of code:

1
2
3
4
5
6
## Create graph
filter(tweets.df, retweet_count > 0) %>% 
  select(screen_name, mentions_screen_name) %>%
  unnest(mentions_screen_name) %>% 
  filter(!is.na(mentions_screen_name)) %>% 
  graph_from_data_frame() -> net

Executing this line creates a graph, stored in the net variable, ready for analysis. For instance, to determine the number of nodes and edges:

1
summary(net) # IGRAPH fd955b4 DN-- 138963 217362 --

Our sample data contains 138,000 nodes and 217,000 edges—a considerable graph. While R offers visualization possibilities, they tend to be computationally intensive and lack the visual appeal of Gephi. Therefore, we’ll proceed with Gephi for visualization.

Visualizing the Graph in Gephi

First, we need a file format compatible with Gephi. This is straightforward, as we can generate a .gml file using the write_graph function:

1
write_graph(simplify(net),  "messi_network.gml", format = "gml")

Next, open Gephi, navigate to “Open graph file,” locate and open the messi_network.gml file. You’ll see a window summarizing the graph information; click Accept. The following will be displayed:

A screenshot showing Gephi's user interface from which users can open a new graph file.
Opening a new graph file in Gephi

Admittedly, this isn’t very informative yet, as we haven’t applied a layout.

Network Layout

In graphs containing thousands of nodes and edges, arranging the nodes effectively is crucial. Layouts serve this purpose by strategically positioning nodes based on predefined criteria.

For our social network analysis tutorial, we’ll employ the ForceAtlas2 layout, a common choice for such analyses. It simulates attractive and repulsive forces between nodes. Connected nodes are placed closer together, while unconnected nodes are farther apart. This approach effectively reveals communities within the graph, as users belonging to the same community will be grouped together.

To implement this layout, go to the Layout window (bottom left), select ForceAtlas 2, and click Run. You’ll observe the nodes dynamically repositioning themselves, forming numerous “clouds.” After a short period, a stable pattern will emerge, at which point you can click Stop (note that automatic stopping might take longer).

As this algorithm involves randomness, each run will produce slightly different outputs. Your result should resemble this:

An image showing the output of the ForceAtlas2 layout in black and white, resulting in a graph with no colors or shades of gray.
Monochrome Gephi output using the ForceAtlas2 layout

The graph is becoming visually engaging. Let’s enhance it further with color.

Identifying Communities

Nodes can be colored based on various criteria; a standard method is by community. For instance, if our graph contains four communities, we’ll use four colors. This color-coding facilitates the understanding of group interactions within your data.

Before coloring, we need to identify the communities. In Gephi, under the Statistics tab, click the Modularity button. This applies the well-known Louvain graph clustering algorithm, known for its speed and considered state-of-the-art due to its efficiency. In the pop-up window, click Accept. Another window will appear displaying a scatter plot of the communities based on their size. This process adds a new attribute called “Modularity Class” to each node, indicating the community to which the user belongs.

Now we can color the graph according to these clusters. Under the Appearance tab, click Apply.

A screenshot displaying the "Appearance" tab in a Gephi workspace. The image shows a range of colors used in the graph.
Using Gephi’s Appearance to add color

This view reveals the size (as a percentage of users) of each community. In our example, the dominant communities (violet and green) comprise 11.34% and 9.29% of the total user population, respectively.

With the current layout and color scheme, the graph will look like this:

An image of a colored graph. The shape is identical to the previous monochrome graph, but colors help identify specific communities, with the largest community (violet) in the lower-left corner and the second-largest community (green) in the upper-right corner. Between them, smaller communities are represented by other colors, including cyan, orange, red, and black.
A colored graph allows us to easily identify different communities.

Identifying Influential Twitter Users

Lastly, let’s pinpoint key participants in the discussion, perhaps to discern their community affiliations. User influence can be gauged using various metrics; one such metric is degree, which quantifies how many other users retweeted or mentioned a particular user.

To visually emphasize users with high interaction levels, we’ll adjust node size based on the Degree property:

A screenshot showing how to change the Degree property in Gephi, under the same Appearance tab mentioned in the previous figures. The minimum size is set to 0.1 and the maximum size is set to 10.
Changing the Degree property in Gephi

The graph will now represent influential users as larger circles:

An image of a colored graph similar to the previous one but with the addition of circles representing influencer nodes. Each color group features a handful of such nodes.
Colored output showing influencers as larger nodes

Having identified highly interactive users, let’s reveal their names. Click the black arrow in the bottom bar of the screen:

A screenshot of the Gephi workspace, showing the black arrow in the lower-right corner of the UI.
Accessing label configuration in Gephi

Select “Labels” and then “Configuration.” In the pop-up window, check “Name” and click Accept. Next, check “Nodes.” Small black lines, representing usernames, will appear on the graph. However, we only want to display the most significant ones.

To achieve this, we’ll again adjust label size based on node degree. In the same window used for node size, increase the minimum size from 0.1 to 10 and the maximum size from 10 to 300.

Adding names makes the graph significantly more insightful, as it now illustrates how different communities engage with influencers:

An image of a colored graph with circles representing significant users, with the names of the most important users overlaid on top. The size of the text corresponds to the size (influence) of each user, with some of the biggest being ESPNFC in purple, TrollFootball in gray, and PSG_inside in pink.
Adding names allows us to see how different communities interact with influencers.

We’ve gained a deeper understanding of this Twitter discussion. For instance, the presence of accounts like mundodabola and neymarjrdepre within the green community points to its Brazilian user base. The orange and gray communities include Spanish-speaking users like sc_espn and InvictosSomos. The gray and black communities, in particular, seem predominantly Spanish-speaking, as evidenced by users like IbaiOut,, LaScaloneta, and the popular streamer IbaiLlanos. Lastly, the violet and red communities appear English-speaking, featuring accounts like ESPNFC and brfootball.

We can now better grasp why these communities, determined through graph computation, align with sociological factors: they speak different languages. While all discussing Messi and his new team, it’s natural for Spanish speakers to interact more amongst themselves than with Portuguese or English speakers. Furthermore, we observe that even within the Spanish-speaking gray and orange communities, different perspectives exist. The gray community’s more humorous approach likely contributes to its members’ increased interaction with each other compared to interactions with official football or journalist accounts.

Harnessing the Power of R and Gephi

While R’s Ggplot library offers an alternative for graph visualization, it’s arguably more limited than Gephi in this context. Gephi provides a dynamic environment that’s easier to configure and yields clearer visualizations compared to the static nature of Ggplot.

In the subsequent parts of this series, we’ll delve deeper into this analysis. We’ll perform topic modeling and sentiment analysis to understand user discussion topics and their sentiment (positive or negative). Additionally, we’ll conduct further graph analysis to examine Twitter’s most influential users.

You can apply these steps to analyze new Twitter conversations and glean valuable insights from your own plotted graphs.

Also in This Series:

  • Understanding Twitter Dynamics With R and Gephi: Text Analysis and Centrality

  • Mining for Twitter Clusters: Social Network Analysis With R and Gephi

Licensed under CC BY-NC-SA 4.0