Twitter Clustering Analysis

Social media outlets such as Twitter can provide insight into the conversations happening around specific topics. By looking at what is being tweeted and by whom, it can be possible to better understand attitudes about important topics which can impact health. Twitter can also be used as a tool to better understand the effectiveness of awareness campaigns.
There are some caveats however, Twitter data is not without its limitations. The first is that it is inherently biased data, in that it only shows conversations happening among people who use Twitter, which may or may not be representative of the population of interest. There's also geographic biases present. While the use is growing throughout the world, there are still areas where use is limited.
Regardless, there are still some valuable insights that can be gained. A clustering analysis such as KMEANS can be useful in understanding the influence of a specific user. High-influence posters can, for better or worse, spread information to a considerable number of Twitter users. This is based on work from https://www.r-bloggers.com/cluster-your-twitter-data-with-r-and-k-means/

The following R script will query Twitter to identify the number of followers a Twitter account has and cluster the followers based on the number of followers they have.

Requirements

In order to run this script, there are some preparatory steps. 1) You must have registered for a developer account at Twitter. Details on how to do that can be found at: http://docs.inboundnow.com/guide/create-twitter-application/ 2) Three packages must be installed: twitteR, rCharts, httr

In [ ]:
options(jupyter.plot_mimetypes = c("text/plain", "image/png" ))
library(twitteR)
library(rCharts)
library(httr)

options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))

require(twitteR)
require(ROAuth)
require(RCurl)

#Parameters for the request to Twitter
reqURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
accessToken <- "YOUR ACCESS TOKEN GOES HERE"
accessTokenSecret <-"YOUR ACCESS TOKEN SECRET GOES HERE"
authURL <- "https://api.twitter.com/oauth/authorize"
apiKey <- "YOUR API KEY GOES HERE"
apiSecret <- "YOUR API SECRET GOES HERE"
CUSTOMER_KEY <-"YOUR CUSTOMER KEY GOES HERE"
CUSTOMER_SECRET <-"YOUR CUSTOMER SECRET GOES HERE"

setup_twitter_oauth(CUSTOMER_KEY, CUSTOMER_SECRET, accessToken, accessTokenSecret)

Now that the parameters have been set we can query Twitter and get results.
Warning, Twitter may apply rate limits for particularly popular users.

In [3]:
user <- getUser("MEASURE_EVAL") #Set the username
#print ("Got the Username")

userFriends <- user$getFriends()
#print ("Got the User Friends")

userFollowers <- user$getFollowers()
#print ("Got the User Followers")

userNeighbors <- union(userFollowers, userFriends) #merge followers and friends
#print ("Merged Followers and Friends")

userNeighbors.df = twListToDF(userNeighbors) #create the dataframe
#print ("made the dataframe")

userNeighbors.df[userNeighbors.df== 0]<- 1
#print ("made negative 1")

userNeighbors.df$logFollowersCount <-log(userNeighbors.df$followersCount)
#print("Made Follower Count")

userNeighbors.df$logFriendsCount <-log(userNeighbors.df$friendsCount)
#print("Made Friend Count")

kObject.log <- data.frame(userNeighbors.df$logFriendsCount,userNeighbors.df$logFollowersCount)
#print("made kObject.log")
mydata <- kObject.log
#print("made my data")

wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
print("Data collection complete")
[1] "Data collection complete"

Now that the data is complete we can cluster the results. We'll do 3 clusters to breakdown follwers into three groups:

  1. Low Influencers Followers who follow a lot of people but are not followed by many others
  2. Moderate Influencers Followers that follow a moderate number of people are in turn followed by a moderate number
  3. Strong Influencers Followers that are followed by a lot of people but do not follow a lot

The results of this clustering opens the door to multiple types of analysis using techniques such as sentiment analysis or network analysis.

  • Which users do strong influencers follow? This can provide insight into who influences the influencers. Tweets from these strong influencers can be an effective way to reach many people with targeted messages for awareness campaigns.
  • What gets moderate influencers to engage? Tweets from these users can be analyzed to determine if awareness campaigns are effective or identify topics of conversation that can inform thinking about effective ways to design interventions. For instance, what are people saying about blessers or age-discordant relationships. A critical mass of moderate influencers can be as effective at spreading information as strong influencers.
In [5]:
#Run the K Means algorithm, specifying 3 centers
user2Means.log <- kmeans(kObject.log, centers=3, iter.max=10, nstart=100)

#Add the vector of specified clusters back to the original vector as a factor

userNeighbors.df$cluster <- factor(user2Means.log$cluster)

p2 <- nPlot(logFollowersCount ~ logFriendsCount, group = 'cluster', data = userNeighbors.df, type = 'scatterChart')

p2$xAxis(axisLabel = 'Followers Count')

p2$yAxis(axisLabel = 'Friends Count')

p2$chart(tooltipContent = "#! function(key, x, y, e){
 return e.point.screenName + ' Followers: ' + e.point.followersCount +' Friends: ' + e.point.friendsCount
} !#")

p2

An example of a Strong Influencer that has many followers but follows few

Gates.png

An example of a Strong Influencer who has many followers and also follows many others