Mapping the City: Twitter Data

Introduction

Our dataset is scraped from Twitter based on tweets in the year 2018.
It contains 234487 observations(rows) and 16 columns. The feature columns include tweets and information related to them like the nature of the tweet, number of media/urls/hashtags, user information. It contains tweets from January 1, 2018 until December 31, 2018.

Goal

For this assignment I’m creating a map for the latent variable – local sentiment. Through this mapping I’m observing the distribution of positive, negative and neutral sentiments across the neighborhoods and draw insights into the kind of emotions people are expressing over twitter. Ultimately, it would be interesting to observe how much positive/negative tweets are generated from each neighborhood according to the percentage of these sentiments.

Analysis

library(maps)
library(rgdal)
library(sf)
library(ggmap)
library(sqldf)
library(stringr)
library(dplyr)
library(ggthemes)

shp <- readOGR("C://SHARYU//NEU//Big Data For Cities//Module 7//Base+Layers//NSAs.shp")

## OGR data source with driver: ESRI Shapefile 
## Source: "C:\SHARYU\NEU\Big Data For Cities\Module 7\Base+Layers\NSAs.shp", layer: "NSAs"
## with 69 features
## It has 8 fields
## Integer64 fields read as strings:  OBJECTID
summary(shp@data)

##     OBJECTID        ID              NSA_NAME    SHAPE_area      
##  1      : 1   1      : 1   Allston      : 1   Min.   : 2814697  
##  10     : 1   10     : 1   Ashmont      : 1   1st Qu.: 8810733  
##  11     : 1   11     : 1   Back Bay East: 1   Median :13729602  
##  12     : 1   12     : 1   Back Bay West: 1   Mean   :19665537  
##  13     : 1   13     : 1   Beacon Hill  : 1   3rd Qu.:24150081  
##  14     : 1   14     : 1   Bellevue Hill: 1   Max.   :76705631  
##  (Other):63   (Other):63   (Other)      :63                     
##    SHAPE_len       HOODS_PD_I                   Nbhd   
##  Min.   : 8515   Min.   : 1.000   Mattapan        :12  
##  1st Qu.:16034   1st Qu.: 2.000   Dorchester      : 8  
##  Median :21105   Median : 5.000   Roxbury         : 8  
##  Mean   :26044   Mean   : 6.696   South End       : 7  
##  3rd Qu.:32623   3rd Qu.:11.000   Allston-Brighton: 6  
##  Max.   :90692   Max.   :17.000   Roslindale      : 6  
##                                   (Other)         :22  
##              NbhdCRM  
##  Boston          :13  
##  Mattapan        :12  
##  Dorchester      : 8  
##  Roxbury         : 8  
##  Allston-Brighton: 6  
##  Roslindale      : 6  
##  (Other)         :16
shp_df <- broom::tidy(shp, region = "Nbhd")
lapply(shp_df, class)

## $long
## [1] "numeric"
## 
## $lat
## [1] "numeric"
## 
## $order
## [1] "integer"
## 
## $hole
## [1] "logical"
## 
## $piece
## [1] "character"
## 
## $group
## [1] "character"
## 
## $id
## [1] "character"
cnames <- aggregate(cbind(long, lat) ~ id, data=shp_df, FUN=mean)

In the above code chunk, I use the shapefile of geographical infrastructure from BARI dataverse. This provides the latitude and longitude for the city of Boston. The NSAs.shp file provides boundaries for neighborhoods. Based on this I create a map for each neighborhood by using geom_polygon and provide labels to them.

To perform sentiment analysis, I create a function score.sentiment that takes as an input the text data, and a list of positive and negative words. This list would be mapped with the text to check if those words are present in the text. Based on that, sentiment of the tweet can be determined. The list of words is quite exhaustive where the positive words list contains around 2700 words and negative words list contains approximately 4500 words.

This function performs data cleaning in terms of removing punctuations, stop words, blank spaces, convert all text to lowercase and finally try to match each word in the text with the words present in positive and negative words list. Based on this matching a list of scores is generated for each tweet.
These scores then determine the corresponding sentiment which is listed in a new column. If the score is greater than zero, the sentiment of that particular tweet is positive, if the score is less than zero, it is negative, otherwise it is neutral.

#Sentiment Analysis

score.sentiment <- function(sentences, pos.words, neg.words)
{
  require(plyr)
  require(stringr)
  scores <- laply(sentences, function(sentence, pos.words, neg.words){
    sentence <- gsub('[[:punct:]]', "", sentence)
    sentence <- gsub('[[:cntrl:]]', "", sentence)
    sentence <- gsub('\\d+', "", sentence)
    sentence <- tolower(sentence)
    word.list <- str_split(sentence, '\\s+')
    words <- unlist(word.list)
    pos.matches <- match(words, pos.words)
    neg.matches <- match(words, neg.words)
    pos.matches <- !is.na(pos.matches)
    neg.matches <- !is.na(neg.matches)
    score <- sum(pos.matches) - sum(neg.matches)
    return(score)
  }, pos.words, neg.words)
  scores.df <- data.frame(score=scores, text=sentences)
  return(scores.df)
}


pos <- scan('C://SHARYU//NEU//Big Data For Cities//Module 6//positive.txt', what='character', comment.char=';')
neg <- scan('C://SHARYU//NEU//Big Data For Cities//Module 6//negative.txt', what='character', comment.char=';')

twitter_merge$text<- str_replace_all(twitter_merge$text,"í ½í²¸í ½í²°' "," ")
scores <- score.sentiment(twitter_merge$text, pos, neg)


stat <- scores
stat <- mutate(stat, sentiment=ifelse(stat$score > 0, 'positive', ifelse(stat$score < 0, 'negative', 'neutral')))
stat <- mutate(stat, Neighborhood = twitter_merge$ISD_Nbhd)
stat <- as.data.frame(stat)
plotdata <- stat %>% group_by(sentiment) %>% tally()
plotdata

## # A tibble: 3 x 2
##   sentiment      n
##   <chr>      <int>
## 1 negative   16490
## 2 neutral   105853
## 3 positive   55013

Next I map the sentiments on the map created from the NSA shapefile. This will give a clear picture of neighborhood-wise sentiment analysis. To do this, I merge the neighborhoods shapefile with twitter data that has sentiments. I change the spellings of a few column entries to make it optimal for merge. The merge is performed on Neighborhood column in twitter data and id column in shapefile that represents neighborhood.

plotdata <- stat %>% group_by(Neighborhood, sentiment) %>% tally()
levels(plotdata$Neighborhood) <- c(levels(plotdata$Neighborhood), "Allston-Brighton")
levels(plotdata$Neighborhood) <- c(levels(plotdata$Neighborhood), "Fenway")
levels(plotdata$Neighborhood) <- c(levels(plotdata$Neighborhood), "Financial District")

plotdata$Neighborhood[plotdata$Neighborhood == "Allston/Brighton"] <- "Allston-Brighton"
plotdata$Neighborhood[plotdata$Neighborhood == "Fenway/Kenmore"] <- "Fenway"
plotdata$Neighborhood[plotdata$Neighborhood == "Financial District/Downtown"] <- "Financial District"

map_data <- merge(plotdata, shp_df, by.x = "Neighborhood", by.y = "id", all.x = TRUE)

Finally, the sentiments are mapped across the neighborhood. This map broadly shows the parts of Boston that most positive, negative and neutral sentiments are coming from.

cbPalette <- c("#999999", "#E69F00", "#56B4E9") map <- ggplot() + geom_polygon(data = map_data, aes(x = long, y = lat, group = group, fill = sentiment), colour = "black") + scale_fill_manual(values = cbPalette) map + geom_text(data = cnames, aes(x = long, y = lat, label = id), size = 3) + theme_void()

#cnames <- aggregate(cbind(long, lat) ~ id, data=shp_df, FUN=mean)
map

It can be observed that most of the positive sentiments come from the neighborhoods of Back Bay, South End, Fenway, Roslindale, Hyde Park, Financial District(also comprises of Government Centre) and Beacon Hill. The more negative tweets come from the neighborhoods of Mission Hill, Roxbury, Mattapan, West Roxbury, Dorchester whereas the neighborhoods of Allston, Charlestown, East Boston, South Boston stand at more neutrals. This analysis could give rise to hypotheses based on living conditions, safety and other factors in these neighborhoods. It would be interesting to dive deep into the negatives and positives through this week’s city exploration and see how much of this analysis proves to be true.

In the coming week, I plan to extract topics of the tweets and map them on the geographical level of neighborhoods through topic modeling in R. The topics would give an idea about what people are talking about most/least on twitter and it would be interesting to see how that information corresponds to the physical attributes of the neighborhoods.


Leave a comment