Introduction
Our dataset is scraped from Twitter based on tweets in the year 2018.
It contains 234487 observations(rows) and 16 columns. The feature columns include tweets and information related to them like the nature of the tweet, number of media/urls/hashtags, user information. It contains tweets from January 1, 2018 until December 31, 2018.
Goal
For this assignment I’m creating a map for the latent variable – local sentiment. Through this mapping I’m observing the distribution of positive, negative and neutral sentiments across the neighborhoods and draw insights into the kind of emotions people are expressing over twitter. Ultimately, it would be interesting to observe how much positive/negative tweets are generated from each neighborhood according to the percentage of these sentiments.
Analysis
library(maps)
library(rgdal)
library(sf)
library(ggmap)
library(sqldf)
library(stringr)
library(dplyr)
library(ggthemes)
shp <- readOGR("C://SHARYU//NEU//Big Data For Cities//Module 7//Base+Layers//NSAs.shp")
## OGR data source with driver: ESRI Shapefile ## Source: "C:\SHARYU\NEU\Big Data For Cities\Module 7\Base+Layers\NSAs.shp", layer: "NSAs" ## with 69 features ## It has 8 fields ## Integer64 fields read as strings: OBJECTID
summary(shp@data)
## OBJECTID ID NSA_NAME SHAPE_area ## 1 : 1 1 : 1 Allston : 1 Min. : 2814697 ## 10 : 1 10 : 1 Ashmont : 1 1st Qu.: 8810733 ## 11 : 1 11 : 1 Back Bay East: 1 Median :13729602 ## 12 : 1 12 : 1 Back Bay West: 1 Mean :19665537 ## 13 : 1 13 : 1 Beacon Hill : 1 3rd Qu.:24150081 ## 14 : 1 14 : 1 Bellevue Hill: 1 Max. :76705631 ## (Other):63 (Other):63 (Other) :63 ## SHAPE_len HOODS_PD_I Nbhd ## Min. : 8515 Min. : 1.000 Mattapan :12 ## 1st Qu.:16034 1st Qu.: 2.000 Dorchester : 8 ## Median :21105 Median : 5.000 Roxbury : 8 ## Mean :26044 Mean : 6.696 South End : 7 ## 3rd Qu.:32623 3rd Qu.:11.000 Allston-Brighton: 6 ## Max. :90692 Max. :17.000 Roslindale : 6 ## (Other) :22 ## NbhdCRM ## Boston :13 ## Mattapan :12 ## Dorchester : 8 ## Roxbury : 8 ## Allston-Brighton: 6 ## Roslindale : 6 ## (Other) :16
shp_df <- broom::tidy(shp, region = "Nbhd")
lapply(shp_df, class)
## $long ## [1] "numeric" ## ## $lat ## [1] "numeric" ## ## $order ## [1] "integer" ## ## $hole ## [1] "logical" ## ## $piece ## [1] "character" ## ## $group ## [1] "character" ## ## $id ## [1] "character"
cnames <- aggregate(cbind(long, lat) ~ id, data=shp_df, FUN=mean)
In the above code chunk, I use the shapefile of geographical infrastructure from BARI dataverse. This provides the latitude and longitude for the city of Boston. The NSAs.shp file provides boundaries for neighborhoods. Based on this I create a map for each neighborhood by using geom_polygon and provide labels to them.
To perform sentiment analysis, I create a function score.sentiment that takes as an input the text data, and a list of positive and negative words. This list would be mapped with the text to check if those words are present in the text. Based on that, sentiment of the tweet can be determined. The list of words is quite exhaustive where the positive words list contains around 2700 words and negative words list contains approximately 4500 words.
This function performs data cleaning in terms of removing punctuations, stop words, blank spaces, convert all text to lowercase and finally try to match each word in the text with the words present in positive and negative words list. Based on this matching a list of scores is generated for each tweet.
These scores then determine the corresponding sentiment which is listed in a new column. If the score is greater than zero, the sentiment of that particular tweet is positive, if the score is less than zero, it is negative, otherwise it is neutral.
#Sentiment Analysis
score.sentiment <- function(sentences, pos.words, neg.words)
{
require(plyr)
require(stringr)
scores <- laply(sentences, function(sentence, pos.words, neg.words){
sentence <- gsub('[[:punct:]]', "", sentence)
sentence <- gsub('[[:cntrl:]]', "", sentence)
sentence <- gsub('\\d+', "", sentence)
sentence <- tolower(sentence)
word.list <- str_split(sentence, '\\s+')
words <- unlist(word.list)
pos.matches <- match(words, pos.words)
neg.matches <- match(words, neg.words)
pos.matches <- !is.na(pos.matches)
neg.matches <- !is.na(neg.matches)
score <- sum(pos.matches) - sum(neg.matches)
return(score)
}, pos.words, neg.words)
scores.df <- data.frame(score=scores, text=sentences)
return(scores.df)
}
pos <- scan('C://SHARYU//NEU//Big Data For Cities//Module 6//positive.txt', what='character', comment.char=';')
neg <- scan('C://SHARYU//NEU//Big Data For Cities//Module 6//negative.txt', what='character', comment.char=';')
twitter_merge$text<- str_replace_all(twitter_merge$text,"í ½í²¸í ½í²°' "," ")
scores <- score.sentiment(twitter_merge$text, pos, neg)
stat <- scores
stat <- mutate(stat, sentiment=ifelse(stat$score > 0, 'positive', ifelse(stat$score < 0, 'negative', 'neutral')))
stat <- mutate(stat, Neighborhood = twitter_merge$ISD_Nbhd)
stat <- as.data.frame(stat)
plotdata <- stat %>% group_by(sentiment) %>% tally()
plotdata
## # A tibble: 3 x 2 ## sentiment n ## <chr> <int> ## 1 negative 16490 ## 2 neutral 105853 ## 3 positive 55013
Next I map the sentiments on the map created from the NSA shapefile. This will give a clear picture of neighborhood-wise sentiment analysis. To do this, I merge the neighborhoods shapefile with twitter data that has sentiments. I change the spellings of a few column entries to make it optimal for merge. The merge is performed on Neighborhood column in twitter data and id column in shapefile that represents neighborhood.
plotdata <- stat %>% group_by(Neighborhood, sentiment) %>% tally()
levels(plotdata$Neighborhood) <- c(levels(plotdata$Neighborhood), "Allston-Brighton")
levels(plotdata$Neighborhood) <- c(levels(plotdata$Neighborhood), "Fenway")
levels(plotdata$Neighborhood) <- c(levels(plotdata$Neighborhood), "Financial District")
plotdata$Neighborhood[plotdata$Neighborhood == "Allston/Brighton"] <- "Allston-Brighton"
plotdata$Neighborhood[plotdata$Neighborhood == "Fenway/Kenmore"] <- "Fenway"
plotdata$Neighborhood[plotdata$Neighborhood == "Financial District/Downtown"] <- "Financial District"
map_data <- merge(plotdata, shp_df, by.x = "Neighborhood", by.y = "id", all.x = TRUE)
Finally, the sentiments are mapped across the neighborhood. This map broadly shows the parts of Boston that most positive, negative and neutral sentiments are coming from.
cbPalette <- c("#999999", "#E69F00", "#56B4E9")
map <- ggplot() + geom_polygon(data = map_data, aes(x = long, y = lat, group = group, fill = sentiment), colour = "black") + scale_fill_manual(values = cbPalette)
map + geom_text(data = cnames, aes(x = long, y = lat, label = id), size = 3) + theme_void()
#cnames <- aggregate(cbind(long, lat) ~ id, data=shp_df, FUN=mean)
map
It can be observed that most of the positive sentiments come from the neighborhoods of Back Bay, South End, Fenway, Roslindale, Hyde Park, Financial District(also comprises of Government Centre) and Beacon Hill. The more negative tweets come from the neighborhoods of Mission Hill, Roxbury, Mattapan, West Roxbury, Dorchester whereas the neighborhoods of Allston, Charlestown, East Boston, South Boston stand at more neutrals. This analysis could give rise to hypotheses based on living conditions, safety and other factors in these neighborhoods. It would be interesting to dive deep into the negatives and positives through this week’s city exploration and see how much of this analysis proves to be true.
In the coming week, I plan to extract topics of the tweets and map them on the geographical level of neighborhoods through topic modeling in R. The topics would give an idea about what people are talking about most/least on twitter and it would be interesting to see how that information corresponds to the physical attributes of the neighborhoods.