Noise Interfering with the Pulse of the City

INTRODUCTION

Following the assignment of the previous week “I would Tell a Data Story.. But it was lost in translation”, this week I am exploring more specifically the patterns that can be seen in the variables of Google Places of Interest Dataset. This analysis is following and further evolving my exploration of whether keywords can create noise in the equitable analysis of the dataset when the data is produced within a multicultural setting. In this post I am specifying a more exact identification of the role of the various tags. I also explore strange outcomes that signify data that should be cleaned from the dataset so as to exclude alternative interpretations of the results on the “equity” question. More specifically, I am exploring:

  • the interrelation of the tags depending on the different frequencies observed
  • the data that might be the result of double counting through the patterns that are seen when the most commonly used keywords are combined with location patterns.

DATA EXPLORATION

require(tidyverse)
library(readr)
Google <- read_csv("/Users/Danai_Tr/Desktop/files of big data/data assignment/dataverse_files (1)/GooglePlaces.POI.csv")
View(Google)
nrow(Google)
## [1] 36354
ncol(Google)
## [1] 18

It is a dataset with 18 different columns and 36354 rows. It has a clear structure, defining the places by their location, name and id and the multiple tags.

According to the data documentation: 

“Google Places measures land usage using a hierarchical labelling system that emphasizes the primary land usage of a parcel. For each POI, Google provides up to 10 ‘tags’ that describe land usage, with lowered numbered tags being more indicative of the central type of land use at that parcel. We have assigned each tag a unique variable, with the variable suffix (X) indicating the tag’s placement in Google’s categorization scheme.”

This analysis will examine further the interrelation of the primary land use with the rest of the tags, so as to draw conclusions on what patterns and information on the dataset can be noticed by their interrelation. At the same time, it will further explore what happens when the geolocation (GIS_ID) is interrelated with the primary tags, so as to predict probable double counting and noise in our dataset.

When requesting the summary of the dataset, we notice that Tag_6-Tag_10 are defined as logical, while only Tag_1-Tag_5 as character. Thus, 0% of the data have 6 or more tags.

summary(Google)
##      GIS_ID            place_id             name              Tag_1              Tag_2              Tag_3              Tag_4              Tag_5            Tag_6          Tag_7          Tag_8        
##  Min.   :1.001e+08   Length:36354       Length:36354       Length:36354       Length:36354       Length:36354       Length:36354       Length:36354       Mode:logical   Mode:logical   Mode:logical  
##  1st Qu.:3.041e+08   Class :character   Class :character   Class :character   Class :character   Class :character   Class :character   Class :character   NA's:36354     NA's:36354     NA's:36354    
##  Median :5.000e+08   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character   Mode  :character                                               
##  Mean   :7.926e+08                                                                                                                                                                                    
##  3rd Qu.:1.201e+09                                                                                                                                                                                    
##  Max.   :2.206e+09                                                                                                                                                                                    
##                                                                                                                                                                                                       
##   Tag_9          Tag_10          vicinity               X                Y           Blk_ID_10            CT_ID_10        
##  Mode:logical   Mode:logical   Length:36354       Min.   :-71.18   Min.   :42.23   Min.   :2.502e+14   Min.   :2.503e+10  
##  NA's:36354     NA's:36354     Class :character   1st Qu.:-71.10   1st Qu.:42.33   1st Qu.:2.502e+14   1st Qu.:2.503e+10  
##                                Mode  :character   Median :-71.07   Median :42.35   Median :2.503e+14   Median :2.503e+10  
##                                                   Mean   :-71.08   Mean   :42.34   Mean   :2.503e+14   Mean   :2.503e+10  
##                                                   3rd Qu.:-71.06   3rd Qu.:42.36   3rd Qu.:2.503e+14   3rd Qu.:2.503e+10  
##                                                   Max.   :-70.93   Max.   :42.40   Max.   :2.503e+14   Max.   :2.503e+10  
##                                                                                    NA's   :8           NA's   :8
View((arrange(Google,Tag_6)))

I also check the validity of the statement through the view command.

#I am keeping the columns I need. The tags to compare, and the GIS IDs and vicinity to indicate the places
clean<-select(Google,GIS_ID,Tag_1,Tag_2,Tag_3,Tag_4,Tag_5, vicinity)
View(clean)

#I am excluding the rows that do not have Tag_3
Tag3<-(filter(clean,Tag_3!=99))
#I am counting the rows that have 3+ tags
nrow(Tag3)
## [1] 3947

In my dataset, I keep only the data with 3 or more Tags. These cases create a set of 3947 rows (that is 10.9%of my data) and allow to focus on the cases that have multiple tags so as to specify the interrelation of the tags among them. For this purpose I am not interested in cases that have less than 3 tags.

INTERRELATION OF TAGS

Within this dataset, I am exploring the most frequently used tags.

#I create a dataframe with the frequency of Tag_1 and spot the 3 most frequent tags
freqTag1<-data.frame(table(Tag3$Tag_1))
names(freqTag1)[1]<-'Tag'
head(arrange(freqTag1,desc(Freq)), n = 3)
##                 Tag Freq
## 1          pharmacy  356
## 2               bar  318
## 3 convenience_store  301
#I create a dataframe with the frequency of Tag_2 and spot the 3 most frequent tags
freqTag2<-data.frame(table(Tag3$Tag_2))
names(freqTag2)[1]<-'Tag2'
head(arrange(freqTag2,desc(Freq)), n = 3)
##         Tag2 Freq
## 1       food  752
## 2 restaurant  473
## 3     health  450

#I create a dataframe with the frequency of Tag_3 and spot the 3 most frequent tags
freqTag3<-data.frame(table(Tag3$Tag_3))
names(freqTag3)[1]<-'Tag3'
head(arrange(freqTag3,desc(Freq)), n = 3)
##     Tag3 Freq
## 1  store 1646
## 2   food  726
## 3 health  428

Through this analysis we can observe the following:

-In Tag 1 we can observe smaller frequency numbers. Thus, except for the fact that it signifies the main use (as it is stated in the dataset documentation), it also consists of a more specific categorisation, as compared to Tag_2 and Tag_3 which are getting exponentially more general

-The first uses in Tag_1 and Tag_2 (pharmacy and food) are signifying completely different results. The first result in Tag_3(store) can be associated with the first results of the other 2 Tags.

Thus, we can assume that there are diverse words that are explaining the same space, either generalising its use as the tags progress or signifying diverse ways of interpreting the same space.

Towards the analysis of the first 3 tags, we would assume that Boston has more pharmacies and food stores rather than any other use, according to Google places. There are various alternative ways of interpreting the results, though. Towards this direction I will examine the patters of GIS_ID and will interrelate it with the tags.

INTERRELATION WITH THE GEOLOCATION AND ODD RESULTS

First I will explore whether GIS_ID is being repeated within the data frame, which would be interpreted either as a multiplicity of uses under the same id, or as a double count of the data.

#I create a dataframe with the frequency of GIS_ID and spot the 3 most frequent tags
freqgis<-data.frame(table(Tag3$GIS_ID))
names(freqgis)[1]<-'GIS_ID'
head(arrange(freqgis,desc(Freq)), n = 3)
##       GIS_ID Freq
## 1  104126000   57
## 2 1001670000   43
## 3  401870000   40

The first 3 results that geolocated in zoning viewer(http://maps.bostonplans.org/zoningviewer/):

  • 104126000 (Airport) 57
  • 1001670000 (Jamaica plain medical centre) 43
  • 401870000 (Boston children’s hospital) 40

I am combining the patterns spotted, so as to open a discussion on whether there is a double counting of data that is altering any conclusion that I will attempt to draw in the upcoming weeks.

#I interrelate the frequency of the most common Tag_1 and Tag_2 with the frequency of GIS_ID

Tag3%>%
  filter(Tag_1=='pharmacy') %>%
  ggplot(aes(x=as.factor(GIS_ID)))+geom_bar()

We can observe a big percentage of our data to be gathered under the same GIS_ID.

Tag3%>%
  filter(Tag_1=='pharmacy') %>%
  {table(.$GIS_ID)} %>%
  as.data.frame %>%
  arrange(.,desc(Freq)) %>%
  head(n = 3)
##         Var1 Freq
## 1 1001670000   41
## 2  401894000   15
## 3  801298000   13

The first result in frequency is the Jamaica plain medical centre.

I follow the same procedure for the most frequent Tag_2 that is the word “food”

Tag3%>%
  filter(Tag_2=='food') %>%
  ggplot(aes(x=as.factor(GIS_ID)))+geom_bar()

We can observe less repetition of the same ID, which confirms that Tag_2 is referring to more general categorisation and, thus, the repetition of the tag is less related with the repetition of the data itself.

Tag3%>%
  filter(Tag_2=='food') %>%
  {table(.$GIS_ID)} %>%
  as.data.frame %>%
  arrange(.,desc(Freq)) %>%
  head(n = 3)
##         Var1 Freq
## 1  104126000   17
## 2 2100360000    4
## 3  101662000    3

We can observe that the first GIS_ID refers to the airport.

Thus, the most frequent Tag_2 (food) is the same with the most frequent GIS_ID (airport). The most frequent Tag_1 (pharmacy) is the same with the second most frequent GIS_ID (Jamaica general hospital).

From this analysis we can conclude that there are double counted data in the dataset. Baring in mind the reality in Boston, it cannot be valid that most of the pharmacies are inside Jamaica magical centre and most of the food stores in the airport.

DISCUSSION

In the end of this post, I will check the frequency that Tag_1 appears to be the same based on the most used Tag_2, so as to underline the importance of spotting and correcting the double-counted data.

Tag3 %>%
  filter(Tag_2=='food') %>%
  ggplot(aes(x=as.factor(Tag_1)))+geom_bar()

We can observe that some data are relevant with food, like the grocery store, while others are not, like the atm or health. The ones that are more relevant are also appearing more often. Without  the previous analysis, we would assume that people are defining spaces in the same way, diversifying how specifically they describe them.

Though now we should rethink this statement. A big amount of this data is the result of double counting. Excluding them from our dataset would result in a bigger diversity of patterns that are associating the tags among them. This would lead to more interesting results on the topic of this exploration: the way that keywords affect equitable results in multicultural environments.

In this post I defined better the interrelation among different tags, as well as peculiar cases that create noise on the attempt to understand the pulse of the city through Google Places Dataset. The outcomes will be utilised in the continuation of my project, towards improving the dataset and combining with other sources, towards drawing equitable conclusions on the impacts of COVID19.


Leave a comment