INTRODUCTION
Following the assignment of the previous week “I would Tell a Data Story.. But it was lost in translation”, this week I am exploring more specifically the patterns that can be seen in the variables of Google Places of Interest Dataset. This analysis is following and further evolving my exploration of whether keywords can create noise in the equitable analysis of the dataset when the data is produced within a multicultural setting. In this post I am specifying a more exact identification of the role of the various tags. I also explore strange outcomes that signify data that should be cleaned from the dataset so as to exclude alternative interpretations of the results on the “equity” question. More specifically, I am exploring:
- the interrelation of the tags depending on the different frequencies observed
- the data that might be the result of double counting through the patterns that are seen when the most commonly used keywords are combined with location patterns.
DATA EXPLORATION
require(tidyverse)
library(readr)
Google <- read_csv("/Users/Danai_Tr/Desktop/files of big data/data assignment/dataverse_files (1)/GooglePlaces.POI.csv")
View(Google)
nrow(Google)
## [1] 36354
ncol(Google)
## [1] 18
It is a dataset with 18 different columns and 36354 rows. It has a clear structure, defining the places by their location, name and id and the multiple tags.
According to the data documentation:
“Google Places measures land usage using a hierarchical labelling system that emphasizes the primary land usage of a parcel. For each POI, Google provides up to 10 ‘tags’ that describe land usage, with lowered numbered tags being more indicative of the central type of land use at that parcel. We have assigned each tag a unique variable, with the variable suffix (X) indicating the tag’s placement in Google’s categorization scheme.”
This analysis will examine further the interrelation of the primary land use with the rest of the tags, so as to draw conclusions on what patterns and information on the dataset can be noticed by their interrelation. At the same time, it will further explore what happens when the geolocation (GIS_ID) is interrelated with the primary tags, so as to predict probable double counting and noise in our dataset.
When requesting the summary of the dataset, we notice that Tag_6-Tag_10 are defined as logical, while only Tag_1-Tag_5 as character. Thus, 0% of the data have 6 or more tags.
summary(Google)
## GIS_ID place_id name Tag_1 Tag_2 Tag_3 Tag_4 Tag_5 Tag_6 Tag_7 Tag_8
## Min. :1.001e+08 Length:36354 Length:36354 Length:36354 Length:36354 Length:36354 Length:36354 Length:36354 Mode:logical Mode:logical Mode:logical
## 1st Qu.:3.041e+08 Class :character Class :character Class :character Class :character Class :character Class :character Class :character NA's:36354 NA's:36354 NA's:36354
## Median :5.000e+08 Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character
## Mean :7.926e+08
## 3rd Qu.:1.201e+09
## Max. :2.206e+09
##
## Tag_9 Tag_10 vicinity X Y Blk_ID_10 CT_ID_10
## Mode:logical Mode:logical Length:36354 Min. :-71.18 Min. :42.23 Min. :2.502e+14 Min. :2.503e+10
## NA's:36354 NA's:36354 Class :character 1st Qu.:-71.10 1st Qu.:42.33 1st Qu.:2.502e+14 1st Qu.:2.503e+10
## Mode :character Median :-71.07 Median :42.35 Median :2.503e+14 Median :2.503e+10
## Mean :-71.08 Mean :42.34 Mean :2.503e+14 Mean :2.503e+10
## 3rd Qu.:-71.06 3rd Qu.:42.36 3rd Qu.:2.503e+14 3rd Qu.:2.503e+10
## Max. :-70.93 Max. :42.40 Max. :2.503e+14 Max. :2.503e+10
## NA's :8 NA's :8
View((arrange(Google,Tag_6)))
I also check the validity of the statement through the view command.
#I am keeping the columns I need. The tags to compare, and the GIS IDs and vicinity to indicate the places
clean<-select(Google,GIS_ID,Tag_1,Tag_2,Tag_3,Tag_4,Tag_5, vicinity)
View(clean)
#I am excluding the rows that do not have Tag_3
Tag3<-(filter(clean,Tag_3!=99))
#I am counting the rows that have 3+ tags
nrow(Tag3)
## [1] 3947
In my dataset, I keep only the data with 3 or more Tags. These cases create a set of 3947 rows (that is 10.9%of my data) and allow to focus on the cases that have multiple tags so as to specify the interrelation of the tags among them. For this purpose I am not interested in cases that have less than 3 tags.
INTERRELATION OF TAGS
Within this dataset, I am exploring the most frequently used tags.
#I create a dataframe with the frequency of Tag_1 and spot the 3 most frequent tags
freqTag1<-data.frame(table(Tag3$Tag_1))
names(freqTag1)[1]<-'Tag'
head(arrange(freqTag1,desc(Freq)), n = 3)
## Tag Freq
## 1 pharmacy 356
## 2 bar 318
## 3 convenience_store 301
#I create a dataframe with the frequency of Tag_2 and spot the 3 most frequent tags
freqTag2<-data.frame(table(Tag3$Tag_2))
names(freqTag2)[1]<-'Tag2'
head(arrange(freqTag2,desc(Freq)), n = 3)
## Tag2 Freq
## 1 food 752
## 2 restaurant 473
## 3 health 450
#I create a dataframe with the frequency of Tag_3 and spot the 3 most frequent tags
freqTag3<-data.frame(table(Tag3$Tag_3))
names(freqTag3)[1]<-'Tag3'
head(arrange(freqTag3,desc(Freq)), n = 3)
## Tag3 Freq
## 1 store 1646
## 2 food 726
## 3 health 428
Through this analysis we can observe the following:
-In Tag 1 we can observe smaller frequency numbers. Thus, except for the fact that it signifies the main use (as it is stated in the dataset documentation), it also consists of a more specific categorisation, as compared to Tag_2 and Tag_3 which are getting exponentially more general
-The first uses in Tag_1 and Tag_2 (pharmacy and food) are signifying completely different results. The first result in Tag_3(store) can be associated with the first results of the other 2 Tags.
Thus, we can assume that there are diverse words that are explaining the same space, either generalising its use as the tags progress or signifying diverse ways of interpreting the same space.
Towards the analysis of the first 3 tags, we would assume that Boston has more pharmacies and food stores rather than any other use, according to Google places. There are various alternative ways of interpreting the results, though. Towards this direction I will examine the patters of GIS_ID and will interrelate it with the tags.
INTERRELATION WITH THE GEOLOCATION AND ODD RESULTS
First I will explore whether GIS_ID is being repeated within the data frame, which would be interpreted either as a multiplicity of uses under the same id, or as a double count of the data.
#I create a dataframe with the frequency of GIS_ID and spot the 3 most frequent tags
freqgis<-data.frame(table(Tag3$GIS_ID))
names(freqgis)[1]<-'GIS_ID'
head(arrange(freqgis,desc(Freq)), n = 3)
## GIS_ID Freq
## 1 104126000 57
## 2 1001670000 43
## 3 401870000 40
The first 3 results that geolocated in zoning viewer(http://maps.bostonplans.org/zoningviewer/):
- 104126000 (Airport) 57
- 1001670000 (Jamaica plain medical centre) 43
- 401870000 (Boston children’s hospital) 40
I am combining the patterns spotted, so as to open a discussion on whether there is a double counting of data that is altering any conclusion that I will attempt to draw in the upcoming weeks.
#I interrelate the frequency of the most common Tag_1 and Tag_2 with the frequency of GIS_ID
Tag3%>%
filter(Tag_1=='pharmacy') %>%
ggplot(aes(x=as.factor(GIS_ID)))+geom_bar()
We can observe a big percentage of our data to be gathered under the same GIS_ID.
Tag3%>%
filter(Tag_1=='pharmacy') %>%
{table(.$GIS_ID)} %>%
as.data.frame %>%
arrange(.,desc(Freq)) %>%
head(n = 3)
## Var1 Freq
## 1 1001670000 41
## 2 401894000 15
## 3 801298000 13
The first result in frequency is the Jamaica plain medical centre.
I follow the same procedure for the most frequent Tag_2 that is the word “food”
Tag3%>%
filter(Tag_2=='food') %>%
ggplot(aes(x=as.factor(GIS_ID)))+geom_bar()
We can observe less repetition of the same ID, which confirms that Tag_2 is referring to more general categorisation and, thus, the repetition of the tag is less related with the repetition of the data itself.
Tag3%>%
filter(Tag_2=='food') %>%
{table(.$GIS_ID)} %>%
as.data.frame %>%
arrange(.,desc(Freq)) %>%
head(n = 3)
## Var1 Freq
## 1 104126000 17
## 2 2100360000 4
## 3 101662000 3
We can observe that the first GIS_ID refers to the airport.
Thus, the most frequent Tag_2 (food) is the same with the most frequent GIS_ID (airport). The most frequent Tag_1 (pharmacy) is the same with the second most frequent GIS_ID (Jamaica general hospital).
From this analysis we can conclude that there are double counted data in the dataset. Baring in mind the reality in Boston, it cannot be valid that most of the pharmacies are inside Jamaica magical centre and most of the food stores in the airport.
DISCUSSION
In the end of this post, I will check the frequency that Tag_1 appears to be the same based on the most used Tag_2, so as to underline the importance of spotting and correcting the double-counted data.
Tag3 %>%
filter(Tag_2=='food') %>%
ggplot(aes(x=as.factor(Tag_1)))+geom_bar()
We can observe that some data are relevant with food, like the grocery store, while others are not, like the atm or health. The ones that are more relevant are also appearing more often. Without the previous analysis, we would assume that people are defining spaces in the same way, diversifying how specifically they describe them.
Though now we should rethink this statement. A big amount of this data is the result of double counting. Excluding them from our dataset would result in a bigger diversity of patterns that are associating the tags among them. This would lead to more interesting results on the topic of this exploration: the way that keywords affect equitable results in multicultural environments.
In this post I defined better the interrelation among different tags, as well as peculiar cases that create noise on the attempt to understand the pulse of the city through Google Places Dataset. The outcomes will be utilised in the continuation of my project, towards improving the dataset and combining with other sources, towards drawing equitable conclusions on the impacts of COVID19.