Principal Components Analysis or PCA is a statistical method used to reduce the dimensionality of a large dataset with multiple variables, lowering the number of variables while retaining a significant amount of the information of the original dataset. From a geometrical perspective, a principal component created through this analysis represents the directions of the data that explain the maximum amount of variance – with eigenvectors of covariance indicating the directions of each principal component and eigenvalues indicating the amount of variance explained through each component. While it is possible to conduct a PCA using categorical variables, using a nonlinear PCA methodology, however this is 1) not widely used in statistics, therefore hasn’t been tested extensively and 2) is somewhat beyond my current skillset.
For this week, I am attempting to perform a PCA on the dataset, using select continuous variables. The 11 variables I have chosen are – Inspection Pass Count Inspection Fail Count Level 1 Violation Count Level 2 Violation Count Level 3 Violation Count Violation Pass Count Violation Fail Count Total Review Count Year Count Average Rating Price (after converting to numerical)
I selected these variables for the PCA because my goal is to identify one or more factors that have a factor eigenvalue cutoff of 1.0, therefore helping me identify that my variables are unidimensional and uniform throughout the dataset.
Once my PC is identified, I will utilize the loading scores of my 11 variables on that PC to create a new variable of ‘performance’. Performance is a latent construct that attempts to capture the relationships between Yelp performance and Food Inspection performance under one singular variable through the means of dimension reduction given by PCA.
The following code outlines the various steps I undertook in this process:
final.pca2<-prcomp(final[,c(3:9,22,24, 25, 40)], center = TRUE,scale. = TRUE)
summary(final.pca2)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.4504 1.2903 1.03385 0.84438 0.70339 0.59114 0.54343
## Proportion of Variance 0.5458 0.1513 0.09717 0.06482 0.04498 0.03177 0.02685
## Cumulative Proportion 0.5458 0.6972 0.79435 0.85917 0.90414 0.93591 0.96276
## PC8 PC9 PC10 PC11
## Standard deviation 0.53111 0.34397 0.09527 0.01354
## Proportion of Variance 0.02564 0.01076 0.00083 0.00002
## Cumulative Proportion 0.98840 0.99916 0.99998 1.00000
eig.val <- get_eigenvalue(final.pca2)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 6.0042425176 54.584022887 54.58402
## Dim.2 1.6647564773 15.134149794 69.71817
## Dim.3 1.0688362600 9.716693273 79.43487
## Dim.4 0.7129828491 6.481662265 85.91653
## Dim.5 0.4947639115 4.497853741 90.41438
## Dim.6 0.3494490528 3.176809571 93.59119
## Dim.7 0.2953111381 2.684646710 96.27584
## Dim.8 0.2820799301 2.564363001 98.84020
## Dim.9 0.1183179564 1.075617786 99.91582
## Dim.10 0.0090765437 0.082514034 99.99833
## Dim.11 0.0001833633 0.001666939 100.00000
#PC1 has the highest eigenvalue, however PC2 and PC3 are also above 1. Please correct me if I am wrong, but I am still eliminating PC2 and 3 due to being too close to the threshold of 1. This allows me to select PC1 as the factor whose loadings will determine the creation of my latent variable. Additionally, at 54.58402%, PC1 explains the majority of the variance within the subset of the original data, leading me to the conclusion that this is a useful path forwards.
#The following are some illustrations of the PCA results – an interesting grouping was of the two types of establishments – eating and drinking, and eating and drinking w/ takeout. I plan on further exploring this difference next week.
description<-final[,39]
gdescription<- ggbiplot(final.pca2, obs.scale = 1, var.scale = 1,
groups = description, ellipse = TRUE, circle = TRUE)
gdescription <- gdescription + scale_color_discrete(name = ”)
gdescription <- gdescription + theme(legend.direction = ‘horizontal’,
legend.position = ‘top’)
gdescription
firstreview<-as.character(final[,27])
gfirst<- ggbiplot(final.pca2, obs.scale = 1, var.scale = 1,
groups = firstreview, ellipse = TRUE, circle = TRUE)
gfirst<- gfirst + scale_color_discrete(name = ”)
gfirst <- gfirst + theme(legend.direction = ‘horizontal’,
legend.position = ‘top’)
gfirst
#Now that I have seen some illustrations, the variance along PC1 becomes more clear and important for my new latent construct of performance. The first step is to isolate variable contributions towards PC1
res.var <- (get_pca_var(final.pca2))
result<-as.data.frame(res.var$contrib)
result
## Dim.1 Dim.2 Dim.3 Dim.4
## Inspection.Fail.Count 13.9989440 0.447365376 0.44906546 0.23731764
## Inspection.Pass.Count 11.4092031 0.073554576 1.17563440 4.69284523
## Violation.Level1.Count 15.2213908 0.004481036 0.83967582 0.40790806
## Violation.Level2.Count 11.7558760 0.987285272 3.68353070 1.38054569
## Violation.Level3.Count 12.4119471 0.193642133 1.40075737 0.06848382
## Violation.Pass.Count 15.9783682 0.058619331 1.22474997 0.09627494
## Violation.Fail.Count 15.6280642 0.004928677 1.31790593 0.83264729
## review_count_total 0.1586208 37.393161626 11.62846220 0.30934711
## year_count 2.8578861 0.230818652 60.93750758 9.98755861
## ave_rating 0.3535249 23.103255290 17.30750388 55.37379139
## PriceNum 0.2261749 37.502888030 0.03520667 26.61328023
## Dim.5 Dim.6 Dim.7 Dim.8
## Inspection.Fail.Count 0.042996957 4.57640087 0.79050582 22.61915838
## Inspection.Pass.Count 9.757291217 52.50366101 5.56377970 3.58654292
## Violation.Level1.Count 0.611211222 3.62554775 0.30835393 14.77286219
## Violation.Level2.Count 0.003167978 10.40097006 30.44204797 33.23057659
## Violation.Level3.Count 0.227288750 0.93667210 60.22257946 17.77251521
## Violation.Pass.Count 0.077391598 0.86668596 0.02637375 0.26788458
## Violation.Fail.Count 1.035465471 4.39887794 0.91123944 4.74623335
## review_count_total 45.095943106 4.65677194 0.61089565 0.06566162
## year_count 8.214899506 16.59550588 0.66341690 0.50687467
## ave_rating 1.318465266 1.41716076 0.06485748 0.89088105
## PriceNum 33.615878930 0.02174573 0.39594990 1.54080944
## Dim.9 Dim.10 Dim.11
## Inspection.Fail.Count 53.119957626 3.703458e+00 1.482974e-02
## Inspection.Pass.Count 7.687961530 3.544928e+00 4.597932e-03
## Violation.Level1.Count 13.293377016 1.790117e-01 5.073618e+01
## Violation.Level2.Count 7.182046053 2.430196e-02 9.096517e-01
## Violation.Level3.Count 3.117336772 3.788341e-02 3.610894e+00
## Violation.Pass.Count 14.645400910 4.690720e+01 1.985105e+01
## Violation.Fail.Count 0.668584745 4.558355e+01 2.487250e+01
## review_count_total 0.080563976 5.699533e-04 2.048936e-06
## year_count 0.004240489 1.275798e-03 1.585510e-05
## ave_rating 0.156099866 1.419189e-02 2.682292e-04
## PriceNum 0.044431017 3.630519e-03 4.605408e-06
onefactor<-result[,1, drop=FALSE]
onefactor
## Dim.1
## Inspection.Fail.Count 13.9989440
## Inspection.Pass.Count 11.4092031
## Violation.Level1.Count 15.2213908
## Violation.Level2.Count 11.7558760
## Violation.Level3.Count 12.4119471
## Violation.Pass.Count 15.9783682
## Violation.Fail.Count 15.6280642
## review_count_total 0.1586208
## year_count 2.8578861
## ave_rating 0.3535249
## PriceNum 0.2261749
#Adding a new variable, performance, as a product of the 11 select variables and their contributions towards PC1
final$performance<-final$Inspection.Fail.Count*onefactor[1,1] +
final$Inspection.Pass.Count*onefactor[2,1] +
final$Violation.Level1.Count*onefactor[3,1] +
final$Violation.Level2.Count*onefactor[4,1] +
final$Violation.Level3.Count*onefactor[5,1] +
final$Violation.Pass.Count*onefactor[6,1] +
final$Violation.Fail.Count*onefactor[7,1] +
final$review_count_total*onefactor[8,1] +
final$year_count*onefactor[9,1] +
final$ave_rating*onefactor[10,1] +
final$PriceNum*onefactor[11,1]
#Reducing variable size – loadings were presented as %, so this is only a conversion
final$performance<-final$performance*0.01
#Summary
summary(final$performance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4793 15.3773 31.1025 39.4375 55.1682 277.6467
final[which.max(final$performance),]
## Restaurant.Name Address
## 4089 The Real Deal 736 Centre StJamaica Plain, MA 02130
## nsa descript PriceNum performance
## 4089 Jamaica Central/South/Sumner Eating & Drinking 1 277.6467
final[which.min(final$performance),]
## Restaurant.Name Address
## 1380 Great Sushi Express 48 Winter StBoston, MA 02108
## descript PriceNum performance
## 1380 Eating & Drinking w/ Take Out 1 0.4793246
#Now it is time to do some aggregations to see what ‘performance’ reveals about establishments and the dataset as a whole – Note I am basing all my aggregations on the NSA level, based on Prof. O’Brien’s feedback.
agg1 <-as.data.frame(aggregate(final, by=list(final$nsa, final$performance),
FUN=mean, na.rm=TRUE))
aggPerformance<-agg1[,1:2, drop = FALSE]
aggPerformance[which.max(aggPerformance$Group.2),]
## Group.1 Group.2
## 423 Jamaica Central/South/Sumner 277.6467
aggPerformance[which.min(aggPerformance$Group.2),]
## Group.1 Group.2
## 1 Downtown/Central/West End 0.4793246
#It becomes clear that the highest average performance score is in Jamaica/Central/South/Sumner NSA, and the lowest is in the Downtown/Central/West NSA. How is this reflected in another variable, failed inspections?
agg2 <-as.data.frame(aggregate(final, by=list(final$nsa, final$Inspection.Fail.Count),
FUN=mean, na.rm=TRUE))
aggInspectionFail<-agg2[,1:2, drop = FALSE]
aggInspectionFail[which.max(aggInspectionFail$Group.2),]
## Group.1 Group.2
## 299 Jamaica Central/South/Sumner 49
aggInspectionFail[which.min(aggInspectionFail$Group.2),]
## Group.1 Group.2
## 1 Chinatown/SCove/Bay Village 1
#However, we find that the Downtown/Central/West NSA has the second-lowest average number of failed inspections, which is consistent with the first aggregation.
Based on the above, I am very tentatively concluding that my latent construct does capture the overall performance of an establishment. But there are some problems in this process – the variance amongst Yelp-derived indicators is too low, causing an inflation of food inspection variables in the loading of this factor. A potential solution would be to scrub Yelp for the most common words in the reviews of each individual establishment. That would add another variable that might significantly alter both the variance proportion distribution and the loading factors on the new PCs created after a new PCA. But as it stands, ‘performance’ as a latent construct is an interesting idea that needs more data to be fully realized and implemented.