Latent Construct – Food Inspections

Principal Components Analysis or PCA is a statistical method used to reduce the dimensionality of a large dataset with multiple variables, lowering the number of variables while retaining a significant amount of the information of the original dataset. From a geometrical perspective, a principal component created through this analysis represents the directions of the data that explain the maximum amount of variance – with eigenvectors of covariance indicating the directions of each principal component and eigenvalues indicating the amount of variance explained through each component. While it is possible to conduct a PCA using categorical variables, using a nonlinear PCA methodology, however this is 1) not widely used in statistics, therefore hasn’t been tested extensively and 2) is somewhat beyond my current skillset.

For this week, I am attempting to perform a PCA on the dataset, using select continuous variables. The 11 variables I have chosen are – Inspection Pass Count Inspection Fail Count Level 1 Violation Count Level 2 Violation Count Level 3 Violation Count Violation Pass Count Violation Fail Count Total Review Count Year Count Average Rating Price (after converting to numerical)

I selected these variables for the PCA because my goal is to identify one or more factors that have a factor eigenvalue cutoff of 1.0, therefore helping me identify that my variables are unidimensional and uniform throughout the dataset.

Once my PC is identified, I will utilize the loading scores of my 11 variables on that PC to create a new variable of ‘performance’. Performance is a latent construct that attempts to capture the relationships between Yelp performance and Food Inspection performance under one singular variable through the means of dimension reduction given by PCA.

The following code outlines the various steps I undertook in this process:


final.pca2<-prcomp(final[,c(3:9,22,24, 25, 40)], center = TRUE,scale. = TRUE)
summary(final.pca2)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.4504 1.2903 1.03385 0.84438 0.70339 0.59114 0.54343
## Proportion of Variance 0.5458 0.1513 0.09717 0.06482 0.04498 0.03177 0.02685
## Cumulative Proportion  0.5458 0.6972 0.79435 0.85917 0.90414 0.93591 0.96276
##                            PC8     PC9    PC10    PC11
## Standard deviation     0.53111 0.34397 0.09527 0.01354
## Proportion of Variance 0.02564 0.01076 0.00083 0.00002
## Cumulative Proportion  0.98840 0.99916 0.99998 1.00000

eig.val <- get_eigenvalue(final.pca2)
eig.val

##          eigenvalue variance.percent cumulative.variance.percent
## Dim.1  6.0042425176     54.584022887                    54.58402
## Dim.2  1.6647564773     15.134149794                    69.71817
## Dim.3  1.0688362600      9.716693273                    79.43487
## Dim.4  0.7129828491      6.481662265                    85.91653
## Dim.5  0.4947639115      4.497853741                    90.41438
## Dim.6  0.3494490528      3.176809571                    93.59119
## Dim.7  0.2953111381      2.684646710                    96.27584
## Dim.8  0.2820799301      2.564363001                    98.84020
## Dim.9  0.1183179564      1.075617786                    99.91582
## Dim.10 0.0090765437      0.082514034                    99.99833
## Dim.11 0.0001833633      0.001666939                   100.00000

#PC1 has the highest eigenvalue, however PC2 and PC3 are also above 1. Please correct me if I am wrong, but I am still eliminating PC2 and 3 due to being too close to the threshold of 1. This allows me to select PC1 as the factor whose loadings will determine the creation of my latent variable. Additionally, at 54.58402%, PC1 explains the majority of the variance within the subset of the original data, leading me to the conclusion that this is a useful path forwards.

#The following are some illustrations of the PCA results – an interesting grouping was of the two types of establishments – eating and drinking, and eating and drinking w/ takeout. I plan on further exploring this difference next week.
description<-final[,39]
gdescription<- ggbiplot(final.pca2, obs.scale = 1, var.scale = 1,
              groups = description, ellipse = TRUE, circle = TRUE)
gdescription <- gdescription + scale_color_discrete(name = ”)
gdescription <- gdescription + theme(legend.direction = ‘horizontal’,
              legend.position = ‘top’)
gdescription

firstreview<-as.character(final[,27])
gfirst<- ggbiplot(final.pca2, obs.scale = 1, var.scale = 1,
              groups = firstreview, ellipse = TRUE, circle = TRUE)
gfirst<- gfirst + scale_color_discrete(name = ”)
gfirst <- gfirst + theme(legend.direction = ‘horizontal’,
              legend.position = ‘top’)
gfirst

#Now that I have seen some illustrations, the variance along PC1 becomes more clear and important for my new latent construct of performance. The first step is to isolate variable contributions towards PC1
res.var <- (get_pca_var(final.pca2))
result<-as.data.frame(res.var$contrib)
result

##                             Dim.1        Dim.2       Dim.3       Dim.4
## Inspection.Fail.Count  13.9989440  0.447365376  0.44906546  0.23731764
## Inspection.Pass.Count  11.4092031  0.073554576  1.17563440  4.69284523
## Violation.Level1.Count 15.2213908  0.004481036  0.83967582  0.40790806
## Violation.Level2.Count 11.7558760  0.987285272  3.68353070  1.38054569
## Violation.Level3.Count 12.4119471  0.193642133  1.40075737  0.06848382
## Violation.Pass.Count   15.9783682  0.058619331  1.22474997  0.09627494
## Violation.Fail.Count   15.6280642  0.004928677  1.31790593  0.83264729
## review_count_total      0.1586208 37.393161626 11.62846220  0.30934711
## year_count              2.8578861  0.230818652 60.93750758  9.98755861
## ave_rating              0.3535249 23.103255290 17.30750388 55.37379139
## PriceNum                0.2261749 37.502888030  0.03520667 26.61328023
##                               Dim.5       Dim.6       Dim.7       Dim.8
## Inspection.Fail.Count   0.042996957  4.57640087  0.79050582 22.61915838
## Inspection.Pass.Count   9.757291217 52.50366101  5.56377970  3.58654292
## Violation.Level1.Count  0.611211222  3.62554775  0.30835393 14.77286219
## Violation.Level2.Count  0.003167978 10.40097006 30.44204797 33.23057659
## Violation.Level3.Count  0.227288750  0.93667210 60.22257946 17.77251521
## Violation.Pass.Count    0.077391598  0.86668596  0.02637375  0.26788458
## Violation.Fail.Count    1.035465471  4.39887794  0.91123944  4.74623335
## review_count_total     45.095943106  4.65677194  0.61089565  0.06566162
## year_count              8.214899506 16.59550588  0.66341690  0.50687467
## ave_rating              1.318465266  1.41716076  0.06485748  0.89088105
## PriceNum               33.615878930  0.02174573  0.39594990  1.54080944
##                               Dim.9       Dim.10       Dim.11
## Inspection.Fail.Count  53.119957626 3.703458e+00 1.482974e-02
## Inspection.Pass.Count   7.687961530 3.544928e+00 4.597932e-03
## Violation.Level1.Count 13.293377016 1.790117e-01 5.073618e+01
## Violation.Level2.Count  7.182046053 2.430196e-02 9.096517e-01
## Violation.Level3.Count  3.117336772 3.788341e-02 3.610894e+00
## Violation.Pass.Count   14.645400910 4.690720e+01 1.985105e+01
## Violation.Fail.Count    0.668584745 4.558355e+01 2.487250e+01
## review_count_total      0.080563976 5.699533e-04 2.048936e-06
## year_count              0.004240489 1.275798e-03 1.585510e-05
## ave_rating              0.156099866 1.419189e-02 2.682292e-04
## PriceNum                0.044431017 3.630519e-03 4.605408e-06

onefactor<-result[,1, drop=FALSE]
onefactor

##                             Dim.1
## Inspection.Fail.Count  13.9989440
## Inspection.Pass.Count  11.4092031
## Violation.Level1.Count 15.2213908
## Violation.Level2.Count 11.7558760
## Violation.Level3.Count 12.4119471
## Violation.Pass.Count   15.9783682
## Violation.Fail.Count   15.6280642
## review_count_total      0.1586208
## year_count              2.8578861
## ave_rating              0.3535249
## PriceNum                0.2261749

#Adding a new variable, performance, as a product of the 11 select variables and their contributions towards PC1
final$performance<-final$Inspection.Fail.Count*onefactor[1,1] +
  final$Inspection.Pass.Count*onefactor[2,1] +
  final$Violation.Level1.Count*onefactor[3,1] +
  final$Violation.Level2.Count*onefactor[4,1] +
  final$Violation.Level3.Count*onefactor[5,1] +
  final$Violation.Pass.Count*onefactor[6,1] +
  final$Violation.Fail.Count*onefactor[7,1] +
  final$review_count_total*onefactor[8,1] +
  final$year_count*onefactor[9,1] +
  final$ave_rating*onefactor[10,1] +
  final$PriceNum*onefactor[11,1]


#Reducing variable size – loadings were presented as %, so this is only a conversion
final$performance<-final$performance*0.01

#Summary
summary(final$performance)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
##   0.4793  15.3773  31.1025  39.4375  55.1682 277.6467

final[which.max(final$performance),]

##      Restaurant.Name                              Address
## 4089   The Real Deal 736 Centre StJamaica Plain, MA 02130                   
##                               nsa          descript PriceNum performance
## 4089 Jamaica Central/South/Sumner Eating & Drinking        1    277.6467

final[which.min(final$performance),]

##          Restaurant.Name                      Address
## 1380 Great Sushi Express 48 Winter StBoston, MA 02108                    
##                           descript PriceNum performance
## 1380 Eating & Drinking w/ Take Out        1   0.4793246

#Now it is time to do some aggregations to see what ‘performance’ reveals about establishments and the dataset as a whole – Note I am basing all my aggregations on the NSA level, based on Prof. O’Brien’s feedback.
agg1 <-as.data.frame(aggregate(final, by=list(final$nsa, final$performance),
  FUN=mean, na.rm=TRUE))

aggPerformance<-agg1[,1:2, drop = FALSE]

aggPerformance[which.max(aggPerformance$Group.2),]

##                          Group.1  Group.2
## 423 Jamaica Central/South/Sumner 277.6467

aggPerformance[which.min(aggPerformance$Group.2),]

##                     Group.1   Group.2
## 1 Downtown/Central/West End 0.4793246

#It becomes clear that the highest average performance score is in Jamaica/Central/South/Sumner NSA, and the lowest is in the Downtown/Central/West NSA. How is this reflected in another variable, failed inspections?
agg2 <-as.data.frame(aggregate(final, by=list(final$nsa, final$Inspection.Fail.Count),
FUN=mean, na.rm=TRUE))

aggInspectionFail<-agg2[,1:2, drop = FALSE]

aggInspectionFail[which.max(aggInspectionFail$Group.2),]

##                          Group.1 Group.2
## 299 Jamaica Central/South/Sumner      49

aggInspectionFail[which.min(aggInspectionFail$Group.2),]

##                       Group.1 Group.2
## 1 Chinatown/SCove/Bay Village       1

#However, we find that the Downtown/Central/West NSA has the second-lowest average number of failed inspections, which is consistent with the first aggregation.

Based on the above, I am very tentatively concluding that my latent construct does capture the overall performance of an establishment. But there are some problems in this process – the variance amongst Yelp-derived indicators is too low, causing an inflation of food inspection variables in the loading of this factor. A potential solution would be to scrub Yelp for the most common words in the reviews of each individual establishment. That would add another variable that might significantly alter both the variance proportion distribution and the loading factors on the new PCs created after a new PCA. But as it stands, ‘performance’ as a latent construct is an interesting idea that needs more data to be fully realized and implemented.


Leave a comment