Comparing Groups in Cambridge

This week, I wanted to examine some of my constructs in more detail at the parcel level (I used parcel level as of now due to the larger sample size for quantitative analysis). The data for this week uses the building permit data as well as the year built and total home value for the parcel.

“`{r include=FALSE}
library(readr)
library(dplyr)
library(lubridate)
library(curl)
library(devtools)
library(ggplot2)
library(sqldf)
library(stringr)
library(easyGgplot2)
library(readr)
library(aCRM)
require(rgdal)
require(sp)
require(ggmap)
library(sf)
require(Hmisc)
library(corrplot)
library(data.table)
require(reshape2)
require(ggplot2)
blockc<-read.csv(‘~/Desktop/Big Cities/blockc.csv’)
names(blockc)
“`

First, I want to compare whether parcels that have energy-related permits differ in years built, total value, and total cost of construction.

t.test(total_cost~energy_inv, data=blockc)
t.test(TOTAL_VAL~energy_inv, data=blockc)
t.test(YEAR_BUILT~energy_inv, data=blockc)

t.test(TOTAL_VAL~solar_iv, data=blockc)
t.test(YEAR_BUILT~solar_iv, data=blockc)
t.test(total_cost~solar_iv, data=blockc)

Surprisingly, I found no significant results in terms of group differences between parcels with and without energy-related permits. I also broke this down by types of energy permits to further examine group differences, but also did not find significant differences.

	Welch Two Sample t-test

data:  TOTAL_VAL by solar_iv
t = 1.4409, df = 2370.4, p-value = 0.1498
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -175802.2 1149901.9
sample estimates:
mean in group 0 mean in group 1 
      1372496.7        885446.9 


	Welch Two Sample t-test

data:  YEAR_BUILT by solar_iv
t = 0.66792, df = 275.84, p-value = 0.5047
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -14.42075  29.23118
sample estimates:
mean in group 0 mean in group 1 
       1896.019        1888.614 


	Welch Two Sample t-test

data:  total_cost by solar_iv
t = -1.0054, df = 254.08, p-value = 0.3156
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1200806.6   389100.2
sample estimates:
mean in group 0 mean in group 1 
       104801.2        510654.4 

Next, I examined whether tracts differ in the number of energy permits and found that they, in fact, do differ in the number of permits.

blockc$TRACTCE10<- as.factor(blockc$TRACTCE10)
anova<-aov(num_perm_energy~TRACTCE10, data=blockc)
class(anova)
summary(anova)

[1] "aov" "lm" 
              Df Sum Sq Mean Sq F value Pr(>F)  
TRACTCE10     32   13.8  0.4303   1.599  0.018 *
Residuals   2541  683.9  0.2691                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
2 observations deleted due to missingness

 

Finally, for the next set of analyses, I looked at the proportion of energy-related permits as a continuous dependent variable to examine certain characteristics in greater detail. For this, I examined whether parcels with a greater proportion of energy permits differ in total cost of construction, split up into categorical groups.

“`{r}
anova<-aov(prop_energy~TC, data=blockc)
class(anova)
summary(anova)
“`

[1] "aov" "lm" 
              Df Sum Sq Mean Sq F value Pr(>F)  
TC             3   0.91  0.3026   2.829 0.0372 *
Residuals   2570 274.91  0.1070                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
2 observations deleted due to missingness

Results find significant differences across the cost of construction.

TukeyHSD(anova)

Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = prop_energy ~ TC, data = blockc)

$TC
                                                  diff
100,000 and 499,999-0 and 99,999          -0.045695618
500,000 and 1MIL-0 and 99,999             -0.040398627
greater than one mill-0 and 99,999        -0.035326424
500,000 and 1MIL-100,000 and 499,999       0.005296991
greater than one mill-100,000 and 499,999  0.010369194
greater than one mill-500,000 and 1MIL     0.005072203
                                                  lwr
100,000 and 499,999-0 and 99,999          -0.08818622
500,000 and 1MIL-0 and 99,999             -0.14060169
greater than one mill-0 and 99,999        -0.18515309
500,000 and 1MIL-100,000 and 499,999      -0.10021323
greater than one mill-100,000 and 499,999 -0.14305761
greater than one mill-500,000 and 1MIL    -0.17318319
                                                   upr
100,000 and 499,999-0 and 99,999          -0.003205019
500,000 and 1MIL-0 and 99,999              0.059804438
greater than one mill-0 and 99,999         0.114500241
500,000 and 1MIL-100,000 and 499,999       0.110807214
greater than one mill-100,000 and 499,999  0.163795993
greater than one mill-500,000 and 1MIL     0.183327598
                                              p adj
100,000 and 499,999-0 and 99,999          0.0292937
500,000 and 1MIL-0 and 99,999             0.7280345
greater than one mill-0 and 99,999        0.9301587
500,000 and 1MIL-100,000 and 499,999      0.9992325
greater than one mill-100,000 and 499,999 0.9981370
greater than one mill-500,000 and 1MIL    0.9998597

I then plotted this using GGPLOT2 in order to visualize the differences between groups.

melted<-melt(blockc[c(11,22)],id.vars=c(“TC”))
means<-aggregate(value~TC,data=melted,mean)
names(means)[2]<-“mean”
ggplot(data=means, aes(x=TC, y=mean)) + geom_bar(stat=”identity”,position=”dodge”, fill=”blue”) + ylab(“Mean”)

Screen Shot 2018-11-21 at 11.43.22 AM

This shows that the proportion of parcels with energy-related permits may be more likely to fall below that 100,000-5000,000 price point but be in fact less expensive. This is an interesting finding!


Leave a comment