Comparing Groups for Cambridge

During the second midterm, I created a new dataset for Cambridge by merging the
“Master Address List” with the “Building Permits” datasts. In previous posts, I’ve
explored climate adaptaion and housing affordability. The inclusion of a master address
list means we can compare groups which are spatially related. It would be helpful to tie
this week’s blog post into the analysis I’m doing for my final. The analysis for my final
involves predicting the adoption of property adaptions over time.

Robustness of measure is important. I’m going to use my predictive model on what I’m assuming
to be two unrelated property adaptions. Solar panel and parking space installation are assumed
to be independent in my analysis. It’s important that we verify this assumption. These types of
installations are meant to be statistically independent events. We can use spatial aggregations
to compare the statistical independence of these building adaptations with a t-test and an
ANOVA. Since ANOVA generalizes the t-test to more than two groups, I’ll also include
“Gross Square Footage” to see if building size is endogenous to parking space or solar panel
installation.

## Aggregating groups

# Census Block 2010
census <- master %>%
    group_by(`Census Block 2010`) %>%
    summarize(
        total_solar_panels = sum(SolarPanels, na.rm=TRUE),
        total_parking_spaces = sum(ParkingSpaces, na.rm=TRUE),
        total_sqfootage = sum(as.numeric(GrossSquareFootage), na.rm=TRUE)
    )

# Neighborhood
neighborhood <- master %>%
    group_by(Neighborhood) %>%
    summarize(
        total_solar_panels = sum(SolarPanels, na.rm=TRUE),
        total_parking_spaces = sum(ParkingSpaces, na.rm=TRUE),
        total_sqfootage = sum(as.numeric(GrossSquareFootage), na.rm=TRUE)
    )

# Election Ward
election_ward <- master %>%
    group_by(`Election Ward`) %>%
    summarize(
        total_solar_panels = sum(SolarPanels, na.rm=TRUE),
        total_parking_spaces = sum(ParkingSpaces, na.rm=TRUE),
        total_sqfootage = sum(as.numeric(GrossSquareFootage), na.rm=TRUE)
    )

T-Test Results

We can now report t-test statistics for parking space and solar panel by each
spatial aggregation. My alternative hypothesis is that parking space and
solar panel installtion is not statistically related for any geographical
level of aggregation. More formally,

\[H_a: \mu_1 \neq \mu_2\]

where

\[H_a = \text{ the alternative hypthesis}\]

\[\mu_1 = \text{ the mean of parking spaces for a spatial level of aggregation}\]

\[\mu_2 = \text{ the mean of solar panels for a spatial level of aggregation}\]

and

\[\alpha = 0.05\]

Census Block

t.test(census$total_solar_panels, census$total_parking_spaces)

## 
##  Welch Two Sample t-test
## 
## data:  census$total_solar_panels and census$total_parking_spaces
## t = 2.0866, df = 171.38, p-value = 0.0384
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##    4.935282 177.761147
## sample estimates:
## mean of x mean of y 
## 119.35714  28.00893

\(p \leq \alpha \Rightarrow \text{ reject } H_0 \Rightarrow \text{ suggest } H_a\)

We can reject the null hypothesis at \(p \leq \alpha\). This finding supports
my assertion that parking spot and solar panel installtion are
statistically independent at the Census Block level.

Neighborhood

t.test(neighborhood$total_solar_panels, neighborhood$total_parking_spaces)

## 
##  Welch Two Sample t-test
## 
## data:  neighborhood$total_solar_panels and neighborhood$total_parking_spaces
## t = 2.2037, df = 20.801, p-value = 0.03895
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##    40.75875 1420.81268
## sample estimates:
## mean of x mean of y 
##  954.8571  224.0714

\(p \leq \alpha \Rightarrow \text{ reject } H_0 \Rightarrow \text{ suggest } H_a\)

We can reject the null hypothesis at \(p \leq \alpha\). This finding supports
my assertion that parking spot and solar panel installtion are
statistically independent at the Neighborhood level.

Election Ward

t.test(election_ward$total_solar_panels, election_ward$total_parking_spaces)

## 
##  Welch Two Sample t-test
## 
## data:  election_ward$total_solar_panels and election_ward$total_parking_spaces
## t = 2.2469, df = 17.795, p-value = 0.03758
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##    54.72596 1650.44071
## sample estimates:
## mean of x mean of y 
## 1114.0000  261.4167

\(p \leq \alpha \Rightarrow \text{ reject } H_0 \Rightarrow \text{ suggest } H_a\)

We can reject the null hypothesis at \(p \leq \alpha\). This finding supports
my assertion that parking spot and solar panel installtion are
statistically independent at the Election Ward level.

ANOVA Results

We can also review a comparison of multiple groups with ANOVA for parking space and
solar panel installation, as well as total square footage.

\[H_a: \mu_3 \neq \mu_1, \mu_2\]

where

\[\mu_3 = \text{ total square footage of a building}\]

and

\[\alpha = 0.05\]

Census Block

summary(aov(total_sqfootage ~ total_solar_panels + total_parking_spaces,
            data = census))

##                       Df    Sum Sq   Mean Sq F value   Pr(>F)    
## total_solar_panels     1 1.372e+14 1.372e+14  12.554 0.000583 ***
## total_parking_spaces   1 7.454e+13 7.454e+13   6.822 0.010273 *  
## Residuals            109 1.191e+15 1.093e+13                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can reject \(H_0\) and suggest \(H_a\) because \(p \leq \alpha\).

Neighborhood

summary(aov(total_sqfootage ~ total_solar_panels + total_parking_spaces,
            data = neighborhood))

##                      Df    Sum Sq   Mean Sq F value Pr(>F)
## total_solar_panels    1 3.116e+14 3.116e+14   1.166  0.303
## total_parking_spaces  1 2.849e+13 2.849e+13   0.107  0.750
## Residuals            11 2.939e+15 2.672e+14

We can fail to reject \(H_0\) because \(p > \alpha\).

Election Ward

summary(aov(total_sqfootage ~ total_solar_panels + total_parking_spaces,
            data = election_ward))

##                      Df    Sum Sq   Mean Sq F value Pr(>F)
## total_solar_panels    1 5.318e+14 5.318e+14   0.825  0.387
## total_parking_spaces  1 3.316e+14 3.316e+14   0.515  0.491
## Residuals             9 5.798e+15 6.442e+14

We can fail to reject \(H_0\) because \(p > \alpha\).

Visualizing the Distributions

It’s informative to visualize the distributions we have been reviewing with the ANOVA
and T-Test.

f1 <- ggplot(neighborhood, aes(total_solar_panels, total_parking_spaces,
                               label=Neighborhood, fill=total_sqfootage,
                               alpha=0.5)) +
    geom_col() +
    geom_label() +
    labs(title = "Fig. 1: Visualizing Relationships Between Cambridge Neighborhoods",
         x = "Total Number of Solar Panels",
         y = "Total Number of Parking Spaces",
         alpha = "Figure Transparency",
         fill = "Total Square Footage")

We can see the presence of outliers in this figure. These outliers should be removed
when doing further analysis for my final.

Discussion

When reviewing the relationship between solar panels, parking spots, and total square footage
of a building for Cambridge, I first note the outliers we saw when visualizing the distribution
in Figure 1. The presence of these outliers is sufficient to make results for the T-Test and
ANOVA require revisiting. I will need to remove outliers or use a transformation, possibly
logarithmic, to make reasonable inferences about the statistical independence of these groups.
I will not dig into the statistical results
here and try to analyze them because I’ve already noted that the normality assumption has
been ‘super violated’. Given the tenuous nature of these statistical results, I was at least
able to get some practice doing hyothesis testing using a t-test and ANOVA.

	wfleming1 on City Exploration #3
	wfleming1 on BlueBikes and Census corr…
	tavernierd on Comparing Groups: Bluebike…
	tavernierd on City Exploration #3: Urban Hik…
	tavernierd on City Exploration #3: Urban Hik…

Seeing Boston Neighborhoods through Administrative Data

The Course Blog for "Big Data for Cities" at Northeastern University (PPUA5262)