I began by making a new binary variable for whether a neighborhood had greater than 1 cultural site, or had 1 or fewer.

TADCT <- read.csv(“/TADCT.csv”)

TADCT$CULT_YES <- ifelse(TADCT$Cult_count > 1, 1, 0)

summary(TADCT$CULT_YES)Min. 1st Qu. Median Mean 3rd Qu. Max.

0.0000 0.0000 1.0000 0.5899 1.0000 1.0000

The summary shows that just about 60% of census tracts in Boston have more than 1 cultural site.

I next ran a t.test to see the effect of cultural sites on median residential value.

t.test(resValue_median~CULT_YES, data=TADCT)

data: resValue_median by CULT_YES

t = 1.3661, df = 78.846, p-value = 0.1758

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-43396.78 233278.77

sample estimates:

mean in group 0 mean in group 1

512173.2 417232.2

The test shows that there is a difference in means between census tracts with more than one cultural site. Confirming earlier analysis, tracts with more cultural sites have lower median home values. However, the t.test has a p-value of 0.18, which is not statistically significant.

Next I conducted an ANOVA statistical test to compare means of residential parcel values by neighborhood.

Anova <- aov(resValue_median~BRA_PD, data=TADCT)

summary(anova)

Df Sum Sq Mean Sq F value Pr(>F)

BRA_PD 16 6540283795839 408767737240 3.274 0.000065 ***

Residuals 156 19475545536842 124843240621

—

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

5 observations deleted due to missingness

The summary of the ANOVA shows that there are statistically significant differences in means of residential properties between neighborhoods. The Tukey test conducts pairwise analysis between neighborhoods. The sample below shows some of the neighborhoods have higher or lower mean residential property values.

Fenway/Kenmore-East Boston 119815.833

Hyde Park-East Boston -16012.500

Jamaica Plain-East Boston 117075.449

Mattapan-East Boston -8616.667

North Dorchester-East Boston -21960.417

Roslindale-East Boston 31747.756

Roxbury-East Boston -14513.640

As expected, Fenway has higher value residences than East Boston, as does Jamaica Plain. Mattapan and Roxbury have lower values than East Boston.

I next constructed a chart to compare the residential home values.

melted <- melt(TADCT[c(15,13)], id.vars=c(“BRA_PD”))

means <- aggregate(value~BRA_PD,data=melted,mean)

names(means)[2] <- “mean”

ggplot(data=means, aes(x=BRA_PD, y=mean)) + geom_bar(stat=”identity”,position=”dodge”, fill=”blue”) + ylab(“Mean”)

ses <- aggregate(value~BRA_PD,data=melted, function(x) sd(x, na.rm=TRUE)/sqrt(length(!is.na(x))))

names(ses)[2]<-‘se’

means <- merge(means,ses,by=’BRA_PD’)

means <- transform(means, lower=mean-se, upper=mean+se)

levels(means$Type)<-c(“Downtown”,”Industrial/Institutional”,”Park”,”Residential”)

bar <- ggplot(data=means, aes(x=BRA_PD, y=mean)) + geom_bar(stat=”identity”,position=”dodge”, fill=”blue”) + xlab(“Neighborhood”) + ylab(“Mean”) + ggtitle(“Residential Parcel Values by Neighborhood”) + theme(axis.text.x = element_text(angle = 90, hjust = 1))

bar + geom_errorbar(aes(ymax=upper, ymin=lower),position=position_dodge(.9))

This graph shows the range of means of residential parcels by neighborhood. The South End and Back Bay each stand out for having very high total mean values, but a wide variance.

]]>

In order to do that, I have to go back some steps and do some cleaning:

- I removed the values that did not meet some of the following criteria:
- Removed all cases that didn’t have accurate information in the “Boston Redevelopment Agency Planning District” variable (variable “BRA_PD”) and for the “Census track” (variable “CT_ID_10”);
- Removed all cases that did not have an appropriate Issued or Expiration date;
- Removed all cases that had invalid dates (the expiration date was before the issue date);
- Removed all the cases that refer to “late” fees.

And added new variables:

- Created a new variable to identify unique licenses concatenating variables LAST NAME and ADDRESSKEY. With this I created a new data set that has only unique licenses.
- I created two new variables called “SAME.STATE” and “SAME.ZIP” in order to identify those licenses that are owned by in-state owners and in-city owners.
- I created a new variable for each valid year of each observation. A valid year is defined as year between the issued and expiration date.
- Created a new variable called “Coffee” that refers to businesses that are coffee shops. I created a new data set that contains only coffee related businesses.

This new data set called “Buss.Lic.Clean.coffee” was aggregated by census track and by year and merge that information with:

- “CT_All_Ecometrics_2014” data set to add information regarding land area and number of parcels;
- “Building Permits” data set to add information about the total active licenses each year from 2012 to 2015;
- “Boston.ACS.SES” data set add socioeconomically indicators from ACS for each data frame.

The resulting data set contains 83 observations with 33 variables. Since the data set has incomplete information, the reliability of the index would not be compromised if we consider the period from 2012 to 2015. I wanted to analyze the percentage variation of the total amount of coffee businesses per census track from 2012 to 2015 and its relationship with the percentage variation of the building permits and the socioeconomically indicators for the same period of time.

For this assignment, I will compare the percentage variation of the total amount of coffee businesses between residential neighborhoods and non-residential neighborhoods. I would expect that the mean of the second group to be higher than the mean of the first group, and to have a significant difference.

Running the t-test the results are shown below. While the mean for the residential neighborhoods is 0.175, the mean for the non-residential neighborhoods is 0.526. The p-value is 0.03, which is small, but I would not consider it small enough to eliminate that the results is given by chance.

When we run an ANOVA test to see if we have differences between type of neighborhoods, we have a high p-value, which indicate us that this result is not significant.

Looking at the difference among four groups (Downtown area, Industrial/Institutional, Park and Residential are), we see that the first two area had an increase of 50% in the Coffee shops in the period of 2012-2015 while Residential areas are below 25%.

]]>

**Overview**

Local Sense Lab released extended sensor data that span across ten days, which is helpful as increasing sample size would enable more robust analysis. In this post, I test the difference between the temperature levels between two sensor locations to further my previous analysis on the urban heat island effect (UHI) in Downtown Crossing (DTX).

I used the t test to determine whether the temperature level between city1 and city2 are significantly different. I used ANOVA to support the findings in t test.

**T Test**

The mean temperature of city1 is 26.65 degrees Celsius and city2, 25.74. The t value is 26.114, which is fairly large. And the p value is almost 0, which confirms that the difference is significant.

**ANOVA Test**

The F value came out to be 692, which means that the variance between the two locations is 692 times greater than the variance within each location. With a significant p value, the test confirms that the two locations have significantly different temperature levels.

*Visualization*

You can tell the city 1 has higher temperature level most of the times. The difference become greater during the hottest times of the day.

## Plot line graph

baseTemp <- ggplot(temp, aes(x=timestamp, y=value))

baseTemp+geom_line(aes(color=factor(id_wasp), group=id_wasp))+scale_color_discrete(name=”Sensor”)+labs(title=”Temperature (Celsius)”)

**R Code**

##—– Reading two CSV files from Local Sense Lab —–##

city <- read.csv(“C:/Users/Jun Kim/Documents/RData/Data_assignment/Module10/sensor.data.city_extended.csv”)

envi <- read.csv(“C:/Users/Jun Kim/Documents/RData/Data_assignment/Module10/sensor.data.environment_extended.csv”)

#remove columns that were used in the data transmission and do not provide analytical value for the city dataset

city2<-city[c(1:2,6:8)]

#remove columns that were used in the data transmission and do not provide analytical value for the environment dataset

envi2<-envi[c(1:2,6:8)]

# Mark all duplicates in the city dataset

city2$un_flg <- as.numeric(duplicated(city2[2:5]))

# Mark all duplicates in the environment dataset

envi2$un_flg <- as.numeric(duplicated(envi2[2:5]))

# Order the city data by id_wasp, sensor, and timestamp

city3 <- city2[order(city2$id_wasp, city2$sensor, city2$timestamp),]

# Order the environment data by id_wasp, sensor, and timestamp

envi3 <- envi2[order(envi2$id_wasp, envi2$sensor, envi2$timestamp),]

## Convert timestamp to POSIxct

city4 <- city3

# Changing the data type

newTimeEnv <- as.POSIXct(city4$timestamp, format=”%Y-%m-%d %H:%M:%S”)

# Assigning the new data to a new data frame

city4$timestamp <- newTimeEnv

# Verifying that the data type of timestamp column (should return POSIXct)

class(city4$timestamp[1])

# Repeat for envi

envi4 <- envi3

# Changing the data type

newTimeEnv <- as.POSIXct(envi4$timestamp, format=”%Y-%m-%d %H:%M:%S”)

# Assigning the new data to a new data frame

envi4$timestamp <- newTimeEnv

# Verifying that the data type of timestamp column (should return POSIXct)

class(envi4$timestamp[1])

# Remove Aug. 10 and Aug. 19 data for city (to compare each date)

city5 <- city4[!(city4$timestamp < “2016-08-11 00:00:00 EDT”),]

city5 <- city5[!(city5$timestamp > “2016-08-18 23:59:59 EDT”),]

# Remove duplicates

city6 <- city4[!(city4$un_flg==1),]

envi6 <- envi4[!(envi4$un_flg==1),]

# Subset by sensor and id_wasp

co <- envi6[which(envi6$sensor == “CO”), ]

coEnvi1 <- co[which(co$id_wasp == “environ1”), ]

coEnvi2 <- co[which(co$id_wasp == “environ2”), ]

no2 <- envi6[which(envi6$sensor == “NO2”), ]

no2Envi1 <- no2[which(no2$id_wasp == “environ1”), ]

no2Envi2 <- no2[which(no2$id_wasp == “environ2”), ]

o2 <- envi6[which(envi6$sensor == “O2”), ]

o2Envi1 <- o2[which(o2$id_wasp == “environ1”), ]

o2Envi2 <- o2[which(o2$id_wasp == “environ2”), ]

co2 <- envi6[which(envi6$sensor == “CO2”), ]

co2Envi1 <- co2[which(co2$id_wasp == “environ1”), ]

co2Envi2 <- co2[which(co2$id_wasp == “environ2”), ]

temp <- city6[which(city6$sensor == “GP_TC”), ]

tempCity1 <- temp[which(temp$id_wasp == “city1”), ]

tempCity2 <- temp[which(temp$id_wasp == “city2”), ]

humid <- city6[which(city6$sensor == “GP_HUM”), ]

humidCity1 <- humid[which(humid$id_wasp == “city1”), ]

humidCity2 <- humid[which(humid$id_wasp == “city2”), ]

decib <- city6[which(city6$sensor == “MCP”), ]

decibCity1 <- decib[which(decib$id_wasp == “city1”), ]

decibCity2 <- decib[which(decib$id_wasp == “city2”), ]

lumino <- city6[which(city6$sensor == “LUM”), ]

luminoCity1 <- lumino[which(lumino$id_wasp == “city1”), ]

luminoCity2 <- lumino[which(lumino$id_wasp == “city2”), ]

# Create new data frame for environ1

envi6_1 <- data.frame(coEnvi1$timestamp, coEnvi1$value, no2Envi1$value, o2Envi1$value, co2Envi1$value)

names(envi6_1)[1] <- “Time”

names(envi6_1)[2] <- “COenv1”

names(envi6_1)[3] <- “NO2env1”

names(envi6_1)[4] <- “O2env1”

names(envi6_1)[5] <- “CO2env1”

# Create ew data frame for environ2

envi6_2 <- data.frame(coEnvi2$timestamp, coEnvi2$value, no2Envi2$value, o2Envi2$value, co2Envi2$value)

names(envi6_2)[1] <- “Time”

names(envi6_2)[2] <- “COenv2”

names(envi6_2)[3] <- “NO2env2”

names(envi6_2)[4] <- “O2env2”

names(envi6_2)[5] <- “CO2env2”

# List names for both data frames

names(envi6_1)

names(envi6_2)

## T test

t.test(tempCity1$value, tempCity2$value)

## ANOVA test

anova <- aov(value~id_wasp, data=temp)

summary(anova)

]]>

To begin I wanted to see if there was a relationship between public service amenities and education amenities. There are a number of different ways to check the relationship between these amenities. One of the most effective ways is by running a t-test. T-tests can only check two variables at first so looking at education and public service amenities aggregated by sub district we can see whether there is a statistically significant difference variation these two amenity types:

These results reveal that there is no statistically significant variation between them. A p-value of .8 firmly demonstrates an insignificant relationship between the two variables. The t-test is not our only tool when looking at sub districts and the different amenities.

Another tool we can use is the chi squared function which determines whether these categories of amenities are independent of sub district type. Looking at public service amenities we have an interesting result:

*Paired t-test*

*data: edag$totedu and puag$totpub*

*t = -0.25993, df = 9, p-value = 0.8008*

*95 percent confidence interval:*

*-106.7314 84.7314*

*mean of the differences -11*

These results reveal that there is no statistically significant variation between them. A p-value of .8 firmly demonstrates an insignificant relationship between the two variables. The t-test is not our only tool when looking at sub districts and the different amenities.

Another tool we can use is the chi squared function which determines whether these categories of amenities are independent of sub district type. Looking at public service amenities we have an interesting result:

*Pearson’s Chi-squared test*

*data: puag$totpub and puag$subdistric*

*X-squared = 80, df = 72, p-value = 0.2424*

With a p value of .24 the relationship is insignificant and we find these variables to be independent of one another. This is surprising as one might expect public service amenities to exist in certain sub district types such as commercial or mixed use spaces but this does not appear to be the case. One category of amenity is residential and one of the sub district types is residential. For this result one would logically expect a dependent relationship but when we run a chi-squared test we find:

*Pearson’s Chi-squared test*

*data: resagg$rtot and resagg$subdistric*

*X-squared = 90, df = 81, p-value = 0.2313*

The p-value of .23 shows us once again that the variables are independent. There may be a number of reasons why this may be the case but first it may be helpful to look at this relationship graphically so we can visually observe trends.

*subdismelt <- melt(resagg[2:10,], id.vars = ‘subdistric’) ggplot(subdismelt, aes(variable, value)) + geom_bar(aes(fill = subdistric), position = “dodge”,stat=”identity”)*

This graph shows that it is not just residential sub district types that have residential amenities. Residential is also the most common sub district type by a large margin. So despite the fact that in the past I have observed a relative restrictiveness when it comes to sub district types, this does not necessarily play out when aggregating by amenity types. While we can run ANOVA using this data and get a result:

*educsubdis<- aov(education~subdistric, data=zone)*

* Df Sum Sq Mean Sq F value Pr(>F) *

*subdistric 9 43.545 4.8383 83.44 < 2.2e-16 ****

*Residuals 1627 94.342 0.0580 *

The results are necessarily meaningless because we are comparing two categorical variables. This is why the chi-squared test was used as it is actually useful for this data set which is almost entirely categorical data.

]]>

> #add in ecometric and demographic data > locate <- "C:/Users/Corinne/OneDrive/classes/big data for cities/unit 3/" > ecometrics <- read.csv(paste(locate, "CT_All_Ecometrics_2014.csv", sep = "")) > demog <- read.csv(paste(locate, "Tract Census Data.csv", sep = ""))

The goal of this week’s assignment was to familiarize myself with two different types of statistical tests, the t-test and the ANOVA test. Both of these compare the means of outdoor activity by a categorical variable such as neighborhood type. In the BARI ecometrics dataset, there is a variable called “Type” that categorizes different census tracts based on whether they are primarily downtown, residential, parks, or institutional/industrial. However, I also wanted to analyze outdoor activity based on two categorical variables in the building permits dataset. The first of these is industry category: upper education, healthcare, religious, government, or civic. Specific permits are assigned categories instead of the census tract as a whole, but I aggregated them and assigned a category to each census tract based on the maximum number of permits of each type. For example, if a given census tract had more building permits identified as performing “religious” work than work in other industry, I designated it as a primarily religious census tract.

> #create an aggregate dataframe for industry category > industry1 <- aggregate(uppereducation~CT_ID_10, data = permits, sum) > industry2 <- aggregate(healthcare~CT_ID_10, data = permits, sum) > industry3 <- aggregate(religious~CT_ID_10, data = permits, sum) > industry4 <- aggregate(government~CT_ID_10, data = permits, sum) > industry5 <- aggregate(civic~CT_ID_10, data = permits, sum) > total <- as.data.frame(table(permits$CT_ID_10)) > names(total) <- c("CT_ID_10", "Permits") > industry <- merge(industry1, industry2, by = 'CT_ID_10') > industry <- merge(industry, industry3, by = 'CT_ID_10') > industry <- merge(industry, industry4, by = 'CT_ID_10') > industry <- merge(industry, industry5, by = 'CT_ID_10') > industry <- merge(industry, total, by = 'CT_ID_10') > rm(industry1, industry2, industry3, industry4, industry5) > #find the most common industry category in each census tract > industry$industryval <- NA > for (x in 1:nrow(industry)) {industry[x,8] <- names(which.max(industry[x,2:6]))} > industry$industryval <- ifelse(rowSums(industry[2:6]) == 0, "other", industry$industryval) > #control for excessive "goverment" values > industry$industryval <- ifelse(((industry$industryval == "government") & + (industry$government/industry$Permits >= .01)), + industry$industryval, + ifelse(industry$industryval == "government", "other", industry$industryval)) > industry$industryval <- as.factor(industry$industryval) > table(industry$industryval) government healthcare other religious uppereducation 53 12 75 23 15

The table above shows the distribution of industry categories among census tracts. If the census tract had no categorized permits, or if the maximum category represented less than 1% of permits in that census tract, I labeled it as “other.” This is not the strongest aggregate variable, because the “other” category represents the highest number of census tracts. Overall only 5.9% of permits have a category assigned to them, so I do not have high hopes that this measure will have a strong positive or negative effect on the presence of outdoor activity.

The second variable that I wanted to analyze within the building permits dataset was work category. Building permits are classified as either new construction, addition, demolition, renovation, moving, or special events. Since the number of permits in each category is extremely uneven (67% of building permits are classified as renovation), I constructed a dataframe based on relative frequency. Instead of basing my aggregate measure on the number of permits concerning renovation in a given tract, I considered the percentage out of all renovation permits that happened to fall in a specific tract. The work category assigned to each census tract is the highest percentage value that fall in that census tract.

> #create an aggregate dataframe for work category > work1 <- aggregate(newcon~CT_ID_10, data = permits, sum) > work1[,2] <- work1[,2]/sum(permits$newcon) > work2 <- aggregate(addition~CT_ID_10, data = permits, sum) > work2[,2] <- work2[,2]/sum(permits$addition) > work3 <- aggregate(demo~CT_ID_10, data = permits, sum) > work3[,2] <- work3[,2]/sum(permits$demo) > work4 <- aggregate(reno~CT_ID_10, data = permits, sum) > work4[,2] <- work4[,2]/sum(permits$reno) > work5 <- aggregate(moving~CT_ID_10, data = permits, sum) > work5[,2] <- work5[,2]/sum(permits$moving) > work6 <- aggregate(specialevents~CT_ID_10, data = permits, sum) > work6[,2] <- work6[,2]/sum(permits$specialevents) > work <- merge(work1, work2, by = 'CT_ID_10') > work <- merge(work, work3, by = 'CT_ID_10') > work <- merge(work, work4, by = 'CT_ID_10') > work <- merge(work, work5, by = 'CT_ID_10') > work <- merge(work, work6, by = 'CT_ID_10') > rm(work1, work2, work3, work4, work5, work6) > #find the most common work category in each census tract > work$workval <- NA > for (x in 1:nrow(work)) {work[x,8] <- names(which.max(work[x,2:7]))} > work$workval <- as.factor(work$workval) > table(work$workval) addition demo moving newcon reno specialevents 13 10 31 70 29 25

The table above shows the distribution of work categories among census tracts. At this point I have created aggregate measures for both categories I am hoping to measure, so I am ready to run some statistical tests.

The first test that I run was to determine whether outdoor activity has a relationship with community engagement. Engagement is one of the ecometrics drawn from BARI’s work with the Boston CRM system (311 data), and it measures “the likelihood that an individual living in a neighborhood knows of and would be willing to use the CRM system.” I split the census tracts into two categories; those with above average engagement (>0) and those with below average engagement (<0). I theorize that the census tracts with more community engagement will also exhibit higher levels of outdoor activity.

> ## T-TESTS ## > # 1. check outdoor energy vs. engagement > tracts$engagement_binary <- ifelse(tracts$Engagement_2014 >= 0, 1, 0) > t.test(OutdoorEnergy~engagement_binary, data=tracts) Welch Two Sample t-test data: OutdoorEnergy by engagement_binary t = -0.70995, df = 171.2, p-value = 0.4787 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -6.564327 3.091470 sample estimates: mean in group 0 mean in group 1 10.67094 12.40737

Unfortunately, in this case the analysis does not support my theory. While there is a slight difference between the two groups, there is almost a 50-50 chance that the +2 split in this particular dataset was due to chance. In fact, a 95% confidence interval states that the mean of outdoor activity in census tracts with more engagement could be anywhere between 6 points higher or 2 points lower than the mean of outdoor activity in less engaged census tracts.

Next I analyzed the breakdown of outdoor activity between multiple groups. The first group I examined was the “Type” variable, where census tracts are labeled as either Downtown, Residential, Parks, or Industrial/Institutional. I expect to find that downtown has the highest levels of outdoor activity, while parks have the lowest. This is due to the formulation of my metric; while parks do have a significant amount of outdoor activity, my metric is formed from building permits, and there is very little by the way of investments or physical changes in a park that requires a building permit. Therefore, my metric does not effectively capture the energy of parks and thus this category will be lower than the others.

> ## ANOVA TESTS ## > # 2. check outdoor energy vs. Type > anova<-aov(OutdoorEnergy~Type, data=tracts) > summary(anova) Df Sum Sq Mean Sq F value Pr(>F) Type 3 10741 3580 15.7 4.4e-09 *** Residuals 174 39670 228 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > TukeyHSD(anova) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = OutdoorEnergy ~ Type, data = tracts) $Type diff lwr upr p adjI-D-17.9491040 -31.26578 -4.632424 0.0033253P-D-28.7113339 -44.12001 -13.302661 0.0000174R-D-28.5385528 -40.39285 -16.684258 0.0000000 P-I -10.7622299 -23.37454 1.850079 0.1236046R-I-10.5894488 -18.47408 -2.704816 0.0034698 R-P 0.1727811 -10.88437 11.229936 0.9999760 > #graph outdoor energy vs. Type > require(reshape2) > require(ggplot2) > melted<-melt(tracts[c(42,14)],id.vars=c("Type")) > means<-aggregate(value~Type,data=melted,mean) > names(means)[2]<-"mean" > ses<-aggregate(value~Type, data=melted, function(x) sd(x, na.rm=TRUE)/sqrt(length(!is.na(x)))) > names(ses)[2]<-'se' > means<-merge(means,ses,by='Type') > means <- transform(means, lower=mean-se, upper=mean+se) > levels(means$Type)<-c("Downtown","Industrial/Institutional","Park","Residential") > ggplot(data=means, aes(x=Type, y=mean)) + + geom_bar(stat="identity", position="dodge", fill="blue") + ylab("Mean") + + geom_errorbar(aes(ymax=upper, ymin=lower), position=position_dodge(.9))

The ANOVA test shows a significant difference in outdoor activity between the types, and a Tukey’s test comparing categories individually mostly supports my hypothesis. Downtown census tracts have higher levels of outdoor activity than any of the other categories. Industrial/institutional areas have higher outdoor activity than residential areas and a non-significant difference between parks. Residential areas and parks are essentially tied. Moderately high levels of outdoor activity in industrial/institutional areas makes sense, as these census tracts contain the universities of the city, and many of those hold outdoor events regularly.

The graph also points out some key differences between the categories that the Tukey’s test does not, namely regarding their levels of variation. The census tracts that are classified as downtown areas have a wide range of outdoor activity levels. Conversely, residential areas are extremely consistent in their levels of outdoor activity. This could be due to actual variation or due to the number of cases; since there are less downtown census tracts, the uncertainty surrounding the results is much higher.

I also want to determine whether significant differences exist between the industry categories.

> # 3. check outdoor energy vs. industryval > anova<-aov(OutdoorEnergy~industryval, data=tracts) > summary(anova) Df Sum Sq Mean Sq F value Pr(>F) industryval 4 1938 484.6 1.729 0.146 Residuals 173 48473 280.2 > #graph outdoor energy vs. industryval > melted<-melt(tracts[c(90,14)],id.vars=c("industryval")) > means<-aggregate(value~industryval,data=melted,mean) > names(means)[2]<-"mean" > ses<-aggregate(value~industryval,data=melted, function(x) sd(x, na.rm=TRUE)/sqrt(length(!is.na(x)))) > names(ses)[2]<-'se' > means<-merge(means,ses,by='industryval') > means <- transform(means, lower=mean-se, upper=mean+se) > ggplot(data=means, aes(x=industryval, y=mean)) + + geom_bar(stat="identity", position="dodge", fill="blue") + ylab("Mean") + + geom_errorbar(aes(ymax=upper, ymin=lower), position=position_dodge(.9))

The graph shows some slight differences between the levels of outdoor activity in each of the industries. Unfortunately, the 14% probability that the differences between categories is due to chance renders our findings insignificant. The significant overlap between many of the standard error bars also indicates that a statistical test on a separate sample of data would yield divergent results.

The last categorical variable I want to examine are the work categories.

> # 4. check outdoor energy vs. workval > anova<-aov(OutdoorEnergy~workval, data=tracts) > summary(anova) Df Sum Sq Mean Sq F value Pr(>F) workval 5 8732 1746.5 7.207 0.00000377 *** Residuals 172 41679 242.3 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > TukeyHSD(anova) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = OutdoorEnergy ~ workval, data = tracts) $workval diff lwr upr p adjdemo-addition25.9921373 7.1210644 44.863210 0.0014579 moving-addition 5.9544424 -8.8699931 20.778878 0.8561303 newcon-addition 1.3572054 -12.1922768 14.906688 0.9997264 reno-addition 0.3633778 -14.6113288 15.338084 0.9999998 specialevents-addition 14.8501525 -0.4908757 30.191181 0.0639081moving-demo-20.0376949 -36.3537600 -3.721630 0.0067365newcon-demo-24.6349319 -39.8019541 -9.467910 0.0000835reno-demo-25.6287595 -42.0814775 -9.176042 0.0001873 specialevents-demo -11.1419848 -27.9288022 5.644833 0.3977999 newcon-moving -4.5972370 -14.2763413 5.081867 0.7455209 reno-moving -5.5910646 -17.1815101 5.999381 0.7328383 specialevents-moving 8.8957101 -3.1642955 20.955716 0.2789618 reno-newcon -0.9938276 -10.9015523 8.913897 0.9997245specialevents-newcon13.4929471 3.0397984 23.946096 0.0036116specialevents-reno14.4867747 2.2425237 26.731026 0.0103403

This time the p-value is vanishingly small, indicating that any distinctions between categories are far more likely to be due to actual differing levels of outdoor activity than chance. When we examine the direct relationships individually, we find that census tracts with high levels of demolition have significantly more outdoor activity than those with a high percentage of building permits indicating addition, moving, new construction, or renovation. Census tracts with a high percentage of special events permits had more outdoor activity than census tracts with a high percentage of new construction or renovation permits. These are interesting findings, but not the most useful. In future assignments, I can dig deeper into the types of work being described to find a logical explanation for each of these results.

Since census tract type was the most informative categorical variable, let’s return there for the last portion of my analysis. The outdoor activity metric was formulated by combining the effects of two types of energy: informal interactions and formal events. Informal interactions are measured from building permits that indicate street-level additions and renovations that improve the outdoor experience, such as the installation of shade awnings or the addition of new outdoor seating. Formal events are permitted outdoor community gatherings, such as a cultural festival, road race, or outdoor concert series. My goal is to see how the levels of formal vs. informal permits compare across the different neighborhood types.

> # graphing formal vs. informal > melted<-melt(tracts[c(7, 13,42)],id.vars=c("Type")) > means2<-aggregate(value~Type+variable,data=melted,mean) > names(means2)[3]<-"mean" > ses2<-aggregate(value~Type+variable,data=melted, function(x) sd(x, na.rm=TRUE)/sqrt(length(!is.na(x)))) > names(ses2)[3]<-'se' > means2<-merge(means2,ses2,by=c('Type','variable')) > means2<-transform(means2, lower=mean-se, upper=mean+se) > levels(means2$Type)<-c("Downtown","Industrial/Institutional","Park","Residential") > levels(means2$variable)<-c("Informal Interactions","Formal Events") > ggplot(data=means2, aes(x=Type, y=mean, fill=variable)) + + geom_bar(stat="identity", position="dodge") + + geom_errorbar(aes(ymax=upper, ymin=lower), position=position_dodge(.9)) + + ylab("Mean") + theme(axis.text.x = element_text(angle = 45, hjust = 1))

As you can see, disaggregating the effects of formal vs. informal indicators on my overall metric does not make much of a difference in parks, residential areas, or even in industrial/institutional areas. In downtown areas, however, the mean number of permits indicating formal events is significantly larger than the mean number of permits indication informal interaction. It also exhibits far more variation; the standard error of formal events in downtown areas is 37.3, well over the standard error of informal events (11.8). No other standard error tops 10. This visualization offers us some insight into the potentially skewing effect of formal permits in downtown areas.

]]>

> propwhiteNSA View(propwhiteNSA) > propwhiteNSA$majority .50,c("white"),c("nonwhite")) > NSAmixeduse<-merge(NSAmixeduse, propwhiteNSA, by="NSA_NAME") > View(NSAmixeduse) > t.test(TotalMean~majority,data=NSAmixeduse)

Welch Two Sample t-test

data: TotalMean by majority

t = -3.1923, df = 38.323, p-value = 0.002818

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-0.18499866 -0.04144143

sample estimates:

mean in group nonwhite mean in group white

0.1621425 0.2753625

The two means indicate that majority white neighborhoods have higher amounts of mixed use zoning than nonwhite majority residents, with their mean mixed use scores differing by about 0.11. For reference, the median of mixed use zoning across all neighborhoods is 0.18, with a third quartile of 0.24, meaning the score’s distribution is heavily skewed right. The p-value of .0028 indicates that this difference in scores is very unlikely to be due to chance.

These results indicate evidence that there is increased mixed use zoning in majority white neighborhoods, and because in last weeks’ analysis I had found a positive correlation between home values and mixed use zoning, this could be illustrating socioeconomic barriers, as well as a history of segregation by both race and income in Boston. While the factors leading to this divide are complex, there certainly seems to be a racial component to how amenities are zoned in Boston.

The next test I will be utilizing is ANOVA, or analysis of variance, in order to compare more than 2 groups with each other through the mixed use zoning scores. To find new categories, I moved back to the subdistrict level data, which classifies each subdistrict into the types Business, Harborpark, Industrial, Miscellaneous, Mixed Use, Open Space, Other, Residential, and Waterfront. Measuring these categories could potentially show any accuracy in how these scores relate to the types, as certain types would expect different values, e.g. Mixed Use versus Residential.

> typeanova <-aov(TotalMean~subdistric,data=mixedmeans) > summary(typeanova)

Df Sum Sq Mean Sq F value Pr(>F)

subdistric 9 70.48 7.831 359.6

Residuals 1627 35.43 0.022

—

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The above F value indicates that there is about 360 times the amount of variation as would be expected by pure chance, with a probability that is infinitesimal. These results show strong evidence of differences between mixed use zoning scores and neighborhood types. To further explore these neighborhood types, I then melted and aggregated the categories to get the mixed use means for each category. After calculating the standard error to show the mean’s variation, I plotted the results as a bar graph with standard errors.

> melttype <- melt(mixedmeans[c(7,18)],id.vars=c("subdistric")) > typemeans<-aggregate(value~subdistric,data=melttype,mean) > names(typemeans)[2] <- "mean" > ses<-aggregate(value~subdistric,data=melttype,function(x) sd(x, na.rm=TRUE)/sqrt(length(!is.na(x)))) > names(ses)[2]<-'se' > typemeans <- merge(typemeans,ses,by='subdistric') > typemeans <- transform(typemeans, lower=mean-se,upper=mean+se) > typebar <-ggplot(data=typemeans,aes(x=subdistric,y=mean))+geom_bar(stat="identity",position="dodge",fill="blue") + ylab("Mixed Use Zoning Mean") + xlab("Neighborhood Type") > typebar+geom_errorbar(aes(ymax=upper,ymin=lower),position=position_dodge(.9))

The graph shows the variation of mixed use scores in the different neighborhood type categories. Predictably, mixed use, other, and business have the highest amounts of diversity in their zoned amenities, as these are areas with variation of activity, while waterfront, residential, and open spaces do not.

Matt Dwyer

]]>

In previous post we’ve seen graphically in scatter-plots how similar the measurements of the sensors are. We’ve seen also that in order to test feasibility of the creation of the Livability index it is important to see that one sensor’s measurements are different that the other one.

In this post we going to go further in this direction and we going to test if the differences observed in the scatter-plots were significant by doing t test for every single measurement in both sensors. Also we going to test if these measurements are constant along the days.

For this analysis, we used an extended dataset in which we have three days for ambient variables (10 to 12 August 2016) and 10 days for ambient variables (10 to 19 August 2016). The data was aggregated every 5 minute.

The first action is to carry out bar-charts to have a better idea of where the similitude and differences between the sensors are. So, in the following figures we plot a bar chart with the mean values for each measurement on every day available.

Figure 1. Comparison between means values for sensor Envi1 and Envi2 (R code in anex1)

Figure 2. Comparison between means values for sensor City1 and City2 (R code in anex2)

From the figure1 and 2 we can see that most of the measurements are different from one sensor to another with the exception of the NO2 that seem to be very similar. To test such differences, we going to carry out a t.test between them. The results are shown in table1.

In table 1 we can see the mean value of the difference of the measurements taken by one sensor and the other, the t statistic and the p-value of the t.test of such difference.

Table 1: t.test between measurement of City1 vs City2 and Env1 vs Env2. We expect that is the measurements from one sensor to the other are equal, their differences (MeanDif) (R code in anex3)

As we suggest before, from the table1 we conclude that only NO2 present similar values between the two sensors in a significant way. That means that although the distance between the sensor is short, the environment and the atmospherically conditions in one spot to another change in a significant way.

As we saw in previous posts, there could be subtle demographic differences on the composition of the blocks where the sensors are located. For example, the blocks where City1 and Environment1 are located is surrounded by departmental stores, so a more intense pedestrian flow is expected. On the other hand, the place where Envirnment2 and City2 are located present lower intensity of pedestrians. So these demographic variables might be responsible for some of the differences in the sensor’s measurements. In order to prove that, we need to add some external demographic variables and see we can explain/predict them through the sensor measurements.

So far, we’ve seen that there are significant differences between sensors, but the question is, are those difference stable along the time?

To address this interrogation we carried out a F test along the days for each measurement and another one for the difference between the measurements of both sensors. The results are shown in table 2 for City measurements and table 3 for the environmental measurements.

Table 2: p-values for F.tests: City1/City2 mean value across days. Dif: difference between values of city1 and city2 across days. Ref: values less than 0.05 are consider that there are difference across days |
Table 3: p-values for F.tests: City1/City2 mean value across days. Dif: difference between values of city1 and city2 across days. Ref: values less than 0.05 are consider that there are difference across days |

From table 2 we are able to see that the difference between both sensor remain stable along the days for Luminosity and Humidity but this differences are not stable for Temperature and Noise. From table 3 we can see that the differences of the measurements are not stable along the day.

In this post we tests the differences between the sensor in order to examine if the distance between them (although is little) it is sufficient to register behavioral differences between two spots in the city. This assure the feasibility of the construction of an index based on this variables. Also we’ve notice the following sub conclusions:

- NO2 is the only variable that does not presents significant changes between the sensors.
- Both Luminosity and Humidity are the only two variables that present fixed changes along the days.
- Only Luminosity from both sensor, and noise from city1 present the same values along the days.

1-Barchart Environmental sensors

graph <- function(x, name, sensorname){ names(melted) <- c("day", "sensor", "value") AggDaySensorVal <- aggregate(value~day+sensor, melted, mean) se <- aggregate(value~day+sensor, melted, function(x) sd(x,na.rm = T)/sqrt(length(!is.na(x))) ) AggDaySensorVal<-merge(AggDaySensorVal, se, by = c("day","sensor"), suffixes = c("","Se")) AggDaySensorVal <- transform(AggDaySensorVal, upper=value+valueSe, lower=value-valueSe) levels(AggDaySensorVal$sensor) <- sensorname AggDaySensorVal$day <- as.factor(AggDaySensorVal$day) graph <- ggplot(data = AggDaySensorVal, aes(x=day, y = value, fill=sensor)) + geom_bar(stat="identity", position="dodge") + geom_errorbar(aes(ymax=upper, ymin=lower), position = position_dodge(.9)) + ylab("Mean") + ggtitle(name) return(graph) } melted <- melt(combine[,c("day","Envi1CO","Envi2CO" )], id.vars="day") EnviGraph1<-graph(melted, "CO", c("Envi1","Envi2")) melted <- melt(combine[,c("day","Envi1CO2","Envi2CO2" )], id.vars="day") EnviGraph2<-graph(melted, "CO2", c("Envi1","Envi2")) melted <- melt(combine[,c("day","Envi1NO2","Envi2NO2" )], id.vars="day") EnviGraph3<-graph(melted, "NO2",c("Envi1","Envi2")) melted <- melt(combine[,c("day","Envi1O2","Envi2O2" )], id.vars="day") EnviGraph4<-graph(melted, "O2",c("Envi1","Envi2"))

2-Bar-charts Environmental variables

melted <- melt(combine[,c("day","City1LUM","City2LUM" )], id.vars="day") CityGraph1<-graph(melted, "Luminosity",c("City1","City2")) melted <- melt(combine[,c("day","City1HUM","City2HUM" )], id.vars="day") CityGraph2<-graph(melted, "Humidity",c("City1","City2")) melted <- melt(combine[,c("day","City1TC","City2TC" )], id.vars="day") CityGraph3<-graph(melted, "Temperature",c("City1","City2")) melted <- melt(combine[,c("day","City1MCP","City2MCP" )], id.vars="day") CityGraph4<-graph(melted, "Noise",c("City1","City2"))

3-t.test among the measurements

options(digits = 4) temp<- t.test(combine$City1LUM, combine$City2LUM ,paired = TRUE) results <- data.frame(measurement = "Luminosity", MeanDif=temp$estimate[[1]], statistic= temp$statistic[[1]], pvalue = temp$p.value[[1]]) temp <- t.test(combine$City1HUM, combine$City2HUM ,paired = TRUE) results <- rbind(results, data.frame(measurement = "Humidity", MeanDif=temp$estimate[[1]], statistic= temp$statistic[[1]], pvalue = temp$p.value[[1]])) temp <- t.test(combine$City1TC, combine$City2TC ,paired = TRUE) results <- rbind(results, data.frame(measurement = "Temperature", MeanDif=temp$estimate[[1]], statistic= temp$statistic[[1]], pvalue = temp$p.value[[1]])) temp <- t.test(combine$City1MCP, combine$City2MCP ,paired = TRUE) results <- rbind(results, data.frame(measurement = "Noise", MeanDif=temp$estimate[[1]], statistic= temp$statistic[[1]], pvalue = temp$p.value[[1]])) temp <- t.test(combine$Envi1CO, combine$Envi2CO ,paired = TRUE) results <- rbind(results, data.frame(measurement = "CO", MeanDif=temp$estimate[[1]], statistic= temp$statistic[[1]], pvalue = temp$p.value[[1]])) temp <- t.test(combine$Envi1CO2, combine$Envi2CO2 ,paired = TRUE) results <- rbind(results, data.frame(measurement = "CO2", MeanDif=temp$estimate[[1]], statistic= temp$statistic[[1]], pvalue = temp$p.value[[1]])) temp <- t.test(combine$Envi1NO2, combine$Envi2NO2 ,paired = TRUE) results <- rbind(results, data.frame(measurement = "NO2", MeanDif=temp$estimate[[1]], statistic= temp$statistic[[1]], pvalue = temp$p.value[[1]])) temp <- t.test(combine$Envi1O2, combine$Envi2O2 ,paired = TRUE) results <- rbind(results, data.frame(measurement = "O2", MeanDif=temp$estimate[[1]], statistic= temp$statistic[[1]], pvalue = temp$p.value[[1]])) stargazer(results, header=FALSE, type='latex', summary =F ,title = "t.test between measurement of City1 vs City2 and ENv1 vs Env2")

4-F.test among the measurements

combine$TempDif <- combine$City1TC - combine$City2TC combine$HumDif <- combine$City1HUM - combine$City2HUM combine$LumDif <- combine$City1LUM - combine$City2LUM combine$McpDif <- combine$City1MCP - combine$City2MCP combine$CODif <- combine$Envi1CO - combine$Envi2CO combine$CO2Dif <- combine$Envi1CO2 - combine$Envi2CO2 combine$NO2Dif <- combine$Envi1NO2 - combine$Envi2NO2 combine$O2Dif <- combine$Envi1O2 - combine$Envi2O2 resultsCity <- matrix(nrow = 4, ncol = 3) resultsEnv <- matrix(nrow = 4, ncol = 3) resultsCity[1,1] <- summary(aov(formula = City1LUM ~ day, data = combine))[[1]]$`Pr(>F)`[1] resultsCity[1,2] <- summary(aov(formula = City2LUM ~ day, data = combine))[[1]]$`Pr(>F)`[1] resultsCity[1,3] <- summary(aov(formula = LumDif ~ day, data = combine))[[1]]$`Pr(>F)`[1] resultsCity[2,1] <- summary(aov(formula = City1HUM ~ day, data = combine))[[1]]$`Pr(>F)`[1] resultsCity[2,2] <- summary(aov(formula = City2HUM ~ day, data = combine))[[1]]$`Pr(>F)`[1] resultsCity[2,3] <- summary(aov(formula = HumDif ~ day, data = combine))[[1]]$`Pr(>F)`[1] resultsCity[3,1] <- summary(aov(formula = City1TC ~ day, data = combine))[[1]]$`Pr(>F)`[1] resultsCity[3,2] <- summary(aov(formula = City2TC ~ day, data = combine))[[1]]$`Pr(>F)`[1] resultsCity[3,3] <- summary(aov(formula = TempDif ~ day, data = combine))[[1]]$`Pr(>F)`[1] resultsCity[4,1] <- summary(aov(formula = City1MCP ~ day, data = combine))[[1]]$`Pr(>F)`[1] resultsCity[4,2] <- summary(aov(formula = City2MCP ~ day, data = combine))[[1]]$`Pr(>F)`[1] resultsCity[4,3] <- summary(aov(formula = McpDif ~ day, data = combine))[[1]]$`Pr(>F)`[1] resultsEnv[1,1] <- summary(aov(formula = Envi1CO ~ day, data = combine))[[1]]$`Pr(>F)`[1] resultsEnv[1,2] <- summary(aov(formula = Envi2CO ~ day, data = combine))[[1]]$`Pr(>F)`[1] resultsEnv[1,3] <- summary(aov(formula = CODif ~ day, data = combine))[[1]]$`Pr(>F)`[1] resultsEnv[2,1] <- summary(aov(formula = Envi1CO2 ~ day, data = combine))[[1]]$`Pr(>F)`[1] resultsEnv[2,2] <- summary(aov(formula = Envi2CO2 ~ day, data = combine))[[1]]$`Pr(>F)`[1] resultsEnv[2,3] <- summary(aov(formula = CO2Dif ~ day, data = combine))[[1]]$`Pr(>F)`[1] resultsEnv[3,1] <- summary(aov(formula = Envi1O2 ~ day, data = combine))[[1]]$`Pr(>F)`[1] resultsEnv[3,2] <- summary(aov(formula = Envi2O2 ~ day, data = combine))[[1]]$`Pr(>F)`[1] resultsEnv[3,3] <- summary(aov(formula = O2Dif ~ day, data = combine))[[1]]$`Pr(>F)`[1] resultsEnv[4,1] <- summary(aov(formula = Envi1NO2 ~ day, data = combine))[[1]]$`Pr(>F)`[1] resultsEnv[4,2] <- summary(aov(formula = Envi2NO2 ~ day, data = combine))[[1]]$`Pr(>F)`[1] resultsEnv[4,3] <- summary(aov(formula = NO2Dif ~ day, data = combine))[[1]]$`Pr(>F)`[1] resultsCity <- data.frame(round(resultsCity,3), row.names = c("Lum","Hum","Tc","Noise") ) resultsEnv <- data.frame(round(resultsEnv,3), row.names = c("CO","CO2","O2","NO2") ) names(resultsCity) <- c("City1", "City2","Dif") names(resultsEnv) <- c("Env1", "Env2","Dif") stargazer(resultsCity, header=FALSE, type='latex', summary =F ,title = "t.test between measurement of City1 vs City2 and ENv1 vs Env2") stargazer(resultsEnv, header=FALSE, type='latex', summary =F ,title = "t.test between measurement of ENv1 vs Env2")

]]>

To test differences in mean residential and public USFPP among neighborhoods that are relatively segregated or integrated, t tests were used. Following Peterson and Krivo (2010), neighborhood segregation was defined as dichotomous variable where neighborhoods with over 70% of a single racial group were considered segregated and those below that threshold were labeled integrated. Out of 174 census tracts in Boston, 110 were integrated and 63 were segregated (with one missing relevant data). Interestingly, integrated neighborhoods have a mean residential USFPP of 538, in comparison to 740 for segregated neighborhoods, a difference that was statistically significant (p < 0.001). Considering that 83% of segregated neighborhoods were overwhelmingly white, this finding is consistent with those of Peterson and Krivo (2010). The difference in means for public USFPP (554 for integrated, 607 for segregated), however, was not significant, suggesting that the racial composition of a neighborhood is related more to residential than public USFPP.

Peterson and Krivo (2010) replicated their results on neighborhood segregation, examining the effects of what they term “hypersegregation,” which is defined as neighborhoods comprised of 90% or more of a single race. Following their work, neighborhood segregation was separated into three groups: integrated (70% single race or below, N = 110), segregated (70%-90% single race, N=50), and hypersegregated (>90% single race, N= 13). A one-way ANOVA was used to test for differences in residential and public USFPP among these three groups. Mean residential space was significantly different among the levels of neighborhood segregation (F = 19.93, p <0.001); posthoc Tukey tests indicate that both levels of segregation differ from integrated neighborhoods, yet no significant difference exists between segregated and hypersegregated neighborhoods. While this may be an issue related to low power (hypersegregated neighborhoods had a very small sample size), it is also possible that the effects of segregation are nonlinear and thus diminish after reaching a certain threshold. Following the results of the t test, there were no significant differences among segregation categories for public USFPP (F=0.39, p =0.675). The graph below clearly illustrates the differences (or lack thereof) among means for both residential and public USFPP. While public space remains similar across neighborhoods of differing levels of segregation, residential space increases for more segregated neighborhoods (probably due to the dominant race in most of these areas being white).

References:

Peterson, R. and Krivo, L. (2010). *Divergent social worlds: Neighborhood crime and the racial-spatial divide.* American Sociological Association.

R code:

#Read in data

All <-read.csv(“~/Desktop/PPUA/Aggregated_CT_11_6.csv”, header=TRUE)

#Create categorical measures of neighborhood segregation and hypersegregation

All$Segregated <- ifelse((All$White >.70 | All$Black >.70 | All$Hispanic >.70 | All$Asian >.70),1,0)

All$Hypersegregated <- ifelse((All$White >.90 | All$Black >.90 | All$Hispanic >.90 | All$Asian >.90),1,0)

All$Neighborhood_segregation <- ifelse(All$Segregated == 1, 1, 0)

All$Neighborhood_segregation <- ifelse(All$Hypersegregated == 1, 2, All$Neighborhood_segregation)

All$Neighborhood_segregation <- as.factor(All$Neighborhood_segregation)

class(All$Neighborhood_segregation)

summary(as.factor(All$Segregated))

summary(as.factor(All$Hypersegregated))

summary(as.factor(All$Neighborhood_segregation))

#Perform t-test for USFPP measures based on segregation variable

t.test(Residential_USFPP~Segregated, data=All)

t.test(Public_USFPP~Segregated, data=All)

white <- All[which(All$Segregated==1 & All$White >.70),]

nrow(white)

View(white)

#Perform anova for USFPP measures based on neighborhood_segregation variable

anova <-aov(Residential_USFPP~Neighborhood_segregation, data=All)

anova2 <-aov(Public_USFPP~Neighborhood_segregation, data=All)

summary(anova)

summary(anova2)

#Posthoc tests

TukeyHSD(anova)

#Visualize differences

require(ggplot2)

require(reshape2)

names(All)

melted3 <-melt(All[c(26,28,37)],id.vars=c(“Neighborhood_segregation”))

View(melted3)

means3 <- aggregate(value~Neighborhood_segregation+variable, data=melted3, mean, na.rm=TRUE)

names(means3)[3]<-“mean”

View(means3)

ses<-aggregate(value~Neighborhood_segregation+variable,data=melted3, function(x) sd(x, na.rm=TRUE)/sqrt(length(!is.na(x))))

names(ses)[3]<-‘se’

View(ses)

merge<-merge(means3,ses,by=c(‘Neighborhood_segregation’,’variable’))

View(merge)

merge<-transform(merge, lower=mean-se, upper=mean+se)

levels(merge$Neighborhood_segregation)<-c(“Integrated”,”Segregated”,”Hypersegregated”)

levels(merge$variable)<-c(“Residential USFPP”,”Public USFPP”)

graph <-ggplot(data=merge, aes(x=Neighborhood_segregation, y=mean, fill=variable)) + geom_bar(stat=”identity”,position=”dodge”) + geom_errorbar(aes(ymax=upper, ymin=lower),position=position_dodge(.9)) + ylab(“Mean”)

graph + labs(x = “Level of Neighborhood Segregation”, y = “Mean Unit Square Foot-Per-Person”, title = “Comparison of Residential and Public USFPP by Neighborhood Segregation”)

]]>

For this task, I am associating each liquor license with a few selected social economic metrics at the census block group level, i.e. assigning each liquor license an value of the indicator based on which census block group it is located; then the metrics are aggregated by census block groups, and finally, I use t-test and ANOVA to analyze the differences in these metrics between each liquor license type.

I have picked median house income and the percentage of white residents as the two indicators for the social fabric of the neighborhoods.

source("src/preparation.R") source("src/alcohol.R") bizz.cols <- c("CT_ID_10", "ALC_BIZ_TYPE", "ALC_TYPE") acs.cols <- c("CT_ID_10", "MedHouseIncome", "White") dat <- merge(bizz.alc[, bizz.cols], acs.ct[, acs.cols], all.x = TRUE)

When categorized by type of alcohol allowed, the licenses can be divided into two group: malt & wine, and all alcoholic beverages. I used a Student’s t-test to see whether census indicators for these two groups are significantly different from each other.

First, the median house income:

library(broom) library(knitr) dat.t <- t.test(MedHouseIncome ~ ALC_TYPE, dat) %>% tidy() dat.t <- dat.t[, c(1:6)] colnames(dat.t) <- c( "mean difference", "mean in group (All Alcohol)", "mean in group (Malte & Wine)", "t statistic", "p value", "df" ) row.names(dat.t) <- c("value") round(dat.t, 2) %>% t() %>% kable()

value | |
---|---|

mean difference | 10765.01 |

mean in group (All Alcohol) | 76030.47 |

mean in group (Malte & Wine) | 65265.46 |

t statistic | 5.76 |

p value | 0.00 |

df | 801.18 |

We get a t-test of 5.76, and the p value is as low as 1.1e-8 indicating a very strong difference. On average, a census block group containing full liquor licenses have a $10,765 higher median house income than those with malt & wine only liquor licenses. Note that each group has different number of members (liquor licenses)

This makes sense because full liquor licenses are much harder to get and have higher prices on the market. They naturally would be more distributed among rich neighborhoods.

Then let’s look at race:

dat.t <- t.test(White ~ ALC_TYPE, dat) %>% tidy() dat.t <- dat.t[, c(1:6)] colnames(dat.t) <- c( "mean difference", "mean in group (All Alcohol)", "mean in group (Malte & Wine)", "t statistic", "p value", "df" ) row.names(dat.t) <- c("value") round(dat.t, 2) %>% t() %>% kable()

value | |
---|---|

mean difference | 0.02 |

mean in group (All Alcohol) | 0.66 |

mean in group (Malte & Wine) | 0.64 |

t statistic | 1.23 |

p value | 0.22 |

df | 714.32 |

The p value is 0.22, not so significant. Statistically speaking, race is not related to the distribution of different alcohol types.

We have 6 different business types designated in the categorization of liquor licenses. An ANOVA test can be used to test whether the implications of these business types are statistically significant.

dat.anova <- aov(MedHouseIncome ~ ALC_BIZ_TYPE, dat) broom::tidy(dat.anova) %>% knitr::kable()

term | df | sumsq | meansq | statistic | p.value |
---|---|---|---|---|---|

ALC_BIZ_TYPE | 5 | 17121124445 | 3424224889 | 3.806447 | 0.0020108 |

Residuals | 1097 | 986845509189 | 899585697 | NA | NA |

With F value’s p value as low as 0.00201, it is statistically significant that at least one business type is different than others.

And based on Tukey’s “Honest Significant Difference” method, three group combinations were identified with significant differences.

broom::tidy(TukeyHSD(dat.anova))[2:6] %>% knitr::kable()

comparison | estimate | conf.low | conf.high | adj.p.value |
---|---|---|---|---|

Common Vectualler-Club | 4563.333 | -7309.544 | 16436.210 | 0.8825740 |

Farmer Distillery-Club | -4402.655 | -66038.275 | 57232.966 | 0.9999516 |

General On-Premises-Club | -10090.611 | -31352.054 | 11170.831 | 0.7539779 |

Hotel-Club | 17154.345 | 1467.223 | 32841.468 | 0.0226704 |

Tavern-Club | -12757.655 | -74393.275 | 48877.966 | 0.9916508 |

Farmer Distillery-Common Vectualler | -8965.987 | -69573.924 | 51641.949 | 0.9982903 |

General On-Premises-Common Vectualler | -14653.944 | -32721.074 | 3413.186 | 0.1886058 |

Hotel-Common Vectualler | 12591.013 | 1615.674 | 23566.351 | 0.0138650 |

Tavern-Common Vectualler | -17320.987 | -77928.924 | 43286.949 | 0.9646641 |

General On-Premises-Farmer Distillery | -5687.957 | -68810.105 | 57434.192 | 0.9998475 |

Hotel-Farmer Distillery | 21557.000 | -39912.037 | 83026.037 | 0.9176134 |

Tavern-Farmer Distillery | -8355.000 | -93978.049 | 77268.049 | 0.9997743 |

Hotel-General On-Premises | 27244.957 | 6471.373 | 48018.540 | 0.0026154 |

Tavern-General On-Premises | -2667.043 | -65789.192 | 60455.105 | 0.9999964 |

Tavern-Hotel | -29912.000 | -91381.037 | 31557.037 | 0.7336351 |

They are Hotel-Club, Hotel-CommonVectualler, and Hotel-GeneralOnPremises.

broom::tidy(TukeyHSD(dat.anova))[c(4, 8, 13), 2:6] %>% knitr::kable()

comparison | estimate | conf.low | conf.high | adj.p.value | |
---|---|---|---|---|---|

4 | Hotel-Club | 17154.35 | 1467.223 | 32841.47 | 0.0226704 |

8 | Hotel-Common Vectualler | 12591.01 | 1615.674 | 23566.35 | 0.0138650 |

13 | Hotel-General On-Premises | 27244.96 | 6471.373 | 48018.54 | 0.0026154 |

They are all related to hotels, which indicates that liquor licenses issues to hotels might be the outlier here, i.e., hotels offering on-premise consumption alcohol beverages tend to be located in richer neighborhoods.

The differences between hotel and the remaining two categories, Farmer Distillery and Tavern, are not statistically significant. However, this might just be because of the very small number of licenses in these two categories (with just 2 and 6 each)–the degree of freedom is really low, making it impossible to get a significant result.

Using bar chart and error bars, the mean differences of median house income between different liquor licenses can be plotted as following:

library(reshape2) dat.melted <- melt(dat[, c("ALC_BIZ_TYPE", "MedHouseIncome")], id.vars = "ALC_BIZ_TYPE") dat.means <- aggregate(value ~ ALC_BIZ_TYPE + variable, dat.melted, FUN = mean) names(dat.means)[3]<-"mean" dat.ses <- aggregate(value ~ ALC_BIZ_TYPE + variable, dat.melted, function(x) sd(x, na.rm = TRUE) / sqrt(length(!is.na(x))) ) names(dat.ses)[3]<-"se" dat.means <- merge(dat.means, dat.ses, by = c("ALC_BIZ_TYPE", "variable")) dat.means <- transform(dat.means, lower = mean - se, upper = mean + se) ggplot(data = dat.means, aes(x = ALC_BIZ_TYPE, y = mean)) + geom_bar(stat = "identity", position = "dodge", fill = "tomato2") + geom_errorbar( aes(ymax = upper, ymin = lower), position = position_dodge(.9)) + theme(axis.text.x = element_text(angle = 70, hjust = 1)) + ylab("Mean") + xlab("Business Type")

The high income level for liquor licenses in hotels is very obvious.

]]>

library(“dplyr”)

library(“ggplot2”)

library(“gridExtra”)

envi <- read.csv(“sensor_data_environment-MT.csv”)

envi <- distinct(envi, id_wasp, sensor, value, timestamp)

envi <- envi[,c(3,4,11,5)]

First, we will consider the two locations of the Environment sensors, and we’ll investigate each chemical separately. In this case, we are simply interested in whether the binary variable *id_wasp* influences the value of the chemical concentration, and so, a simple t-test will provide the information we need.

t.no2 <- t.test(envi[envi$sensor==”NO2″,]$value~envi[envi$sensor==”NO2″,]$id_wasp)

t.co <- t.test(envi[envi$sensor==”CO”,]$value~envi[envi$sensor==”CO”,]$id_wasp)

t.o2 <- t.test(envi[envi$sensor==”O2″,]$value~envi[envi$sensor==”O2″,]$id_wasp)

t.co2 <- t.test(envi[envi$sensor==”CO2″,]$value~envi[envi$sensor==”CO2″,]$id_wasp)

The results of these tests are summarized below:

NO2 | CO | O2 | CO2 | |

T statistic | 4.5403 | – 22.8262 | 48.8681 | – 16.5273 |

degrees | 1178.709 | 1010.852 | 1158.722 | 894.672 |

p-Value |
6.195e-06 |
< 2.2e-16 |
< 2.2e-16 |
< 2.2e-16 |

Environ1 mean | 2.010362 | 1.913590 | 0.5815106 | 0.8431141 |

Environ2 mean | 2.005268 | 2.404169 | 0.5425828 | 0.9847937 |

Note the differences in the Environ1 and Environ2 means in the NO2 data seem very small, especially compared to the CO data. The t statistics – reflecting the difference in means relative to the variation of the data – are not too close to zero, and the p-Values calculated in all four tests are exceedingly small. So, we can conclude that the atmospheric concentration of the chemicals at each location is very closely related; the means are close enough that they are unlikely to have occurred by chance.

What about the variability of each chemical throughout the day? Is there a significant difference between NO2 concentration in the middle of the day versus the night? We previously added a flag to our dataset that indicates whether the measurement was taken in the morning, afternoon/evening, or at night. We can use this flag with the ANOVA test to determine if there is a significant difference in the measurements in these three times of day.

aov.no2 <-aov(envi[envi$sensor==”NO2″,]$value~envi[envi$sensor==”NO2″,]$day_flg)

aov.co <- aov(envi[envi$sensor==”CO”,]$value~envi[envi$sensor==”CO”,]$day_flg)

aov.o2 <- aov(envi[envi$sensor==”O2″,]$value~envi[envi$sensor==”O2″,]$day_flg)

aov.co2 <- aov(envi[envi$sensor==”CO2″,]$value~envi[envi$sensor==”CO2″,]$day_flg)

The results are summarized below:

F Value |
Probability >F | |

NO2 | 27.34 |
2.47e-12 |

CO | 305.4 |
<2e-16 |

O2 | 6.55 |
0.00148 |

CO2 | 70.31 |
<2e-16 |

In all four cases, the F value is greater than 1, and in the CO and CO2 cases in particular, the F value is much larger than 1. This indicates that there is a larger variability between the groups than the variability within each group. Therefore, we can conclude that there is a statistically significant difference between the concentrations in the morning, late day, and nighttime hours. Another way to frame this is that, for NO2, the time of day accounts for 27 times the variability in the data than we would expect to see by pure chance. The probability >F is the probability that we would see this much variability by chance (not attributed to the time of day), and in all four cases, this probability is again exceedingly small.

The F test does not provide any information about which times of day indicate high or low concentrations of these chemicals. To dig into this, we will visualize the average value of atmospheric concentration at each time period in the day. First we’ll calculate the means and standard errors for each case (sensor location, chemical, and time of day), and then we’ll create bar plots for each chemical so that we can compare both the sensor location and the time of day. Interestingly, these plots appear to show very little variability (the standard error bars are almost too hard to see because the data are so tightly centered around the mean value). This reaffirms the results of the ANOVA test, demonstrating that the within-group variance is quite small. Meanwhile, the difference in the means (the difference in the bar height for equivalent times of day), which approximately represents the between-group variance, is substantially different. This is most apparent for the CO and CO2 data. For the NO2 data, the between-group and within-group variance is quite difficult to see because all the data is so tightly grouped. The results of the ANOVA test and t test concluded that the difference between times of day and sensor locations were significant, but the individual t statistic and F value were rather small compared with CO and CO2. For the O2 data, the difference between the sensor locations is notable, but the difference between the times of day is very small, demonstrated by the high t statistic but very low F statistic for the O2 data.

means <- aggregate(value~id_wasp+sensor+day_flg,data=envi,mean)

names(means)[4] <- ‘Mean’

ses <- aggregate(value~id_wasp+sensor+day_flg,data=envi,function(x) sd(x, na.rm=TRUE)/sqrt(length(!is.na(x))))

names(ses)[4] <- ‘SE’

means <- merge(means,ses,by=c(“id_wasp”,”sensor”,”day_flg”))

means <- transform(means,lower=Mean-SE,upper=Mean+SE)

bar1 <- ggplot(data=means[means$sensor==’CO’,], aes(x=id_wasp, y=Mean, fill=day_flg)) +

geom_bar(stat=”identity”,position=”dodge”) +

geom_errorbar(aes(ymax=upper,ymin=lower),position=position_dodge(0.9)) +

guides(fill=guide_legend(title=”Time of Day”)) +

labs(title = “CO”, x=NULL, y=”Mean”)

bar2 <- ggplot(data=means[means$sensor==’NO2′,], aes(x=id_wasp, y=Mean, fill=day_flg)) +

geom_bar(stat=”identity”,position=”dodge”) +

geom_errorbar(aes(ymax=upper,ymin=lower),position=position_dodge(0.9)) +

guides(fill=guide_legend(title=”Time of Day”)) +

labs(title = “NO2″, x=NULL, y=”Mean”)

bar3 <- ggplot(data=means[means$sensor==’CO2′,], aes(x=id_wasp, y=Mean, fill=day_flg)) +

geom_bar(stat=”identity”,position=”dodge”) +

geom_errorbar(aes(ymax=upper,ymin=lower),position=position_dodge(0.9)) +

guides(fill=guide_legend(title=”Time of Day”)) +

labs(title = “CO2″, x=NULL, y=”Mean”)

bar4 <- ggplot(data=means[means$sensor==’O2′,], aes(x=id_wasp, y=Mean, fill=day_flg)) +

geom_bar(stat=”identity”,position=”dodge”) +

geom_errorbar(aes(ymax=upper,ymin=lower),position=position_dodge(0.9)) +

guides(fill=guide_legend(title=”Time of Day”)) +

labs(title = “O2″, x=NULL, y=”Mean”)

grid.arrange(bar1,bar2,bar3,bar4,ncol=2)

We can also visualize the periodic average as a departure from the daily average. This will help us detect any differences in the variability throughout the day in one location versus the other. The process is similar to our previous visualization, but this time we’ll calculate the daily mean for each sensor and location, and create bar plots that show the departure from the daily mean for each time period. For NO2, CO2, and O2, the distinctive feature is that both locations have approximately the same pattern throughout the day, just scaled differently based on location (i.e., NO2 is above average in the afternoon, much below average in the morning, and below but closer to average overnight, and both locations exhibit this pattern except that Environ2 is closer to average overall). This effect is muted in the O2 plot because all values are so similar for O2. However, in the CO plot, note that the two locations exhibit different behavior during the overnight hours. For Environ1, the concentrations are below average overnight, while they are above average for Environ2 overnight. This suggests that there may be some interaction between the influence of location and the influence of time of day; this could be driven by some physical driver, such as diffusion or some other type of chemical transfer.

dailymean <- aggregate(value~id_wasp+sensor,data=envi,mean)

names(dailymean)[3] <- ‘DailyMean’

plotdata <- merge(means,dailymean,by=c(“id_wasp”,”sensor”))

bar1 <- ggplot(data=plotdata[plotdata$sensor==’CO’,], aes(x=id_wasp, y=(Mean – DailyMean), fill=day_flg)) +

geom_bar(stat=”identity”,position=”dodge”) +

geom_errorbar(aes(ymax=(lower-DailyMean),ymin=(upper-DailyMean)),position=position_dodge(0.9)) +

guides(fill=guide_legend(title=”Time of Day”)) +

labs(title = “CO”, x=NULL, y=”Departure from Mean”)

bar2 <- ggplot(data=plotdata[plotdata$sensor==’NO2′,], aes(x=id_wasp, y=(Mean – DailyMean), fill=day_flg)) +

geom_bar(stat=”identity”,position=”dodge”) +

geom_errorbar(aes(ymax=(lower-DailyMean),ymin=(upper-DailyMean)),position=position_dodge(0.9)) +

guides(fill=guide_legend(title=”Time of Day”)) +

labs(title = “NO2″, x=NULL, y=”Departure from Mean”)

bar3 <- ggplot(data=plotdata[plotdata$sensor==’CO2′,], aes(x=id_wasp, y=(Mean – DailyMean), fill=day_flg)) +

geom_bar(stat=”identity”,position=”dodge”) +

geom_errorbar(aes(ymax=(lower-DailyMean),ymin=(upper-DailyMean)),position=position_dodge(0.9)) +

guides(fill=guide_legend(title=”Time of Day”)) +

labs(title = “CO2″, x=NULL, y=”Departure from Mean”)

bar4 <- ggplot(data=plotdata[plotdata$sensor==’O2′,], aes(x=id_wasp, y=(Mean – DailyMean), fill=day_flg)) +

geom_bar(stat=”identity”,position=”dodge”) +

geom_errorbar(aes(ymax=(lower-DailyMean),ymin=(upper-DailyMean)),position=position_dodge(0.9)) +

guides(fill=guide_legend(title=”Time of Day”)) +

labs(title = “O2″, x=NULL, y=”Departure from Mean”)

grid.arrange(bar1,bar2,bar3,bar4,ncol=2)

We can test for this interaction between factors (location and time of day) using a two-way ANOVA test, as follows:

aov2.no2 <- aov(envi[envi$sensor==”NO2″,]$value~envi[envi$sensor==”NO2″,]$day_flg * envi[envi$sensor==”NO2″,]$id_wasp)

aov2.co <- aov(envi[envi$sensor==”CO”,]$value~envi[envi$sensor==”CO”,]$day_flg * envi[envi$sensor==”CO”,]$id_wasp)

aov2.o2 <- aov(envi[envi$sensor==”O2″,]$value~envi[envi$sensor==”O2″,]$day_flg * envi[envi$sensor==”O2″,]$id_wasp)

aov2.co2 <- aov(envi[envi$sensor==”CO2″,]$value~envi[envi$sensor==”CO2″,]$day_flg * envi[envi$sensor==”CO2″,]$id_wasp)

F Value of Interaction |
Probability >F for Interaction | |

NO2 | 1.513 |
0.221 |

CO | 17.09 |
4.8e-08 |

O2 | 3.029 |
0.0488 |

CO2 | 46.62 |
<2e-16 |

The test confirms our hypothesis that the location and time of day interact to influence the concentration values, with a particularly large F value for CO2, and a more moderate but still significant value for CO. Given the larger Probability >F values for NO2 and O2, it is not especially likely that the two factors interact to influence concentration for NO2 and O2.

]]>