Correlation

by Alan Cheung

Background

In this week’s blog, we will learn how to use correlations when analyzing Boston’s food establishment inspection data. According to Daniel O’Brien’s Urban Informatics: Using Big Data to Understand and Serve Communities, “in statistics, a correlation is used to evaluate the relationship between two numerical variables, specifically whether they rise and fall together or whether one falls as the other rises and vice versa. The former is referred to as a positive correlation, the latter is a negative correlation.”

In previous blog posts, we discussed about developing the latent construct, gentrification; coding the manifest variables; and creating the gentrification score. We will run correlations on the gentrification variables (median active restaurant age, restaurant closure rate, recent restaurant remodel rate, gentrification score) with the variables from the Massachusetts Census Indicators. The database contains 52 indicators organized into 7 main categories: 1) Sex, Age, and Density; 2) Racial and Ethnic Composition; 3) Economic measures; 4) Family and Household information; 5) Highest Level of Education completed; 6) Method of Transportation to Work; 7) Housing. The datasets that we will use are from the ACS tract indicator from years 2007-2011, 2011-2015, and 2015-2019. The goal is to find the correlations between the gentrification variables and the census indicators that are common in all three datasets and understand their implications.

Analysis

First, we set up our notebook with tidyverse and load our datasets into dataframes.

library(tidyverse)
df <- read.csv("/Users/alancheung/Library/CloudStorage/GoogleDrive-a.cheung.273@gmail.com/My Drive/Classes/Northeastern/PPUA 5262 Big Data for Cities/Project/spacial_data.csv")
df07_11 <- read.csv("/Users/alancheung/Library/CloudStorage/GoogleDrive-a.cheung.273@gmail.com/My Drive/Classes/Northeastern/PPUA 5262 Big Data for Cities/Project/dataverse_files/ACS_0711_TRACT.csv")
df11_15 <- read.csv("/Users/alancheung/Library/CloudStorage/GoogleDrive-a.cheung.273@gmail.com/My Drive/Classes/Northeastern/PPUA 5262 Big Data for Cities/Project/dataverse_files/ACS_1115_TRACT.csv")
df15_19 <- read.csv("/Users/alancheung/Library/CloudStorage/GoogleDrive-a.cheung.273@gmail.com/My Drive/Classes/Northeastern/PPUA 5262 Big Data for Cities/Project/dataverse_files/ACS_1519_TRACT.csv")

The next chunk creates the variables that the gentrification score is composed of – median active restaurant age, restaurant closure rate, and recent restaurant remodel rate. See my previous blog post for an explanation of the code.

restaurant <- df %>%
  group_by(businessname,address,BG_ID_10, CT_ID_10, issdttm, expdttm, licstatus, YR_BUILT, YR_REMOD) %>%
  count(businessname,address,BG_ID_10, CT_ID_10, issdttm, expdttm, licstatus, YR_BUILT, YR_REMOD) %>%
  rename('violation_count' = 'n')

restaurant$issdttm <- as.POSIXct(restaurant$issdttm, format="%Y-%m-%d %H:%M:%S")
restaurant$expdttm <- as.POSIXct(restaurant$expdttm, format="%Y-%m-%d %H:%M:%S")

closure_rate <- restaurant %>%
  group_by(address, CT_ID_10) %>%
  summarise(total_businesses = n(),
            closed_businesses = sum(licstatus =='Inactive'),
            closure_rate = closed_businesses / total_businesses) %>%
  arrange(CT_ID_10)

closure_rate_CT <- closure_rate %>%
  group_by(CT_ID_10) %>%
  summarize(
            total_closed_count = sum(closed_businesses),
            total_restaurant_count = sum(total_businesses),
            avg_close_rate = total_closed_count / total_restaurant_count) %>%
  arrange(CT_ID_10)

current_date <- Sys.Date()

df_with_active_age <- restaurant %>%
  filter(licstatus =="Active") %>%
  summarize(age_of_active_days = as.numeric(difftime(current_date, issdttm, units = "days")),
         age_of_active_years = age_of_active_days / 365.25)

active_age_CT <- df_with_active_age %>%
  group_by(CT_ID_10) %>%
  summarize(active_restaurant_count = sum(licstatus=="Active"),
            median_age_yr = median(age_of_active_years))

active_age_closure_rate_CT <- closure_rate_CT %>%
  left_join(active_age_CT, by='CT_ID_10') 

recent_rest_remod_CT <- restaurant %>%
  filter(!is.na(YR_REMOD)) %>%
  filter(YR_REMOD >=2007) %>%
  group_by(CT_ID_10) %>%
  summarize(recent_remod_rest_count = n()) %>%
  arrange(desc(recent_remod_rest_count))

merged_df <- active_age_closure_rate_CT %>%
  left_join(recent_rest_remod_CT, by = "CT_ID_10") %>%
  mutate(new_remod_percent = recent_remod_rest_count / total_restaurant_count)

Here, the median active restaurant age is normalized by dividing the median age by the max median age in dataset and then subtracting 1 by the value. Now, all the variables theoretically range from 0 to 1, and a higher score means that the area is more likely to experience gentrification.

merged_df$median_age_score <- 1 - (merged_df$median_age_yr / max(merged_df$median_age_yr, na.rm = TRUE))

We then create the gentrification score. A smaller weight of 20% is assigned to closure rate because the data could misrepresent gentrification, such as restaurants closing because of the pandemic. The remodel rate and the normalized median age were each given a weight of 40%. The gentrification score ranges theoretically from 0 to 1.

merged_df$gentrification_score <- (0.4*(if_else(is.na(merged_df$new_remod_percent), 0, merged_df$new_remod_percent)) + 0.2*(if_else(is.na(merged_df$avg_close_rate), 0, merged_df$avg_close_rate)) + 0.4*(if_else(is.na(merged_df$median_age_score), 0, merged_df$median_age_score)))

We create new dataframes by merging the the gentrification dataframe with the census dataframes on the census tract level. We use select_if(~is.numeric(.)) to select columns that have a numerical data type because correlations can only be run with numerical variables.

combined_df07_11 <- merged_df %>%
  left_join(df07_11, by="CT_ID_10") %>%
  select_if(~is.numeric(.))
  
combined_df11_15 <- merged_df %>%
  left_join(df11_15, by="CT_ID_10") %>%
  select_if(~is.numeric(.))

combined_df15_19 <- merged_df %>%
  left_join(df15_19, by="CT_ID_10") %>%
  select_if(~is.numeric(.))

We create a Gally plot of the gentrification variables to see if there are correlations between them. We see that the gentrification score is correlated with the three other variables; this makes sense because the score is composed of those variables. We see a minor negative correlation (r=-0.36) between restaurant closure rate; this is not surprising because they are aspects of business turnover.

require(GGally)

ggpairs(combined_df15_19, columns= c(4,6,8,10))

Instead of writing code for each pair of variables to run correlations, we create a function that runs the correlation between all the variables to save us a lot of time. A function is sequence of instructions that performs a specific task, and it is reusable. The inputs of the function are the dataframe, the maximum p-value that we can tolerate, and the minimum magnitude of the correlation coefficient (r) that we can tolerate. The function outputs a dataframe that has the p-values and correlation coefficient between the two variables that meet the p-value and r thresholds.

library(Hmisc)

# Define the function
filter_correlation_pvalue <- function(df1, p_value_threshold, r_threshold) {
  
  # Merge and prepare the data
  combined_df <- df1 %>%
    select(-"CT_ID_10") %>%
    na.omit() %>%
    as.matrix()
  
  # Compute the correlation and p-value matrix
  rcorr_matrix <- rcorr(combined_df)

  # Extract the correlation matrix and the p-value matrix
  correlation_matrix <- rcorr_matrix$r
  p_value_matrix <- rcorr_matrix$P

  # Get variable names
  var_names <- colnames(correlation_matrix)

  # Create a data frame from the matrices
  cor_pval_df <- expand.grid(Var1 = var_names, Var2 = var_names, stringsAsFactors = FALSE)
  cor_pval_df$Correlation <- as.vector(correlation_matrix)
  cor_pval_df$P_Value <- as.vector(p_value_matrix)

  # Filter the data frame
  filtered_cor_pval_df <- cor_pval_df[cor_pval_df$P_Value < p_value_threshold & (cor_pval_df$Correlation > r_threshold | cor_pval_df$Correlation < -1*r_threshold), ]

  return(filtered_cor_pval_df)
}

We use the function on each of the combined dataframes to get the correlated variables. We set the p-value threshold to be 0.05, which is the most common threshold, and r threshold to be 0.2 to see any pairs with at least a weak relationship. We want to compare the gentrification variables to the census variables, so we filter the data by searching for gentrification variables in Var1 and excluding them in Var2.

result07_11 <- filter_correlation_pvalue(combined_df07_11, 0.05, 0.2) %>%
  filter(Var1 == "avg_close_rate" | Var1 == "new_remod_percent" |
           Var1 == "median_age_yr" | Var1 == "gentrification_score") %>%
  filter(Var2 != "avg_close_rate" & Var2 != "new_remod_percent"
          & Var2 != "median_age_yr" & Var2 != "gentrification_score"
         & Var2 !="median_age_score") %>%
  arrange(desc(abs(Correlation)))

result11_15 <- filter_correlation_pvalue(combined_df11_15, 0.05, 0.2) %>%
  filter(Var1 == "avg_close_rate" | Var1 == "new_remod_percent" |
           Var1 == "median_age_yr" | Var1 == "gentrification_score") %>%
  filter(Var2 != "avg_close_rate" & Var2 != "new_remod_percent"
          & Var2 != "median_age_yr" & Var2 != "gentrification_score"
         & Var2 !="median_age_score") %>%
  arrange(desc(abs(Correlation)))

result15_19 <- filter_correlation_pvalue(combined_df15_19, 0.05, 0.2) %>%
  filter(Var1 == "avg_close_rate" | Var1 == "new_remod_percent" |
           Var1 == "median_age_yr" | Var1 == "gentrification_score") %>%
  filter(Var2 != "avg_close_rate" & Var2 != "new_remod_percent"
          & Var2 != "median_age_yr" & Var2 != "gentrification_score"
         & Var2 !="median_age_score" & Var2 != "TOWN_ID" & Var2 != "FIPS_STCO" & Var2 != "recent_remod_rest_count") %>%
  arrange(desc(abs(Correlation)))

After filtering the data, we merge the three correlation dataframes into one with full_join(). This allows us to see which correlations are significant across the three different time periods.

results_merged <- full_join(result07_11,result11_15, by = join_by(Var1, Var2))
results_merged2 <-full_join(results_merged, result15_19, by=join_by(Var1, Var2))

We want to see which correlations appear in all three time periods, so we filter where there are no “NA” values in the correlations.

results_merged3 <- results_merged2 %>%
  filter(!is.na(Correlation.x) & !(is.na(Correlation.y)) & !is.na(Correlation)) %>%
  arrange(Var1, desc(abs(Correlation.x)), desc((abs(Correlation.y))), desc((abs(Correlation))))

results_merged3 %>%
  select(Var1, Var2, Correlation.x, P_Value.x)

Looking at the results_merged3 dataframe, we see that there are 14 census variables that the gentrification variables correlate with. Four of them are for the closure rate, and the other ten are for the recent restaurant remodeling rate. The gentrification score and the median active restaurant age did not have any significant correlations. All the correlations are statistically significant but weak; the highest r magnitude is r=-0.35 between Black (proportion of black residents) and new_remod_percent.

We see that White (proportion of white residents) is positively correlated with the remodel percentage and negatively correlated with closure rate. On the other hand, we see Black negatively correlated with remodel percentage. This could indicate racial economic inequality, where areas with more white residents could have more wealth to sustain restaurants and renovate them.

library(gridExtra)

plot1 <- ggplot(combined_df07_11, aes(x = White, y = new_remod_percent)) +
  geom_point() +  
  geom_smooth(method = "lm", color = "blue") +
  theme_minimal() +
  labs(y = "Proportion of New Restaurant Remodel", x = "Proportion of White Residents",
       title = "White Residents vs New Restaurant Remodel and Closure")

plot2 <- ggplot(combined_df07_11, aes(x = White, y = avg_close_rate)) +
  geom_point() +  
  geom_smooth(method = "lm", color = "blue") +
  theme_minimal() +
  labs(y = "Proportion of Restaurant Closure", x = "Proportion of White Residents",
       title = "")

grid.arrange(plot1, plot2, ncol = 2)

Another interesting relationship is between highest education level and recent restaurant remodel rate. We see Master (proportion of residents with Master’s degree) and Doc (proportion with doctoral degree) positively correlated with remodel rate, while LessThanHS (proportion without high school degree) and HSGrad (proportion with high school degree) negatively correlated with remodel rate. This makes sense because those with college degrees tend to earn more than those without advanced degrees, and nicer-looking restaurants tend to be in wealthier neighborhoods. We also see that PubAssist (proportion receiving public assistance), which is a population with low income, is negatively correlated with remodel rate.

Something to note is that the variable FemHeadPer (proportion of family household that are female-headed) is positively correlated with closure rate and negatively correlated with remodel percentage. It would be interesting to understand why this is the case.

Conclusion

In this blog post, we’ve analyzed Boston’s changing restaurant scene, focusing on gentrification factors. We then correlated these factors with data from the ACS to identify trends and effects in various parts of the city.

The correlations between the gentrification variables and the census indicators may reveal some trends of Boston. People with higher education and income tend to live in wealthier areas with newly remodeled restaurants, and the inverse is true. There may be economic inequity in some parts of the city for black residents. It would be interesting to look at the correlations between the census indicators to further examine these trends.


Leave a comment