BlueBikes and Census correlations

For this weeks analysis into correlations, I wanted to mainly focus on census data as it provides interesting continuous variables. In previous assignments I have attempted to include census data, but I struggled with the complexity of my approach. This week I decided to simplify the merging of data sets using already available spatial functions!

This post will cover the code, choice of variables, and I will explain the significance of the results found.

Using census data

Using the load_variables function provided by tidycensus, I chose the following 3 variables to explore from both the Decennial and American Community Survey:

  • P001001 – Total population. I chose this to test the hypothesis – having more people or a higher density of people would have a positive correlation with BlueBikes usage.
  • B08101_009 – Total drove alone to work with a car, truck or van. I chose this to test the hypothesis – if more people are choosing to drive then they are less likely to opt into a bike share program, hence a negative correlation.
  • B07011_001 – Median income. I chose this to explore a socioeconomic indicator. I do not have a formed hypothesis for this variable yet as I could argue the results could go either way. For example, lower income tracts may choose to bike more as car ownership is expensive. However on the other hand, lower income areas are often further from city centers and have less infrastructure to support alternative forms of transportation.

I fetched each of these variables and combined them into a single data frame with two additional variables: P001001_area and B08101_009_area.

However, when fetching these data, tidycensus returns all census tracts associated with the specified counties. To filter it down I merged the census data with the stations data using st_join so that I could apply a spatial function with the geometry column. The following code chunks perform this merge and displays the output on a map:

Finding Correlations – Analysis

Before jumping into the correlation significance analysis, I wanted to visualize a few variables. I found that the start_total values in my data set (representing the total number of trips started at a station) had a significant skew, this can be seen by the distribution of values. I plotted both the values and the log of the values to see if one is closer to a normal distribution.

Whilst they both have a skew, applying a log did improve the distribution. This will help with visualizing correlations later on.

Using the rcorr from the package Hmisc I can quickly check if my hypotheses hold any weight. I picked the following variables from my data set to look at:

  • P001001
  • P001001_by_area
  • B08101_009
  • B08101_009_by_area
  • B07011_001
  • start_total
  • log_start

I calculated the density for the two variables, P001001 and B08101_009 as they indicate estimated totals for a census tract. Tract areas are not consistent so I felt it would make sense to use a density metric too.

Viewing the r and p values for the correlations between these variables, we see a few interesting things:

Looking at the r values for log_start, we can see it is reasonably correlated with B08101_009 (estimate for driving alone) and P001001_by_area (population density). The p values for these 2 relationships are almost zero indicating that the correlations are not present by accident. B07011_001 (median income) does not seem to be correlated with log_start or start_total and the associated p values also indicate that any relationship present could be by chance.

To visualize the promising variables, I created scatter plots with regression lines. The relationship is present but not as stark as I expected:

The GGally pair plot shows that the log_start variable has a better linear correlation with the census variables than start_total. However, start_total still has relatively good r and p values.

Interpretting in the Boston Context

These results confirmed the hypotheses. I’d like to reflect on each hypothesis separately.

Having more people or a higher density of people would have a positive correlation with BlueBikes usage.
Looking at population density rather than total population adds clarity to possible explanations. High population density usually means denser living and less rigid zoning. We tend to find areas with higher density to also have businesses, shops, education, and other services. Biking becomes a more viable option in these areas as people do not have to travel as far to take advantage of amenities. Cities also have more public transit stations in these areas as they are generally the destination people are interested in (for work or fun). I think this is an important insight as encouraging people to use more eco-friendly transportation methods comes hand in hand with how cities are designed and developed.

If more people are choosing to drive then they are less likely to opt into a bike share program
During my latent construct analysis, I was specifically interested in ways to encourage residents to use alternative forms of transportation (other than driving). The B08101_009 (Total drove alone to work with a car, truck or van) metric is interesting to use as a supplement. Since I did in fact find a negative correlation between this variable and station usage, I think this implies that efforts to encourage biking do in fact effect driving habits.


One thought on “BlueBikes and Census correlations

  1. Thank you for your post! I completely understand your frustration with incorporating census data into your dataset. For BOS311 data, I had a constant issue plotting 2020 Boston Census data, as opposed to 2010 Boston Census data located on BARI/Urban Informatics. The same was true for t-Test. The variables you created are essential to understanding the relevance of policies seeking to reduce vehicle usage. The populated values are not normally distributed as you indicated. This is likely due to the large sample size. Your evaluation of correlations helps us see that a relationship between bike sharing and driving exists. Overall, this was a pretty well-done interpretation on (p>.05) (normal distribution) and (p < .05) (non-normal distribution). This includes your correlations which show their significance with your hypothesis.

    Like

Leave a comment