Correlation: Blue Bikes

Mia Scholes. November 15th, 2023

Background

This week built off of last week’s analysis of station use as it relates to the roadway bike network. I tested the hypothesis that Blue Bike stations near safer bike infrastructure outperformed those nearest to lower quality infrastructure or no bike lanes at all. From the mapping and visualization, it seemed to be true that there is in fact a relationship between proximity to good bike lanes – particularly protected bike lanes – and how much use a Blue Bike station gets. Now that we have the tools to test actual correlation I wanted to see how strong this relationship really is.

Setup

The following is all code to set up the data frames and maps that I have been using for analysis:

library(nngeo)

library(sf)

library(ggplot2)

library(ggmap)

bikeNetwork<-st_read("/Users/miascholes/Desktop/Existing_Bike_Network_2023/Existing_Bike_Network_2023.shp")

massMap<-st_read("/Users/miascholes/Desktop/townssurvey_shp/TOWNSSURVEY_POLY.shp")

AugBikeObs=read.csv('/Users/miascholes/Desktop/202208-bluebikes-tripdata.csv')

stationSpatial=read.csv('/Users/miascholes/Desktop/Bluebikes_Stations_spatial.csv')

Code to build data frame bStations, from previous analysis:

*** The code highlighted in yellow was giving me the following error when trying to knit. It worked perfectly in console, and I checked over and over to make sure that there weren’t actually any missing coordinates. Any help here would be hugely appreciated!

#aggregating/merging to get to easy to analyze forms

stations<-stationSpatial[,2:3]

names(stations)[1:(ncol(stations)-1)] <- names(stations)[2:ncol(stations)]

names(stations)[2]<-"District"

mapTrips<-AugBikeObs[,5:7]

stationList<-merge(stations,mapTrips,by.x = 'Name',by.y = 'start.station.name')

stationTrips<-aggregate(District~Name,data=stationList,length)

names(stationTrips)[2]<-"trip.count"

mapTrips2<-aggregate(cbind(start.station.latitude,start.station.longitude)~start.station.name,data = mapTrips,mode)

merge1<-merge(mapTrips2,stationTrips,by.x = 'start.station.name',by.y = 'Name')

stationInfo<-merge(merge1,stations,by.x='start.station.name',by.y = 'Name')

stationInfo$District<-toupper(stationInfo$District)

#user proportion stuff

AugBikeObs[15]<-ifelse(AugBikeObs$usertype=='Subscriber',1,0)

names(AugBikeObs)[15]<-'userBool'

userProp<-aggregate(userBool~start.station.name,data=AugBikeObs,mean)

StationInfo<-merge(stationInfo,userProp,by='start.station.name')

#bike network stuff

protectedBN<-bikeNetwork[bikeNetwork$ExisFacil %in% c('BFBL','SBL','SBLSL','SBLBL','CFSBL','PED','SUP','SUPN','SUPM'),]

bosStations<-StationInfo[StationInfo$District=='BOSTON',]

bosStationsgeo<-st_as_sf(bosStations, coords=c('start.station.longitude','start.station.latitude'), crs=4269)

bikeNetwork<-st_transform(bikeNetwork,crs = 4269)

protectedBN<-st_transform(protectedBN,crs = 4269)

proxtoPBN<-st_nearest_feature(bosStationsgeo,protectedBN)

proxtoBN<-st_nearest_feature(bosStationsgeo,bikeNetwork)

distProtected<-st_distance(bosStationsgeo, protectedBN[proxtoPBN,], by_element=TRUE)

distNet<-st_distance(bosStationsgeo, bikeNetwork[proxtoBN,], by_element=TRUE)

bStations<-data.frame(StationInfo[StationInfo$District=="BOSTON",1])

bStations[2]<-distNet

bStations[3]<-distProtected

names(bStations)[1]<-'start.station.name'

names(bStations)[2]<-'dist.to.network'

names(bStations)[3]<-'dist.to.prot.network'

trial<-(bStations$dist.to.network)+2*(bStations$dist.to.prot.network)

connectivityScore<-scale(trial)

connectivityScore<-4-connectivityScore

bStations[4]<-connectivityScore

names(bStations)[4]<-'connectivity'

bStations[5]<-trial

names(bStations)[5]<-'pre.scale.score'

bStations[6]<-trial

names(bStations)[5]<-'pre.scale.score'

Analysis

I first used ggpairs() to see the relationship between distances to the two defined networks, connectivity, and trip count. Of course, distance to networks correlated highly with connectivity because they are what made up its construction, but more interesting was how each of the three variables correlated to trip count on the bottom row.

require(GGally)

bosPlotting<-bStations

bosPlotting[,2]<-as.numeric(bStations[,2])

bosPlotting[,3]<-as.numeric(bStations[,3])

bosPlotting[,4]<-as.numeric(bStations[,4])

bosPlotting[,5]<-bosStations[,4]

names(bosPlotting)[5]<-"trip.count"

ggpairs(bosPlotting,columns = 2:5)

I wanted to know more about the correlation between trip count and distance to the network, protected network, and the given connectivity score, so I used cor.test() to see p values and confidence interval in addition to effect size. All of them were significant! More interestingly, proximity to the protected network had a much higher correlation (40% larger effect size) than proximity to the network overall, indicating that the type and quality of cycling infrastructure does have a real impact on its use. Also excitingly, my created measure of “connectivity” had the highest correlation and lowest p-value, lending some credibility to the measure itself!

cor.test(as.numeric(bosPlotting$connectivity),bosPlotting$trip.count)

cor.test(as.numeric(bosPlotting$dist.to.network),bosPlotting$trip.count)

cor.test(as.numeric(bosPlotting$dist.to.prot.network),bosPlotting$trip.count)

From the Gally plot, proximity to network seemed to have a much steeper curve than proximity to protected network and also a smaller effect size and p-value, so I isolated those two relationships into dot plots using ggplot.

histo21<-ggplot(data = bosPlotting,aes(y=trip.count,x=as.numeric(dist.to.network)))

histo21+geom_point()+labs(y="1 Month Trip Count / Station",x="Distance to Network (m)")

histo31<-ggplot(data = bosPlotting,aes(y=trip.count,x=as.numeric(dist.to.prot.network)))

histo31+geom_point()+labs(y="1 Month Trip Count / Station",x="Distance to Protected Network (m)")

Distance to the bike network as a whole had a very big cluster right at zero, meaning that most Blue Bike stations are likely right on a street with a bike lane. This makes perfect sense for their business.

However, all bike lanes are not created equally, and the spread of distance between each station and the protected bike network was much less concentrated. This makes the relationship between distance and maximum station use potential look pretty linear, rather than the harder-to-interpret general network plot.

It is also significant because distance to the protected network was a better indicator of station use than distance to the general network – useful information when planning both Blue Bike stations and when inferring how much use different types of bike infrastructure might get from all types of cyclists.

Next Steps

I’m interested in further isolating different types of bike infrastructure and how they affect Blue Bike use, particularly to get a sense of what it takes to make the most people feel comfortable biking to apply to broader populations and inform design of new infrastructure.

	wfleming1 on City Exploration #3
	wfleming1 on BlueBikes and Census corr…
	tavernierd on Comparing Groups: Bluebike…
	tavernierd on City Exploration #3: Urban Hik…
	tavernierd on City Exploration #3: Urban Hik…

Seeing Boston Neighborhoods through Administrative Data

The Course Blog for "Big Data for Cities" at Northeastern University (PPUA5262)