Comparing Groups – Blue Bikes – Seeing Boston Neighborhoods through Administrative Data

Introduction

In this analysis, we delve into the dynamics of bike-sharing activities, utilizing the total rides (TR) as a key metric to gauge the community’s usage and demand for bike-sharing services. Understanding the factors influencing total rides is essential for optimizing bike-sharing systems and tailoring services to meet varying user needs. Two distinct categorical variables, Touristic Station and Hour Category, were explored to uncover patterns and insights within the bike-sharing data.

The first categorical variable, Touristic Station, classified stations based on a threshold for Casual Ridership Proportion (CRP). This approach aimed to investigate whether stations attracting a higher proportion of casual riders, potentially linked to tourist attractions, exhibit distinct characteristics in terms of total rides. The second categorical variable, Hour Category, grouped hours of the day into specific intervals, enabling an exploration of temporal patterns in bike-sharing activities.

T-test

Outcome Variable: Total Rides (TR)

Reason for Interest: Total rides serve as a fundamental metric to gauge the overall usage and demand for bike-sharing services in a given area. By focusing on total rides, we aim to gain insights into the popularity and efficiency of the bike-sharing system, allowing us to understand how extensively the community utilizes this mode of transportation.

Categorical Variable: Touristic Station

Reason for Interest: Touristic Station is an intriguing categorical variable created to categorize stations based on a threshold for CRP (e.g., CRP > 0.5). This variable is interesting as it allows for the exploration of whether stations that attract a higher proportion of casual riders, potentially indicative of tourist attractions, exhibit distinct characteristics compared to those with lower casual ridership. The interest lies in uncovering any spatial or contextual patterns related to tourist preferences in bike-sharing, providing valuable insights for urban planning and optimizing bike-sharing services in areas with different levels of touristic appeal.

Interpretation

The t-test results reveal a substantial difference in Casual Ridership Proportion (CRP) between stations categorized as Touristic (TRUE) and non-Touristic (FALSE). Stations classified as Touristic exhibit a significantly lower mean Total rides of 324.97 compared to non-Touristic stations with a mean TR of 1040.88. The magnitude of this difference is considerable at 648.97, suggesting that areas attracting more casual riders, potentially associated with tourist attractions, experience lower overall bike-sharing activity. The 95% confidence interval for the mean difference ranges from 545 to 887, and the p-value is remarkably low at 1.167511e-14, providing robust evidence to reject the null hypothesis. It’s noteworthy that the positive sign of the t-statistic (8.25) reflects that the mean Total rides of Touristic stations is lower than that of non-Touristic stations.

#Read Data
BlueBikes <-read.csv("202307-bluebikes-tripdata.csv")
#Total rides by station
TotalRides_st <- BlueBikes %>%
  ungroup() %>%
  mutate(start_station_name = trimws(start_station_name)) %>%
  group_by(start_station_name) %>%
  summarise(TR = n())
#Casual ridership by station
CasRider_st <- BlueBikes %>%
  ungroup() %>%
  mutate(start_station_name = trimws(start_station_name)) %>%
  filter(!is.na(start_station_name) & end_station_name != "") %>%
  group_by(start_station_name) %>%
  summarise(CasRid = sum(member_casual == "casual"), Comp_Rides = n())
CasRider_st$CRP = CasRider_st$CasRid / CasRider_st$Comp_Rides
#merge dataframes
merged_data_st <- left_join(TotalRides_st, CasRider_st, by = "start_station_name")
#create categorical
merged_data_st$TouristicStation <- merged_data_st$CRP > 0.5
# t-test
t_test_result <- t.test(TR ~ TouristicStation, data = merged_data_st, paired = FALSE)
# Summary
print("T-Test Results:")
print(t_test_result)

# Visualization (boxplot)
boxplot(TR ~ TouristicStation, data = merged_data_st, col = c("lightblue", "lightgreen"), 
        main = "Total Rides across Touristic Stations", ylab = "Total Rides (TR)")

The box plot depicting total rides across Touristic and non-Touristic stations provides interesting insights into the distribution of bike-sharing activities. The larger box plot for non-Touristic stations, extending from 0 to around 1500, indicates a wider range of total rides within these stations. The presence of 8 dots outside the box suggests the existence of several outliers, highlighting stations with exceptionally high total ride counts.

On the other hand, Touristic stations exhibit a more compressed box plot, suggesting a narrower range of total rides. The 11 dots outside the box in Touristic stations indicate a higher number of outliers compared to non-Touristic stations. This could imply that Touristic stations experience occasional spikes in total rides, potentially associated with special events, tourist influx, or other factors contributing to these outliers.

In summary, the differences in the box plots and the number of outliers suggest varying patterns in the total rides between Touristic and non-Touristic stations. Non-Touristic stations appear to have a more diverse distribution of total rides, while Touristic stations experience occasional peaks in bike-sharing activity. These patterns could be crucial for system managers and urban planners in optimizing bike-sharing services and anticipating fluctuations in usage patterns.

ANOVA

Outcome Variable of Interest: Total Rides (TR)

Reason for Interest: Total rides serve as a fundamental metric to gauge the overall usage and demand for bike-sharing services throughout the day. Analyzing Total Rides allows us to understand the temporal patterns of bike-sharing activities and identify peak usage hours.

Categorical Variable: Hour Category

Reason for Interest: The creation of an Hour Category variable, categorizing hours of the day into specific groups [Morning (6:00 AM – 11:59 AM) – Capture the morning commuting hours, Afternoon (12:00 PM – 5:59 PM) – Represent the midday hours, Evening (6:00 PM – 11:59 PM) – Encompass the evening hours, Night (12:00 AM – 5:59 AM) – Cover the late-night and early morning periods], is intriguing as it enables the exploration of whether the time of day influences Total Rides. Hour categories may capture variations in commuter patterns, leisure rides, or other temporal factors affecting bike-sharing usage.

Interpretation

The ANOVA results indicate a statistically significant effect of the “HourCategory” variable on total rides (TR) (F(3, 20) = 10.84, p = 0.000192). This suggests that the mean total rides significantly differ across the four-time categories: Morning, Afternoon, Evening, and Night.

Post-hoc Tukey tests were conducted to identify specific differences between pairs of HourCategory levels:

Morning vs. Night: There is a significant difference in total rides between Morning and Night (p = 0.011). On average, total rides are higher in the Morning compared to Night.

Afternoon vs. Night: A significant difference is observed between Afternoon and Night (p = 0.0001327). Total rides are higher in the Afternoon compared to Night.

Evening vs. Night: There is a significant difference in total rides between Evening and Night (p = 0.0027309). Total rides are higher in the Evening compared to Night.

No significant differences were found between Afternoon and Morning (p = 0.2387252), Morning and Evening (p = 0.9246483), and Afternoon and Evening (p = 0.5534375).

In summary, the time of day significantly influences the total rides, with specific differences identified between Morning, Afternoon, Evening, and Night periods. These findings provide valuable insights into temporal patterns of bike-sharing activities, which can inform service optimization and resource allocation strategies.

TotalRides$HourCategory <- cut(TotalRides$time, breaks = c(0, 6, 12, 18, 24),
                                   labels = c("Night", "Morning", "Afternoon", "Evening"),
                                   include.lowest = TRUE, right = FALSE)
# ANOVA
anova_result <- aov(TR ~ HourCategory, data = TotalRides)
# Post-hoc test (Tukey HSD for ANOVA)
posthoc_result <- TukeyHSD(anova_result)

print("ANOVA Results:")
print(summary(anova_result))


print("Post-hoc Test Results:")
print(posthoc_result)

means <- summarySE(TotalRides, measurevar = "TR", groupvars = "HourCategory")
# Create upper and lower values for error bars
means$upper <- means$TR + means$se
means$lower <- means$TR - means$se
#Create barplot
bar <- ggplot(data = means, aes(x = HourCategory, y = TR)) +
  geom_bar(stat = "identity", position = "dodge", fill = "blue") +
  ylab('Total Rides') +
  geom_errorbar(aes(ymax = upper, ymin = lower),
                position = position_dodge(.9)) +
  coord_cartesian(ylim = c(0, max(means$upper)))  
# Display the plot
print(bar)

The box plot and error bar visualizations shed light on the distribution of Total Rides (TR) across various Hour Categories, namely Morning, Afternoon, Evening, and Night. These graphics offer valuable insights into the temporal patterns of bike-sharing activities.

Total Rides Variation Across Time

The Night category displays a comparatively lower range of Total Rides, indicating less frequent bike-sharing activities during the night hours, with the box plot ranging from 0 to 2465.

Morning exhibits the widest range of Total Rides, suggesting diverse usage patterns during the morning hours, with the box plot spanning from 0 to 18187.

Afternoon and Evening categories also show substantial variability in Total Rides, with ranges extending from 0 to 26949 and 0 to 20967, respectively.

Error Bars and Confidence Intervals

The narrow error bars for the Night category (ranging from 1782 to 3148) signify less variability in Total Rides during this time, providing a more confident estimate around the mean.

Morning and Afternoon categories, characterized by wider error bars (ranging from 15358 to 21016 and 23473 to 30425, respectively), indicate higher variability in bike-sharing activities during these periods.

The Evening category’s error bars (ranging from 16527 to 25406) suggest a moderate level of variability in Total Rides during the evening hours.

Interpretation and Implications

The observed variations in Total Rides across different Hour Categories emphasize the importance of considering temporal dynamics in bike-sharing service planning. The wider variability during Morning and Afternoon hours suggests diverse commuter and midday usage patterns, while the narrower variations in Night and Evening hours may indicate more consistent and focused bike-sharing activities.

Conclusion

The results of the t-test comparing Touristic and non-Touristic stations revealed a substantial difference in CRP and total rides. Touristic stations exhibited a significantly lower mean total rides of 324.97 compared to non-Touristic stations with a mean TR of 1040.88. This suggests that areas attracting more casual riders, potentially associated with tourist attractions, experience lower overall bike-sharing activity. The statistical significance of the results, with a p-value of 1.167511e-14, underscores the robust evidence supporting the rejection of the null hypothesis.

Moving to ANOVA, we explored the relationship between total rides and Hour Category. The analysis identified a significant effect of Hour Category on total rides, indicating that mean total rides significantly differ across Morning, Afternoon, Evening, and Night periods. Post-hoc Tukey tests unveiled specific differences between Morning vs. Night, Afternoon vs. Night, and Evening vs. Night, providing valuable insights into temporal patterns of bike-sharing activities. These findings contribute to our understanding of how the time of day influences total rides, facilitating informed decisions in service optimization and resource allocation for bike-sharing systems.

Understanding the dynamics of bike-sharing activities, particularly the influence of tourist attractions and temporal patterns, holds significant implications for Boston’s communities. The observed lower total rides in Touristic stations may indicate the need for targeted efforts to enhance bike-sharing engagement in these areas. Furthermore, recognizing the temporal variations in total rides allows for tailored strategies to meet the unique demands of different periods during the day.

The methodologies employed in this study, including t-tests and ANOVA, can serve as valuable tools in investigating various aspects of bike-sharing systems or other urban phenomena. The transferability lies in adapting these statistical approaches to address different questions, such as exploring the impact of specific events, weather conditions, or infrastructure changes on bike-sharing behaviors. By applying similar analytical frameworks, cities and researchers can uncover insights that inform urban planning, promote sustainable transportation, and enhance the overall livability of communities.

	wfleming1 on City Exploration #3
	wfleming1 on BlueBikes and Census corr…
	tavernierd on Comparing Groups: Bluebike…
	tavernierd on City Exploration #3: Urban Hik…
	tavernierd on City Exploration #3: Urban Hik…

Seeing Boston Neighborhoods through Administrative Data

The Course Blog for "Big Data for Cities" at Northeastern University (PPUA5262)

Comparing Groups – Blue Bikes

Leave a comment Cancel reply

Seeing Boston Neighborhoods through Administrative Data

The Course Blog for "Big Data for Cities" at Northeastern University (PPUA5262)

Share this:

Related

Leave a comment Cancel reply