CT Coffee Index – Differences among groups

Last week I said I was going to change the scope of my project: I was going to build a “CT Coffee Index”. The intention behind this change, is that if I can find supporting information of potential economic development of census tracts based on the behavior of the coffee shops, I could later extend the index to other businesses.

In order to do that, I have to go back some steps and do some cleaning:

  • I removed the values that did not meet some of the following criteria:
    • Removed all cases that didn’t have accurate information in the “Boston Redevelopment Agency Planning District” variable (variable “BRA_PD”) and for the “Census track” (variable “CT_ID_10”);
    • Removed all cases that did not have an appropriate Issued or Expiration date;
    • Removed all cases that had invalid dates (the expiration date was before the issue date);
    • Removed all the cases that refer to “late” fees.

And added new variables:

  • Created a new variable to identify unique licenses concatenating variables LAST NAME and ADDRESSKEY. With this I created a new data set that has only unique licenses.
  • I created two new variables called “SAME.STATE” and “SAME.ZIP” in order to identify those licenses that are owned by in-state owners and in-city owners.
  • I created a new variable for each valid year of each observation. A valid year is defined as year between the issued and expiration date.
  • Created a new variable called “Coffee” that refers to businesses that are coffee shops. I created a new data set that contains only coffee related businesses.

This new data set called “Buss.Lic.Clean.coffee” was aggregated by census track and by year and merge that information with:

  • “CT_All_Ecometrics_2014” data set to add information regarding land area and number of parcels;
  • “Building Permits” data set to add information about the total active licenses each year from 2012 to 2015;
  • “Boston.ACS.SES” data set add socioeconomically indicators from ACS for each data frame.

The resulting data set contains 83 observations with 33 variables. Since the data set has incomplete information, the reliability of the index would not be compromised if we consider the period from 2012 to 2015. I wanted to analyze the percentage variation of the total amount of coffee businesses per census track from 2012 to 2015 and its relationship with the percentage variation of the building permits and the socioeconomically indicators for the same period of time.

For this assignment, I will compare the percentage variation of the total amount of coffee businesses between residential neighborhoods and non-residential neighborhoods. I would expect that the mean of the second group to be higher than the mean of the first group, and to have a significant difference.

Running the t-test the results are shown below. While the mean for the residential neighborhoods is 0.175, the mean for the non-residential neighborhoods is 0.526. The p-value is 0.03, which is small, but I would not consider it small enough to eliminate that the results is given by chance.


When we run an ANOVA test to see if we have differences between type of neighborhoods, we have a high p-value, which indicate us that this result is not significant.


Looking at the difference among four groups (Downtown area, Industrial/Institutional, Park and Residential are), we see that the first two area had an increase of 50% in the Coffee shops in the period of 2012-2015 while Residential areas are below 25%.



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s