Pulse of the City: Understand the Vibrancy of Boston’s Business Constituents

Last time we talked about food establishments in Boston, today we turn our focus onto the largest data set in the database–“Licenses.Old.Allowed.tab”, which contains records of all business licenses in Boston issued from 2007 to 2015, including those that have expired. This will give us a more holistic view on the vibrancy of Boston’s business constituents.

Basic facts

The raw data (bizz) contains 29,131 records, however, many of the rows are duplicates with only differences in information related to the license application fee charges (the APFEETYPEKEY, FEEDESC and AMT column). Because this data set keeps track of all charges related to a business license, including renewals and late fees. In order to calculate an accurate number of business licenses ever existed, I removed the duplicated items based on unique license IDs and created a new data frame to work on 1.

The cleaned data (bizz.clean) showed some different numbers: from Jan 2007 to Aug 2015, there were in total 12,289 business licences issued in Boston; 3,766 of which may have expired by the end of 2015 (if the business didn’t renew it).

The most vibrant neighborhood

A lot of the columns in this data set are about a business’s location info, including the address, GPS coordinate, zip code, street and neighborhood names. With such comprehensive information, one would naturally wonder, what’s the geographic distribution of the businesses in Boston? Will the number of business licenses issued be proportioned to the population, the size, or any other characteristics of each neighborhood?2


With more than 2500 business licenses issued in the past 7 years, the Central district is unarguably the busiest neighborhood. The number is more than twice of the district in the second place. We are not entirely sure of the boundaries and the demographics of the districts, and the divisions used by this data set are not in sync with Boston Redevelopment Authority’s latest definition of “neighborhoods”. The chart may look different if we re-map each business into the 26 neighborhoods as displayed on the official website of BRA and the City of Boston.

Nevertheless, we do get a general picture on which are the hot areas of Boston. Although if we want to conduct further studies on the correlation between the demographics/population and the business vibrancy, we’d need to obtain those data from other data sources first.

Top business types

What we can do now is to go deeper with existing data and look into each neighborhood’s business constituents, so to have a better understanding of the city by knowing what types of businesses thrive where.

Let’s take the top 10 neighborhoods for example.3


As you may have noticed, a large portion of the records has the license category as “not set”. Based on the “FEEDESC” column, they all seem to be related to different permits required by different government departments, such as Certificate of Inspection, a permit for construction projects, and BFD Permit, the safety permit by the Boston Fire Department. They are counted as new licenses, but like renewals and late fee records, they are just additional records for the same business, thus cannot be used to calculate the density of businesses.

When removed these records (bizz.super.clean), the graph looks moderately different4:


It’s needless to say that food and drinks played a major part in the business world. Every neighborhood has more than 3/4 of its business licenses issued to food and drinks establishments. But some more subtle facts are also reflected in this graph: despite similar total numbers of businesses, more liquid stores can be found in the Bay/Beacon Hill area than in Alston/Brighton, but it seems far more convenient to buy takeout food in Alston. And comparing to the city average, Roxbury seems to be barren of entertainment.

The cost to maintain the licenses

With so many different licenses required to run a business, how much do they actually cost? It is reported that only half of the businesses would survive more than 5 years. Keeping the cost of low is crucial to the survival of small businesses.

To answer this question, we must return to the raw data set. The first thing to do is to find a unique identifier for businesses. We can’t use LICENSENO as the identifier any more, because we are calculating for each business the total cost of all licenses.

Some other columns were considered for this purpose:

  • ADDR1, ADDRKEY, and Property_ID – Address related info, looks unique, but there can be multiple businesses in one address, and the owner of the property may change.
  • NAMEFIRST + NAMELAST – It’s possible to get duplicates, and some business owners may be in charge of multiple businesses.
  • ADDRKEY + NAMELAST – Unique to the business and address, although some businesses may operate on different addresses.

Weighing each identifier’s pros and cons, “ADDRKEY + NAMELAST” looks like the most appropriate candidate. Because the majority of the data would be about small business owners who don’t really operate on multiple addresses or own multiple businesses.5


A little less than 80% of all businesses had only 1 license, but 70% of them have made more than 2 payments6. The median of the license fee is $200, meaning the majority of businesses pays this amount for their licenses.7


Above graph shows the distribution of the price paid for obtaining or maintaining a business license (up to $3,000), including renewals, late fees and fines. Prices larger than $3,000 were excluded, because their numbers are small anyway. The number go down as the price goes up, but there is an exception at $2,800, which is the exact price for a “7 Day All Alcoholic Beverages License”. All restaurants and clubs that sell alcohols are required to keep this license.

It seems that as long as your business doesn’t involve alcohol, the cost of maintaining the business license should not be a burden to you.

Corrupted Data

A few unwanted abnormalities were found in the data while working on this article:

  1. 1,684 rows (5.8%) of data have an issue date later than the expiration date, which doesn’t make much sense. We’d have to brush them off if we’d like to do any analysis based on these dates.
  2. Some businesses, such as B B Kitchen Inc., have multiple records of paying different amount for various fees, but some other important information, such as location and phone number, is missing.
  3. Some company, namely “Gilbert Realty Trust”, has a lot of records with license numbers in the format of “RINS-XXXXX”, but I couldn’t find any information about this specific license type. The FEEDESC column says “Filing fee”, but what is this filing fee?

To rectify these records, we’d have to go back to the original database, or pull from other data sources, so to cross-check their validity.

  1. Sanitation
    cleanbiz <- function() {
     # filter out bad license numbers
     dat <- bizz[grep("^[a-zA-Z0-9\\-]+$", bizz$LICENSENO), ]
     # sort empty license cat to the end, just so to let as many as
     # possible rows have the category info
     dat <- dat[order(dat$LICENSECAT, decreasing = TRUE), order(names(dat))]
     # get unique rows by license number
     uniq.no <- unique(dat$LICENSENO)
     # the index of the first appearances of uniq elements
     uniq.index <- match(uniq.no, dat$LICENSENO)
     dat[uniq.index, ]
    bizz.clean <- cleanbiz()

  2. Plot the top neighborhood
    biz.plotTopNH <- function(df) {
     # the first row contains an empty string, `[-1, ]` removes it
     df <- as.data.frame(sort(table(df$BRA_PD)))[-1, ]
     names(df) <- c("Neighborhood", "Count")
     top.nh <<- df # save top neighborhoods in a global variable
     ggplot(data = df, aes(x = Neighborhood, y = Count, fill = "tomato2")) +
     geom_bar(stat = "identity") +
     legend.position = "none",
     plot.title = element_text(
     face = "plain", size = 11,
     margin = margin(10, 0, 10, 0)
     )) +
     coord_flip() +
     ggtitle("Number of Business Licenses Issued in Boston") +
     xlab("") + ylab("")

  3. Plot the license category breakdown for top neighborhoods
    biz.plotTopCatByNH <- function(df) {
     # get records in the top neighborhoods
     top.nh.names <- tail(top.nh$Neighborhood, n = 10)
     df <- df[df$BRA_PD %in% top.nh.names, ]
     # Consolidate Categories
     df$LICENSECAT <- as.character(df$LICENSECAT)
     df$LICENSECAT[df$LICENSECAT == " "] <- "NotSet"
     top.cat <- sort(table(df$LICENSECAT), decreasing = TRUE)
     top.cat <- names(head(top.cat, n = 9))
     df[!(df$LICENSECAT %in% top.cat), ]$LICENSECAT <- "Other"
     dat <- as.data.frame(xtabs(
     data = df[, c("LICENSECAT", "BRA_PD")],
     drop.unused.levels = TRUE))
     # Give categories readable names
     dat <- merge(dat, licensecat)
     ggplot(dat, aes(x = BRA_PD, y = Freq, fill = CATDESC)) +
     geom_bar(stat = "identity") +
     theme(axis.text.x = element_text(angle = 90)) +
     scale_fill_discrete("License Category") +
     ggtitle("License Categories in Top Neighborhoods") +
     plot.title = element_text(
     face = "plain", size = 11,
     margin = margin(10, 0, 12, 0)
     legend.key.size = unit(12, "pt")) +
     xlab("") + ylab("")

  4. Plot the license category breakdown with “Not set” data removed
    bizz.super.clean <- bizz.clean[bizz.clean$LICENSECAT != " ",]

  5. Plot the distribution of payments and licenses of each business
    bizz.cost.clean <- bizz[grep("^[a-zA-Z0-9\\-]+$", bizz$LICENSENO), ]
    bizz.cost.clean$BIZID <- paste(bizz.cost.clean$ADDRKEY,
    bizz.clean$BIZID <- paste(bizz.clean$ADDRKEY,
    dat1 <- as.data.frame(table(bizz.cost.clean$BIZID))
    dat2 <- as.data.frame(table(bizz.clean$BIZID))
    p1 <- ggplot(dat1[dat1$Freq <= 20, ],
     aes(x = Freq, fill = "tomato2")) + 
     geom_histogram(binwidth = 1) +
     panel.background = element_rect(fill = "grey95"),
     legend.position = "none",
     plot.title = element_text(
     face = "plain",
     size = 11,
     margin = margin(10, 0, 15, 0)
     )) +
     ggtitle("Number of Payments") +
     xlab("") + ylab("Count")
    p2 <- ggplot(dat2[dat2$Freq <= 15, ],
     aes(x = Freq, fill = "tomato2")) + 
     geom_histogram(binwidth = 1) +
     panel.background = element_rect(fill = "grey95"),
     legend.position = "none",
     plot.title = element_text(
     face = "plain",
     size = 11,
     margin = margin(10, 0, 10, 0))) +
     ggtitle("Number of Licenses") +
     xlab("") + ylab("Count")
    grid.arrange(p1, p2, ncol = 2)
    # Calculates the two numbers of percentages
    p1 <- sum(dat1[dat1$Freq > 2, ]$Freq) / sum(dat1$Freq)
    p2 <- sum(dat2[dat2$Freq <= 1, ]$Freq) / sum(dat2$Freq)

  6. Calculate the percentages
    p1 <- sum(dat1[dat1$Freq > 2, ]$Freq) / sum(dat1$Freq)
    p2 <- sum(dat2[dat2$Freq <= 1, ]$Freq) / sum(dat2$Freq)

  7. Plot the distribution of the number of payment by price
    ggplot(bizz.cost.clean[bizz.cost.clean$AMT < 3000, ], aes(x = AMT)) +
     geom_histogram(binwidth = 50, fill = "tomato2") +
     ggtitle("Number of Business License Payment by Price (< $3000)") +
     theme(panel.background = element_rect(fill = "grey95")) +
     xlab("Amount (USD)") + ylab("Count")


One thought on “Pulse of the City: Understand the Vibrancy of Boston’s Business Constituents

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s