2015 Boston Tax Assessor: Data stories

We will attempt to “tell a story”, using data from the City of Boston Assessing Department. To accomplish this, our tool of choice will be the R statistical programming language.

And how do we do that? Finding some interesting facts buried within the dataset -and expressing them with clarity and precision in an easy to read way- will be a good start.

We’ll explore the 2015 Boston Tax Assessor data, a snapshot of the City of Boston Assessing Department’s centralized database for parcel data for every identifiable parcel in the city for the year 2015. This dataset was made available to the public as part of the City of Boston’s open data initiative.

So let’s get to work!

First, we’ll briefly describe the structure of our data set, and explain in detail one of the cases contained within.

We begin by loading the dataset, and assigning it to a variable called “TAdata”. We then check if the data was loaded:

TAdata <- read.csv('data/Tax Assessor 2015 - Data.csv', stringsAsFactors = FALSE)

Then we find out how many observations (rows) are included in the dataset

nrow(TAdata)
## [1] 168146

How many variables are available for analysis…

ncol(TAdata)
## [1] 65

…and their names:

names(TAdata)
##  [1] "PID"             "CM_ID"           "GIS_ID"         
##  [4] "ST_NUM"          "ST_NAME"         "ST_NAME_SUF"    
##  [7] "UNIT_NUM"        "ZIPCODE"         "PTYPE"          
## [10] "LU"              "OWN_OCC"         "OWNER"          
## [13] "MAIL_ADDRESSEE"  "MAIL_ADDRESS"    "MAIL.CS"        
## [16] "MAIL_ZIPCODE"    "AV_LAND"         "AV_BLDG"        
## [19] "AV_TOTAL"        "GROSS_TAX"       "LAND_SF"        
## [22] "YR_BUILT"        "YR_REMOD"        "GROSS_AREA"     
## [25] "LIVING_AREA"     "NUM_FLOORS"      "STRUCTURE_CLASS"
## [28] "R_BLDG_STYL"     "R_ROOF_TYP"      "R_EXT_FIN"      
## [31] "R_TOTAL_RMS"     "R_BDRMS"         "R_FULL_BTH"     
## [34] "R_HALF_BTH"      "R_KITCH"         "R_HEAT_TYP"     
## [37] "R_AC"            "R_FPLACE"        "S_NUM_BLDG"     
## [40] "S_BLDG_STYL"     "S_UNIT_RES"      "S_UNIT_COM"     
## [43] "S_UNIT_RC"       "S_EXT_FIN"       "U_BASE_FLOOR"   
## [46] "U_NUM_PARK"      "U_CORNER"        "U_ORIENT"       
## [49] "U_TOT_RMS"       "U_BDRMS"         "U_FULL_BTH"     
## [52] "U_HALF_BTH"      "U_KIT_TYPE"      "U_HEAT_TYP"     
## [55] "U_AC"            "U_FPLACE"        "Blk_ID"         
## [58] "BG_ID_10"        "CT_ID_10"        "X"              
## [61] "Y"               "LocationID"      "TLID"           
## [64] "BRA_PD"          "NSA_NAME"

With the help of the dataset’s codebook (available at the Harvard Dataverse), that defines each variable and its range of values, we can choose one or more variables for further investigation. “LU” (land use), YR_BUILT“,”OWN_OCC“, (a variable that checks if the owner of the parcel lives in it),”NUM_FLOORS“, sound interesting.

Focusing on these variables, we’ll take a look at three cases, selected for no particular reason other than their ordinal positions in our dataset: the first one, the one at the middle, and the last one.

TAdata[c(1,nrow(TAdata)/2, nrow(TAdata)), c("LU", "YR_BUILT","OWN_OCC", "NUM_FLOORS")]
##        LU YR_BUILT OWN_OCC NUM_FLOORS
## 1      R3     1900       Y          3
## 84073  R3     1905       N          3
## 168146  E     1900       N          1

What have we learnt about these three individual cases?

We got-

  • A three-story house from 1900, occupied by its owner
  • A three-story house from 1905, not occupied by its owner
  • A single story tax-exempt site, dating from 1900, not occupied by its owner

By comparing the differences between the two houses -or, lack thereof- we may suspect that the early 1900’s three family residential unit is widespread in Boston.

These few cases are not enough to generalize, but they illustrate characteristics of the City of Boston housing stock, that we can further analyse to determine if they are predominant.

The data does not specify what’s in the tax-exempt parcel, but we can infer that there’s no building in it -the empty “roof type” variable seems to support this conclusion.

So, let’s propose some hypothesis about Boston’s built environment:

  • Many parcels are not occupied by their owners
  • Three-storied buildings are common
  • There are plenty of structures built during the first decade of the 20th century (was it a “boom time” for the city?)

We can investigate further, and see if the hypothesis hold up.

Let’s start by finding out the percentage of cases in which the occupant is the owner of the parcel.

To do that, first we select a subset where the OWN_OCC variable equals “Y”

owner_occupants <- TAdata[TAdata$OWN_OCC == "Y",]
nrow(owner_occupants)
## [1] 75281

Now we know there are 75281 properties where the owner is the occupant.

Just in case, we make sure that we didn’t left out any rows with an unexpected value (i.e. either “Y” or “n”)

nrow(TAdata[TAdata$OWN_OCC != "Y" | TAdata$OWN_OCC != "N",])
## [1] 168146

Knowing that owner occupancy is defined for every parcel, we can obtain the % of owner occupancy for the total of parcels in the city of Boston:

owner_occupancy_rate <- nrow(owner_occupants) / nrow(TAdata)
owner_occupancy_rate
## [1] 0.4477121

And now we know that 44.8% of the parcels in Boston are occupied by their owners! (and by the way, our working hypothesis was wrong)

Now, let’s see how common three-storied buildings are.

This is an opportunity to make use of ggplot2 library, an extension of the R programming language with powerful graphic generation capabilities. We’ll tell R that we want to start using ggplot2:

library(ggplot2)

Now, let’s plot the distribution of buildings, by # of floors:

ggplot(data=TAdata, aes(x=NUM_FLOORS)) + geom_bar(binwidth = 1) + ggtitle("Boston: Parcels by floor #")

Boston TA data - Telling data stories (1)

The answer is clear: Three-story buildings are common in Boston, only surpassed in quantity by 2 and 1-storied structures.

With just a slight adjustment, we can use a similar histogram plot to find out if the early 1900’s were a boom period for the City Boston, as evidenced by a particularly high number of structures being built:

ggplot(data=TAdata, aes(x=YR_BUILT)) + geom_bar(binwidth = 1) + ggtitle("Boston: Structures built by year")

Boston TA data - Telling data stories (2)

The result does not look quite right. Our plot’s Y axis, which represents year of construction, starts a 0 -probably because unknown building dates are listed as “0” in our dataset.

Let’s take a look at how many cases show a year earlier than the founding of the city in 1630:

nrow(TAdata[TAdata$YR_BUILT < "1630" ,])
## [1] 20974

And how many cases show 0 as its building year:

nrow(TAdata[TAdata$YR_BUILT == "0" ,])
## [1] 20974

It’s the same number! That’s a good sign: if every suspicious date is 0, we can safely assume that such a value simply means that the year was unknown at the time the parcel was added to the database.

We remove the “year zero” cases:

TAdata_no_year_zero <- TAdata[TAdata$YR_BUILT != "0" ,]

We plot once again:

ggplot(data=TAdata_no_year_zero, aes(x=YR_BUILT)) + geom_bar(binwidth = 1) + ggtitle("Boston: Structures built by year")

Boston TA data - Telling data stories (3)

Now the result looks better, and also confirms our guess: the turn of the 20th century was a period of intense building activity in Boston, probably* unsurpassed before or since.

  • Two caveats: all, or most of those buildings that we left out of the plot, could have been built on a specific year (say, 1960) and in that case, the early 1900’s would not be the period of highest construction activity. Also, it may be the case that in the remote past, there was a period of record building activity, but we can’t tell with our data since most structures were demolished since. (Of course, this is not likely.)

Leave a comment