Pulse of the city: Craigslist Posts

As described in the previous post, the data contains 13 variables as below:

Part 1

First, to understand how the price has changed during the pandemic time, we need to have a brief understanding of the price distribution during this period.

We can see that the median of rent of all posts is 1995 dollars, and from the plot, it is easy to tell that most common price range is between 10^3 and 10^3.5 (1000 dollars and 3162 dollars). However, there are some weird price range such as $0 post at left of the histogram, and 10^7, which is 10 million dollars, at the right side of the plot.

To further examine the price distribution excluding the outliers, we set a filter to exclude the extremely high prices, and the remaining price distribution looks like a normal distribtion, though slightly skewed to the right.

We also found that there are much lower number of posts in Feb, Mar, and April compare to later months, whether this is caused by seasonal patterns or Covid-19 is worth further investigation. In addition to the monthly variation, we can see that there are some pattern likely to be fluctuation within each week, this might worth exploration to unveil people’s posting habits.

It seems that regarding to time of the day, people post on Craigslist mostly during the afternoon (1pm – 7pm), and almost no one post anything during the early morning between 5 and 9am.

Now that we have converted all month into integer, it makes sense to look deeper into to the pricing pattern of the posts. Against our intuition, the percentage of higher priced posts seemed to increased during the Covid months, while the lowest priced posts generally stayed the same. This trend certainly worth look into in the next stage, as we currently have no convincing assumption why people tend to increase the price tag.

Part 2

In the effort to find the patterns within the posts. We also investigated the location section in the dataset.

The data set contains 79419 records of rent posts in Greater Boston area. In the location column, however, there are 7159 unique value of locations, which is much more than the number of areas near Boston.

By manually inspect the results, it seems that loose control of what people enter at the location space led to the problem, as people type in customized location information. In the next step, keywords extraction might be helpful for us to determine the exact area of the posts.

Moreover, there are 14402 records with an empty location information. As location is an important variable for us to conduct analysis. We plan to extract area information from the post body in the next stage.

From the histogram, we can see there are a very small number of post that has extremely large area, after examing some of the case, these are not legitimate posts, thus we filtered all post with area larger than 5000, and re-plotted the histogram.

Now it seems the distribution of area is more reasonable, with the bell-shaped curve centered around 900-1000 sqft, which is about the size of a typical 2-bedroom apartment. In the process, we also noticed some people posts area that is very accurate, we may need to examine if people actually post the correct number of sq footage.


Leave a comment