Emil Fine

Personal Website

Data Visualization Example for Machine Learning

Here we’ll go over a few python methods of analyzing the data with visualization before getting in machine learning techniques. This page builds upon the preprocessing article for our King Country real estate dataset. Since we have finished preprocessing previously, all of our data is in numeric format. Note: Columns are truncated in the images below.

We can always start with the describe() function to get a high-level statistical overview of our data. This gives us a basic idea of what we’re dealing with, such as the avg of 4 bedrooms, 3 bathrooms, and ~2000 sq ft of living room.

print(data.describe())

One of my favorite charts is a heatmap created using the Seaborn library. It can visually show you the correlations between the features.

f, ax = plt.subplots(figsize=(12, 12))
sns.set(font_scale=.7)
ax = sns.heatmap(data.corr(), square=True, annot=True, fmt='.2f', cmap='winter', linewidths=0.0)
# We need to manually adjust the ylimits since matplotlib v3.1.1 is not currently working properly with Seaborn and cutting off the first and last rows otherwise.
ax.set_ylim(17.0,0)
plt.show()

Some expected observations we can see is that living space is highly correlated with price, along with the grade given by the local community. There is nothing very surprising from this data but it does give us an idea of features that do not have much impact on price and could be removed from our price prediction model to improve speed. i.e. Anything under 9% can be removed: sqft_lot, condition, yr_built, zipcode, sqft_lot15.

Next we use a pie chart and a histogram to view the cumulative monthly sales. From these two graphs we can see that the months of May to July have the highest sales by volume, while January has the lowest.

# # Create pie chart for Sales/Month -----------------------------------------------
f, ax = plt.subplots(figsize=(10, 10))
plt.subplot(221) # Create 4x4 frame, insert graph into frame 1
labels = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
plt.title('Pie Chart: Cumulative Sales Per Month')
colors = ["#ff4d4a","#ffaa6d", "#FF9999", "#99ffff", "#4aff4d", "#6dc2ff", "#aa6dff", "#c2ff6d", "#fff67b", "#7b84ff", "#f67bff", "#84ff7b"]
data['month'].value_counts().sort_index().plot.pie(autopct='%1.2f%%', #Display pie percentages      
                                        labels=labels,
                                        fontsize=8,
                                        counterclock=False,#Turn data counterclockwise
                                        startangle=150, #Turn the pie chart by a certain degree
                                        pctdistance=0.85, #Edit wedge label location
                                        colors=colors,
                                        explode=[0.1]*12 #Set distance between each wedge)
# Create donut pie chart by centering white circle --------------------------------
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
# Create histogram for Sales/Month -------------------------------------------------
plt.subplot(222) # Create 4x4 frame, insert graph into frame 2
ind = np.arange(1, 13)
data['month'].hist(bins=12, label=labels, range=(.5, 12.5))
plt.ylabel('Sales Count')
plt.xticks(ind, labels=labels, rotation=45)
plt.title('Histogram: Cumulative Sales Per Month')

Further analyzing the Avg Price per Month data we notice that May has a significant increase over April. We can conclude that listing your property towards the end of April could be ideal if you were looking to sell within a month. Had this data included the average time on market then we could recommend a more optimal time frame to list a property.

plt.subplot(224) # Create 4x4 frame, insert graph into frame 4
price_groups = data.groupby(['month'], as_index=False)['price'].median()
ind = np.arange(0, 12)
sns.barplot(y="price", x="month", data=price_groups, label="Total", palette="RdBu")
plt.xticks(ind, labels=labels, rotation=45)
plt.ylim(380000, 480000)
plt.title('Monthly Median Price', fontsize=16)