Preprocessing Python Example for Machine Learning

Emil Fine

6 years ago

Before applying a Machine Learning Algorithm, the data needs to be preprocessed, structured and analyzed. Below are some useful python functions to accomplish this. I will be using a King Country real estate dataset from Kaggle to determine a house price. Note: I’m using a subset of the actual dataset for the examples below to speed up processing. In the next article I will cover data visualization.

# Import csv data using pandas library, with comma separated values, python engine, using existing column headers
import pandas as pd
data = pd.read_csv('data/housedata2.csv', sep=r'\s,\s', engine='python', header=0)

#head() function returns first 5 rows. Columns below were shortened for webpage.
print(data.head())

Since the bedrooms column is a categorical value, we need to change it to numerical. There are several ways to do this. First method is using a list and leveraging the inherent mapping provided by indexing.

# Check all unique values in column
print('Bedroom values: ', data.bedrooms.unique())

bdroom_list = ['No', 'One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven', 'Eight']

# Clean Short-Hand Version 
data['bedrooms'] = [bdroom_list.index(room) for room in data['bedrooms']]

# Complex Version 
i = 0

while i < len(data['bedrooms']):
    data['bedrooms'][i] = bdroom_list.index(data['bedrooms'][i])
    i += 1
print(data['bedrooms'])

However, using the inherent mapping provided by indexing requires your the data to align with the mapping (i.e. ordered list). A more flexible approach is using a dictionary data structure where you create a custom text to integer mapping. The result will be the same.

# Create dictionary mapping
print('Bedroom values: ', data.bedrooms.unique())

bdroom_dict={
    'One':1,
    'Two':2,
    'Three':3,
    'Four':4,
    'Five':5,
    'Six':6,
    'Seven':7,
    'Eight':8,
    'No':0}

# Replace bedroom text to numeric via dict mapping
data['bedrooms'].replace(bdroom_dict, inplace=True)

I’m using another dataset for the next example.

Another Method is using the LabelEncoder() function in the sklearn preprocessing library. This function is generally used for data with an inherent order but binary values are ok too and very simple to implement. Here we change the Gender column from M and F to 1s and 0s.

from sklearn import preprocessing
marathon_data = pd.read_csv('data/marathon.csv', quotechar='"')
print(marathon_data.head())

le = preprocessing.LabelEncoder()
marathon_data['Gender'] =
le.fit_transform(marathon_data['Gender'].astype(str))

print(marathon_data.head())

The third method is the One-Hot method. This is used for categorical data with no intrinsic order. You can see that it takes the 79 unique country values and creates a matrix with a unique column for each unique country and places a ‘1’ in each row for the persons location.

print('Countries: ', len(marathon_data['Country'].unique()))
marathon_data = pd.get_dummies(marathon_data, columns=['Country'])
print(marathon_data.head())

Going back to our original real estate dataset, we can immediately drop columns that we already know will be irrelevant to our price prediction.

print(len(data.columns))
data = data.drop(['id', 'lat', 'long'], axis=1)
print(len(data.columns))

Next we replace any ‘?’s with NaNs using the numpy library and we check if any of the rows have null values that we need to remove. Since there are no null values then we do not need to drop any rows.

import numpy as np
data = data.replace('?', np.nan)
print("Number of null values per column:")
print(data[data.isnull().any(axis=1)].count())
# dropna() function drops rows with null values but not needed here
data = data.dropna()

The last column to clean up is the date column. For that we use the apply() and parse() function from the dateutil library.

from dateutil.parser import parse
def clean_date(date_str):
    return str(parse(date_str).date())
data['date'] = data['date'].apply(clean_date)
print(data.head())

I wanted to add a new column with only the month to utilize later.

# Extract month values and add to new column 
def clean_month(date_str):
    return int(parse(date_str).date().month)
data['month'] = data['date'].apply(clean_month)

Giving your data a random shuffle is a good habit to make sure the algorithm doesn’t pick up any patterns related to how the data is read into the system

data = data.sample(frac=1).reset_index(drop=True)

After a few lines of code, we are now ready to start analyzing the data.