“ Selling to people who actually want to hear from you is more effective than interrupting strangers who don’t ” – Seth Godin (Yahoo VP of Direct Marketing, Entrepreneur, Author)
Goal
In this article I will do a simple customer segmentation problem using Machine Learning. The methods are generally similar to the kernels submitted on kaggle.com, except where I think other methods are better. The result will be clearly identified customer segments based on the given sample data. While simple, this can be merged with more data to provide additional insights.
What is Customer Segmentation and Why is it important?
Customer market segmentation attempts to identify a group of customers (a segment, a cluster, a group are used interchangeably) who differ from another within the given population. It is used by product, marketing and sales teams to share the right story, to the right people, at the right time by understanding the needs of different groups.
Traditional Approach
Traditional segmentation is often manual and works with limited data, this makes it difficult to understand personas on a deeper level and more data makes it exponentially more expensive without a clear benefit.
Machine Learning: Turn Data into Gold
This is where Machine Learning can help by leveraging:
- Big data, to take in more data about your customers than a single person or team of people could ever analyze manually
- Automation. to manage the large amounts of data quickly, and
- Algorithms, to make sense of the data you posses and turn it into meaningful insights
History & Technical details
I will be using the quick and efficient ‘K-Means’ algorithm to do a very simple customer segmentation example. The simplified idea is that if a point belongs to a cluster, it should be near to lots of other points in that cluster. It was proposed by Stuart Lloyd from Bell Labs in 1957 for pulse-code modulation and published in 1982[1]. It is an unsupervised learning algorithm that finds a fixed number of clusters. An improved algorithm, K-Means++, was proposed in 2006 by David Arthur and Serei Vassilvitskii[2]. It uses an updated approach in determining the starting location for the algorithm and reduces the time to find an optimal solution. I’ll be using the K-Means Algo in combination with the silhouette score to determine how many clusters are optimal. Note: While determining the silhouette score is computationally more expensive, we don’t have a large enough data sample to notice, while yielding more concrete results as you will see.
Given Data and Conclusions
We’re given a dataset containing the following attributes: CustomerID, Gender, Age, Income, and Spending Score. After running through the algorithms, we have the following main graphs:
The first graph shows 5 clear clusters based on Annual Income vs Spending.
- Careful = High Income, Low Spenders
- Standard = Medium Income, Medium Spenders
- Target = High Income, High Spenders
- Careless = Low Income, High Spenders
- Sensible = Low Income, Low Spenders
The second graph shows 4 clear clusters based on Age vs Spending.
- Target Young Customers = Young, Medium – High Spenders
- Target / Priority = Young, High Spenders
- Usual Customers = All Ages, Low – Medium Spenders
- Target Old Customers = Old, Medium – High Spenders
Application of ML to Arrive at Conclusions
How did we arrive at those customer segments?
First we do a quick analysis of the given data to see if anything pops out.
We can see that there are clear groupings in income vs spending, and those are the important metrics for a store. Avg vs Spending doesn’t have groupings that are easy to identify, but that’s what we have the algorithms for.
Next we determine how many clusters our data actually contains by calculating the silhouette score, which is the mean silhouette coefficient over all instances (data points). An instance’s silhouette coefficient is
a = mean distance to other instances in the same clusters
b = mean nearest-cluster distance
The highest score where k = 5 identifies the ideal number of customer segments for Income vs Spend
Another method, called the ‘Elbow Method’, is possible and more efficient but as you can see it is a more coarse method when trying to finding the ‘elbow’. As we are not bound by resources in this example, I focused on the silhouette score. Interestingly, following the discussions of others, I did not notice many utilizing the more succinct silhouette method.
Additions
At this point I have achieved the goal of creating a number of customer segments from which the marketing team can create strategies. There are many improvements possible though in this example, such as:
- More charts to dive into more details
- Splitting male vs female into 2 data sets if your product differentiates between gender
- Salary & Spending by Gender. Do males or females spend/more more?
- Note: Males have a higher earnings cap but females have a higher spending cap by about $5k based on this data set
- This data set was very simple. Combining additional data attributes such as CRM data and Google Analytics data can provide more and deeper personas, or combining Age vs Income vs Spending in a 3D graph
- Testing other algorithms such as Agglomerative Clustering (AGNES), DBSCAN, or Mean-Shift Clustering
References
[1] Stuart P. Lloyd, “Least Squared Quantization in PCM.” IEEE Transactions on Information Theory 28, no.2 (1982): 129-137
[2] David Arthur and Sergei Vassilvitskii, “k-Means++: The Advantages of Careful Seeding,” Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (2007): 1027-1035.
All code can be found on my github account
Thanks for visiting my page!