Clustering With K-Means

Clustering is the task of dividing data points into a number of groups such that data points in the same groups are more similar than those in other groups. In other words, the aim is to segregate groups with similar traits and assign them into clusters. The goal is to minimise inter-cluster similarities and maximise intra-cluster differences.
Unsupervised learning can be used to discover the underlying structure of our data. K-Means clustering can provide valuable insights to our data and can be useful in the following areas :


- Recommendation engines
- Customer market segmentation
- Social network analysis
- Image segmentation
- Search results grouping
- Anomaly detection

The dataset used is from Kaggle.com and provides information of shoppers at a shopping mall. Information available include Customer ID, age, gender, annual income and spending score. Spending Score has been assigned to the customer based on defined parameters eg customer behavior and purchasing data.

Getting Started

The first step is to import libraries to be used for analysis and charting. Panda and numpy libraries are used for data analysis purposes while seaborn and matplotlib will be used for chart plotting.

In [1]:
%matplotlib inline 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Run this line if you have a HD monitor
%config InlineBackend.figure_format = 'retina'
# Initialize our chart engine
sns.set()

After importing our libraries, we will need to prepare our dataset. For this purpose, we will import data from csv file with our required dataset.

After loading our dataset, we can perform a sample check of our data by using "df.head(10)" to view the first 10 lines in our dataset.

In [2]:
df = pd.read_csv('~/Documents/1_DS_360/4_Porftolio/3_Clustering/datasets_Mall_Customers_V1.csv')
In [3]:
df.head(10)
Out[3]:
CustomerID Gender Age Annual_Income Spending_Score
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
5 6 Female 22 17 76
6 7 Female 35 18 6
7 8 Female 23 18 94
8 9 Male 64 19 3
9 10 Female 30 19 72

As some of the data are in text format, we can convert them to boolean for analysis purpose. After conversion, let's view the data again.
We will denote the following for text replacement :
Male = 1 and female = 0.

In [4]:
df.Gender.replace('Male', 1, inplace=True)
df.Gender.replace('Female', 0, inplace=True)
df.head(10)
Out[4]:
CustomerID Gender Age Annual_Income Spending_Score
0 1 1 19 15 39
1 2 1 21 15 81
2 3 0 20 16 6
3 4 0 23 16 77
4 5 0 31 17 40
5 6 0 22 17 76
6 7 0 35 18 6
7 8 0 23 18 94
8 9 1 64 19 3
9 10 0 30 19 72

Next, we can plot some graphs to analyse the data patterns to decide on which features to shortlist for clustering.
Let's plot a scatter plot of each of the features with spending score.
1) Gender and spending score

In [5]:
## import matplotlib.pyplot as plt
plt.figure(figsize=(10,6))
# s - marker size
s = sns.scatterplot(x="Gender", y="Spending_Score",  data=df, s=200, palette="Set2")

Observations :
Both female and male shoppers have low and high spending respectively.
Highest spenders are female but only by a marginal amount as compared to male shoppers.

2) Age and spending score

In [6]:
plt.figure(figsize=(10,6))
# s - marker size
s = sns.scatterplot(x="Age", y="Spending_Score",  data=df, s=200, palette="Set2")

Observations :
Shoppers below the age of 40 have higher spends. This can be a group to be targeted for promotions or campaigns.
Let's try to group them into cluster, and pick an initial cluster of 4.

In [7]:
k = 4
from sklearn.cluster import KMeans
X = ['Age', 'Spending_Score']
kmeans = KMeans(n_clusters=k).fit(df[X])

centroids = kmeans.cluster_centers_
df["label"] = kmeans.labels_.astype(str)

plt.figure(figsize=(10, 6))
s = sns.scatterplot(x="Age", y="Spending_Score", hue='label',  
                    data=df, s=150, palette="Set2")

for i in range(k):
    plt.annotate(str(i), 
                 centroids[i],
                 horizontalalignment='center',
                 verticalalignment='center',
                 size=20, weight='bold',
                 color='black')

Observations :
There are some data points which are quite far from its respective centroids (centre points of a cluster).
To test the number of ideal grouping, we can use the Elbow Method.

In [8]:
distortions = []
K = range(1,10)
for k in K:
    kmeanModel = KMeans(n_clusters=k)
    kmeanModel.fit(df[X])
    distortions.append(kmeanModel.inertia_)
plt.figure(figsize=(16,8))
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method Showing The Optimal k')
plt.show()

To determine the optimal number of clusters, we have to select the value of k at the “elbow” ie the point after which the distortion/inertia start decreasing in a linear fashion (ie where data points are nearest to each other compared to when distortion was higher with less clusters).
Thus for the given data, we conclude that the optimal number of clusters for the data is 5.
Let's plot the scatter graph again with 5 clusters.

In [9]:
k = 5
from sklearn.cluster import KMeans
X = ['Age', 'Spending_Score']
kmeans = KMeans(n_clusters=k).fit(df[X])

centroids = kmeans.cluster_centers_
df["label"] = kmeans.labels_.astype(str)

plt.figure(figsize=(10, 6))
s = sns.scatterplot(x="Age", y="Spending_Score", hue='label',  
                    data=df, s=150, palette="Set2")

for i in range(k):
    plt.annotate(str(i), 
                 centroids[i],
                 horizontalalignment='center',
                 verticalalignment='center',
                 size=20, weight='bold',
                 color='black') 

Based on the new 5 clusters, we can group / categorise our shoppers in original dataset into 5 groups based on age and spending habits.
For promotion or marketing activities which are age related, these groupings can be used.

Let's analyse the relationship between and annual income and spending score.

3) Annual income and spending score
Plot scatter graph for annual income and spending score.

In [10]:
plt.figure(figsize=(10,6))
# s - marker size
s = sns.scatterplot(x="Annual_Income", y="Spending_Score",  data=df, s=200, palette="Set2")

Based on Elbow Method, let's group the above data into 5 clusters.

In [11]:
k = 5
from sklearn.cluster import KMeans
X = ['Annual_Income', 'Spending_Score']
kmeans = KMeans(n_clusters=k).fit(df[X])

centroids = kmeans.cluster_centers_
df["label"] = kmeans.labels_.astype(str)

plt.figure(figsize=(10, 6))
s = sns.scatterplot(x="Annual_Income", y="Spending_Score", hue='label',  
                    data=df, s=150, palette="Set2")

for i in range(k):
    plt.annotate(str(i), 
                 centroids[i],
                 horizontalalignment='center',
                 verticalalignment='center',
                 size=20, weight='bold',
                 color='black') 

For targeted marketing based on income (eg luxury goods), this grouping can be used.
We can update the clusters grouping into our dataset.

In [12]:
df
Out[12]:
CustomerID Gender Age Annual_Income Spending_Score label
0 1 1 19 15 39 2
1 2 1 21 15 81 0
2 3 0 20 16 6 2
3 4 0 23 16 77 0
4 5 0 31 17 40 2
... ... ... ... ... ... ...
195 196 0 35 120 79 3
196 197 0 45 126 28 1
197 198 1 32 126 74 3
198 199 1 32 137 18 1
199 200 1 30 137 83 3

200 rows × 6 columns

This exercise shows us that K-Means clustering is a simple and useful method for performing segmentation based on data points similarities.