Clustering With K-Means¶

K-Means Clustering is a popular Machine Learning algorithm used for unsupervised learning. Unsupervised learning refers to scenarios where you have unlabelled data and outcomes or results are unknown.¶

Clustering is the task of dividing data points into a number of groups such that data points in the same groups are more similar than those in other groups. In other words, the aim is to segregate groups with similar traits and assign them into clusters. The goal is to minimise inter-cluster similarities and maximise intra-cluster differences.¶

Unsupervised learning can be used to discover the underlying structure of our data. K-Means clustering can provide valuable insights to our data and can be useful in the following areas :¶

- Recommendation engines
- Customer market segmentation
- Social network analysis
- Image segmentation
- Search results grouping
- Anomaly detection

The dataset used is from Kaggle.com and provides information of shoppers at a shopping mall. Information available include Customer ID, age, gender, annual income and spending score. Spending Score has been assigned to the customer based on defined parameters eg customer behavior and purchasing data.¶

Getting Started¶

The first step is to import libraries to be used for analysis and charting. Panda and numpy libraries are used for data analysis purposes while seaborn and matplotlib will be used for chart plotting.

%matplotlib inline 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Run this line if you have a HD monitor
%config InlineBackend.figure_format = 'retina'
# Initialize our chart engine
sns.set()

After importing our libraries, we will need to prepare our dataset. For this purpose, we will import data from csv file with our required dataset.

After loading our dataset, we can perform a sample check of our data by using "df.head(10)" to view the first 10 lines in our dataset.

df = pd.read_csv('~/Documents/1_DS_360/4_Porftolio/3_Clustering/datasets_Mall_Customers_V1.csv')

df.head(10)

As some of the data are in text format, we can convert them to boolean for analysis purpose. After conversion, let's view the data again.
We will denote the following for text replacement :
Male = 1 and female = 0.

df.Gender.replace('Male', 1, inplace=True)
df.Gender.replace('Female', 0, inplace=True)
df.head(10)

Next, we can plot some graphs to analyse the data patterns to decide on which features to shortlist for clustering.
Let's plot a scatter plot of each of the features with spending score.
1) Gender and spending score

## import matplotlib.pyplot as plt
plt.figure(figsize=(10,6))
# s - marker size
s = sns.scatterplot(x="Gender", y="Spending_Score",  data=df, s=200, palette="Set2")

Observations :
Both female and male shoppers have low and high spending respectively.
Highest spenders are female but only by a marginal amount as compared to male shoppers.

2) Age and spending score

plt.figure(figsize=(10,6))
# s - marker size
s = sns.scatterplot(x="Age", y="Spending_Score",  data=df, s=200, palette="Set2")

Observations :
Shoppers below the age of 40 have higher spends. This can be a group to be targeted for promotions or campaigns.
Let's try to group them into cluster, and pick an initial cluster of 4.

k = 4
from sklearn.cluster import KMeans
X = ['Age', 'Spending_Score']
kmeans = KMeans(n_clusters=k).fit(df[X])

centroids = kmeans.cluster_centers_
df["label"] = kmeans.labels_.astype(str)

plt.figure(figsize=(10, 6))
s = sns.scatterplot(x="Age", y="Spending_Score", hue='label',  
                    data=df, s=150, palette="Set2")

for i in range(k):
    plt.annotate(str(i), 
                 centroids[i],
                 horizontalalignment='center',
                 verticalalignment='center',
                 size=20, weight='bold',
                 color='black')

Observations :
There are some data points which are quite far from its respective centroids (centre points of a cluster).
To test the number of ideal grouping, we can use the Elbow Method.

distortions = []
K = range(1,10)
for k in K:
    kmeanModel = KMeans(n_clusters=k)
    kmeanModel.fit(df[X])
    distortions.append(kmeanModel.inertia_)
plt.figure(figsize=(16,8))
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method Showing The Optimal k')
plt.show()

To determine the optimal number of clusters, we have to select the value of k at the “elbow” ie the point after which the distortion/inertia start decreasing in a linear fashion (ie where data points are nearest to each other compared to when distortion was higher with less clusters).
Thus for the given data, we conclude that the optimal number of clusters for the data is 5.
Let's plot the scatter graph again with 5 clusters.

k = 5
from sklearn.cluster import KMeans
X = ['Age', 'Spending_Score']
kmeans = KMeans(n_clusters=k).fit(df[X])

centroids = kmeans.cluster_centers_
df["label"] = kmeans.labels_.astype(str)

plt.figure(figsize=(10, 6))
s = sns.scatterplot(x="Age", y="Spending_Score", hue='label',  
                    data=df, s=150, palette="Set2")

for i in range(k):
    plt.annotate(str(i), 
                 centroids[i],
                 horizontalalignment='center',
                 verticalalignment='center',
                 size=20, weight='bold',
                 color='black')

Based on the new 5 clusters, we can group / categorise our shoppers in original dataset into 5 groups based on age and spending habits.
For promotion or marketing activities which are age related, these groupings can be used.

Let's analyse the relationship between and annual income and spending score.

3) Annual income and spending score
Plot scatter graph for annual income and spending score.

plt.figure(figsize=(10,6))
# s - marker size
s = sns.scatterplot(x="Annual_Income", y="Spending_Score",  data=df, s=200, palette="Set2")

Based on Elbow Method, let's group the above data into 5 clusters.

k = 5
from sklearn.cluster import KMeans
X = ['Annual_Income', 'Spending_Score']
kmeans = KMeans(n_clusters=k).fit(df[X])

centroids = kmeans.cluster_centers_
df["label"] = kmeans.labels_.astype(str)

plt.figure(figsize=(10, 6))
s = sns.scatterplot(x="Annual_Income", y="Spending_Score", hue='label',  
                    data=df, s=150, palette="Set2")

for i in range(k):
    plt.annotate(str(i), 
                 centroids[i],
                 horizontalalignment='center',
                 verticalalignment='center',
                 size=20, weight='bold',
                 color='black')

For targeted marketing based on income (eg luxury goods), this grouping can be used.
We can update the clusters grouping into our dataset.

df

This exercise shows us that K-Means clustering is a simple and useful method for performing segmentation based on data points similarities.

	CustomerID	Gender	Age	Annual_Income	Spending_Score
0	1	Male	19	15	39
1	2	Male	21	15	81
2	3	Female	20	16	6
3	4	Female	23	16	77
4	5	Female	31	17	40
5	6	Female	22	17	76
6	7	Female	35	18	6
7	8	Female	23	18	94
8	9	Male	64	19	3
9	10	Female	30	19	72

	CustomerID	Gender	Age	Annual_Income	Spending_Score
0	1	1	19	15	39
1	2	1	21	15	81
2	3	0	20	16	6
3	4	0	23	16	77
4	5	0	31	17	40
5	6	0	22	17	76
6	7	0	35	18	6
7	8	0	23	18	94
8	9	1	64	19	3
9	10	0	30	19	72

	CustomerID	Gender	Age	Annual_Income	Spending_Score	label
0	1	1	19	15	39	2
1	2	1	21	15	81	0
2	3	0	20	16	6	2
3	4	0	23	16	77	0
4	5	0	31	17	40	2
...	...	...	...	...	...	...
195	196	0	35	120	79	3
196	197	0	45	126	28	1
197	198	1	32	126	74	3
198	199	1	32	137	18	1
199	200	1	30	137	83	3

	CustomerID	Gender	Age	Annual_Income	Spending_Score
0	1	1	19	15	39
1	2	1	21	15	81
2	3	0	20	16	6
3	4	0	23	16	77
4	5	0	31	17	40
5	6	0	22	17	76
6	7	0	35	18	6
7	8	0	23	18	94
8	9	1	64	19	3
9	10	0	30	19	72

	CustomerID	Gender	Age	Annual_Income	Spending_Score	label
0	1	1	19	15	39	2
1	2	1	21	15	81	0
2	3	0	20	16	6	2
3	4	0	23	16	77	0
4	5	0	31	17	40	2
...	...	...	...	...	...	...
195	196	0	35	120	79	3
196	197	0	45	126	28	1
197	198	1	32	126	74	3
198	199	1	32	137	18	1
199	200	1	30	137	83	3

	CustomerID	Gender	Age	Annual_Income	Spending_Score
0	1	1	19	15	39
1	2	1	21	15	81
2	3	0	20	16	6
3	4	0	23	16	77
4	5	0	31	17	40
5	6	0	22	17	76
6	7	0	35	18	6
7	8	0	23	18	94
8	9	1	64	19	3
9	10	0	30	19	72

	CustomerID	Gender	Age	Annual_Income	Spending_Score	label
0	1	1	19	15	39	2
1	2	1	21	15	81	0
2	3	0	20	16	6	2
3	4	0	23	16	77	0
4	5	0	31	17	40	2
...	...	...	...	...	...	...
195	196	0	35	120	79	3
196	197	0	45	126	28	1
197	198	1	32	126	74	3
198	199	1	32	137	18	1
199	200	1	30	137	83	3

	CustomerID	Gender	Age	Annual_Income	Spending_Score
0	1	1	19	15	39
1	2	1	21	15	81
2	3	0	20	16	6
3	4	0	23	16	77
4	5	0	31	17	40
5	6	0	22	17	76
6	7	0	35	18	6
7	8	0	23	18	94
8	9	1	64	19	3
9	10	0	30	19	72

	CustomerID	Gender	Age	Annual_Income	Spending_Score	label
0	1	1	19	15	39	2
1	2	1	21	15	81	0
2	3	0	20	16	6	2
3	4	0	23	16	77	0
4	5	0	31	17	40	2
...	...	...	...	...	...	...
195	196	0	35	120	79	3
196	197	0	45	126	28	1
197	198	1	32	126	74	3
198	199	1	32	137	18	1
199	200	1	30	137	83	3