- Recommendation engines
- Customer market segmentation
- Social network analysis
- Image segmentation
- Search results grouping
- Anomaly detection
The first step is to import libraries to be used for analysis and charting. Panda and numpy libraries are used for data analysis purposes while seaborn and matplotlib will be used for chart plotting.
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Run this line if you have a HD monitor
%config InlineBackend.figure_format = 'retina'
# Initialize our chart engine
sns.set()
After importing our libraries, we will need to prepare our dataset. For this purpose, we will import data from csv file with our required dataset.
After loading our dataset, we can perform a sample check of our data by using "df.head(10)" to view the first 10 lines in our dataset.
df = pd.read_csv('~/Documents/1_DS_360/4_Porftolio/3_Clustering/datasets_Mall_Customers_V1.csv')
df.head(10)
As some of the data are in text format, we can convert them to boolean for analysis purpose. After conversion, let's view the data again.
We will denote the following for text replacement :
Male = 1 and female = 0.
df.Gender.replace('Male', 1, inplace=True)
df.Gender.replace('Female', 0, inplace=True)
df.head(10)
Next, we can plot some graphs to analyse the data patterns to decide on which features to shortlist for clustering.
Let's plot a scatter plot of each of the features with spending score.
1) Gender and spending score
## import matplotlib.pyplot as plt
plt.figure(figsize=(10,6))
# s - marker size
s = sns.scatterplot(x="Gender", y="Spending_Score", data=df, s=200, palette="Set2")
Observations :
Both female and male shoppers have low and high spending respectively.
Highest spenders are female but only by a marginal amount as compared to male shoppers.
2) Age and spending score
plt.figure(figsize=(10,6))
# s - marker size
s = sns.scatterplot(x="Age", y="Spending_Score", data=df, s=200, palette="Set2")
Observations :
Shoppers below the age of 40 have higher spends. This can be a group to be targeted for promotions or campaigns.
Let's try to group them into cluster, and pick an initial cluster of 4.
k = 4
from sklearn.cluster import KMeans
X = ['Age', 'Spending_Score']
kmeans = KMeans(n_clusters=k).fit(df[X])
centroids = kmeans.cluster_centers_
df["label"] = kmeans.labels_.astype(str)
plt.figure(figsize=(10, 6))
s = sns.scatterplot(x="Age", y="Spending_Score", hue='label',
data=df, s=150, palette="Set2")
for i in range(k):
plt.annotate(str(i),
centroids[i],
horizontalalignment='center',
verticalalignment='center',
size=20, weight='bold',
color='black')
Observations :
There are some data points which are quite far from its respective centroids (centre points of a cluster).
To test the number of ideal grouping, we can use the Elbow Method.
distortions = []
K = range(1,10)
for k in K:
kmeanModel = KMeans(n_clusters=k)
kmeanModel.fit(df[X])
distortions.append(kmeanModel.inertia_)
plt.figure(figsize=(16,8))
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method Showing The Optimal k')
plt.show()
To determine the optimal number of clusters, we have to select the value of k at the “elbow” ie the point after which the distortion/inertia start decreasing in a linear fashion (ie where data points are nearest to each other compared to when distortion was higher with less clusters).
Thus for the given data, we conclude that the optimal number of clusters for the data is 5.
Let's plot the scatter graph again with 5 clusters.
k = 5
from sklearn.cluster import KMeans
X = ['Age', 'Spending_Score']
kmeans = KMeans(n_clusters=k).fit(df[X])
centroids = kmeans.cluster_centers_
df["label"] = kmeans.labels_.astype(str)
plt.figure(figsize=(10, 6))
s = sns.scatterplot(x="Age", y="Spending_Score", hue='label',
data=df, s=150, palette="Set2")
for i in range(k):
plt.annotate(str(i),
centroids[i],
horizontalalignment='center',
verticalalignment='center',
size=20, weight='bold',
color='black')
Based on the new 5 clusters, we can group / categorise our shoppers in original dataset into 5 groups based on age and spending habits.
For promotion or marketing activities which are age related, these groupings can be used.
Let's analyse the relationship between and annual income and spending score.
3) Annual income and spending score
Plot scatter graph for annual income and spending score.
plt.figure(figsize=(10,6))
# s - marker size
s = sns.scatterplot(x="Annual_Income", y="Spending_Score", data=df, s=200, palette="Set2")
Based on Elbow Method, let's group the above data into 5 clusters.
k = 5
from sklearn.cluster import KMeans
X = ['Annual_Income', 'Spending_Score']
kmeans = KMeans(n_clusters=k).fit(df[X])
centroids = kmeans.cluster_centers_
df["label"] = kmeans.labels_.astype(str)
plt.figure(figsize=(10, 6))
s = sns.scatterplot(x="Annual_Income", y="Spending_Score", hue='label',
data=df, s=150, palette="Set2")
for i in range(k):
plt.annotate(str(i),
centroids[i],
horizontalalignment='center',
verticalalignment='center',
size=20, weight='bold',
color='black')
For targeted marketing based on income (eg luxury goods), this grouping can be used.
We can update the clusters grouping into our dataset.
df
This exercise shows us that K-Means clustering is a simple and useful method for performing segmentation based on data points similarities.