Buiding Customer Segmentation by GMM from scratch
Customer segmentation is always a hot subject for marketers to understand the customer and knowing what your customer looks like. By understanding your customer, marketers can have a different strategic approach for each group of users. In this post, I’ll show you how to build a segmentation model from scratch and help you to understand the basic principle behind the algorithm. A Jupiter notebook for this exercise can be found here
First and foremost, what is the Gaussian Mixture Model (GMMs) and why shall we use GMM instead of K-mean?
Gaussian Mixture Models (GMMs) is a probabilistic clustering model for representing normally distributed subpopulations within an overall population.GMMs give us more flexibility than K-mean. In K-mean, we assign each data point to a specific cluster (hard assignment). In the next iteration, we might revise it based on measuring the distance between the data point and the cluster’s centroid and reassign the centroid. GMM used soft assignment (probability of belonging to each cluster) therefore it brings more flexibility for the model)
Step of GMM
- Guess some cluster centers
- Repeat until converged
- E-Step: assign points to the nearest cluster center
- M-Step: set the cluster centers to the mean
Pros and Cons of GMM:
Advantage:
- Does not bias the cluster sizes to have specific structures and does not assume clusters to be of any geometry as does by K-means.
- GMM is a lot more flexible in terms of cluster covariance.
- GMM will always fit better. GMM takes variance into consideration when it calculates the measurement.
Disadvantage:
- It uses all the components it has access to, so the initialization of clusters will be difficult when the dimensionality of data is high.
- Difficult to interpret.
Real-world application:
The dataset: includes 35 features, each feature represent behaviors of each customer (can be download in git link here)
Data Prepreration
import numpy as np
import pandas as pd
from matplotlib import colors
from tqdm import tqdm as tqdm
%matplotlib inlineimport plotly as pl
import plotly.graph_objs as go
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as snsfrom sklearn import metrics
# gmm
from sklearn.mixture import GaussianMixture#load data
data = pd.read_csv('data_segmentation.csv')categerical_col = data.select_dtypes(include=['object']).columns
numberical_col = data._get_numeric_data().columns
data[numberical_col] = data[numberical_col].astype(float)# Clean data# misssing percentage
def missing_percentage(data):
"""This function takes a DataFrame(df) as input and returns two columns, total missing values and total missing values percentage"""
total = data.isnull().sum().sort_values(ascending = False)
percent = round(data.isnull().sum().sort_values(ascending = False)/len(data)*100,2)
return pd.concat([total, percent], axis=1, keys=['Total','Percent'])
missing_percentage(data)data[categerical_col] = data[categerical_col].fillna("No infor")
data[numberical_col] = data[numberical_col].fillna(0)# Dealing with outlierstd = np.std(data[numberical_col])
mean = np.mean(data[numberical_col])#check outlier
def cnt_outlier(data,sd, inc_cols=[]):
num_cols = data.select_dtypes(include=[np.number]).columns
num_cols = [e for e in num_cols if e in inc_cols]
outlier = (data[numberical_col]-mean).abs() > sd*std
return outlier.sum()# Remove Outlierfrom scipy import stats
z = np.abs(stats.zscore(data[numberical_col]))
threshold = 4
data_work = data[(z <= 4).all(axis=1)]
data_work.head()
Factor Analysis:
Why Factor Analysis?
Factor analysis is related to the mixture models
One limitation of mixture models is that they only use a single latent variable to generate each observation; however, in the real world, data come from multiple variables, for example:
Factor analysis is a method for dimension reduction: It reduces the dimension of the original space
Factor analysis is an exploratory data analysis method :Discover a small set of components that underlie a high-dimensional dataset (data camp)
Adequacy Test Before you perform factor analysis, you need to evaluate the “factorability” of our dataset. Factorability means "can we found the factors in the dataset?". There are two methods to check the factorability or sampling adequacy:
1. Bartlett’s Test
2. Kaiser-Meyer-Olkin Test (KMO)
Bartlett’s test of sphericity checks whether or not the observed variables intercorrelate at all using the observed correlation matrix against the identity matrix. If the test found statistically insignificant, you should not employ a factor analysis.
from factor_analyzer import FactorAnalyzer
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
chi_square_value,p_value=calculate_bartlett_sphericity(data_work[numberical_col])
chi_square_value, p_value(166361.1684484173, 0.0)
In this Bartlett ’s test, the p-value is 0. The test was statistically significant, indicating that the observed correlation matrix is not an identity matrix.
Kaiser-Meyer-Olkin (KMO) Test measures the suitability of data for factor analysis. It determines the adequacy for each observed variable and for the complete model. KMO estimates the proportion of variance among all the observed variables. Lower proportion id more suitable for factor analysis. KMO values range between 0 and 1. The value of KMO less than 0.6 is consider inadequate. (Source: Data camp)
from factor_analyzer.factor_analyzer import calculate_kmo
kmo_all,kmo_model=calculate_kmo(a)
kmo_model
0.8211935165487708
The KMO is 0.82 indicate that we can proceed with planned factor analysis.
Choosing the Number of Factors:
For choosing the number of factors, you can use the Kaiser criterion and scree plot. Both are based on eigenvalues.
fa = FactorAnalyzer()
fa.fit(a)
eigen_values, vectors = fa.get_eigenvalues()x = np.arange(1,a.shape[1]+1)import plotly.graph_objects as go
import numpy as np# x = np.arange(10)fig = go.Figure(data=go.Scatter(x=x, y=eigen_values))
fig.update_layout(title='Eigen value by number of factors',
xaxis_title='Factors',
yaxis_title='eigen_values')
fig.show()
Variances of the factor
_, _, cumvar = fa.get_factor_variance()
fig = go.Figure(data=go.Scatter(x=x, y=cumvar))
fig.update_layout(title='Eigen value by number of factors',
xaxis_title='Numbe of Factors',
yaxis_title='Variance%')
fig.show()
The factor loading is a matrix that shows the relationship of each variable to the underlying factor. It shows the correlation coefficient for observed variables and factors. It shows the variance explained by the observed variables. The higher value, the more related each variable to the factors
fa.set_params(n_factors=10, rotation='varimax')
fa.fit(a)
loadings = fa.loadings_factor_loading = pd.DataFrame(loadings)
factor_loading.head()
# Get variance of each factors
fa.get_factor_variance()
factor_variance = pd.DataFrame(fa.get_factor_variance())
factor_variance
fa_finnal = FactorAnalyzer(n_factors=10, rotation="varimax")
fa_finnal.fit(a)
Create a new data frame with only 10 factors that we found earlier
col_name = []
for i in range(10):
col_name.append('Factor '+str(i+1))
df_factors = fa_finnal.transform(a)
df_factors = pd.DataFrame(df_fac, columns = col_name)
Using GMM for segmentation:
Choosing the number of cluster
def gmm_bic(a, lower=2, upper=10):
bic = []
for i in tqdm(range(lower,upper)):
gm = GaussianMixture(n_components=i, covariance_type="full",max_iter=300, random_state=42)
gm.fit(a)
b = gm.bic(a)
bic.append(b)
print('Convergence? {} at iteration {}'.format(gm.converged_, gm.n_iter_))
fig = go.Figure(data=go.Scatter(x=np.arange(lower,upper), y=bic))
fig.update_layout(title='BIC',
xaxis_title='Numbe of Clusters',
yaxis_title='BIC')
fig.show()
Run GMM on the factors data frame
gmm_bic(df_fac)
The lower is the BIC, the better is the model to actually predict the data we have. According to the above graph, we choose 4 clusters since after 4 clusters there is no significant change of BIC while the number of clusters is increasing.
Run GMM on the factor Dataframe
gmm = GaussianMixture(n_components=4, max_iter=300, random_state=42)
gmm.fit(df_fac)
a['Segment'] = gmm.predict(df_fac)
Outputting
#display
pd.options.display.float_format = '{:,.2f}'.format
output = a[features].groupby('Segment').mean().T
output.to_csv('result.csv')#show gradient
def background_gradient(s, m, M, cmap='PuBu', low=0, high=0):
rng = M - m
norm = colors.Normalize(m - (rng * low),
M + (rng * high))
normed = norm(s.values)
c = [colors.rgb2hex(x) for x in plt.cm.get_cmap(cmap)(normed)]
return ['background-color: %s' % color for color in c]
# display output with gradient
output.style.apply(background_gradient,
# highlight_max(output),
cmap='PuBu',
m=output.min().min(),
M=output.max().max(),
low=0,
high=0.2).format('{:,.2f}')
a.groupby('Segment').size()Segment
0 1085
1 1088
2 767
3 670
dtype: int64
Profiling the output
After having the result, by interpreting each feature and its value, we can identify our customers and the characteristic of each group. The examples are below
Group 1 (1,085): Promotion hunter
Group 2 (670): Neglector
Group 3 (1,088): Hopeful fan
Group 4 (767): The royalty
References: