As state in our previous blog on Predicting Customer Churn in Banking , churn is a phenomenon in which customer stop doing business with an entity. It is most commonly expressed as the percentage at which subscribers discontinue their subscriptions within a given time period. Churn can be applied to subscription-based businesses as well to the number of employees that leave a firm. The churn rate and growth rate are diametrically opposite factors, as the former measures the loss of customers and the other measures the acquisition of customers. It is therefore, one of the fundamental use cases companies seek to implement.
Now-a-days, machine learning models provide the most affective means to measure, understand and predict customer churn. This blog is one such effort in this direction and will help you with in the follows ways:
- How to formulate the problem.
- Use of XGBoost to predict customer churn.
- Use of custom loss function
- How to estimate model efficacy.
Data Acquisition & Wrangling
The data used in this study is taken from Kaggle, published under the name Telco customer churn (11.1.3+). All the rights are owned by the original publisher. This is one of the most extensive datasets on customer churn at Kaggle. The dataset module is composed of 5 tables each carrying information on with specific segments :
- Demographics
- includes customer demographics data such as gender, age, marital status, etc.
- Location
- includes customer address information such as country, state, zip, etc.
- Population
- contains info about population per zipcode basis
- Services
- contains detailed info about services and charges, i.e., monthly bill, number of contracts, contract period, types of subscription like TV, music or video streaming, etc.
- Status
- includes info about customer satisfaction score, lifetime value, contract state, i.e., out of contractn, running out of contract or in contract, etc.
A quick glance of the main churn data table is follows:
| CustomerID | Country | State | Gender | Paperless Billing | Total Charges | ... | Churn Label | Churn Score | CLTV | Churn Reason |
|---|---|---|---|---|---|---|---|---|---|---|
| 3668-QPYBK | United States | California | Male | Yes | 108.15 | ... | Yes | 86 | 3239 | Competitor made better offer | 9237-HQITU | United States | California | Female | Yes | 151.65 | ... | Yes | 67 | 2701 | Moved |
| 9305-CDSKC | United States | California | Female | Yes | 820.5 | ... | Yes | 86 | 5372 | Moved |
There are no null values within therefore, no need to fill NAs. All the above five tables contain information per customer basis and therefore, are joined by CustomerID column. After the several joins, he final dataset looks as follows:
Int64Index: 7043 entries, 0 to 7042
Data columns (total 60 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CustomerID 7043 non-null object
1 Count 7043 non-null int64
2 Country 7043 non-null object
3 State 7043 non-null object
4 City 7043 non-null object
5 Zip Code 7043 non-null int64
6 Lat Long 7043 non-null object
7 Latitude 7043 non-null float64
8 Longitude 7043 non-null float64
9 Gender 7043 non-null object
10 Senior Citizen 7043 non-null object
11 Partner 7043 non-null object
12 Dependents 7043 non-null object
13 Tenure Months 7043 non-null int64
14 Phone Service 7043 non-null object
15 Multiple Lines 7043 non-null object
16 Internet Service 7043 non-null object
17 Online Security 7043 non-null object
18 Online Backup 7043 non-null object
19 Device Protection 7043 non-null object
20 Tech Support 7043 non-null object
21 Streaming TV 7043 non-null object
22 Streaming Movies 7043 non-null object
23 Contract 7043 non-null object
24 Paperless Billing 7043 non-null object
25 Payment Method 7043 non-null object
26 Monthly Charges 7043 non-null float64
27 Total Charges 7043 non-null object
28 Churn Label 7043 non-null object
29 Churn Value 7043 non-null int64
30 Churn Score 7043 non-null int64
31 CLTV 7043 non-null int64
32 Churn Reason 1869 non-null object
33 Customer ID 7043 non-null object
34 Age 7043 non-null int64
35 Under 30 7043 non-null object
36 Married 7043 non-null object
37 Number of Dependents 7043 non-null int64
38 Quarter 7043 non-null object
39 Referred a Friend 7043 non-null object
40 Number of Referrals 7043 non-null int64
41 Tenure in Months 7043 non-null int64
42 Offer 7043 non-null object
43 Avg Monthly Long Distance Charges 7043 non-null float64
44 Internet Type 7043 non-null object
45 Avg Monthly GB Download 7043 non-null int64
46 Device Protection Plan 7043 non-null object
47 Premium Tech Support 7043 non-null object
48 Streaming Music 7043 non-null object
49 Unlimited Data 7043 non-null object
50 Monthly Charge 7043 non-null float64
51 Total Refunds 7043 non-null float64
52 Total Extra Data Charges 7043 non-null int64
53 Total Long Distance Charges 7043 non-null float64
54 Total Revenue 7043 non-null float64
55 Satisfaction Score 7043 non-null int64
56 Customer Status 7043 non-null object
57 Churn Category 1869 non-null object
58 ID 1671 non-null float64
59 Population 1671 non-null float64
dtypes: float64(10), int64(13), object(37)
memory usage: 3.3+ MB
We dropped the duplicate and seemingly useless columns, use 'Churn Label' as the target variable and perform label encoding to convert categorical variables into integers so that they can be passed to XGBoost train function. This is done in a very few lines as follows:
### Let us now build the X and y variables
df_merged['Churn Category'] = df_merged['Churn Category'].fillna('NA')
# use instead of Internet Service as the former has more details,
# dropping Customer Status columns as it is
# somehow dubious, also dropping Churn Reason as it is covered by Churn
# category. CustomerID is also not needed at this point
df_data = df_merged.drop(['Count', 'Country', 'State', 'Lat Long', 'Latitude',
'Churn Reason', 'Customer ID', 'ID', 'CustomerID', 'Population',
'Longitude', 'Churn Label', 'Tenure in Months', 'Internet Service',
'Customer Status' ], axis=1)
df_y = df_data['ChurnValue']
df_y = pd.DataFrame(LabelEncoder().fit_transform(df_y))
df_y = df_y.reset_index(drop=True)
df_X_raw = df_data.drop(['ChurnValue'], axis=1)
#df_X_raw = p_df_raw_train.drop(['Name'], axis=1)
columns=df_X_raw.columns
for f in df_X_raw.columns:
if df_X_raw[f].dtype == 'object':
df_X_raw[f] = LabelEncoder().fit_transform(list(df_X_raw[f]))
This dataset contains several features which dictates the necessity for data normalization. Power Transform helps make data look more Gaussian-like. This is to make the search space more convex where local extrema is also global extrema and can be achieved as follows:
scaler = PowerTransformer()
scaled_df = scaler.fit_transform(df_X_raw)
df_X = pd.DataFrame(scaled_df, columns=columns)
df_X = df_X.reset_index(drop=True)
Churn modeling always leads to the problem of imbalance classes and best known approach to deal with it is by explicitly assigning class weights (use of over/under-sample using smote has proven no attribution in practice), calculated as follows:
class_labels = np.unique(df_y)
class_weights = class_weight.compute_class_weight(class_weight='balanced', classes=class_labels, y=df_y)
class_weights = dict(zip(class_labels, class_weights))
class_weights
The above function gives a clear indication of class imbalances. One way to deal with such a problem is via using custom loss function which put different emphasis on classes w.r.t their weights. Focal loss function is exact such function, for more info, please read this blog post.
Hyper-parameter Optimization
After completing the above data wrangling, we now proceed to a totally new approach of model fitting, i.e., hyper-parameters optimization. This is explained in our next blog post here . Interested reader are encouraged to use the same approach presented in the blog on Predicting Customer Churn in Banking , to do simple model fitting and performance evaluation.
Leave a Comment
Your email address will not be published.
