As state in our previous blog on Predicting Customer Churn in Banking , churn is a phenomenon in which customer stop doing business with an entity. It is most commonly expressed as the percentage at which subscribers discontinue their subscriptions within a given time period. Churn can be applied to subscription-based businesses as well to the number of employees that leave a firm. The churn rate and growth rate are diametrically opposite factors, as the former measures the loss of customers and the other measures the acquisition of customers. It is therefore, one of the fundamental use cases companies seek to implement.
Now-a-days, machine learning models provide the most affective means to measure, understand and predict customer churn. This blog is one such effort in this direction and will help you with in the follows ways:
- How to formulate the problem.
- Use of XGBoost to predict customer churn.
- Use of custom loss function
- How to estimate model efficacy.
Data Acquisition & Wrangling
The data used in this study is taken from Kaggle, published under the name Telco customer churn (11.1.3+). All the rights are owned by the original publisher. This is one of the most extensive datasets on customer churn at Kaggle. The dataset module is composed of 5 tables each carrying information on with specific segments :
- Demographics
- includes customer demographics data such as gender, age, marital status, etc.
- Location
- includes customer address information such as country, state, zip, etc.
- Population
- contains info about population per zipcode basis
- Services
- contains detailed info about services and charges, i.e., monthly bill, number of contracts, contract period, types of subscription like TV, music or video streaming, etc.
- Status
- includes info about customer satisfaction score, lifetime value, contract state, i.e., out of contractn, running out of contract or in contract, etc.
A quick glance of the main churn data table is follows:
CustomerID | Country | State | Gender | Paperless Billing | Total Charges | ... | Churn Label | Churn Score | CLTV | Churn Reason |
---|---|---|---|---|---|---|---|---|---|---|
3668-QPYBK | United States | California | Male | Yes | 108.15 | ... | Yes | 86 | 3239 | Competitor made better offer | 9237-HQITU | United States | California | Female | Yes | 151.65 | ... | Yes | 67 | 2701 | Moved |
9305-CDSKC | United States | California | Female | Yes | 820.5 | ... | Yes | 86 | 5372 | Moved |
There are no null values within therefore, no need to fill NAs. All the above five tables contain information per customer basis and therefore, are joined by CustomerID column. After the several joins, he final dataset looks as follows:
Int64Index: 7043 entries, 0 to 7042 Data columns (total 60 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CustomerID 7043 non-null object 1 Count 7043 non-null int64 2 Country 7043 non-null object 3 State 7043 non-null object 4 City 7043 non-null object 5 Zip Code 7043 non-null int64 6 Lat Long 7043 non-null object 7 Latitude 7043 non-null float64 8 Longitude 7043 non-null float64 9 Gender 7043 non-null object 10 Senior Citizen 7043 non-null object 11 Partner 7043 non-null object 12 Dependents 7043 non-null object 13 Tenure Months 7043 non-null int64 14 Phone Service 7043 non-null object 15 Multiple Lines 7043 non-null object 16 Internet Service 7043 non-null object 17 Online Security 7043 non-null object 18 Online Backup 7043 non-null object 19 Device Protection 7043 non-null object 20 Tech Support 7043 non-null object 21 Streaming TV 7043 non-null object 22 Streaming Movies 7043 non-null object 23 Contract 7043 non-null object 24 Paperless Billing 7043 non-null object 25 Payment Method 7043 non-null object 26 Monthly Charges 7043 non-null float64 27 Total Charges 7043 non-null object 28 Churn Label 7043 non-null object 29 Churn Value 7043 non-null int64 30 Churn Score 7043 non-null int64 31 CLTV 7043 non-null int64 32 Churn Reason 1869 non-null object 33 Customer ID 7043 non-null object 34 Age 7043 non-null int64 35 Under 30 7043 non-null object 36 Married 7043 non-null object 37 Number of Dependents 7043 non-null int64 38 Quarter 7043 non-null object 39 Referred a Friend 7043 non-null object 40 Number of Referrals 7043 non-null int64 41 Tenure in Months 7043 non-null int64 42 Offer 7043 non-null object 43 Avg Monthly Long Distance Charges 7043 non-null float64 44 Internet Type 7043 non-null object 45 Avg Monthly GB Download 7043 non-null int64 46 Device Protection Plan 7043 non-null object 47 Premium Tech Support 7043 non-null object 48 Streaming Music 7043 non-null object 49 Unlimited Data 7043 non-null object 50 Monthly Charge 7043 non-null float64 51 Total Refunds 7043 non-null float64 52 Total Extra Data Charges 7043 non-null int64 53 Total Long Distance Charges 7043 non-null float64 54 Total Revenue 7043 non-null float64 55 Satisfaction Score 7043 non-null int64 56 Customer Status 7043 non-null object 57 Churn Category 1869 non-null object 58 ID 1671 non-null float64 59 Population 1671 non-null float64 dtypes: float64(10), int64(13), object(37) memory usage: 3.3+ MB
We dropped the duplicate and seemingly useless columns, use 'Churn Label' as the target variable and perform label encoding to convert categorical variables into integers so that they can be passed to XGBoost train function. This is done in a very few lines as follows:
### Let us now build the X and y variables df_merged['Churn Category'] = df_merged['Churn Category'].fillna('NA') # use instead of Internet Service as the former has more details, # dropping Customer Status columns as it is # somehow dubious, also dropping Churn Reason as it is covered by Churn # category. CustomerID is also not needed at this point df_data = df_merged.drop(['Count', 'Country', 'State', 'Lat Long', 'Latitude', 'Churn Reason', 'Customer ID', 'ID', 'CustomerID', 'Population', 'Longitude', 'Churn Label', 'Tenure in Months', 'Internet Service', 'Customer Status' ], axis=1) df_y = df_data['ChurnValue'] df_y = pd.DataFrame(LabelEncoder().fit_transform(df_y)) df_y = df_y.reset_index(drop=True) df_X_raw = df_data.drop(['ChurnValue'], axis=1) #df_X_raw = p_df_raw_train.drop(['Name'], axis=1) columns=df_X_raw.columns for f in df_X_raw.columns: if df_X_raw[f].dtype == 'object': df_X_raw[f] = LabelEncoder().fit_transform(list(df_X_raw[f]))
This dataset contains several features which dictates the necessity for data normalization. Power Transform helps make data look more Gaussian-like. This is to make the search space more convex where local extrema is also global extrema and can be achieved as follows:
scaler = PowerTransformer() scaled_df = scaler.fit_transform(df_X_raw) df_X = pd.DataFrame(scaled_df, columns=columns) df_X = df_X.reset_index(drop=True)
Churn modeling always leads to the problem of imbalance classes and best known approach to deal with it is by explicitly assigning class weights (use of over/under-sample using smote has proven no attribution in practice), calculated as follows:
class_labels = np.unique(df_y) class_weights = class_weight.compute_class_weight(class_weight='balanced', classes=class_labels, y=df_y) class_weights = dict(zip(class_labels, class_weights)) class_weights
The above function gives a clear indication of class imbalances. One way to deal with such a problem is via using custom loss function which put different emphasis on classes w.r.t their weights. Focal loss function is exact such function, for more info, please read this blog post.
Hyper-parameter Optimization
After completing the above data wrangling, we now proceed to a totally new approach of model fitting, i.e., hyper-parameters optimization. This is explained in our next blog post here . Interested reader are encouraged to use the same approach presented in the blog on Predicting Customer Churn in Banking , to do simple model fitting and performance evaluation.
Leave a Comment
Your email address will not be published.