image

Blog - Predicting Customer Churn in Telecom Industry

As state in our previous blog on Predicting Customer Churn in Banking , churn is a phenomenon in which customer stop doing business with an entity. It is most commonly expressed as the percentage at which subscribers discontinue their subscriptions within a given time period. Churn can be applied to subscription-based businesses as well to the number of employees that leave a firm. The churn rate and growth rate are diametrically opposite factors, as the former measures the loss of customers and the other measures the acquisition of customers. It is therefore, one of the fundamental use cases companies seek to implement.

Now-a-days, machine learning models provide the most affective means to measure, understand and predict customer churn. This blog is one such effort in this direction and will help you with in the follows ways:

  • How to formulate the problem.
  • Use of XGBoost to predict customer churn.
  • Use of custom loss function
  • How to estimate model efficacy.

Data Acquisition & Wrangling

The data used in this study is taken from Kaggle, published under the name Telco customer churn (11.1.3+). All the rights are owned by the original publisher. This is one of the most extensive datasets on customer churn at Kaggle. The dataset module is composed of 5 tables each carrying information on with specific segments :

  • Demographics
    • includes customer demographics data such as gender, age, marital status, etc.
  • Location
    • includes customer address information such as country, state, zip, etc.
  • Population
    • contains info about population per zipcode basis
  • Services
    • contains detailed info about services and charges, i.e., monthly bill, number of contracts, contract period, types of subscription like TV, music or video streaming, etc.
  • Status
    • includes info about customer satisfaction score, lifetime value, contract state, i.e., out of contractn, running out of contract or in contract, etc.

A quick glance of the main churn data table is follows:

CustomerID Country State Gender Paperless Billing Total Charges ... Churn Label Churn Score CLTV Churn Reason
3668-QPYBK United States California Male Yes 108.15 ... Yes 86 3239 Competitor made better offer
9237-HQITU United States California Female Yes 151.65 ... Yes 67 2701 Moved
9305-CDSKC United States California Female Yes 820.5 ... Yes 86 5372 Moved

There are no null values within therefore, no need to fill NAs. All the above five tables contain information per customer basis and therefore, are joined by CustomerID column. After the several joins, he final dataset looks as follows:

    Int64Index: 7043 entries, 0 to 7042
    Data columns (total 60 columns):
     #   Column                             Non-Null Count  Dtype
    ---  ------                             --------------  -----
     0   CustomerID                         7043 non-null   object
     1   Count                              7043 non-null   int64
     2   Country                            7043 non-null   object
     3   State                              7043 non-null   object
     4   City                               7043 non-null   object
     5   Zip Code                           7043 non-null   int64
     6   Lat Long                           7043 non-null   object
     7   Latitude                           7043 non-null   float64
     8   Longitude                          7043 non-null   float64
     9   Gender                             7043 non-null   object
     10  Senior Citizen                     7043 non-null   object
     11  Partner                            7043 non-null   object
     12  Dependents                         7043 non-null   object
     13  Tenure Months                      7043 non-null   int64
     14  Phone Service                      7043 non-null   object
     15  Multiple Lines                     7043 non-null   object
     16  Internet Service                   7043 non-null   object
     17  Online Security                    7043 non-null   object
     18  Online Backup                      7043 non-null   object
     19  Device Protection                  7043 non-null   object
     20  Tech Support                       7043 non-null   object
     21  Streaming TV                       7043 non-null   object
     22  Streaming Movies                   7043 non-null   object
     23  Contract                           7043 non-null   object
     24  Paperless Billing                  7043 non-null   object
     25  Payment Method                     7043 non-null   object
     26  Monthly Charges                    7043 non-null   float64
     27  Total Charges                      7043 non-null   object
     28  Churn Label                        7043 non-null   object
     29  Churn Value                        7043 non-null   int64
     30  Churn Score                        7043 non-null   int64
     31  CLTV                               7043 non-null   int64
     32  Churn Reason                       1869 non-null   object
     33  Customer ID                        7043 non-null   object
     34  Age                                7043 non-null   int64
     35  Under 30                           7043 non-null   object
     36  Married                            7043 non-null   object
     37  Number of Dependents               7043 non-null   int64
     38  Quarter                            7043 non-null   object
     39  Referred a Friend                  7043 non-null   object
     40  Number of Referrals                7043 non-null   int64
     41  Tenure in Months                   7043 non-null   int64
     42  Offer                              7043 non-null   object
     43  Avg Monthly Long Distance Charges  7043 non-null   float64
     44  Internet Type                      7043 non-null   object
     45  Avg Monthly GB Download            7043 non-null   int64
     46  Device Protection Plan             7043 non-null   object
     47  Premium Tech Support               7043 non-null   object
     48  Streaming Music                    7043 non-null   object
     49  Unlimited Data                     7043 non-null   object
     50  Monthly Charge                     7043 non-null   float64
     51  Total Refunds                      7043 non-null   float64
     52  Total Extra Data Charges           7043 non-null   int64
     53  Total Long Distance Charges        7043 non-null   float64
     54  Total Revenue                      7043 non-null   float64
     55  Satisfaction Score                 7043 non-null   int64
     56  Customer Status                    7043 non-null   object
     57  Churn Category                     1869 non-null   object
     58  ID                                 1671 non-null   float64
     59  Population                         1671 non-null   float64
    dtypes: float64(10), int64(13), object(37)
    memory usage: 3.3+ MB
                                

We dropped the duplicate and seemingly useless columns, use 'Churn Label' as the target variable and perform label encoding to convert categorical variables into integers so that they can be passed to XGBoost train function. This is done in a very few lines as follows:


    ### Let us now build the X and y variables
    df_merged['Churn Category'] = df_merged['Churn Category'].fillna('NA')

    # use instead of Internet Service as the former has more details,
    # dropping Customer Status columns as it is
    # somehow dubious, also dropping Churn Reason as it is covered by Churn
    # category. CustomerID is also not needed at this point

    df_data =  df_merged.drop(['Count', 'Country', 'State', 'Lat Long', 'Latitude',
                                    'Churn Reason', 'Customer ID', 'ID', 'CustomerID', 'Population',
                                    'Longitude', 'Churn Label', 'Tenure in Months', 'Internet Service',
                                    'Customer Status' ], axis=1)


    df_y = df_data['ChurnValue']
    df_y = pd.DataFrame(LabelEncoder().fit_transform(df_y))
    df_y = df_y.reset_index(drop=True)

    df_X_raw = df_data.drop(['ChurnValue'], axis=1)
    #df_X_raw = p_df_raw_train.drop(['Name'], axis=1)

    columns=df_X_raw.columns
    for f in df_X_raw.columns:
        if df_X_raw[f].dtype == 'object':
            df_X_raw[f] = LabelEncoder().fit_transform(list(df_X_raw[f]))
                                

This dataset contains several features which dictates the necessity for data normalization. Power Transform helps make data look more Gaussian-like. This is to make the search space more convex where local extrema is also global extrema and can be achieved as follows:


    scaler = PowerTransformer()
    scaled_df = scaler.fit_transform(df_X_raw)
    df_X = pd.DataFrame(scaled_df, columns=columns)
    df_X = df_X.reset_index(drop=True)
                                

Churn modeling always leads to the problem of imbalance classes and best known approach to deal with it is by explicitly assigning class weights (use of over/under-sample using smote has proven no attribution in practice), calculated as follows:


    class_labels = np.unique(df_y)
    class_weights = class_weight.compute_class_weight(class_weight='balanced', classes=class_labels, y=df_y)
    class_weights = dict(zip(class_labels, class_weights))
    class_weights
                                

The above function gives a clear indication of class imbalances. One way to deal with such a problem is via using custom loss function which put different emphasis on classes w.r.t their weights. Focal loss function is exact such function, for more info, please read this blog post.

Hyper-parameter Optimization

After completing the above data wrangling, we now proceed to a totally new approach of model fitting, i.e., hyper-parameters optimization. This is explained in our next blog post here . Interested reader are encouraged to use the same approach presented in the blog on Predicting Customer Churn in Banking , to do simple model fitting and performance evaluation.

Leave a Comment

Your email address will not be published.