Empirisch Tech - Blog: Bayesian Hyper-Parameter Optimization

Category Data Science/Code
Last Updated November 4, 2021

After laying the theoretical foundations on model tuning using bayesian optimization in our post on BLOG - TOWARDS HYPER-PARAMETER OPTIMIZATION FOR MODEL TUNING , we now give a hands-on example on how to achieve this using the well-known scikit-optimize library. Data acquisition and wrangling is already explained in one of our previous post on BLOG - PREDICTING CUSTOMER CHURN IN TELECOM INDUSTRY. The final data attributes to be used here are the following:


    Index(['City', 'ZipCode', 'Gender', 'SeniorCitizen', 'Partner', 'Dependents',
           'TenureMonths', 'PhoneService', 'MultipleLines', 'OnlineSecurity',
           'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
           'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod',
           'MonthlyCharges', 'TotalCharges', 'ChurnScore', 'CLTV', 'Age',
           'Under30', 'Married', 'NumberofDependents', 'Quarter',
           'ReferredaFriend', 'NumberofReferrals', 'Offer',
           'AvgMonthlyLongDistanceCharges', 'InternetType', 'AvgMonthlyGBDownload',
           'DeviceProtectionPlan', 'PremiumTechSupport', 'StreamingMusic',
           'UnlimitedData', 'MonthlyCharge', 'TotalRefunds',
           'TotalExtraDataCharges', 'TotalLongDistanceCharges', 'TotalRevenue',
           'SatisfactionScore', 'ChurnCategory'],
          dtype='object')

Let us now do the Bayesian Magic. Here first, we need to identify parameter search space. Here, we have used just the main or more influencing hyper-parameters, later the results will show that with such as subset more then 80% F1-score is achieved. The interested readers are encouraged to extend the search space to improve the results further.

    space  = [ Integer(3, 30, name='max_depth')
              ,Integer(6, 50, name='num_leaves')
              ,Integer(50, 200, name='min_child_samples')
              ,Real(1, 500,  name='scale_pos_weight')
              ,Real(0.6, 0.9, name='subsample')
              ,Real(0.6, 0.9, name='colsample_bytree')
              ,Real(0.0001, 1,  name='learning_rate', prior='log-uniform')
              ,Integer(2, 200, name='max_bin')
              ,Integer(100, 2000, name='num_iterations')
              ,Integer(20, 100, name='min_child_samples')
             ]

We are again using the customized loss function, i.e., the focal loss function, an example of which can be found in one of our previous posts on BLOG - PREDICTING CUSTOMER CHURN IN BANKING.

Finally, the objective function is calculated as follows:


    def objective(values):

        params = {
                  'max_depth': values[0]
                  ,'num_leaves': values[1]
                  ,'min_child_samples': values[2]
                  ,'scale_pos_weight': values[3]
                  ,'subsample': values[4]
                  ,'colsample_bytree': values[5]
                  ,'learning_rate': values[6]
                  ,'max_bin': values[7]
                  ,'num_iterations': values[8]
                  ,'min_child_samples': values[9]
                  #,'lambda_l1': values[9]
                  #,'lambda_l2': values[10]
                  ,'metric':'auc'
                  ,'nthread': 20
                  ,'boosting_type': 'gbdt'
                  ,'objective': 'binary'
                  ,'min_child_weight': 0
                  ,'min_split_gain': 0
                  ,'subsample_freq': 1
                  }

        train_auc_list = []
        valid_auc_list = []
        early_stopping_rounds = 50
        num_boost_round       = 100            ### One can increase this number, helps a better fit

        # Fit model on feature_set and calculate validation AUROC

        cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

        for train_index, test_index in cv.split(df_X_sub, df_y_sub):
            X_train, X_valid = df_X_sub.loc[df_X_sub.index[train_index]], df_X_sub.loc[df_X_sub.index[test_index]]
            y_train, y_valid = df_y_sub.loc[df_y_sub.index[train_index]], df_y_sub.loc[df_y_sub.index[test_index]]

            train_data = lgb.Dataset(data=X_train, label=y_train,
                                     categorical_feature=categorical_feature,
                                     free_raw_data=False)

            valid_data = lgb.Dataset(X_valid, label=y_valid,
                                     categorical_feature=categorical_feature,
                                     free_raw_data=False)

            evals_results = {}
            model_lgb     = lgb.train(params,
                                      train_data,
                                      valid_sets=valid_data,
                                      evals_result=evals_results,
                                      num_boost_round=num_boost_round,
                                      #early_stopping_rounds=early_stopping_rounds,
                                      verbose_eval=False,
                                      feval=focal_loss_eval
                                     )

            train_preds = sigmoid(model_lgb.predict(X_train))
            train_binary_preds = [int(p>0.5) for p in train_preds]

            valid_preds = sigmoid(model_lgb.predict(X_valid))
            valid_binary_preds = [int(p>0.5) for p in valid_preds]

            train_auc = f1_score(y_train, train_binary_preds)
            valid_auc = f1_score(y_valid, valid_binary_preds)

            train_auc_list.append(train_auc)
            valid_auc_list.append(valid_auc)


        train_valid_diff = np.abs(np.mean(train_auc_list) - np.mean(valid_auc_list))

        ### this is a good criteria to control over-fitting
        if train_valid_diff > 5:
                train_valid_diff = 0.05
        else:
            train_valid_diff = np.mean(train_auc_list)

        print('TRAIN.....', np.mean(train_auc_list))
        print('VALID.....', np.mean(valid_auc_list))
        print('final score ..... ', train_valid_diff)

        gc.collect()

        #return  np.mean(train_auc_list)
        return -train_valid_diff


    res_gp = gp_minimize(objective, space, n_calls=50, random_state=0, n_random_starts=10)
    "Best score=%.4f" % res_gp.fun

Here is a lot happening in the above two code snippets, so let us break it down step-by-step.

In this example, the gp_minimize from scikit-optimize library is used in which the choice of probabilistic regression model is Gaussian Process, for more details on this please refer to BLOG - TOWARDS HYPER-PARAMETER OPTIMIZATION FOR MODEL TUNING . Here the first argument is the function that is needed to be minimized. In our case, it is the LightGBM model that is performing three-fold cross-validation on train and validation datasets. This function run three times (due to three-fold cross-validation) for each set of parameters. The range of parameter values for each hyper-parameter is specified in the space list three cell blocks above. After the end of every execution cycle of the objective function, the difference between train and validation datasets is measured. This is more hands-on approach to avoid any over-fitting. In this example, we have taken an arbitrary but relatively small difference of 5% in the F1-scores of train and validation datasets. If the difference exceeds this threshold, it regards the run as over-fit and assign a very small gain of 0.05 forcing the gp_minimize function to ignore this particular set of hyper-parameters.

The second argument is the search space for the hyper-parameters of LightGBM. These are nothing but the list of input parameters LightGBM takes/requires and a range is given for each parameter. E.g., parameter 'max_depth' can only be integer between the range of 3 and 30. Then ranges for each hyper-parameter can be increases/decreased given the performance of the algorithm. In case some algorithm other than LightGBM is used, the list of parameter space will change accordingly.

The next parameter, although not specified here, is the acquisition function details of which can be read in one of the posts mentioning BLOG - TOWARDS HYPER-PARAMETER OPTIMIZATION FOR MODEL TUNING . gp_minimize facilitates three different types of acquisition functions namely, Expected Improvement (EI), Lower Confidence Bound (LCB) and Probability of Improvement (PI), any one of which can be used as the third parameter. The default value is EI though. Remember that the acquisition functions provide the means for controlling the exploration-exploitation trade-off, i.e., keep searching in the current region and slowly moves to the optima which could very likely be local (exploitation), or explore new regions based on high uncertainty (exploration).

The final worth emphasizing argument is n_calls which simply states how many time the objective functions is to be executed. This parameter provides the trade-off between time and accuracy, i.e., the higher the number of calls are, the better the results will be but at the expense of longer convergence time. There is no silver bullet to this number, the best I found is to assign the number big enough so that it will run the whole night and have the results ready for you first thing in the morning.


    /opt/conda/lib/python3.7/site-packages/lightgbm/engine.py:148: UserWarning: Found `num_iterations` in params. Will use it instead of argument
    warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
    /opt/conda/lib/python3.7/site-packages/lightgbm/basic.py:1291: UserWarning: Using categorical_feature in Dataset.
    warnings.warn('Using categorical_feature in Dataset.')

    TRAIN..... 0.4130507655413463
    VALID..... 0.41305069086609986
    final score .....  0.4130507655413463

    /opt/conda/lib/python3.7/site-packages/lightgbm/engine.py:148: UserWarning: Found `num_iterations` in params. Will use it instead of argument
      warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
    /opt/conda/lib/python3.7/site-packages/lightgbm/basic.py:1291: UserWarning: Using categorical_feature in Dataset.
      warnings.warn('Using categorical_feature in Dataset.')

    TRAIN..... 0.4130507655413463
    VALID..... 0.41305069086609986
    final score .....  0.4130507655413463
    ...
    ...
    ...

    /opt/conda/lib/python3.7/site-packages/lightgbm/engine.py:148: UserWarning: Found `num_iterations` in params. Will use it instead of argument
    warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
    /opt/conda/lib/python3.7/site-packages/lightgbm/basic.py:1291: UserWarning: Using categorical_feature in Dataset.
    warnings.warn('Using categorical_feature in Dataset.')

    TRAIN..... 1.0
    VALID..... 0.8991131689120596
    final score .....  1.0

    'Best score=-1.0000'

At the end of the execution, the best set of parameters are extracted and further used to train the final model as follows:


    res_gp.x


    [3, 6, 200, 1.0, 0.9, 0.9, 1.0, 200, 2000, 20]


    params = {'max_depth': res_gp.x[0]
              ,'num_leaves': res_gp.x[1]
              ,'min_child_samples': res_gp.x[2]
              ,'scale_pos_weight': res_gp.x[3]
              ,'subsample': res_gp.x[4]
              ,'colsample_bytree': res_gp.x[5]
              ,'learning_rate': res_gp.x[6]
              ,'max_bin':  res_gp.x[7]
              #,'num_iterations': values[8]
              ,'min_child_samples': res_gp.x[8]
              #,'lambda_l1': values[9]
              #,'lambda_l2': values[10]
              ,'metric':'auc'
              ,'nthread': 8
              ,'boosting_type': 'gbdt'
              ,'objective': 'binary'
              ,'min_child_weight': 0
              ,'min_split_gain': 0
              ,'subsample_freq': 1
              }

    model_lgb = lgb.LGBMClassifier(**params, class_weight = class_weights).fit(df_X_sub, df_y_sub)

The above approach gave a huge boost to model tuning with the final F1-score of 82%. As per the industry standards, this is a very high number which is mainly realizable due to the extensive amount of data attributes used in this model. Below is the list of the rest of the performance stats:


    clf_train_pred = model_lgb.predict(df_X_sub)
    clf_test_pred = model_lgb.predict(X_test)
    auc_train = roc_auc_score(df_y_sub, clf_train_pred)
    print('\nAUROC.....',auc_train)


    AUROC..... 0.910429052761263


    auc_test = roc_auc_score(y_test, clf_test_pred)
    print('\nTEST AUROC.....', auc_test)


    AUROC..... 0.910429052761263


    print('Accuracy')
    print(accuracy_score(y_test, clf_test_pred))
    print('F1_score')
    print(f1_score(y_test, clf_test_pred))
    print('Confusion Matrix')
    print(confusion_matrix(y_test, clf_test_pred))
    print('Confusion Report')
    print(classification_report(y_test, clf_test_pred))


    Accuracy
    0.8924731182795699
    F1_score
    0.8290013679890561
    Confusion Matrix
    [[1469  215]
     [  35  606]]
    Confusion Report
                  precision    recall  f1-score   support

               0       0.98      0.87      0.92      1684
               1       0.74      0.95      0.83       641

        accuracy                           0.89      2325
       macro avg       0.86      0.91      0.88      2325
    weighted avg       0.91      0.89      0.90      2325

Your email address will not be published.

Blog - Hyper-parameter Optimization using Scikit-Optimize on Telco Churn Data

Leave a Comment