After laying the theoretical foundations on model tuning using bayesian optimization in our post on BLOG - TOWARDS HYPER-PARAMETER OPTIMIZATION FOR MODEL TUNING , we now give a hands-on example on how to achieve this using the well-known scikit-optimize library. Data acquisition and wrangling is already explained in one of our previous post on BLOG - PREDICTING CUSTOMER CHURN IN TELECOM INDUSTRY. The final data attributes to be used here are the following:
Index(['City', 'ZipCode', 'Gender', 'SeniorCitizen', 'Partner', 'Dependents', 'TenureMonths', 'PhoneService', 'MultipleLines', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'ChurnScore', 'CLTV', 'Age', 'Under30', 'Married', 'NumberofDependents', 'Quarter', 'ReferredaFriend', 'NumberofReferrals', 'Offer', 'AvgMonthlyLongDistanceCharges', 'InternetType', 'AvgMonthlyGBDownload', 'DeviceProtectionPlan', 'PremiumTechSupport', 'StreamingMusic', 'UnlimitedData', 'MonthlyCharge', 'TotalRefunds', 'TotalExtraDataCharges', 'TotalLongDistanceCharges', 'TotalRevenue', 'SatisfactionScore', 'ChurnCategory'], dtype='object')
Let us now do the Bayesian Magic. Here first, we need to identify parameter search space. Here, we have used just the main or more influencing hyper-parameters, later the results will show that with such as subset more then 80% F1-score is achieved. The interested readers are encouraged to extend the search space to improve the results further.
space = [ Integer(3, 30, name='max_depth') ,Integer(6, 50, name='num_leaves') ,Integer(50, 200, name='min_child_samples') ,Real(1, 500, name='scale_pos_weight') ,Real(0.6, 0.9, name='subsample') ,Real(0.6, 0.9, name='colsample_bytree') ,Real(0.0001, 1, name='learning_rate', prior='log-uniform') ,Integer(2, 200, name='max_bin') ,Integer(100, 2000, name='num_iterations') ,Integer(20, 100, name='min_child_samples') ]
We are again using the customized loss function, i.e., the focal loss function, an example of which can be found in one of our previous posts on BLOG - PREDICTING CUSTOMER CHURN IN BANKING.
Finally, the objective function is calculated as follows:
def objective(values): params = { 'max_depth': values[0] ,'num_leaves': values[1] ,'min_child_samples': values[2] ,'scale_pos_weight': values[3] ,'subsample': values[4] ,'colsample_bytree': values[5] ,'learning_rate': values[6] ,'max_bin': values[7] ,'num_iterations': values[8] ,'min_child_samples': values[9] #,'lambda_l1': values[9] #,'lambda_l2': values[10] ,'metric':'auc' ,'nthread': 20 ,'boosting_type': 'gbdt' ,'objective': 'binary' ,'min_child_weight': 0 ,'min_split_gain': 0 ,'subsample_freq': 1 } train_auc_list = [] valid_auc_list = [] early_stopping_rounds = 50 num_boost_round = 100 ### One can increase this number, helps a better fit # Fit model on feature_set and calculate validation AUROC cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42) for train_index, test_index in cv.split(df_X_sub, df_y_sub): X_train, X_valid = df_X_sub.loc[df_X_sub.index[train_index]], df_X_sub.loc[df_X_sub.index[test_index]] y_train, y_valid = df_y_sub.loc[df_y_sub.index[train_index]], df_y_sub.loc[df_y_sub.index[test_index]] train_data = lgb.Dataset(data=X_train, label=y_train, categorical_feature=categorical_feature, free_raw_data=False) valid_data = lgb.Dataset(X_valid, label=y_valid, categorical_feature=categorical_feature, free_raw_data=False) evals_results = {} model_lgb = lgb.train(params, train_data, valid_sets=valid_data, evals_result=evals_results, num_boost_round=num_boost_round, #early_stopping_rounds=early_stopping_rounds, verbose_eval=False, feval=focal_loss_eval ) train_preds = sigmoid(model_lgb.predict(X_train)) train_binary_preds = [int(p>0.5) for p in train_preds] valid_preds = sigmoid(model_lgb.predict(X_valid)) valid_binary_preds = [int(p>0.5) for p in valid_preds] train_auc = f1_score(y_train, train_binary_preds) valid_auc = f1_score(y_valid, valid_binary_preds) train_auc_list.append(train_auc) valid_auc_list.append(valid_auc) train_valid_diff = np.abs(np.mean(train_auc_list) - np.mean(valid_auc_list)) ### this is a good criteria to control over-fitting if train_valid_diff > 5: train_valid_diff = 0.05 else: train_valid_diff = np.mean(train_auc_list) print('TRAIN.....', np.mean(train_auc_list)) print('VALID.....', np.mean(valid_auc_list)) print('final score ..... ', train_valid_diff) gc.collect() #return np.mean(train_auc_list) return -train_valid_diff
res_gp = gp_minimize(objective, space, n_calls=50, random_state=0, n_random_starts=10) "Best score=%.4f" % res_gp.fun
Here is a lot happening in the above two code snippets, so let us break it down step-by-step.
In this example, the
The second argument is the search space for the hyper-parameters of LightGBM. These are nothing but the list of input parameters LightGBM takes/requires and a range is given for each parameter. E.g., parameter 'max_depth' can only be integer between the range of 3 and 30. Then ranges for each hyper-parameter can be increases/decreased given the performance of the algorithm. In case some algorithm other than LightGBM is used, the list of parameter space will change accordingly.
The next parameter, although not specified here, is the acquisition function details of which can be read in one of the posts mentioning BLOG - TOWARDS HYPER-PARAMETER OPTIMIZATION FOR MODEL TUNING . gp_minimize facilitates three different types of acquisition functions namely, Expected Improvement (EI), Lower Confidence Bound (LCB) and Probability of Improvement (PI), any one of which can be used as the third parameter. The default value is EI though. Remember that the acquisition functions provide the means for controlling the exploration-exploitation trade-off, i.e., keep searching in the current region and slowly moves to the optima which could very likely be local (exploitation), or explore new regions based on high uncertainty (exploration).
The final worth emphasizing argument is
/opt/conda/lib/python3.7/site-packages/lightgbm/engine.py:148: UserWarning: Found `num_iterations` in params. Will use it instead of argument warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias)) /opt/conda/lib/python3.7/site-packages/lightgbm/basic.py:1291: UserWarning: Using categorical_feature in Dataset. warnings.warn('Using categorical_feature in Dataset.') TRAIN..... 0.4130507655413463 VALID..... 0.41305069086609986 final score ..... 0.4130507655413463 /opt/conda/lib/python3.7/site-packages/lightgbm/engine.py:148: UserWarning: Found `num_iterations` in params. Will use it instead of argument warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias)) /opt/conda/lib/python3.7/site-packages/lightgbm/basic.py:1291: UserWarning: Using categorical_feature in Dataset. warnings.warn('Using categorical_feature in Dataset.') TRAIN..... 0.4130507655413463 VALID..... 0.41305069086609986 final score ..... 0.4130507655413463 ... ... ... /opt/conda/lib/python3.7/site-packages/lightgbm/engine.py:148: UserWarning: Found `num_iterations` in params. Will use it instead of argument warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias)) /opt/conda/lib/python3.7/site-packages/lightgbm/basic.py:1291: UserWarning: Using categorical_feature in Dataset. warnings.warn('Using categorical_feature in Dataset.') TRAIN..... 1.0 VALID..... 0.8991131689120596 final score ..... 1.0 'Best score=-1.0000'
At the end of the execution, the best set of parameters are extracted and further used to train the final model as follows:
res_gp.x
[3, 6, 200, 1.0, 0.9, 0.9, 1.0, 200, 2000, 20]
params = {'max_depth': res_gp.x[0] ,'num_leaves': res_gp.x[1] ,'min_child_samples': res_gp.x[2] ,'scale_pos_weight': res_gp.x[3] ,'subsample': res_gp.x[4] ,'colsample_bytree': res_gp.x[5] ,'learning_rate': res_gp.x[6] ,'max_bin': res_gp.x[7] #,'num_iterations': values[8] ,'min_child_samples': res_gp.x[8] #,'lambda_l1': values[9] #,'lambda_l2': values[10] ,'metric':'auc' ,'nthread': 8 ,'boosting_type': 'gbdt' ,'objective': 'binary' ,'min_child_weight': 0 ,'min_split_gain': 0 ,'subsample_freq': 1 } model_lgb = lgb.LGBMClassifier(**params, class_weight = class_weights).fit(df_X_sub, df_y_sub)
The above approach gave a huge boost to model tuning with the final F1-score of 82%. As per the industry standards, this is a very high number which is mainly realizable due to the extensive amount of data attributes used in this model. Below is the list of the rest of the performance stats:
clf_train_pred = model_lgb.predict(df_X_sub) clf_test_pred = model_lgb.predict(X_test) auc_train = roc_auc_score(df_y_sub, clf_train_pred) print('\nAUROC.....',auc_train)
AUROC..... 0.910429052761263
auc_test = roc_auc_score(y_test, clf_test_pred) print('\nTEST AUROC.....', auc_test)
AUROC..... 0.910429052761263
print('Accuracy') print(accuracy_score(y_test, clf_test_pred)) print('F1_score') print(f1_score(y_test, clf_test_pred)) print('Confusion Matrix') print(confusion_matrix(y_test, clf_test_pred)) print('Confusion Report') print(classification_report(y_test, clf_test_pred))
Accuracy 0.8924731182795699 F1_score 0.8290013679890561 Confusion Matrix [[1469 215] [ 35 606]] Confusion Report precision recall f1-score support 0 0.98 0.87 0.92 1684 1 0.74 0.95 0.83 641 accuracy 0.89 2325 macro avg 0.86 0.91 0.88 2325 weighted avg 0.91 0.89 0.90 2325
Leave a Comment
Your email address will not be published.