Magistrit%C3%B6%C3%B6_lisadeta_Tuttar.pdf - Page 6

• Testing scheme for date logic was developed. For example, check if the birth date is before all other dates. • Categorical columns were turned to factor values, where factor levels were ordered by their count in the data (ordering by exposure was considered, but showed no difference in ordering of levels) • Group ID was constructed since some policies had the same information aside from claim- related data. Group ID was used to make training and testing sets to avoid data leakage between splits. Note that missing values were present for most numeric variables, and for some of the categorical variables missing values were not altered. This was intentional since gradient boosting machine implementation in R package gbm (Greenwell et al., 2022) allows for a separate direction for missing values, thus allowing to model even missing data. Although some machine learning techniques are available in SPARK, the implementations avail- able were not up to par in terms of possible loss functions, weighting etc. Implementations in R packages gbm, xgboost (Chen et al., 2022) were used instead. This meant that in-machine computation had to be done, so the whole data set training was impossible. Based on proposals from If and prior modelling papers like (Henckaerts et al., 2017) or (Wüthrich, 2019), a single financial year (2018) policies were chosen as data to be modelled. The latest financial year was chosen since less data seemed to be missing for later years. The final data that was modelled and analysed contained 1 644 800 rows with 19 explanatory variables, exposure variable, number of claims as the response variable and Group ID column. The data was split based on unique values of Group ID column into three parts: training with proportion 0.6, validation with proportion 0.2 and testing with proportion 0.2 of the data. 4.2 Baseline models To compare different approaches, a baseline model is needed. One such baseline is the trivial model predicting historic average response for new observations. However, for the insurance industry, a GLM model is widely used; thus, it is a good baseline to compare against. This subchapter focuses on building the GLM models used as baseline models to compare different models to. The models are fitted using the structure specified in Subchapter 1.1. Parameters 40

of the models are estimated using maximum likelihood estimation following the setup given in Subchapter 1.3. Note that GLM is not able to work with missing values. This means that missing data should either be imputed or deleted. As discussed at the end of the previous subchapter, the gradient boosting model is able to work with missing data, thus imputation of the data, although industry standard, is forgone to showcase one of the possible advantages that can be gained from machine learning methods. This means that GLM models were fit on data that had all of the observations with missing values removed. Since GLM can easily be fitted using the likelihood of the data, there is no need for validation split to assess the fit of the model, and a bigger dataset allows for a better likelihood fit of the model parameters. GLM models were fitted using 1 117 001 observation (about 10.39% observation less compared to full training and validation set (1 315 720)). A bidirectional step-wise search based on AIC (defined in Formula (3.2)) was used to find the model. The rationale for the step-wise search was that optimising AIC allows us to find the model best generalising to training data and hopefully generalising well in general. Additionally, the step-wise search can be considered an automatic modelling way, similar to machine learning models. The model with the best AIC used these variables: FullMass, BmSubjectAge, Vehyearnew, Mileage, Drivexpbnew, AgreementTypeCode, CntNotice, CntPenaltyPoints, FuelType, BmClassAas, Ifregion, Policylength. Variables Vehyearnew and Drivexpbnew are transformations of variables VehicleAge and DrivExpB with corrections from date information. This model will be called "AIC model". The description of variables can be seen in Table 3 in Appendix B.1. Note that no interactions were considered. This was done for three reasons. Firstly, a simple model to compare is desirable, and interactions introduce a lot of complexity. Secondly, aside from an exhaustive search, there are no easy ways to automate the search for interactions using just the GLM framework. Lastly, interaction computation for some combinations of the variables was not feasible (or in some cases possible) on the hardware used. A second baseline model was also proposed, using the previously described GLM model as the starting point. Following some historical findings about driver age and experience, a polynomial relation for age and experience was considered (Valecký, 2016). In a backward step-wise search (based on AIC), starting at the sixth-order polynomial term for each variable was considered. The addition of polynomial terms decreased AIC further, where the lowest AIC was achieved by the model with both variables with polynomial terms up to the fifth order. This model will 41

be called "AIC model + poly". Both of these GLM models will be used as benchmarks to compare the performance of machine learning models and improvement methods described in Chapter 3. Appendix C gives an overview of both models. Model structure, coefficients and interpretation are presented. 4.3 Modelling with GBM and XGBoost The gathered insight can only be as good and insightful as the underlying machine learning model is at predicting the response variable. This subchapter gives an overview of the tuning procedure for gradient boosted machine (GBM) and XGBoost models. The models follow the corresponding frameworks described in chapters 2.2.2 and 2.2.3. Note that these models share the structure; however, XGBoost can be considered the more advanced and modern algorithm. GBM model is used to showcase maidrr methods since models from R package gbm work naively with maidrr implementation in package maidrr (Henckaerts, 2020). XGBoost model is trained using R package xgboost. The rule ensemble implementation in R package xrf (Holub, 2022) expects a XGBoost model as the underlying tree model. For both models, the procedure for hyperparameter tuning was similar. First, a grid for ap- propriate hyperparameters was created. This was done using R package dials (Kuhn and Frick, 2022). The dials package allows to specify the ranges for hyperparameters and then apply a complete grid selection, random selection or maximal entropy selection. Maximal entropy selects parameters in such a way that the whole space specified by the ranges is best covered. This was used to generate 20 possible combinations of hyperparameters. Then, models were fit using these parameters on the training portion of the data, and their performance was validated on the validation portion of the data, using (3.1) as the error metric. For the GBM model, 4 hyperparameters were tuned using the grid: maximum depth of each tree in the ensemble (range 3 to 6), ensemble learning rate (range 0.001 to 0.2), minimum number of observations in leaf node (range 1 to 987) and observation sampling rate for trees in the ensemble (range 0.5 to 0.8). The optimal number of trees in the ensemble was found using a fraction (20%) of training data as an additional validation data where the best number of trees had the smallest validation error. Using this tuning procedure, I found that the GBM model with a tree depth of 4, an ensem- ble learning rate of 0.03311877, a minimum number of observations in a leaf node of 339, an 42

Table 1: Baseline models and machine learning models performance comparison. The test set had all rows with missing values dropped to make model performance comparable (GBM could also predict with missing values). The trivial model predicts training data average frequency for all observations. Model AIC Poisson Deviance (Test) Number of parameters Deviance proportion Trivial model

0.11695364

104.3842% AIC model 167150.8 0.11204148 91 100% AIC model + poly 166756 0.11159231 99 99.5991% GBM

0.11146033

99.4813% XGBoost

0.11158860

99.5958% observation sampling rate of 0.6757881 and 691 trees gave the smallest validation error on the validation portion of the data. XGBoost is the more advanced version of gradient tree boosting and thus allows for more hy- perparameters to be tuned. For XGBoost, 7 hyperparameters were tuned: number of variables used to fit a tree (range ⌈ √ 108⌉to 108 3 ), maximum depth of trees in the ensemble (range 3 to 6), ensemble learning rate (range 0.1 to 0.3), number of trees (range 1 to 100), the minimum number of observations in leaf node (range 1 to 987), reduction in loss to allow further splits (range 0 to 0.2) and observation sampling rate for trees in the ensemble (range 0.5 to 0.75). It is important to note that the XGBoost model, as implemented in package xgboost, did not allow for missing values in data and all data has to be numerical, thus one-hot encoding (splitting all levels of categorical variables into separate binary variables) needed to be applied. Doing so takes the number of variables to 108. Note also that only a small number of trees (up to 100) were considered since more trees produce more rules, thus increasing the computation intensity of rule ensemble methods. The optimal XGBoost model had 35 variables for each tree, depth of 4, a learning rate of 0.28255206, 91 trees in the ensemble, at least 496 observations in the leaf nodes, reduction in loss of at least 0.06351081 and a sampling rate of 0.68568611 for each tree. A comparison between the resulting models can be seen in Table 1. Since all models use variables and data differently, a unifying test set needed to be created. This meant removing all missing values from the testing set since both GLM and XGBoost models can not work with missing values. Based on the test set performance (using again (3.1)), we can see that the GBM model is clearly best, followed by the XGBoost model and AIC model with polynomial terms. The base AIC model is worse compared to other non-trivial models. Taking now the AIC model 43

Poisson deviance as 100%, then the polynomial terms improve the Poisson deviance by 0.4009%, XGBoost by 0.4042% and GBM by 0.5187%. The trivial model is about 4.3842% worse than the AIC model, showing that variables are able to describe the response in some way. The accuracy improvements machine learning methods provided were much smaller than ex- pected. However, it is important to remember that a better choice and tuning of hyperparame- ters and some restrictions on the model structure can further improve these models. This is not the main aim of the thesis, and no further search for better models is done. 44

5 Machine learning applications In the previous chapter, we fitted machine learning models which were able to predict the response variable better than baseline models, based on Poisson deviance in Formula (3.1). This chapter focuses on extracting insight from machine learning models and making interpretable models based on this insight. We will be using the maidrr approach and rule ensemble for this. 5.1 maidrr modelling We will apply the maidrr method to GBM model produced in the previous chapter. The GBM model from package gbm works natively with package maidrr since the authors of the package developed it with gbm package in mind. Although maidrr is model agnostic and any machine learning model can, in theory, be used. Using the GBM model, we first extract the partial dependence for all variables with non-zero importance. The relative variable importance in Figure 4 corresponds to the GBM model trained in the previous chapter. Based on this, all variables besides AgrStatus, IsTrainingEquipment and Terminated had non-zero variable importance and thus, partial dependence for them was computed. Partial dependence was computed using a sample of 100 000 observations from the training set due to computational intensity. This sample is, however, 10 times bigger than the sample used by maidrr package by default. Then, using the implementation of Algorithm A.2, the optimal grouping penalties were selected. Optimal penalties are the solution to the problem, as seen in Formula (3.5), where the optimal number of groups k∗ j is unknown. This problem is separately solved for main effects (single variable) and interaction variable effects. This involves taking a set of potential grouping penalties, applying the penalty and seeing how well the surrogate GLM performs using cross-validation. Several runs with different sets of potential grouping penalties were done. For main effects (single variable), the optimal penalty was λmain = 5 · 10−7 and for interaction effects, the optimal penalty was λintr = 9 · 10−6. The search range for both was 10−12 to 10−3, so both penalties were around the middle of the search range. Note that smaller penalties mean more grouping levels. One key advantage of maidrr is that it can automatically perform feature and interaction se- lection. For optimal penalties λmain and λintr, 15 variables were selected, and in addition, 6 interactions using these variables were selected. The selected variables with optimal groupings were Ifregion (17 groups), CntPenaltyPoints (9), VehMake (15), BmClassAas (14), Drivexpbnew 45

Page 6 of 9