Magistrit%C3%B6%C3%B6_lisadeta_Tuttar.pdf

Type: Document | Status: ready

(11), BmSubjectAge (24), Mileage (8), Vehyearnew (7), FullMass (6), FuelType (3), CntNotice (3), Policylength (2), EnginePower (3), AgreementTypeCode (2), SeatCount(2). The interac- tions selected for the surrogate model were CntPenaltyPoints and BmSubjectAge (13 groups), Ifregion and CntPenaltyPoints (11), Ifregion and BmSubjectAge (13), VehMake and BmSubjec- tAge (4), Drivexpbnew and BmSubjectAge (5), Ifregion and BmClassAas (7). This model will be called maidrr surrogate (encoded as "Surrogate" in tables). Appendix D has plots of grouped partial dependence for all of these variables. It is important to note that the underlying GBM model was able to deal with missing values for variables, and this ability is also present in the surrogate. Namely, the surrogate model has a separate group containing missing values where appropriate or missing values are grouped with some group of values. Sometimes bigger values of variables are also attached to this group with missing values, indicating that missing values act similarly to those bigger values. One such example is FullMass variable, where group [NA, NA] also contains vehicles with 3450 or greater full mass. Detailed output and interpretation of the surrogate model can be found in Appendix E.1. Now that we know what kind of groups can be used to imitate the machine learning model, augmentation for the underlying AIC model can be tested. To start out with, a forward search based on AIC was done starting from the AIC model with polynomial terms - for each grouping augmentation (for example, Ifregion variable grouping into 17 groups), AIC model equivalent variable was replaced by grouped variable coming from the maidrr surrogate model and model AIC was calculated. All of the different groupings were tested, and the augmentation providing the best gain in AIC was adopted. Then the process was repeated until there was no gain in AIC. With this procedure, these groupings were adopted in this order: grouping for Mileage, then Drivexpbnew, then interaction between CntPenaltyPoints and BmSubjectAge, then VehMake grouping, then interaction between Ifregion and CntPenaltyPoints, then Vehyearnew, then Cnt- Notice, then FullMass, then interaction between Ifregion and BmClassAas, then interaction be- tween Ifregion and BmSubjectAge, then interaction between VehMake and BmSubjectAge, then EnginePower grouping, then FuelType grouping and lastly SeatCount grouping. This model will be called maidrr grouping augmented AIC model (encoded as "AIC model + group"). In addition to maidrr grouping, a second augmentation was proposed and tested. In some cases, fitting of constant to grouped value does not seem appropriate. For example, for Drivexpbnew variable (Figure 6), fitting a constant value for a group formed between 0 and 4 is not appropriate 46

since there is a clear non-constant drop in partial dependence for this range. To fix this, a piecewise linear approximation can be used. Using maidrr grouping augmented AIC model, each of the grouped ordered and numeric variables were replaced by their piecewise linear alternative one by one. If the piecewise alternative improved the AIC of the model, it was adopted. In the end, only one piecewise alternative improved AIC: Drivexpbnew. This model will be called maidrr grouping and spline augmented AIC model (encoded as "AIC model + spline"). A more detailed look into both maidrr grouping augmented, grouping and spline augmented AIC models are presented in Appendix E.2 and E.3, respectively. 5.2 Rule ensemble modelling For rule ensemble, an implementation available in R package xrf was used. The procedure is carried out as described in Subchapter 3.3; however, the author of package xrf built the package focusing on fitting and extracting rules from the XGBoost model. Additionally, the author implemented rule duplication removal and rule deoverlapping. Rule deoverlapping means fixing the structure of rules to have non or minimal overlap in seg- mented data produced by these rules. In practice, this means introducing additional rules and altering the original extracted rules to make disjointed segments of data. In this case, deover- lapping of rules proved to be computationally infeasible and thus was not used. It is important to note that the out-of-the-box package xrf was not able to produce a model with satisfactory assumptions. To fix this, three things were done: a way to add XGBoost model trained outside of the package was implemented, Lasso regression model options like modelling family being Poisson was implemented, and linear term normalisation was added. Extracted rules will remain on the original scale since those rules can more easily be interpreted. The XGBoost model trained in the previous chapter was used to make the rule ensemble. From this model, 1173 non-duplicate rules were extracted. Using these rules and the linear terms, a penalty parameter search was conducted. The penalty parameters considered were default values from glmnet package (Lasso regression backend package) (Friedman, Hastie, and Tibshirani, 2010), which were 100 logarithmically uniform values from Λmax to Λmin = 0.0001 · Λmax where Λmax is such penalty value for which all coefficients are 0. For this data we got that Λmax = 1.510447 · 10−2. These penalty parameter values correspond to the penalty Λ in Formula 3.8. The fitted model finds two "optimal" penalty parameters Λmin = 0.00040119 and Λ1se = 0.00111633. First corresponds to the penalty parameter achieving the lowest cross-validation er- 47

ror, while Λ1se corresponds to the larger (in value) penalty parameter achieving cross-validation error 1 standard error (about 0.00107538 or 0.3% of the smallest cross-validation Poisson de- viance (Formula (3.1))) away from the minimum. These models will be called rule fit model with minimum parameter (encoded as "RuleFit min") and rule fit model with 1se parameter (encoded as "RuleFit 1se"), respectively. Rule fit model with minimum parameter had 377 non-zero coefficients for terms in regression. Out of these terms, 341 were for rules and 35 for original terms. For rule fit model with 1se parameter 146 terms had non-zero coefficients with 137 for rules and 8 for original terms. We can therefore say that about half of the additional non-zero terms improve the deviance by only 1 standard error amount, showing that adding more terms gives a small gain in deviance. It is also important to note that the normalisation of original numeric terms, as suggested in paper (Friedman and Popescu, 2008), was done. A more detailed overview and a sample of rules and their coefficients of both rule fit models are available in Appendix F.1 and F.2. 5.3 Model comparison This subchapter focuses on showcasing the difference in model accuracy metrics. Two metrics will be used to compare the models: Poisson deviance from Formula (3.1) on the test set as the performance measure and AIC as a goodness of fit (GOF) measure. Comparison of accuracy measures based on previously unseen data is a standard approach for machine learning models. However, since one of the points of interest is to see if proposed model improvements improve the underlying model, AIC as the goodness of fit metric is used. For this, all the different GLM models will be trained on the same data, and their AIC value will be computed to asses their future performance. To compare the models, a unified test set is necessary. Since all of the models need variables in different ways, two additional copies of the test set needed to be made. All of these test sets are identical in terms of observation ordering etc., but their underlying structure differs, like having one-hot encoding, normalisation of terms, etc. The resulting test data sets have 294 777 observations. Additionally, it is important to note that maidrr surrogate model interaction terms are prone to producing additional missing values due to previously unseen combinations of interaction vari- ables present in test data. This fact can not be avoided; thus, the observations producing these 48

Table 2: Model comparison based on AIC on the training set and Poisson deviance on the test set. The test set was unified to be fair for all models, as different models require different data structures. The test set contains 241 020 observations. AIC was found with all models retrained on training data where all missing values were omitted. Model name AIC (training) Poisson Deviance (Test) Number of parameters Trivial model

0.11695364

AIC model 167150.8 0.11204148 91 AIC model + poly 166756 0.11159231 99 GBM

0.11146036

Surrogate 166279.7 0.11148752 156 Surrogate No interaction 166432.0 0.11143346 109 AIC model + group 166300.3 0.11145320 175 AIC model + spline 166294.6 0.11144887 181 XGBoost

0.11158860

RuleFit min 165945.1 0.11143408 377 RuleFit 1se 166479.5 0.11169946 146 errors were omitted. Final test data sets had 241 020 observations. This data was acceptable for all models; thus, deviance comparison was possible and fair. In Table 2, a comparison between models is presented. As discussed earlier, the test set was made fair and comparable for all models. AIC value for likelihood-based models was found by retraining all models to the training data with all missing data removed. This makes the AIC values comparable. The number of parameters for rule fit models were chosen to be the number of non-zero coefficients since Lasso is used for feature selection in this model. Based on the results from Table 2, we can see that all considered "augmented" models perform better than the basic AIC model. This is good to see since we are considering somewhat more complex models. Moving deeper, we can see that the surrogate model containing only categorical variables is much better in terms of Poisson deviance compared to AIC model with polynomial terms. Based on AIC, the surrogate model is clearly better (by 476) compared to the AIC model with polynomial terms. 49

However, something rather odd is also happening. We can see that the surrogate model without interactions is the best model based on Poisson deviance. This is odd since the added interactions should, in theory, improve the fit of the model. There is no clear explanation why this model performs best on the test data. The fact that this model performs better than the underlying machine learning model is also strange since the underlying structure for both of these models is similar (constant prediction for a given segment of data). This might hint that the GBM model could be further improved. Another thing that is strange is the fact that the AIC model with maidrr grouping augmentation and the AIC model with grouping and spline augmentation perform better than the underlying machine learning model, GBM, they were built on. One possible explanation for this is one of the disadvantages mentioned in Subchapter 2.1.3, related to the approximation of linear relations using splits. Both GBM and XGBoost models are very shallow models (depth of 4 for both) and have quite a low number of iterations (for GBM 691, for XGBoost 91). With a low number of iterations, it is hard to have enough splits related to linear terms; thus, only a rough sense of linear relation is captured. I believe this is also supported by the fact that some of the terms are kept linear for both grouping augmented and grouping and spline augmented models. A similar case holds up for XGBoost and corresponding rule fit models. Rule fit models again outperform the machine learning model they are based on, but they also contain the linear terms, thus capturing the linear relation, not captured by the tree ensemble method. Rule fit model with a minimum parameter has similar deviance compared to the surrogate model with no interactions, showing that rules are able to describe the data in a similar way. However, we can see that rule fit model is not that stable since 1 standard error of cross-validation Poisson deviance is enough to make a model worse than the AIC model with polynomial terms (based on test deviance). This indicates that the underlying structure, although good in some cases, is very reliant on the complexity allowed by the penalty parameter. I believe that in order to make this model more stable, a better machine learning model and rules are needed. However, looking into the rules that have non-zero coefficients has proven very insightful, show- casing which possible combination of variables might be used in ratemaking. More on this is discussed in the Subchapter 5.4. Since all non-machine-learning models are ultimately likelihood based GLM models, then taking into account the number of parameters, we get that actually rule fit model with minimum parameter is the best, followed by the maidrr surrogate model. The difference in AIC between 50