Figure 9: Example of grouping numeric variable par- tial dependence into 5 groups (NA values as a separate group). Figure 10: Example of grouping ordered categorical variable partial dependence into 6 groups. In addition to main effect modelling, interaction can also be modelled using H-statistic described in the previous subchapter. H-statistic is used to find possible interactions, and partial depen- dence of interactions with suitable strength (above a value defined by the modeller) are grouped using the same procedure as above. The effect of the corresponding interaction variable is then interpreted as an additional effect on top of the main effects (interaction effect is centred at 0, which indicates no effect). Note that in this case, the main effect penalty and interaction effect penalty can be different and will be denoted by λmain and λintr. To find the best possible surrogate, maidrr focuses on 4 hyperparameters: penalties λmain and λintr, maximum number of groups k and interaction strength cut-off value h. Penalties are tuned by a grid search using K-fold cross-validation (data is split into K parts, one part is left out as the validation part, and the model is trained on the rest of the data) by finding a GLM model minimising the desired loss function with respect to the original response variable (not machine learning predictions, like other surrogate techniques). First tuning of λmain is done, and depending on the features selected (if ˆk{j} = 1, the feature is excluded from the surrogate model) λintr is then tuned. Hyperparameters k and h focus on the complexity of the surrogate and depend on the desired outcome. If a more complex model is suitable, then a high value of k and a low value of h allow for smoother main effects and more interactions. Opposite values of k and h allow for a coarser surrogate. The algorithm for maidrr surrogate is present in Appendix (A.1) and for penalty tuning in Appendix (A.2). 3.3 Rule ensemble This subchapter is based on (Friedman and Popescu, 2008). 34
There are almost countless ways to model a response variable, stemming from the choice of methods and parameters for these methods. Seldom are these methods easily interpretable. In some industries, the ability to interpret the models and make general business decisions based on those models is needed. Thus an interpretable machine learning method is needed. Decision trees, as shown in Chapter 2, are rule based methods which, on their own, are very interpretable but also highly unstable. To fix this, tree ensembles can be used at the cost of interpretability. In 2008, Friedman and Popescu proposed a way to leverage these highly interpretable rules of tree ensembles to make a model with predictive power comparable to those ensemble methods. When working with ensembles, we are searching for a predictive function F(x) as an additive expansion F(x) = β0 + T X t=1 βth(x, at), (3.6) where h(x) is the "base learner" prediction function with parameters at and combination pa- rameter βt. When working with tree ensembles, "base learner" h(x, at) is a decision tree with M terminal nodes and parameters at are the splitting variables with indices j ⊂{1, . . . , p}, and splitting points s, which are enough to define the M terminal regions. In their paper (Friedman and Popescu, 2008), the authors proposed that letting go of the un- derlying decision tree branching structure and just using the rules produced from decision trees as the "base learners" could produce a highly interpretable method. Denote X{j} as the set of all possible values of variable Xj and vj,k be a subset of these values, vj,k ⊂X{j}. A rule base learner is then rk(x) = p Y j=1 I(xj ∈vj,k), where I(·) is the indicator function and k is index of the rule used. Using product over all variables results in a two-valued base learner (rk(x) ∈{0, 1}), taking non-zero value only if all variables Xj, j = 1, . . . , p belong to their specified subset of values vj,k. For orderable variables (numeric and ordered categorical variables), the subset of values is an interval of values vj,k = (tj,k, uj,k], where tj,k and uj,k are the lower and upper limit values (categorical levels), respectively. For unorderable categorical variables, vj,k is an explicit subset of possible categorical levels. Note that if vj,k = X{j}, then variables Xj can be omitted from the rule since I(xj ∈X{j}) = 1 for all Xj values. In practice, "simple" rules are desirable as this leaves most of the variables Xj 35
out and focuses on a few "important" variable segments. An example of a rule generated from a CART decision tree on Figure 3 corresponding to region R4 of the tree is r4(x) = I(x1 ∈(t3, 1]) · I(x2 ∈[0, t4]). As shown above, consecutive decision tree splits fit the desired base learner structure and thus can easily be extracted and used on their own as base learners. Note that, not only terminal node rules can be used, but any combination of splits leading to any node in the tree can be used. Suppose now, we have a T binary trees {T}T 1 . This results in K = T X t=1 2(|Tt|−1) rules {rk(x)}K 1 , where |Tt| is the number of terminal nodes for t-th tree. Then we can define rule ensemble as F(x) = a0 + K X k=1 akrk(x), (3.7) where rules {rk(x)}K 1 serve as the base learners and ak, k ∈{0, . . . , K} serve as combination parameters from Formula (3.6). Using now importance sampled learning ensemble (ISLE) methodology as described in (Friedman and Popescu, 2003), one possible way to estimate the combination parameters { ˆak}K 0 , is using regularised linear regression on the training data {ˆak}K 0 = argmin {ak}K 0 n X i=1 L
yi, a0 + K X k=1 akrk(x) !
- Λ · K X k=1 |ak|, where L(·) is the loss function we want to minimise. The regularisation used here is Lasso regression, which uses the prediction risk Pn i=1 L yi, a0 + PK k=1 akrk(x) with an additional constraint on the absolute size of the parameter, Λ · PK k=1 |ak|. It can be shown that when using Lasso regression, larger penalty values Λ produce shrinkage, often setting many "unimportant" parameters {ak}K 0 to zero, effectively excluding them from regression (Tibshirani, 1996). This is favourable since Lasso regression helps us to perform off-hand feature selection on many possible rules used in the ensemble. In their research, Friedman and Popescu showed that rule ensemble generated from trees with random tree size performed well when compared to other ensemble methods. However, an ad- ditional augmentation was proposed. Friedman and Popescu argued that the linear function is among the most difficult functions to approximate using rules (and decision trees), requiring a 36
large number of iterations and rules to estimate accurately. They suggested that using addi- tional linear components in the additive expansion can help deal with linear dependence without sacrificing much predictive power, thanks to Lasso regression allowing to eliminate unnecessary linear components. The linear augmented rule ensemble is then F(x) = a0 + K X k=1 akrk(x) + p X j=1 γjxj, where rule ensemble additive expansion from Formula (3.7) has additional linear terms as base learners with corresponding combination parameters γj. Again using ISLE approach, we get that combination parameters {ˆak}K 0 and {ˆγj}p 1 can be esti- mated using Lasso regression ({ˆak}K 0 , {ˆγj}p
- = argmin {ak}K 0 ,{γj}p 1 n X i=1 L yi, a0 + K X k=1 akrk(xi) + p X j=1 γjxi,j +Λ K X k=1 |ak| + p X j=1 |γj| . (3.8) It is important to keep in mind that Lasso regression is very sensitive to the scale of predictors. Thus, the normalisation of predictors should be performed prior to fitting the model. Authors suggested to regularise variables Xj, j = 1, . . . , p using xj ←0.4 xj std(xj), where std(xj) is the standard deviation of the variable Xj in the data. The coefficient 0.4 is used to scale the variable to the same influence as an average rule with uniform support (number of observations with a given rule) on the unit interval. Authors note that scaling the rules can be done but is seldom required since rules with very large or very small support are ultimately defined by a small number of training observations, which is undesirable. Additionally, since linear terms might have outlier issues, a "Winsorized" version of the linear term should be used W(xj) = min(δ+ j , max(δ− j , xj)), where δ− j and δ+ j are α ∈(0, 0.5) and 1 −α quantiles of the distribution of Xj, respectively. 37
4 Claim frequency modelling This chapter focuses on modelling motor third-party liability (MTPL) claim frequency using generalized linear models, gradient boosting machine and XGBoost. The modelling is done on Latvian MTPL data provided by If P&C Insurance AS. Note that in Latvia, MTPL policies are a part of a shared market, thus, the same information is available to all insurance providers in the area at all times. Motor third-party liability is compulsory insurance required for all vehicles registered with the Latvian Motor Vehicle Register. In case of an accident, this insurance covers the cost of damages done to a third party’s property or health. Claim frequency modelling should, in theory, be the only approach capable of quantifying risk for the insurance company since claim amounts should have little to no relation with the policyholder or vehicle specified, as only third-party damages are compensated. Predicting claim frequency right allows the insurance company to capture better risk from the population by giving better prices and to drive away unwanted risk by setting a more appropriate price of insurance. The full dataset contains 12 847 035 policies issued in 2012−2018 to private clients. The dataset has 84 columns that can be split into 3 categories: • Policy information columns: agreement type, agreement status, policy start and end dates, policy duration, estimated policy issue region, number of claims and claim amounts linked to the policy. • Policy owner information: policy subject birth date, subject age and driving experience, national driving penalty points and penalty notices, start and end dates for license by vehicle category, bonus-malus class, previous bonus-malus class and the number of previous claims. • Policy vehicle information: vehicle age, type, seat count, mass, make, model, body type, fuel type, mileage, engine power and volume, date of first registration and date of last inspection. Several columns were either duplicate columns or information stored in them was unusable or missing. From these 84 columns, 45 were selected for further analysis and preprocessing. The full table of selected columns, column types and column descriptions can be seen in Appendix B.1. 38
4.1 Data and preprocessing The initial dataset provided by If had several issues with regard to data quality. This is a common issue with real-life data as data aggregation is done by combining data from several sources that seldom share structure. To use this data for any kind of analysis, it needs to be preprocessed. This dataset was quite big, just barely fitting into the memory on the machine (laptop provided by If) used. This meant that any sort of out-of-the-box data manipulation could not be done, and another software capable of dealing with big datasets needed to be installed. To solve this, a local computation cluster using Apache SPARK backend and Apache Hadoop distributed file system was set up (Apache Software Foundation, 2020; Apache Software Foundation, 2018). To communicate to the cluster, RStudio package sparklyr (Luraschi et al., 2022) was used as the frontend to send R commands to the cluster. This system allowed to use pipelines from package dplyr (Wickham et al., 2022) to select, filter and modify all of the data at once, doing it quite fast. Preprocessing was done on column-by-column bases. Key preprocessing steps are listed below: • All of the dates were converted from text type in format YYYY/MM/DD (but not always) to date objects in R. • Columns VehMake, VehModel had some of their levels aggregated into level "Other" based on the count of observations with that level (less than 5000 were aggregated). VehModel was dropped due to too many categorical levels. • VehMake and Ifregion were encoded to numeric levels to ease the visualisation and model printouts. • Categorical columns with missing values were merged to corresponding "not available" levels ("n/z", "n/a", etc.). • Some of the systemic errors were fixed. For example, sometimes, Mileage was calculated in the wrong way (StartMileage - LastMileage). • Date-based values like age, driving experience, vehicle age, etc., were calculated and com- pared against values already present in the data. • Ranges of most numeric variables (excluding response) were cut to 0.995 quantiles to exclude very large outliers. 39