Magistrit%C3%B6%C3%B6_lisadeta_Tuttar.pdf - Page 4

3 Machine learning insights In this work, we want to showcase a way to leverage the good predictive power of machine learning models in augmenting a standard GLM. Machine learning has been able to uncover a deeper understanding of the underlying trends in the dataset being modelled. These deeper trends and the relationships between independent variables and response variables are most often obfuscated and there is no clear way to showcase how a prediction from a machine learning method comes about. Recently, a need to understand the underlying decision-making of machine learning methods has become relevant, and research into interpretable machine learning has taken off (Murdoch et al., 2019). In this chapter, we will discuss ways to learn from machine learning methods by visualising the use of variables and applying this knowledge in a semi-automatic way to augment the underlying GLM. Two methods are discussed: one focuses on insight extraction through partial dependence, and the other leverages the underlying structure of decision trees, extracting the splitting rules and modelling the response using those rules. 3.1 Measures and statistics This subchapter is based on (Molnar, 2022) and (Hastie, Tibshirani, and Friedman, 2009). This subchapter discusses several measures and metrics used for the insight methods and model comparison. 3.1.1 Model performance metrics In machine learning, many different models can solve a single problem. Many hyper parameters can and should be tuned inside these machine learning models to find the best fit for the data. There is a need for some metric to quantify how well a model is performing. One of the most popular metrics used is the model’s test set loss or deviance. When training a machine learning model, a common approach is splitting the data into two distinct parts: training and test data. The aim is to have a previously unseen bit of data, the test set, that acts as an equal playing field for all models. First, the model prediction function F(x) is estimated based on the training data, and then an appropriate (based on the modelling problem) loss function L(y, F(x)) is used to calculate the total loss for the test set, Pnt i=1(L(yi, F(xi)). Note that for some problems, the total loss is not the most appropriate metric to minimise, so deviance can 26

be used instead. For example for a Poisson distributed response variable, Poisson deviance for an i.i.d. sample of observations {(yi, xi)}n 1 can be calculated as DevP(F) = 2 n X i=1 yi ln yi F(xi) −(yi −F(xi)) . (3.1) Total loss or deviance based model comparison has one main issue. Namely, as the complexity of the underlying model increases, the total loss or deviance of that model will stay the same or decrease. This is similar to the case where increasing the number of linear terms always lowers or keeps the deviance of the linear model. For machine learning models, this is not a significant issue since most of the models have basically an infinite (saturated) number of parameters (splitting variables, splitting values, etc.). However, this issue persists for parametric models like generalized linear regression. A way to get around this for parametric models is to introduce a cost for the number of parameters in the model. For in-sample error prediction, the Akaike information criterion (AIC) can be used. For parametric models using likelihood, we have AIC = −2 · l(β, ϕ, y) + 2p. (3.2) The best model can be again chosen by having the lowest AIC. As the number of model parameters is penalised, the lower AIC ensures that predictive power of the model increased more than just the improvement due to the increased complexity. 3.1.2 Variable importance For models with p independent predictor variables, seldom are all variables equally relevant for predicting. For generalized linear models, an estimate of a variable’s coefficient βi and its variance D (βi) is enough to tell if that variable is statistically significant for predicting the response variable through the use of a test such as the likelihood ratio test. However, for machine learning methods, this kind of setup is missing. One possible solution to determine what predictors are relevant to predicting the response is called variable importance. There are several types of variable importance, but in this work, we look at relative variable importance and permutation variable importance. In the case of relative variable importance, we want to know how "relevant" different values of variable Xp are in predicting the response variable. For a single decision tree, the squared 27

Figure 4: Relative variable importance for gradient boosting machine computed on training data. Figure 5: Permutation variable importance for gradient boosting machine computed on training data. relative relevance of variable Xp can be expressed as the sum of squared improvements in the loss for all splits in a tree using this variable. In other words, we can write IR2 p(T) = J−1 X j=1 i2 jI(v(j) = p), where J is the number of internal nodes in the tree, v(j) is the index of the splitting variable for node j, and i2 j is the squared improvement in the loss for adopting a given split. This single decision tree relevance generalises well to additive tree expansion models since an average across all trees in the expansion shows the same phenomena. Then we can write that for tree ensembles IR2 p = 1 T T X t=1 IR2 p(Tt), where T is the number of decision trees in the additive expansion. Relevance is simply the square root of the corresponding squared relevance. As this measure is relative to the data it was fitted on, it is usually given as proportion (sums to 100). An example of scaled relative variable importance for a gradient boosting machine is shown in Figure 4. Permutation variable importance is a little different from relative variable importance even though they try to quantify the same thing. For permutation variable importance, we take the variable Xp, permutate the realised values xp = (x1,p, x2,p, . . . , xn,p)T in the dataset and see how much the loss of the dataset changed. Based on the increase in loss, we can see how important the given variable value was in making the prediction for a given observation. To calculate permutation variable importance, we take the prediction function F(x) and the loss function L(y, F(x)) and compute the sum of losses on the dataset Lorig = Pn i=1 L(yi, F(xi)). 28

Then permutate the variable Xp in the dataset to get Xpperm and compute the sum of losses with the permuted variable Lp = Pn i=1 L(yi, F(xiperm)), xiperm denotes the ith observation with the permutated variable Xpperm. Permutation importance is then their ratio IPp = Lp Lorig , or their difference IPp = Lp −Lorig. Similarly to relative importance, this value is usually presented as a proportion. An example of permutation importance for a gradient boosting machine is shown in Figure 5. Note that the order of variable importance is different for different measures, but overall the same variables are considered important. 3.1.3 Partial dependence Now that we have tools to choose a good model and to see what variables the model relies on to make predictions, we still need a way to visualise how the different values of a given variable or even a collection of variables affect the prediction. For a linear regression model, this boils down to interpreting the different regression coefficients βj. For example, in the case of modelling claim frequency using GLM with Poisson distribution and log-link function, we could say something along the lines: "Having two subjects for which all variables expect Xj are the same and numeric variable Xj differ by 1 unit, then the estimated claim frequencies differ by a factor of exp(βj)". For categorical variable Xk, the estimated claim frequency would differ by a factor of exp(βk) when compared to the baseline categorical level. As evident from previous subchapters, here again, machine learning does not have such simple interpretable tools. However, one somewhat interpretable tool exists - partial dependence. Partial dependence shows the average prediction for different values or unique combinations of values of fixed variables where the effect of not fixed variables is averaged out. Take a vector of indices for fixed variables S ⊂{1, 2, . . . , p}. Let C be its complement set, so that S S C = {1, 2, . . . , p}. Let XS denote the variable space defined by the variables with indices S and let XC be the complement variable space such that they can be combined to make original variable space X. The prediction function can be written as F(x) = F(xS, xC), where x ∈X, xS ∈XS and xC ∈XC. Partial dependence for values xS ∈XS can then be defined as PD(xS) = E (F(xS, XC)) , 29

Figure 6: GBM partial dependence for a numeric vari- able estimated from a sample of 105 rows of training data Figure 7: GBM partial dependence for ordered cate- gorical variable estimated from a sample of 105 rows of training data where XC is a random vector from complement space XC. In other words, partial dependence of fixed variable values xS is the expected value of the prediction function F(xS, XC) in the probability space defined by complement variables XC. Using real data, partial dependence can easily be estimated by ˆ PD(xS) = 1 n n X i=1 F(xS, xi,C), where xS is an element of observed combinations from space XS and xi,C denotes the values of the complement set variables for the i-th observation. Note that partial dependence is usually calculated sequentially over unique values of the fixed variables present in the data, then the particular values of partial dependence can be aggregated into a vector or easily plotted. Single variable partial dependence uses all unique values of that variable present in the data (grid of values) and calculates the corresponding average prediction over all other variables. For numeric variables, plotting partial dependence produces a trailing plot showcasing the change in partial dependence value based on values of numeric variables. For categorical variables, plotting partial dependence produces a scatter plot showcasing the partial dependence value for each level of categorical variable. For two variable partial dependence, a heat map (if both are numeric) or multiple line plot (if one is numeric and the other is categorical) or level faceted point plots (both are categorical) can be used. There is no easy or interpretable way to visualise higher dimensional partial dependence. Examples of one variable partial dependence plots are presented in Figures 6 and 7. From the averaging, we can see that if we fix only one variable Xj (j ∈{1, 2 . . . , p}) then, for all unique values of variable Xj, we go through all values from the dataset therefore, the partial dependence calculation is at least O(n2) complexity. Considering more than one variable, the 30

number of possible combinations grows by up to another factor ofn, so in total complexity would be O(n3). In general, the complexity for partial dependence ofs variables isO(ns). This is very computationally intensive, even for moderately sized datasets. 3.1.4 Friedman’s H-statistic This subchapter is based on (Friedman and Popescu, 2008). One of the key ways to model non-linear relations in linear regression is through the use of interaction. Interaction allows the value of one variable to affect the effect of another variable or even variables. In predictive modelling, interactions add a lot of complexity to a model but can also significantly improve the predictive power of a model. The decision trees are able to model interaction effects through consecutive splits on a single tree, thus leading to a structure that inherently models interactions. This property carries over to any other tree-based models (but not exclusively). In their paper (Friedman and Popescu, 2008), the authors defined a statistic that quantifies the presence and/or strength of an interaction between variables used in predictive functionF (x). Let all partial dependence be centred (mean removed). They showed that the partial dependence of two variablesXj and Xk could be decomposed into P D(x{j,k}) = P D(x{j}) + P D(x{k}), if variables do not interact. Herex{j,k} ∈ X{j,k}, x{j} ∈ X{j} and x{k} ∈ X{k}. Then, a statistic can be defined and empirically estimated using the empirical estimates of partial dependence on the dataset, H 2 {j,k} = Pn i=1 h ˆP D(xi,{j,k}) − ˆP D(xi,{j}) − ˆP D(xi,{k}) i Pn i=1( ˆP D(xi,{j,k}))2 , (3.3) where ˆP D(xi,{j,k}) is the partial dependence value for interaction combination present for ob- servation xi and ˆP D(xi,{j}) and ˆP D(xi,{k}) are single variable partial dependence values for variable values present for observationxi. This H-statistic showcases the fraction of variance not explained by a single variable partial dependenciesP D(x{j}) and P D(x{k}). If the value is 0, then no interaction between variablesXj and Xk is present in prediction functionF (x). Otherwise, larger values correspond to a "stronger" interaction between these variables. For a single variableXj, we can also check if any interactions with this variable are present in 31 F (x) by using another version ofH-statistic (3.3), H 2 j = Pn i=1 h F (xi) − ˆP D(xi,{j}) − ˆP D(xi,\j) i Pn i=1(F (xi))2 , (3.4) where ˆP D(xi,\j) denotes the interaction partial dependence value for all variables expectjth variable for combination present in observationxi. Friedman and Popescu also proposed to combine the numerator of (3.3) and denominator of (3.4) to get a version ofH-statistic showcasing the importance of a particular interaction in the prediction model. Additionally, the authors showed that this statistic could easily be expanded to higher dimension interactions. In this work, we only focus on two variable interactions. 3.2 Model-Agnostic Interpretable Data-driven suRRogates (maidrr) This subchapter is based on (Henckaerts, Antonio, and Côté, 2020). Partial dependence is one way to quantify the way the machine learning model uses one variable. In 2020, Roel Henckaerts et al. proposed a method using partial dependence, segmenting and clustering it, to produce a surrogate model capable of mimicking the prediction functionF (x) of any machine learning model, thus being a model-agnostic approach. A workflow of the maidrr process can be seen in Figure 8. Figure 8: maidrr process for transforming black box methods into interpretable GLM. Taken from (Henckaerts, Antonio, and Côté, 2020) Figure 2. Assuming a satisfactory machine learning "black box" method is fit to the data, we first extract the partial dependence for independent variables in the data. The same calculation is used as described previously, but since the algorithm is computationally very demanding, this is usually done over a smaller sample of training data. Note that since we want to average out the effect of other variables, the Monte-Carlo approach of oversampling for smaller initial datasets might lead to a better estimation of the distribution of complement variables, thus reducing the variance of partial dependence. 32 After partial dependence is calculated, a dynamic programming clustering K-means algorithm is used to make optimal (for a given K) and reproducible clusters of partial dependence effects (Wang and Song, 2011). These clusters are bins of values for numeric variables or sets of cate- gorical levels for factor variables. Note that only sequential factor variable levels can be grouped for ordered factor variables. The clustering produces a segmentation of independent variables allowing for a constant to be fit to a given segment, representing its effect on the response variable. To find an optimal number of groupings for all independent variables would be a p-dimensional problem. However, maidrr authors proposed the use of penalised loss function to find the optimal number of groups ˆk{j}, j = 1, . . . , p. The process is simple and starts by taking a random guess for the number of groups for a variable. Suppose variable Xj was grouped into k{j} groups. Let zi∗,{j} = ˆ PD(xi∗,{j}) denote the partial dependence value of i∗-th unique value of variable Xj, i∗∈{1, . . . , m{j}}, where m{j} is the number of unique values, and let ˜zi∗,{j} be the average partial dependence value of the group the unique value xi∗,{j} belongs to. The optimal number of groups ˆk{j} can be found by solving arg min k{j} m{j} X i∗=1 wi∗,{j}(zi∗,{j} −˜zi∗,{j})2 + λ ln(k{j}), (3.5) where wi∗,{j} is the proportion of observations that have value xi∗,{j}. Using the weighted mean squared error in Formula (3.5) encourages good groupings for frequently occurring feature values. Low (high) values of penalty parameter λ allow for more (less) groups for a given feature, resulting in a smoother (coarser) approximation of the underlying partial dependence effect. Examples of possible groupings for a numeric variable are presented in Figure 9, and for ordered categorical variable is presented in Figure 10. The figures indicate grouping by the vertical lines for numeric variables and as different symbols for categorical variables. Additionally, the weight of given feature values is displayed as the line colour for the numeric variable (darker means more) and the size of the symbol for the categorical variable (bigger means more). Through grouping of partial dependence, we find a possible segmentation of independent variables in space X, thus turning all variables into categorical form by introducing additional parameters for numeric variables and taking away some "not needed" parameters for categorical variables. After grouping, a surrogate model (GLM) is fit. Since all variables end up as categorical levels, the fitted GLM can easily be transformed into a fixed-size decision table which is very transparent and interpretable. 33

Page 4 of 9