University of Tartu F aculty of Science and Technology Institute of Mathematics and Statistics Artur Tuttar Extending generalized linear models in insurance with machine learning techniques Actuarial and Financial Engineering Master’s thesis (30 ECTS) Supervisors: Meelis Käärik (PhD) Julius Pau (MSc) Tartu 2023 Extending generalized linear models in insurance with machine learning techniques Masters’s thesis Artur Tuttar Abstract. Machine learning models have shown promising results regarding their predictive power. However, little to no information about their use of variables is available. The aim of this thesis is to introduce and put into practice two ways of extracting this insight about variable use. This insight is applied to produce interpretable models that predict in a similar way to underlying machine learning models. The first three chapters give a theoretical overview of methods used to build models and extract insight, and the last two chapters focus on applying these methods to predict claim frequency using real-life insurance data. Keywords: motor vehicle insurance, generalized linear models, interpretable machine learning. CERSC research specification: P160 Statistics, operations research, programming, actuarial mathematics. Üldistatud lineaarsete mudelite edasiarendus kindlustusandmetel masinõppe meetodite abil Magistritöö Artur Tuttar Lühikokkuvõte. Masinõppe mudelid on viimasel ajal silma paistnud oma ennustusvõime poolest. Paraku ei võimalda masinõppe mudelite ülesehitus aru saada, kuidas need mudelid erinevaid tunnuseid kasutavad. See magistritöö tutvustab ja rakendab reaalelu andmetel kaht meetodit, mis püüavad luua masinõppe mudelist interpreteeritavaid mudeleid. Töö kolmes esimeses peatükis antakse teoreetiline ülevaade mudelitest ja meetoditest ning viimases kahes peatükis rakendatakse tuvustatud meetodeid kahjusageduse hindamiseks reaalelu kindlustusand- metel. Võtmesõnad: sõidukikindlustus, üldistatud lineaarsed mudelid, interpreteeritav masinõpe. CERCS teaduseriala: P160 Statistika, operatsioonianalüüs, programmeerimine, finants- ja kindlustusmatemaatika. 1
Contents Introduction 4 1 Generalized linear models 6 1.1 Model structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Modelling using GLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Count and frequency data modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2 T ree models 12 2.1 Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.1 Data partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.2 Regression trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.3 Advantages and disadvantages of trees . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.1 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.2 Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.3 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3 Machine learning insights 26 3.1 Measures and statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1.1 Model performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1.2 Variable importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.3 Partial dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1.4 Friedman’s H-statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Model-Agnostic Interpretable Data-driven suRRogates (maidrr) . . . . . . . . . . . . . . . 32 3.3 Rule ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4 Claim frequency modelling 38 4.1 Data and preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2 Baseline models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3 Modelling with GBM and XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2 5 Machine learning applications 45 5.1 maidrr modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2 Rule ensemble modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.3 Model comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Conclusion 54 References 55 Appendix 59 A maidrr algorithms 59 A.1 maidrr surrogate model algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 A.2 maidrr penalty tuning algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3
Introduction Insurance companies focus on evaluating risks and providing coverage for them. The price of the coverage should be fair and correspond to the underlying risk. The fair price is usually given through rates applied to a given client. Currently, an interpretable statistical model is built and used to estimate the rates. However, this approach is being challenged by machine learning. Risk evaluation is usually split into two parts: estimating the number of claims (or claim fre- quency) and estimating the claim amounts. This thesis focuses on the former. Currently, the main tool for this task is an interpretable generalized linear model (GLM). However, several machine learning algorithms have been shown to outperform this classical approach (Henckaerts et al., 2019; Wüthrich, 2019). These machine learning models are inherently opaque and thus bring value only in their accuracy. Thus, no insight applicable to the ratemaking can easily be extracted from these models. This thesis aims to introduce and put into practice two ways to extract insights from machine learning models and produce interpretable counterparts to these models. Classic generalized linear models will be compared to machine learning models and corresponding interpretable models by modelling claim frequency for motor third party liability insurance. This thesis is split into five chapters. The first chapter focuses on the model setup and model structure for generalized linear models. An overview of parameter estimation and claim frequency modelling using Poisson distribution is also provided. The second chapter introduces decision trees and tree-based boosting ensemble methods like gradient boosting machines (gradient tree boosting) and XGBoost. The third chapter focuses on model metrics and introduces two ways to extract insight from machine learning models: maidrr and rule ensemble. In the fourth chapter, several models are used to predict the claim frequency for Latvian motor third-party liability data, provided by If P&C Insurance AS, including GLMs, gradient boosting machine (GBM) and XGBoost models. General model training procedures are also given. The last chapter introduces models using the insights extracted from the machine learning models developed in Chapter 4, and all models are compared using Poisson deviance and AIC. A small discussion about working on this thesis is also given. All data manipulation and model training is done using corresponding packages for statistical computation software R (R Core Team, 2022). This thesis was written using Overleaf, an online compiler for the LATEX typesetting system (Lamport, 1994). 4
The author would like to thank Julian Trufin and Roel Henckaerts for their correspondence regarding references and suggestions. The author is also extremely grateful for the advice and expertise provided by supervisors Meelis Käärik and Julius Pau. Lastly, the author is grateful for the help of his peers: Joseph Haske, Mihkel Lepson and Nicholas Lupul. Additionally, this version of the thesis is made publicly available in the spirit of sharing research and showcasing the application of methods developed. However, appendices B, C, D, E, F contain information and results that are considered a trade secret for If P&C Insurance AS and could be used by other businesses besides them to adjust and improve ratemaking and pricing processes. This version of the thesis does not include these appendices; any references and text hyperlinks linking to these appendices have been altered to plain text. 5
1 Generalized linear models One of the simplest ways to model a relation between independent variables (X1, X2, . . . , Xp) and response variable (Y) is to assume a linear relation (in terms of parameters) between the independent variables and response variable and fit an ordinary linear regression in the form of µi = E (Yi) := E (Y |X1 = xi,1, X2 = xi,2, . . . , Xp = xi,p) = β0 + pX j=1 βjxi,j, where xi,j is the realisation of the corresponding independent variableXj for i-th observation. In addition to the linearity assumption, ordinary linear regression assumes the normal distri- bution for the residuals of the model and constant variance for those residuals. Therefore, we have Yi − µi = εi ∼ N (0, σ), where σ is constant. When looking at insurance data, we seldom observe the normal distribution. For example, the number of claims is a positive integer or claim size can only be positive and have heavy tails. Thus normal distribution assumption does not apply and other ways of fitting a relationship between independent and dependent variables are needed. A step up from ordinary linear regression was proposed in 1972. The generalized linear model (GLM) introduced by Nelder and Wedderburn showed a way to compute maximum likelihood estimates for parameters βj, j ∈ {0, 1, 2, . . . , p} for observations conditionally distributed ac- cording to some exponential family distributions (Nelder and Wedderburn, 1972). These models remain an insurance industry staple tool to this day because they are simple to understand and easy to interpret. 1.1 Model structure This subchapter is based on (de Jong and Heller, 2008). A random variableY is from the exponential family if the probability density functionfY (x) is of the form fY (x) = c(x, ϕ) exp xθ − a(θ) ϕ , 6 where θ and ϕ are the canonical and dispersion parameter of the exponential family, respectively. For these distributions, it holds that E (Y ) = a′(θ), (1.1) D (Y ) = ϕa′′(θ), where E (·) and D (·) denote the mean and the variance of a random variable, respectively. The exponential family contains several distributions that are prevalent in insurance, including the exponential distribution, gamma distribution, inverse Gaussian distribution, Poisson distri- bution, binomial and negative binomial distributions. The aim of the generalized linear model is the same as the ordinary linear regression model – to describe the response variable (Y ) in terms of independent variables (X1, X2, . . . , Xp) and coefficients (β0, β1, . . . , βp). However, the models have some key differences. For GLMs, we allow the response variable Y to be from any exponential family, whilst for ordinary linear regression, the response is assumed to be normally distributed. Secondly, for ordinary linear regression, a linear relationship between the independent variables and conditional mean µi of the response is modelled, but for GLMs, the transformed conditional mean g(µi) is modelled, where g(·) is called the link function. So for GLMs, we have that g(µi) = β0 + p X j=1 βjxi,j =: ηi, where ηi is called the linear predictor for the i-th observation. Note that in the next two paragraphs, the i-th index is omitted since a general discussion about the model structure is given. The link function g(·) acts as the mediator for the linear predictor η and the response variable Y . The choice of appropriate link function is not concrete for every exponential family distribution. Rather, it is dictated by the data and the problem at hand. However, for every exponential family distribution, a canonical link function can be found. The canonical link function is a link function for which it holds that g(µ) = θ, where θ is the canonical parameter of the exponential family function. The most common link functions are • identity link: g(µ) = µ, which is the canonical link of the normal distribution, • log-link: g(µ) = ln(µ), which is the canonical link of the Poisson distribution, 7
• power-link g(µ) = µp, which is the canonical link for gamma distribution if p = −1 and inverse Gaussian if p = −2, • logit-link g(µ) = ln µ 1−µ , which is the canonical link for binomial distribution. Usually, the canonical link function is equal to the canonical parameter with respect to some constant. For example, for gamma distribution, we have that θ = −1 µ, but the canonical link function for this distribution is g(µ) = 1 µ, so the −1 constant is omitted (de Jong and Heller, 2008). However, sometimes when modelling claim size or frequency, a need to adjust for group size or time period arises. For example, in the case of the number of claims: To estimate the average number of claims in a period of time – claim frequency – the number of claims should be offset (divided) by the exposure time of the policyholder since longer exposure to risk means more time to have claims. In this case, using a log-link function would yield that g µ n = ln µ n = η ⇐⇒ln(µ) = ln(n) + η, where variable n is called exposure and ln(n) is called the offset. 1.2 Modelling using GLM This subchapter is based on (de Jong and Heller, 2008). The following steps are done when fitting a GLM:
- Choose a response distribution with probability density/mass function fY (·) from the exponential family. The aim is to choose a response distribution tailored to the situation or modelling problem at hand.
- Choose a link function g(µ). As discussed in the previous subchapter, no concrete link function can be given to any single situation as the data and problem at hand dictate the appropriate link function, but a canonical link is a good starting point.
- Choose independent variables X1, X2, . . . , Xp . The choice stems from the problem at hand and can vary based on domain knowledge.
- Collect the data of the observed values of the response variable y = (y1, y2, . . . )T and independent variables (x1,1, x2,1, . . . )T , (x1,2, x2,2, . . . )T , . . . , (x1,p, x2,p, . . . )T (here xi,j refers to the value of variable Xj for ith observation). 8