- Estimate β = (β0, β1, . . . , βp)T and ϕ (if unknown prior) through maximum likelihood or its variant. A more detailed explanation of this is given in the next subchapter.
- Assess the fit of the model by examining the predictions of the model and other model diagnostics. The estimates of coefficients also show whether the independent variable helps to determine the value of the conditional mean of the response. This step usually restarts the fitting process as more information on the relation between the response and independent variables is gained. Usually, the data is already collected and the modelling is aimed at selecting the appropriate dis- tribution, link function and explanatory variables. Fitting the model is done through maximum likelihood methods, which are implemented in most statistical software. 1.3 Parameter estimation This subchapter is based on (de Jong and Heller, 2008) and (Hardin and Hilbe, 2018). The author has filled in some of the details missing from these references. The maximum likelihood estimate (MLE) of a parameter θ is such a value of θ that the likelihood of observing a given data is the biggest. We will now show how to get maximum likelihood estimates for the coefficients (β0, β1, . . . , βp) of the generalized linear model. Suppose we have a random variable Y belonging to the exponential family. Let y = (y1, y2, . . . , yn) be an independant sample from random variable Y . Our aim is to find parameters β = (β0, β1, β2, . . . , βp)T , such that the likelihood of seeing this sample y is highest. As the real- isation of Y in the sample are independent, then the likelihood L(β, ϕ, y) of getting this sample can be written as the product of the values of the probability density or mass function evaluated at sample observations L := L(β, ϕ, y) = n Y i=1 f(θi, ϕ, yi) = n Y i=1 c(yi, ϕ) exp yiθi −a(θi) ϕ , here θi = θ(µi), µi = µ(θi) and g(µi) = g(µ(θi)) = ηi, where µ(·) and θ(·) are functions fixed by the distribution (how the mean of distribution relates to the parameter of the distribution), and ηi is the linear predictor for the ith observation. Using this and the fact the logarithm is a monotonically increasing function (then the maximum of the function remains in the same place), we can get the log-likelihood of the sample ℓ:= l(β, ϕ, y) = ln(L(β, ϕ, y)) = n X i=1 yiθi −a(θi) ϕ
- ln(c(yi, ϕ)) . 9
Now, to find the set of parameters that maximise the log-likelihood, we have to take the derivative in terms of parameters β. From this, we get that ∂ℓ ∂β = ∂ℓ ∂β0 , ∂ℓ ∂β1 , . . . , ∂ℓ ∂βp T . Now, using the chain rule for derivatives, we have the following expression for each βj: ∂ℓ ∂βj
n X i=1 ∂ℓ ∂θi ∂θ(µi) ∂µi ∂g−1(ηi) ∂ηi ∂ηi ∂βj . (1.2) Taking the partial derivatives of ℓwith respect to θi, we get ∂ℓ ∂θi = yi −a′(θi) ϕ , (1.3) for all observations i = 1, . . . , n. Now, since µi = E (Yi), then taking derivative of (1.1) with respect to the θi, we get a′′(θi) = ∂µi ∂θi . (1.4) Using Formula (1.4) and the fact that θi = θ(µi) we get that ∂θ(µi) ∂µi
1 a′′(θi). (1.5) Lastly, for a monotone link function g(µ) = η, there exists inverse function g−1(·) and then, we get that µi = g−1(ηi) ∂ηi ∂βj = xi,j, (1.6) for all observations (i = 1, . . . , n) in the sample. Using Formulas (1.3), (1.1), (1.5) and (1.6) in Formula (1.2), we get that ∂ℓ ∂βj
n X i=1 yi −a′(θi) ϕ 1 a′′(θi) ∂g−1(ηi) ∂ηi xi,j
n X i=1 yi −µi ϕa′′(θi) ∂g−1(ηi) ∂ηi xi,j, 10
for all j = 0, . . . , p. Additionally, if the model includes the intercept term (β0) then xi,0 = 1 for all i = 1, . . . , n, otherwise xi,0 = 0. Note that here∂g −1(ηi) ∂ηi is the derivative of the inverse of the link function with respect to linear predictorηi. Now solving this set of derivative equations ∂ℓ ∂βββ = 0, gives us an estimate for the unknown coefficientsˆβββ. Since an analytical solution is difficult to find, an approximation method, such as the Newton-Rapshon method, can be used. 1.4 Count and frequency data modelling This subchapter is based on (Hardin and Hilbe, 2018). In this work, the practical part will focus on estimating claim frequency from the data. This will, in part, be done using GLM models with Poisson distribution. A short overview and details to keep in mind when dealing with Poisson distribution are provided below. The Poisson distribution is a discrete distribution which is often used to describe counts. The probability mass function for Poisson distribution is f (y, ξ) = e−ξξy y! = 1 y! exp (y ln(ξ) − ξ) , where ξ is a parameter with an intuitive meaning of frequency at which the event that is being tracked happens. From this form, we can clearly see that it is a part of the exponential family since we can choose θ = ln( ξ), a(θ) = ξ = exp( θ), ϕ = 1 according to the definition of the exponential family. A distinct property of Poisson distribution is that the expected valueE (Y ) and variance D (Y ) are equal (E (Y ) = a′(θ) = ξ, D (Y ) = ϕa′′(θ) = ξ). This is one of the most important properties to check when modelling with the Poisson distribution. If D (Y ) < E (Y ), then there is underdispersion in the data. Although not ideal, this is rarely taken into account when modelling. If D (Y ) > E (Y ), then there is overdispersion in the data and usage of regular Poisson distribution might lead to wrong conclusions. Other models like negative binomial or quasi Poisson can be used in this case. 11 2 Tree models Most machine learning techniques are non-parametric and usually quite complex "black-box" methods, where data goes in and predictions come out. However, decision trees are not a black- box technique. In fact, they produce an easily interpretable model. A decision tree is a non-parametric supervised learning method which tries to predict the response variable through a set of splits. These splits can be easily interpreted since they try to segment the data into homogeneous subsets. A chain of these splits is called a rule and can easily showcase a collection of data points that have somewhat similar independent variables and the response variable (Molnar, 2022). Decision trees are used as the key method for several more advanced machine learning methods. These methods aim to use the predictive power of a tree but lower the variance of the trees through ensembling. The ensembles use the decision trees as base learners and build a structure containing these trees to augment the predicting power of a single tree. These tree-based en- semble methods can be classified into two groups: boosting and bagging tree ensembles (Hastie, Tibshirani, and Friedman, 2009). In this chapter, we will first introduce a decision tree building method CART (Classification And Regression Trees), and then showcase a way these trees can be used as an ensemble to fix some of the shortcomings of a single decision tree. This work focuses on boosting ensembles of trees through gradient boosting and its further development XGBoost (eXtreme Gradient Boosting). 2.1 Decision trees This subchapter is based on (Hastie, Tibshirani, and Friedman, 2009). To understand how decision trees work, we will first start with a tree-growing algorithm with two independent variables. Then, we will generalise the algorithm to any number of variables. The splitting for both categorical and numeric variables will be discussed, but only the regression problem is explained, as this is the core problem of this work. 2.1.1 Data partitioning Suppose we have a continuous response variable Y and two independent variables X1 and X2 taking values in the unit interval. Our aim is to partition the space X := X{1} × X{2} (where X{j} is the set of possible values Xj can take; in this case, X{1} = X{2} = [0, 1]) into regions, 12
Figure 1: CART "general" partitioning using straight line boundaries. Taken from (Hastie, Tibshirani, and Friedman, 2009), Figure 9.2. Figure 2: CART recursive binary partitioning. Taken from (Hastie, Tibshirani, and Friedman, 2009), Figure 9.2. so the predicted response ˆyi in a given region is close to the actual response value yi. There are infinitely many and infinitely complex ways of splitting the space X up into such regions. Decision trees use splits done using straight lines like xj = c. However, even using straight lines, the partitioned regions can be hard to describe. An example of a "general" partitioning is given in Figure 1. To solve this, a restriction is put in place. Each split has to divide the space or subspace into two distinct parts minimising the loss with respect to the response. This way of splitting is called recursive binary split. The algorithm for recursive binary splitting is quite simple. At the start, all possible splits of space X are considered. Since each region predicts a constant value – usually region average – as the prediction for the response, the best split minimising the loss to the response value Y is adopted, and the space is divided along this split. Now two separate regions (subspaces) of X are formed, usually denoted as R1 and R2. However, as more splits are done, the indices are renumbered for all regions. The previous step is repeated for the new regions, and again the "best" split is chosen. The splitting takes place until some stopping rule is triggered. For example, the maximum number of regions is reached. In Figure 2, an example of a two-dimensional binary recursive split is given. First, a split along X1 at value x1 = t1 is done. This results in two regions R1 = {(x1, x2) ∈X|x1 ≤ t1} and R2 = {(x1, x2) ∈X|x1 > t1}. The next split is done in region R1 along variable X2 at value x2 = t2 producing two new regions R1 = {(x1, x2) ∈X|x1 ≤t1, x2 ≤t2} and 13
Figure 3: CART partitioning as a recursive binary tree. Taken from (Hastie, Tibshirani, and Friedman, 2009), Figure 9.2. R2 = {(x1, x2) ∈ X|x1 ≤ t1, x2 > t2} and previous R2 region becomes R3. After this, region R3 is split along X1 at x1 = t3 resulting in regions R3 = {(x1, x2) ∈ X|x1 > t1, x1 ≤ t3} and R4 = {(x1, x2) ∈ X|x1 > t1, x1 > t3}. Lastly, region R4 is split along variable X2 at value x2 = t4, where resulting regions areR4 = {(x1, x2) ∈ X|x1 > t1, x1 > t3, x2 ≤ t4} and R5 = {(x1, x2) ∈ X|x1 > t1, x1 > t3, x2 > t4}. The resulting regression model forY can be written as ˆy = ftree(x) = 5X m=1 cmI[x ∈ Rm], (2.1) where cm denotes the predicted value for the regionRm and x = (x1, x2) or the vector of realised values of variablesX1 and X2. The dimensional partitioning plots work well for one and two-dimensional data. However, for higher dimensional data, the space partitioning is not very informative. Another way to describe this partitioning is by using a binary tree. The same two-dimensional example as for Figure 2 is represented as a binary tree in Figure 3. A splitting condition is displayed at each tree node; if the observations abide by this condition, they go to the left child node. Otherwise, the observation goes to the right child node. The tree plot also generalises well to higher dimensions since, usually, at each node, only one variable is inspected. 14 2.1.2 Regression trees In the previous subchapter, we looked into how decision trees form the regression model by partitioning the data space into regions. However, the choice of best split and, in general, the growing of trees was not considered in detail. In this subchapter, we will remedy that. We are considering a regression problem for the response variableY. Let now our data containp independent variables X1, X2, . . . , Xp for n observations. The partitioning algorithm described in the previous subchapter should be able to decide what splitting variable and split point to use automatically. One way to do this would be to consider the regression model withM regions R1, R2, . . . , RM with constant predictionscm in each region f (x) = MX m=1 cmI(x ∈ Rm), where x = (x1, x2, . . . , xp). Using this regression, we choose a loss function and minimise it. The choice of this function should stem from the problem at hand. When dealing with count data, the Poisson deviance or Poisson log loss function should be minimised. As a simple example, take the loss function to be the sum of squares,P(yi − f (xi))2. Using this loss function, it’s easy to see that for any regionRm, the best prediction is the average of the response variable in the region Rm, ˆcm = avg(yi|xi ∈ Rm). But now we still have the issue of choosing the regions, as searching for the best binary partition with minimum possible loss is generally computationally infeasible. A greedy algorithm is used to choose the splitting variable Xj and splitting value s. We want to partition space X = X{1} × X{2} × · · · ×X{p}, whereX{j} is the set of values variableXj can have. Let us denote the regions corresponding to split along variableXj at splitting values as Rm1(j, s) and Rm2(j, s) (at the start of fitting, region indicesm1 and m2 are not fully determined as further splits might change them). Then we can write that Rm1(j, s) = {(x1, . . . , xp) ∈ X|xj ≤ s} and Rm2(j, s) = {(x1, . . . , xp) ∈ X|xj > s}, if variable Xj is a numeric variable and thens ∈ R or Rm1(j, s) = {(x1, . . . , xp) ∈ X|xj ∈ s} and Rm2(j, s) = {(x1, . . . , xp) ∈ X|xj ̸∈ s}, if variable Xj is categorical then s is a subset of values categorical variableXj can take or, in 15 other words, s ⊂ X{j}. In both cases, we solve a minimisation problem min j,s X i,xi∈Rm1 (j,s) L(yi, ˆcm1) + X i,xi∈Rm2 (j,s) L(yi, ˆcm2) , where L(y, c) is a loss function andˆcm is the solution to minimisation problem ˆcm = arg min c X i,xi∈Rm L(yi, c), (2.2) where m ∈ {m1, m2}. In case of sum of squares,L(y, c) = (y − c)2 and ˆcm = avg(yi|xi ∈ Rm). Finding the best splitting variableXj and splitting points, we adopt this split on the data and get two new regions from this. The same procedure is repeated on the resulting regions. This process, if left unchecked, can lead to severe overfitting since as the tree grows in size (depth), the tree relies on smaller and smaller subsets of the data. Overfitting leads to making predictions that most likely capture the local effect of the independent variables on the response. The tree structure, in that case, is too granular. One way to combat overfitting is by introducing a complexity parameter. This parameter aims to find the optimal tree size based on observed data. One of the possible complexity parameters is the minimum decrease in the loss for a split. So if no split lowers the loss by the amount fixed by this parameter, no further splits are done. This parameter has the downside of being short-sighted since sometimes a seemingly useless split might lead to a good split down the line. Another possible complexity parameter is cost complexity. In this case, we build the decision tree to its highest depth, stopping when we reach a stopping rule like minimum node size. This deep tree T0 is then pruned (internal nodes are collapsed) based on the value of this cost complexity. Let T be a subtree ofT0 obtainable through pruning. Then define Nm(T ) = X i,xi∈Rm 1 (node support), Qm(T ) = 1 Nm(T ) X i,xi∈Rm L(yi, ˆcm) (node complexity). Now we can define the cost complexity of treeT as Cα(T ) = |T |X m=1 Nm(T )Qm(T ) + α|T |, where |T | is the number of leafs in the treeT. Now the parameterα acts as a tradeoff between the tree size and its associated goodness of fit. It can be shown that through pruningT0, a unique 16 small subtree Tα that minimises Cα(T ) exists and can be found through weakest link pruning. That is, collapsing a node that produces the smallest per-node increase inP m Nm(T )Qm(T ). 2.1.3 Advantages and disadvantages of trees This subchapter is based on (Hastie, Tibshirani, and Friedman, 2009), (Molnar, 2022) and a scikit-learn post (1.10. Decision Trees 2023). In the previous two subchapters, a brief overview of the CART decision tree building algorithm was given. Like with any model, this model also has its advantages and disadvantages. In the case of the CART model, the decision trees are easy to visualise and, thanks to the building process, also easy to interpret. For CART models, the categorical variables can be used without any encoding (like dummy encoding) since CART uses a subset of variable levels to build the splitting condition. Modelling the data set with rules allows for easy interaction modelling. However, decision trees without proper pruning can easily overfit to the training data. Decision trees are also highly unstable since a small change in the dataset might lead to a different tree altogether. Although the piecewise constant approximation can, in theory, approximate any function, to approximate linare relation a lot of splits are required. Finding a single globally optimal tree can be computationally infeasible. 2.2 Boosting The previous chapter showcased a powerful predictive method, but it had some shortcomings. To combat some of the disadvantages present in the decision tree algorithm – chiefly the instability – and to augment the predictive power of a single predictor, an ensemble of predictors can be used. An ensemble is a collection of weak learners, such as low depth decision trees, all trained and arranged in a structure to lower the variance of predictions. Although an ensemble of any methods is possible, we will look at decision tree ensembles. Two strategies are commonly employed for creating decision tree ensembles: bagging and boosting. In the case of bagging, a lot of decision trees are built on bootstrapped samples of the training data and the predictions of each tree are averaged to get an overall prediction. In boosting, weak learners are used to learn from the predictions of the previous iteration sequentially. This work focuses on boosting methods for decision trees. 17 Boosting can be done in several ways, like learning from prediction errors (Adaboost), gradient descent or gradient boosting. In general, we can say that for boosting, we are searching for prediction function F (x) in the form of an additive expansion F (x) = β0 + TX t=1 βtft(x), where ft(x) is called the base learner,βt is called the expansion coefficient andT is the number of learners or iterations. Ultimately, we want to minimise the within-sample loss and find all of the base learners such that the sum of losses in the sample is minimal. In other words, we want to find F ∗(x) = arg min F (x) L(F ), where L(F ) = Pn i=1 L(yi, F(xi)). This can seldom be done all at once so usually an iterative process is used instead. The following subchapters will describe a boosting technique called gradient boosting in greater detail. 2.2.1 Gradient descent This subchapter is based on (Friedman, 2001). In order to understand gradient boosting, we first need to have an overview of gradient descent. Since machine learning algorithms are made to optimise some loss or error function, then a way to find a global minimum is needed. Since minimisation requires us to find partial derivatives, set them to 0 and solve the equation, then usually, the equations tend to be quite complex. Gradient descent aims to approximate the solution to this kind of problem. The idea of gradient descent is simple. First, we take a random guess of what a parameter, in terms of which we want to optimise, is. Next, calculate the negative gradient and update the chosen parameter values so the function value is smaller, and hopefully closer to the global minimum. In our case, consider the single observation loss function L(yi, ˆyi), where ˆyi = F (xi) is the prediction for ith observation (i = 1, . . . , n). Denote the sum of loss for the whole dataset as L(F ) := nX i=1 L(yi, F(xi)). We want to find the functionF ∗ that minimises the sum of lossesL. Note that in the upcoming discussion the subscript, i from xi, is omitted since the gradient is calculated and adjusted for each data point separately. 18 Let us search for the function F ∗in the form of F ∗(x) = T X t=0 ft(x), where f0(x) is a fixed initial guess and ft(x), 0 < t ≤T are the incremental improvements. Let Fτ(x) = Pτ t=0 ft(x) be the incremental prediction function at step τ. Define then ft(x) = −ρtgt(x), where ρt ∈R is the learning rate and gt(x) = ∂L ∂ˆy (y, Fτ−1(x)) is the gradient of the loss function evaluated at the previous incremental step Fτ−1. The learning rate ρt can be given in several forms. A constant learning rate ρt = ρ ∈R, ∀t ≥0 is one of the most widely adopted forms in software, but setting the constant value too low might lead to very slow convergence while setting it too high might lead to a method that does not converge or does not converge to the global minimum. Another way to fix the learning rate is to reduce it exponentially with each iteration. In this case, you would start with a constant learning rate ρ0 = ρ and at each step t you reduce it by some coefficient α ∈(0, 1). So we get the generic formula ρt = ραt. Overall, we can write that the resulting approximation to the solution function F ∗(x) is in the form of Ft(x) = Ft−1(x) −ρtgt(x). As each step reduces the gradient by some amount, a way to stop the process is required. The usual stopping rules are a maximum number of boosting iteration T or a threshold for the size of incremental step ft(x). 2.2.2 Gradient Boosting This subchapter is based on (Friedman, 2002). The setup for gradient boosting is the same as for gradient descent. Namely we want to find a function F ∗(x) such that sum of loss functions L(F ∗) = Pn i=1 L(yi, ˆy∗ i ) is minimised, where ˆy∗ i = F ∗(xi). So, in other words F ∗(x) = arg min F(x) L(F). 19