For gradient descent, the loss function for each observation was sequentially optimised by the gradient produced on the loss function. Gradient boosting takes another approach again using an additive expansion, but in this case, we search for the function F(x) in the form of F(x) = T X t=0 βth(x, at), where h(x, a), is a simple, weak learner function of independent variables x with parameters a = (a1, a2, . . . ). The expansion coefficients βt and function parameters at are derived from the training data in a stage-wise manner. Again an initial guess for the parameters is given at step t = 0 and then, at each iteration step t = 1, 2, . . . , T , the next set of parameters is found as a solution to (βt, at) = arg min β,a L(Ft−1(x) + βh(x, a)). (2.3) The solution to this function is found through a two-step procedure. First, a least squares approach is used to estimate the parameters of the weak learner, at = arg min a,ρ n X i=1 [˜yi,t −ρh(xi, a)]2, where ˜yi,t is the negative gradient evaluated at the previous iteration, ˜yi,t = −∂L ∂ˆy (yi, Ft−1(x)), and ρ ∈R is the weight of the given base learner in the ensemble. Secondly, the resulting optimal at is then used in (2.3) to form a one parameter optimisation problem βt = arg min β L(Ft−1(x) + βh(x, at)). (2.4) Gradient tree boosting (gradient boosting machine) is a special case of the method described above. It fixes the structure of a weak learner to be a tree with M terminal nodes splitting the data into regions R1, R2, . . . , RM. As specified in the previous subchapter, the regression tree predicts a constant ˆcm,t for the whole region that is the solution to (2.2). Then we can write h(x, {Rm,t}M m=1) = M X m=1 ˆcm,tI(x ∈Rm,t), where Rm,t is the m-th region of the t-th decision tree (our weak learner). Note that in this case, the model parameters would be the set of splitting variables with indices j ⊂{1, 2, . . . , p} and 20
their split points s, which can be calculated in the same manner as discussed previously. These parameters are enough to determine the corresponding regions Rm,t for t-th iteration. As the regions are disjointed, then the solution to (2.4) simplifies to a location estimation problem within each region Rm,t for loss L γm,t = arg min γ X i,xi∈Rm,t L(yi, Ft−1(xi) + γ). The updated prediction function for the t-th iteration in region Rm,t then becomes Fm,t(x) = Ft−1(x) + ν · γm,tI(x ∈Rm,t), where ν ∈(0, 1] acts similarly as the learning rate ρ from gradient descent, fixing the amount a given tree can learn from its previous errors. Putting all of these steps together gives us the algorithm for generalized boosting with decision trees (Algorithm 1). Algorithm 1 Generalized gradient tree boosting F0(x) = arg minγ Pn i=1 L(yi, γ) for t = 1 to T do ˜yi,t = −∂L ∂ˆy (yi, Ft−1(x)), i = 1, . . . , n Find regions R1,t, . . . , RM,t for decision tree with M terminal nodes using ˜yt = (˜y1,t, . . . , ˜yn,t) as the vector of response values γm,t = arg minγ P i,xi∈Rm,t L(yi, Ft−1(xi) + γ), for m = 1, . . . , M Ft(x) = Ft−1(x) + ν · PM m=1 γm,tI(x ∈Rm,t) end for This algorithm was first introduced by Friedman in 1999 and has since been used and studied heavily (Friedman, 2001). The original algorithm was somewhat computationally and memory intensive, depending on the selected decision tree. But over the years, several updates to the algorithm have been made. One of these updates is inspired by the work of Breiman on "bagging" (Breiman, 1996). Breiman proposed a mix of bagging and boosting in 1999 (Breiman, 1999). This would include a random sample of full training data as the starting point of the decision tree building and out-of-bag residuals for boosting steps. Friedman adopted these ideas and updated the gradient boosting algorithm (Algorithm 1) accordingly, calling the new algorithm "stochastic gradient boosting". The full algorithm is available in (Friedman, 2002). This allowed for a reduction in computation steps and memory usage proportional to the sampling fraction used. 21
2.2.3 XGBoost This subchapter is based on (Chen and Guestrin, 2016). A bigger set of upgrades to gradient boosting was introduced in (Chen and Guestrin, 2016), with the introduction of XGBoost. XGBoost can be considered an overhaul of the idea and algorithm proposed by Friedman. Chen and Guestrin modernised the method by optimising the tree-building process. The proposed upgrades stem from two sides, the underlying mathematical and algorithmic changes and hardware-related optimisation. The notation for this chapter closely follows the notation presented in the previous chapters, where Ft(x) is the prediction function for gradient tree boosting, cm is the prediction for the region Rm, L(y, ˆy) is the chosen loss function and M is the number of terminal nodes. A list of mathematical and algorithmic improvements proposed is given below. • The regularisation of the weak learner in the objective function. When building trees, in addition to a differentiable convex loss function L(y, ˆy), a penalty term is included. The objective function, in this case, is L(F(x)) = n X i=1 L(yi, ˆyi) + T X t=1 Ω(Ft), where ˆyi = F(xi) and the penalty function is defined as Ω(Ft) = κM + 1 2Λ M X m=1 ˆc2 m, with penalty parameters κ and Λ . • Using a Taylor series approximation for the objective function. The objective function for the tree for iteration t (denoted Tt) is defined as Lt = n X i=1 L(yi, Ft−1(xi) + Ft(xi)) + Ω(Ft), but a second-order Taylor approximation can be used to optimise the new objective quickly L(t) = n X i=1 [L(yi, Ft−1(xi)) + giFt(xi) + 1 2hiF 2 t (xi)] + Ω(Ft), where gi = ∂L ∂ˆy (yi, Ft−1(xi)) and hi = ∂2L (∂ˆy)2 (yi, Ft−1(xi)). (2.5) 22
Since L(yi, Ft−1(xi)) is a constant in terms of Ft(x), it can be omitted from the objective function. Additionally, substituting Ω(Ft) and rearranging the sums we then get ˜L(t) = M X m=1 X i,xi∈Rm,t gi ˆcm,t + 1 2 X i,xi∈Rm,t hi + Λ ˆc2 m,t + κM, where Rm,t is the m-th region from t-th iteration tree. From this, we can get that the optimal leaf prediction ˆc∗ m,t should approximately be ˆc∗ m,t ≈− X i,xi∈Rm,t gi X i,xi∈Rm,t hi + Λ . From this, a new scoring function for a tree Tt is proposed using the approximated objective function ˜L(t) and the optimal leaf prediction ˆc∗ m,t L(t)(Tt) = −1 2 M X m=1 X i,xi∈Rm,t gi 2 X i,xi∈Rm,t hi + Λ
- κM. To evaluate a split, a greedy algorithm starts from a single leaf and adds branches to the tree. The objective function for a split can be written as Lsplit = 1 2 X i,xi∈RR gi 2 X i,xi∈RR hi + Λ
X i,xi∈RL gi 2 X i,xi∈RL hi + Λ − X i,xi∈RR S RL gi 2 X i,xi∈RR S RL hi + Λ −κ, where RR and RL are the corresponding right and left regions for a considered split. (RR S RL) is then the original leaf the split was considered for. • Column subsampling is implemented. To combat overfitting, a given tree is allowed to choose only from a random subset of available columns to do a split with. This forces the model to sometimes build a tree with variables it would otherwise never use. • Approximate splitting algorithm. For generic gradient boosting, to find the best splitting value s for the independent variable Xj, all possible splits are considered. This 23
is a called an exact greedy algorithm, and it is computationally demanding as efficiency is tied to the available memory. Instead of the exact search, an approximation – percentiles of variable values – of the possible candidate values is produced and evaluated. XGBoost has two variants of this algorithm: the global variant makes all of the candidate splits at the start of fitting and the local variant that re-proposes the candidates after each split. • A weighted quantile sketch is used. This allows the subsetting of the independent variable values using the second-order gradient values hi from Formula (2.5) for each observation. Let Dj = {(x1,j, h1), (x2,j, h2), . . . , (xn,j, hn)}. This denotes the pair of values for the variable Xj and second-order gradient hi in the training set. Now, define rj(z) = 1 X i,(xi,j,hi)∈Dj hi X i,(xi,j,hi)∈Dj,x<z hi, which represents the proportion of the sum of second-order gradients hi for instances where the value of variable Xj is less than z. The aim is to find such candidates (sj,1, sj,2, . . . , sj,l) that satisfy |rj(sj,k) −rj(sj,k+1)| < ε, ∀k = 1, . . . , l −1, where 1 ε ≈l (maximum number of splits considered). Note that sj,1 = min i xi,j and sj,l = max i xi,j. This complements the objective function since the values of hi can then be considered as weights ˜L(t) = n X i=1 1 2hi Ft(xi) −gi hi 2
- Ω(Ft) + constant, giving us the weighted squared loss with the label gi hi and weights hi • The fitting procedure is sparsity aware. Real-world data is often sparse due to missing data, frequent zero entries or variable encoding. This kind of missing data is not useful when building trees but still has to be assigned to a node in a tree. To do this, a default direction for sparse data is used where the optimal direction is learnt from the data. In addition to the mathematical and algorithmic improvements, some hardware optimisations are also implemented. • Column blocks are used. Column value sorting and learning is one of the most memory- intense parts of the algorithm. For an exact greedy search, all values of the column have 24
to be read in memory, potentially requiring a lot of memory in the case of large datasets. To alleviate this, the column data is stored and processed in "blocks" since the augmented weighted quantile function allows for the processing of the column in parts. The usage of column blocks also allows for parallel computation of these blocks, speeding up the search for optimal splits. • The algorithm is cache-aware. The column blocks allowed for parallel computation for split finding, but this results in non-continuous memory access to retrieve row-specific gradient statistics. Non-continuous memory access slows down the program since the read/write cycle of other computations would need to be queued. The solution to this is easy – using an appropriately sized column block that allows the gradient statistics to be fit in the CPU cache. • Blocks for Out-of-core computations to effectively use the machine’s resources. A block reading and writing system is used when the data does not fit into the main memory. All the data is stored on disk in blocks and accessed sequentially. This is, however, limited by the disk read and write speeds. To speed this up, the stored data blocks are compressed and decompressed on the fly, exchanging the disk usage for CPU usage. Additionally, the data blocks are stored on multiple disks (where available), and different threads of the CPU are assigned to access the data allowing for more streamlined read-and-write sequencing. All of these augmentations combine to make the methods an order of magnitude faster, allow- ing for excellent out-of-core and distributed system performance with basically no predictive performance drawbacks. 25