these two models is 334.6, which is quite a significant decrease. The fact that the rules with some linear terms beat out piecewise constant fit is interesting. Rules are able to describe deeper interactions, however, at the cost of losing sight of simple constant or non-linear relations. It is important to note that the best model based on deviance is not the best model based on AIC. The surrogate model without interactions was just barely able to beat rule fit model with minimum parameter based on test deviance. It might be that the interactions were not so present in the test data, but in training data it is clear that interactions are important and thus should be included if the best fit based on likelihood is desired. Comparing the two approaches of gathering insight, we can see that maidrr is able to produce a more stable model based on purely categorical variables. However, the interpretation of inter- action is much better for rule fit models. Rules give you a concrete idea how a combination of different variables affects the claim frequency, while with surrogates, you have to be very precise when interpreting the interaction for a particular observation. Overall it appears that we were able to gather some insight from the way machine learning models use variables and produce meaningful and, most importantly, interpretable augmentations for industry standard models. 5.4 Discussion In this subchapter, a general discussion about machine learning modelling and the results of this thesis are discussed. This subchapter is based on the author’s experience working with this data and the problem at hand. To start out with, all of this work has been plagued by long waiting times. Starting out, I knew machine learning would take considerably more time than classic statistical models like GLM; however, I severely underestimated the time required to run some of the procedures and methods. As evident in the method and model description, most of the algorithms are quite computation- ally intense (at least O(n2) for most), and it does not help that their implementations seldom allow for parallelisations running on one core of the machine by default. With my limited knowl- edge and skills, I was, however, able to improve some of the underlying code for maidrr package allowing me to run some of the code in parallel. This showcases well that working with machine learning requires additional skills in writing code and understanding the underlying structure of the method being used. Continuing on this topic of computational difficulties, I was able to leverage the high-performance computational cluster available to the University of Tartu, thus allowing me to run several model 51
training sessions at the same time (University of Tartu, 2018). The cluster also allowed to more easily run the parallel computation since parallelisation in Windows (Bott and Stinson, 2019) is not allowed by default. Overall, even though all of the code could be run on a single core on a machine with only 16 GB of ram, it would have taken significantly longer since every job would need to be carefully planned and executed. In general, working with machine learning on insurance data has proven difficult since most of the methods and implementations I found did not have the capabilities to model insurance data. They did not allow for the use of Poisson or custom objective (loss) function, thus, count or claim frequency modelling would not be optimal. The same goes for claim amount modelling using Gamma or other heavy-tailed distributions. However, simply having a custom or Poisson objective was also not enough since modelling claim frequency requires us to use either offset for the response (given the right link between response and prediction function) or the ability to use weights for model fitting. Having both of these requirements proved to be very rare. In my experience, working in R helped with this since more implementations with both requirements were available (compared to Python libraries). This shows that machine learning, although powerful in cases where normal or binomial distribu- tion is used, is still not ripe enough to be freely applied to the insurance field and problems. There were no out-of-the-box solutions, and every approach and method needed a lot of additional work and prior knowledge to put to work predicting insurance data. Regarding the results of the applied methods, I feel quite hopeful. I was able to discover similar groupings for variables used in ratemaking at the company presently, and I did this in a few months while they have been doing and developing their pricing structure for years. This is one of the key developments I achieved with maidrr and presented to the pricing unit at If. In addition to grouping structure, If was also very interested in the rules found through rule ensembles. Rules allow us to find a segment of data historically performing abnormally compared to the rest of the observations, thus giving a reason to put additional restrictions in place for pricing this part of the population. With additional analysis, these combinations of variables might allow to gain an edge in insurance pricing in a market where all of the data is shared. Something different and more advanced has to be done to gain an edge in such a competitive environment, and I and my supervisor at If feel that this might be one of these things. Lastly, I would like to comment that this research and thesis is by no means perfect, but it is a proof of concept to showcase what possible alternatives and additions machine learning can provide in the age of computing. It is clear that the machine learning field will keep on evolving. 52
However, I do not see the restriction of model and result interpretability being lifted any time soon in the insurance field. This way, a need for tools to make machine learning interpretable and, in general, interpretable machine learning will only grow in time. Machine learning is the future, but interpretable machine learning will help us to get there and understand it. I would like to finish this discussion with a quote by Christoph Molnar: "When opaque machine learning models are used in research, scientific findings remain completely hidden if the model only gives predictions without explanations. To facilitate learning and satisfy curiosity as to why certain predictions or behaviours are created by machines, interpretability and explanations are crucial. Of course, humans do not need explanations for everything that happens. For most people, it is okay that they do not understand how a computer works. Unexpected events make us curious." (Molnar, 2022). 53
Conclusion The purpose of this thesis was to introduce and showcase two ways to extract insight from machine learning models trained to evaluate risk in insurance pricing. To do this, the first 2 chapters of the thesis focused on introducing the current main statistical model, the generalized linear model, giving an overview of decision trees and tree-based boosting ensembles like gra- dient boosting machine and XGBoost. In Chapter 3, model metrics and insight statistics were discussed. After that, model agnostic data-driven surrogate models (maidrr) and rule ensemble methods were introduced and explained. In the last two chapters, all models and methods were applied to motor third party liability data coming from Latvia. The models were trained on the training split of the data, the resulting models were briefly explained, and their performance was assessed using the testing split of the data. First, 4 models were fit to the training data: a GLM model based on step-wise AIC search, the AIC search model with additional polynomial terms (baseline models), gradient boosted machine and XGBoost as the machine learning models. The resulting machine learning models were better in terms of test set deviance compared to the baseline models. This gave reason to extract insight from these machine learning models and an additional 6 models were proposed: surrogate model, surrogate model without interactions, grouping augmented AIC search model, grouping and spline augmented AIC search model and two rule ensemble models with different penalty parameters. All of the proposed machine learning insight augmented models ended up being better or compa- rable in accuracy to baseline generalized linear models. Based on likelihood metrics, the proposed augmentations proved to produce models that are better able to capture the likelihood of the data. Some strange behaviour was also observed, where some augmentations performed better on the test set compared to the original machine learning models. However, both ways of using machine learning models proved to be useful in different cases: maidrr is good for feature selec- tion and grouping of these features and rule ensemble is good for searching for combinations of variables that can be priced differently. This thesis was able to showcase two possible ways machine learning could be used to augment current practices of ratemaking. Both of the methods used show potential in their respective strengths of feature grouping for maidrr and combination search for rule ensemble. In the case of maidrr, the approach can be used for any machine learning model, and thus advancements in machine learning models applicable to the insurance field will also improve the surrogate model that can be evaluated. Rule ensembles help to gather ideas for further analysis and strategies. 54
References 1.10. Decision Trees (Jan. 2023). [Online; accessed 12. Jan. 2023]. url: https://scikit- learn.org/stable/modules/tree.html. Apache Software Foundation (July 20, 2018). Hadoop. Version 2.7.7. url: https:// hadoop.apache.org. – (June 18, 2020). Spark. Version 2.3.4, 3.0.3. url: https://spark.apache.org. Bott, Ed and Craig Stinson (2019). Windows 10 inside out. Microsoft Press. Breiman, Leo (Aug. 1996). “Bagging predictors”. In: Mach. Learn. 24.2, pp. 123–140. issn: 1573-0565. doi: 10.1007/BF00058655. – (Feb. 1999). Using Adaptive Bagging to Debias Regressions. Technical Report 547. Berkeley, CA 94720: University of California at Berkeley, Statistics Department. Chen, Tianqi and Carlos Guestrin (Mar. 2016). “XGBoost: A Scalable Tree Boosting System”. In: arXiv. doi: 10.1145/2939672.2939785. eprint: 1603.02754. Chen, Tianqi, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho, Kailong Chen, Rory Mitchell, Ignacio Cano, Tianyi Zhou, Mu Li, Junyuan Xie, Min Lin, Yifeng Geng, Yutian Li, and Jiaming Yuan (2022). xgboost: Extreme Gradient Boosting. R package version 1.6.0.1. url: https://CRAN.R-project.org/package=xgboost. de Jong, Piet and Gillian Z. Heller (Feb. 2008). Generalized Linear Models for Insurance Data. Cambridge, England, UK: Cambridge University Press. isbn: 978-0-52187914- 9. url: https : / / www . cambridge . org / lv / academic / subjects / statistics - probability/statistics-econometrics-finance-and-insurance/generalized- linear-models-insurance-data?format=HB&isbn=9780521879149. Friedman, Jerome, Trevor Hastie, and Robert Tibshirani (2010). “Regularization Paths for Generalized Linear Models via Coordinate Descent”. In: Journal of Statistical Software 33.1, pp. 1–22. doi: 10.18637/jss.v033.i01. url: https://www.jstatsoft.org/ v33/i01/. Friedman, Jerome H. (2001). “Greedy function approximation: A gradient boosting ma- chine.” In: The Annals of Statistics 29.5, pp. 1189 –1232. doi: 10.1214/aos/1013203451. url: https://doi.org/10.1214/aos/1013203451. – (2002). “Stochastic gradient boosting”. In: Computational Statistics & Data Analysis 38.4. Nonlinear Methods and Data Mining, pp. 367–378. issn: 0167-9473. doi: https: 55
//doi.org/10.1016/S0167-9473(01)00065-2 . url: https://www.sciencedirect. com/science/article/pii/S0167947301000652. Friedman, Jerome H. and Bogdan E. Popescu (Oct. 2003). “Importance Sampled Learning Ensembles”. In: ResearchGate. url: https://www.researchgate.net/publication/ 2888930_Importance_Sampled_Learning_Ensembles. – (Sept. 2008). “Predictive learning via rule ensembles”. In:The Annals of Applied Statis- tics 2.3. doi: 10 . 1214 / 07 - aoas148. url: https : / / doi . org / 10 . 1214 \ %2F07 - aoas148. Greenwell, Brandon, Bradley Boehmke, Jay Cunningham, and GBM Developers (2022). gbm: Generalized Boosted Regression Models . R package version 2.1.8.1.url: https: //CRAN.R-project.org/package=gbm. Hardin, James W. and Joseph M. Hilbe (Apr. 2018). Generalized Linear Models and Extensions: Fourth Edition . Stata Press.isbn: 978-1-59718225-6. url: https://www. amazon.com/Generalized-Linear-Models-Extensions-Fourth/dp/1597182257. Hastie, Trevor, Robert Tibshirani, and Jerome H. Friedman (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction . New York, NY, USA: Springer. isbn:978-0-38784884-6. url: https://hastie.su.domains/ElemStatLearn/. Henckaerts, Roel (2020). maidrr: Model-Agnostic Interpretable Data-driven suRRogate . https://henckr.github.io/maidrr/, https://github.com/henckr/maidrr. Henckaerts, Roel, Katrien Antonio, Maxime Clijsters, and Verbelen Roel (Jan. 2017). “A Data Driven Binning Strategy for the Construction of Insurance Tariff Classes”. In: SSRN Electronic Journal. issn: 1556-5068. doi: 10.2139/ssrn.3052174. Henckaerts,Roel,KatrienAntonio,andMarie-PierCôté(2020). When stakes are high: bal- ancing accuracy and transparency with Model-Agnostic Interpretable Data-driven suR- Rogates. doi: 10.48550/ARXIV.2007.06894 . url: https://arxiv.org/abs/2007. 06894. Henckaerts, Roel, Marie-Pier Côté, Katrien Antonio, and Roel Verbelen (Apr. 2019). “Boosting insights in insurance tariff plans with tree-based machine learning methods”. In: arXiv. doi: 10.48550/arXiv.1904.10890. eprint: 1904.10890. Holub,Karl(2022). xrf: eXtreme RuleFit. Rpackageversion0.2.2. url: https://CRAN.R- project.org/package=xrf. 56 Kuhn, Max and Hannah Frick (2022). dials: Tools for Creating Tuning Parameter Values. R package version 1.1.0. url: https://CRAN.R-project.org/package=dials. Lamport, Leslie (1994). LATEX: a Document Preparation System. 2nd ed. Massachusetts: Addison Wesley. Luraschi, Javier, Kevin Kuo, Kevin Ushey, JJ Allaire, Hossein Falaki, Lu Wang, Andy Zhang, Yitao Li, Edgar Ruiz, and The Apache Software Foundation (2022). sparklyr: R Interface to Apache Spark. R package version 1.7.8. url: https://CRAN.R-project. org/package=sparklyr. Molnar, Christoph (2022). Interpretable Machine Learning. A Guide for Making Black Box Models Explainable. 2nd ed. url: https://christophm.github.io/interpretable- ml-book. Murdoch, W. James, Chandan Singh, Karl Kumbier, Reza Abbasi-Asl, and Bin Yu (Oct. 2019). “Definitions, methods, and applications in interpretable machine learning”. In: Proc. Natl. Acad. Sci. U.S.A. 116.44, pp. 22071–22080. doi: 10.1073/pnas.1900654116. Nelder, J. A. and R. W. Wedderburn (1972). “Generalized linear models”. In: Journal of the Royal Statistical Society. Series A (General) 135.3, 370–384. doi: 10.2307/ 2344614. R Core Team (2022). R: A Language and Environment for Statistical Computing. R Foun- dation for Statistical Computing. Vienna, Austria. url: https://www.R-project. org/. Tibshirani, Robert (1996). “Regression Shrinkage and Selection via the Lasso”. In: Jour- nal of the Royal Statistical Society. Series B (Methodological) 58.1, pp. 267–288. issn: 00359246. url: http://www.jstor.org/stable/2346178 (visited on 01/31/2023). University of Tartu (2018). UT Rocket. doi: 10.23673/PH6N-0144. Valecký, Jiří (May 2016). “Modelling Claim Frequency in Vehicle Insurance”. In: Acta Universitatis Agriculturae et Silviculturae Mendelianae Brunensis 64.2, pp. 683–689. issn: 1211-8516. doi: 10.11118/actaun201664020683. Wang, Haizhou and Mingzhou Song (Dec. 2011). “Ckmeans.1d.dp: Optimal k-means Clus- tering in One Dimension by Dynamic Programming”. In: R Journal 3.2, pp. 29–33. issn: 2073-4859. doi: 10.32614/RJ-2011-015. 57