FIRM-LEVEL%20PREDICTORS%20OF%20LABOUR%20TAX%20EVASION.%20Alice%20Mikk.pdf - Page 2

age in years was calculated by subtracting date of birth from year end date 3;
work experience in years was calculated by subtracting the date of employment from year end date.

Step 1) was necessary due to the nature of monthly wage data. This calculation smoothed the fluctuations in monthly salary compared to using only working time rate adjusted wage from one particular month as the selected month could have e mployees receiving abnormally high salary due to bonuses or receiving no salary at all. Incor porating a month with extremely high salaries due to bonuses incorrectly reflect a person’s incom e and contribute to upward bias in the results. For example, year-end or Christmas bonuses are comm on among businesses. On the other hand, choosing a month where individuals report no admins trative salary, the individuals are omitted from the analysis. This could happen, for example, with many construction workers as execution of various construction stages is highly seasonal.

Therefore, in order to take into account all employ ees in the dataset without their average wage being dependent on how many months they were engaged in employment, the annual summarised payments were divided by the months engaged in work . Otherwise, if divided by twelve months for all employees, the monthly average salary could be underestimated for those who were not engaged in the labour market for one or more months of the year. Calculating the average working time rate adjusted monthly salary based on annual income eliminates these issues.

The calculation was done separately for different e mployers, and therefore accounting for individuals engaged in several employment contracts during a year. As presented in table 2, individuals were mainly working for one employer on ly, but there were also individuals who worked for 2-8 different employers during a year. T hese individuals are not omitted from the analysis, but the wage calculations take into account their rather “jumpy employment” as the main focus is on the labour tax evasion of the employer. Therefore, including only one employment and omitting others would not be reasonable and could potentially exclude tax evading employers from analysis.

3 Year end date refers to the year for which the regression analysis was carried out (Dec. 31, 2021 or 2022). 22

Table 2. Unique employer-employee combinations Year Number of unique employer-employee pairs Number of unique employees Number of employees employed by several employers
2021 316 679 289 502 27 177 2022 318 542 292 305 26 237
Source: Author ’s calculations Note: The total number of unique employer-employee pairs in the dataset presents the observations from the four NACE sectors selected for analysis purposes. The age calculation in step 2) resulted in values r anging from 11 to 91 years in the dataset. Only individuals aged 16-65 years are kept in the analysis. Individuals younger than 16 are omitted due to serving the mandatory minimum of general educati on requirement and individuals older than 65 are omitted due to old-age pension. The retirement age does not automatically result in complete exclusion from the labour market, but rather part-t ime participation or irregular activities. Post state pension age workforce works fewer hours than younger workers and the gap in hours is greater for men than women (Smeaton & McKay, 2003). The Estonian labour market policies are encouraging the elderly to stay longer in paid empl oyment, increasing the labour market participation after they reach the state pension ag e. A total of 18 066 individuals were omitted in 2021 and 14 752 individuals were omitted in 2021, mainly elderly.

Only full-time employees according to working time rate were included in the analysis. However, this does not account for individuals who may have worked for only part of the month (for example quit work in the middle of a month) and it is also not visible in the data source. One indicator suggesting a shorter span of work is receiving work ing time adjusted monthly salary lower than official minimum wage. Additionally, receiving mark edly less than the minimum wage could suggest to data irregularities or errors. To omit i ndividuals whose salary is downward biased due to their irregular inclusion in employment during one or several months, all individuals receiving average working time rate adjusted monthly salary l ess than the minimum wage by law for that year (584 euros for 2021 and 654 euros for 2022), are omitted. As a result, 31 927 individuals were omitted in 2021 and 35 330 individuals were omitted in 2022.

It is important to note that this exlusion could possibly omit individuals who have received benefits for temporary work incapacity from the Social Insur ance Board, such as sickness benefit or care benefit. Individuals earning the mininum wage are m ore sensitive to a decrease in income as 23

experiencing loss of income in even one month in pe riod under investigation would shift them to earning less than minimum wage by law.

Calculation of step 3) considers an exemption regar ding these employees whose employment contract had ended during the observed timespan and in this case, the work experience in years was calculated by subtracting the date of employment from the end of employment. In other cases, the experience is calculated by subtracting the date of the employment from the end of the year for which the analysis was carried out. The experience is rounded down to a full year.

The variable for occupation is not available for al l employees in TÖR ; the missing observations are however only a small proportion from all observations, amounting to 982 observations in 2021 and 676 observations in 2022. Due to the low magnitude, the observations with missing occupation were dropped. Moreover, data on education matched f rom population dataset is not available for all observations, 72 910 observations were missing for 2021 and 81 212 observations were missing for 2022, making up approximately a quarter of the observations.

To understand whether the issue of missing educatio nal attainment is systematic or random, observations with missing values were further investigated. The average salary of the individuals in 2021 was 1368 euros and the median salary accoun ted to 1193 euros. Compared to the dataset including observation with and without educational attainment variable, the average salary was 1445 euros and the median salary was 1213 euros. T he share of different occupation, region, NACE and gender categories follows the distribution of the initial dataset. Therefore, observations with missing educational attainment should not affe ct the results of the analysis severely, if omitted. However, robustness checks are carried out to confirm that the results are not compromised by omitting observations with missing education.

Table 3 represents the number of unique employer-em ployee pairs and omitted variables in each steps as well as the final sample used for thesis purposes.

Table 3. Observations dropped and final sample Year Number of unique employer- employee pairs
Younger than 16 or older than 65 Wage below minimum wage No occupation Armed forces No educational attainment Final sample size 2021 316 679 18 066 31 927 982 3 72 910 192 791 2022 318 542 14 752 35 3 30 676 0 81 2 12 186 57 2 Source: Author ’s calculations Note: The montly minimum wage was 584 euros for 2021 and 654 euros for 2022. The final sample includes 192 791 observations for 2021 and 186 572 observations for 2022, so a total of 123 888 observations were dropped for 2021 and 131 970 observations were dropped for 2022. The descriptive statistics of the final sample variables are included in table 4.
Table 4. Descriptive statistics Source: Author ’s calculations Note: NACE, Statistical Classification of Economic Activities. The mean wage is higher in 2022 compared to 2021 wh ich is expected as average salary has increased. Mean of 0.4 for gender shows that there are slightly more men in the dataset than women. The average age of the individuals is 46.8 y ears for 2021 and 46.6 years for 2022. The average experience of the individuals is 8.20 years in 2021 and 7.6 years in 2023. Regarding categorical variables, the distribution to different categories is visible in Appendix 1. 2.1.2. Annual report data Due to the matched employer-employee wage data, referring to availability of the registry code of the employer for each employee, wage data can later be linked to various firm-level financial and non-financial variables from the annual reports. Fo r each employer balance sheet and profit and loss statement as well as annexes are available. Th erefore, different balance sheet and profit and Year 2021 2022
Variable mean st.dev min max mean st.dev min max
Wage 7.16 0.50 6.37 10.89 7.27 0.49 6.48 11.20
Woman 0.40 - 0 1 0.40 - 0 1 Age 46.80 10.20 17 65 46.6 0 10.30 16 65
Education - - 1 4 - - 1 4 Experience 8.20 6.67 0 40 7.6 0 6.86 0 40
NACE - - 1 4 - - 1 4 Occupation - - 1 8 - - 1 8 Region - - 1 5 - - 1 5 25

loss statement values are retrieved, such as assets , liabilities, revenues, expenditures and profit. Also, more detailed allocation of balance sheet and profit and loss statement items in annexes, average number of employees reduced to full-time, as well as NACE is available.

Following previous studies on tax evasion and finan cial fraud by Hajek & Henriques (2017), Gavoille & Zasova (2023) and Benkovskis & Fadejeva (2022), different ratios are calculated using various annual report items. Preprocessing of data also included removing observations with missing values as not all companies have all balanc e sheet or profit and loss statement values available. Benkovskis & Fadejeva (2022) keep the set of independent variables short and include ratios based on the most common financial indicators. The same approach is followed in this thesis as the aim is to also investigate tax evasion in micro enterprises. Due to their simplified reporting, not too many financial indicators are included in the annual reports and inclusion of very specific indicators in the analysis could result in exclusio n of a large proportion of small firms, as only observations where all the variables included in th e final model were present, were kept. The number of unique firms in the dataset, omitted obse rvations and final dataset is presented in the table 5. Table 5. Observations dropped and final sample size Year Number of unique firms No employees No profit or turnover
No cash or assets No liabilities No COGS Infinite values Final sample 2021 65 187 32 255 139 5673 242 2553 534 23 791 2022 66 207 32 721 152 5783 202 2519 516 24 314
Source: Author ’s calculations Notes:

Missing values in assets include both total assets and current assets.
Missing values in liabilities include both total liabilities and short-term liabilities.
COGS, cost of goods sold.
Infinite values arose from division by zero. Approximately half of the companies in the annual report dataset reported having zero employees. Looking into these firms, several patterns can be d etected. Part of the firms could have been inactive during the year as turnover is zero or relatively small. Others seem to have zero employees due to error or potential engagement in labour tax evasion at the extensive margin (unreported employees). Taking into account the field of activity of firms and turnover arising from acitivites, a firm should have employees on their payroll. The companies for which the number of employees is zero are cross-checked with data on administrati ve wages to see how many employees have 26

received reported salary in the corresponding year. As a result 893 firms (868 in 2022) paid and reported salaries to employees in 2021 which show z ero employees in annual report data and 32 255 firms (32 721 in 2022) did not report any salary payments in 2021. For the firms for which the zero employees was erroneous, the number of dif ferent employees who received salary according to wage data, is used as a proxy. The firms who did not report any salary payments are omitted from the analysis, as the aim is to investi gate labour tax evasion on the intensive margin and not on the extensive margin. The same approach is applied on data from 2022.

Additionally, some ratio calculations produced infinite values (dividing with zero). Cecchini et al.
(2010) tackle the issue of division by zero by replacing zero values with 0.001 ; however it should be approached with caution and the effect on the an alysis results should be clear. Therefore, to avoid biases and unclear effect on interpretation, the infinite values were omitted from the analysis. The descriptive statistics of the final sample variables is included in table 6.
Table 6. Descriptive statistics Source: Author ’s calculations Notes:

NACE, Statistical Classification of Economic Activities.
COGS, cost of goods sold.

The descriptive statistics show that enterprise data exhibits a lot of variance. The mean size of the company is 8.30 employees in 2021 and 8.20 employee s in 2022, showing that the number of employees has slightly decreased. The mean of log t urnover increased when comparing 2022 to 2021. Debt to assets or short-term debt to assets i s extremely high for example in case where a Year 2021 2022
Variable Mean St.dev Min Max Mean St.dev Min Max
NACE - - 1 4 - - 1 4 Size 8.30 42.74 1 3 115 8.20 42.5 0 1 3010
Log of turnover
12.39 1.77 3.85 20.22 12.47 1.77 4.09 20.79
Debt to assets
1.08 51.65 0 7 661.33 0.65 10.70 0 1377.00
Short-term debt to current assets
2.29 66.38 0 6 663.40 1.68 39.80 0 4711.00
Cash to assets
0.28 0.27 -2.33 1.00 0.28 0.28 -0.06 4.26
Turnover to assets
3.25 34.70 0 4 404.00 3.57 35.4 0 3270.00
COGS to turnover
0.66 6.34 0 926.79 0.65 4.83 0 458.56
27

short-term loan liability has been taken on but the re are very few assets owned by the firm. The same situation arises regarding turnover to assets. The cash to assets ratio is negative in case the firm has a credit account in use. COGS is high when a company spends more on the intermediate consumption to provide products and services than i t receives taxable revenue. It is important to note that turnover accounts to sales revenue of a company only, and the high ratio could be due to the fact that company has reported other revenue. R egarding the categorical variable NACE, the number of observations falling into each category is presented in table 7. Table 7. Number of observations per category for NACE NACE 2021 2022 Manufacturing 4375 4421
Construction 7066 7479
Wholesale and r etail trade; repair of motor vehicles and motorcycles 9116 9151
Transportation and storage 3234 3263
Total 23 791 24 314
Source: Author ’s calculations

The number of observations falling into each NACE is relatively similar for both, 2021 and 2022. Approximately 38% of firms are active in NACE wholesale and retail, 30% in construction, 18% in manufacturing and 14% in transportation and storage.
2.2. Methodology This section will give an overview of the methodolo gy used in the empirical analysis to obtain subsets of tax evading and tax compliant firms as w ell as analyse the firm-level predictors contributing to the probability of being engaged in labour tax evasion. The methodology of the empirical analysis is based on previous research in this field, implementing Ordinary Least Squares (hereinafter OLS) and logistic regression. In subsection 2.2.1, the methodology to obtain samples of tax compliant or tax evading firms is described in detail. In subsection 2.2.2, the methodology to model the relationship between firm-level financ ial and non-financial variables and labour tax evasion is explained. What is more, classification of the remaining firms as tax evading or compliant is discussed.
28

2.2.1. Obtaining subsets of tax evading and tax compliant firms Defining the treated group of firms could be approa ched differently. Kukk & Staehr (2014) find that households with business income underreport 62 % of their actual total income and Kukk et al. (2019) find income underreporting to be more than 40% of self-employed household income on average. Braguinsky & Mityakov (2015) show that small enterprises are more eager to engage in tax evasion. Gavoille & Zasova (2023) and Benkov skis & Fadejeva (2022) apply wage regression to pinpoint firms with “suspiciously low wages”.

To obtain a subset of firms for which the classific ation is tax evading, the Mincer (1975) wage regression model is employed as a starting point in developing an empirical earnings model, following Gavoille & Zasova (2023) and Benkovskis & Fadejeva (2022). Mincer wage regression model is an extensively used model employing OLS me thod to analyse the relationship between an individual’s earnings and various other human ca pital variables, primarily education and experience. The natural logarithm of earnings is used to address issues related to the distributions of earnings, and the model aims to estimate how cha nges in education, experience and other individual characteristics affect the individual’s earnings. To spot firms paying suspiciously low wages to their employees, Gavoille & Zasova (2023) regress the log of wage for an individual employee against characteristics such as age, exper ience, gender, field of activity, location of workplace, occupation and educational attainment. T hey consider employees in the bottom 10% of the residual distribution receiving abnormally l ow wages as the wage regression predicts significantly higher salary taking into account the individual characteristics of the employee. They classify firms with at least one employee in the bottom of the residual distribution in a given year as tax evader.

However, as the size and therefore individuals empl oyed by a company varies a lot, the bottom 10% of a residual distribution should be approached with caution. For example, if one employee from a company with more than 100 employees falls in the bottom 10% of the residual distribution, the firm would be classified as tax evading. On the other hand, a firm with ten employees who all fall into the bottom 10% of the residual distribution is classified tax evading as well. The scale of beforementioned cases is not readily comparable and an individual falling in the bottom due to large actual and predicted salary discrepancy could be there for other reasons than employer being engaged in labour tax evasion. Benkovskis & Fadejeva (2022) also disuss that for some individuals seemingly low wages could be due to unobserved work er characteristics. They require two 29

conditions to be met in order to be classified as a treated group. Firstly, the share of employees with “suspiciously low wages” in a given year is eq ual to 50% or more and secondly occupation data should be available for one third or at least 10 employees for that firm. Therefore it is beneficial to take into account the population of e mployees and share of employees falling to the bottom 10% of the distribution. It is therefore ass umed that firms for which 50% or more of employees fall into the bottom 10% of the residual distribution pay a suspiciously low wage and are therefore classified as tax evading.

Harmon & Walker (1995) also investigate whether ins trumental variables (hereinafter IV) approach should be preferred over OLS due to endoge neity and biases. They conclude that, even though IV estimates are nearly double the estimates for OLS, IV estimation provides less precise estimates, and the differences from OLS are not statistically significant. Therefore, IV is discarded for the purpose of this analysis and OLS is chosen. The standard errors are assumed to be heteroscedastic, and robust standard errors are therefore applied. The final wage regression model is presented as follows: ln = +
+ + +
+ +
+ +

+ + (1)

The list of variables is included in Appendix 2. The dependent variable is the natural logarithm of the average monthly wage and the independent variables included in the wage regression are:

Individual characteristics such as age (age ), age², experience (exp ), experience² (expressed in years) and a dummy variable taking the value 1 if the respondent is a woman (gndr ) ;
A set of dummy variables (reg ) indicating the NUTS region in which the respondent works (reference group is Northern Estonia);
A set of dummy variables (nace ) indicating the NACE of the employer of the respon dent (reference group is manufacturing);
A set of dummy variables ( occup ) indicating the type of occupation of the responde nt (reference group is managers);
A set of dummy variables ( educ ) indicating the education of the respondent (refer ence group is pre-school education);
Error term ε.

Regarding tax-compliant firms to be used as a contr ol group, Gavoille & Zasova (2023) assume that firms owned by Nordic companies are less likely to engage in illicit corporate activities. They 30

have obtained similar results as Braguinsky & Mitya kov (2015) in their previous research regarding the transparency and law-abiding cultural norms (Gavoille & Zasova, 2021). Benkovskis & Fadejeva (2022) have identified “definitely compliant” companies to be state-owned firms and companies whose owners are located in low-corruptio n countries. DeBacker et al. (2015) have found that firms owned by parents operating in countries with higher corruption levels, evade more tax in the USA, supporting the findings and approac h. It could be also argued, whether wage regression could be used to distinguish both, tax evading and tax compliant companies. However, companies not falling into the bottom 10% of the residual distribution can still be engaged in labour tax evasion and should not be therefore classified as “definitely compliant”. The aim is to find patterns indicating labour tax evasion also for those who are not in the bottom 10% of the residual distribution.

This thesis follows Gavoille & Zasova (2021, 2023), considering companies whose owners reside in Nordic countries (Iceland, Norway, Sweden, Finland or Denmark) as tax compliant. If there are differences in the country the parent company and group parent company are registered in, parent company is considered more influential in importing corporate culture and conduct than group parent company. However, it should be noted that ev en though findings by Gavoille & Zasova (2021, 2023) and Braguinsky & Mityakov (2015) suppo rt the approach, selecting Nordic owned companies as tax compliant relies on strong assumptions.
2.2.2. Firm-level predictors of tax evasion After obtaining a sample of firms for which the tru e and false type (classification) is known, it is important to distinguish other firms between tax co mpliant and tax evading using different firm- level non-financial and financial indicators. The literature on accounting and computer science on fraud detection has shown good prediction performan ce using different variables from firms’ annual financial reports although no formal model in economic theory has been provided and the approach rather lies on identifying the patterns (G avoille & Zasova, 2023). Therefore, different balance sheet and income statements values have been used to calculate ratios that could indicate patterns associated with labour tax evasion. The ch osen variables mainly follow Benkovskis & Fadejeva (2022) and are presented in Appendix 3.

Gavoille & Zasova (2023) proceed by splitting the s ample of firms for which they know the true type to training (80%) and test (20%) sample. They then train a gradient boosting algorithm to distinguish between tax compliant and tax evading f irms using the training sample and financial 31

variables of the firms as input variables. Afterwar ds, the model is applied to the firms in the test sample and as a final step, they classify all the firms in the analysis. Benkovskis & Fadejeva (2022) on the other hand employ a probit model to model the relationship and predict the probability that each firm is engaged in tax evasion. They discuss that despite potential losses in predictive power, using a probit model is transparent and allows to r eport the sign and significance for coefficients on the firm-level predictors. What is more, probit results in an estimate of the probability of a firm being involved in tax evasion and not a binary clas sification. First, they estimate a probit model on firms for which the true type is known. After th is, the probit model is used to predict the out- of-sample probability of being engaged in tax evasi on and finally, the goodness of the model is evaluated.

To analyse the factors that contribute to probabili ty of tax evasion, a probit or a logit model is considered as both are designed for dependent varia bles taking on values between 0 and 1, being therefore suitable for analysing the relationship b etween different firm-level financial and non- financial predictors and the event of tax evasion occuring. What is more, the interpretability of the model is important to further analyse the relations hip between labour tax evasion and firm-level predictors. The choice between probit and logit mod el is dependent on the characteristics of the data and underlying assumption, however, both models produce similar results in many cases. As per interpretability, logit model coefficients are often found more straightforward to interpret due to the simplicity of log-odds scale compared to cha nges in the standard normal distribution (Gujarati, 2003). Therefore, the logistic regressio n model is employed for the purpose of this thesis.

After combining the two subsets of firms for which the binary classification assumption of tax compliant and tax evading was done and merging it with firm’s financial data, logistic regression model is used to model the relationship between binary outcome and predictor variables. The final logistic regression model takes the following form: ! = ln "# $"# = % + &! + ! (2)

Y ᵢ is the dependent variable of respondent i, representing the probability of being engaged in labour tax evasion. β₀ is the intercept, β is the parameter estimate. Xᵢ is a vector of explanatory variables of respondent i , including different firm-level financial and non- financial variables that may correlate with tax evasion. Lastly, εᵢ is the error term. 32

Logistic distribution of the errors is assumed and maximum likelihood estimation method is used in parameter estimation to obtain most accurate est imates. The standard errors are assumed to be heteroscedastic, and robust standard errors are the refore applied. As the logit model only allows determining the direction of the effect of independ ent variables on the dependent variable, the marginal effects are also calculated to better unde rstand the extent of the impact of independent variables on the dependent variable. However, it is important to note that the estimated coefficients from logistic regression present conditional correl ations between the dependent and independent variables and should not be considered as causal re lationships when interpreting and discussing the results. (Gujarati, 2003)

There are several ways to measure the prediction pe rformance of logistic regression model. Following Hajek & Henriques (2017), confusion matri x and different performance metrics are presented, i.e. recall, type I error, specificity, type II error, accuracy, F-measure and AUC-ROC (Area Under the Receiver Operating Characteristic Curve) .

Recall or true positive rate is the number of firms correctly classified as evading as a percentage of all evading companies (Ibid.): '( ) = *" " (3)

Type I error or false positive rate is the number o f firms incorrectly classified as evading as a percentage of all compliant companies (Ibid.): +( ) = ,"

(4)

Specificity or true negative rate is the number of firms correctly classified as compliant as a percentage of all compliant firms (Ibid.): '. ) = *"

(5)

Type II error or false negative rate is the number of firms incorrectly classified as compliant as a percentage of all evading firms (Ibid.): +. ) = ,- " (6)

Accuracy is defined as a percentage of observations correctly classified (Ibid.): /0 = *"1*- "1- (7)

F-measure is the mean of precision and TP rate (Ibid.): +-34 = 2 ∗ "789!:!;<∗" 7=>8 "789!:!;<1" 7=>8 (8)

The AUC-ROC score is used to evaluate the performan ce of binary classification of the logistic model. The AUC-ROC score ranges from 0 to 1, where 0 indicates a poor model and 1 indicates a perfect model that makes all predictions correctly. (Ibid.)

Lastly, the logit model is employed to predict the out-of-sample probability of being engaged in tax evasion for all firms: ̂ = @&! A (9) Here, ̂ on the left-hand side denotes the predicted probab ility of labour tax evasion for firm i. Xᵢ is a vector of explanatory variables for respondent i, including various firm-level financial and non-financial variables. A signifies the estimated coefficients, γ is representing the logistic function. The probability is computed separately for each year.

Setting the probability threshold is crucial for th e outcome, as the absolute share of the firms classified as tax evading is heavily dependent on the subjective threshold. Benkovskis & Fadejeva (2022) set the probit estimation threshold to be at 0.84, therefore firms with predicted probability above 84% are classified as evading, and the others as compliant. The threshold is rather high and it can result in a model that is conservative in its true positive predictions. This leads to fewer false positives, but could potentially miss many true pos itives. For the purpose of this thesis, the predicted probability threshold is set to be at 0.6 5 and a robustness analysis is done using the probability threshold of 0.84, following Benkovskis & Fadejeva (2022). 34

EMPIRICAL ANALYSIS This chapter presents the main results from the wag e regression and logistic regression and provides an overview of the robustness checks carried out to confirm the reliability of the results. Furthermore, discussion regarding results is presented in the section 3.3. as well as shortcomings of the analysis and suggestions for improvements and further research.
3.1. Main results This section will give an overview of the main resu lts of the empirical analysis to obtain subsets of tax evading and tax compliant firms as well as analyse the firm-level predictors contributing to the probability of being engaged in labour tax evasion.
3.1.1. Wage regression To obtain the set of tax evading firms, i.e. the fi rms which pay “suspiciously low wages” to their employees, wage regression is performed. The data u sed is combined from matched employer- employee wage data and population data, as wage dat a do not include a variable of educational attainment. The reference group for NACE is manufac turing, for education is pre-school education, for region is Northern Estonia and for occupation is managers. The results of the wage regression for 2021 and 2022 separately are presented in table 8. Table 8. Wage regression 2021 2022 ln(wage) ln(wage) Intercept 7.073*** (0.020)
7.270*** (0.018)
Gender -0.221*** (0.002) -0.220*** (0.002) Age 0.021*** (0.001) 0.021*** (0.001) Age²/100 -0.028*** (0.001) -0.029*** (0.001) 35

Experience 0.026*** (0.001) 0.023*** (0.001) Experience²/100 -0.063*** (0.002) -0.058*** (0.002) Construction -0.133*** (0.003) -0.132*** (0.003) Wholesale and retail trade -0.066*** (0.003) -0.050*** (0.003) Transportation and storage -0.101*** (0.003) -0.080*** (0.003) Basic education 0.045* (0.015) 0.014 (0.013) Secondary education 0.071*** (0.015) 0.044** (0.013) Tertiary education
0.142*** (0.015) 0.113*** (0.013) Central Estonia -0.109*** (0.003) -0.117*** (0.003) North-Eastern Estonia -0.265*** (0.003) -0.255*** (0.003) Western Estonia -0.158*** (0.003) -0.158*** (0.003) Southern Estonia -0.120*** (0.002) -0.127*** (0.003) Professionals 0.119*** (0.006) 0.101*** (0.006) Technicians and associate professionals -0.053*** (0.005) -0.078*** (0.005) Clerical support workers -0.250*** (0.005) -0.281*** (0.005) Services and sales workers -0.438*** (0.005) -0.451*** (0.005) Skilled agricultural, forestry and fishery workers -0.482*** (0.033) -0.451*** (0.032) Craft and related trades workers -0.390*** (0.005) -0.413*** (0.005) Plant and machine operators and assemblers -0.382*** (0.005) -0.420*** (0.005) Elementary occupations -0.498*** (0.005) -0.530*** (0.005) Observations 192 791 186 572 R² 0.316 0.336 Note: Results are based on Eq.1. Significance level * p < 0.05, ** p < 0.01, *** p < 0.001. Robust stardard errors in parentheses. The R² is 31.6% for 2021 and 33.6% for 2022 and is therefore resembling that of previous studies estimating wage equations. The wage equation by Harmon & Walker (1995) explains 27% of the variance in the depenent variable and Gavoille & Zasova (2023) account to 25%. What is more, the results of the wage regression for 2021 and 2022 are relatively similar.

Page 2 of 5