17 Sensitivity analysis

Building statistical models and indices involves stages where subjective judgements have to be made. These can include the selection of individual data sets, the treatment of missing values, and the approach to weighting and aggregation. Good modelling practice means we should evaluate our model, testing the assumptions and judgements made in its building and analysing the uncertainties associated with the modelling process. Sensitivity analysis is one way to undertake such an assessment.

To test the robustness and uncertainty of the modelling approach used by InCiSE, five types of sensitivity analysis have been undertaken:

Varying the set of countries selected for results to be produced;
Excluding out-of-date data;
Alternative approaches to weighting;
Using the ranks of source data; and,
Alternative approaches to imputation.

This chapter summarises the approach and results of these different analyses, while detailed results can be found in Appendix B.

17.1 Country selection

Section 3.3 discusses how the approach to country selection for the 2019 edition of InCiSE differs from the 2017 Pilot, as it now uses the results of the data quality assessment (DQA) to identify countries for inclusion. The DQA produces a score for each country that summarises the quality of the data within the InCiSE model about that country (before imputation of missing values). The threshold for inclusion in the 2019 edition of InCiSE is an overall DQA score of 0.50 or greater.

The three countries included in the InCiSE Index with the lowest data quality scores have markedly poorer data quality by indicator than other countries (see Table 2.8.A). For each of these three countries only two or three of the 12 InCiSE indicators are rated green, a further two or three indicators are rated as amber, while five or six are rated as red, and one indicator is fully imputed.

Section 3.8 also outlines an approach to ‘grading’ countries based on their data quality scores. DQA scores of 0.75 are given an ‘A+’ grade, while those below 0.6 are given a ‘D’ grade. In this ‘D’ group there are four more countries in addition to the three discussed above.

The 2017 Pilot used a simpler approach to country inclusion with a threshold of having at least 75% of metrics available, and producing a set of 31 countries¹. For the 2019 edition’s set of metrics 31 countries also achieve the 75% threshold but the country coverage differs to the set of countries in the 2017 Pilot.

¹ One further country in 2017 met this criteria but was not an OECD member so was excluded to simplify Open Sanspretation of results.

The first two sensitivity tests for country coverage altered the DQA threshold used to determine country inclusion. The first test used a DQA score of 0.55 or higher, excluding the three countries in the 2019 set with the lowest data quality, while the second test used a DQA score of 0.6 or higher. The third test used the 2017 Pilot’s threshold of countries with 75% of data being available. The fourth test used the 31 countries included in the 2017 Pilot.

17.2 Reference date

The reference dates of the source data for the 2019 edition of InCiSE ranges from 2012 to 2018. However, as shown in Table 5.2.A, the reference dates vary across indicators. A third of the metrics have a reference date of 2017 or 2018, around half of the metrics have a reference date of 2015 or 2016, while just 17 out of the 116 metrics have a reference date of 2012.

Of these 17 metrics, 14 are the metrics for the capabilities indicator. This is the only indicator with 100% of its data with a reference date from before 2015². The capabilities indicator is solely composed of data with a reference year of 2012. Only two other indicators have data from before 2014 but in both cases this is a small number of their constituent metrics.

² The lack of recency of the data source for the capabilities indicator (the OECD’s Survey of Adult Skills) is discussed in Chapter 6.

Table 17.1: Reference year of InCiSE metrics by indicator
Indicator	Number of metrics per year							Percent within period...
Indicator	2012	2013	2014	2015	2016	2017	2018	2012-14	2015-16	2017-18
Capabilities	14							100%
Crisis and risk management				8	5				100%
Digital services					7	6			54%	46%
Fiscal and financial management	1				1	4		17%	17%	67%
HR management				5	4				100%
Inclusiveness				3	2				100%
Integrity		1	2	11		2	1	18%	65%	18%
Openness				1	3	4	2		40%	60%
Policy making							8			100%
Procurement					6				100%
Regulation						6	3			100%
Tax administration				5		1			83%	17%
Total	15	1	2	33	28	23	14	16%	53%	32%
Table 5.2.A in the original PDF publication

The first two sensitivity tests for recency exclude the capabilities indicator. In the first analysis the capabilities indicator is excluded but the weightings of the other indicators are not adjusted. In the second analysis the weightings are recalculated to account for the removal of the capabilities indicator.

In the third test, only data with a reference year of 2015 or later is included in the model; the four other metrics from before 2014 are excluded in addition to the 14 capabilities metrics. In the fourth test, only data with a reference year of 2016 or later is included in the model; the 51 metrics with a reference date of 2016 or earlier are therefore excluded. For both these analyses there is no adjustment the weightings – either to calculate the indicators from their constituent metrics or to calculate the index from the indicators.

17.3 Alternative approaches to weighting

The InCiSE Index is a weighted aggregation of the InCiSE indicators, which themselves are weighted aggregations of the InCiSE metrics. Section 3.7 set out the approach to weighting the InCiSE indicators to calculate the InCiSE Index. Two-thirds of an indicator’s weight is based on an ‘equal share’ approach (i.e. 1/12), while one-third is based on the results of the data quality assessment. Section 3.6 and Chapters 3-14 outline how the individual metrics are weighted to produce each of the 12 indicator scores.

The first three sensitivity tests for alternative weighting look at the proportion of indicator weighting that is assigned to the ‘equal share’ and the data quality assessment. The first test uses a 50:50 split rather than the 67:33 split. The second test uses solely an ‘equal share’ approach (i.e. indicator weights set to 1/12 each). The third test uses solely the results of the data quality assessment to determine the weighting.

The fourth and fifth tests focus on metrics weighting: The fourth does not apply weighting to metrics within indicators (i.e. all metrics contribute equally to the calculation of their indicator), and the fifth is a simple summation of the metrics, then normalised as per the standard calculations of the indicators and index (as set out Section 3.5).

17.4 Adjusting the base data

In the InCiSE model, metrics are normalised after missing data is imputed. An alternative approach would be to normalise the data before it is imputed.

Three sensitivity tests were done where normalisation of the data occurred before the imputation. In the first test the data was ranked, in the second test the data was rescaled using the same min-max normalisation applied to the outputs of the model, and in the third test the data was converted to z-scores with a mean of 0 and a standard deviation of 1.

17.5 Alternative imputation methods

As discussed in section 2.4 missing data in the InCiSE base data is handled through multiple imputation, and in particular the predictive mean matching method.

Four sensitivity tests were carried out using different approaches to imputation. Section 3.4 outlines how the imputation of missing data is handled on a per-indicator basis, the first test changes this to adopt a “kitchen sink”/“all-in-one” approach in which the full dataset of all 116 metrics (and two external predictor variables) are supplied to the imputation function. The second test uses a modified form of predictive mean matching called ‘midas touch’ to generate imputed values. The third test uses the ‘random forest’ method to generate imputed values, a machine learning approach. The fourth test uses mean imputation, where missing data is replaced with the simple arithmetic mean of the observed data.

17.6 Results of the sensitivity analysis

Table 17.2 shows the results of the 2019 InCiSE model for each country and the range of ranks across the five different sets of sensitivity analysis, while Figure 17.1 show how the InCiSE Index score varies by country for each of the sensitivity tests carried out. The results of the five sets of sensitivity analysis demonstrate general stability in the model, with country ranks either unchanged or changed by only one or two places on average, and the same groupings of countries at the top and bottom of the rankings. Full results from the sensitivity analysis are provided in Appendix B.

Figure 17.1: Sensitivity analysis results

Figures 5.1 to 5.5 in the original 2019 publication.

In the country coverage sensitivity analysis, the main driver of change in rankings is due to the exclusion of countries: Figure 5.1 shows that the scores of individual countries do not substantially change as a result of the exclusion of different countries. When varying the reference date there are some changes as a result of the exclusion of the capabilities indicator, and further changes as a result of excluding data with a reference year of 2015 and earlier.

Altering the weighting schemes for the calculation of the index and indicators does not result in many changes, except when calculating the index as a simple sum of all metrics (i.e. applying no weighting at all). Similarly making alterations to the metrics (e.g. ranking, rescaling, standardisation) before they are imputed does not result in many changes to country scores or rankings.

Varying the imputation methodology results in slightly more variation of country scores and ranks than the previous sensitivity checks. Only three countries see no change in their ranking, however of those that do change, the difference in ranks is still small at around one or two places.

One way to consider the effectiveness of the sensitivity analysis is to calculate the Mean Absolute Error (MAE) arising from the analysis. MAE is a common technique for assessing the quality of statistical models by comparing the difference of the model’s estimates/predictions with the original data. It is calculated as the sum of the absolute errors divided by the number of cases. In the case of the InCiSE sensitivity analysis, ‘error’ is calculated as the difference between the 2019 InCiSE Index results and the results from each of the sensitivity tests.

The overall MAE figure for the sensitivity analysis, that is the mean level of ‘error’ across all 20 sensitivity tests for all 38 countries, is ±0.017. The MAE can also be calculated for each sensitivity test or each set of tests. The per-set MAE figures is presented in Table 17.3, while the per-test MAE is presented in the tables in Appendix B. Across the different sets of methodological sensitivity tests, the smallest MAE is ±0.007 for the set of tests varying country selection while the highest MAE is ±0.023 for the set of tests changing the reference date.

Finally, the MAE can also be calculated by country, which is also included in Table 17.2 and ranges from ±0.001 to ±0.032. However, given that the same two countries place highest and lowest across most tests the minimum per-country MAE is skewed by the limited variability in these two countries’ scores, when excluding these countries the minimum MAE rises from ±0.001 to ±0.009.

Table 17.2: Variation in country ranking across sensitivity analyses
Country	2019 results		Range of country's ranks in sensitivity analysis					Mean absolute error (MAE)
Country	Score	Rank	Country coverage	Reference date	Alternative weightings	Adjust base data	Imputation method	Mean absolute error (MAE)
GBR	1.000	1	1	1	1-2	1	1-2	0.003
NZL	0.980	2	2	2	1-2	2	1-2	0.019
CAN	0.916	3	3	3	3	3	3-5	0.021
FIN	0.883	4	4	4-5	4-5	4	3-4	0.013
AUS	0.863	5	5	4-5	4-5	5-6	4-7	0.014
DNK	0.832	6	5-6	7-9	6-8	5-7	5-7	0.021
NOR	0.830	7	6-7	6	6-7	6-10	5-7	0.010
NLD	0.794	8	7-8	8-9	8-10	8-9	8-9	0.014
KOR	0.785	9	8-10	9-11	6-11	7-11	10	0.019
SWE	0.785	10	9-10	7-10	8-10	8-9	8-9	0.009
USA	0.765	11	11	10-11	10-11	10-11	11	0.029
EST	0.674	12	10-12	12-17	12	12-13	12-15	0.023
CHE	0.650	13	11-13	13-14	13-14	12-15	12-15	0.020
IRL	0.625	14	14-16	15-16	14-17	14-15	16-17	0.021
FRA	0.619	15	12-15	12-14	13-16	13-15	12-15	0.012
AUT	0.617	16	13-15	15-16	13-16	16-17	13-15	0.014
ESP	0.599	17	15-17	13-17	15-17	16-17	16-17	0.010
MEX	0.507	18	17-19	19-20	18-24	18-23	18-20	0.020
DEU	0.505	19	16-19	18-21	18-19	19-21	18-20	0.010
LTU	0.487	20	18-20	18-20	20-22	20-21	20-22	0.018
BEL	0.485	21	19-22	18-22	20-21	19-20	18-21	0.017
JPN	0.472	22	17-21	21-22	19-24	18-23	21-24	0.020
LVA	0.466	23	20-23	23-26	20-24	24	24-26	0.031
CHL	0.454	24	21-24	23-25	22-24	22-23	21-23	0.014
ITA	0.419	25	22-25	23-25	25-26	25	23-25	0.014
SVN	0.369	26	23-26	26-28	25-26	26	25-26	0.018
ISR	0.315	27	27	24-27	27	27	27-29	0.022
POL	0.282	28	24-28	28-36	28-29	28-29	27-29	0.025
PRT	0.259	29	25-29	29-30	28-29	31	28-31	0.015
CZE	0.245	30	26-30	27-32	30-32	28-30	30-31	0.018
ISL	0.228	31	31	30-32	30-32	29-30	28-31	0.019
TUR	0.189	32	27-32	28-32	30-35	32	32-33	0.026
SVK	0.172	33	28-33	31-34	32-35	33	32-34	0.015
BGR	0.147	34	—	34-35	33-34	35	35-36	0.016
HRV	0.140	35	—	36-37	34-36	34	33-34	0.019
ROU	0.127	36	—	35-37	36-37	36	35-37	0.022
GRC	0.107	37	29-34	33-35	34-38	37	36-37	0.027
HUN	0.000	38	30-35	38	37-38	38	38	0.001
Table 5.6.A in the original 2019 publication

Table 17.3: Summary of variation in ranking changes across sensitivity analysis sets
	Country coverage	Reference date	Alternative weightings	Adjust base data	Imputation method
Mean absolute error (MAE)	0.007	0.023	0.018	0.014	0.022
Countries with no change in rank	8	5	3	16	3
Largest difference in rank	5	8	6	5	3
Average difference in rank	2	2	2	1	2
Table 5.6.A in the original 2019 publication

Cross-referencing note

This was presented as chapter 5 in the original 2019 publication.