16  Sensitivity analysis

Building statistical models and indices involves stages where subjective judgements have to be made. These can include the selection of individual data sets, the treatment of missing values, and the approach to weighting and aggregation. Good modelling practice means we should evaluate our model, testing the assumptions and judgements made in its building and analysing the uncertainties associated with the modelling process. Sensitivity analysis is one way to undertake such an assessment.

To test the robustness and uncertainty of the modelling approach used by InCiSE, five types of sensitivity analysis have been undertaken:

This chapter summarises the approach and results of these different analyses, while detailed results can be found in Appendix B.

16.1 Country selection

Section 2.3 discusses how the approach to country selection for the 2019 edition of InCiSE differs from the 2017 Pilot, as it now uses the results of the data quality assessment (DQA) to identify countries for inclusion. The DQA produces a score for each country that summarises the quality of the data within the InCiSE model about that country (before imputation of missing values). The threshold for inclusion in the 2019 edition of InCiSE is an overall DQA score of 0.50 or greater.

The three countries included in the InCiSE Index with the lowest data quality scores have markedly poorer data quality by indicator than other countries (see Table 2.8.A). For each of these three countries only two or three of the 12 InCiSE indicators are rated green, a further two or three indicators are rated as amber, while five or six are rated as red, and one indicator is fully imputed.

Section 2.8 also outlines an approach to ‘grading’ countries based on their data quality scores. DQA scores of 0.75 are given an ‘A+’ grade, while those below 0.6 are given a ‘D’ grade. In this ‘D’ group there are four more countries in addition to the three discussed above.

The 2017 Pilot used a simpler approach to country inclusion with a threshold of having at least 75% of metrics available, and producing a set of 31 countries1. For the 2019 edition’s set of metrics 31 countries also achieve the 75% threshold but the country coverage differs to the set of countries in the 2017 Pilot.

  • 1 One further country in 2017 met this criteria but was not an OECD member so was excluded to simplify Open Sanspretation of results.

  • The first two sensitivity tests for country coverage altered the DQA threshold used to determine country inclusion. The first test used a DQA score of 0.55 or higher, excluding the three countries in the 2019 set with the lowest data quality, while the second test used a DQA score of 0.6 or higher. The third test used the 2017 Pilot’s threshold of countries with 75% of data being available. The fourth test used the 31 countries included in the 2017 Pilot.

    16.2 Reference date

    The reference dates of the source data for the 2019 edition of InCiSE ranges from 2012 to 2018. However, as shown in Table 5.2.A, the reference dates vary across indicators. A third of the metrics have a reference date of 2017 or 2018, around half of the metrics have a reference date of 2015 or 2016, while just 17 out of the 116 metrics have a reference date of 2012.

    Of these 17 metrics, 14 are the metrics for the capabilities indicator. This is the only indicator with 100% of its data with a reference date from before 20152. The capabilities indicator is solely composed of data with a reference year of 2012. Only two other indicators have data from before 2014 but in both cases this is a small number of their constituent metrics.

  • 2 The lack of recency of the data source for the capabilities indicator (the OECD’s Survey of Adult Skills) is discussed in Chapter 5.

  • Table 16.1: Reference year of InCiSE metrics by indicator
    Indicator Number of metrics per year Percent within period...
    2012 2013 2014 2015 2016 2017 2018 2012-14 2015-16 2017-18
    Capabilities 14





    100%

    Crisis and risk management


    8 5


    100%
    Digital services



    7 6

    54% 46%
    Fiscal and financial management 1


    1 4
    17% 17% 67%
    HR management


    5 4


    100%
    Inclusiveness


    3 2


    100%
    Integrity
    1 2 11
    2 1 18% 65% 18%
    Openness


    1 3 4 2
    40% 60%
    Policy making





    8

    100%
    Procurement



    6


    100%
    Regulation




    6 3

    100%
    Tax administration


    5
    1

    83% 17%
    Total 15 1 2 33 28 23 14 16% 53% 32%
    Table 5.2.A in the original PDF publication

    The first two sensitivity tests for recency exclude the capabilities indicator. In the first analysis the capabilities indicator is excluded but the weightings of the other indicators are not adjusted. In the second analysis the weightings are recalculated to account for the removal of the capabilities indicator.

    In the third test, only data with a reference year of 2015 or later is included in the model; the four other metrics from before 2014 are excluded in addition to the 14 capabilities metrics. In the fourth test, only data with a reference year of 2016 or later is included in the model; the 51 metrics with a reference date of 2016 or earlier are therefore excluded. For both these analyses there is no adjustment the weightings – either to calculate the indicators from their constituent metrics or to calculate the index from the indicators.

    16.3 Alternative approaches to weighting

    The InCiSE Index is a weighted aggregation of the InCiSE indicators, which themselves are weighted aggregations of the InCiSE metrics. Section 2.7 set out the approach to weighting the InCiSE indicators to calculate the InCiSE Index. Two-thirds of an indicator’s weight is based on an ‘equal share’ approach (i.e. 1/12), while one-third is based on the results of the data quality assessment. Section 2.6 and Chapters 3-14 outline how the individual metrics are weighted to produce each of the 12 indicator scores.

    The first three sensitivity tests for alternative weighting look at the proportion of indicator weighting that is assigned to the ‘equal share’ and the data quality assessment. The first test uses a 50:50 split rather than the 67:33 split. The second test uses solely an ‘equal share’ approach (i.e. indicator weights set to 1/12 each). The third test uses solely the results of the data quality assessment to determine the weighting.

    The fourth and fifth tests focus on metrics weighting: The fourth does not apply weighting to metrics within indicators (i.e. all metrics contribute equally to the calculation of their indicator), and the fifth is a simple summation of the metrics, then normalised as per the standard calculations of the indicators and index (as set out Section 2.5).

    16.4 Adjusting the base data

    In the InCiSE model, metrics are normalised after missing data is imputed. An alternative approach would be to normalise the data before it is imputed.

    Three sensitivity tests were done where normalisation of the data occurred before the imputation. In the first test the data was ranked, in the second test the data was rescaled using the same min-max normalisation applied to the outputs of the model, and in the third test the data was converted to z-scores with a mean of 0 and a standard deviation of 1.

    16.5 Alternative imputation methods

    As discussed in section 2.4 missing data in the InCiSE base data is handled through multiple imputation, and in particular the predictive mean matching method.

    Four sensitivity tests were carried out using different approaches to imputation. Section 2.4 outlines how the imputation of missing data is handled on a per-indicator basis, the first test changes this to adopt a “kitchen sink”/“all-in-one” approach in which the full dataset of all 116 metrics (and two external predictor variables) are supplied to the imputation function. The second test uses a modified form of predictive mean matching called ‘midas touch’ to generate imputed values. The third test uses the ‘random forest’ method to generate imputed values, a machine learning approach. The fourth test uses mean imputation, where missing data is replaced with the simple arithmetic mean of the observed data.

    16.6 Results of the sensitivity analysis

    Table 16.2 shows the results of the 2019 InCiSE model for each country and the range of ranks across the five different sets of sensitivity analysis, while Figure 16.1 show how the InCiSE Index score varies by country for each of the sensitivity tests carried out. The results of the five sets of sensitivity analysis demonstrate general stability in the model, with country ranks either unchanged or changed by only one or two places on average, and the same groupings of countries at the top and bottom of the rankings. Full results from the sensitivity analysis are provided in Appendix B.

    Figure 16.1: Sensitivity analysis results
    Five line graphs showing the detailed results of the sensitivity analysis. Each graph shows the InCiSE 2019 final index scores for each country compared to the results of each set of tests conducted in the sensitivity analysis. Reference year Alternative imputation method Country selection Adjusting base data Altenative weighting 2019 results Excl CAP Excl CAP & reweight 2015-18 data 2016-18 data 2019 results All-in-one Midas touch Random forests Mean value 2019 results DQA ≥ 0.55 DQA ≥ 0.6 75% of data 2017 countries 2019 results Ranked data Rescaled data Standardised data 2019 results 50:50 Equal weights Sum of metrics 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
    Figures 5.1 to 5.5 in the original 2019 publication.

    In the country coverage sensitivity analysis, the main driver of change in rankings is due to the exclusion of countries: Figure 5.1 shows that the scores of individual countries do not substantially change as a result of the exclusion of different countries. When varying the reference date there are some changes as a result of the exclusion of the capabilities indicator, and further changes as a result of excluding data with a reference year of 2015 and earlier.

    Altering the weighting schemes for the calculation of the index and indicators does not result in many changes, except when calculating the index as a simple sum of all metrics (i.e. applying no weighting at all). Similarly making alterations to the metrics (e.g. ranking, rescaling, standardisation) before they are imputed does not result in many changes to country scores or rankings.

    Varying the imputation methodology results in slightly more variation of country scores and ranks than the previous sensitivity checks. Only three countries see no change in their ranking, however of those that do change, the difference in ranks is still small at around one or two places.

    One way to consider the effectiveness of the sensitivity analysis is to calculate the Mean Absolute Error (MAE) arising from the analysis. MAE is a common technique for assessing the quality of statistical models by comparing the difference of the model’s estimates/predictions with the original data. It is calculated as the sum of the absolute errors divided by the number of cases. In the case of the InCiSE sensitivity analysis, ‘error’ is calculated as the difference between the 2019 InCiSE Index results and the results from each of the sensitivity tests.

    The overall MAE figure for the sensitivity analysis, that is the mean level of ‘error’ across all 20 sensitivity tests for all 38 countries, is ±0.017. The MAE can also be calculated for each sensitivity test or each set of tests. The per-set MAE figures is presented in Table 16.3, while the per-test MAE is presented in the tables in Appendix B. Across the different sets of methodological sensitivity tests, the smallest MAE is ±0.007 for the set of tests varying country selection while the highest MAE is ±0.023 for the set of tests changing the reference date.

    Finally, the MAE can also be calculated by country, which is also included in Table 16.2 and ranges from ±0.001 to ±0.032. However, given that the same two countries place highest and lowest across most tests the minimum per-country MAE is skewed by the limited variability in these two countries’ scores, when excluding these countries the minimum MAE rises from ±0.001 to ±0.009.

    Table 16.2: Variation in country ranking across sensitivity analyses
    Country 2019 results Range of country's ranks in sensitivity analysis Mean absolute error (MAE)
    Score Rank Country coverage Reference date Alternative weightings Adjust base data Imputation method
    GBR 1.000 1 1 1 1-2 1 1-2 0.003
    NZL 0.980 2 2 2 1-2 2 1-2 0.019
    CAN 0.916 3 3 3 3 3 3-5 0.021
    FIN 0.883 4 4 4-5 4-5 4 3-4 0.013
    AUS 0.863 5 5 4-5 4-5 5-6 4-7 0.014
    DNK 0.832 6 5-6 7-9 6-8 5-7 5-7 0.021
    NOR 0.830 7 6-7 6 6-7 6-10 5-7 0.010
    NLD 0.794 8 7-8 8-9 8-10 8-9 8-9 0.014
    KOR 0.785 9 8-10 9-11 6-11 7-11 10 0.019
    SWE 0.785 10 9-10 7-10 8-10 8-9 8-9 0.009
    USA 0.765 11 11 10-11 10-11 10-11 11 0.029
    EST 0.674 12 10-12 12-17 12 12-13 12-15 0.023
    CHE 0.650 13 11-13 13-14 13-14 12-15 12-15 0.020
    IRL 0.625 14 14-16 15-16 14-17 14-15 16-17 0.021
    FRA 0.619 15 12-15 12-14 13-16 13-15 12-15 0.012
    AUT 0.617 16 13-15 15-16 13-16 16-17 13-15 0.014
    ESP 0.599 17 15-17 13-17 15-17 16-17 16-17 0.010
    MEX 0.507 18 17-19 19-20 18-24 18-23 18-20 0.020
    DEU 0.505 19 16-19 18-21 18-19 19-21 18-20 0.010
    LTU 0.487 20 18-20 18-20 20-22 20-21 20-22 0.018
    BEL 0.485 21 19-22 18-22 20-21 19-20 18-21 0.017
    JPN 0.472 22 17-21 21-22 19-24 18-23 21-24 0.020
    LVA 0.466 23 20-23 23-26 20-24 24 24-26 0.031
    CHL 0.454 24 21-24 23-25 22-24 22-23 21-23 0.014
    ITA 0.419 25 22-25 23-25 25-26 25 23-25 0.014
    SVN 0.369 26 23-26 26-28 25-26 26 25-26 0.018
    ISR 0.315 27 27 24-27 27 27 27-29 0.022
    POL 0.282 28 24-28 28-36 28-29 28-29 27-29 0.025
    PRT 0.259 29 25-29 29-30 28-29 31 28-31 0.015
    CZE 0.245 30 26-30 27-32 30-32 28-30 30-31 0.018
    ISL 0.228 31 31 30-32 30-32 29-30 28-31 0.019
    TUR 0.189 32 27-32 28-32 30-35 32 32-33 0.026
    SVK 0.172 33 28-33 31-34 32-35 33 32-34 0.015
    BGR 0.147 34 34-35 33-34 35 35-36 0.016
    HRV 0.140 35 36-37 34-36 34 33-34 0.019
    ROU 0.127 36 35-37 36-37 36 35-37 0.022
    GRC 0.107 37 29-34 33-35 34-38 37 36-37 0.027
    HUN 0.000 38 30-35 38 37-38 38 38 0.001
    Table 5.6.A in the original 2019 publication
    Table 16.3: Summary of variation in ranking changes across sensitivity analysis sets
    Country coverage Reference date Alternative weightings Adjust base data Imputation method
    Mean absolute error (MAE) 0.007 0.023 0.018 0.014 0.022
    Countries with no change in rank 8 5 3 16 3
    Largest difference in rank 5 8 6 5 3
    Average difference in rank 2 2 2 1 2
    Table 5.6.A in the original 2019 publication
    Cross-referencing note

    This was presented as chapter 5 in the original 2019 publication.