enEnglish
CROS

What are some methodological issues in a S-DWH?

Recycling of data in a S-DWH , viz "store data once and re-use many times", has wide benefits in terms of version control, storage, auditing, accountability and some dimensions of quality (coherence and comparability, accessibility and clarity, timeliness and punctuality). However, for the other quality dimensions - accuracy and reliability, and relevance - the re-use of data causes issues. These issues are essentially methodological by nature - two simple examples concern data cleaning and outliering.

The first example is that within a silo system, data cleaning can target - for example, through selective editing - the errors that have the greatest impact on the output(s) of interest. However, with a S-DWH , the outputs are unknown - as the data can be re-used many times. Therefore, in theory all data should be cleaned. However, in practice the cost savings associated with a S-DWH would be lost through a uiniversal data cleaning strategy, hence it is more likely that data cleaning will target the key outputs - and so for other outputs, untreated errors will possibly feed into results and cause issues with accuracy of results.

The second example is that within a silo system, outliering can target - for example, through Winsorisation - the outliers that have the greatest impact on the output(s) of interest. However, with a S-DWH , the outputs are unknown - as the data can be re-used many times. Therefore, in theory no data point can be identified and treated as an outlier, as in some domains it will be the norm. However, in practice outliers need to be identified and treated, hence it is more likely outliers will be identified and treated for specific outputs and then re-instated as a normal data point, ie no 'outlier flag' will be retained. This will lead to inconsistencies between different outputs, and even within the same outputs over different periods. The alternative - declaring outliers based on key outputs - would lead to biases in other outputs, and (essentially) throwing away useful information it has cost a producer to gather and a respondent to provide.