Download An Approach for Testing the Extract-Transform-Load Process in Data Warehouse Systems PDF

TitleAn Approach for Testing the Extract-Transform-Load Process in Data Warehouse Systems
File Size658.5 KB
Total Pages96
Table of Contents
List of Tables
List of Figures
	Problem Description
Literature Survey
	Data Warehouse Components
		Sources and Target Data Warehouse
		Extract, Transform, Load (ETL)
		Front-end Applications
	Testing Data Warehouse Components
	Testing Source Area and Target Data Warehouse
		Testing Underlying Data
		Testing the Data Model
		Testing Data Management Product
	Testing ETL Process
		Functional Testing of ETL Process
		Performance, Stress, and Scalability Testing of ETL Process
		Reliability Testing of ETL Process
		Regression Testing of ETL Process
		Usability Testing of ETL Process
	Testing Front-end Applications
		Functional Testing of Front-end Applications
		Usability Testing of Front-end Applications
		Performance and Stress Testing of Front-end Applications
Motivating Example
	One-to-one mappings
	Many-to-one mappings
	Many-to-many mappings
	Need for balancing tests
Balancing Properties
		Record count match
		Distinct record count match
		Attribute value match
		Attribute constraint match
		Outliers match
		Average match
	Syntactic validity
		Attribute data type match
		Attribute length match
		Attribute boundary match
	Completeness of the Properties
	Identify Source-To-Target Mappings
		One-to-one table mapping
		One-to-one attribute mapping
		Many-to-one table mapping
		Many-to-one attribute mapping
	Generate Balancing Tests
		Generate Analysis Queries
		Generate Test Assertions
Demonstration and Evaluation
	Validation of ETL Scripts
	Evaluation of Fault Finding Ability of Assertions
	Threats to Validity
Conclusions and Future Work
Document Text Contents
Page 1




Submitted by

Hajar Homayouni

Department of Computer Science

In partial fulfillment of the requirements

For the Degree of Master of Science

Colorado State University

Fort Collins, Colorado

Fall 2017

Master’s Committee:

Advisor: Sudipto Ghosh
Co-Advisor: Indrakshi Ray

James M. Bieman
Leo R. Vijayasarathy

Page 2




Enterprises use data warehouses to accumulate data from multiple sources for data analysis and

research. Since organizational decisions are often made based on the data stored in a data ware-

house, all its components must be rigorously tested. In this thesis, we first present a comprehensive

survey of data warehouse testing approaches, and then develop and evaluate an automated testing

approach for validating the Extract-Transform-Load (ETL) process, which is a common activity in

data warehousing.

In the survey we present a classification framework that categorizes the testing and evaluation

activities applied to the different components of data warehouses. These approaches include both

dynamic analysis as well as static evaluation and manual inspections. The classification frame-

work uses information related to what is tested in terms of the data warehouse component that

is validated, and how it is tested in terms of various types of testing and evaluation approaches.

We discuss the specific challenges and open problems for each component and propose research


The ETL process involves extracting data from source databases, transforming it into a form

suitable for research and analysis, and loading it into a data warehouse. ETL processes can use

complex one-to-one, many-to-one, and many-to-many transformations involving sources and tar-

gets that use different schemas, databases, and technologies. Since faulty implementations in any

of the ETL steps can result in incorrect information in the target data warehouse, ETL processes

must be thoroughly validated. In this thesis, we propose automated balancing tests that check for

discrepancies between the data in the source databases and that in the target warehouse. Balancing

tests ensure that the data obtained from the source databases is not lost or incorrectly modified by


Page 48

literature even though they are critical for a comprehensive testing of the process. We identified the

following open problems in ETL testing. We summarize areas and ideas for future investigation.

� In the functional testing of ETL, there is not a systematic way to assure the completeness of

the test cases written/generated as a set of queries. As with the functional testing of under-

lying data, we can use appropriate test adequacy criteria to evaluate and create a thorough


� There is a lack of systematic techniques to generate mock test inputs for the functional testing

of the ETL process. We can use input space partitioning techniques to generate test data for

all the equivalent classes of data. Current tools generate random test data with not much

similarity with the characteristics of real data.

� The fault finding ability of the balancing tests is not evaluated in the surveyed approaches.

We can use mutation analysis for this evaluation.

� In the performance, stress, and scalability testing of ETL, the existing approaches test the en-

tire ETL process under different workloads. We can apply tests to the individual components

of ETL to determine the areas of weaknesses.

� The heterogeneous data involved in the data warehousing systems make the performance,

stress, and scalability testing of the ETL process difficult. Testers must use large heteroge-

neous datasets in order to perform tests.

� The existing ETL implementations rely on ACID properties of transaction systems, and ig-

nore the reliability testing of the ETL process. We can perform the balancing tests proposed

in Chapter 5 to compare the results of the ETL process under normal conditions with the

ones under abnormal conditions to verify the properties, namely, completeness, consistency,

and syntactic validity.

� No approach has been proposed to test the usability of the ETL process. We define this testing

activity as the process of determining whether or not the ETL process is easy to use by the


Page 49

data warehouse implementer. One can test the usability of the ETL process by assessing the

manual effort involved in configuring ETL that is measured in terms of time.

2.5 Testing Front-end Applications
Front-end applications in data warehousing are used by data analyzers and researchers to per-

form various types of analysis on data and generate reports. Thus, it is important to test these

applications to make sure the data is correctly, effectively, and efficiently presented to the users.

2.5.1 Functional Testing of Front-end Applications

This testing activity ensures that the data is correctly selected and displayed by the applications

to the end-users. The goal of testing the functionality of the front-end applications is to recognize

whether the analysis or end result in a report is incorrect, and whether the cause of the problem is

the front-end application rather than the other components or processes in the data warehouse.

Golfarelli and Rizzi [7] compared the results of analyses queries displayed by the application

with the results obtained by executing the same queries directly (i.e., without using the application

as an interface) on the target data warehouse. They suggested two different ways to create test

cases as queries for functionality testing. In a black-box approach, test cases are a set of queries

based on user requirements. In a white-box approach, the test cases are determined by defining

appropriate coverage criteria for the dimensional data. For example, test cases are created to test

all the facts, dimensions, and attributes of the dimensional data.

The approaches proposed by Golfarelli and Rizzi are promising. In our project, we have used

Achilles, which is a front-end application that performs quality assurance and analysis checks on

health data warehouses in the OMOP [10] data model. The queries in Achilles are executable on


The functional testing of the front-end applications must consider all possible test inputs for

adequate testing. As it is impossible to test the functionality of the front-end applications with all


Page 95

[78] Byung-Kwon Park, Hyoil Han, and Il-Yeol Song. XML-OLAP: A Multidimensional Analysis

Framework for XML Warehouses. In Data Warehousing and Knowledge Discovery, pages

32–42, Berlin, Heidelberg, August 2005.

[79] Xin Bai. Testing the Performance of an SSAS Cube Using VSTS. In 7th International

Conference on Information Technology: New Generations, pages 986–991, Las Vegas, NV,

USA, April 2010.

[80] James F. Allen. Natural Language Processing. In Encyclopedia of Computer Science, pages

1218–1222. John Wiley and Sons Ltd., Chichester, UK, 2003.

[81] Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. The Ora-

cle Problem in Software Testing: A Survey. IEEE Transactions on Software Engineering,

41(5):507–525, 2015.

[82] SqlRender. (Accessed 2017-09-20).

[83] Sqlrender With BigQuery Supports. (Ac-

cessed 2017-09-20).

[84] Querysurge. (Accessed 2017-10-15).

[85] Databene benerator. (Accessed 2017-05-15).

[86] Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. The Ora-

cle Problem in Software Testing: A Survey. IEEE Transactions on Software Engineering,

41(5):507–525, 2015.

[87] Des Greer and Yann Hamon. Agile Software Development. Software: Practice and Experi-

ence, 41(9):943–944, 2011.

[88] Lawrence Corr and Jim Stagnitto. Agile Data Warehouse Design: Collaborative Dimensional

Modeling, from Whiteboard to Star Schema. DecisionOne Press, 3rd edition, 2011.


Page 96

[89] Roy P. Pargas, Mary Jean Harrold, and Robert R. Peck. Test-Data Generation Using Genetic

Algorithms. Software Testing, Verification And Reliability, 9:263–282, 1999.

[90] Henk C. A. Van Tilborg and Sushil Jajodia, editors. Encyclopedia of Cryptography and

Security. Springer, 2nd edition, 2011.

[91] L. P. Carloni, K. L. McMillan, A. Saldanha, and A. L. Sangiovanni-Vincentelli. A Method-

ology for Correct-by-construction Latency Insensitive Design. In IEEE/ACM International

Conference on Computer-Aided Design. Digest of Technical Papers (Cat. No.99CH37051),

pages 309–315, November 1999.


Similer Documents