Data Warehouse Testing or ETL Testing

Early in the 1970s Organizations began to depend more on the data from heterogeneous sources for many of their projects and developments. During this period the ETL tools started gaining lot of popularity which Extracts the data from heterogeneous sources, Transforms the information into a consistent data type and then finally Loads the data into Single Data Repository or a Central Data Warehouse. ETL tool also validates and verify the data for any duplicate records or any data loss.

When referring to Business Intelligence quality assurance we discover the terms Data Warehouse testing and ETL testing which are inter-changeably used as one and the same. But when we take a deep look into both of these methods we see that ETL testing is a sub-component of overall DWH testing. A data warehouse is essentially built using data extractions, data transformations, and data loads. ETL processes extract data from sources, transform the data according to BI reporting requirements, then load the data to a target data warehouse.

ETL testing applies to Data Warehouses or Data integration projects while Database Testing applies to any database holding data (typically transaction systems) that provides faster, better and efficient data processing for quicker reporting.

Data Reconciliation (DR) describes a verification phase during a data migration where the target data is compared against original source data to ensure that the migration architecture has transferred the data correctly. In Database testing the reconciliation is simple as it is directly from source table to target table for the transaction (OLTP) purposes whereas in ETL testing the reconciliation is applied to data warehouse systems and used to obtain relevant information for analytics and business intelligence (OLAP).

fig: - The ETL Architecture

After selecting data from the sources, ETL procedures resolve problems in the data, convert data into a common model appropriate for research and analysis, and write the data to staging and cleansing areas—then finally to the target data warehouse.

An Organization collects lots of data for their business purpose which can be in either -

1. Structured format (RDBMS where SQL can be performed easily) like – CRM (customer relationship management info. – cname, cid, cnumber, etc…), Organization data into table forms (no. of employees, eid, salary, etc….).

2. In the form of XML files which are both human-readable and machine-readable maybe from many of their sub branches or subsidiaries.

3. Unstructured form of data (JSON files where NoSQL is done mostly or MongoDB handles unstructured data) like social media information from web pages or social media handles – Most of analytics and recommendations can be done from this.

4. Some flat files (text files) from customers filling application forms, call logs, complains or other details as text.

5. Also so many other formats of data from multiple sources.

All of these data from heterogeneous sources are now run under ETL tests (Extract, Transform and Load) and then dumped into a Data Warehouse (to which only companies have access). The best data warehouse tools used for doing the Online Analytical Processing (OLAP) using tools like Power Bi and Tableau are: - Amazon Redshift, Microsoft Azure, Google Big Query, Snowflake, Micro Focus Vertica, Teradata, Amazon DynamoDB, PostgreSQL, etc...

A data warehouse keeps data gathered and integrated from different sources and stores the large number of records needed for long-term analysis. Implementations of data warehouses use various data models (such as dimensional or normalized models), and technologies (such as DBMS, Data Warehouse Appliance (DWA), and cloud data warehouse appliances).

Based on the business goals there are 4 general categories of ETL testing: -

1. New System testing - tests the data obtained from various sources.

2. Migration testing - tests the data transferred from source to DWH.

3. Change testing - involves the testing of new data added to DWH.

4. Report testing - validates the final data and makes necessary calculations.

ETL’s processes present many challenges, such as extracting data from multiple heterogeneous sources involving different data models, detecting and fixing a wide variety of errors/issues in data, then transforming the data into different formats that match the requirements of the target data warehouse.

Top 7 ETL Tools for 2021 are: -

1. Xplenty

2. Talend

3. Informatica Power Center

4. Stitch

5. Oracle Data Integrator

6. Skyvia

7. Fivetran

Below are some of the major differences between Database testing and ETL testing: -

Since ETL tool is running on a single machine it has some restrictions: -

1. The storage issues for such huge amount of data coming from so many sources varying drastically in very short time.

2. Data coming to ETL is not real time delaying customization and recommendation to customers.

3. Involves lot of Monetary for this setup of DWH as it is very expensive along with some very expensive ETL tools too.

Because of these drawbacks initially a concept called Multiple Parallel Processing (MPP) Technique was introduced where ETL tools and DWH tools were installed in multiple machines to run processes in parallel but this was also very tedious and hectic process involving merging, joining from different countries or sources.

Polyglot persistence is a term that refers to using multiple data storage technologies for varying data storage needs across an application or within smaller components of an application.

Hence the HADOOP concepts were introduced to deal with the problems of Big data storage and real time processing of data in faster and effective ways.

References: - Below are some useful resources on google for this related topic can be found on below links -

https://www.talend.com/resources/etl-testing/

https://www.xplenty.com/blog/top-7-etl-tools

https://www.edureka.co/blog/informatica-etl/

https://www.tricentis.com/blog/bi-dwh-etl-testing-2/

You can connect with me on -

Linkedin - https://www.linkedin.com/in/harish-singh-166b63118

Twitter - @harisshh_singh

Gmail - hs02863@gmail.com

End notes: -

Hope this was useful for beginners in the field of Data Science.

See you guys until next time.