Early in the 1970s Organizations began to depend more on the data from heterogeneous sources for many of their projects and developments. During this period the ETL tools started gaining lot of popularity which Extracts the data from heterogeneous sources, Transforms the information into a consistent data type and then finally Loads the data into Single Data Repository or a Central Data Warehouse. ETL tool also validates and verify the data for any duplicate records or any data loss.
When referring to Business Intelligence quality assurance we discover the terms Data Warehouse testing and ETL testing which are inter-changeably used as one and the same. But when we take a deep look into both of these methods we see that ETL testing is a sub-component of overall DWH testing. A data warehouse is essentially built using data extractions, data transformations, and data loads. ETL processes extract data from sources, transform the data according to BI reporting requirements, then load the data to a target data warehouse.
ETL testing applies to Data Warehouses or Data integration projects while Database Testing applies to any database holding data (typically transaction systems) that provides faster, better and efficient data processing for quicker reporting.
Data Reconciliation (DR) describes a verification phase during a data migration where the target data is compared against original source data to ensure that the migration architecture has transferred the data correctly. In Database testing the reconciliation is simple as it is directly from source table to target table for the transaction (OLTP) purposes whereas in ETL testing the reconciliation is applied to data warehouse systems and used to obtain relevant information for analytics and business intelligence (OLAP).
After selecting data from the sources, ETL procedures resolve problems in the data, convert data into a common model appropriate for research and analysis, and write the data to staging and cleansing areas—then finally to the target data warehouse.
An Organization collects lots of data for their business purpose which can be in either -
1. Structured format (RDBMS where SQL can be performed easily) like – CRM (customer relationship management info. – cname, cid, cnumber, etc…), Organization data into table forms (no. of employees, eid, salary, etc….).
2. In the form of XML files which are both human-readable and machine-readable maybe from many of their sub branches or subsidiaries.
3. Unstructured form of data (JSON files where NoSQL is done mostly or MongoDB handles unstructured data) like social media information from web pages or social media handles – Most of analytics and recommendations can be done from this.
4. Some flat files (text files) from customers filling application forms, call logs, complains or other details as text.
5. Also so many other formats of data from multiple sources.
Based on the business goals there are 4 general categories of ETL testing: -
1. New System testing - tests the data obtained from various sources.
2. Migration testing - tests the data transferred from source to DWH.
3. Change testing - involves the testing of new data added to DWH.
4. Report testing - validates the final data and makes necessary calculations.
ETL’s processes present many challenges, such as extracting data from multiple heterogeneous sources involving different data models, detecting and fixing a wide variety of errors/issues in data, then transforming the data into different formats that match the requirements of the target data warehouse.
Top 7 ETL Tools for 2021 are: -
1. Xplenty
2. Talend
3. Informatica Power Center
4. Stitch
5. Oracle Data Integrator
6. Skyvia
7. Fivetran
Below are some of the major differences between Database testing and ETL testing: -
Since ETL tool is running on a single machine it has some restrictions:
-
1. The storage issues for such huge amount of data
coming from so many sources varying drastically in very short time.
2. Data coming to ETL is not real time delaying
customization and recommendation to customers.
3. Involves lot of Monetary for this setup of DWH as it is very expensive along with some very expensive ETL tools too.
Because of these drawbacks initially a concept called Multiple Parallel
Processing (MPP) Technique was introduced where ETL tools and DWH tools were
installed in multiple machines to run processes in parallel but this was also
very tedious and hectic process involving merging, joining from different
countries or sources.
Polyglot persistence is a term that refers to using multiple data storage
technologies for varying data storage needs across an application or within
smaller components of an application.
Hence the HADOOP concepts were introduced to deal with the problems of Big data storage and real time processing of data in faster and effective ways.
References: - Below are some useful resources on google for this related topic can be found on below links -
You can connect with me on -
Linkedin - https://www.linkedin.com/in/harish-singh-166b63118
Twitter - @harisshh_singh
Gmail - hs02863@gmail.com
Hope this was useful for beginners in the field of Data Science.
See you guys until next time.







More valuable posts. great source of information which is very useful for me. Need more updates like this.
ReplyDeletemanual testing online course
manual testing certification online
Thank you for reading, Happy to write anytime.
DeleteAwesome article,content has very informative ideas, waiting for the next update…
ReplyDeleteWhat is struts framework
advantages of struts
Thank you for reading, Happy to write anytime.
DeleteThank you for reading, Happy to write anytime.
ReplyDeleteThank you for reading, Happy to write anytime.
ReplyDeleteThank you for reading, Happy to write anytime.
ReplyDeleteExcellent article... Thank you for providing such valuable information; the contents are quite intriguing.
ReplyDeleteBig Data Engineering Services
Data Analytics Solutions
Data Modernization Solutions
AI & ML Service Provider