Data Science

Before we understand what is Ensemble technique in Machine learning let us first understand few challenges associated while building an efficient and accurate Machine Learning Model:

1. Bias (Intercept) - If our model is skewed towards some data points then the ability of a Machine learning model to capture the true relationship cannot be obtained. This is a type of error in which the weights are not properly represented leading to skewed results, less accurate and more analytical errors.

"Higher the Bias less accurate will be our model".

2. Variance - The difference between the accuracy predicted on training data and the accuracy predicted on the test data is called as 'Variance'. There is a problem here if there is no variance in our data then there is a chance of Overfitting of our training data on the test data.

"Higher the Variance less accurate will be our model".

3. Overfitting - When we train our model with lot of data there is a chance that our model is learning from noise and inaccurate data points in our dataset. Then our model fails to categorize the data correctly because of too much noise and details.

"Overfitting is when High Variance and Low Bias is present in a model".

4. Underfitting - In this situation our model is failing to identify the trend itself destroying the accuracy of our Machine learning model which usually happens when we have not trained our model with sufficient data points, just like trying to build a Linear model using non-linear relational data points.

"Underfitting is when Low Variance and High Bias is present in a model".

To deal with this problem we need a model with "Low Variance and Low Bias" which is considered as an ideal Good fit model for making better Predictions and for achieving best Insights form our dataset and the point where both of these factors Variance and Bias are low is called as the "Sweet spot between a simple model and complex model" which we can find using Regularization, Bagging, Boosting and Stacking.

An Ensemble Machine Learning is a technique of combining predictions from same training dataset (also known as Classifiers) using multiple Machine learning models to achieve better accuracy. It is one of the efficient way of building a Machine learning model.

1. Strong Classifiers: the prediction obtained from any model which is performing really well on both regression and classification tasks given.

2. Weak Classifiers: are the prediction obtained form any model that performs only slight better than any random chance. There can be single weak learner or combined weak learners.

We can divide Ensemble learning techniques into Simple and Advanced Ensemble learning techniques which are: -

1. Simple -

a. Max Voting.

b. Averaging.

c. Weighted Averaging.

2. Advanced -

a. Stacking.

b. Blending.

c. Bagging.

d. Boosting.

Regularization:

Let's begin with understanding the Regularization techniques used for improving the accuracy of the model and to control the overfitting scenarios (basically controlling high Variance). Though regularization does not improvise the performance of the model but as an advantage it can improve the generalization of performance of new and unseen data.

The three main Regularization techniques are: -

1. L2 penalty/L2 Norm - Ridge Regression method.

2. L1 penalty/L1 Norm - Lasso Regression.

3. Dropout.

We can use Ridge and Lasso algorithms for any type of algorithms involving Weighted parameters and also for Neural networks whereas Dropout is primarily used for any kind of Neural networks like ANN, CNN, DNN or RNN to moderate the learning.

1. Ridge Regularization (L2 Norm): The main purpose of using Ridge Regularization is to find a new line that is not completely Overfitting on Training data as well, which means we introduce a small amount of Bias into how our new line will fit to the data. When we find a slightly bad fit line with some bias the change is very significant improving the long living predictions and accuracy of the test data.

Fig: - Ridge Regression Formulation

Ridge Regression penalizes sum of squared coefficients. Here we will try to decrease the Ridge Regression Penalty (Lambda * slope^2) of the Least Squares Line (Regression Line) calculated such that the Ridge Regression Line also is fitting on most of the data points. This can be done by increasing the value of 'Lambda' form 0 to n positive numbers using Cross Validation method.

If Lambda = 0 then the Ridge Penalty is same as Regression Line. Larger Lambda value gets less steeper becomes Slope of out data set.

2. Lasso Regularization (L1 Norm): penalizes the absolute values of the coefficients. Lasso regression is very much similar to the Ridge Regression technique but along with improving the prediction it also helps in performing Feature Engineering as we can shrink the Lasso penalty to absolute zero.

Fig: Lasso Regression Formularization

Lasso Regression can exclude useless variables from an equation making the final equation easier and simpler for calculations - Feature Engineering. Hence Lasso Regression is best to use when we have lots of useless parameters.

3. Elastic Net Regression: is hybrid of Lasso Regression and Ridge Regression techniques. It is used when there are multiple features correlated.

Fig: Elastic Net Regression
1. When Lambda1 and Lamda2 both = 0, then it is Least Squared Regression line.

2. If Lambda1 = 0, then it is a Ridge regression.

3. If Lmabda2 = 0, then it is Lasso regression.

4. Dropout: Dropout is a Regularization technique we mostly use while building a Neural Network model as it prevents the complex co-adoptions from other neurons. In Neural Networks fully connected layers are more prone to Overfit the training dataset, using Dropout we can drop connections with 1-p (probability parameter which needs to be tuned) probability for each of the specified layers and now we are left with the Reduced network in the Test dataset as most of them were left out in the Training dataset.

Fig: Dropout Regularization

Dropout increases the training speed and learns more robust internal functions to identify random and unseen data.

Now let us discuss some of the advanced Ensemble Machine Learning Techniques: -

1. Bagging: is called Bootstrap Aggregating Machine Learning technique designed to Reduce Variance, improve the accuracy and stability of the algorithm used in Statistical Classification and Regression. Bagging reduces variance and prevents overfitting of the Training model.

Bagging works by creating multiple samples of the Training dataset learns them individually, fits a decision tree on each sample and combines each of the weak classifiers Parallelly using some deterministic averaging processes.

Fig: - Bagging Ensemble technique

Random forest is an extension of the Bagging Ensemble method, where we split the data while constructing each decision tree considering few random features only and not all the features together.

2. Boosting: is an Ensemble Machine learning technique used to Reduce Bias and also Reduce Variance in supervised learning by adding models Sequentially to the ensemble where new models added attempts to correct the errors made by the prior models. Hence the more and more models added will reduce the error to fewer at least to prevent Training from Overfitting.

Fig: - Boosting Ensemble Technique

Ada Boost, Gradient Boost and XG Boost Algorithms are some of the successful types of Boosting Techniques used in Ensemble Machine Learning.

Ada Boosting uses very simple trees for making single decision on one input variable before making a prediction and these short trees are referred to as Decision Stumps.

We can see the Comparison between Bagging and Boosting as follow:

Fig: - Bagging vs Boosting

3. Stacking: involves combining the predictions based on Voting from multiple Machine Learning models but same training dataset like Bagging and Boosting. Another Machine Learning model is used to learn from the Base Models which is more often a Linear model such as Linear Regression or Logistic Regression problem for classification but we can still use any Machine learning model.

Fig: - Stacking Ensemble Technique

Stacking mainly involves K-fold Cross Validation or Train-Test-Split model for each Base-model for each Base model to store and then training them on entire Training dataset and now the Meta-model is trained on which model to trust under circumstances.

Check out my Github repository about XGBoosting Algorithm, one of the most widely used Ensemble Learning method link - https://github.com/HarishSingh2095/Ensemble-Learning_XGBoost-Algorithm

References:

For more references on Ensemble Learninig you can visit - https://machinelearningmastery.com/ensemble-machine-learning-with-python-7-day-mini-course/

Also do visit - https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205 for further references.

You can connect with me on -

Linkedin - https://www.linkedin.com/in/harish-singh-166b63118

Twitter - @harisshh_singh

Gmail - hs02863@gmail.com

End notes:

Hope this was useful for beginners in the field of Data Science.

See you guys until next time.

Early in the 1970s Organizations began to depend more on the data from heterogeneous sources for many of their projects and developments. During this period the ETL tools started gaining lot of popularity which Extracts the data from heterogeneous sources, Transforms the information into a consistent data type and then finally Loads the data into Single Data Repository or a Central Data Warehouse. ETL tool also validates and verify the data for any duplicate records or any data loss.

When referring to Business Intelligence quality assurance we discover the terms Data Warehouse testing and ETL testing which are inter-changeably used as one and the same. But when we take a deep look into both of these methods we see that ETL testing is a sub-component of overall DWH testing. A data warehouse is essentially built using data extractions, data transformations, and data loads. ETL processes extract data from sources, transform the data according to BI reporting requirements, then load the data to a target data warehouse.

ETL testing applies to Data Warehouses or Data integration projects while Database Testing applies to any database holding data (typically transaction systems) that provides faster, better and efficient data processing for quicker reporting.

Data Reconciliation (DR) describes a verification phase during a data migration where the target data is compared against original source data to ensure that the migration architecture has transferred the data correctly. In Database testing the reconciliation is simple as it is directly from source table to target table for the transaction (OLTP) purposes whereas in ETL testing the reconciliation is applied to data warehouse systems and used to obtain relevant information for analytics and business intelligence (OLAP).

fig: - The ETL Architecture

After selecting data from the sources, ETL procedures resolve problems in the data, convert data into a common model appropriate for research and analysis, and write the data to staging and cleansing areas—then finally to the target data warehouse.

An Organization collects lots of data for their business purpose which can be in either -

1. Structured format (RDBMS where SQL can be performed easily) like – CRM (customer relationship management info. – cname, cid, cnumber, etc…), Organization data into table forms (no. of employees, eid, salary, etc….).

2. In the form of XML files which are both human-readable and machine-readable maybe from many of their sub branches or subsidiaries.

3. Unstructured form of data (JSON files where NoSQL is done mostly or MongoDB handles unstructured data) like social media information from web pages or social media handles – Most of analytics and recommendations can be done from this.

4. Some flat files (text files) from customers filling application forms, call logs, complains or other details as text.

5. Also so many other formats of data from multiple sources.

All of these data from heterogeneous sources are now run under ETL tests (Extract, Transform and Load) and then dumped into a Data Warehouse (to which only companies have access). The best data warehouse tools used for doing the Online Analytical Processing (OLAP) using tools like Power Bi and Tableau are: - Amazon Redshift, Microsoft Azure, Google Big Query, Snowflake, Micro Focus Vertica, Teradata, Amazon DynamoDB, PostgreSQL, etc...

A data warehouse keeps data gathered and integrated from different sources and stores the large number of records needed for long-term analysis. Implementations of data warehouses use various data models (such as dimensional or normalized models), and technologies (such as DBMS, Data Warehouse Appliance (DWA), and cloud data warehouse appliances).

Based on the business goals there are 4 general categories of ETL testing: -

1. New System testing - tests the data obtained from various sources.

2. Migration testing - tests the data transferred from source to DWH.

3. Change testing - involves the testing of new data added to DWH.

4. Report testing - validates the final data and makes necessary calculations.

ETL’s processes present many challenges, such as extracting data from multiple heterogeneous sources involving different data models, detecting and fixing a wide variety of errors/issues in data, then transforming the data into different formats that match the requirements of the target data warehouse.

Top 7 ETL Tools for 2021 are: -

1. Xplenty

2. Talend

3. Informatica Power Center

4. Stitch

5. Oracle Data Integrator

6. Skyvia

7. Fivetran

Below are some of the major differences between Database testing and ETL testing: -

Since ETL tool is running on a single machine it has some restrictions: -

1. The storage issues for such huge amount of data coming from so many sources varying drastically in very short time.

2. Data coming to ETL is not real time delaying customization and recommendation to customers.

3. Involves lot of Monetary for this setup of DWH as it is very expensive along with some very expensive ETL tools too.

Because of these drawbacks initially a concept called Multiple Parallel Processing (MPP) Technique was introduced where ETL tools and DWH tools were installed in multiple machines to run processes in parallel but this was also very tedious and hectic process involving merging, joining from different countries or sources.

Polyglot persistence is a term that refers to using multiple data storage technologies for varying data storage needs across an application or within smaller components of an application.

Hence the HADOOP concepts were introduced to deal with the problems of Big data storage and real time processing of data in faster and effective ways.

References: - Below are some useful resources on google for this related topic can be found on below links -

https://www.talend.com/resources/etl-testing/

https://www.xplenty.com/blog/top-7-etl-tools

https://www.edureka.co/blog/informatica-etl/

https://www.tricentis.com/blog/bi-dwh-etl-testing-2/

You can connect with me on -

Linkedin - https://www.linkedin.com/in/harish-singh-166b63118

Twitter - @harisshh_singh

Gmail - hs02863@gmail.com

End notes: -

Hope this was useful for beginners in the field of Data Science.

See you guys until next time.

Data Science

Ensemble Machine Learning Techniques

Regularization:

Now let us discuss some of the advanced Ensemble Machine Learning Techniques: -

Data Warehouse Testing or ETL Testing

Popular Posts

Report Abuse

About Me

Search This Blog

Blog Archive

Subscribe Us

About Us

Social Plugin

DATA SCIENCE.

Boxed Version

Default Variables

Link List

Footer Menu Widget

Footer Social Widget

Facebook

Subscribe Us

Data Science

Ensemble Machine Learning Techniques

Regularization:

Now let us discuss some of the advanced Ensemble Machine Learning Techniques: -

Data Warehouse Testing or ETL Testing

Follow Us

Popular Posts

Report Abuse

About Me

Search This Blog

Blog Archive

Subscribe Us

About Us

Social Plugin

DATA SCIENCE.

Boxed Version

Default Variables

Link List

Footer Menu Widget

Footer Social Widget

Facebook

Subscribe Us