So here's what Harish found Online when he searched for “What is Data Science?”   Data science is an inter-disciplinary science of da...

Data Science - Exploratory Data Analysis

 So here's what Harish found Online when he searched for “What is Data Science?” 


Data science is an inter-disciplinary science of data also related to Computer Science but completely a different field of study which involves vastly increasing volume of data that is Retrieved from multiple sources like Social Media, Gadgets connected on Internet of things, search queries, Reviews & Customer Service inquiries, Reports collected from multiple places, etc…… 

A Data Scientist needs to know how to Store such huge amount of information which can later be used for Analyzing the data and Retrieving useful information from it. The goal of data science is to gain insights and knowledge from any type of data — both structured and unstructured.


Since finally we do Analysis on the Data we have stored we can say that Data science is more related to the mathematics field of Statistics to achieve more insightful information from the Data by Visualizing the useful information for easy understanding and better presentation.


Here we can conclude that Data science requires the skillset of collection, organization, analysis, and presentation of data.


Data science uses complex machine learning algorithms to build predictive models, using modern tools and techniques to derive meaningful information and unseen patterns, and also to make business decisions.

Yes it is true, if somebody already knows your Likes and Dislikes then Delivering Relevant Content is easier making it more of an engaging shopping experience.


Most of the time spent by a Data Scientist is in doing the Exploratory Data Analysis (EDA) and Cleaning of Dataset as it plays a major key role in making some sense from the Raw Data. 


Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods.


Below are the steps involved in EDA to understand, Clean and Prepare the dataset for extracting useful info. from the Raw Data:

1.    Variable Identification.

2.    Univariate Analysis.

3.    Bi-variate Analysis.

4.    Missing values treatment.

5.    Outlier treatment.

6.    Variable transformation.

7.    Variable creation.

fig: - EDA Step by Step


We are dealing here with lots of Missing variables, Outliers, Mean, Median, Mode, Relation, Correlation, Variance, Covariance, Important features and the Target Variables in our Dataset.


Let’s list some powerful python Modules that we will be using to do EDA: -


    Mainly we will need four modules or libraries to be imported in Python to EDA –

o   Pandas - for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. Pandas allows importing data from various file formats such as comma-separated values, JSON, SQL, Microsoft Excel. Pandas allows various data manipulation operations such as merging, reshaping, selecting, as well as data cleaning, and data wrangling features.

o   Numpy - NumPy is a Python library used for working with arrays. It also has functions for working in domain of linear algebra, Fourier transform, and matrices.

o   Matplotlib - Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. Each pyplot function makes some change to a figure.

o   Seaborn - Seaborn is a library that uses Matplotlib underneath to plot graphs. It will be used to visualize random distributions.


    Once we have imported the necessary libraries and also Loaded the dataset on python platform, we will run the following commands as needed to do EDA and Cleaning our dataset –


1.    df.head() and df.tail() – will print first and the last few lines of our dataset.


2.    df .describe() - to get various summary statistics (like – count, min, max, std, etc..) excluding the NaN values.


3.    df.shape – number of rows, columns.


4.    df.columns – All the column names.


5.    df. sample(5) - taking a sample of our data as an easy way to get a feel for our data quickly.


6.    df.info() - prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.


7.    df.describe() – Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.


8.    df.isnull().sum() - to check for null or missing values.


9.    df.dtypes - The data type of each column.


10.  df.nunique(axis=0, dropna = True) - Return Series with number of distinct observations. Can ignore NaN values.


11.  df.coln.unique() – Each unique term present in coln.


12.  df.drop(deep = True/False) - When deep=True (default), a new object will be created with a copy of the calling object’s data and indices. Modifications to the data or indices of the copy will not be reflected in the original object.

When deep=False, a new object will be created without copying the calling object’s data or index (only references to the data and index are copied). Any changes to the data of the original will be reflected in the shallow copy (and vice versa).


Till now we have seen some Variable Identification and Cleaning commands in python. Now let’s see some Visualization techniques in python under EDA.


1.    df.corr() – to print the correlation Matrix of the data-frame after we have done all the necessary Cleaning and Variable Identification.


2.    df.plot(kind='scatter', x='odometer', y='price') – plotting scatter plot to identify the relations.


3.    sns.pairplot(df) – One of the best way to identify relation between the variables using scatterplots.


4.    df[coln].plot(kind='hist', bins=50, figsize=(12,6), facecolor='grey',edgecolor='black') – plotting Histogram to identify frequencies.


5.  Boxplot - One of the best way to identify the outliers in data along with mean, median and mode information. 

 

To give some practical understanding of what we are discussing here, you guys can also visit one of my Github repository about EDA and building a simple ML model post completion of EDA through this linkhttps://github.com/HarishSingh2095/LinearRegression_Predictions_X


References:


For complete details on all the steps involved in EDA you can visit A Comprehensive Guide to Data Exploration https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/


Also do visit - https://towardsdatascience.com/an-extensive-guide-to-exploratory-data-analysis-ddd99a03199e?gi=4e05b6334126 for further references.


You can connect with me on - 


Linkedin - https://www.linkedin.com/in/harish-singh-166b63118


Twitter@harisshh_singh


Gmail - hs02863@gmail.com

 

End notes:


Hope this was useful for beginners in the field of Data Science. 

See you guys until next time.

 

4 comments: