So here's what Harish found Online when he searched for “What is Data Science?”
Data science is an inter-disciplinary science of data also related to Computer Science but completely a different field of study which involves vastly increasing volume of data that is Retrieved from multiple sources like Social Media, Gadgets connected on Internet of things, search queries, Reviews & Customer Service inquiries, Reports collected from multiple places, etc……
A Data
Scientist needs to know how to Store such huge amount of
information which can later be used for Analyzing the data and Retrieving useful information from it. The goal of data science is to gain
insights and knowledge from any type of data — both structured and
unstructured.
Since finally we do Analysis on the
Data we have stored we can say that Data science is more related to the
mathematics field of Statistics to achieve more insightful information from
the Data by Visualizing the useful information for easy
understanding and better presentation.
Here we can conclude that Data
science requires the skillset of collection, organization, analysis, and
presentation of data.
Data science uses complex machine
learning algorithms to build predictive models, using modern tools and
techniques to derive meaningful information and unseen patterns, and also to
make business decisions.
Most of the time spent by a Data Scientist is in doing the Exploratory Data Analysis (EDA) and Cleaning of Dataset as it plays a major key role in making some sense from the Raw Data.
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods.
Below are the steps involved in EDA to understand, Clean and Prepare the dataset for extracting useful info. from the Raw Data:
1.
Variable Identification.
2.
Univariate Analysis.
3.
Bi-variate Analysis.
4.
Missing values treatment.
5.
Outlier treatment.
6.
Variable transformation.
7. Variable creation.
We are dealing here with lots of Missing variables, Outliers, Mean, Median, Mode, Relation, Correlation, Variance, Covariance, Important features and the Target Variables in our Dataset.
Let’s list some powerful python Modules that we will be using to do EDA: -
Mainly we will need four modules or libraries to be imported in Python to EDA –
o Pandas - for data
manipulation and analysis. In particular, it offers data structures and
operations for manipulating numerical tables and time series. Pandas allows
importing data from various file formats such as comma-separated values, JSON,
SQL, Microsoft Excel. Pandas allows various data manipulation operations such
as merging, reshaping, selecting, as well as data cleaning, and data wrangling
features.
o Numpy - NumPy is a
Python library used for working with arrays. It also has functions for working
in domain of linear algebra, Fourier transform, and matrices.
o Matplotlib -
Matplotlib is a plotting library for the Python programming language and its
numerical mathematics extension NumPy. Each pyplot function makes some change
to a figure.
o Seaborn - Seaborn is a
library that uses Matplotlib underneath to plot graphs. It will be used to
visualize random distributions.
Once
we have imported the necessary libraries and also Loaded the dataset on python
platform, we will run the following commands as needed to do EDA and Cleaning
our dataset –
1. df.head() and df.tail() –
will print first and the last few lines of our dataset.
2. df .describe() - to
get various summary statistics (like – count, min, max, std, etc..) excluding
the NaN values.
3. df.shape – number of
rows, columns.
4. df.columns – All the
column names.
5. df. sample(5) -
taking a sample of our data as an easy way to get a feel for our data quickly.
6. df.info() - prints information about a DataFrame including the index
dtype and columns, non-null values and memory usage.
7. df.describe() – Descriptive statistics include those that summarize the
central tendency, dispersion and shape of a dataset’s distribution, excluding
NaN values.
8. df.isnull().sum() - to check for null or missing values.
9. df.dtypes - The data type of each column.
10. df.nunique(axis=0, dropna = True) - Return Series with number of distinct observations. Can
ignore NaN values.
11. df.coln.unique() – Each unique term present in coln.
12. df.drop(deep = True/False) - When deep=True (default), a new object will be created with a copy of the calling object’s data and indices. Modifications to the data or indices of the copy will not be reflected in the original object.
When deep=False, a new object will be created without copying the calling object’s data or index (only references to the data and index are copied). Any changes to the data of the original will be reflected in the shallow copy (and vice versa).
Till now we have seen some Variable Identification and Cleaning commands in python. Now let’s see some Visualization techniques in python under EDA.
1. df.corr() – to print the correlation Matrix of the data-frame after
we have done all the necessary Cleaning and Variable Identification.
2. df.plot(kind='scatter', x='odometer',
y='price') – plotting scatter plot to identify
the relations.
3. sns.pairplot(df) – One of the best way to identify relation between the
variables using scatterplots.
4. df[coln].plot(kind='hist', bins=50, figsize=(12,6), facecolor='grey',edgecolor='black') – plotting Histogram to identify frequencies.
5. Boxplot - One of the best way to identify the outliers in data along with mean, median and mode information.
To give some practical understanding of what we are discussing here, you guys can also visit one of my Github repository about EDA and building a simple ML model post completion of EDA through this link - https://github.com/HarishSingh2095/LinearRegression_Predictions_X
References:
For complete details on all the steps involved in EDA you can visit A Comprehensive Guide to Data Exploration - https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
Also do visit - https://towardsdatascience.com/an-extensive-guide-to-exploratory-data-analysis-ddd99a03199e?gi=4e05b6334126 for further references.
You can connect with me on -
Linkedin - https://www.linkedin.com/in/harish-singh-166b63118
Twitter - @harisshh_singh
Gmail - hs02863@gmail.com
End notes:
Hope this was useful for beginners in the field of Data Science.
See you guys until next time.




Follow Us
Were this world an endless plain, and by sailing eastward we could for ever reach new distances