Data Science

A Database Testing is a software testing type that checks the schema, tables, triggers, etc. of the database that is under test. It may involve creating complex queries to load/stress test the Database and check its responsiveness. It also checks the data integrity and consistency.

The DB testing is a back-end testing whereas GUI testing is the front-end testing.

GUI Testing is a software testing type that checks the Graphical User Interface (GUI) of the software, before the UI or Application is made public to use.

The purpose of GUI testing is to ensure the functionalities of software applications whether or not it is working as per specifications by checking screens, controls like menu buttons, icons, application response, etc.

Types of DB testing are: -

1. Structural Testing.

2. Functional Testing.

3. Non-Functional Testing.

1. Structural Testing involves –

a. Schema testing / Mapping testing: -

- To check if there is proper mapping between the tables/views/columns of the DB and GUI.

- To verify the heterogeneous DB environment within overall application mapping.

- Tools used are: - DB Unit, Microsoft SQL server, etc.

b. DB tables and columns testing: -

- To check the mapping between DB fields and columns in the back end.

- Check if any not useful/unmapped DB columns/tables.

- Check for key constraints and Indexes – primary key and foreign key constraints and their data-types in all the tables.

- Whether clustered Indexes and non-clustered indexes have been created on the required tables as specified by the business requirement.

c. Stored procedures: -

- To check the standard conventions of the coding, exceptions and error handling of all the stored procedure of the modules.

- Checking for conditions/loops applied to the required input data.

- Executing TRIM () function to remove the unnecessary space characters or other specified characters from leading (start) and trailing (end) of a string.

- Checking the presence of any manual exception, NULL condition, unused stored procedures.

- Overall integration of stored procedures and function modules.

- Tools used are: - LINQ, Stored Procedures Test tool.

d. Trigger Testing involves: -

- Coding conventions.

- DML transactions – SELECT, INSERT, UPDATE, DELETE, etc.

- Check all updates.

e. DB Server Validation involves: -

- DB server configuration check.

- Authorization by only required users to perform different levels of action on the application.

- It can also cater the needs of maximum number users allowed for the transaction.

2. Functional DB Testing - checks all the fields and allows necessary NULL values, length and size of that field and checks if the similar fields across multiple tables have same name.

a. Black Box Testing involves checking data integrity and consistency: -

- If data is logically well organized and stored in tables.

- If any unnecessary data is present.

- To check the data updated on UI.

- TRIM () on DML.

- Transactions based on business goals whether properly committing and rolling back or not.

- Checking all conditions applied on multiple heterogeneous DB.

- Checking all the transactions executed.

b. White Box Testing involves login and user security: -

- If the application is preventing unknown user login.

o Invalid Username Valid Password.

o Invalid Username Invalid Password.

o Valid Username Invalid Password.

o Valid Username Valid Password.

- Data security and the level of access.

- Assignment of different roles/permissions as per eligibility and encrypting sensitive passwords, card details, etc.

3. Non-Functional Testing involves -

- Risk quantification and minimum system equipment requirement.

- Load Testing: -

o To check if the running transactions have performance impact on DB.

o Response time for executing the transactions from multiple remote users.

o Time taken by DB to fetch specific records.

- Stress Testing (Fatigue Testing): -

o To identify the system breakpoint.

o The point where system fails after loading more data on an application is Breakpoint of a system.

o For ex:- A CRM application takes max user load of 50,000 users but when the load is 51,000 then the transaction syncs the DB to avoid breakpoint and allow new transactions.

o Proper planning is required to avoid time and cost based issues.

o Tools used: - Load-Runner and Win-Runner.

- Security Testing.

- Compatibility Testing.

- Usability Testing.

References: - Below are some useful resources on google for this related topic can be found on below links -

https://www.guru99.com/data-testing.html

https://www.tutorialspoint.com/database_testing/index.html

You can connect with me on -

Linkedin - https://www.linkedin.com/in/harish-singh-166b63118

Twitter - @harisshh_singh

Gmail - hs02863@gmail.com

End notes: -

Hope this was useful for beginners in the field of Data Science.

See you guys until next time.

So here's what Harish found Online when he searched for “What is Data Science?”

Data science is an inter-disciplinary science of data also related to Computer Science but completely a different field of study which involves vastly increasing volume of data that is Retrieved from multiple sources like Social Media, Gadgets connected on Internet of things, search queries, Reviews & Customer Service inquiries, Reports collected from multiple places, etc……

A Data Scientist needs to know how to Store such huge amount of information which can later be used for Analyzing the data and Retrieving useful information from it. The goal of data science is to gain insights and knowledge from any type of data — both structured and unstructured.

Since finally we do Analysis on the Data we have stored we can say that Data science is more related to the mathematics field of Statistics to achieve more insightful information from the Data by Visualizing the useful information for easy understanding and better presentation.

Here we can conclude that Data science requires the skillset of collection, organization, analysis, and presentation of data.

Data science uses complex machine learning algorithms to build predictive models, using modern tools and techniques to derive meaningful information and unseen patterns, and also to make business decisions.

Yes it is true, if somebody already knows your Likes and Dislikes then Delivering Relevant Content is easier making it more of an engaging shopping experience.

Most of the time spent by a Data Scientist is in doing the Exploratory Data Analysis (EDA) and Cleaning of Dataset as it plays a major key role in making some sense from the Raw Data.

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods.

Below are the steps involved in EDA to understand, Clean and Prepare the dataset for extracting useful info. from the Raw Data:

1. Variable Identification.

2. Univariate Analysis.

3. Bi-variate Analysis.

4. Missing values treatment.

5. Outlier treatment.

6. Variable transformation.

7. Variable creation.

fig: - EDA Step by Step

We are dealing here with lots of Missing variables, Outliers, Mean, Median, Mode, Relation, Correlation, Variance, Covariance, Important features and the Target Variables in our Dataset.

Let’s list some powerful python Modules that we will be using to do EDA: -

Mainly we will need four modules or libraries to be imported in Python to EDA –

o Pandas - for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. Pandas allows importing data from various file formats such as comma-separated values, JSON, SQL, Microsoft Excel. Pandas allows various data manipulation operations such as merging, reshaping, selecting, as well as data cleaning, and data wrangling features.

o Numpy - NumPy is a Python library used for working with arrays. It also has functions for working in domain of linear algebra, Fourier transform, and matrices.

o Matplotlib - Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. Each pyplot function makes some change to a figure.

o Seaborn - Seaborn is a library that uses Matplotlib underneath to plot graphs. It will be used to visualize random distributions.

Once we have imported the necessary libraries and also Loaded the dataset on python platform, we will run the following commands as needed to do EDA and Cleaning our dataset –

1. df.head() and df.tail() – will print first and the last few lines of our dataset.

2. df .describe() - to get various summary statistics (like – count, min, max, std, etc..) excluding the NaN values.

3. df.shape – number of rows, columns.

4. df.columns – All the column names.

5. df. sample(5) - taking a sample of our data as an easy way to get a feel for our data quickly.

6. df.info() - prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.

7. df.describe() – Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

8. df.isnull().sum() - to check for null or missing values.

9. df.dtypes - The data type of each column.

10. df.nunique(axis=0, dropna = True) - Return Series with number of distinct observations. Can ignore NaN values.

11. df.coln.unique() – Each unique term present in coln.

12. df.drop(deep = True/False) - When deep=True (default), a new object will be created with a copy of the calling object’s data and indices. Modifications to the data or indices of the copy will not be reflected in the original object.

When deep=False, a new object will be created without copying the calling object’s data or index (only references to the data and index are copied). Any changes to the data of the original will be reflected in the shallow copy (and vice versa).

Till now we have seen some Variable Identification and Cleaning commands in python. Now let’s see some Visualization techniques in python under EDA.

1. df.corr() – to print the correlation Matrix of the data-frame after we have done all the necessary Cleaning and Variable Identification.

2. df.plot(kind='scatter', x='odometer', y='price') – plotting scatter plot to identify the relations.

3. sns.pairplot(df) – One of the best way to identify relation between the variables using scatterplots.

4. df[coln].plot(kind='hist', bins=50, figsize=(12,6), facecolor='grey',edgecolor='black') – plotting Histogram to identify frequencies.

5. Boxplot - One of the best way to identify the outliers in data along with mean, median and mode information.

To give some practical understanding of what we are discussing here, you guys can also visit one of my Github repository about EDA and building a simple ML model post completion of EDA through this link - https://github.com/HarishSingh2095/LinearRegression_Predictions_X

References:

For complete details on all the steps involved in EDA you can visit A Comprehensive Guide to Data Exploration - https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/

Also do visit - https://towardsdatascience.com/an-extensive-guide-to-exploratory-data-analysis-ddd99a03199e?gi=4e05b6334126 for further references.

You can connect with me on -

Linkedin - https://www.linkedin.com/in/harish-singh-166b63118

Twitter - @harisshh_singh

Gmail - hs02863@gmail.com

End notes:

Hope this was useful for beginners in the field of Data Science.

See you guys until next time.

Data Science

DB Testing

Data Science - Exploratory Data Analysis

Popular Posts

Report Abuse

About Me

Search This Blog

Blog Archive

Subscribe Us

About Us

Social Plugin

DATA SCIENCE.

Boxed Version

Default Variables

Link List

Footer Menu Widget

Footer Social Widget

Facebook

Subscribe Us

Data Science

DB Testing

Data Science - Exploratory Data Analysis

Follow Us

Popular Posts

Report Abuse

About Me

Search This Blog

Blog Archive

Subscribe Us

About Us

Social Plugin

DATA SCIENCE.

Boxed Version

Default Variables

Link List

Footer Menu Widget

Footer Social Widget

Facebook

Subscribe Us