0% found this document useful (0 votes)

24 views49 pages

Dsbda Ass2

Uploaded by

ngak1214

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views49 pages

Dsbda Ass2

Uploaded by

ngak1214

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Data Science and Big

Data Analytics
Laboratory
Third Year 2019 Course

Prof.K.B.Sadafale
Assistant Professor
Computer Dept. GCOEAR, Avasari
Data Wrangling: II
Create an “Academic performance” dataset of students and
perform the following operations using Python.
✓ Scan all variables for missing values and inconsistencies. If
there are missing values and/or inconsistencies, use any of
the suitable techniques to deal with them.
✓ Scan all numeric variables for outliers. If there are outliers,
use any of the suitable techniques to deal with them.
✓ Apply data transformations on at least one of the variables.
The purpose of this transformation should be one of the
following reasons: to change the scale for better
understanding of the variable, to convert a non-linear
relation into a linear one, or to decrease the skewness and
convert the distribution into a normal distribution.
Reason and document your approach properly.
Select all Rows with NaN Values in Pandas DataFrame

➢ Here are 4 ways to select all rows with NaN values in Pandas
DataFrame:

➢ (1) Using isna() to select all rows with NaN under

a single DataFrame column:
✓ df[df['column name'].isna()]
➢ (2) Using isnull() to select all rows with NaN under
a single DataFrame column:
✓ df[df['column name'].isnull()]
➢ (3) Using isna() to select all rows with NaN under
an entire DataFrame:
✓ df[df.isna().any(axis=1)]
➢ (4) Using isnull() to select all rows with NaN under
an entire DataFrame:
➢ df[df.isnull().any(axis=1)]
Step 1: Create a DataFrame

✓ Numeric values with NaN

✓ String/text values with NaN
✓ to select all rows with the NaN values under the ‘first_set‘
column.

Step 2: Select all rows with NaN under a single DataFrame

column
✓ you may use the isna() approach to select the NaNs:
✓ df[df['column name'].isna()]
You’ll get the same results using isnull():
Select all rows with NaN under the entire DataFrame

✓ To find all rows with NaN under the entire DataFrame, you
may apply this syntax:
✓ df[df.isna().any(axis=1)]
Optionally, you’ll get the same results using isnull():
How to Drop Rows with NaN Values in Pandas DataFrame

✓ The syntax that you may apply in order drop rows with NaN
values in your DataFrame:
✓ df.dropna()

➢ Let’s say that you have the following dataset:

values_1 values_2
700 DDD
ABC 150
500 350
XYZ 400
1200 5000
Notice that the DataFrame contains both:
✓ Numeric data: 700, 500, 1200, 150 , 350 ,400, 5000
✓ Non-numeric values: ABC, XYZ, DDD
➢ You can then use to_numeric in order to convert the values in
the dataset into a float format.
➢ But since 3 of those values are non-numeric, you’ll get ‘NaN’
for those 3 values.
Step 2: Drop the Rows with NaN Values in Pandas DataFrame

✓ To drop all the rows with the NaN values, you may use df.dropna().

✓ you’ll see only two rows without any NaN values:

Step 3 (Optional): Reset the Index

Syntax
df.reset_index(drop=True)
df.replace(np.nan,0)

Replace NaN Values with Zeros in Pandas DataFrame

✓ The 4 approaches below in order to replace NaN values with

zeros in Pandas DataFrame:
(1) For a single column using Pandas:
df['DataFrame Column'] = df['DataFrame Column'].fillna(0)

(2) For a single column using NumPy:

df['DataFrame Column'] = df['DataFrame Column'].replace(np.nan, 0)

(3) For an entire DataFrame using Pandas:

df.fillna(0)

(4) For an entire DataFrame using NumPy:

df.replace(np.nan,0)
Case 1: replace NaN values with zeros for a column using Pandas

✓ Suppose that you have a single column with the following

data that contains NaN values:

values
700
NaN
500
NaN
✓ Python code to replace the NaN values with 0’s:
How to Transpose Pandas DataFrame
You can use the following syntax to transpose Pandas DataFrame:
df = df.transpose()

How to apply the above syntax by reviewing 3 cases of:

✓ Transposing a DataFrame with a default index

✓ Transposing a DataFrame with a tailored index

✓ Importing a CSV file and then transposing the

DataFrame
Case 1: Transpose Pandas DataFrame with a Default Index

• Let’s create a DataFrame with 3 columns:

Get the DataFrame (with a default numeric index that starts from 0 ):
➢ You can then add df = df.transpose() to the code in order to
transpose the DataFrame:
Case 2: Transpose Pandas DataFrame with a Tailored Index

✓ What if you want to assign your own tailored index, and then
transpose the DataFrame?
✓ For example, let’s add the following index to the DataFrame:
✓ index = ['X', 'Y', 'Z']
✓ Now add df = df.transpose() in order to transpose the
DataFrame:
Case 3: Import a CSV File and then Transpose the Results
• For example, let’s say that you have the following data
saved in a CSV file:

A B C
11 44 77
22 55 88
33 66 99

• You can then use the code to import the data into Python
(note that you’ll need to modify the path to reflect the
location where the CSV file is stored on your computer):
✓ Optionally, you can rename the index values before
transposing the DataFrame:
✓ df = df.rename(index = {0:'X', 1:'Y', 2:'Z'})
Using StandardScaler() Function to Standardize Python Data

➢ Scaling of Features is an essential step in modeling the

algorithms with the datasets.
➢ The data that is usually used for the purpose of modeling is
derived through various means such as:
✓ Questionnaire
✓ Surveys
✓ Research
✓ etc.
➢ So, the data obtained contains features of various dimensions
and scales altogether.
➢ Different scales of the data features affect the modeling of a
dataset adversely.
➢ It leads to a biased outcome of predictions in terms of
misclassification error and accuracy rates.
➢ Thus, it is necessary to Scale the data prior to modeling.
➢ Standardization is a scaling technique wherein it makes the
data scale-free by converting the statistical distribution of the
data into the below format:

✓ mean - 0 (zero)
✓ standard deviation – 1

✓ Python sklearn library offers us with StandardScaler()

function to standardize the data values into a standard
format.
✓ Syntax:
✓ object = StandardScaler()
✓ object.fit_transform(data)

✓ According to the above syntax, we initially create an object of

the StandardScaler() function.
✓ Further, we use fit_transform() along with the assigned object
to transform the data and standardize it.
Standard Scaler
➢ Standard Scaler helps to get standardized distribution, with a
zero mean and standard deviation of one (unit variance).
➢ It standardizes features by subtracting the mean value from
the feature and then dividing the result by feature standard
deviation.
✓ The standard scaling is calculated as:
✓ z = (x - u) / s
Where,
➢ z is scaled data.
➢ x is to be scaled data.
➢ u is the mean of the training samples
➢ s is the standard deviation of the training samples.
✓ Sklearn preprocessing supports StandardScaler() method to
achieve this directly in merely 2-3 steps.
Syntax:

class sklearn.preprocessing.StandardScaler(*, copy=True, with_mean=True, with_std=True)

Parameters:

• copy: If False, inplace scaling is done. If True , copy is created instead of

inplace scaling.
• with_mean: If True, data is centered before scaling.
• with_std: If True, data is scaled to unit variance.
MinMax Scaler
✓ This is another way of data scaling, where the minimum of feature is
made equal to zero and the maximum of feature equal to one.
✓ MinMax Scaler shrinks the data within the given range, usually of 0 to 1.
✓ It transforms data by scaling features to a given range.
✓ It scales the values to a specific value range without changing the shape
of the original distribution.

The MinMax scaling is done using:

x_std = (x – x.min(axis=0)) / (x.max(axis=0) – x.min(axis=0))

x_scaled = x_std * (max – min) + min

Where,
min, max = feature_range
x.min(axis=0) : Minimum feature value
x.max(axis=0):Maximum feature value
Syntax:

class sklearn.preprocessing.MinMaxScaler(feature_range=0, 1, *, copy=True,

clip=False)

Parameters:

• feature_range: Desired range of scaled data.

• The default range for the feature returned by M inM axScaler is 0 to 1.
• The range is provided in tuple form as (min,max).
• copy: If False, inplace scaling is done. If True , copy is created instead
of inplace scaling.
• clip: If True, scaled data is clipped to provided feature range.
✓ Only two Colum name(Income,Age) Because Department having Character
data not to scale
Z score for Outlier Detection – Python

➢ Z score is an important concept in statistics.

➢ Z score is also called standard score.

➢ This score helps to understand if a data value is greater or

smaller than mean and how far away it is from the mean.

➢ More specifically, Z score tells how many standard deviations

away a data point is from the mean.

Z score = (x -mean) / std. deviation

Z score and Outliers:

✓ If the z score of a data point is more than 3, it indicates that the

data point is quite different from the other data points.
✓ Such a data point can be an outlier.
✓ For example, in a survey, it was asked how many children a
person had.
✓ Suppose the data obtained from people is

✓ 1, 2, 2, 2, 3, 1, 1, 15, 2, 2, 2, 3, 1, 1, 2

✓ Clearly, 15 is an outlier in this dataset.

Let us use calculate the Z score using Python to find this outlier.

Step 1: Import necessary libraries

import numpy as np

Step 2: Calculate mean, standard deviation

data = [1, 2, 2, 2, 3, 1, 1, 15, 2, 2, 2, 3, 1, 1, 2]

mean = np.mean(data)
std = np.std(data)
print('mean of the dataset is', mean)
print('std. deviation is', std)

Output:
mean of the dataset is 2.6666666666666665
std. deviation is 3.3598941782277745
Step 3: Calculate Z score. If Z score>3, print it as an outlier.

threshold = 3
outlier = []
for i in data:
z = (i-mean)/std
if z > threshold:
outlier.append(i)
print('outlier in dataset is', outlier)

Output:

outlier in dataset is [15]

Conclusion: Z score helps us identify outliers in the data.

Example
Approach for Outliers

we will approach dealing with bad data using Z-Score:

1.The very first step will be setting the upper and lower limit.
This range stimulates that every data point will be regarded as
an outlier out of this range.

Let’s see the formulae for both upper and lower limits.

Upper: Mean + 3 * standard deviation.

Lower: Mean – 3 * standard deviation.

✓ In the output, we see that the highest value is 13.06 while the
lowest value is 2.12
✓ Hence any value out of this range is the bad data point
2.The second step is to detect how many outliers are there in
the dataset based on the upper and lower limit that we set
up just

df[(df[‘cgpa'] > 13.06) | (df[‘cgpa’] < 2.12)]

Pandas Cheat Sheet PDF
67% (3)
Pandas Cheat Sheet PDF
1 page
Python Basics - Hamza Zahoor
No ratings yet
Python Basics - Hamza Zahoor
6 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
6 pages
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
No ratings yet
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
111 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
Ap Python
No ratings yet
Ap Python
12 pages
Pandas 1
No ratings yet
Pandas 1
13 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
Data Science Cheat Sheet: KEY Imports
100% (1)
Data Science Cheat Sheet: KEY Imports
1 page
Data Preprocessing PT 2
No ratings yet
Data Preprocessing PT 2
7 pages
Seven Lab Instruction
No ratings yet
Seven Lab Instruction
38 pages
Abhiml ML File
No ratings yet
Abhiml ML File
74 pages
12 Pandas
No ratings yet
12 Pandas
9 pages
Week 10
No ratings yet
Week 10
50 pages
Python Basics Refresher
No ratings yet
Python Basics Refresher
19 pages
Phython Example
No ratings yet
Phython Example
12 pages
Lab File
No ratings yet
Lab File
96 pages
Feature Engineering
No ratings yet
Feature Engineering
50 pages
ML - Preprocessing - Introduction
No ratings yet
ML - Preprocessing - Introduction
14 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
24 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
Pandas
No ratings yet
Pandas
30 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
Python Cheat Sheet Code Academy
100% (1)
Python Cheat Sheet Code Academy
1 page
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
Data Preprocessing Tutorial
No ratings yet
Data Preprocessing Tutorial
39 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
Pandas DataFrame Notes
100% (1)
Pandas DataFrame Notes
10 pages
Data Analysis: Data Preparation
No ratings yet
Data Analysis: Data Preparation
9 pages
Tutorial 4
No ratings yet
Tutorial 4
8 pages
ML Lab Records
No ratings yet
ML Lab Records
101 pages
DSBDL Asg 2 Write Up
No ratings yet
DSBDL Asg 2 Write Up
4 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
No ratings yet
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
12 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
EDS - Python Cheat Sheet
0% (1)
EDS - Python Cheat Sheet
3 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
10 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
Pandas Merged
No ratings yet
Pandas Merged
2 pages
Unit 1
No ratings yet
Unit 1
21 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
48 pages
Analisis Dan Visualisasi Data
No ratings yet
Analisis Dan Visualisasi Data
15 pages
Asfasdas
No ratings yet
Asfasdas
36 pages
Cheat Sheet - Pandas
No ratings yet
Cheat Sheet - Pandas
12 pages
Exp3 Python
No ratings yet
Exp3 Python
15 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
Pandas: Import
100% (1)
Pandas: Import
13 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Administering Microsoft Azure SQL Solutions DP 300
From Everand
Administering Microsoft Azure SQL Solutions DP 300
Manish Soni
No ratings yet
Correlation Regression
No ratings yet
Correlation Regression
26 pages
SET-01 - SOCS - ESE-MAY23 - B.Tech (CSE-H+N.H) - VI - CSAI3011 - Pattern Recognition and Anomaly%2
No ratings yet
SET-01 - SOCS - ESE-MAY23 - B.Tech (CSE-H+N.H) - VI - CSAI3011 - Pattern Recognition and Anomaly%2
2 pages
Data Science Roadmap
No ratings yet
Data Science Roadmap
41 pages
Linear Mixed Models A Practical Guide Us
No ratings yet
Linear Mixed Models A Practical Guide Us
2 pages
Summer Booklet-Statistcs
No ratings yet
Summer Booklet-Statistcs
15 pages
Augmented Dickney Fuller and Phillip-Peron Tests: Prior To Global Contagion
No ratings yet
Augmented Dickney Fuller and Phillip-Peron Tests: Prior To Global Contagion
10 pages
Infinit Digital Solutions - RP 1 Report Final
No ratings yet
Infinit Digital Solutions - RP 1 Report Final
17 pages
Class5 Lecture
No ratings yet
Class5 Lecture
53 pages
Agilent Cary 8454 UV-Visible Spectroscopy System: Good Laboratory Practice
No ratings yet
Agilent Cary 8454 UV-Visible Spectroscopy System: Good Laboratory Practice
12 pages
Sales Analyst
No ratings yet
Sales Analyst
4 pages
A Stochastic Model For Demand Forecating in Python
No ratings yet
A Stochastic Model For Demand Forecating in Python
32 pages
EFA Manuscript
No ratings yet
EFA Manuscript
11 pages
Path Analysis and Structural Equation Modeling With Latent Variables
No ratings yet
Path Analysis and Structural Equation Modeling With Latent Variables
35 pages
T9 - Table For Constants For Control and Formulas For Control Charts
No ratings yet
T9 - Table For Constants For Control and Formulas For Control Charts
3 pages
SAT Ratios, Rates, Proportional Relationships, and Units
No ratings yet
SAT Ratios, Rates, Proportional Relationships, and Units
59 pages
Regression Analysis, Linear or Nonlinear Regression? That Is The Question. - Minitab
No ratings yet
Regression Analysis, Linear or Nonlinear Regression? That Is The Question. - Minitab
11 pages
Factor Influencing On Hanu Students' House Rent: Econometrics Project
No ratings yet
Factor Influencing On Hanu Students' House Rent: Econometrics Project
23 pages
Tableau Financial Data Analysis
No ratings yet
Tableau Financial Data Analysis
3 pages
Efficacy of Problem Solving Therapy For Spouses of Men With Prostate Cancer - A Randomized Controlled Trial-2018
No ratings yet
Efficacy of Problem Solving Therapy For Spouses of Men With Prostate Cancer - A Randomized Controlled Trial-2018
9 pages
Part IV VDD
No ratings yet
Part IV VDD
28 pages
Module I. Introduction To Quality Management Lecture - 1 How The Concept of Quality Management Evolved Over Time?
No ratings yet
Module I. Introduction To Quality Management Lecture - 1 How The Concept of Quality Management Evolved Over Time?
16 pages
Prob Stat Lesson 9
No ratings yet
Prob Stat Lesson 9
44 pages
Assignment On Regression
100% (1)
Assignment On Regression
11 pages
Casio Calc Regression
No ratings yet
Casio Calc Regression
2 pages
Assignment Data Analitik Halizah - Docx PART 4
No ratings yet
Assignment Data Analitik Halizah - Docx PART 4
5 pages
Multiple Choice Questions
100% (2)
Multiple Choice Questions
5 pages
Choice of Hotel Facilities by Guests With Physical Disabilities in Nairobi, Kenya
No ratings yet
Choice of Hotel Facilities by Guests With Physical Disabilities in Nairobi, Kenya
17 pages
One-Sample Kolmogorov-Smirnov Test
No ratings yet
One-Sample Kolmogorov-Smirnov Test
5 pages
Template An Nisbah 2024
No ratings yet
Template An Nisbah 2024
7 pages
Resume Lisa Pang
No ratings yet
Resume Lisa Pang
1 page

Dsbda Ass2

Uploaded by

Dsbda Ass2

Uploaded by

Data Science and Big

➢ (1) Using isna() to select all rows with NaN under

✓ Numeric values with NaN

Step 2: Select all rows with NaN under a single DataFrame

➢ Let’s say that you have the following dataset:

✓ you’ll see only two rows without any NaN values:

Replace NaN Values with Zeros in Pandas DataFrame

✓ The 4 approaches below in order to replace NaN values with

(2) For a single column using NumPy:

(3) For an entire DataFrame using Pandas:

(4) For an entire DataFrame using NumPy:

✓ Suppose that you have a single column with the following

How to apply the above syntax by reviewing 3 cases of:

✓ Transposing a DataFrame with a default index

✓ Transposing a DataFrame with a tailored index

✓ Importing a CSV file and then transposing the

• Let’s create a DataFrame with 3 columns:

➢ Scaling of Features is an essential step in modeling the

✓ Python sklearn library offers us with StandardScaler()

✓ According to the above syntax, we initially create an object of

class sklearn.preprocessing.StandardScaler(*, copy=True, with_mean=True, with_std=True)

• copy: If False, inplace scaling is done. If True , copy is created instead of

The MinMax scaling is done using:

x_std = (x – x.min(axis=0)) / (x.max(axis=0) – x.min(axis=0))

x_scaled = x_std * (max – min) + min

class sklearn.preprocessing.MinMaxScaler(feature_range=0, 1, *, copy=True,

• feature_range: Desired range of scaled data.

➢ Z score is an important concept in statistics.

➢ Z score is also called standard score.

➢ This score helps to understand if a data value is greater or

➢ More specifically, Z score tells how many standard deviations

Z score = (x -mean) / std. deviation

✓ If the z score of a data point is more than 3, it indicates that the

✓ Clearly, 15 is an outlier in this dataset.

Step 1: Import necessary libraries

Step 2: Calculate mean, standard deviation

data = [1, 2, 2, 2, 3, 1, 1, 15, 2, 2, 2, 3, 1, 1, 2]

outlier in dataset is [15]

Conclusion: Z score helps us identify outliers in the data.

we will approach dealing with bad data using Z-Score:

Upper: Mean + 3 * standard deviation.

Lower: Mean – 3 * standard deviation.

df[(df[‘cgpa'] > 13.06) | (df[‘cgpa’] < 2.12)]

You might also like