Dsbda Ass2
Dsbda Ass2
Data Analytics
Laboratory
Third Year 2019 Course
Prof.K.B.Sadafale
Assistant Professor
Computer Dept. GCOEAR, Avasari
Data Wrangling: II
Create an “Academic performance” dataset of students and
perform the following operations using Python.
✓ Scan all variables for missing values and inconsistencies. If
there are missing values and/or inconsistencies, use any of
the suitable techniques to deal with them.
✓ Scan all numeric variables for outliers. If there are outliers,
use any of the suitable techniques to deal with them.
✓ Apply data transformations on at least one of the variables.
The purpose of this transformation should be one of the
following reasons: to change the scale for better
understanding of the variable, to convert a non-linear
relation into a linear one, or to decrease the skewness and
convert the distribution into a normal distribution.
Reason and document your approach properly.
Select all Rows with NaN Values in Pandas DataFrame
➢ Here are 4 ways to select all rows with NaN values in Pandas
DataFrame:
✓ To find all rows with NaN under the entire DataFrame, you
may apply this syntax:
✓ df[df.isna().any(axis=1)]
Optionally, you’ll get the same results using isnull():
How to Drop Rows with NaN Values in Pandas DataFrame
✓ The syntax that you may apply in order drop rows with NaN
values in your DataFrame:
✓ df.dropna()
values_1 values_2
700 DDD
ABC 150
500 350
XYZ 400
1200 5000
Notice that the DataFrame contains both:
✓ Numeric data: 700, 500, 1200, 150 , 350 ,400, 5000
✓ Non-numeric values: ABC, XYZ, DDD
➢ You can then use to_numeric in order to convert the values in
the dataset into a float format.
➢ But since 3 of those values are non-numeric, you’ll get ‘NaN’
for those 3 values.
Step 2: Drop the Rows with NaN Values in Pandas DataFrame
✓ To drop all the rows with the NaN values, you may use df.dropna().
Syntax
df.reset_index(drop=True)
df.replace(np.nan,0)
values
700
NaN
500
NaN
✓ Python code to replace the NaN values with 0’s:
How to Transpose Pandas DataFrame
You can use the following syntax to transpose Pandas DataFrame:
df = df.transpose()
Get the DataFrame (with a default numeric index that starts from 0 ):
➢ You can then add df = df.transpose() to the code in order to
transpose the DataFrame:
Case 2: Transpose Pandas DataFrame with a Tailored Index
✓ What if you want to assign your own tailored index, and then
transpose the DataFrame?
✓ For example, let’s add the following index to the DataFrame:
✓ index = ['X', 'Y', 'Z']
✓ Now add df = df.transpose() in order to transpose the
DataFrame:
Case 3: Import a CSV File and then Transpose the Results
• For example, let’s say that you have the following data
saved in a CSV file:
A B C
11 44 77
22 55 88
33 66 99
• You can then use the code to import the data into Python
(note that you’ll need to modify the path to reflect the
location where the CSV file is stored on your computer):
✓ Optionally, you can rename the index values before
transposing the DataFrame:
✓ df = df.rename(index = {0:'X', 1:'Y', 2:'Z'})
Using StandardScaler() Function to Standardize Python Data
✓ mean - 0 (zero)
✓ standard deviation – 1
Parameters:
Parameters:
✓ 1, 2, 2, 2, 3, 1, 1, 15, 2, 2, 2, 3, 1, 1, 2
import numpy as np
Output:
mean of the dataset is 2.6666666666666665
std. deviation is 3.3598941782277745
Step 3: Calculate Z score. If Z score>3, print it as an outlier.
threshold = 3
outlier = []
for i in data:
z = (i-mean)/std
if z > threshold:
outlier.append(i)
print('outlier in dataset is', outlier)
Output:
1.The very first step will be setting the upper and lower limit.
This range stimulates that every data point will be regarded as
an outlier out of this range.
Let’s see the formulae for both upper and lower limits.