0% found this document useful (0 votes)
24 views49 pages

Dsbda Ass2

Uploaded by

ngak1214
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views49 pages

Dsbda Ass2

Uploaded by

ngak1214
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Data Science and Big

Data Analytics
Laboratory
Third Year 2019 Course

Prof.K.B.Sadafale
Assistant Professor
Computer Dept. GCOEAR, Avasari
Data Wrangling: II
Create an “Academic performance” dataset of students and
perform the following operations using Python.
✓ Scan all variables for missing values and inconsistencies. If
there are missing values and/or inconsistencies, use any of
the suitable techniques to deal with them.
✓ Scan all numeric variables for outliers. If there are outliers,
use any of the suitable techniques to deal with them.
✓ Apply data transformations on at least one of the variables.
The purpose of this transformation should be one of the
following reasons: to change the scale for better
understanding of the variable, to convert a non-linear
relation into a linear one, or to decrease the skewness and
convert the distribution into a normal distribution.
Reason and document your approach properly.
Select all Rows with NaN Values in Pandas DataFrame

➢ Here are 4 ways to select all rows with NaN values in Pandas
DataFrame:

➢ (1) Using isna() to select all rows with NaN under


a single DataFrame column:
✓ df[df['column name'].isna()]
➢ (2) Using isnull() to select all rows with NaN under
a single DataFrame column:
✓ df[df['column name'].isnull()]
➢ (3) Using isna() to select all rows with NaN under
an entire DataFrame:
✓ df[df.isna().any(axis=1)]
➢ (4) Using isnull() to select all rows with NaN under
an entire DataFrame:
➢ df[df.isnull().any(axis=1)]
Step 1: Create a DataFrame

✓ Numeric values with NaN


✓ String/text values with NaN
✓ to select all rows with the NaN values under the ‘first_set‘
column.

Step 2: Select all rows with NaN under a single DataFrame


column
✓ you may use the isna() approach to select the NaNs:
✓ df[df['column name'].isna()]
You’ll get the same results using isnull():
Select all rows with NaN under the entire DataFrame

✓ To find all rows with NaN under the entire DataFrame, you
may apply this syntax:
✓ df[df.isna().any(axis=1)]
Optionally, you’ll get the same results using isnull():
How to Drop Rows with NaN Values in Pandas DataFrame

✓ The syntax that you may apply in order drop rows with NaN
values in your DataFrame:
✓ df.dropna()

➢ Let’s say that you have the following dataset:

values_1 values_2
700 DDD
ABC 150
500 350
XYZ 400
1200 5000
Notice that the DataFrame contains both:
✓ Numeric data: 700, 500, 1200, 150 , 350 ,400, 5000
✓ Non-numeric values: ABC, XYZ, DDD
➢ You can then use to_numeric in order to convert the values in
the dataset into a float format.
➢ But since 3 of those values are non-numeric, you’ll get ‘NaN’
for those 3 values.
Step 2: Drop the Rows with NaN Values in Pandas DataFrame

✓ To drop all the rows with the NaN values, you may use df.dropna().

✓ you’ll see only two rows without any NaN values:


Step 3 (Optional): Reset the Index

Syntax
df.reset_index(drop=True)
df.replace(np.nan,0)

Replace NaN Values with Zeros in Pandas DataFrame

✓ The 4 approaches below in order to replace NaN values with


zeros in Pandas DataFrame:
(1) For a single column using Pandas:
df['DataFrame Column'] = df['DataFrame Column'].fillna(0)

(2) For a single column using NumPy:


df['DataFrame Column'] = df['DataFrame Column'].replace(np.nan, 0)

(3) For an entire DataFrame using Pandas:


df.fillna(0)

(4) For an entire DataFrame using NumPy:


df.replace(np.nan,0)
Case 1: replace NaN values with zeros for a column using Pandas

✓ Suppose that you have a single column with the following


data that contains NaN values:

values
700
NaN
500
NaN
✓ Python code to replace the NaN values with 0’s:
How to Transpose Pandas DataFrame
You can use the following syntax to transpose Pandas DataFrame:
df = df.transpose()

How to apply the above syntax by reviewing 3 cases of:

✓ Transposing a DataFrame with a default index

✓ Transposing a DataFrame with a tailored index

✓ Importing a CSV file and then transposing the


DataFrame
Case 1: Transpose Pandas DataFrame with a Default Index

• Let’s create a DataFrame with 3 columns:

Get the DataFrame (with a default numeric index that starts from 0 ):
➢ You can then add df = df.transpose() to the code in order to
transpose the DataFrame:
Case 2: Transpose Pandas DataFrame with a Tailored Index

✓ What if you want to assign your own tailored index, and then
transpose the DataFrame?
✓ For example, let’s add the following index to the DataFrame:
✓ index = ['X', 'Y', 'Z']
✓ Now add df = df.transpose() in order to transpose the
DataFrame:
Case 3: Import a CSV File and then Transpose the Results
• For example, let’s say that you have the following data
saved in a CSV file:

A B C
11 44 77
22 55 88
33 66 99

• You can then use the code to import the data into Python
(note that you’ll need to modify the path to reflect the
location where the CSV file is stored on your computer):
✓ Optionally, you can rename the index values before
transposing the DataFrame:
✓ df = df.rename(index = {0:'X', 1:'Y', 2:'Z'})
Using StandardScaler() Function to Standardize Python Data

➢ Scaling of Features is an essential step in modeling the


algorithms with the datasets.
➢ The data that is usually used for the purpose of modeling is
derived through various means such as:
✓ Questionnaire
✓ Surveys
✓ Research
✓ etc.
➢ So, the data obtained contains features of various dimensions
and scales altogether.
➢ Different scales of the data features affect the modeling of a
dataset adversely.
➢ It leads to a biased outcome of predictions in terms of
misclassification error and accuracy rates.
➢ Thus, it is necessary to Scale the data prior to modeling.
➢ Standardization is a scaling technique wherein it makes the
data scale-free by converting the statistical distribution of the
data into the below format:

✓ mean - 0 (zero)
✓ standard deviation – 1

✓ Python sklearn library offers us with StandardScaler()


function to standardize the data values into a standard
format.
✓ Syntax:
✓ object = StandardScaler()
✓ object.fit_transform(data)

✓ According to the above syntax, we initially create an object of


the StandardScaler() function.
✓ Further, we use fit_transform() along with the assigned object
to transform the data and standardize it.
Standard Scaler
➢ Standard Scaler helps to get standardized distribution, with a
zero mean and standard deviation of one (unit variance).
➢ It standardizes features by subtracting the mean value from
the feature and then dividing the result by feature standard
deviation.
✓ The standard scaling is calculated as:
✓ z = (x - u) / s
Where,
➢ z is scaled data.
➢ x is to be scaled data.
➢ u is the mean of the training samples
➢ s is the standard deviation of the training samples.
✓ Sklearn preprocessing supports StandardScaler() method to
achieve this directly in merely 2-3 steps.
Syntax:

class sklearn.preprocessing.StandardScaler(*, copy=True, with_mean=True, with_std=True)

Parameters:

• copy: If False, inplace scaling is done. If True , copy is created instead of


inplace scaling.
• with_mean: If True, data is centered before scaling.
• with_std: If True, data is scaled to unit variance.
MinMax Scaler
✓ This is another way of data scaling, where the minimum of feature is
made equal to zero and the maximum of feature equal to one.
✓ MinMax Scaler shrinks the data within the given range, usually of 0 to 1.
✓ It transforms data by scaling features to a given range.
✓ It scales the values to a specific value range without changing the shape
of the original distribution.

The MinMax scaling is done using:

x_std = (x – x.min(axis=0)) / (x.max(axis=0) – x.min(axis=0))

x_scaled = x_std * (max – min) + min


Where,
min, max = feature_range
x.min(axis=0) : Minimum feature value
x.max(axis=0):Maximum feature value
Syntax:

class sklearn.preprocessing.MinMaxScaler(feature_range=0, 1, *, copy=True,


clip=False)

Parameters:

• feature_range: Desired range of scaled data.


• The default range for the feature returned by M inM axScaler is 0 to 1.
• The range is provided in tuple form as (min,max).
• copy: If False, inplace scaling is done. If True , copy is created instead
of inplace scaling.
• clip: If True, scaled data is clipped to provided feature range.
✓ Only two Colum name(Income,Age) Because Department having Character
data not to scale
Z score for Outlier Detection – Python

➢ Z score is an important concept in statistics.

➢ Z score is also called standard score.

➢ This score helps to understand if a data value is greater or


smaller than mean and how far away it is from the mean.

➢ More specifically, Z score tells how many standard deviations


away a data point is from the mean.

Z score = (x -mean) / std. deviation


Z score and Outliers:

✓ If the z score of a data point is more than 3, it indicates that the


data point is quite different from the other data points.
✓ Such a data point can be an outlier.
✓ For example, in a survey, it was asked how many children a
person had.
✓ Suppose the data obtained from people is

✓ 1, 2, 2, 2, 3, 1, 1, 15, 2, 2, 2, 3, 1, 1, 2

✓ Clearly, 15 is an outlier in this dataset.


Let us use calculate the Z score using Python to find this outlier.

Step 1: Import necessary libraries

import numpy as np

Step 2: Calculate mean, standard deviation

data = [1, 2, 2, 2, 3, 1, 1, 15, 2, 2, 2, 3, 1, 1, 2]


mean = np.mean(data)
std = np.std(data)
print('mean of the dataset is', mean)
print('std. deviation is', std)

Output:
mean of the dataset is 2.6666666666666665
std. deviation is 3.3598941782277745
Step 3: Calculate Z score. If Z score>3, print it as an outlier.

threshold = 3
outlier = []
for i in data:
z = (i-mean)/std
if z > threshold:
outlier.append(i)
print('outlier in dataset is', outlier)

Output:

outlier in dataset is [15]

Conclusion: Z score helps us identify outliers in the data.


Example
Approach for Outliers

we will approach dealing with bad data using Z-Score:

1.The very first step will be setting the upper and lower limit.
This range stimulates that every data point will be regarded as
an outlier out of this range.

Let’s see the formulae for both upper and lower limits.

Upper: Mean + 3 * standard deviation.

Lower: Mean – 3 * standard deviation.


✓ In the output, we see that the highest value is 13.06 while the
lowest value is 2.12
✓ Hence any value out of this range is the bad data point
2.The second step is to detect how many outliers are there in
the dataset based on the upper and lower limit that we set
up just

df[(df[‘cgpa'] > 13.06) | (df[‘cgpa’] < 2.12)]

You might also like