0% found this document useful (0 votes)
10 views8 pages

Assignment 2 Ds

Uploaded by

siddhigupta1310
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views8 pages

Assignment 2 Ds

Uploaded by

siddhigupta1310
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

ASSIGNMENT-2

Q1.Create a data frame in Python to produce missing values. The data is

printed and missing values are indicated by the value NaN. Implement the
below methods for handling missing values.
Method 1: Replace missing values with zeros
Method 2: Dropping rows with missing values
Method 3: Replace missing values with Mean/Median/Mode
Method 4: Fill NaN values with the value from the previous rows
Method 5: Fill NaN values with the value from the next rows
Method 6: Fill missing values using interpolation method: Linear Interpolation

import pandas as pd
import numpy as np
df=pd.DataFrame( {
'c1': [1,2,np.nan,4],
'c2': [5,6,7,8],
'c3': [9,10,np.nan,12]} ,dtype='f')
print("Dataframe is:\n",df)
print()
method_1=df.fillna(0)
method_2=df.dropna()
method_3=df.fillna(df.mean())
method_4=df.fillna(method='ffill')
method_5=df.fillna(method='bfill')
method_6=df.interpolate(method='linear')
print("Method 1: Replace missing values with zeros")
print(method_1)
print("\nMethod 2: Dropping rows with missing values")
print(method_2)
print("\nMethod 3: Replace missing values with Mean")
print(method_3)
print("\nMethod 4: Fill NaN values with the value from the previous rows")
print(method_4)
print("\nMethod 5: Fill NaN values with the value from the next rows")
print(method_5)

Taranpreet Kaur 15417702022


print("\nMethod 6: Fill missing values using interpolation method: Linear
Interpolation")
print(method_6)

Q2.Write a python code for carrying out equal width binning for the price

of nine items that are stored in a data frame. For equi-width binning the
minimum and maximum price value are used to three equal width bins
names Low, Medium, and High. Plot a histogram for the three bins based 3
on the price range.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = {

Taranpreet Kaur 15417702022


'Item': ['Item1', 'Item2', 'Item3', 'Item4', 'Item5', 'Item6', 'Item7', 'Item8', 'Item9'],
'Price': [10, 20, 30, 40, 50, 60, 70, 80, 90]
}
df = pd.DataFrame(data)
# Define bins based on equal width
min_price = df['Price'].min()
max_price = df['Price'].max()
bin_width = (max_price - min_price) / 3
labels = ['Low', 'Medium', 'High']
df['Price_Bin'] = pd.cut(df['Price'], bins=3, labels=labels, include_lowest=True)
print(df)
plt.hist(df['Price_Bin'], bins=3,edgecolor='black')
plt.title('Histogram of Price Bins')
plt.xlabel('Price Range')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()

Taranpreet Kaur 15417702022


Q3.Write the python code for outlier detection using the standard
deviation method. Here for the randomly generated dataset values, the
mean and standard deviation is calculated and then the cut off value is
found for identifying outliers by considering thrice the standard deviation
value as the threshold value. The outlier can be pictorially represented in
form of a histogram

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
data = np.random.normal(loc=0, scale=1, size=1000)
mean = np.mean(data)
std_dev = np.std(data)

# Set threshold for outliers


threshold = 3 * std_dev
outliers = data[abs(data - mean) > threshold]
plt.figure(figsize=(10, 6))
plt.hist(data, bins=30, color='blue', alpha=0.7, edgecolor='black', label='Data')
plt.hist(outliers, bins=30, color='red', alpha=0.7, edgecolor='black',
label='Outliers')
plt.axvline(mean, color='k', linestyle='dashed', linewidth=1, label='Mean')
plt.axvline(mean + threshold, color='r', linestyle='dashed', linewidth=1,
label='Threshold')
plt.axvline(mean - threshold, color='r', linestyle='dashed', linewidth=1)
plt.title('Histogram of Data with Outliers Detected by Standard Deviation
Method')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.legend()
plt.show()

Taranpreet Kaur 15417702022


Q4.Write the python code to pictorially represent outlier in a histogram. 4
The dataset consists of 94 numerical values containing 2 outliers (the value
10 and 12). The outliers are to be removed from the list and final list of
numerical values contain no outliers.

data = [5, 8, 6, 10, 12, 7, 8, 15, 20, 22, 25, 28, 30, 32, 35, 38, 40, 42, 45, 50, 55,
60, 65, 70, 75, 80, 85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150,
155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220, 225, 230,
235, 240, 245, 250, 255, 260, 265, 270, 275, 280, 285, 290, 295, 10, 12]
plt.hist(data, bins=20, color='blue', edgecolor='black')
plt.title('Histogram with Outliers')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()
data_outliers = [x for x in data if x not in [10, 12]]
plt.hist(data_outliers, bins=20, color='green', edgecolor='black')
plt.title('Histogram without Outliers')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()

Taranpreet Kaur 15417702022


Q5.Write a python code for outlier detection and removal for a given set of
data points using interquartile method. Equi-width bins are created for
displaying the data values using a histogram. Q1 and Q3 are calculated using
percentile() function used in python, which help in calculating interquartile
range by finding difference between Q1 and Q3. Next the lower bound(LB)
and upper bound (UB) values are found using the formula
(1.5 ∗ IQR) contains no outliers by considering data values which are only within
the LB and UB.

Taranpreet Kaur 15417702022


import numpy as np
import matplotlib.pyplot as plt

data = [5, 8, 6, 10, 12, 7, 8, 15, 20, 22, 25, 28, 30, 32, 35, 38, 40, 42, 45, 50, 55,
60, 65, 70, 75, 80,
85, 90, 95, 100, 105, 110, 115, 120, 125, 130, 135, 140, 145, 150, 155, 160,
165, 170, 175, 180, 185,
190, 195, 200, 205, 210, 215, 220, 225, 230, 235, 240, 245, 250, 255, 260,
265, 270, 275, 280, 285, 290,
295, 10, 12]
plt.hist(data, bins=20, color='red',ec='black')
plt.title('Histogram with Equi-width Bins')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()
Q1 = np.quantile(data,0.25)
Q3 = np.quantile(data,0.75)
IQR = Q3 - Q1
LB = Q1 - (1.5 * IQR)
UB = Q3 + (1.5 * IQR)
data_new = []
for x in data:
if LB <= x <= UB:
data_new.append(x)
plt.hist(data_new, bins=20, color='yellow', ec='black')
plt.title('Histogram without Outliers (Interquartile Method)')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()
print("Final dataset without outliers:")
print(data_new)

Taranpreet Kaur 15417702022


Taranpreet Kaur 15417702022

You might also like