0% found this document useful (0 votes)
4 views48 pages

Final DA LAB1 Merged

The document outlines a lab exercise on Exploratory Data Analysis (EDA) conducted by students at Bharatiya Vidya Bhavan's Sardar Patel Institute of Technology. It details the objectives of EDA, including understanding data structure, identifying data quality issues, and preparing data for modeling, while also providing a practical application using a dataset. The conclusion emphasizes the importance of EDA in uncovering insights and preparing data for further analysis.

Uploaded by

anandkrishna1511
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views48 pages

Final DA LAB1 Merged

The document outlines a lab exercise on Exploratory Data Analysis (EDA) conducted by students at Bharatiya Vidya Bhavan's Sardar Patel Institute of Technology. It details the objectives of EDA, including understanding data structure, identifying data quality issues, and preparing data for modeling, while also providing a practical application using a dataset. The conclusion emphasizes the importance of EDA in uncovering insights and preparing data for further analysis.

Uploaded by

anandkrishna1511
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

BHARATIYA VIDYA BHAVAN’S

SARDAR PATEL INSTITUTE OF TECHNOLOGY


(Empowered Autonomous Institute Affiliated to University of Mumbai)
[Knowledge is Nectar]

Department of Computer Science and Engineering

Course - Data Analytics


UID 2021300066
2021300030

Name Anand
Krishna
Om Doshi
Class and Batch BE Comps A

Date 29 August 2024

Lab # 1

Aim Perform EDA such as number of data samples, number of features, number of classes,
number of data samples per class, removing missing values, conversion to numbers,
using seaborn library to plot different graphs

Objective To apply EDA on given dataset

Theory
Exploratory Data Analysis (EDA) is a crucial step in the data analysis process, focusing on
summarizing and understanding the main characteristics of a dataset. EDA involves examining
data sets to uncover patterns, spot anomalies, test hypotheses, and check assumptions using
statistical graphics and other data visualization techniques.

Objectives of EDA

1. Understand Data Structure


2. Identify Data Quality Issues
3. Discover Patterns and Relationships
4. Prepare Data for Modeling

1. Number of Data Samples


● Definition: The total number of individual records or rows in the dataset.
● Purpose: Helps understand the size of the dataset and ensure there is enough data for
analysis and model training.

2. Number of Features

● Definition: The total number of attributes or columns in the dataset.

3. Number of Classes
BHARATIYA VIDYA BHAVAN’S
SARDAR PATEL INSTITUTE OF TECHNOLOGY
(Empowered Autonomous Institute Affiliated to University of Mumbai)
[Knowledge is Nectar]

Department of Computer Science and Engineering

● Definition: The number of unique categories or labels in the target variable (for
classification problems).

4. Number of Data Samples per Class

● Definition: The distribution of data samples across different classes.

5. Removing Missing Values

● Definition: The process of handling missing or NaN values in the dataset.

6. Conversion to Numbers

● Definition: Transforming categorical or non-numeric data into numerical format.

7. Using Seaborn Library to Plot Different Graphs

● Definition: Utilizing the Seaborn library to create visualizations such as histograms, scatter
plots, pair plots, etc.
● Purpose: Helps in understanding the data distribution, relationships between features, and
patterns or anomalies in the data. Seaborn simplifies the creation of aesthetically pleasing
and informative plots.

Conclusion EDA is an essential process in data analytics that involves summarizing, visualizing, and
cleaning data to uncover insights and prepare it for further analysis or modeling. By
applying EDA techniques, we can better understand the dataset's structure, address data
quality issues, explore relationships, and make informed decisions about the next steps in
data analysis.

References https://www.geeksforgeeks.org/matplotlib-tutorial/
da-exp1

August 29, 2024

[ ]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#to ignore warnings
import warnings
warnings.filterwarnings('ignore')

[ ]: data = pd.read_csv("/content/AmesHousing.csv")

[ ]: data.head()

[ ]: Order PID MS SubClass MS Zoning Lot Frontage Lot Area Street \


0 1 526301100 20 RL 141.0 31770 Pave
1 2 526350040 20 RH 80.0 11622 Pave
2 3 526351010 20 RL 81.0 14267 Pave
3 4 526353030 20 RL 93.0 11160 Pave
4 5 527105010 60 RL 74.0 13830 Pave

Alley Lot Shape Land Contour … Pool Area Pool QC Fence Misc Feature \
0 NaN IR1 Lvl … 0 NaN NaN NaN
1 NaN Reg Lvl … 0 NaN MnPrv NaN
2 NaN IR1 Lvl … 0 NaN NaN Gar2
3 NaN Reg Lvl … 0 NaN NaN NaN
4 NaN IR1 Lvl … 0 NaN MnPrv NaN

Misc Val Mo Sold Yr Sold Sale Type Sale Condition SalePrice


0 0 5 2010 WD Normal 215000
1 0 6 2010 WD Normal 105000
2 12500 6 2010 WD Normal 172000
3 0 4 2010 WD Normal 244000
4 0 3 2010 WD Normal 189900

[5 rows x 82 columns]

[ ]: data.describe()

1
[ ]: Order PID MS SubClass Lot Frontage Lot Area \
count 2930.00000 2.930000e+03 2930.000000 2440.000000 2930.000000
mean 1465.50000 7.144645e+08 57.387372 69.224590 10147.921843
std 845.96247 1.887308e+08 42.638025 23.365335 7880.017759
min 1.00000 5.263011e+08 20.000000 21.000000 1300.000000
25% 733.25000 5.284770e+08 20.000000 58.000000 7440.250000
50% 1465.50000 5.354536e+08 50.000000 68.000000 9436.500000
75% 2197.75000 9.071811e+08 70.000000 80.000000 11555.250000
max 2930.00000 1.007100e+09 190.000000 313.000000 215245.000000

Overall Qual Overall Cond Year Built Year Remod/Add Mas Vnr Area \
count 2930.000000 2930.000000 2930.000000 2930.000000 2907.000000
mean 6.094881 5.563140 1971.356314 1984.266553 101.896801
std 1.411026 1.111537 30.245361 20.860286 179.112611
min 1.000000 1.000000 1872.000000 1950.000000 0.000000
25% 5.000000 5.000000 1954.000000 1965.000000 0.000000
50% 6.000000 5.000000 1973.000000 1993.000000 0.000000
75% 7.000000 6.000000 2001.000000 2004.000000 164.000000
max 10.000000 9.000000 2010.000000 2010.000000 1600.000000

… Wood Deck SF Open Porch SF Enclosed Porch 3Ssn Porch \


count … 2930.000000 2930.000000 2930.000000 2930.000000
mean … 93.751877 47.533447 23.011604 2.592491
std … 126.361562 67.483400 64.139059 25.141331
min … 0.000000 0.000000 0.000000 0.000000
25% … 0.000000 0.000000 0.000000 0.000000
50% … 0.000000 27.000000 0.000000 0.000000
75% … 168.000000 70.000000 0.000000 0.000000
max … 1424.000000 742.000000 1012.000000 508.000000

Screen Porch Pool Area Misc Val Mo Sold Yr Sold \


count 2930.000000 2930.000000 2930.000000 2930.000000 2930.000000
mean 16.002048 2.243345 50.635154 6.216041 2007.790444
std 56.087370 35.597181 566.344288 2.714492 1.316613
min 0.000000 0.000000 0.000000 1.000000 2006.000000
25% 0.000000 0.000000 0.000000 4.000000 2007.000000
50% 0.000000 0.000000 0.000000 6.000000 2008.000000
75% 0.000000 0.000000 0.000000 8.000000 2009.000000
max 576.000000 800.000000 17000.000000 12.000000 2010.000000

SalePrice
count 2930.000000
mean 180796.060068
std 79886.692357
min 12789.000000
25% 129500.000000
50% 160000.000000

2
75% 213500.000000
max 755000.000000

[8 rows x 39 columns]

[ ]: data.columns

[ ]: Index(['Order', 'PID', 'MS SubClass', 'MS Zoning', 'Lot Frontage', 'Lot Area',
'Street', 'Alley', 'Lot Shape', 'Land Contour', 'Utilities',
'Lot Config', 'Land Slope', 'Neighborhood', 'Condition 1',
'Condition 2', 'Bldg Type', 'House Style', 'Overall Qual',
'Overall Cond', 'Year Built', 'Year Remod/Add', 'Roof Style',
'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Type',
'Mas Vnr Area', 'Exter Qual', 'Exter Cond', 'Foundation', 'Bsmt Qual',
'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin SF 1',
'BsmtFin Type 2', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF',
'Heating', 'Heating QC', 'Central Air', 'Electrical', '1st Flr SF',
'2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Bsmt Full Bath',
'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr',
'Kitchen AbvGr', 'Kitchen Qual', 'TotRms AbvGrd', 'Functional',
'Fireplaces', 'Fireplace Qu', 'Garage Type', 'Garage Yr Blt',
'Garage Finish', 'Garage Cars', 'Garage Area', 'Garage Qual',
'Garage Cond', 'Paved Drive', 'Wood Deck SF', 'Open Porch SF',
'Enclosed Porch', '3Ssn Porch', 'Screen Porch', 'Pool Area', 'Pool QC',
'Fence', 'Misc Feature', 'Misc Val', 'Mo Sold', 'Yr Sold', 'Sale Type',
'Sale Condition', 'SalePrice'],
dtype='object')

[ ]: # Number of data samples


num_samples = data.shape[0]

# Number of features
num_features = data.shape[1]

print(f"Number of data samples: {num_samples}")


print(f"Number of features: {num_features}")

Number of data samples: 2930


Number of features: 82

[ ]: #Returns tuple of (row,column)


data.shape

[ ]: (2930, 82)

[ ]: data.info()

<class 'pandas.core.frame.DataFrame'>

3
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 82 columns):
# Column Non-Null Count Dtype

0 Order 2930 non-null int64


1 PID 2930 non-null int64
2 MS SubClass 2930 non-null int64
3 MS Zoning 2930 non-null object
4 Lot Frontage 2440 non-null float64
5 Lot Area 2930 non-null int64
6 Street 2930 non-null object
7 Alley 198 non-null object
8 Lot Shape 2930 non-null object
9 Land Contour 2930 non-null object
10 Utilities 2930 non-null object
11 Lot Config 2930 non-null object
12 Land Slope 2930 non-null object
13 Neighborhood 2930 non-null object
14 Condition 1 2930 non-null object
15 Condition 2 2930 non-null object
16 Bldg Type 2930 non-null object
17 House Style 2930 non-null object
18 Overall Qual 2930 non-null int64
19 Overall Cond 2930 non-null int64
20 Year Built 2930 non-null int64
21 Year Remod/Add 2930 non-null int64
22 Roof Style 2930 non-null object
23 Roof Matl 2930 non-null object
24 Exterior 1st 2930 non-null object
25 Exterior 2nd 2930 non-null object
26 Mas Vnr Type 1155 non-null object
27 Mas Vnr Area 2907 non-null float64
28 Exter Qual 2930 non-null object
29 Exter Cond 2930 non-null object
30 Foundation 2930 non-null object
31 Bsmt Qual 2850 non-null object
32 Bsmt Cond 2850 non-null object
33 Bsmt Exposure 2847 non-null object
34 BsmtFin Type 1 2850 non-null object
35 BsmtFin SF 1 2929 non-null float64
36 BsmtFin Type 2 2849 non-null object
37 BsmtFin SF 2 2929 non-null float64
38 Bsmt Unf SF 2929 non-null float64
39 Total Bsmt SF 2929 non-null float64
40 Heating 2930 non-null object
41 Heating QC 2930 non-null object
42 Central Air 2930 non-null object
43 Electrical 2929 non-null object

4
44 1st Flr SF 2930 non-null int64
45 2nd Flr SF 2930 non-null int64
46 Low Qual Fin SF 2930 non-null int64
47 Gr Liv Area 2930 non-null int64
48 Bsmt Full Bath 2928 non-null float64
49 Bsmt Half Bath 2928 non-null float64
50 Full Bath 2930 non-null int64
51 Half Bath 2930 non-null int64
52 Bedroom AbvGr 2930 non-null int64
53 Kitchen AbvGr 2930 non-null int64
54 Kitchen Qual 2930 non-null object
55 TotRms AbvGrd 2930 non-null int64
56 Functional 2930 non-null object
57 Fireplaces 2930 non-null int64
58 Fireplace Qu 1508 non-null object
59 Garage Type 2773 non-null object
60 Garage Yr Blt 2771 non-null float64
61 Garage Finish 2771 non-null object
62 Garage Cars 2929 non-null float64
63 Garage Area 2929 non-null float64
64 Garage Qual 2771 non-null object
65 Garage Cond 2771 non-null object
66 Paved Drive 2930 non-null object
67 Wood Deck SF 2930 non-null int64
68 Open Porch SF 2930 non-null int64
69 Enclosed Porch 2930 non-null int64
70 3Ssn Porch 2930 non-null int64
71 Screen Porch 2930 non-null int64
72 Pool Area 2930 non-null int64
73 Pool QC 13 non-null object
74 Fence 572 non-null object
75 Misc Feature 106 non-null object
76 Misc Val 2930 non-null int64
77 Mo Sold 2930 non-null int64
78 Yr Sold 2930 non-null int64
79 Sale Type 2930 non-null object
80 Sale Condition 2930 non-null object
81 SalePrice 2930 non-null int64
dtypes: float64(11), int64(28), object(43)
memory usage: 1.8+ MB

[ ]: data.describe()

[ ]: Order PID MS SubClass Lot Frontage Lot Area \


count 2930.00000 2.930000e+03 2930.000000 2440.000000 2930.000000
mean 1465.50000 7.144645e+08 57.387372 69.224590 10147.921843
std 845.96247 1.887308e+08 42.638025 23.365335 7880.017759

5
min 1.00000 5.263011e+08 20.000000 21.000000 1300.000000
25% 733.25000 5.284770e+08 20.000000 58.000000 7440.250000
50% 1465.50000 5.354536e+08 50.000000 68.000000 9436.500000
75% 2197.75000 9.071811e+08 70.000000 80.000000 11555.250000
max 2930.00000 1.007100e+09 190.000000 313.000000 215245.000000

Overall Qual Overall Cond Year Built Year Remod/Add Mas Vnr Area \
count 2930.000000 2930.000000 2930.000000 2930.000000 2907.000000
mean 6.094881 5.563140 1971.356314 1984.266553 101.896801
std 1.411026 1.111537 30.245361 20.860286 179.112611
min 1.000000 1.000000 1872.000000 1950.000000 0.000000
25% 5.000000 5.000000 1954.000000 1965.000000 0.000000
50% 6.000000 5.000000 1973.000000 1993.000000 0.000000
75% 7.000000 6.000000 2001.000000 2004.000000 164.000000
max 10.000000 9.000000 2010.000000 2010.000000 1600.000000

… Wood Deck SF Open Porch SF Enclosed Porch 3Ssn Porch \


count … 2930.000000 2930.000000 2930.000000 2930.000000
mean … 93.751877 47.533447 23.011604 2.592491
std … 126.361562 67.483400 64.139059 25.141331
min … 0.000000 0.000000 0.000000 0.000000
25% … 0.000000 0.000000 0.000000 0.000000
50% … 0.000000 27.000000 0.000000 0.000000
75% … 168.000000 70.000000 0.000000 0.000000
max … 1424.000000 742.000000 1012.000000 508.000000

Screen Porch Pool Area Misc Val Mo Sold Yr Sold \


count 2930.000000 2930.000000 2930.000000 2930.000000 2930.000000
mean 16.002048 2.243345 50.635154 6.216041 2007.790444
std 56.087370 35.597181 566.344288 2.714492 1.316613
min 0.000000 0.000000 0.000000 1.000000 2006.000000
25% 0.000000 0.000000 0.000000 4.000000 2007.000000
50% 0.000000 0.000000 0.000000 6.000000 2008.000000
75% 0.000000 0.000000 0.000000 8.000000 2009.000000
max 576.000000 800.000000 17000.000000 12.000000 2010.000000

SalePrice
count 2930.000000
mean 180796.060068
std 79886.692357
min 12789.000000
25% 129500.000000
50% 160000.000000
75% 213500.000000
max 755000.000000

[8 rows x 39 columns]

6
[ ]: data.isnull().sum().sort_values(ascending=False)

[ ]: Pool QC 2917
Misc Feature 2824
Alley 2732
Fence 2358
Mas Vnr Type 1775

PID 0
Central Air 0
1st Flr SF 0
2nd Flr SF 0
SalePrice 0
Length: 82, dtype: int64

[ ]: missing_values = data.isnull().sum().sort_values(ascending=False)

missing_values_percentage = (missing_values / len(data)) * 100


missing_values_percentage.to_dict()

[ ]: {'Pool QC': 99.55631399317406,


'Misc Feature': 96.38225255972696,
'Alley': 93.24232081911262,
'Fence': 80.4778156996587,
'Mas Vnr Type': 60.580204778157,
'Fireplace Qu': 48.532423208191126,
'Lot Frontage': 16.723549488054605,
'Garage Cond': 5.426621160409556,
'Garage Finish': 5.426621160409556,
'Garage Yr Blt': 5.426621160409556,
'Garage Qual': 5.426621160409556,
'Garage Type': 5.3583617747440275,
'Bsmt Exposure': 2.832764505119454,
'BsmtFin Type 2': 2.7645051194539247,
'Bsmt Qual': 2.7303754266211606,
'Bsmt Cond': 2.7303754266211606,
'BsmtFin Type 1': 2.7303754266211606,
'Mas Vnr Area': 0.7849829351535836,
'Bsmt Full Bath': 0.06825938566552901,
'Bsmt Half Bath': 0.06825938566552901,
'BsmtFin SF 1': 0.034129692832764506,
'Garage Cars': 0.034129692832764506,
'Electrical': 0.034129692832764506,
'Total Bsmt SF': 0.034129692832764506,
'Bsmt Unf SF': 0.034129692832764506,
'BsmtFin SF 2': 0.034129692832764506,
'Garage Area': 0.034129692832764506,

7
'Paved Drive': 0.0,
'Full Bath': 0.0,
'Half Bath': 0.0,
'Bedroom AbvGr': 0.0,
'Kitchen AbvGr': 0.0,
'Kitchen Qual': 0.0,
'TotRms AbvGrd': 0.0,
'Sale Condition': 0.0,
'Sale Type': 0.0,
'Yr Sold': 0.0,
'Mo Sold': 0.0,
'Misc Val': 0.0,
'Functional': 0.0,
'Fireplaces': 0.0,
'Pool Area': 0.0,
'Screen Porch': 0.0,
'3Ssn Porch': 0.0,
'Enclosed Porch': 0.0,
'Open Porch SF': 0.0,
'Wood Deck SF': 0.0,
'Order': 0.0,
'Heating QC': 0.0,
'Gr Liv Area': 0.0,
'Overall Qual': 0.0,
'MS SubClass': 0.0,
'MS Zoning': 0.0,
'Lot Area': 0.0,
'Street': 0.0,
'Lot Shape': 0.0,
'Land Contour': 0.0,
'Utilities': 0.0,
'Lot Config': 0.0,
'Land Slope': 0.0,
'Neighborhood': 0.0,
'Condition 1': 0.0,
'Condition 2': 0.0,
'Bldg Type': 0.0,
'House Style': 0.0,
'Overall Cond': 0.0,
'Low Qual Fin SF': 0.0,
'Year Built': 0.0,
'Year Remod/Add': 0.0,
'Roof Style': 0.0,
'Roof Matl': 0.0,
'Exterior 1st': 0.0,
'Exterior 2nd': 0.0,
'Exter Qual': 0.0,

8
'Exter Cond': 0.0,
'Foundation': 0.0,
'Heating': 0.0,
'PID': 0.0,
'Central Air': 0.0,
'1st Flr SF': 0.0,
'2nd Flr SF': 0.0,
'SalePrice': 0.0}

[ ]: unique_values = data.nunique().sort_values(ascending=False)
unique_values_percentage = unique_values / len(data) * 100
unique_values_percentage.to_dict()

[ ]: {'Order': 100.0,
'PID': 100.0,
'Lot Area': 66.89419795221842,
'Gr Liv Area': 44.09556313993174,
'Bsmt Unf SF': 38.80546075085324,
'1st Flr SF': 36.96245733788396,
'Total Bsmt SF': 36.10921501706484,
'SalePrice': 35.221843003412964,
'BsmtFin SF 1': 33.95904436860068,
'2nd Flr SF': 21.672354948805463,
'Garage Area': 20.580204778156997,
'Mas Vnr Area': 15.187713310580206,
'Wood Deck SF': 12.969283276450511,
'BsmtFin SF 2': 9.351535836177474,
'Open Porch SF': 8.600682593856655,
'Enclosed Porch': 6.2457337883959045,
'Lot Frontage': 4.368600682593857,
'Screen Porch': 4.129692832764506,
'Year Built': 4.027303754266211,
'Garage Yr Blt': 3.515358361774744,
'Year Remod/Add': 2.0819112627986347,
'Misc Val': 1.2969283276450512,
'Low Qual Fin SF': 1.228668941979522,
'3Ssn Porch': 1.0580204778156996,
'Neighborhood': 0.955631399317406,
'Exterior 2nd': 0.5802047781569966,
'MS SubClass': 0.5460750853242321,
'Exterior 1st': 0.5460750853242321,
'TotRms AbvGrd': 0.477815699658703,
'Pool Area': 0.477815699658703,
'Mo Sold': 0.40955631399317405,
'Overall Qual': 0.3412969283276451,
'Sale Type': 0.3412969283276451,
'Overall Cond': 0.3071672354948805,

9
'Condition 1': 0.3071672354948805,
'Bedroom AbvGr': 0.27303754266211605,
'Functional': 0.27303754266211605,
'Condition 2': 0.27303754266211605,
'Roof Matl': 0.27303754266211605,
'House Style': 0.27303754266211605,
'MS Zoning': 0.2389078498293515,
'BsmtFin Type 1': 0.20477815699658702,
'Roof Style': 0.20477815699658702,
'Garage Type': 0.20477815699658702,
'Garage Cars': 0.20477815699658702,
'BsmtFin Type 2': 0.20477815699658702,
'Sale Condition': 0.20477815699658702,
'Foundation': 0.20477815699658702,
'Heating': 0.20477815699658702,
'Garage Qual': 0.17064846416382254,
'Garage Cond': 0.17064846416382254,
'Fireplace Qu': 0.17064846416382254,
'Yr Sold': 0.17064846416382254,
'Fireplaces': 0.17064846416382254,
'Misc Feature': 0.17064846416382254,
'Heating QC': 0.17064846416382254,
'Kitchen Qual': 0.17064846416382254,
'Lot Config': 0.17064846416382254,
'Full Bath': 0.17064846416382254,
'Bldg Type': 0.17064846416382254,
'Electrical': 0.17064846416382254,
'Exter Cond': 0.17064846416382254,
'Bsmt Qual': 0.17064846416382254,
'Bsmt Cond': 0.17064846416382254,
'Kitchen AbvGr': 0.13651877133105803,
'Mas Vnr Type': 0.13651877133105803,
'Exter Qual': 0.13651877133105803,
'Fence': 0.13651877133105803,
'Pool QC': 0.13651877133105803,
'Bsmt Exposure': 0.13651877133105803,
'Land Contour': 0.13651877133105803,
'Bsmt Full Bath': 0.13651877133105803,
'Lot Shape': 0.13651877133105803,
'Paved Drive': 0.10238907849829351,
'Garage Finish': 0.10238907849829351,
'Bsmt Half Bath': 0.10238907849829351,
'Land Slope': 0.10238907849829351,
'Half Bath': 0.10238907849829351,
'Utilities': 0.10238907849829351,
'Alley': 0.06825938566552901,
'Street': 0.06825938566552901,

10
'Central Air': 0.06825938566552901}

[ ]: unique_values = data.nunique().sort_values(ascending=True).to_dict()
unique_values

[ ]: {'Central Air': 2,
'Street': 2,
'Alley': 2,
'Bsmt Half Bath': 3,
'Paved Drive': 3,
'Half Bath': 3,
'Utilities': 3,
'Garage Finish': 3,
'Land Slope': 3,
'Kitchen AbvGr': 4,
'Fence': 4,
'Mas Vnr Type': 4,
'Exter Qual': 4,
'Pool QC': 4,
'Land Contour': 4,
'Lot Shape': 4,
'Bsmt Exposure': 4,
'Bsmt Full Bath': 4,
'Electrical': 5,
'Misc Feature': 5,
'Bsmt Qual': 5,
'Bsmt Cond': 5,
'Exter Cond': 5,
'Fireplaces': 5,
'Kitchen Qual': 5,
'Heating QC': 5,
'Bldg Type': 5,
'Fireplace Qu': 5,
'Lot Config': 5,
'Yr Sold': 5,
'Garage Qual': 5,
'Garage Cond': 5,
'Full Bath': 5,
'Garage Type': 6,
'Garage Cars': 6,
'Sale Condition': 6,
'Heating': 6,
'BsmtFin Type 2': 6,
'BsmtFin Type 1': 6,
'Foundation': 6,
'Roof Style': 6,
'MS Zoning': 7,

11
'Condition 2': 8,
'House Style': 8,
'Functional': 8,
'Roof Matl': 8,
'Bedroom AbvGr': 8,
'Condition 1': 9,
'Overall Cond': 9,
'Sale Type': 10,
'Overall Qual': 10,
'Mo Sold': 12,
'Pool Area': 14,
'TotRms AbvGrd': 14,
'Exterior 1st': 16,
'MS SubClass': 16,
'Exterior 2nd': 17,
'Neighborhood': 28,
'3Ssn Porch': 31,
'Low Qual Fin SF': 36,
'Misc Val': 38,
'Year Remod/Add': 61,
'Garage Yr Blt': 103,
'Year Built': 118,
'Screen Porch': 121,
'Lot Frontage': 128,
'Enclosed Porch': 183,
'Open Porch SF': 252,
'BsmtFin SF 2': 274,
'Wood Deck SF': 380,
'Mas Vnr Area': 445,
'Garage Area': 603,
'2nd Flr SF': 635,
'BsmtFin SF 1': 995,
'SalePrice': 1032,
'Total Bsmt SF': 1058,
'1st Flr SF': 1083,
'Bsmt Unf SF': 1137,
'Gr Liv Area': 1292,
'Lot Area': 1960,
'PID': 2930,
'Order': 2930}

[ ]: # Define the list of column names to drop


columns_to_drop = ['Order', 'PID', 'Pool QC', 'Misc Feature', 'Alley','Mas Vnr␣
↪Type']

# Drop the columns


data = data.drop(columns=columns_to_drop)

12
[ ]: data.describe()

[ ]: MS SubClass Lot Frontage Lot Area Overall Qual Overall Cond \


count 2930.000000 2440.000000 2930.000000 2930.000000 2930.000000
mean 57.387372 69.224590 10147.921843 6.094881 5.563140
std 42.638025 23.365335 7880.017759 1.411026 1.111537
min 20.000000 21.000000 1300.000000 1.000000 1.000000
25% 20.000000 58.000000 7440.250000 5.000000 5.000000
50% 50.000000 68.000000 9436.500000 6.000000 5.000000
75% 70.000000 80.000000 11555.250000 7.000000 6.000000
max 190.000000 313.000000 215245.000000 10.000000 9.000000

Year Built Year Remod/Add Mas Vnr Area BsmtFin SF 1 BsmtFin SF 2 \


count 2930.000000 2930.000000 2907.000000 2929.000000 2929.000000
mean 1971.356314 1984.266553 101.896801 442.629566 49.722431
std 30.245361 20.860286 179.112611 455.590839 169.168476
min 1872.000000 1950.000000 0.000000 0.000000 0.000000
25% 1954.000000 1965.000000 0.000000 0.000000 0.000000
50% 1973.000000 1993.000000 0.000000 370.000000 0.000000
75% 2001.000000 2004.000000 164.000000 734.000000 0.000000
max 2010.000000 2010.000000 1600.000000 5644.000000 1526.000000

… Wood Deck SF Open Porch SF Enclosed Porch 3Ssn Porch \


count … 2930.000000 2930.000000 2930.000000 2930.000000
mean … 93.751877 47.533447 23.011604 2.592491
std … 126.361562 67.483400 64.139059 25.141331
min … 0.000000 0.000000 0.000000 0.000000
25% … 0.000000 0.000000 0.000000 0.000000
50% … 0.000000 27.000000 0.000000 0.000000
75% … 168.000000 70.000000 0.000000 0.000000
max … 1424.000000 742.000000 1012.000000 508.000000

Screen Porch Pool Area Misc Val Mo Sold Yr Sold \


count 2930.000000 2930.000000 2930.000000 2930.000000 2930.000000
mean 16.002048 2.243345 50.635154 6.216041 2007.790444
std 56.087370 35.597181 566.344288 2.714492 1.316613
min 0.000000 0.000000 0.000000 1.000000 2006.000000
25% 0.000000 0.000000 0.000000 4.000000 2007.000000
50% 0.000000 0.000000 0.000000 6.000000 2008.000000
75% 0.000000 0.000000 0.000000 8.000000 2009.000000
max 576.000000 800.000000 17000.000000 12.000000 2010.000000

SalePrice
count 2930.000000
mean 180796.060068
std 79886.692357
min 12789.000000

13
25% 129500.000000
50% 160000.000000
75% 213500.000000
max 755000.000000

[8 rows x 37 columns]

[ ]: for column in data.columns:


null_count = data[column].isna().sum()
if null_count > 0: # Check if there are any null values in the column
print(f"Column: {column}")
print(f"Data Type: {data[column].dtype}") # Print the data type of the␣
↪column

print(f"Number of NA values: {null_count}") # Print the number of NA␣


↪values

print() # Adds an empty line for better readability

Column: Lot Frontage


Data Type: float64
Number of NA values: 490

Column: Mas Vnr Area


Data Type: float64
Number of NA values: 23

Column: Bsmt Qual


Data Type: object
Number of NA values: 80

Column: Bsmt Cond


Data Type: object
Number of NA values: 80

Column: Bsmt Exposure


Data Type: object
Number of NA values: 83

Column: BsmtFin Type 1


Data Type: object
Number of NA values: 80

Column: BsmtFin SF 1
Data Type: float64
Number of NA values: 1

Column: BsmtFin Type 2


Data Type: object

14
Number of NA values: 81

Column: BsmtFin SF 2
Data Type: float64
Number of NA values: 1

Column: Bsmt Unf SF


Data Type: float64
Number of NA values: 1

Column: Total Bsmt SF


Data Type: float64
Number of NA values: 1

Column: Electrical
Data Type: object
Number of NA values: 1

Column: Bsmt Full Bath


Data Type: float64
Number of NA values: 2

Column: Bsmt Half Bath


Data Type: float64
Number of NA values: 2

Column: Fireplace Qu
Data Type: object
Number of NA values: 1422

Column: Garage Type


Data Type: object
Number of NA values: 157

Column: Garage Yr Blt


Data Type: float64
Number of NA values: 159

Column: Garage Finish


Data Type: object
Number of NA values: 159

Column: Garage Cars


Data Type: float64
Number of NA values: 1

Column: Garage Area


Data Type: float64

15
Number of NA values: 1

Column: Garage Qual


Data Type: object
Number of NA values: 159

Column: Garage Cond


Data Type: object
Number of NA values: 159

Column: Fence
Data Type: object
Number of NA values: 2358

[ ]: for column in data.columns:


if pd.api.types.is_numeric_dtype(data[column]) and data[column].isna().
↪sum() > 0:

skew_value = data[column].skew()
print(f"Skewness of {column}: {skew_value}")

# If data is skewed, use median, otherwise use mean


if abs(skew_value) > 0.5: # Adjust the threshold if needed
median_value = data[column].median()
data[column].fillna(median_value, inplace=True)
print(f"Filled NA in {column} with median: {median_value}")
else:
mean_value = data[column].mean()
data[column].fillna(mean_value, inplace=True)
print(f"Filled NA in {column} with mean: {mean_value}")

print("Missing value replacement complete.")

Skewness of Lot Frontage: 1.499067354883421


Filled NA in Lot Frontage with median: 68.0
Skewness of Mas Vnr Area: 2.606984784742485
Filled NA in Mas Vnr Area with median: 0.0
Skewness of BsmtFin SF 1: 1.416182206786989
Filled NA in BsmtFin SF 1 with median: 370.0
Skewness of BsmtFin SF 2: 4.139978473979118
Filled NA in BsmtFin SF 2 with median: 0.0
Skewness of Bsmt Unf SF: 0.9230527428629574
Filled NA in Bsmt Unf SF with median: 466.0
Skewness of Total Bsmt SF: 1.156204321548864
Filled NA in Total Bsmt SF with median: 990.0
Skewness of Bsmt Full Bath: 0.6166390019959825
Filled NA in Bsmt Full Bath with median: 0.0
Skewness of Bsmt Half Bath: 3.940795464335767

16
Filled NA in Bsmt Half Bath with median: 0.0
Skewness of Garage Yr Blt: -0.38467176161174854
Filled NA in Garage Yr Blt with mean: 1978.1324431613136
Skewness of Garage Cars: -0.2198363641384971
Filled NA in Garage Cars with mean: 1.7668146124957322
Skewness of Garage Area: 0.2419942395445727
Filled NA in Garage Area with mean: 472.8197336975077
Missing value replacement complete.

[ ]: from sklearn.preprocessing import LabelEncoder

# Create a label encoder object


label_encoder = LabelEncoder()

for column in data.columns:


if not pd.api.types.is_numeric_dtype(data[column]):
data[column] = label_encoder.fit_transform(data[column])
print(f"{column} has been binary encoded.")

# Check the structure and first few rows of the DataFrame after encoding
print(data.info())
print(data.head())

MS Zoning has been binary encoded.


Street has been binary encoded.
Lot Shape has been binary encoded.
Land Contour has been binary encoded.
Utilities has been binary encoded.
Lot Config has been binary encoded.
Land Slope has been binary encoded.
Neighborhood has been binary encoded.
Condition 1 has been binary encoded.
Condition 2 has been binary encoded.
Bldg Type has been binary encoded.
House Style has been binary encoded.
Roof Style has been binary encoded.
Roof Matl has been binary encoded.
Exterior 1st has been binary encoded.
Exterior 2nd has been binary encoded.
Exter Qual has been binary encoded.
Exter Cond has been binary encoded.
Foundation has been binary encoded.
Bsmt Qual has been binary encoded.
Bsmt Cond has been binary encoded.
Bsmt Exposure has been binary encoded.
BsmtFin Type 1 has been binary encoded.
BsmtFin Type 2 has been binary encoded.
Heating has been binary encoded.

17
Heating QC has been binary encoded.
Central Air has been binary encoded.
Electrical has been binary encoded.
Kitchen Qual has been binary encoded.
Functional has been binary encoded.
Fireplace Qu has been binary encoded.
Garage Type has been binary encoded.
Garage Finish has been binary encoded.
Garage Qual has been binary encoded.
Garage Cond has been binary encoded.
Paved Drive has been binary encoded.
Fence has been binary encoded.
Sale Type has been binary encoded.
Sale Condition has been binary encoded.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 76 columns):
# Column Non-Null Count Dtype

0 MS SubClass 2930 non-null int64


1 MS Zoning 2930 non-null int64
2 Lot Frontage 2930 non-null float64
3 Lot Area 2930 non-null int64
4 Street 2930 non-null int64
5 Lot Shape 2930 non-null int64
6 Land Contour 2930 non-null int64
7 Utilities 2930 non-null int64
8 Lot Config 2930 non-null int64
9 Land Slope 2930 non-null int64
10 Neighborhood 2930 non-null int64
11 Condition 1 2930 non-null int64
12 Condition 2 2930 non-null int64
13 Bldg Type 2930 non-null int64
14 House Style 2930 non-null int64
15 Overall Qual 2930 non-null int64
16 Overall Cond 2930 non-null int64
17 Year Built 2930 non-null int64
18 Year Remod/Add 2930 non-null int64
19 Roof Style 2930 non-null int64
20 Roof Matl 2930 non-null int64
21 Exterior 1st 2930 non-null int64
22 Exterior 2nd 2930 non-null int64
23 Mas Vnr Area 2930 non-null float64
24 Exter Qual 2930 non-null int64
25 Exter Cond 2930 non-null int64
26 Foundation 2930 non-null int64
27 Bsmt Qual 2930 non-null int64
28 Bsmt Cond 2930 non-null int64

18
29 Bsmt Exposure 2930 non-null int64
30 BsmtFin Type 1 2930 non-null int64
31 BsmtFin SF 1 2930 non-null float64
32 BsmtFin Type 2 2930 non-null int64
33 BsmtFin SF 2 2930 non-null float64
34 Bsmt Unf SF 2930 non-null float64
35 Total Bsmt SF 2930 non-null float64
36 Heating 2930 non-null int64
37 Heating QC 2930 non-null int64
38 Central Air 2930 non-null int64
39 Electrical 2930 non-null int64
40 1st Flr SF 2930 non-null int64
41 2nd Flr SF 2930 non-null int64
42 Low Qual Fin SF 2930 non-null int64
43 Gr Liv Area 2930 non-null int64
44 Bsmt Full Bath 2930 non-null float64
45 Bsmt Half Bath 2930 non-null float64
46 Full Bath 2930 non-null int64
47 Half Bath 2930 non-null int64
48 Bedroom AbvGr 2930 non-null int64
49 Kitchen AbvGr 2930 non-null int64
50 Kitchen Qual 2930 non-null int64
51 TotRms AbvGrd 2930 non-null int64
52 Functional 2930 non-null int64
53 Fireplaces 2930 non-null int64
54 Fireplace Qu 2930 non-null int64
55 Garage Type 2930 non-null int64
56 Garage Yr Blt 2930 non-null float64
57 Garage Finish 2930 non-null int64
58 Garage Cars 2930 non-null float64
59 Garage Area 2930 non-null float64
60 Garage Qual 2930 non-null int64
61 Garage Cond 2930 non-null int64
62 Paved Drive 2930 non-null int64
63 Wood Deck SF 2930 non-null int64
64 Open Porch SF 2930 non-null int64
65 Enclosed Porch 2930 non-null int64
66 3Ssn Porch 2930 non-null int64
67 Screen Porch 2930 non-null int64
68 Pool Area 2930 non-null int64
69 Fence 2930 non-null int64
70 Misc Val 2930 non-null int64
71 Mo Sold 2930 non-null int64
72 Yr Sold 2930 non-null int64
73 Sale Type 2930 non-null int64
74 Sale Condition 2930 non-null int64
75 SalePrice 2930 non-null int64
dtypes: float64(11), int64(65)

19
memory usage: 1.7 MB
None
MS SubClass MS Zoning Lot Frontage Lot Area Street Lot Shape \
0 20 5 141.0 31770 1 0
1 20 4 80.0 11622 1 3
2 20 5 81.0 14267 1 0
3 20 5 93.0 11160 1 3
4 60 5 74.0 13830 1 0

Land Contour Utilities Lot Config Land Slope … 3Ssn Porch \


0 3 0 0 0 … 0
1 3 0 4 0 … 0
2 3 0 0 0 … 0
3 3 0 0 0 … 0
4 3 0 4 0 … 0

Screen Porch Pool Area Fence Misc Val Mo Sold Yr Sold Sale Type \
0 0 0 4 0 5 2010 9
1 120 0 2 0 6 2010 9
2 0 0 4 12500 6 2010 9
3 0 0 4 0 4 2010 9
4 0 0 2 0 3 2010 9

Sale Condition SalePrice


0 4 215000
1 4 105000
2 4 172000
3 4 244000
4 4 189900

[5 rows x 76 columns]

[ ]: # Set the aesthetic style of the plots


sns.set(style="whitegrid")

# Histogram of Sale Prices


plt.figure(figsize=(12, 6))
sns.histplot(data['SalePrice'], kde=True)
plt.title('Distribution of Sale Prices')
plt.xlabel('Sale Price')
plt.ylabel('Frequency')
plt.show()

# Box Plot of Sale Prices


plt.figure(figsize=(12, 6))
sns.boxplot(x='SalePrice', data=data)
plt.title('Box Plot of Sale Prices')

20
plt.xlabel('Sale Price')
plt.show()

[ ]: # Set the aesthetic style of the plots


sns.set(style="whitegrid")

21
# Scatter Plot of Lot Frontage vs Lot Area
plt.figure(figsize=(12, 8))
sns.scatterplot(x='Lot Frontage', y='Lot Area', data=data, alpha=0.6)
plt.title('Scatter Plot of Lot Frontage vs. Lot Area')
plt.xlabel('Lot Frontage (feet)')
plt.ylabel('Lot Area (square feet)')
plt.show()

# If you have neighborhood data and want to visualize Sale Price in different␣
↪neighborhoods

# This assumes 'Neighborhood' is a column in your dataframe


if 'Neighborhood' in data.columns:
plt.figure(figsize=(12, 8))
sns.scatterplot(x='Neighborhood', y='SalePrice', data=data)
plt.title('Sale Price Distribution Across Neighborhoods')
plt.xlabel('Neighborhood')
plt.ylabel('Sale Price')
plt.xticks(rotation=45) # Rotate the x-axis labels for better visibility
plt.show()
else:
print("Neighborhood data not available for plotting.")

22
[ ]: sns.set(style="whitegrid")

# Group by 'Yr Sold' and calculate the average 'SalePrice'


yearly_avg = data.groupby('Yr Sold')['SalePrice'].mean()

# Plotting the line graph


plt.figure(figsize=(12, 6))
sns.lineplot(x=yearly_avg.index, y=yearly_avg.values)
plt.title('Average Sale Price Over the Years')
plt.xlabel('Year Sold')
plt.ylabel('Average Sale Price')
plt.show()

23
[ ]: # Create a new column combining year and month for detailed trend analysis
data['Year_Month'] = data['Yr Sold'].astype(str) + '-' + data['Mo Sold'].
↪astype(str).str.zfill(2)

# Count the number of sales per 'Year_Month'


monthly_sales = data['Year_Month'].value_counts().sort_index()

# Plotting the bar graph


plt.figure(figsize=(18, 8))
sns.barplot(x=monthly_sales.index, y=monthly_sales.values, palette="viridis")
plt.title('Number of Properties Sold Each Month')
plt.xlabel('Month and Year')
plt.ylabel('Number of Sales')
plt.xticks(rotation=90) # Rotate the labels for better visibility
plt.show()

24
[ ]:

25
da-lab1

August 22, 2024

Name: Shruti Kedari


Class: AIML
UID: 2021600033
Exp No: 1
Part 1
[1]: import numpy as np
import pandas as pd

[2]: data = pd.read_csv('AmesHousing.csv')

[3]: data.shape

[3]: (2930, 82)

[4]: data.info

[4] : <bound method DataFrame.info of Order PID MS SubClass MS Zoning


Lot Frontage Lot Area Street \
0 1 526301100 20 RL 141.0 31770 Pave
1 2 526350040 20 RH 80.0 11622 Pave
2 3 526351010 20 RL 81.0 14267 Pave
3 4 526353030 20 RL 93.0 11160 Pave
4 5 527105010 60 RL 74.0 13830 Pave
… … … … … … … …
2925 2926 923275080 80 RL 37.0 7937 Pave
2926 2927 923276100 20 RL NaN 8885 Pave
2927 2928 923400125 85 RL 62.0 10441 Pave
2928 2929 924100070 20 RL 77.0 10010 Pave
2929 2930 924151050 60 RL 74.0 9627 Pave

Alley Lot Shape Land Contour … Pool Area Pool QC Fence Misc Feature \
0 NaN IR1 Lvl … 0 NaN NaN NaN
1 NaN Reg Lvl … 0 NaN MnPrv NaN
2 NaN IR1 Lvl … 0 NaN NaN Gar2
3 NaN Reg Lvl … 0 NaN NaN NaN

1
4 NaN IR1 Lvl … 0 NaN MnPrv NaN
… … … … … … … … …
2925 NaN IR1 Lvl … 0 NaN GdPrv NaN
2926 NaN IR1 Low … 0 NaN MnPrv NaN
2927 NaN Reg Lvl … 0 NaN MnPrv Shed
2928 NaN Reg Lvl … 0 NaN NaN NaN
2929 NaN Reg Lvl … 0 NaN NaN NaN

Misc Val Mo Sold Yr Sold Sale Type Sale Condition SalePrice


0 0 5 2010 WD Normal 215000
1 0 6 2010 WD Normal 105000
2 12500 6 2010 WD Normal 172000
3 0 4 2010 WD Normal 244000
4 0 3 2010 WD Normal 189900
… … … … … … …
2925 0 3 2006 WD Normal 142500
2926 0 6 2006 WD Normal 131000
2927 700 7 2006 WD Normal 132000
2928 0 4 2006 WD Normal 170000
2929 0 11 2006 WD Normal 188000

[2930 rows x 82 columns]>

[5]: # Number of Features


num_features = len(data.columns)
num_features

[5]: 82

[6]: # Number of data samples/ rows


num_samples = data.shape[0]
num_samples

[6]: 2930

[7]: #
data['Sale Condition'].unique()

[7]: array(['Normal', 'Partial', 'Family', 'Abnorml', 'Alloca', 'AdjLand'],


dtype=object)

[8]: # Number of NaN values


nullF = data.isnull().sum().sum()
nullF

[8]: 13997

2
[9]: # Number of Null value rows
nullrows = data.isnull().sum()
nullrows

[9]: Order 0
PID 0
MS SubClass 0
MS Zoning 0
Lot Frontage 490

Mo Sold 0
Yr Sold 0
Sale Type 0
Sale Condition 0
SalePrice 0
Length: 82, dtype: int64

[10]: # Deleting rows with null values in a specific column


data.dropna(subset = ['Lot Frontage'], inplace = True)

[11]: data.isnull().sum()

[11] : Order 0
PID 0
MS SubClass 0
MS Zoning 0
Lot Frontage 0
..
Mo Sold 0
Yr Sold 0
Sale Type 0
Sale Condition 0
SalePrice 0
Length: 82, dtype: int64

[12]: data['PID'].dtypes

[12] : dtype('int64')

[13]: data.PID = data.PID.astype(str)

[14]: import seaborn as sns


import matplotlib.pyplot as plt

[15]: data.head(3)

3
[15]: Order PID MS SubClass MS Zoning Lot Frontage Lot Area Street \
0 1 526301100 20 RL 141.0 31770 Pave
1 2 526350040 20 RH 80.0 11622 Pave
2 3 526351010 20 RL 81.0 14267 Pave

Alley Lot Shape Land Contour … Pool Area Pool QC Fence Misc Feature \
0 NaN IR1 Lvl … 0 NaN NaN NaN
1 NaN Reg Lvl … 0 NaN MnPrv NaN
2 NaN IR1 Lvl … 0 NaN NaN Gar2

Misc Val Mo Sold Yr Sold Sale Type Sale Condition SalePrice


0 0 5 2010 WD Normal 215000
1 0 6 2010 WD Normal 105000
2 12500 6 2010 WD Normal 172000

[3 rows x 82 columns]

[16]: # Number of NaN values


nullF = data['Alley'].isnull().sum().sum()
nullF

[16]: 2255

Histogram
Visualize the distribution of a single numerical variable
[17]: sns.histplot(data['SalePrice'], kde=True)
plt.title('Distribution of Sale Prices')
plt.xlabel('Sale Price')
plt.ylabel('Frequency')
plt.show()

4
[18]: # Histogram
plt.hist(data['SalePrice'], edgecolor = 'black', bins = 9,color='skyblue',␣
↪alpha=0.7)

plt.title('Histogram for Species in cm')


plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

5
[19]: sns.countplot(x='SalePrice', data = data)

[19]: <Axes: xlabel='SalePrice', ylabel='count'>

6
Box Plot
Compare the distribution of a numerical variable across different categories.
Visualizing SalePrice distribution across different MS Zoning categories.
[20]: sns.boxplot(x='MS Zoning', y='SalePrice', data=data)
plt.title('Sale Price Distribution by MS Zoning')
plt.xlabel('MS Zoning')
plt.ylabel('Sale Price')
plt.show()

7
[ ]:

Scatter Plot
Visualize the relationship between two numerical variables.
Visualizing the relationship between Lot Area and SalePrice

[21]: sns.scatterplot(x='Lot Area', y='SalePrice', data=data)


plt.title('Lot Area vs Sale Price')
plt.xlabel('Lot Area')
plt.ylabel('Sale Price')
plt.show()

8
pairwise relationships between several numerical variables
[22]: sns.pairplot(data[['SalePrice', 'Lot Area', 'Lot Frontage']])
plt.suptitle('Pair Plot of SalePrice, Lot Area, and Lot Frontage', y=1.02)
plt.show()

9
Visualize the correlation matrix between numerical variables
[23]: corr_matrix = data[['SalePrice', 'Lot Area', 'Lot Frontage', 'Pool Area']].
↪corr()

sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)


plt.title('Correlation Heatmap')
plt.show()

10
PART 2
[24]: df = pd.read_csv('Iris.csv')

[25]: df.shape

[25]: (150, 6)

[26]: df.head(3)

[26]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species


0 1 5.1 3.5 1.4 0.2 Iris-setosa
1 2 4.9 3.0 1.4 0.2 Iris-setosa
2 3 4.7 3.2 1.3 0.2 Iris-setosa

[27]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype

11
0 Id 150 non-null int64
1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
5 Species 150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB

[28]: df.describe()

[28]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm


count 150.000000 150.000000 150.000000 150.000000 150.000000
mean 75.500000 5.843333 3.054000 3.758667 1.198667
std 43.445368 0.828066 0.433594 1.764420 0.763161
min 1.000000 4.300000 2.000000 1.000000 0.100000
25% 38.250000 5.100000 2.800000 1.600000 0.300000
50% 75.500000 5.800000 3.000000 4.350000 1.300000
75% 112.750000 6.400000 3.300000 5.100000 1.800000
max 150.000000 7.900000 4.400000 6.900000 2.500000

[29]: df.shape

[29]: (150, 6)

[30]: duplicate_rows_df = df[df.duplicated()]

[31]: print("number of duplicate rows ", duplicate_rows_df)

number of duplicate rows Empty DataFrame


Columns: [Id, SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm, Species]
Index: []

[32]: df.count()

[32]: Id 150
SepalLengthCm 150
SepalWidthCm 150
PetalLengthCm 150
PetalWidthCm 150
Species 150
dtype: int64

[33]: print(df.isnull().sum())

Id 0
SepalLengthCm 0
SepalWidthCm 0

12
PetalLengthCm 0
PetalWidthCm 0
Species 0
dtype: int64
[34]: # Num of features
num_features = len(df.columns)

[35]: num_features

[35]: 6

[36]: # Number of data samples/ rows


num_samples = df.shape[0]

[37]: num_samples

[37]: 150

[38]: # Number of NaN values


nullF = df.isnull().sum().sum()
nullF

[38]: 0

[39]: import matplotlib.pyplot as plt

[40]: # Histogram
plt.hist(df['SepalLengthCm'], edgecolor = 'black', bins = 9,color='skyblue',␣
↪alpha=0.7)

plt.title('Histogram for SepalLength in cm')


plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

13
[41]: # Histogram
plt.hist(df['Species'], edgecolor = 'black', bins = 9,color='skyblue', alpha=0.
↪7)

plt.title('Histogram for Species in cm')


plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

14
[42]: import seaborn as sns

Scatter Plot used to explain the relationship between two continuous variables within a dataset
x, y: Input data variables that should be numeric.
hue = Species => color the points based on the species.
style = Species => Differentiate species using different markers
Palette = ‘Viridis’ => Applies the ‘Viridis’ color palette to the plot

[43]: sns.scatterplot(x="SepalLengthCm",
y="SepalWidthCm",
data = df,
hue='Species',
style='Species',
palette='viridis'
)
plt.title('Scatter Plot of Sepal Length vs Sepal Width')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')

15
plt.show()

[ ]:

[44]: sns.lineplot(data=df, x='Id', y='SepalLengthCm', label='Sepal Length (cm)',␣


↪marker='o')

sns.lineplot(data=df, x='Id', y='SepalWidthCm', label='Sepal Width (cm)',␣


↪marker='o')

plt.title('Line Plot of Sepal Length and Sepal Width Over Id')


plt.xlabel('Id')
plt.ylabel('Measurements (cm)')
plt.legend()
plt.show()

16
[ ]:

A bar plot is used when you want to visualize and compare categorical data, particularly when you
want to show the distribution of data across different categories
x=‘Species’: Specifies the species as the categories on the x-axis.
y=‘SepalLengthCm’: The sepal length is averaged and shown as bars on the y-axis.

[45]: # BarPlot
sns.barplot(x = 'Species', y='SepalLengthCm', data = df, palette='Set2')
plt.title('Average Sepal Length by Species')
plt.xlabel('Species')
plt.ylabel('Average Sepal Length (cm)')

[45] : Text(0, 0.5, 'Average Sepal Length (cm)')

17
COUNTPlot => Show value counts for a single categorical variable.
[46]: sns.countplot(x='Species', data = df)

[46] : <Axes: xlabel='Species', ylabel='count'>

18
[ ]:

PairPlot
A pair plot (also known as a scatterplot matrix) is a powerful visualization tool that allows you
to explore relationships between multiple variables in a dataset simultaneously. It displays scatter
plots for every pair of variables
hue=‘Species’: This colors the points in the scatter plots according to the Species category, allowing
you to see how different species are distributed across the pairs of variables.
vars=[‘SepalLengthCm’, ‘SepalWidthCm’, ‘PetalLengthCm’, ‘PetalWidthCm’]: Specifies the
columns to include in the pair plot. If you omit this, all numerical columns will be used.

[47]: sns.pairplot(df, hue = 'Species', vars=['SepalLengthCm', 'SepalWidthCm',␣


↪'PetalLengthCm', 'PetalWidthCm'])

[47] : <seaborn.axisgrid.PairGrid at 0x1b14a6fe3d0>

19
Heatmaps are often used to visualize correlation matrices, where each cell shows the correlation
between two variables. This helps in identifying strongly correlated variables.
corr_matrix: Calculates the correlation matrix for the selected columns.
sns.heatmap(): Creates the heatmap from the correlation matrix.
annot=True: Adds the correlation coefficients inside the cells.
cmap=‘coolwarm’: Specifies the color map, with blue for low values and red for high values.
linewidths=0.5: Adds a thin line between cells for better readability.
[48]: corr_matrix = df[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm',␣
↪'PetalWidthCm']].corr()

sns.heatmap(corr_matrix, linewidths=0.5, annot=True)

[48] : <Axes: >

20
21

You might also like