Final DA LAB1 Merged
Final DA LAB1 Merged
Name Anand
Krishna
Om Doshi
Class and Batch BE Comps A
Lab # 1
Aim Perform EDA such as number of data samples, number of features, number of classes,
number of data samples per class, removing missing values, conversion to numbers,
using seaborn library to plot different graphs
Theory
Exploratory Data Analysis (EDA) is a crucial step in the data analysis process, focusing on
summarizing and understanding the main characteristics of a dataset. EDA involves examining
data sets to uncover patterns, spot anomalies, test hypotheses, and check assumptions using
statistical graphics and other data visualization techniques.
Objectives of EDA
2. Number of Features
3. Number of Classes
BHARATIYA VIDYA BHAVAN’S
SARDAR PATEL INSTITUTE OF TECHNOLOGY
(Empowered Autonomous Institute Affiliated to University of Mumbai)
[Knowledge is Nectar]
● Definition: The number of unique categories or labels in the target variable (for
classification problems).
6. Conversion to Numbers
● Definition: Utilizing the Seaborn library to create visualizations such as histograms, scatter
plots, pair plots, etc.
● Purpose: Helps in understanding the data distribution, relationships between features, and
patterns or anomalies in the data. Seaborn simplifies the creation of aesthetically pleasing
and informative plots.
Conclusion EDA is an essential process in data analytics that involves summarizing, visualizing, and
cleaning data to uncover insights and prepare it for further analysis or modeling. By
applying EDA techniques, we can better understand the dataset's structure, address data
quality issues, explore relationships, and make informed decisions about the next steps in
data analysis.
References https://www.geeksforgeeks.org/matplotlib-tutorial/
da-exp1
[ ]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#to ignore warnings
import warnings
warnings.filterwarnings('ignore')
[ ]: data = pd.read_csv("/content/AmesHousing.csv")
[ ]: data.head()
Alley Lot Shape Land Contour … Pool Area Pool QC Fence Misc Feature \
0 NaN IR1 Lvl … 0 NaN NaN NaN
1 NaN Reg Lvl … 0 NaN MnPrv NaN
2 NaN IR1 Lvl … 0 NaN NaN Gar2
3 NaN Reg Lvl … 0 NaN NaN NaN
4 NaN IR1 Lvl … 0 NaN MnPrv NaN
[5 rows x 82 columns]
[ ]: data.describe()
1
[ ]: Order PID MS SubClass Lot Frontage Lot Area \
count 2930.00000 2.930000e+03 2930.000000 2440.000000 2930.000000
mean 1465.50000 7.144645e+08 57.387372 69.224590 10147.921843
std 845.96247 1.887308e+08 42.638025 23.365335 7880.017759
min 1.00000 5.263011e+08 20.000000 21.000000 1300.000000
25% 733.25000 5.284770e+08 20.000000 58.000000 7440.250000
50% 1465.50000 5.354536e+08 50.000000 68.000000 9436.500000
75% 2197.75000 9.071811e+08 70.000000 80.000000 11555.250000
max 2930.00000 1.007100e+09 190.000000 313.000000 215245.000000
Overall Qual Overall Cond Year Built Year Remod/Add Mas Vnr Area \
count 2930.000000 2930.000000 2930.000000 2930.000000 2907.000000
mean 6.094881 5.563140 1971.356314 1984.266553 101.896801
std 1.411026 1.111537 30.245361 20.860286 179.112611
min 1.000000 1.000000 1872.000000 1950.000000 0.000000
25% 5.000000 5.000000 1954.000000 1965.000000 0.000000
50% 6.000000 5.000000 1973.000000 1993.000000 0.000000
75% 7.000000 6.000000 2001.000000 2004.000000 164.000000
max 10.000000 9.000000 2010.000000 2010.000000 1600.000000
SalePrice
count 2930.000000
mean 180796.060068
std 79886.692357
min 12789.000000
25% 129500.000000
50% 160000.000000
2
75% 213500.000000
max 755000.000000
[8 rows x 39 columns]
[ ]: data.columns
[ ]: Index(['Order', 'PID', 'MS SubClass', 'MS Zoning', 'Lot Frontage', 'Lot Area',
'Street', 'Alley', 'Lot Shape', 'Land Contour', 'Utilities',
'Lot Config', 'Land Slope', 'Neighborhood', 'Condition 1',
'Condition 2', 'Bldg Type', 'House Style', 'Overall Qual',
'Overall Cond', 'Year Built', 'Year Remod/Add', 'Roof Style',
'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Type',
'Mas Vnr Area', 'Exter Qual', 'Exter Cond', 'Foundation', 'Bsmt Qual',
'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin SF 1',
'BsmtFin Type 2', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF',
'Heating', 'Heating QC', 'Central Air', 'Electrical', '1st Flr SF',
'2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Bsmt Full Bath',
'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr',
'Kitchen AbvGr', 'Kitchen Qual', 'TotRms AbvGrd', 'Functional',
'Fireplaces', 'Fireplace Qu', 'Garage Type', 'Garage Yr Blt',
'Garage Finish', 'Garage Cars', 'Garage Area', 'Garage Qual',
'Garage Cond', 'Paved Drive', 'Wood Deck SF', 'Open Porch SF',
'Enclosed Porch', '3Ssn Porch', 'Screen Porch', 'Pool Area', 'Pool QC',
'Fence', 'Misc Feature', 'Misc Val', 'Mo Sold', 'Yr Sold', 'Sale Type',
'Sale Condition', 'SalePrice'],
dtype='object')
# Number of features
num_features = data.shape[1]
[ ]: (2930, 82)
[ ]: data.info()
<class 'pandas.core.frame.DataFrame'>
3
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 82 columns):
# Column Non-Null Count Dtype
4
44 1st Flr SF 2930 non-null int64
45 2nd Flr SF 2930 non-null int64
46 Low Qual Fin SF 2930 non-null int64
47 Gr Liv Area 2930 non-null int64
48 Bsmt Full Bath 2928 non-null float64
49 Bsmt Half Bath 2928 non-null float64
50 Full Bath 2930 non-null int64
51 Half Bath 2930 non-null int64
52 Bedroom AbvGr 2930 non-null int64
53 Kitchen AbvGr 2930 non-null int64
54 Kitchen Qual 2930 non-null object
55 TotRms AbvGrd 2930 non-null int64
56 Functional 2930 non-null object
57 Fireplaces 2930 non-null int64
58 Fireplace Qu 1508 non-null object
59 Garage Type 2773 non-null object
60 Garage Yr Blt 2771 non-null float64
61 Garage Finish 2771 non-null object
62 Garage Cars 2929 non-null float64
63 Garage Area 2929 non-null float64
64 Garage Qual 2771 non-null object
65 Garage Cond 2771 non-null object
66 Paved Drive 2930 non-null object
67 Wood Deck SF 2930 non-null int64
68 Open Porch SF 2930 non-null int64
69 Enclosed Porch 2930 non-null int64
70 3Ssn Porch 2930 non-null int64
71 Screen Porch 2930 non-null int64
72 Pool Area 2930 non-null int64
73 Pool QC 13 non-null object
74 Fence 572 non-null object
75 Misc Feature 106 non-null object
76 Misc Val 2930 non-null int64
77 Mo Sold 2930 non-null int64
78 Yr Sold 2930 non-null int64
79 Sale Type 2930 non-null object
80 Sale Condition 2930 non-null object
81 SalePrice 2930 non-null int64
dtypes: float64(11), int64(28), object(43)
memory usage: 1.8+ MB
[ ]: data.describe()
5
min 1.00000 5.263011e+08 20.000000 21.000000 1300.000000
25% 733.25000 5.284770e+08 20.000000 58.000000 7440.250000
50% 1465.50000 5.354536e+08 50.000000 68.000000 9436.500000
75% 2197.75000 9.071811e+08 70.000000 80.000000 11555.250000
max 2930.00000 1.007100e+09 190.000000 313.000000 215245.000000
Overall Qual Overall Cond Year Built Year Remod/Add Mas Vnr Area \
count 2930.000000 2930.000000 2930.000000 2930.000000 2907.000000
mean 6.094881 5.563140 1971.356314 1984.266553 101.896801
std 1.411026 1.111537 30.245361 20.860286 179.112611
min 1.000000 1.000000 1872.000000 1950.000000 0.000000
25% 5.000000 5.000000 1954.000000 1965.000000 0.000000
50% 6.000000 5.000000 1973.000000 1993.000000 0.000000
75% 7.000000 6.000000 2001.000000 2004.000000 164.000000
max 10.000000 9.000000 2010.000000 2010.000000 1600.000000
SalePrice
count 2930.000000
mean 180796.060068
std 79886.692357
min 12789.000000
25% 129500.000000
50% 160000.000000
75% 213500.000000
max 755000.000000
[8 rows x 39 columns]
6
[ ]: data.isnull().sum().sort_values(ascending=False)
[ ]: Pool QC 2917
Misc Feature 2824
Alley 2732
Fence 2358
Mas Vnr Type 1775
…
PID 0
Central Air 0
1st Flr SF 0
2nd Flr SF 0
SalePrice 0
Length: 82, dtype: int64
[ ]: missing_values = data.isnull().sum().sort_values(ascending=False)
7
'Paved Drive': 0.0,
'Full Bath': 0.0,
'Half Bath': 0.0,
'Bedroom AbvGr': 0.0,
'Kitchen AbvGr': 0.0,
'Kitchen Qual': 0.0,
'TotRms AbvGrd': 0.0,
'Sale Condition': 0.0,
'Sale Type': 0.0,
'Yr Sold': 0.0,
'Mo Sold': 0.0,
'Misc Val': 0.0,
'Functional': 0.0,
'Fireplaces': 0.0,
'Pool Area': 0.0,
'Screen Porch': 0.0,
'3Ssn Porch': 0.0,
'Enclosed Porch': 0.0,
'Open Porch SF': 0.0,
'Wood Deck SF': 0.0,
'Order': 0.0,
'Heating QC': 0.0,
'Gr Liv Area': 0.0,
'Overall Qual': 0.0,
'MS SubClass': 0.0,
'MS Zoning': 0.0,
'Lot Area': 0.0,
'Street': 0.0,
'Lot Shape': 0.0,
'Land Contour': 0.0,
'Utilities': 0.0,
'Lot Config': 0.0,
'Land Slope': 0.0,
'Neighborhood': 0.0,
'Condition 1': 0.0,
'Condition 2': 0.0,
'Bldg Type': 0.0,
'House Style': 0.0,
'Overall Cond': 0.0,
'Low Qual Fin SF': 0.0,
'Year Built': 0.0,
'Year Remod/Add': 0.0,
'Roof Style': 0.0,
'Roof Matl': 0.0,
'Exterior 1st': 0.0,
'Exterior 2nd': 0.0,
'Exter Qual': 0.0,
8
'Exter Cond': 0.0,
'Foundation': 0.0,
'Heating': 0.0,
'PID': 0.0,
'Central Air': 0.0,
'1st Flr SF': 0.0,
'2nd Flr SF': 0.0,
'SalePrice': 0.0}
[ ]: unique_values = data.nunique().sort_values(ascending=False)
unique_values_percentage = unique_values / len(data) * 100
unique_values_percentage.to_dict()
[ ]: {'Order': 100.0,
'PID': 100.0,
'Lot Area': 66.89419795221842,
'Gr Liv Area': 44.09556313993174,
'Bsmt Unf SF': 38.80546075085324,
'1st Flr SF': 36.96245733788396,
'Total Bsmt SF': 36.10921501706484,
'SalePrice': 35.221843003412964,
'BsmtFin SF 1': 33.95904436860068,
'2nd Flr SF': 21.672354948805463,
'Garage Area': 20.580204778156997,
'Mas Vnr Area': 15.187713310580206,
'Wood Deck SF': 12.969283276450511,
'BsmtFin SF 2': 9.351535836177474,
'Open Porch SF': 8.600682593856655,
'Enclosed Porch': 6.2457337883959045,
'Lot Frontage': 4.368600682593857,
'Screen Porch': 4.129692832764506,
'Year Built': 4.027303754266211,
'Garage Yr Blt': 3.515358361774744,
'Year Remod/Add': 2.0819112627986347,
'Misc Val': 1.2969283276450512,
'Low Qual Fin SF': 1.228668941979522,
'3Ssn Porch': 1.0580204778156996,
'Neighborhood': 0.955631399317406,
'Exterior 2nd': 0.5802047781569966,
'MS SubClass': 0.5460750853242321,
'Exterior 1st': 0.5460750853242321,
'TotRms AbvGrd': 0.477815699658703,
'Pool Area': 0.477815699658703,
'Mo Sold': 0.40955631399317405,
'Overall Qual': 0.3412969283276451,
'Sale Type': 0.3412969283276451,
'Overall Cond': 0.3071672354948805,
9
'Condition 1': 0.3071672354948805,
'Bedroom AbvGr': 0.27303754266211605,
'Functional': 0.27303754266211605,
'Condition 2': 0.27303754266211605,
'Roof Matl': 0.27303754266211605,
'House Style': 0.27303754266211605,
'MS Zoning': 0.2389078498293515,
'BsmtFin Type 1': 0.20477815699658702,
'Roof Style': 0.20477815699658702,
'Garage Type': 0.20477815699658702,
'Garage Cars': 0.20477815699658702,
'BsmtFin Type 2': 0.20477815699658702,
'Sale Condition': 0.20477815699658702,
'Foundation': 0.20477815699658702,
'Heating': 0.20477815699658702,
'Garage Qual': 0.17064846416382254,
'Garage Cond': 0.17064846416382254,
'Fireplace Qu': 0.17064846416382254,
'Yr Sold': 0.17064846416382254,
'Fireplaces': 0.17064846416382254,
'Misc Feature': 0.17064846416382254,
'Heating QC': 0.17064846416382254,
'Kitchen Qual': 0.17064846416382254,
'Lot Config': 0.17064846416382254,
'Full Bath': 0.17064846416382254,
'Bldg Type': 0.17064846416382254,
'Electrical': 0.17064846416382254,
'Exter Cond': 0.17064846416382254,
'Bsmt Qual': 0.17064846416382254,
'Bsmt Cond': 0.17064846416382254,
'Kitchen AbvGr': 0.13651877133105803,
'Mas Vnr Type': 0.13651877133105803,
'Exter Qual': 0.13651877133105803,
'Fence': 0.13651877133105803,
'Pool QC': 0.13651877133105803,
'Bsmt Exposure': 0.13651877133105803,
'Land Contour': 0.13651877133105803,
'Bsmt Full Bath': 0.13651877133105803,
'Lot Shape': 0.13651877133105803,
'Paved Drive': 0.10238907849829351,
'Garage Finish': 0.10238907849829351,
'Bsmt Half Bath': 0.10238907849829351,
'Land Slope': 0.10238907849829351,
'Half Bath': 0.10238907849829351,
'Utilities': 0.10238907849829351,
'Alley': 0.06825938566552901,
'Street': 0.06825938566552901,
10
'Central Air': 0.06825938566552901}
[ ]: unique_values = data.nunique().sort_values(ascending=True).to_dict()
unique_values
[ ]: {'Central Air': 2,
'Street': 2,
'Alley': 2,
'Bsmt Half Bath': 3,
'Paved Drive': 3,
'Half Bath': 3,
'Utilities': 3,
'Garage Finish': 3,
'Land Slope': 3,
'Kitchen AbvGr': 4,
'Fence': 4,
'Mas Vnr Type': 4,
'Exter Qual': 4,
'Pool QC': 4,
'Land Contour': 4,
'Lot Shape': 4,
'Bsmt Exposure': 4,
'Bsmt Full Bath': 4,
'Electrical': 5,
'Misc Feature': 5,
'Bsmt Qual': 5,
'Bsmt Cond': 5,
'Exter Cond': 5,
'Fireplaces': 5,
'Kitchen Qual': 5,
'Heating QC': 5,
'Bldg Type': 5,
'Fireplace Qu': 5,
'Lot Config': 5,
'Yr Sold': 5,
'Garage Qual': 5,
'Garage Cond': 5,
'Full Bath': 5,
'Garage Type': 6,
'Garage Cars': 6,
'Sale Condition': 6,
'Heating': 6,
'BsmtFin Type 2': 6,
'BsmtFin Type 1': 6,
'Foundation': 6,
'Roof Style': 6,
'MS Zoning': 7,
11
'Condition 2': 8,
'House Style': 8,
'Functional': 8,
'Roof Matl': 8,
'Bedroom AbvGr': 8,
'Condition 1': 9,
'Overall Cond': 9,
'Sale Type': 10,
'Overall Qual': 10,
'Mo Sold': 12,
'Pool Area': 14,
'TotRms AbvGrd': 14,
'Exterior 1st': 16,
'MS SubClass': 16,
'Exterior 2nd': 17,
'Neighborhood': 28,
'3Ssn Porch': 31,
'Low Qual Fin SF': 36,
'Misc Val': 38,
'Year Remod/Add': 61,
'Garage Yr Blt': 103,
'Year Built': 118,
'Screen Porch': 121,
'Lot Frontage': 128,
'Enclosed Porch': 183,
'Open Porch SF': 252,
'BsmtFin SF 2': 274,
'Wood Deck SF': 380,
'Mas Vnr Area': 445,
'Garage Area': 603,
'2nd Flr SF': 635,
'BsmtFin SF 1': 995,
'SalePrice': 1032,
'Total Bsmt SF': 1058,
'1st Flr SF': 1083,
'Bsmt Unf SF': 1137,
'Gr Liv Area': 1292,
'Lot Area': 1960,
'PID': 2930,
'Order': 2930}
12
[ ]: data.describe()
SalePrice
count 2930.000000
mean 180796.060068
std 79886.692357
min 12789.000000
13
25% 129500.000000
50% 160000.000000
75% 213500.000000
max 755000.000000
[8 rows x 37 columns]
Column: BsmtFin SF 1
Data Type: float64
Number of NA values: 1
14
Number of NA values: 81
Column: BsmtFin SF 2
Data Type: float64
Number of NA values: 1
Column: Electrical
Data Type: object
Number of NA values: 1
Column: Fireplace Qu
Data Type: object
Number of NA values: 1422
15
Number of NA values: 1
Column: Fence
Data Type: object
Number of NA values: 2358
skew_value = data[column].skew()
print(f"Skewness of {column}: {skew_value}")
16
Filled NA in Bsmt Half Bath with median: 0.0
Skewness of Garage Yr Blt: -0.38467176161174854
Filled NA in Garage Yr Blt with mean: 1978.1324431613136
Skewness of Garage Cars: -0.2198363641384971
Filled NA in Garage Cars with mean: 1.7668146124957322
Skewness of Garage Area: 0.2419942395445727
Filled NA in Garage Area with mean: 472.8197336975077
Missing value replacement complete.
# Check the structure and first few rows of the DataFrame after encoding
print(data.info())
print(data.head())
17
Heating QC has been binary encoded.
Central Air has been binary encoded.
Electrical has been binary encoded.
Kitchen Qual has been binary encoded.
Functional has been binary encoded.
Fireplace Qu has been binary encoded.
Garage Type has been binary encoded.
Garage Finish has been binary encoded.
Garage Qual has been binary encoded.
Garage Cond has been binary encoded.
Paved Drive has been binary encoded.
Fence has been binary encoded.
Sale Type has been binary encoded.
Sale Condition has been binary encoded.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 76 columns):
# Column Non-Null Count Dtype
18
29 Bsmt Exposure 2930 non-null int64
30 BsmtFin Type 1 2930 non-null int64
31 BsmtFin SF 1 2930 non-null float64
32 BsmtFin Type 2 2930 non-null int64
33 BsmtFin SF 2 2930 non-null float64
34 Bsmt Unf SF 2930 non-null float64
35 Total Bsmt SF 2930 non-null float64
36 Heating 2930 non-null int64
37 Heating QC 2930 non-null int64
38 Central Air 2930 non-null int64
39 Electrical 2930 non-null int64
40 1st Flr SF 2930 non-null int64
41 2nd Flr SF 2930 non-null int64
42 Low Qual Fin SF 2930 non-null int64
43 Gr Liv Area 2930 non-null int64
44 Bsmt Full Bath 2930 non-null float64
45 Bsmt Half Bath 2930 non-null float64
46 Full Bath 2930 non-null int64
47 Half Bath 2930 non-null int64
48 Bedroom AbvGr 2930 non-null int64
49 Kitchen AbvGr 2930 non-null int64
50 Kitchen Qual 2930 non-null int64
51 TotRms AbvGrd 2930 non-null int64
52 Functional 2930 non-null int64
53 Fireplaces 2930 non-null int64
54 Fireplace Qu 2930 non-null int64
55 Garage Type 2930 non-null int64
56 Garage Yr Blt 2930 non-null float64
57 Garage Finish 2930 non-null int64
58 Garage Cars 2930 non-null float64
59 Garage Area 2930 non-null float64
60 Garage Qual 2930 non-null int64
61 Garage Cond 2930 non-null int64
62 Paved Drive 2930 non-null int64
63 Wood Deck SF 2930 non-null int64
64 Open Porch SF 2930 non-null int64
65 Enclosed Porch 2930 non-null int64
66 3Ssn Porch 2930 non-null int64
67 Screen Porch 2930 non-null int64
68 Pool Area 2930 non-null int64
69 Fence 2930 non-null int64
70 Misc Val 2930 non-null int64
71 Mo Sold 2930 non-null int64
72 Yr Sold 2930 non-null int64
73 Sale Type 2930 non-null int64
74 Sale Condition 2930 non-null int64
75 SalePrice 2930 non-null int64
dtypes: float64(11), int64(65)
19
memory usage: 1.7 MB
None
MS SubClass MS Zoning Lot Frontage Lot Area Street Lot Shape \
0 20 5 141.0 31770 1 0
1 20 4 80.0 11622 1 3
2 20 5 81.0 14267 1 0
3 20 5 93.0 11160 1 3
4 60 5 74.0 13830 1 0
Screen Porch Pool Area Fence Misc Val Mo Sold Yr Sold Sale Type \
0 0 0 4 0 5 2010 9
1 120 0 2 0 6 2010 9
2 0 0 4 12500 6 2010 9
3 0 0 4 0 4 2010 9
4 0 0 2 0 3 2010 9
[5 rows x 76 columns]
20
plt.xlabel('Sale Price')
plt.show()
21
# Scatter Plot of Lot Frontage vs Lot Area
plt.figure(figsize=(12, 8))
sns.scatterplot(x='Lot Frontage', y='Lot Area', data=data, alpha=0.6)
plt.title('Scatter Plot of Lot Frontage vs. Lot Area')
plt.xlabel('Lot Frontage (feet)')
plt.ylabel('Lot Area (square feet)')
plt.show()
# If you have neighborhood data and want to visualize Sale Price in different␣
↪neighborhoods
22
[ ]: sns.set(style="whitegrid")
23
[ ]: # Create a new column combining year and month for detailed trend analysis
data['Year_Month'] = data['Yr Sold'].astype(str) + '-' + data['Mo Sold'].
↪astype(str).str.zfill(2)
24
[ ]:
25
da-lab1
[3]: data.shape
[4]: data.info
Alley Lot Shape Land Contour … Pool Area Pool QC Fence Misc Feature \
0 NaN IR1 Lvl … 0 NaN NaN NaN
1 NaN Reg Lvl … 0 NaN MnPrv NaN
2 NaN IR1 Lvl … 0 NaN NaN Gar2
3 NaN Reg Lvl … 0 NaN NaN NaN
1
4 NaN IR1 Lvl … 0 NaN MnPrv NaN
… … … … … … … … …
2925 NaN IR1 Lvl … 0 NaN GdPrv NaN
2926 NaN IR1 Low … 0 NaN MnPrv NaN
2927 NaN Reg Lvl … 0 NaN MnPrv Shed
2928 NaN Reg Lvl … 0 NaN NaN NaN
2929 NaN Reg Lvl … 0 NaN NaN NaN
[5]: 82
[6]: 2930
[7]: #
data['Sale Condition'].unique()
[8]: 13997
2
[9]: # Number of Null value rows
nullrows = data.isnull().sum()
nullrows
[9]: Order 0
PID 0
MS SubClass 0
MS Zoning 0
Lot Frontage 490
…
Mo Sold 0
Yr Sold 0
Sale Type 0
Sale Condition 0
SalePrice 0
Length: 82, dtype: int64
[11]: data.isnull().sum()
[11] : Order 0
PID 0
MS SubClass 0
MS Zoning 0
Lot Frontage 0
..
Mo Sold 0
Yr Sold 0
Sale Type 0
Sale Condition 0
SalePrice 0
Length: 82, dtype: int64
[12]: data['PID'].dtypes
[12] : dtype('int64')
[15]: data.head(3)
3
[15]: Order PID MS SubClass MS Zoning Lot Frontage Lot Area Street \
0 1 526301100 20 RL 141.0 31770 Pave
1 2 526350040 20 RH 80.0 11622 Pave
2 3 526351010 20 RL 81.0 14267 Pave
Alley Lot Shape Land Contour … Pool Area Pool QC Fence Misc Feature \
0 NaN IR1 Lvl … 0 NaN NaN NaN
1 NaN Reg Lvl … 0 NaN MnPrv NaN
2 NaN IR1 Lvl … 0 NaN NaN Gar2
[3 rows x 82 columns]
[16]: 2255
Histogram
Visualize the distribution of a single numerical variable
[17]: sns.histplot(data['SalePrice'], kde=True)
plt.title('Distribution of Sale Prices')
plt.xlabel('Sale Price')
plt.ylabel('Frequency')
plt.show()
4
[18]: # Histogram
plt.hist(data['SalePrice'], edgecolor = 'black', bins = 9,color='skyblue',␣
↪alpha=0.7)
5
[19]: sns.countplot(x='SalePrice', data = data)
6
Box Plot
Compare the distribution of a numerical variable across different categories.
Visualizing SalePrice distribution across different MS Zoning categories.
[20]: sns.boxplot(x='MS Zoning', y='SalePrice', data=data)
plt.title('Sale Price Distribution by MS Zoning')
plt.xlabel('MS Zoning')
plt.ylabel('Sale Price')
plt.show()
7
[ ]:
Scatter Plot
Visualize the relationship between two numerical variables.
Visualizing the relationship between Lot Area and SalePrice
8
pairwise relationships between several numerical variables
[22]: sns.pairplot(data[['SalePrice', 'Lot Area', 'Lot Frontage']])
plt.suptitle('Pair Plot of SalePrice, Lot Area, and Lot Frontage', y=1.02)
plt.show()
9
Visualize the correlation matrix between numerical variables
[23]: corr_matrix = data[['SalePrice', 'Lot Area', 'Lot Frontage', 'Pool Area']].
↪corr()
10
PART 2
[24]: df = pd.read_csv('Iris.csv')
[25]: df.shape
[25]: (150, 6)
[26]: df.head(3)
[27]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype
11
0 Id 150 non-null int64
1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
5 Species 150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB
[28]: df.describe()
[29]: df.shape
[29]: (150, 6)
[32]: df.count()
[32]: Id 150
SepalLengthCm 150
SepalWidthCm 150
PetalLengthCm 150
PetalWidthCm 150
Species 150
dtype: int64
[33]: print(df.isnull().sum())
Id 0
SepalLengthCm 0
SepalWidthCm 0
12
PetalLengthCm 0
PetalWidthCm 0
Species 0
dtype: int64
[34]: # Num of features
num_features = len(df.columns)
[35]: num_features
[35]: 6
[37]: num_samples
[37]: 150
[38]: 0
[40]: # Histogram
plt.hist(df['SepalLengthCm'], edgecolor = 'black', bins = 9,color='skyblue',␣
↪alpha=0.7)
13
[41]: # Histogram
plt.hist(df['Species'], edgecolor = 'black', bins = 9,color='skyblue', alpha=0.
↪7)
14
[42]: import seaborn as sns
Scatter Plot used to explain the relationship between two continuous variables within a dataset
x, y: Input data variables that should be numeric.
hue = Species => color the points based on the species.
style = Species => Differentiate species using different markers
Palette = ‘Viridis’ => Applies the ‘Viridis’ color palette to the plot
[43]: sns.scatterplot(x="SepalLengthCm",
y="SepalWidthCm",
data = df,
hue='Species',
style='Species',
palette='viridis'
)
plt.title('Scatter Plot of Sepal Length vs Sepal Width')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
15
plt.show()
[ ]:
16
[ ]:
A bar plot is used when you want to visualize and compare categorical data, particularly when you
want to show the distribution of data across different categories
x=‘Species’: Specifies the species as the categories on the x-axis.
y=‘SepalLengthCm’: The sepal length is averaged and shown as bars on the y-axis.
[45]: # BarPlot
sns.barplot(x = 'Species', y='SepalLengthCm', data = df, palette='Set2')
plt.title('Average Sepal Length by Species')
plt.xlabel('Species')
plt.ylabel('Average Sepal Length (cm)')
17
COUNTPlot => Show value counts for a single categorical variable.
[46]: sns.countplot(x='Species', data = df)
18
[ ]:
PairPlot
A pair plot (also known as a scatterplot matrix) is a powerful visualization tool that allows you
to explore relationships between multiple variables in a dataset simultaneously. It displays scatter
plots for every pair of variables
hue=‘Species’: This colors the points in the scatter plots according to the Species category, allowing
you to see how different species are distributed across the pairs of variables.
vars=[‘SepalLengthCm’, ‘SepalWidthCm’, ‘PetalLengthCm’, ‘PetalWidthCm’]: Specifies the
columns to include in the pair plot. If you omit this, all numerical columns will be used.
19
Heatmaps are often used to visualize correlation matrices, where each cell shows the correlation
between two variables. This helps in identifying strongly correlated variables.
corr_matrix: Calculates the correlation matrix for the selected columns.
sns.heatmap(): Creates the heatmap from the correlation matrix.
annot=True: Adds the correlation coefficients inside the cells.
cmap=‘coolwarm’: Specifies the color map, with blue for low values and red for high values.
linewidths=0.5: Adds a thin line between cells for better readability.
[48]: corr_matrix = df[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm',␣
↪'PetalWidthCm']].corr()
20
21