Assignment CSE-520
Assignment CSE-520
Answer:
Code Snippet:
import pandas as pd
Output:
C:\Python38\python.exe "C:/Users/Tarik Adnan/PycharmProjects/CSE520/read_csv.py"
Qualitative variables:
1. vs (Engine (0 = V-shaped, 1 = straight))
2. am (Transmission (0 = automatic, 1 = manual))
Q2. Give the appropriate graphical representation for qualitative variables. Comment the
graphs.
Answer:
# Read CSV
data = pd.read_csv('F:/File/Car_data.csv').vs
ingredients= ['V -Shaped', 'Straight']
data.value_counts().plot.pie(autopct='%1.2f%%', startangle=90,
counterclock=True)
plt.ylabel('')
plt.xlabel('Feature Identifier: ‘am’')
plt.show()
Output:
Comment on the plot:
From the above representation it is visible that, ~60% of the total cars from the data set is of
Automatic transmission type and ~40% of the total cars from the data set is of Manual
transmission type.
Q3. Give a histogram for the variable ‘mpg (Miles/(US) gallon)’ and comment the plot.
Answer:
Code Snippet:
import pandas as pd
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(6, 4))
data = pd.read_csv('F:/File/Car_data.csv', quoting=3)['mpg']
histogram = plt.hist(data, width=2.2, color='#0504aa', alpha=0.7)
plt.xlim([8, 35])
plt.grid(axis='y', alpha=0.75)
plt.ylabel('Frequency')
plt.xlabel('‘mpg (Miles/(US) gallon)’')
plt.title('Normal Distribution Histogram of Feature: ‘mpg (Miles/(US)
gallon)’')
plt.show()
Output:
Comment on the plot:
The histogram shows that the majority of the cars have a mileage between 12-24
Miles/gallon
This is a unimodal histogram which has a hump in between 15-20
This is not a symmetrical histogram
The upper tail is longer than the lower tail, so it is positively skewed.
Q4. Give stem-leaf-plot for all numerical variables and comment the plots.
Answer:
1. Stem –leaf plot for feature ‘mpg (Miles/(US) gallon)’
Code Snippet:
import pandas as pd
import stemgraphic
import matplotlib.pyplot as plt
col_name = input("Type the feature name to plot stem-leaf: ")
input_scale = float(input("Input scale to plot stem-leaf: "))
data = pd.read_csv('F:/File/Car_data.csv')[col_name]
y = pd.Series(data)
fig, ax = stemgraphic.stem_graphic(data, scale=input_scale)
fig.set_figheight(8)
fig.set_figwidth(6)
Input:
Skewed data
When data are skewed, the majority of the data are located on the high or low side
of the graph. Skewness indicates that the data may not be normally
distributed. Often, skewness is easiest to detect with a histogram or a boxplot.
Outliers
Outliers, which are data values that are far away from other data values, can strongly
affect the results. Often, outliers are easiest to identify on a boxplot. On a stem-and-
leaf plot, isolated values at the ends identify possible outliers.
The above stem-leaf plot of feature ‘mpg’ it is not indicating any outliers.
Multi-modal data
Multi-modal data have multiple peaks, also called modes. Multi-modal data often
indicate that important variables are not yet accounted for.
From the above stem-leaf plot it is visible that it has 2 peaks
data = pd.read_csv('F:/File/Car_data.csv')[col_name]
y = pd.Series(data)
fig, ax = stemgraphic.stem_graphic(data, scale=input_scale)
fig.set_figheight(8)
fig.set_figwidth(6)
Output:
Comment on the plot:
Spread
The spread shows how much the data vary. The following stem-and-leaf plot shows
‘cyl (Number of cylinders)’ of the cars. The values range from 4 to 8.
Skewed data
When data are skewed, the majority of the data are located on the high or low side
of the graph. Skewness indicates that the data may not be normally
distributed. Often, skewness is easiest to detect with a histogram or a boxplot.
Outliers
Outliers, which are data values that are far away from other data values, can strongly
affect the results. Often, outliers are easiest to identify on a boxplot. On a stem-and-
leaf plot, isolated values at the ends identify possible outliers.
The above stem-leaf plot of feature ‘cyl it is not indicating any outliers.
Multi-modal data
Multi-modal data have multiple peaks, also called modes. Multi-modal data often
indicate that important variables are not yet accounted for.
From the above stem-leaf plot it is visible that it has 3 peaks
data = pd.read_csv('F:/File/Car_data.csv')[col_name]
y = pd.Series(data)
fig, ax = stemgraphic.stem_graphic(data, scale=input_scale)
fig.set_figheight(8)
fig.set_figwidth(6)
Output:
Comment on the plot:
Spread
The spread shows how much the data vary. The following stem-and-leaf plot shows
‘disp (Displacement (cu.in.))’ of the cars. The values range from 71.1 to 472.0.
Skewed data
When data are skewed, the majority of the data are located on the high or low side
of the graph. Skewness indicates that the data may not be normally
distributed. Often, skewness is easiest to detect with a histogram or a boxplot.
Outliers
Outliers, which are data values that are far away from other data values, can strongly
affect the results. Often, outliers are easiest to identify on a boxplot. On a stem-and-
leaf plot, isolated values at the ends identify possible outliers.
The above stem-leaf plot of feature ‘disp it is not indicating any outliers.
Multi-modal data
Multi-modal data have multiple peaks, also called modes. Multi-modal data often
indicate that important variables are not yet accounted for.
From the above stem-leaf plot it is visible that it has 2 peaks
data = pd.read_csv('F:/File/Car_data.csv')[col_name]
y = pd.Series(data)
fig, ax = stemgraphic.stem_graphic(data, scale=input_scale)
fig.set_figheight(8)
fig.set_figwidth(6)
Output:
Comment on the plot:
Spread
The spread shows how much the data vary. The following stem-and-leaf plot shows
‘hp (Gross horsepower)’ of the cars. The values range from 52 to 335.
Skewed data
When data are skewed, the majority of the data are located on the high or low side
of the graph. Skewness indicates that the data may not be normally
distributed. Often, skewness is easiest to detect with a histogram or a boxplot.
Outliers
Outliers, which are data values that are far away from other data values, can strongly
affect the results. Often, outliers are easiest to identify on a boxplot. On a stem-and-
leaf plot, isolated values at the ends identify possible outliers.
The above stem-leaf plot of feature ‘hp it is indicating an outliers which is 335.
Multi-modal data
Multi-modal data have multiple peaks, also called modes. Multi-modal data often
indicate that important variables are not yet accounted for.
From the above stem-leaf plot it is visible that it has 4 peaks
data = pd.read_csv('F:/File/Car_data.csv')[col_name]
y = pd.Series(data)
fig, ax = stemgraphic.stem_graphic(data, scale=input_scale)
fig.set_figheight(8)
fig.set_figwidth(6)
Output:
Comment on the plot:
Spread
The spread shows how much the data vary. The following stem-and-leaf plot shows
‘drat (Rear axle ratio)’ of the cars. The values range from 2.76 to 4.93.
Skewed data
When data are skewed, the majority of the data are located on the high or low side
of the graph. Skewness indicates that the data may not be normally
distributed. Often, skewness is easiest to detect with a histogram or a boxplot.
Outliers
Outliers, which are data values that are far away from other data values, can strongly
affect the results. Often, outliers are easiest to identify on a boxplot. On a stem-and-
leaf plot, isolated values at the ends identify possible outliers.
The above stem-leaf plot of feature ‘drat’ is indicating an outliers which is 4.93.
Multi-modal data
Multi-modal data have multiple peaks, also called modes. Multi-modal data often
indicate that important variables are not yet accounted for.
From the above stem-leaf plot it is visible that it has 2 peaks
data = pd.read_csv('F:/File/Car_data.csv')[col_name]
y = pd.Series(data)
fig, ax = stemgraphic.stem_graphic(data, scale=input_scale)
fig.set_figheight(8)
fig.set_figwidth(6)
Output:
When data are skewed, the majority of the data are located on the high or low side
of the graph. Skewness indicates that the data may not be normally
distributed. Often, skewness is easiest to detect with a histogram or a boxplot.
Outliers
Outliers, which are data values that are far away from other data values, can strongly
affect the results. Often, outliers are easiest to identify on a boxplot. On a stem-and-
leaf plot, isolated values at the ends identify possible outliers.
The above stem-leaf plot of feature ‘wt is not indicating any outliers.
Multi-modal data
Multi-modal data have multiple peaks, also called modes. Multi-modal data often
indicate that important variables are not yet accounted for.
From the above stem-leaf plot it is visible that it has 1 peak.
data = pd.read_csv('F:/File/Car_data.csv')[col_name]
y = pd.Series(data)
fig, ax = stemgraphic.stem_graphic(data, scale=input_scale)
fig.set_figheight(8)
fig.set_figwidth(6)
Input:
When data are skewed, the majority of the data are located on the high or low side
of the graph. Skewness indicates that the data may not be normally
distributed. Often, skewness is easiest to detect with a histogram or a boxplot.
Outliers
Outliers, which are data values that are far away from other data values, can strongly
affect the results. Often, outliers are easiest to identify on a boxplot. On a stem-and-
leaf plot, isolated values at the ends identify possible outliers.
The above stem-leaf plot of feature ‘qsec is indicating an outliers which is 22.9.
Multi-modal data
Multi-modal data have multiple peaks, also called modes. Multi-modal data often
indicate that important variables are not yet accounted for.
From the above stem-leaf plot it is visible that it has 1 peak.
data = pd.read_csv('F:/File/Car_data.csv')[col_name]
y = pd.Series(data)
fig, ax = stemgraphic.stem_graphic(data, scale=input_scale)
fig.set_figheight(8)
fig.set_figwidth(6)
Input:
When data are skewed, the majority of the data are located on the high or low side
of the graph. Skewness indicates that the data may not be normally
distributed. Often, skewness is easiest to detect with a histogram or a boxplot.
Outliers, which are data values that are far away from other data values, can strongly
affect the results. Often, outliers are easiest to identify on a boxplot. On a stem-and-
leaf plot, isolated values at the ends identify possible outliers.
The above stem-leaf plot of feature ‘gear is not indicating any outlier.
Multi-modal data
Multi-modal data have multiple peaks, also called modes. Multi-modal data often
indicate that important variables are not yet accounted for.
From the above stem-leaf plot it is visible that it has 1 peak.
Code Snippet:
import pandas as pd
import stemgraphic
import matplotlib.pyplot as plt
col_name = input("Type the feature name to plot stem-leaf: ")
input_scale = float(input("Input scale to plot stem-leaf: "))
data = pd.read_csv('F:/File/Car_data.csv')[col_name]
y = pd.Series(data)
fig, ax = stemgraphic.stem_graphic(data, scale=input_scale)
fig.set_figheight(8)
fig.set_figwidth(6)
Input:
When data are skewed, the majority of the data are located on the high or low side
of the graph. Skewness indicates that the data may not be normally
distributed. Often, skewness is easiest to detect with a histogram or a boxplot.
Outliers, which are data values that are far away from other data values, can strongly
affect the results. Often, outliers are easiest to identify on a boxplot. On a stem-and-
leaf plot, isolated values at the ends identify possible outliers.
The above stem-leaf plot of feature ‘carb is indicating an outliers which is 8.
Multi-modal data
Multi-modal data have multiple peaks, also called modes. Multi-modal data often
indicate that important variables are not yet accounted for.
From the above stem-leaf plot it is visible that it has 2 peaks.
Q5. Find the boxplot for the variable ‘wt(Weight (1000 lbs))’. Do you identify outliers from
this data? If so, how will you interpret those car models?
Answer:
Code Snippet:
import pandas
import matplotlib.pyplot as plt
import glob
files = glob.glob('F:/File/*.csv')
fig = plt.figure(figsize=(10, 4))
Output:
Output of describe() function:
C:\Python38\python.exe "C:/Users/Tarik
Adnan/PycharmProjects/CSE520/Box_Whiskers_Plot_WT.py"
count 32.000000
mean 3.217250
std 0.978457
min 1.513000
25% 2.581250
50% 3.325000
75% 3.610000
max 5.424000
Name: wt, dtype: float64
It is visible from the box whiskers plot that there is 3 outliers. The car models that fall
into outlier are listed below
a. Cadillac Fleetwood (Weight (1000 lbs: 5.25)
b. Lincoln Continental (Weight (1000 lbs: 5.424)
c. Chrysler Imperial (Weight (1000 lbs: 5.345)
The 75% cars have the weight (1000 lbs) in between 3.610
The mean of the data for the variable “wt” is 3.217250
The outlier labeled cars are way too heavy in terms of Weight (1000 lbs) from the
upper fence 4.084.
Q6. Which car models have less ‘mpg’ than the first quarter in terms of less ‘mpg’? Also, find
car models that have more ‘mpg’ than the third quarter.
Answer:
Code Snippet:
import pandas
import matplotlib.pyplot as plt
import glob
files = glob.glob('F:/File/*.csv')
fig = plt.figure(figsize=(6, 4))
Output:
Output of describe() function:
C:\Python38\python.exe "C:\Users\Tarik
Adnan\PycharmProjects\CSE520\Box_Whiskers_Plot_MPG.py"
count 32.000000
mean 20.090625
std 6.026948
min 10.400000
25% 15.425000
50% 19.200000
75% 22.800000
max 33.900000
Name: mpg, dtype: float64
It is visible from the box whiskers plot that there is 1 outlier. The car model that falls
into outlier is listed below
a. Toyota Corolla (Miles/(US) gallon: 33.9)
First quarter (25%) in terms of ‘mpg’ is 15.425. There are total 8 Car models that have
less ‘mpg’ than the first quarter in terms of less ‘mpg’. Those are:
a. Duster 360 (Miles/(US) gallon: 14.3)
b. Merc 450SLC (Miles/(US) gallon: 15.2)
c. Cadillac Fleetwood (Miles/(US) gallon: 10.4)
d. Lincoln Continental (Miles/(US) gallon: 10.4)
e. Chrysler Imperial (Miles/(US) gallon: 14.7)
f. AMC Javelin (Miles/(US) gallon: 15.2)
g. Camaro Z28 (Miles/(US) gallon: 13.3)
h. Maserati Bora (Miles/(US) gallon: 15)
Third quarter (75%) in terms of ‘mpg’ is 22.80. There are total 7 car models that have
more ‘mpg’ than the third quarter. Those are:
a. Merc 240D (Miles/(US) gallon: 24.4)
b. Fiat 128 (Miles/(US) gallon: 32.4)
c. Honda Civic (Miles/(US) gallon: 30.4)
d. Toyota Corolla (Miles/(US) gallon: 33.9)
e. Fiat X1-9 (Miles/(US) gallon: 27.3)
f. Porsche 914-2 (Miles/(US) gallon: 26.0)
g. Lotus Europa (Miles/(US) gallon: 30.4)