0% found this document useful (0 votes)
38 views

ADS EXP Assignments

The document describes two experiments related to analyzing a dataset using descriptive and inferential statistics. The first experiment explores descriptive statistics like measures of central tendency and variability to summarize the dataset. It also covers inferential statistics like hypothesis testing and regression analysis to make inferences about the broader population based on a sample from the dataset. The second experiment applies data cleaning techniques such as data imputation to handle missing values in the dataset.

Uploaded by

neha.3228.sharma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

ADS EXP Assignments

The document describes two experiments related to analyzing a dataset using descriptive and inferential statistics. The first experiment explores descriptive statistics like measures of central tendency and variability to summarize the dataset. It also covers inferential statistics like hypothesis testing and regression analysis to make inferences about the broader population based on a sample from the dataset. The second experiment applies data cleaning techniques such as data imputation to handle missing values in the dataset.

Uploaded by

neha.3228.sharma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

EXPERIMENT 01

Aim : Explore descriptive and inferential statistics on the given dataset

Theory :

● Descriptive Statistics - Descriptive statistics describe, show, and summarise the basic features of a dataset
found in a given study, presented in a summary that describes the data sample and its measurements. It
helps analysts to understand the data better.
Example : If you want a good example of descriptive statistics, look no further than a student’s grade
point average (GPA). A GPA gathers the data points created through a large selection of grades,
classes, and exams then average them together and presents a general idea of the student’s mean
academic performance. Note that the GPA doesn’t predict future performance or present any
conclusions. Instead, it provides a straightforward summary of students’ academic success based on
values pulled from data.
● Types of Descriptive Statistics - All descriptive statistics are either measures of central tendency or
measures of variability, also known as measures of dispersion.
● Inferential Statistics - Inferential statistics are often used to compare the differences between the treatment
groups. Inferential statistics use measurements from the sample of subjects in the experiment to compare
the treatment groups and make generalizations about the larger population of subjects.
Example : A coach wants to find out how many average cartwheels sophomores at his college can do
without stopping. A sample of a few students will be asked to perform cartwheels and the average
will be calculated. Inferential statistics will use this data to make a conclusion regarding how many
cartwheel sophomores can perform on average.
● Types of Inferential Statistics - Inferential statistics can be classified into hypothesis testing and
regression analysis. Hypothesis testing also includes the use of confidence intervals to test the
parameters of a population
○ Hypothesis testing - Hypothesis testing is a type of inferential statistics that is used to test
assumptions and draw conclusions about the population from the available sample data. It
involves setting up a null hypothesis and an alternative hypothesis followed by conducting
a statistical test of significance. A conclusion is drawn based on the value of the test
statistic, the critical value, and the confidence intervals. A hypothesis test can be left-tailed,
right-tailed, and two-tailed. The most common types of hypothesis testing are Z test, F test,
and T test.
○ Regression Analysis - Regression analysis is used to quantify how one variable will change
with respect to another variable. There are many types of regressions available such as
simple linear, multiple linear, nominal, logistic, and ordinal regression. The most
commonly used regression in inferential statistics is linear regression. Linear regression
checks the effect of a unit change of the independent variable in the dependent variable

Conclusion : We have successfully explored descriptive and inferential statistics on the given dataset.
import pandas as pd

# load the dataset


data = pd.read_csv("Inc_Exp_Data.csv")

# display the first 5 rows of the dataset


print(data.head())

# display the number of rows and columns in the dataset


print("\n Rows:", data.shape[0])
print("\n Columns:", data.shape[1])

# display the column names


print("\n Column names:", data.columns)

# display the data types of each column


print("\n Data types:", data.dtypes)

# display basic statistics for numerical columns


print("\n Statistics:", data.describe())

# display the number of missing values in each column


print("\n Missing values:", data.isnull().sum())
import pandas as pd
from scipy import stats

# load the dataset


data = pd.read_csv("Inc_Exp_Data.csv")

# Select two columns (or variables) for inferential analysis


x = data['Mthly_HH_Expense']
y = data['Mthly_HH_Income']

# Perform a t-test to determine if there is a significant difference


#between the means of the two variables
t, p = stats.ttest_ind(x, y)

# Print the results of the t-test


print("t = ", t)
print("p = ", p)

# Set a significance level


alpha = 0.05

# Interpret the results


if p > alpha:
print("The means of the two variables are not significantly different (fail to r
eject H0)")
else:
print("The means of the two variables are significantly different (reject H0)")
EXPERIMENT 02

Aim : Apply data cleaning techniques (e.g. Data Imputation).

Theory :

Data cleaning is the process of identifying and correcting or removing errors, inconsistencies, and
inaccuracies in data. One common data cleaning technique is data imputation, which is the process of
filling in missing data with estimated values based on other available data.
Here are some steps you can follow to apply data cleaning techniques such as data imputation:

○ Identify missing data: The first step in data imputation is to identify missing data. This can
be done by examining the dataset and looking for cells or fields with missing values.
○ Choose an imputation method: Once you have identified the missing data, you need to
choose an appropriate imputation method. There are several imputation methods available,
including mean imputation, median imputation, mode imputation, and regression
imputation.
○ Perform imputation: Once you have chosen an imputation method, you can perform the
imputation by filling in the missing data with estimated values. For example, if you choose
mean imputation, you would calculate the mean of the available data for that variable and
replace the missing data with that value.
○ Evaluate the results: After imputing the missing data, it is important to evaluate the results
to ensure that the imputed data makes sense and does not introduce bias or errors into the
dataset.
○ Repeat as necessary: If you find that the imputed data is not satisfactory, you may need to
repeat the process with a different imputation method or adjust the parameters of the
imputation method.

In addition to data imputation, there are many other data cleaning techniques that you can use to
improve the quality of your data. Some of these techniques include removing duplicate data,
correcting inconsistencies and errors, and standardizing data formats. The specific techniques you
use will depend on the nature of your data and the goals of your analysis.

Conclusion : We have successfully applied data cleaning techniques like Data Imputation.
import pandas as pd
data = pd.read_csv('Inc_Exp_Data.csv')
data.head()

x=data.iloc[:,1:5].values
y=data.iloc[:,-1].values

from sklearn.model_selection import train_test_split


xtrain,xtest,ytrain,ytest = train_test_split(x,y,test_size=0.20)
xtest

array([[ 30000, 6, 0, 1404000], [ 2000, 1, 0, 97200], [ 4500, 2, 0, 112800], [ 10000, 4, 0,


244800], [ 7000, 2, 3000, 79920], [ 9000, 2, 0, 218880], [ 25000, 5, 5000, 351360], [ 16000,
3, 35000, 167400], [ 10000, 3, 0, 590400], [ 10000, 2, 1000, 437400]])

from sklearn.preprocessing import StandardScaler


sc = StandardScaler()
xtrain = sc.fit_transform(xtrain)
xtest = sc.fit_transform(xtest)
xtrain

array([[-1.02132888, -1.67819739, -0.60186099, -1.10649009], [-0.69279526, -0.95639206, -


0.60186099, -0.18382132], [-0.03572803, -0.23458673, 0.5024711 , 0.43621866], [-0.66815524,
-0.95639206, -0.60186099, -0.93731361], [ 0.37493899, 0.4872186 , 0.17117147, 0.26088285],
[-0.44639505, -0.23458673, -0.60186099, -0.10621367], [-0.93919548, -1.67819739, -
0.60186099, 0.80783202], [-0.03572803, -0.23458673, -0.60186099, 0.17136714], [-0.03572803,
-0.95639206, 0.17117147, -0.8859858 ], [ 0.37493899, 0.4872186 , -0.60186099, 2.16452877],
[-0.03572803, -0.23458673, -0.60186099, 1.90008788], [-0.19999484, -0.23458673, 1.16507036,
-0.82644554], [ 2.26400729, 1.93082925, -0.60186099, 1.25130432], [ 1.60694005, 1.20902392,
1.6068032 , 2.73775778], [ 2.01760708, 1.20902392, 2.04853604, 0.27607588], [-0.69279526, -
0.23458673, -0.60186099, 0.33972237], [-0.69279526, -1.67819739, 0.06073827, -1.27607718],
[-1.02132888, -0.23458673, -0.60186099, 0.12660929], [ 1.60694005, 1.20902392, 0.17117147, -
0.04215656], [-0.93919548, -0.23458673, -0.60186099, -1.02354434], [-0.11786144, 1.20902392,
-0.60186099, -0.64823537], [ 0.12853878, -0.23458673, -0.04969494, 0.30892568], [-
0.85706207, 1.20902392, -0.60186099, -0.89625136], [-0.85706207, -0.23458673, -0.60186099, -
0.36654834], [-0.61066186, -0.23458673, -0.60186099, -0.46099151], [ 0.37493899, -
0.95639206, 0.28160468, -0.17355576], [-0.03572803, 0.4872186 , 1.16507036, -1.03996924], [-
0.44639505, 1.93082925, -0.60186099, -0.21584988], [-1.26772909, -0.95639206, -0.60186099, -
0.77979882], [-1.02132888, -0.95639206, -0.16012815, -1.55940671], [-0.44639505, -
0.23458673, -0.60186099, -0.25485902], [ 2.4282741 , 1.20902392, 1.6068032 , 0.24117297], [-
0.81599537, 1.20902392, -0.60186099, -0.69504634], [ 2.4282741 , -0.23458673, 3.81546739,
1.75226378], [ 0.37493899, 0.4872186 , -0.60186099, 2.13003648], [ 0.37493899, 1.20902392, -
0.60186099, 0.01327748], [-0.03572803, -0.95639206, -0.60186099, 0.21160815], [ 0.37493899,
-0.23458673, -0.60186099, -0.24130847], [-1.13631565, -1.67819739, -0.16012815, -
0.58828449], [ 0.37493899, 1.20902392, 2.04853604, -0.82151807]])

ytrain
array([1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 3, 1, 2, 1, 1, 1, 2, 2, 1, 1, 3, 1, 1, 1, 1, 4,
1, 1, 1, 1, 2, 2, 2, 3, 2, 2, 1, 1])

to_drop = ['Highest_Qualified_Member']
df.drop(to_drop, inplace = True, axis = 1)
#Alternaltively
#df.drop(columns=to_drop, inplace=True)
df.head()

df['No_of_Earning_Members'].is_unique
False

df.set_index('No_of_Fly_Members', inplace = True)


df.head()

df.iloc[1]

df['Mthly_HH_Expense'].head(30)
df.loc[1:, 'Annual_HH_Income'].head(10)

df['Date of Publication'] = pd.to_numeric(extr)


df['Date of Publication'].dtype
dtype('float64')

df['Date of Publication'].isnull().sum() / len(df)


0.11717147339205986

df['Place of Publication'].head(10)

df.loc[4157862]

pub = df['Place of Publication']


london = pub.str.contains('London')
london[:5]

pub = df['Place of Publication']


df['Place of Publication'] = np.where(pub.str.contains('london'), 'Lonon',np.where(pub.str.c
ontains('oxford'), 'Oxford',np.where(pub.eq('Newcastle upon Tyne'),'Newcastle-upon-
Tyne', df['Place of Publication'])))

df['Place of Publication']
EXPERIMENT 03

Aim : Explore data visualization techniques.

Theory :

Data visualization is the graphical representation of data to provide a better understanding of


complex information. It is an essential part of data analysis and plays a vital role in the
decision-making process. In this documentation, we will explore various data visualization
techniques that can help you to gain insights from your data.

○ Line Chart

A line chart is a graphical representation of data in which data points are plotted and
connected by lines. Line charts are used to show trends over time and to compare data from
different categories.

○ Bar Chart

A bar chart is a graphical representation of data in which data is presented as bars. Bar
charts are used to compare data from different categories and to show changes in data over
time.

○ Pie Chart

A pie chart is a graphical representation of data in which data is presented as slices of a pie.
Pie charts are used to show the percentage breakdown of data.

○ Scatter Plot

A scatter plot is a graphical representation of data in which data points are plotted as
individual points. Scatter plots are used to show the relationship between two variables.

○ Treemap

A treemap is a graphical representation of data in which data is presented as rectangles of


different sizes. Treemaps are used to show the hierarchical structure of data.

○ Histogram
A histogram is a graphical representation of data in which data is presented as a series of
bars. Histograms are used to show the distribution of data.

These are some of the commonly used data visualization techniques that can help you gain insights from
your data. The choice of visualization technique depends on the type of data you have and the insights you
want to gain from it.

Conclusion : We have successfully explored data visualization techniques.


exploratory-data-analysis-on-Income_Expenditure-dataset

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Data Loading
df = pd.read_csv("Inc_Exp_Data.csv")
df.head()

df.shape
(50, 7)

df.info()

Data Cleaning: If data contains "?" replace it with NAN


df_data = df.replace('?',np.NAN)
df_data.isnull().sum()

df['Mthly_HH_Expense'].fillna('np.nan', inplace= True)


df['Mthly_HH_Income'].fillna('np.nan', inplace= True)
df.isnull().sum()

Summary statistics of variable


df.describe()

plt.figure(figsize=(10,8))
df[['Mthly_HH_Income','Mthly_HH_Expense','No_of_Fly_Members','Emi_or_Rent_Amt','Annual_HH_In
come','Highest_Qualified_Member','No_of_Earning_Members']].hist(figsize=(10,10),bins=6,color
='Y')
plt.figure(figsize=(10,8))
plt.tight_layout()
plt.show()

Findings
• Maximum people have an income of 30000-40000
• Average expense of people is 15000-35000
• There is 0-5000 EMI or rent value for most people
• Most people have an annual income less then 2 lakhs
• Most families only have 1 earning family member
plt.figure(1)
plt.subplot(221)
df['Annual_HH_Income'].value_counts(normalize=True).plot(figsize=(10,8),kind='line',color='r
ed')
plt.title("Annual Income frequency diagram")
plt.ylabel('No_of_fly_Members')
plt.xlabel('Annual Income');

plt.subplot(222)
df['Mthly_HH_Income'].value_counts(normalize=True).plot(figsize=(10,8),kind='pie')
plt.title("Monthly income frequency diagram")
plt.xlabel('Mthly_HH_Income')

Findings
• For every 4 people family the highest annual income is 6 Lakhs
• Maximum monthly income in this list is 1 Lakh
• There are many 2 people families in this dataset having various annual incomes
import seaborn as sns
corr = df.corr()
plt.figure(figsize=(20,9))
a = sns.heatmap(corr,cmap='brg', annot=True, fmt='.2f')

Bivariate Analysis: Emi or Rent Analysis


plt.rcParams['figure.figsize']=(18,9)
ax = sns.boxplot(x="Mthly_HH_Income", y="Mthly_HH_Expense", data=df)
plt.rcParams['figure.figsize']=(19,7)
ax = sns.boxplot(x="Mthly_HH_Income", y="No_of_Fly_Members", data=df)

Positive linear relationship


plt.rcParams['figure.figsize']=(10,5)
ax = sns.boxplot(x="Annual_HH_Income", y="No_of_Fly_Members", data=df)

# Engine size as potential predictor variable of price


sns.regplot(x="Emi_or_Rent_Amt", y="No_of_Fly_Members", data=df)
plt.ylim(0,)
df[["Mthly_HH_Income", "Annual_HH_Income"]].corr()

sns.regplot(x="Mthly_HH_Income", y="Emi_or_Rent_Amt", data=df)


plt.ylim(0,)

df[["Mthly_HH_Income","Emi_or_Rent_Amt"]].corr()

EMI or Rent Analysis

• Emi or Rent of any said person increases with their income


• Emi or Rent of any said person decreases with their expenses
EXPERIMENT 04

Aim : Implement and explore performance evaluation metrics for Data Models (Supervised/Unsupervised
Learning)

Theory :

Performance evaluation metrics are essential for measuring the effectiveness and efficiency of data models.
These metrics are used to compare different models, select the best model for a specific problem, and
optimize model parameters. In supervised learning, some of the commonly used performance evaluation
metrics include accuracy, precision, recall, F1 score, ROC curve, and AUC score. In unsupervised learning,
some commonly used metrics include clustering accuracy, silhouette score, and Davies-Bouldin index.

Accuracy is the most widely used metric in supervised learning, and it measures the proportion of correct
predictions made by the model. Precision measures the fraction of true positive predictions among all
positive predictions, while recall measures the fraction of true positive predictions among all actual
positives. F1 score is a harmonic mean of precision and recall and is often used when precision and recall
are equally important. ROC curve and AUC score are used to evaluate the performance of binary
classifiers, where ROC curve plots the true positive rate (TPR) against the false positive rate (FPR) at
various threshold settings, and AUC score measures the area under the ROC curve.

In unsupervised learning, clustering accuracy measures the extent to which the predicted clusters match the
true labels. Silhouette score measures the similarity of data points within a cluster compared to points in
other clusters, and Davies-Bouldin index measures the average similarity between each cluster and its most
similar cluster.

In addition to these commonly used metrics, there are many other performance evaluation metrics available
for data models, depending on the type of problem and the nature of the data. For instance, in regression
problems, mean absolute error (MAE), mean squared error (MSE), and R-squared are commonly used
metrics to evaluate the performance of a model. MAE measures the average magnitude of the errors in the
predictions made by the model, while MSE measures the average squared magnitude of the errors.
R-squared measures the proportion of variance in the dependent variable that is explained by the
independent variables in the model.

In multi-class classification problems, there are several evaluation metrics available, including
macro-averaged precision, macro-averaged recall, macro-averaged F1 score, and micro-averaged F1 score.
Macro-averaged precision, recall, and F1 score calculate the performance of each class separately and then
take an average of these scores, while micro-averaged F1 score treats the multi-class problem as a binary
classification problem and calculates the F1 score based on the true positive, false positive, and false
negative rates of all classes combined.

It is important to note that no single metric can provide a complete picture of the performance of a data
model, and different metrics may emphasize different aspects of performance. Therefore, it is recommended
to use a combination of different metrics to evaluate the performance of a model thoroughly

Conclusion : Performance evaluation metrics play a crucial role in selecting the best data model for a particular
problem. Choosing the right metric is important, as it determines the effectiveness and efficiency of the model. A
combination of different metrics is often used to evaluate the performance of a model.
Supervised Learning

import pandas as pd
df = pd.read_csv("Inc_Exp_Data.csv")
df.head()

x=df.iloc[:,1:5].values
y=df.iloc[:,-1].values

from sklearn.model_selection import train_test_split


xtrain, xtest, ytrain,
ytest = train_test_split(x,y,test_size=0.30, random_state = 42)
xtest
array([[ 10500, 6, 0, 316800], [ 10000, 3, 0, 590400], [ 25000, 6, 0, 523800], [
48000, 7, 0, 885600], [ 10000, 6, 0, 258000], [ 50000, 4, 20000, 1032000], [ 8000,
4, 0, 556920], [ 25000, 4, 0, 449400], [ 10000, 2, 1000, 437400], [ 13000, 4, 0,
385200], [ 5000, 3, 0, 292032], [ 12000, 2, 3000, 147000], [ 20000, 3, 0, 581760],
[ 9000, 2, 0, 218880], [ 2000, 1, 0, 97200]])

from sklearn.preprocessing import StandardScaler


sc = StandardScaler()
xtrain = sc.fit_transform(xtrain)
xtest = sc.fit_transform(xtest)
xtrain
array([[-3.22882971e-01, -8.46847194e-01, 4.79687790e+00, -9.96450882e-01], [
4.34348243e-02, -1.23928858e-01, 2.01328817e-01, 4.13910986e-01], [ 5.01332068e-
01, 5.98989479e-01, -5.64596030e-01, 1.89796426e+00], [ 9.59229312e-01,
1.32190782e+00, -5.64596030e-01, 2.63699388e+00], [-6.61726932e-01, -8.46847194e-
01, -5.64596030e-01, -7.65504126e-01], [-9.63939113e-01, -1.23928858e-01, -
5.64596030e-01, -8.39548124e-01], [ 4.34348243e-02, -8.46847194e-01, -2.84486372e-
02, -7.21430318e-01], [-6.89200766e-01, -8.46847194e-01, -5.64596030e-01, -
1.18500619e-01], [ 5.01332068e-01, 5.98989479e-01, -2.84486372e-02, 2.63354857e-
01], [ 1.87502380e+00, 1.32190782e+00, -2.84486372e-02, 3.14309216e-03], [-
1.05551856e+00, -8.46847194e-01, -2.58226091e-01, -1.29967868e+00], [ 5.01332068e-
01, 5.98989479e-01, -5.64596030e-01, 1.86834666e+00], [-8.72359664e-01, -
1.23928858e-01, -5.64596030e-01, -2.75403377e-01], [ 2.26593722e-01, -1.23928858e-
01, -1.81633606e-01, 3.04607941e-01], [-1.05551856e+00, -1.56976553e+00, -
5.64596030e-01, -9.10771398e-01], [-6.89200766e-01, -1.23928858e-01, -5.64596030e-
01, 3.31052227e-01], [ 5.01332068e-01, 1.32190782e+00, 1.27362360e+00, -
6.66073614e-01], [ 4.34348243e-02, -1.23928858e-01, -5.64596030e-01, 1.86490135e-
01], [-1.14709801e+00, -1.56976553e+00, -1.05041122e-01, -1.25348933e+00], [-
4.14462420e-01, -1.23928858e-01, -5.64596030e-01, -1.79498770e-01], [-
1.37604663e+00, -1.56976553e+00, -5.64596030e-01, -1.15687954e+00], [ 4.34348243e-
02, -1.23928858e-01, -5.64596030e-01, 1.67089600e+00], [-4.14462420e-01,
2.04482615e+00, -5.64596030e-01, -1.46002675e-01], [ 5.01332068e-01, -8.46847194e-
01, 4.81438475e-02, -1.09685857e-01], [ 2.79081829e+00, 1.32190782e+00,
9.67253663e-01, 2.46430514e-01], [-1.39724073e-01, -1.23928858e-01, 6.60883725e-
01, -6.70304700e-01], [-4.81446245e-02, 1.32190782e+00, -5.64596030e-01, -
5.17280437e-01], [-1.18372979e+00, -1.56976553e+00, -2.58226091e-01, -4.65802229e-
01], [ 1.87502380e+00, 1.32190782e+00, 9.67253663e-01, 2.39018055e+00], [
5.01332068e-01, 5.98989479e-01, 2.01328817e-01, -4.55929696e-01], [ 4.34348243e-
02, 5.98989479e-01, 6.60883725e-01, -8.53651743e-01], [-9.63939113e-01, -
1.56976553e+00, -5.64596030e-01, 7.33005359e-01], [-8.72359664e-01, -1.23928858e-
01, -5.64596030e-01, -7.69030031e-01], [-4.14462420e-01, -1.23928858e-01, -
5.64596030e-01, -5.18610207e-02], [ 2.33292104e+00, 1.32190782e+00,
1.27362360e+00, 2.76400704e-01]])

ytrain
array([1, 2, 2, 3, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1,
2, 1, 1, 1, 1, 1, 1, 2, 1, 2])

from sklearn.linear_model import LogisticRegression


# Train a logistic regression model
lr = LogisticRegression()
lr.fit(xtrain, ytrain)
LogisticRegression()

from sklearn.metrics import classification_report


# assuming y_true is the true labels and y_pred are the predicted labels
print(classification_report(ytest, ypred))

Unsupervised Learning

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
df = pd.read_csv('Inc_Exp_Data.csv')
df = df.replace(' ?', pd.NaT)
df = df.dropna()
scaler = StandardScaler()
num_cols = ["Mthly_HH_Income", "Mthly_HH_Expense", "No_of_Fly_Members", "Emi_or_Re
nt_Amt", "Annual_HH_Income","No_of_Earning_Members"]
df[num_cols] = scaler.fit_transform(df[num_cols])
X = df[num_cols]

# Apply K-Means clustering


kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X)
# Compute the silhouette score
silhouette = silhouette_score(X, kmeans.labels_)
print("Silhouette score:", silhouette)
Silhouette score: 0.40946819295341863

Elbow Method

import matplotlib.pyplot as plt

# Compute the inertia for different numbers of clusters


inertia = []
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)
inertia.append(kmeans.inertia_)

# Plot the inertia vs. the number of clusters


plt.plot(range(1, 11), inertia)
plt.xlabel("Number of clusters")
plt.ylabel("Inertia")
plt.show()
EXPERIMENT 05

Aim : Use SMOTE technique to generate synthetic data. (to solve the problem of class imbalance)

Theory :

Class imbalance is a common problem in many machine learning applications. In some datasets, one class may be
significantly underrepresented compared to others, making it difficult for a machine learning model to learn from
the available data. This can lead to biased models that have high accuracy for the majority class and poor accuracy
for the minority class.

One approach to addressing this problem is to use the Synthetic Minority Over-sampling Technique (SMOTE).
SMOTE is a data augmentation technique that generates synthetic samples for the minority class by interpolating
between the existing minority samples.

The SMOTE algorithm works as follows:


● Choose a minority sample at random.
● Select one of its k nearest neighbors at random.
● Generate a new sample by interpolating between the chosen sample and the selected neighbor.
● Repeat steps 1-3 to generate additional synthetic samples.
● The value of k determines the number of neighbors used to interpolate between samples. The SMOTE
algorithm is typically used with k=5, which means that five nearest neighbors are used to generate each
new sample.

SMOTE can be applied to the original dataset to generate synthetic samples for the minority class, effectively
increasing the size of the minority class and reducing the class imbalance in the dataset. This can lead to more
accurate machine learning models that perform well on both the majority and minority classes.
One important consideration when using SMOTE is the choice of the value of k. The value of k determines the
level of interpolation between samples and can have a significant impact on the quality of the generated synthetic
data. In general, a larger value of k will result in more conservative interpolation and may produce higher quality
synthetic data. However, larger values of k may also result in overfitting and reduced generalization performance.

Another consideration when using SMOTE is the potential for introducing synthetic data artifacts or biases. Since
the generated synthetic data is based on the existing minority samples, it may inherit any biases or limitations of
the original data. Additionally, the synthetic data may not accurately capture the full distribution of the minority
class, which can lead to overfitting and reduced generalization performance.

Conclusion :

In conclusion, SMOTE is a powerful technique for addressing the problem of class imbalance in machine learning.
By generating synthetic data for the minority class, SMOTE can help to balance the class distribution and improve
the performance of machine learning models. It is easy to use and can be applied to a wide range of datasets,
making it a valuable tool for data scientists and machine learning practitioners. However, it is important to use
SMOTE carefully and to evaluate the performance of the resulting model to ensure that it is accurate and reliable.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import f_oneway

dataset =pd.read_csv("Inc_Exp_Data.csv")
dataset.head()

A=dataset["Mthly_HH_Income"]
A

B=dataset["Mthly_HH_Expense"]
B
0 8000
1 7000
2 4500
3 2000
4 12000
5 8000
6 16000
7 20000
8 9000

C=dataset["No_of_Fly_Members"]
C
0 3
1 2
2 2
3 1
4 2
5 2
6 3
7 5
8 2
9 4
10 4

D=dataset["Emi_or_Rent_Amt"]
D
0 2000
1 3000
2 0
3 0
4 3000
5 0
6 35000
7 8000
8 0
9 0
10 8000

E=dataset["Annual_HH_Income"]
E
0 64200
1 79920
2 112800
3 97200
4 147000
5 196560
6 167400
7 216000
8 218880
9 220800
10 278400

F=dataset["No_of_Earning_Members"]
F
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 2
10 2

f_oneway(A,B,C,D,E,F)
F_onewayResult(statistic=110.97258263216287, pvalue=1.4660303003601035e-65)

In this case, the F statistic is 110.97258263216287 and the p-value is


1.4660303003601035e-65 (a very small number), indicating that there is a
significant difference between the means of the groups being compared.
Since the p-value is much smaller than the significance level (usually 0.05), we
can reject the null hypothesis and conclude that there is a significant difference
between the means of the groups being compared. This means that at least one of
the groups has a different mean from the others. We cannot determine which
group(s) has a different mean from this output alone, but it suggests that further
investigation is warranted.
EXPERIMENT 06

Aim : Outlier detection using distance based/density based method

Theory:

Distance-based Outlier Detection:


Distance-based outlier detection method consults the neighbourhood of an object, which is defined by a given
radius. An object is then considered an outlier if its neighbourhood does not have enough other points. A distance
is the threshold that can be defined as a reasonable neighbourhood of the object. For each object o we can find a
reasonable number of neighbours of an object.
A distance-based outlier detection method consults the neighbourhood of an object, which is defined by a given
radius. An object is then considered an outlier if its neighbourhood does not have enough other points. This is
termed as Distance-Based Outlier Detection Methods.
Distance-Based Methods usually depend on a Multi-dimensional Index, Which is used to retrieve the
neighbourhood of each object to see if it contains sufficient points. If there are insufficient points, then the object
is termed an outlier.
Distance-Based methods scale better to multi-dimensional space and can be computed more efficiently than the
statistical-based method. Identifying Distance-based outliers is an important and useful data mining activity. The
main disadvantage of distance-based methods is that distance-based outlier detection is based on a single value of
a custom parameter. This can cause significant problems if the dataset contains both dense and sparse regions.
Outlier detection methods can be categorised according to whether the sample of data for analysis is given with
expert-provided labels that can be used to build an outlier detection model. In this case, the detection methods are
supervised, semi-supervised, or unsupervised. Alternatively, outlier detection methods may be organised according
to their assumptions regarding normal objects versus outliers. This categorization includes statistical methods,
proximity-based methods, and clustering-based methods.
Algorithms for mining distance-based outliers:
● Index-based algorithm
● Nested-loop algorithm
● Cell-based algorithm
Density-based Outlier Detection:
Density-based outlier detection method investigates the density of an object and that of its neighbours. Here, an
object is identified as an outlier if its density is relatively much lower than that of its neighbours. Many real-world
data sets demonstrate a more complex structure, where objects may be considered outliers with respect to their
local neighbourhoods, rather than with respect to the global data distribution.
A density-based outlier detection method is used for checking the density of an entity object and its closest
objects. Key applications of this method are used in many applications including Malware Detection, Awareness,
Behavior Analysis, and Network Intrusion Detection. There are some limitations to density-based outlier detection
methods that are effective until it is determined that the outliers being detected are not necessarily outliers but just
a part of a much larger distribution of data. A limitation with using density-based outlier detection methods is that
the density function must be defined and clearly understood before implementation and the proper value set.

Conclusion : We have successfully performed outlier detection using distance based/density based method
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
df=pd.read_csv('Inc_Exp_Data.csv')

def remove_sign(x,sign):
if type(x) is str:
x = float(x.replace(sign,'').replace(',',''))
return x

df=df[['Mthly_HH_Income','Mthly_HH_Expense']]
df=pd.DataFrame(df)
df.price = df.Mthly_HH_Expense.apply(remove_sign,sign='$')
sns.boxplot(y='Mthly_HH_Expense', x='Mthly_HH_Income',data=df)
plt.xticks(rotation=90)
plt.ylabel('Mthly_HH_Expense (Rs.)')

IQR Method
def remove_outlier_IQR(df):
Q1=df.quantile(0.25)
Q3=df.quantile(0.75)
IQR=Q3-Q1
df_final=df[~((df<(Q1-1.5*IQR)) | (df>(Q3+1.5*IQR)))]
return df_final

df_outlier_removed=remove_outlier_IQR(df.Mthly_HH_Expense)
df_outlier_removed=pd.DataFrame(df_outlier_removed)
ind_diff=df.index.difference(df_outlier_removed.index)

for i in range(0, len(ind_diff),1):


df_final=df.drop([ind_diff[i]])
df=df_final
sns.boxplot(y='Mthly_HH_Expense', x='Mthly_HH_Income',data=df_final)
plt.xticks(rotation=90)
plt.ylabel('Mthly_HH_Expense(Rs.)')

len(ind_diff)
3

HAMPEL METHOD
def remove_outlier_Hampel(df):
med=df.median()
List=abs(df-med)
cond=List.median()*4.5
good_list=List[~(List>cond)]
return good_list

df_outlier_removed=remove_outlier_Hampel(df.Mthly_HH_Expense)
df_outlier_removed=pd.DataFrame(df_outlier_removed)
ind_diff=df.index.difference(df_outlier_removed.index)
for i in range(0, len(ind_diff),1):
df_final=df.drop([ind_diff[i]])
df=df_final

sns.boxplot(y='Mthly_HH_Expense', x='Mthly_HH_Income',data=df_final)
plt.xticks(rotation=90)
plt.ylabel('Mthly_HH_Expense(Rs.)')
len(ind_diff)
1

DBSCAN Method

from sklearn.cluster import DBSCAN


def remove_outliers_DBSCAN(df,eps,min_samples):
outlier_detection = DBSCAN(eps = eps, min_samples = min_samples)
clusters = outlier_detection.fit_predict(df.values.reshape(-1,1))
data = pd.DataFrame()
data['cluster'] = clusters
return data['cluster']

clusters=remove_outliers_DBSCAN((df['Mthly_HH_Expense']),1,1)
clusters.value_counts().sort_values(ascending=False)

df_cluster=pd.DataFrame(clusters)
ind_outlier=df_cluster.index[df_cluster['cluster']==-1]
ind_outlier
plt.plot(clusters)
df_cluster=pd.DataFrame(clusters)
ind_outlier=df_cluster.index[df_cluster['cluster']==-1]
ind_outlier

for i in range(0, len(ind_outlier),1):


df_final=df.drop([ind_outlier[15]])
df=df_final

sns.boxplot(y='Mthly_HH_Expense', x='Mthly_HH_Income',data=df_final)
plt.xticks(rotation=90)
plt.ylabel('Mthly_HH_Expense (Rs.)')

len(ind_outlier)
0
EXPERIMENT 07

Aim : Explore time series forecasting on the given dataset.

Theory:

Time series forecasting occurs when you make scientific predictions based on historical time stamped data.
It involves building models through historical analysis and using them to make observations and drive
future strategic decision-making. An important distinction in forecasting is that at the time of the work, the
future outcome is completely unavailable and can only be estimated through careful analysis and
evidence-based priors. Time series forecasting is the process of analyzing time series data using statistics
and modeling to make predictions and inform strategic decision-making. It’s not always an exact prediction,
and likelihood of forecasts can vary wildly—especially when dealing with the commonly fluctuating
variables in time series data as well as factors outside our control. However, forecasting insight about which
outcomes are more likely—or less likely—to occur than other potential outcomes. Often, the more
comprehensive the data we have, the more accurate the forecasts can be. While forecasting and “prediction”
generally mean the same thing, there is a notable distinction. In some industries, forecasting might refer to
data at a specific future point in time, while prediction refers to future data in general. Series forecasting is
often used in conjunction with time series analysis. Time series analysis involves developing models to gain
an understanding of the data to understand the underlying causes. Analysis can provide the “why” behind
the outcomes you are seeing. Forecasting then takes the next step of what to do with that knowledge and the
predictable extrapolations of what might happen in the future.

Applications of time series forecasting

Forecasting has a range of applications in various industries. It has tons of practical applications including:
weather forecasting, climate forecasting, economic forecasting, healthcare forecasting engineering
forecasting, finance forecasting, retail forecasting, business forecasting, environmental studies forecasting,
social studies forecasting, and more. Basically anyone who has consistent historical data can analyze that
data with time series analysis methods and then model, forecasting, and predict. For some industries, the
entire point of time series analysis is to facilitate forecasting. Some technologies, such as augmented
analytics, can even automatically select forecasting from among other statistical algorithms if it offers the
most certainty.

When time series forecasting should be used

Naturally, there are limitations when dealing with the unpredictable and the unknown. Time series
forecasting isn’t infallible and isn’t appropriate or useful for all situations. Because there really is no
explicit set of rules for when you should or should not use forecasting, it is up to analysts and data teams to
know the limitations of analysis and what their models can support. Not every model will fit every data set
or answer every question. Data teams should use time series forecasting when they understand the business
question and have the appropriate data and forecasting capabilities to answer that question. Good
forecasting works with clean, time stamped data and can identify the genuine trends and patterns in
historical data. Analysts can tell the difference between random fluctuations or outliers, and can separate
genuine insights from seasonal variations. Time series analysis shows how data changes over time, and
good forecasting can identify the direction in which the data is changing.

Conclusion : We have successfully explored time series forecasting on the given dataset.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

data = pd.read_csv('airline-passengers.csv')
data.columns = ['Month','Passengers']
data['Month'] = pd.to_datetime(data['Month'], format='%Y-%m')
data = data.set_index('Month')
data.head()

data.plot(figsize=(20, 4))
plt.grid()
plt.legend(loc='best')
plt.title('Airline passenger traffic')
plt.show(block=False)

data = data.assign(Passengers_Linear_Interpolation=data.Passengers.interpolate(method='linea
r'))
data[['Passengers_Linear_Interpolation']].plot(figsize=(20, 4))
plt.grid()
plt.legend(loc='best')
plt.title('Airline passenger traffic: Linear interpolation')
plt.show(block=False)
data['Passengers'] = data['Passengers_Linear_Interpolation']
data.drop(columns=['Passengers_Linear_Interpolation'],inplace=True)
data.head()

import seaborn as sns


fig = plt.subplots(figsize=(20, 5))
ax = sns.boxplot(x=data['Passengers'],whis=1.5)

fig = data.Passengers.hist(figsize = (20,5))

from pylab import rcParams


import statsmodels.api as sm
rcParams['figure.figsize'] = 20,24
decomposition = sm.tsa.seasonal_decompose(data.Passengers, model='additive') # additive seas
onal index
fig = decomposition.plot()
plt.show()
decomposition = sm.tsa.seasonal_decompose(data.Passengers, model='multiplicative') # multipl
icative seasonal index
fig = decomposition.plot()
plt.show()
train_len = 120
train = data[0:train_len] # first 120 months as training set
test = data[train_len:] # last 24 months as out-of-time test set
y_hat_sma = data.copy()
ma_window = 12

y_hat_sma['sma_forecast'] = data['Passengers'].rolling(ma_window).mean()
y_hat_sma['sma_forecast'][train_len:] = y_hat_sma['sma_forecast'][train_len-1]

plt.figure(figsize=(20,5))
plt.grid()
plt.plot(train['Passengers'], label='Train')
plt.plot(test['Passengers'], label='Test')
plt.plot(y_hat_sma['sma_forecast'], label='Simple moving average forecast')
plt.legend(loc='best')
plt.title('Simple Moving Average Method')
plt.show()

from sklearn.metrics import mean_squared_error


rmse = np.sqrt(mean_squared_error(test['Passengers'], y_hat_sma['sma_forecast'][train_len:])
).round(2)
mape = np.round(np.mean(np.abs(test['Passengers']-
y_hat_sma['sma_forecast'][train_len:])/test['Passengers'])*100,2)

results = pd.DataFrame({'Method':['Simple moving average forecast'], 'MAPE': [mape], 'RMSE':


[rmse]})
results = results[['Method', 'RMSE', 'MAPE']]
results

ARIMA

data = pd.read_csv("airline-passenger-traffic(1).csv", header=None)


data.head()

data.columns = ['Month', 'Passengers']


data.Month = pd.to_datetime(data.Month, format='%Y-%m')
data.Passengers = data.Passengers.astype("float64")
data = data.set_index('Month')
data.head()

data.plot(figsize=(14,6))
plt.title('Airline Passenger Traffic Data')
plt.show(block=False)

data['Passengers_Mean_Imputation'] = data.Passengers.fillna(data.Passengers.mean())
plt.figure(figsize=(16,4))
plt.plot(data.Passengers_Mean_Imputation, label='Passengers_Mean_Imputation')
plt.plot(data.Passengers, label='Passengers')
plt.legend(loc='best')
plt.title('Missing Value Treatment: Mean Imputation')
plt.show(block=False)

data.head()

data["Passengers"]=data["Passengers_Mean_Imputation"]
data.drop(columns=['Passengers_Mean_Imputation'],inplace=True)
data.head()

from pylab import rcParams


import statsmodels.api as sm
rcParams['figure.figsize'] = (14,8)
decomposition = sm.tsa.seasonal_decompose(data.Passengers, model='additive')
fig = decomposition.plot()
plt.show()
import statsmodels.api as sm
from scipy.stats import boxcox
data_boxcox = pd.Series(boxcox(data.Passengers, lmbda=0), index = data.index)
train_data_boxcox = data_boxcox[:130]
model = sm.tsa.arima.ARIMA(train_data_boxcox, order=(1,1,1))
model_fit = model.fit()
print(model_fit.params)

sigma squared represents the variance of the residual values ar.L1 refers to the
autoregressive term with the lag of 1, ar.L2 represents the same, but with the lag of 2.
ma.L1 and ma.L2 refer to the ‘moving average’ terms with lag of 1 and 2.

You might also like