0% found this document useful (0 votes)

52 views38 pages

ADS EXP Assignments

The document describes two experiments related to analyzing a dataset using descriptive and inferential statistics. The first experiment explores descriptive statistics like measures of central tendency and variability to summarize the dataset. It also covers inferential statistics like hypothesis testing and regression analysis to make inferences about the broader population based on a sample from the dataset. The second experiment applies data cleaning techniques such as data imputation to handle missing values in the dataset.

Uploaded by

neha.3228.sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views38 pages

ADS EXP Assignments

Uploaded by

neha.3228.sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

EXPERIMENT 01

Aim : Explore descriptive and inferential statistics on the given dataset

Theory :

● Descriptive Statistics - Descriptive statistics describe, show, and summarise the basic features of a dataset
found in a given study, presented in a summary that describes the data sample and its measurements. It
helps analysts to understand the data better.
Example : If you want a good example of descriptive statistics, look no further than a student’s grade
point average (GPA). A GPA gathers the data points created through a large selection of grades,
classes, and exams then average them together and presents a general idea of the student’s mean
academic performance. Note that the GPA doesn’t predict future performance or present any
conclusions. Instead, it provides a straightforward summary of students’ academic success based on
values pulled from data.
● Types of Descriptive Statistics - All descriptive statistics are either measures of central tendency or
measures of variability, also known as measures of dispersion.
● Inferential Statistics - Inferential statistics are often used to compare the differences between the treatment
groups. Inferential statistics use measurements from the sample of subjects in the experiment to compare
the treatment groups and make generalizations about the larger population of subjects.
Example : A coach wants to find out how many average cartwheels sophomores at his college can do
without stopping. A sample of a few students will be asked to perform cartwheels and the average
will be calculated. Inferential statistics will use this data to make a conclusion regarding how many
cartwheel sophomores can perform on average.
● Types of Inferential Statistics - Inferential statistics can be classified into hypothesis testing and
regression analysis. Hypothesis testing also includes the use of confidence intervals to test the
parameters of a population
○ Hypothesis testing - Hypothesis testing is a type of inferential statistics that is used to test
assumptions and draw conclusions about the population from the available sample data. It
involves setting up a null hypothesis and an alternative hypothesis followed by conducting
a statistical test of significance. A conclusion is drawn based on the value of the test
statistic, the critical value, and the confidence intervals. A hypothesis test can be left-tailed,
right-tailed, and two-tailed. The most common types of hypothesis testing are Z test, F test,
and T test.
○ Regression Analysis - Regression analysis is used to quantify how one variable will change
with respect to another variable. There are many types of regressions available such as
simple linear, multiple linear, nominal, logistic, and ordinal regression. The most
commonly used regression in inferential statistics is linear regression. Linear regression
checks the effect of a unit change of the independent variable in the dependent variable

Conclusion : We have successfully explored descriptive and inferential statistics on the given dataset.
import pandas as pd

# load the dataset

data = pd.read_csv("Inc_Exp_Data.csv")

# display the first 5 rows of the dataset

print(data.head())

# display the number of rows and columns in the dataset

print("\n Rows:", data.shape[0])
print("\n Columns:", data.shape[1])

# display the column names

print("\n Column names:", data.columns)

# display the data types of each column

print("\n Data types:", data.dtypes)

# display basic statistics for numerical columns

print("\n Statistics:", data.describe())

# display the number of missing values in each column

print("\n Missing values:", data.isnull().sum())
import pandas as pd
from scipy import stats

# load the dataset

data = pd.read_csv("Inc_Exp_Data.csv")

# Select two columns (or variables) for inferential analysis

x = data['Mthly_HH_Expense']
y = data['Mthly_HH_Income']

# Perform a t-test to determine if there is a significant difference

#between the means of the two variables
t, p = stats.ttest_ind(x, y)

# Print the results of the t-test

print("t = ", t)
print("p = ", p)

# Set a significance level

alpha = 0.05

# Interpret the results

if p > alpha:
print("The means of the two variables are not significantly different (fail to r
eject H0)")
else:
print("The means of the two variables are significantly different (reject H0)")
EXPERIMENT 02

Aim : Apply data cleaning techniques (e.g. Data Imputation).

Theory :

Data cleaning is the process of identifying and correcting or removing errors, inconsistencies, and
inaccuracies in data. One common data cleaning technique is data imputation, which is the process of
filling in missing data with estimated values based on other available data.
Here are some steps you can follow to apply data cleaning techniques such as data imputation:

○ Identify missing data: The first step in data imputation is to identify missing data. This can
be done by examining the dataset and looking for cells or fields with missing values.
○ Choose an imputation method: Once you have identified the missing data, you need to
choose an appropriate imputation method. There are several imputation methods available,
including mean imputation, median imputation, mode imputation, and regression
imputation.
○ Perform imputation: Once you have chosen an imputation method, you can perform the
imputation by filling in the missing data with estimated values. For example, if you choose
mean imputation, you would calculate the mean of the available data for that variable and
replace the missing data with that value.
○ Evaluate the results: After imputing the missing data, it is important to evaluate the results
to ensure that the imputed data makes sense and does not introduce bias or errors into the
dataset.
○ Repeat as necessary: If you find that the imputed data is not satisfactory, you may need to
repeat the process with a different imputation method or adjust the parameters of the
imputation method.

In addition to data imputation, there are many other data cleaning techniques that you can use to
improve the quality of your data. Some of these techniques include removing duplicate data,
correcting inconsistencies and errors, and standardizing data formats. The specific techniques you
use will depend on the nature of your data and the goals of your analysis.

Conclusion : We have successfully applied data cleaning techniques like Data Imputation.
import pandas as pd
data = pd.read_csv('Inc_Exp_Data.csv')
data.head()

x=data.iloc[:,1:5].values
y=data.iloc[:,-1].values

from sklearn.model_selection import train_test_split

xtrain,xtest,ytrain,ytest = train_test_split(x,y,test_size=0.20)
xtest

array([[ 30000, 6, 0, 1404000], [ 2000, 1, 0, 97200], [ 4500, 2, 0, 112800], [ 10000, 4, 0,

244800], [ 7000, 2, 3000, 79920], [ 9000, 2, 0, 218880], [ 25000, 5, 5000, 351360], [ 16000,
3, 35000, 167400], [ 10000, 3, 0, 590400], [ 10000, 2, 1000, 437400]])

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
xtrain = sc.fit_transform(xtrain)
xtest = sc.fit_transform(xtest)
xtrain

array([[-1.02132888, -1.67819739, -0.60186099, -1.10649009], [-0.69279526, -0.95639206, -

0.60186099, -0.18382132], [-0.03572803, -0.23458673, 0.5024711 , 0.43621866], [-0.66815524,
-0.95639206, -0.60186099, -0.93731361], [ 0.37493899, 0.4872186 , 0.17117147, 0.26088285],
[-0.44639505, -0.23458673, -0.60186099, -0.10621367], [-0.93919548, -1.67819739, -
0.60186099, 0.80783202], [-0.03572803, -0.23458673, -0.60186099, 0.17136714], [-0.03572803,
-0.95639206, 0.17117147, -0.8859858 ], [ 0.37493899, 0.4872186 , -0.60186099, 2.16452877],
[-0.03572803, -0.23458673, -0.60186099, 1.90008788], [-0.19999484, -0.23458673, 1.16507036,
-0.82644554], [ 2.26400729, 1.93082925, -0.60186099, 1.25130432], [ 1.60694005, 1.20902392,
1.6068032 , 2.73775778], [ 2.01760708, 1.20902392, 2.04853604, 0.27607588], [-0.69279526, -
0.23458673, -0.60186099, 0.33972237], [-0.69279526, -1.67819739, 0.06073827, -1.27607718],
[-1.02132888, -0.23458673, -0.60186099, 0.12660929], [ 1.60694005, 1.20902392, 0.17117147, -
0.04215656], [-0.93919548, -0.23458673, -0.60186099, -1.02354434], [-0.11786144, 1.20902392,
-0.60186099, -0.64823537], [ 0.12853878, -0.23458673, -0.04969494, 0.30892568], [-
0.85706207, 1.20902392, -0.60186099, -0.89625136], [-0.85706207, -0.23458673, -0.60186099, -
0.36654834], [-0.61066186, -0.23458673, -0.60186099, -0.46099151], [ 0.37493899, -
0.95639206, 0.28160468, -0.17355576], [-0.03572803, 0.4872186 , 1.16507036, -1.03996924], [-
0.44639505, 1.93082925, -0.60186099, -0.21584988], [-1.26772909, -0.95639206, -0.60186099, -
0.77979882], [-1.02132888, -0.95639206, -0.16012815, -1.55940671], [-0.44639505, -
0.23458673, -0.60186099, -0.25485902], [ 2.4282741 , 1.20902392, 1.6068032 , 0.24117297], [-
0.81599537, 1.20902392, -0.60186099, -0.69504634], [ 2.4282741 , -0.23458673, 3.81546739,
1.75226378], [ 0.37493899, 0.4872186 , -0.60186099, 2.13003648], [ 0.37493899, 1.20902392, -
0.60186099, 0.01327748], [-0.03572803, -0.95639206, -0.60186099, 0.21160815], [ 0.37493899,
-0.23458673, -0.60186099, -0.24130847], [-1.13631565, -1.67819739, -0.16012815, -
0.58828449], [ 0.37493899, 1.20902392, 2.04853604, -0.82151807]])

ytrain
array([1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 3, 1, 2, 1, 1, 1, 2, 2, 1, 1, 3, 1, 1, 1, 1, 4,
1, 1, 1, 1, 2, 2, 2, 3, 2, 2, 1, 1])

to_drop = ['Highest_Qualified_Member']
df.drop(to_drop, inplace = True, axis = 1)
#Alternaltively
#df.drop(columns=to_drop, inplace=True)
df.head()

df['No_of_Earning_Members'].is_unique
False

df.set_index('No_of_Fly_Members', inplace = True)

df.head()

df.iloc[1]

df['Mthly_HH_Expense'].head(30)
df.loc[1:, 'Annual_HH_Income'].head(10)

df['Date of Publication'] = pd.to_numeric(extr)

df['Date of Publication'].dtype
dtype('float64')

df['Date of Publication'].isnull().sum() / len(df)

0.11717147339205986

df['Place of Publication'].head(10)

df.loc[4157862]

pub = df['Place of Publication']

london = pub.str.contains('London')
london[:5]

pub = df['Place of Publication']

df['Place of Publication'] = np.where(pub.str.contains('london'), 'Lonon',np.where(pub.str.c
ontains('oxford'), 'Oxford',np.where(pub.eq('Newcastle upon Tyne'),'Newcastle-upon-
Tyne', df['Place of Publication'])))

df['Place of Publication']
EXPERIMENT 03

Aim : Explore data visualization techniques.

Theory :

Data visualization is the graphical representation of data to provide a better understanding of

complex information. It is an essential part of data analysis and plays a vital role in the
decision-making process. In this documentation, we will explore various data visualization
techniques that can help you to gain insights from your data.

○ Line Chart

A line chart is a graphical representation of data in which data points are plotted and
connected by lines. Line charts are used to show trends over time and to compare data from
different categories.

○ Bar Chart

A bar chart is a graphical representation of data in which data is presented as bars. Bar
charts are used to compare data from different categories and to show changes in data over
time.

○ Pie Chart

A pie chart is a graphical representation of data in which data is presented as slices of a pie.
Pie charts are used to show the percentage breakdown of data.

○ Scatter Plot

A scatter plot is a graphical representation of data in which data points are plotted as
individual points. Scatter plots are used to show the relationship between two variables.

○ Treemap

A treemap is a graphical representation of data in which data is presented as rectangles of

different sizes. Treemaps are used to show the hierarchical structure of data.

○ Histogram
A histogram is a graphical representation of data in which data is presented as a series of
bars. Histograms are used to show the distribution of data.

These are some of the commonly used data visualization techniques that can help you gain insights from
your data. The choice of visualization technique depends on the type of data you have and the insights you
want to gain from it.

Conclusion : We have successfully explored data visualization techniques.

exploratory-data-analysis-on-Income_Expenditure-dataset

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Data Loading
df = pd.read_csv("Inc_Exp_Data.csv")
df.head()

df.shape
(50, 7)

df.info()

Data Cleaning: If data contains "?" replace it with NAN

df_data = df.replace('?',np.NAN)
df_data.isnull().sum()

df['Mthly_HH_Expense'].fillna('np.nan', inplace= True)

df['Mthly_HH_Income'].fillna('np.nan', inplace= True)
df.isnull().sum()

Summary statistics of variable

df.describe()

plt.figure(figsize=(10,8))
df[['Mthly_HH_Income','Mthly_HH_Expense','No_of_Fly_Members','Emi_or_Rent_Amt','Annual_HH_In
come','Highest_Qualified_Member','No_of_Earning_Members']].hist(figsize=(10,10),bins=6,color
='Y')
plt.figure(figsize=(10,8))
plt.tight_layout()
plt.show()

Findings
• Maximum people have an income of 30000-40000
• Average expense of people is 15000-35000
• There is 0-5000 EMI or rent value for most people
• Most people have an annual income less then 2 lakhs
• Most families only have 1 earning family member
plt.figure(1)
plt.subplot(221)
df['Annual_HH_Income'].value_counts(normalize=True).plot(figsize=(10,8),kind='line',color='r
ed')
plt.title("Annual Income frequency diagram")
plt.ylabel('No_of_fly_Members')
plt.xlabel('Annual Income');

plt.subplot(222)
df['Mthly_HH_Income'].value_counts(normalize=True).plot(figsize=(10,8),kind='pie')
plt.title("Monthly income frequency diagram")
plt.xlabel('Mthly_HH_Income')

Findings
• For every 4 people family the highest annual income is 6 Lakhs
• Maximum monthly income in this list is 1 Lakh
• There are many 2 people families in this dataset having various annual incomes
import seaborn as sns
corr = df.corr()
plt.figure(figsize=(20,9))
a = sns.heatmap(corr,cmap='brg', annot=True, fmt='.2f')

Bivariate Analysis: Emi or Rent Analysis

plt.rcParams['figure.figsize']=(18,9)
ax = sns.boxplot(x="Mthly_HH_Income", y="Mthly_HH_Expense", data=df)
plt.rcParams['figure.figsize']=(19,7)
ax = sns.boxplot(x="Mthly_HH_Income", y="No_of_Fly_Members", data=df)

Positive linear relationship

plt.rcParams['figure.figsize']=(10,5)
ax = sns.boxplot(x="Annual_HH_Income", y="No_of_Fly_Members", data=df)

# Engine size as potential predictor variable of price

sns.regplot(x="Emi_or_Rent_Amt", y="No_of_Fly_Members", data=df)
plt.ylim(0,)
df[["Mthly_HH_Income", "Annual_HH_Income"]].corr()

sns.regplot(x="Mthly_HH_Income", y="Emi_or_Rent_Amt", data=df)

plt.ylim(0,)

df[["Mthly_HH_Income","Emi_or_Rent_Amt"]].corr()

EMI or Rent Analysis

• Emi or Rent of any said person increases with their income

• Emi or Rent of any said person decreases with their expenses
EXPERIMENT 04

Aim : Implement and explore performance evaluation metrics for Data Models (Supervised/Unsupervised
Learning)

Theory :

Performance evaluation metrics are essential for measuring the effectiveness and efficiency of data models.
These metrics are used to compare different models, select the best model for a specific problem, and
optimize model parameters. In supervised learning, some of the commonly used performance evaluation
metrics include accuracy, precision, recall, F1 score, ROC curve, and AUC score. In unsupervised learning,
some commonly used metrics include clustering accuracy, silhouette score, and Davies-Bouldin index.

Accuracy is the most widely used metric in supervised learning, and it measures the proportion of correct
predictions made by the model. Precision measures the fraction of true positive predictions among all
positive predictions, while recall measures the fraction of true positive predictions among all actual
positives. F1 score is a harmonic mean of precision and recall and is often used when precision and recall
are equally important. ROC curve and AUC score are used to evaluate the performance of binary
classifiers, where ROC curve plots the true positive rate (TPR) against the false positive rate (FPR) at
various threshold settings, and AUC score measures the area under the ROC curve.

In unsupervised learning, clustering accuracy measures the extent to which the predicted clusters match the
true labels. Silhouette score measures the similarity of data points within a cluster compared to points in
other clusters, and Davies-Bouldin index measures the average similarity between each cluster and its most
similar cluster.

In addition to these commonly used metrics, there are many other performance evaluation metrics available
for data models, depending on the type of problem and the nature of the data. For instance, in regression
problems, mean absolute error (MAE), mean squared error (MSE), and R-squared are commonly used
metrics to evaluate the performance of a model. MAE measures the average magnitude of the errors in the
predictions made by the model, while MSE measures the average squared magnitude of the errors.
R-squared measures the proportion of variance in the dependent variable that is explained by the
independent variables in the model.

In multi-class classification problems, there are several evaluation metrics available, including
macro-averaged precision, macro-averaged recall, macro-averaged F1 score, and micro-averaged F1 score.
Macro-averaged precision, recall, and F1 score calculate the performance of each class separately and then
take an average of these scores, while micro-averaged F1 score treats the multi-class problem as a binary
classification problem and calculates the F1 score based on the true positive, false positive, and false
negative rates of all classes combined.

It is important to note that no single metric can provide a complete picture of the performance of a data
model, and different metrics may emphasize different aspects of performance. Therefore, it is recommended
to use a combination of different metrics to evaluate the performance of a model thoroughly

Conclusion : Performance evaluation metrics play a crucial role in selecting the best data model for a particular
problem. Choosing the right metric is important, as it determines the effectiveness and efficiency of the model. A
combination of different metrics is often used to evaluate the performance of a model.
Supervised Learning

import pandas as pd
df = pd.read_csv("Inc_Exp_Data.csv")
df.head()

x=df.iloc[:,1:5].values
y=df.iloc[:,-1].values

from sklearn.model_selection import train_test_split

xtrain, xtest, ytrain,
ytest = train_test_split(x,y,test_size=0.30, random_state = 42)
xtest
array([[ 10500, 6, 0, 316800], [ 10000, 3, 0, 590400], [ 25000, 6, 0, 523800], [
48000, 7, 0, 885600], [ 10000, 6, 0, 258000], [ 50000, 4, 20000, 1032000], [ 8000,
4, 0, 556920], [ 25000, 4, 0, 449400], [ 10000, 2, 1000, 437400], [ 13000, 4, 0,
385200], [ 5000, 3, 0, 292032], [ 12000, 2, 3000, 147000], [ 20000, 3, 0, 581760],
[ 9000, 2, 0, 218880], [ 2000, 1, 0, 97200]])

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
xtrain = sc.fit_transform(xtrain)
xtest = sc.fit_transform(xtest)
xtrain
array([[-3.22882971e-01, -8.46847194e-01, 4.79687790e+00, -9.96450882e-01], [
4.34348243e-02, -1.23928858e-01, 2.01328817e-01, 4.13910986e-01], [ 5.01332068e-
01, 5.98989479e-01, -5.64596030e-01, 1.89796426e+00], [ 9.59229312e-01,
1.32190782e+00, -5.64596030e-01, 2.63699388e+00], [-6.61726932e-01, -8.46847194e-
01, -5.64596030e-01, -7.65504126e-01], [-9.63939113e-01, -1.23928858e-01, -
5.64596030e-01, -8.39548124e-01], [ 4.34348243e-02, -8.46847194e-01, -2.84486372e-
02, -7.21430318e-01], [-6.89200766e-01, -8.46847194e-01, -5.64596030e-01, -
1.18500619e-01], [ 5.01332068e-01, 5.98989479e-01, -2.84486372e-02, 2.63354857e-
01], [ 1.87502380e+00, 1.32190782e+00, -2.84486372e-02, 3.14309216e-03], [-
1.05551856e+00, -8.46847194e-01, -2.58226091e-01, -1.29967868e+00], [ 5.01332068e-
01, 5.98989479e-01, -5.64596030e-01, 1.86834666e+00], [-8.72359664e-01, -
1.23928858e-01, -5.64596030e-01, -2.75403377e-01], [ 2.26593722e-01, -1.23928858e-
01, -1.81633606e-01, 3.04607941e-01], [-1.05551856e+00, -1.56976553e+00, -
5.64596030e-01, -9.10771398e-01], [-6.89200766e-01, -1.23928858e-01, -5.64596030e-
01, 3.31052227e-01], [ 5.01332068e-01, 1.32190782e+00, 1.27362360e+00, -
6.66073614e-01], [ 4.34348243e-02, -1.23928858e-01, -5.64596030e-01, 1.86490135e-
01], [-1.14709801e+00, -1.56976553e+00, -1.05041122e-01, -1.25348933e+00], [-
4.14462420e-01, -1.23928858e-01, -5.64596030e-01, -1.79498770e-01], [-
1.37604663e+00, -1.56976553e+00, -5.64596030e-01, -1.15687954e+00], [ 4.34348243e-
02, -1.23928858e-01, -5.64596030e-01, 1.67089600e+00], [-4.14462420e-01,
2.04482615e+00, -5.64596030e-01, -1.46002675e-01], [ 5.01332068e-01, -8.46847194e-
01, 4.81438475e-02, -1.09685857e-01], [ 2.79081829e+00, 1.32190782e+00,
9.67253663e-01, 2.46430514e-01], [-1.39724073e-01, -1.23928858e-01, 6.60883725e-
01, -6.70304700e-01], [-4.81446245e-02, 1.32190782e+00, -5.64596030e-01, -
5.17280437e-01], [-1.18372979e+00, -1.56976553e+00, -2.58226091e-01, -4.65802229e-
01], [ 1.87502380e+00, 1.32190782e+00, 9.67253663e-01, 2.39018055e+00], [
5.01332068e-01, 5.98989479e-01, 2.01328817e-01, -4.55929696e-01], [ 4.34348243e-
02, 5.98989479e-01, 6.60883725e-01, -8.53651743e-01], [-9.63939113e-01, -
1.56976553e+00, -5.64596030e-01, 7.33005359e-01], [-8.72359664e-01, -1.23928858e-
01, -5.64596030e-01, -7.69030031e-01], [-4.14462420e-01, -1.23928858e-01, -
5.64596030e-01, -5.18610207e-02], [ 2.33292104e+00, 1.32190782e+00,
1.27362360e+00, 2.76400704e-01]])

ytrain
array([1, 2, 2, 3, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1,
2, 1, 1, 1, 1, 1, 1, 2, 1, 2])

from sklearn.linear_model import LogisticRegression

# Train a logistic regression model
lr = LogisticRegression()
lr.fit(xtrain, ytrain)
LogisticRegression()

from sklearn.metrics import classification_report

# assuming y_true is the true labels and y_pred are the predicted labels
print(classification_report(ytest, ypred))

Unsupervised Learning

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
df = pd.read_csv('Inc_Exp_Data.csv')
df = df.replace(' ?', pd.NaT)
df = df.dropna()
scaler = StandardScaler()
num_cols = ["Mthly_HH_Income", "Mthly_HH_Expense", "No_of_Fly_Members", "Emi_or_Re
nt_Amt", "Annual_HH_Income","No_of_Earning_Members"]
df[num_cols] = scaler.fit_transform(df[num_cols])
X = df[num_cols]

# Apply K-Means clustering

kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X)
# Compute the silhouette score
silhouette = silhouette_score(X, kmeans.labels_)
print("Silhouette score:", silhouette)
Silhouette score: 0.40946819295341863

Elbow Method

import matplotlib.pyplot as plt

# Compute the inertia for different numbers of clusters

inertia = []
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)
inertia.append(kmeans.inertia_)

# Plot the inertia vs. the number of clusters

plt.plot(range(1, 11), inertia)
plt.xlabel("Number of clusters")
plt.ylabel("Inertia")
plt.show()
EXPERIMENT 05

Aim : Use SMOTE technique to generate synthetic data. (to solve the problem of class imbalance)

Theory :

Class imbalance is a common problem in many machine learning applications. In some datasets, one class may be
significantly underrepresented compared to others, making it difficult for a machine learning model to learn from
the available data. This can lead to biased models that have high accuracy for the majority class and poor accuracy
for the minority class.

One approach to addressing this problem is to use the Synthetic Minority Over-sampling Technique (SMOTE).
SMOTE is a data augmentation technique that generates synthetic samples for the minority class by interpolating
between the existing minority samples.

The SMOTE algorithm works as follows:

● Choose a minority sample at random.
● Select one of its k nearest neighbors at random.
● Generate a new sample by interpolating between the chosen sample and the selected neighbor.
● Repeat steps 1-3 to generate additional synthetic samples.
● The value of k determines the number of neighbors used to interpolate between samples. The SMOTE
algorithm is typically used with k=5, which means that five nearest neighbors are used to generate each
new sample.

SMOTE can be applied to the original dataset to generate synthetic samples for the minority class, effectively
increasing the size of the minority class and reducing the class imbalance in the dataset. This can lead to more
accurate machine learning models that perform well on both the majority and minority classes.
One important consideration when using SMOTE is the choice of the value of k. The value of k determines the
level of interpolation between samples and can have a significant impact on the quality of the generated synthetic
data. In general, a larger value of k will result in more conservative interpolation and may produce higher quality
synthetic data. However, larger values of k may also result in overfitting and reduced generalization performance.

Another consideration when using SMOTE is the potential for introducing synthetic data artifacts or biases. Since
the generated synthetic data is based on the existing minority samples, it may inherit any biases or limitations of
the original data. Additionally, the synthetic data may not accurately capture the full distribution of the minority
class, which can lead to overfitting and reduced generalization performance.

Conclusion :

In conclusion, SMOTE is a powerful technique for addressing the problem of class imbalance in machine learning.
By generating synthetic data for the minority class, SMOTE can help to balance the class distribution and improve
the performance of machine learning models. It is easy to use and can be applied to a wide range of datasets,
making it a valuable tool for data scientists and machine learning practitioners. However, it is important to use
SMOTE carefully and to evaluate the performance of the resulting model to ensure that it is accurate and reliable.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import f_oneway

dataset =pd.read_csv("Inc_Exp_Data.csv")
dataset.head()

A=dataset["Mthly_HH_Income"]
A

B=dataset["Mthly_HH_Expense"]
B
0 8000
1 7000
2 4500
3 2000
4 12000
5 8000
6 16000
7 20000
8 9000

C=dataset["No_of_Fly_Members"]
C
0 3
1 2
2 2
3 1
4 2
5 2
6 3
7 5
8 2
9 4
10 4

D=dataset["Emi_or_Rent_Amt"]
D
0 2000
1 3000
2 0
3 0
4 3000
5 0
6 35000
7 8000
8 0
9 0
10 8000

E=dataset["Annual_HH_Income"]
E
0 64200
1 79920
2 112800
3 97200
4 147000
5 196560
6 167400
7 216000
8 218880
9 220800
10 278400

F=dataset["No_of_Earning_Members"]
F
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 2
10 2

f_oneway(A,B,C,D,E,F)
F_onewayResult(statistic=110.97258263216287, pvalue=1.4660303003601035e-65)

In this case, the F statistic is 110.97258263216287 and the p-value is

1.4660303003601035e-65 (a very small number), indicating that there is a
significant difference between the means of the groups being compared.
Since the p-value is much smaller than the significance level (usually 0.05), we
can reject the null hypothesis and conclude that there is a significant difference
between the means of the groups being compared. This means that at least one of
the groups has a different mean from the others. We cannot determine which
group(s) has a different mean from this output alone, but it suggests that further
investigation is warranted.
EXPERIMENT 06

Aim : Outlier detection using distance based/density based method

Theory:

Distance-based Outlier Detection:

Distance-based outlier detection method consults the neighbourhood of an object, which is defined by a given
radius. An object is then considered an outlier if its neighbourhood does not have enough other points. A distance
is the threshold that can be defined as a reasonable neighbourhood of the object. For each object o we can find a
reasonable number of neighbours of an object.
A distance-based outlier detection method consults the neighbourhood of an object, which is defined by a given
radius. An object is then considered an outlier if its neighbourhood does not have enough other points. This is
termed as Distance-Based Outlier Detection Methods.
Distance-Based Methods usually depend on a Multi-dimensional Index, Which is used to retrieve the
neighbourhood of each object to see if it contains sufficient points. If there are insufficient points, then the object
is termed an outlier.
Distance-Based methods scale better to multi-dimensional space and can be computed more efficiently than the
statistical-based method. Identifying Distance-based outliers is an important and useful data mining activity. The
main disadvantage of distance-based methods is that distance-based outlier detection is based on a single value of
a custom parameter. This can cause significant problems if the dataset contains both dense and sparse regions.
Outlier detection methods can be categorised according to whether the sample of data for analysis is given with
expert-provided labels that can be used to build an outlier detection model. In this case, the detection methods are
supervised, semi-supervised, or unsupervised. Alternatively, outlier detection methods may be organised according
to their assumptions regarding normal objects versus outliers. This categorization includes statistical methods,
proximity-based methods, and clustering-based methods.
Algorithms for mining distance-based outliers:
● Index-based algorithm
● Nested-loop algorithm
● Cell-based algorithm
Density-based Outlier Detection:
Density-based outlier detection method investigates the density of an object and that of its neighbours. Here, an
object is identified as an outlier if its density is relatively much lower than that of its neighbours. Many real-world
data sets demonstrate a more complex structure, where objects may be considered outliers with respect to their
local neighbourhoods, rather than with respect to the global data distribution.
A density-based outlier detection method is used for checking the density of an entity object and its closest
objects. Key applications of this method are used in many applications including Malware Detection, Awareness,
Behavior Analysis, and Network Intrusion Detection. There are some limitations to density-based outlier detection
methods that are effective until it is determined that the outliers being detected are not necessarily outliers but just
a part of a much larger distribution of data. A limitation with using density-based outlier detection methods is that
the density function must be defined and clearly understood before implementation and the proper value set.

Conclusion : We have successfully performed outlier detection using distance based/density based method
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
df=pd.read_csv('Inc_Exp_Data.csv')

def remove_sign(x,sign):
if type(x) is str:
x = float(x.replace(sign,'').replace(',',''))
return x

df=df[['Mthly_HH_Income','Mthly_HH_Expense']]
df=pd.DataFrame(df)
df.price = df.Mthly_HH_Expense.apply(remove_sign,sign='$')
sns.boxplot(y='Mthly_HH_Expense', x='Mthly_HH_Income',data=df)
plt.xticks(rotation=90)
plt.ylabel('Mthly_HH_Expense (Rs.)')

IQR Method
def remove_outlier_IQR(df):
Q1=df.quantile(0.25)
Q3=df.quantile(0.75)
IQR=Q3-Q1
df_final=df[~((df<(Q1-1.5*IQR)) | (df>(Q3+1.5*IQR)))]
return df_final

df_outlier_removed=remove_outlier_IQR(df.Mthly_HH_Expense)
df_outlier_removed=pd.DataFrame(df_outlier_removed)
ind_diff=df.index.difference(df_outlier_removed.index)

for i in range(0, len(ind_diff),1):

df_final=df.drop([ind_diff[i]])
df=df_final
sns.boxplot(y='Mthly_HH_Expense', x='Mthly_HH_Income',data=df_final)
plt.xticks(rotation=90)
plt.ylabel('Mthly_HH_Expense(Rs.)')

len(ind_diff)
3

HAMPEL METHOD
def remove_outlier_Hampel(df):
med=df.median()
List=abs(df-med)
cond=List.median()*4.5
good_list=List[~(List>cond)]
return good_list

df_outlier_removed=remove_outlier_Hampel(df.Mthly_HH_Expense)
df_outlier_removed=pd.DataFrame(df_outlier_removed)
ind_diff=df.index.difference(df_outlier_removed.index)
for i in range(0, len(ind_diff),1):
df_final=df.drop([ind_diff[i]])
df=df_final

sns.boxplot(y='Mthly_HH_Expense', x='Mthly_HH_Income',data=df_final)
plt.xticks(rotation=90)
plt.ylabel('Mthly_HH_Expense(Rs.)')
len(ind_diff)
1

DBSCAN Method

from sklearn.cluster import DBSCAN

def remove_outliers_DBSCAN(df,eps,min_samples):
outlier_detection = DBSCAN(eps = eps, min_samples = min_samples)
clusters = outlier_detection.fit_predict(df.values.reshape(-1,1))
data = pd.DataFrame()
data['cluster'] = clusters
return data['cluster']

clusters=remove_outliers_DBSCAN((df['Mthly_HH_Expense']),1,1)
clusters.value_counts().sort_values(ascending=False)

df_cluster=pd.DataFrame(clusters)
ind_outlier=df_cluster.index[df_cluster['cluster']==-1]
ind_outlier
plt.plot(clusters)
df_cluster=pd.DataFrame(clusters)
ind_outlier=df_cluster.index[df_cluster['cluster']==-1]
ind_outlier

for i in range(0, len(ind_outlier),1):

df_final=df.drop([ind_outlier[15]])
df=df_final

sns.boxplot(y='Mthly_HH_Expense', x='Mthly_HH_Income',data=df_final)
plt.xticks(rotation=90)
plt.ylabel('Mthly_HH_Expense (Rs.)')

len(ind_outlier)
0
EXPERIMENT 07

Aim : Explore time series forecasting on the given dataset.

Theory:

Time series forecasting occurs when you make scientific predictions based on historical time stamped data.
It involves building models through historical analysis and using them to make observations and drive
future strategic decision-making. An important distinction in forecasting is that at the time of the work, the
future outcome is completely unavailable and can only be estimated through careful analysis and
evidence-based priors. Time series forecasting is the process of analyzing time series data using statistics
and modeling to make predictions and inform strategic decision-making. It’s not always an exact prediction,
and likelihood of forecasts can vary wildly—especially when dealing with the commonly fluctuating
variables in time series data as well as factors outside our control. However, forecasting insight about which
outcomes are more likely—or less likely—to occur than other potential outcomes. Often, the more
comprehensive the data we have, the more accurate the forecasts can be. While forecasting and “prediction”
generally mean the same thing, there is a notable distinction. In some industries, forecasting might refer to
data at a specific future point in time, while prediction refers to future data in general. Series forecasting is
often used in conjunction with time series analysis. Time series analysis involves developing models to gain
an understanding of the data to understand the underlying causes. Analysis can provide the “why” behind
the outcomes you are seeing. Forecasting then takes the next step of what to do with that knowledge and the
predictable extrapolations of what might happen in the future.

Applications of time series forecasting

Forecasting has a range of applications in various industries. It has tons of practical applications including:
weather forecasting, climate forecasting, economic forecasting, healthcare forecasting engineering
forecasting, finance forecasting, retail forecasting, business forecasting, environmental studies forecasting,
social studies forecasting, and more. Basically anyone who has consistent historical data can analyze that
data with time series analysis methods and then model, forecasting, and predict. For some industries, the
entire point of time series analysis is to facilitate forecasting. Some technologies, such as augmented
analytics, can even automatically select forecasting from among other statistical algorithms if it offers the
most certainty.

When time series forecasting should be used

Naturally, there are limitations when dealing with the unpredictable and the unknown. Time series
forecasting isn’t infallible and isn’t appropriate or useful for all situations. Because there really is no
explicit set of rules for when you should or should not use forecasting, it is up to analysts and data teams to
know the limitations of analysis and what their models can support. Not every model will fit every data set
or answer every question. Data teams should use time series forecasting when they understand the business
question and have the appropriate data and forecasting capabilities to answer that question. Good
forecasting works with clean, time stamped data and can identify the genuine trends and patterns in
historical data. Analysts can tell the difference between random fluctuations or outliers, and can separate
genuine insights from seasonal variations. Time series analysis shows how data changes over time, and
good forecasting can identify the direction in which the data is changing.

Conclusion : We have successfully explored time series forecasting on the given dataset.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

data = pd.read_csv('airline-passengers.csv')
data.columns = ['Month','Passengers']
data['Month'] = pd.to_datetime(data['Month'], format='%Y-%m')
data = data.set_index('Month')
data.head()

data.plot(figsize=(20, 4))
plt.grid()
plt.legend(loc='best')
plt.title('Airline passenger traffic')
plt.show(block=False)

data = data.assign(Passengers_Linear_Interpolation=data.Passengers.interpolate(method='linea
r'))
data[['Passengers_Linear_Interpolation']].plot(figsize=(20, 4))
plt.grid()
plt.legend(loc='best')
plt.title('Airline passenger traffic: Linear interpolation')
plt.show(block=False)
data['Passengers'] = data['Passengers_Linear_Interpolation']
data.drop(columns=['Passengers_Linear_Interpolation'],inplace=True)
data.head()

import seaborn as sns

fig = plt.subplots(figsize=(20, 5))
ax = sns.boxplot(x=data['Passengers'],whis=1.5)

fig = data.Passengers.hist(figsize = (20,5))

from pylab import rcParams

import statsmodels.api as sm
rcParams['figure.figsize'] = 20,24
decomposition = sm.tsa.seasonal_decompose(data.Passengers, model='additive') # additive seas
onal index
fig = decomposition.plot()
plt.show()
decomposition = sm.tsa.seasonal_decompose(data.Passengers, model='multiplicative') # multipl
icative seasonal index
fig = decomposition.plot()
plt.show()
train_len = 120
train = data[0:train_len] # first 120 months as training set
test = data[train_len:] # last 24 months as out-of-time test set
y_hat_sma = data.copy()
ma_window = 12

y_hat_sma['sma_forecast'] = data['Passengers'].rolling(ma_window).mean()
y_hat_sma['sma_forecast'][train_len:] = y_hat_sma['sma_forecast'][train_len-1]

plt.figure(figsize=(20,5))
plt.grid()
plt.plot(train['Passengers'], label='Train')
plt.plot(test['Passengers'], label='Test')
plt.plot(y_hat_sma['sma_forecast'], label='Simple moving average forecast')
plt.legend(loc='best')
plt.title('Simple Moving Average Method')
plt.show()

from sklearn.metrics import mean_squared_error

rmse = np.sqrt(mean_squared_error(test['Passengers'], y_hat_sma['sma_forecast'][train_len:])
).round(2)
mape = np.round(np.mean(np.abs(test['Passengers']-
y_hat_sma['sma_forecast'][train_len:])/test['Passengers'])*100,2)

results = pd.DataFrame({'Method':['Simple moving average forecast'], 'MAPE': [mape], 'RMSE':

[rmse]})
results = results[['Method', 'RMSE', 'MAPE']]
results

ARIMA

data = pd.read_csv("airline-passenger-traffic(1).csv", header=None)

data.head()

data.columns = ['Month', 'Passengers']

data.Month = pd.to_datetime(data.Month, format='%Y-%m')
data.Passengers = data.Passengers.astype("float64")
data = data.set_index('Month')
data.head()

data.plot(figsize=(14,6))
plt.title('Airline Passenger Traffic Data')
plt.show(block=False)

data['Passengers_Mean_Imputation'] = data.Passengers.fillna(data.Passengers.mean())
plt.figure(figsize=(16,4))
plt.plot(data.Passengers_Mean_Imputation, label='Passengers_Mean_Imputation')
plt.plot(data.Passengers, label='Passengers')
plt.legend(loc='best')
plt.title('Missing Value Treatment: Mean Imputation')
plt.show(block=False)

data.head()

data["Passengers"]=data["Passengers_Mean_Imputation"]
data.drop(columns=['Passengers_Mean_Imputation'],inplace=True)
data.head()

from pylab import rcParams

import statsmodels.api as sm
rcParams['figure.figsize'] = (14,8)
decomposition = sm.tsa.seasonal_decompose(data.Passengers, model='additive')
fig = decomposition.plot()
plt.show()
import statsmodels.api as sm
from scipy.stats import boxcox
data_boxcox = pd.Series(boxcox(data.Passengers, lmbda=0), index = data.index)
train_data_boxcox = data_boxcox[:130]
model = sm.tsa.arima.ARIMA(train_data_boxcox, order=(1,1,1))
model_fit = model.fit()
print(model_fit.params)

sigma squared represents the variance of the residual values ar.L1 refers to the
autoregressive term with the lag of 1, ar.L2 represents the same, but with the lag of 2.
ma.L1 and ma.L2 refer to the ‘moving average’ terms with lag of 1 and 2.

(Feature Engineering) (Extended-Cheatsheet)
No ratings yet
(Feature Engineering) (Extended-Cheatsheet)
9 pages
ADS LAB Merged
No ratings yet
ADS LAB Merged
86 pages
Data Analytics Lab Manual
No ratings yet
Data Analytics Lab Manual
47 pages
Complete Module in Readings in Philippine History
No ratings yet
Complete Module in Readings in Philippine History
239 pages
DADV - Lab - Subject - 303105315
No ratings yet
DADV - Lab - Subject - 303105315
35 pages
Research Guideline As Per New Curriculum
No ratings yet
Research Guideline As Per New Curriculum
53 pages
Data Science Practicals
No ratings yet
Data Science Practicals
47 pages
Statistical Analysis With Scipy?
No ratings yet
Statistical Analysis With Scipy?
9 pages
Data Science Practical With Solutions BSC Cs Sem 6
No ratings yet
Data Science Practical With Solutions BSC Cs Sem 6
29 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
32 pages
Inferential Statistics
No ratings yet
Inferential Statistics
22 pages
Machine Learning Lab File
No ratings yet
Machine Learning Lab File
45 pages
DVA Lab Manual
No ratings yet
DVA Lab Manual
20 pages
ML Lab Record
No ratings yet
ML Lab Record
38 pages
Data Analytics Lab Manual - 250402 - 095326
No ratings yet
Data Analytics Lab Manual - 250402 - 095326
58 pages
AD3411
No ratings yet
AD3411
28 pages
Adsexp 1
No ratings yet
Adsexp 1
6 pages
(Ebook PDF) Business Statistics, 3rd Canadian Edition Instant Download
100% (5)
(Ebook PDF) Business Statistics, 3rd Canadian Edition Instant Download
53 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Complete Data Science Questions
No ratings yet
Complete Data Science Questions
5 pages
ADS Exp2
No ratings yet
ADS Exp2
4 pages
DataAnalytics Lab Manual
No ratings yet
DataAnalytics Lab Manual
35 pages
Sowmi DS
No ratings yet
Sowmi DS
27 pages
DAP Writeups - Merged
No ratings yet
DAP Writeups - Merged
33 pages
Omkar
No ratings yet
Omkar
37 pages
ML File Syllabus
No ratings yet
ML File Syllabus
43 pages
DA Programs
No ratings yet
DA Programs
44 pages
Da Program Upto 6
No ratings yet
Da Program Upto 6
20 pages
ML LAB Manual-1
No ratings yet
ML LAB Manual-1
33 pages
Project Paarth
No ratings yet
Project Paarth
21 pages
Data Science and Analtics Laboratory
No ratings yet
Data Science and Analtics Laboratory
21 pages
DA Lab
No ratings yet
DA Lab
27 pages
PP DWDM 4 5
No ratings yet
PP DWDM 4 5
26 pages
Inferential Statistical Analysis Using Python
No ratings yet
Inferential Statistical Analysis Using Python
22 pages
Ad3411-Data Science and Analytics Laboratory
No ratings yet
Ad3411-Data Science and Analytics Laboratory
27 pages
TYCS Practical
No ratings yet
TYCS Practical
26 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
Exp 8 - LM
No ratings yet
Exp 8 - LM
10 pages
ML Self Unit 2
No ratings yet
ML Self Unit 2
20 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Ifm Group2 Code
No ratings yet
Ifm Group2 Code
7 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
CIPP Checklist
100% (1)
CIPP Checklist
16 pages
Customer Profiling, Segmentation, and Sales Prediction Using AI in Direct Marketing
No ratings yet
Customer Profiling, Segmentation, and Sales Prediction Using AI in Direct Marketing
11 pages
Dsi237 Group 2
No ratings yet
Dsi237 Group 2
27 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
ASSi2 DSBDA
No ratings yet
ASSi2 DSBDA
4 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
Exp-2 ML
No ratings yet
Exp-2 ML
6 pages
Corrected Index of Topics
No ratings yet
Corrected Index of Topics
2 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
Sample Thesis Chapter 3 Research Methodology
100% (3)
Sample Thesis Chapter 3 Research Methodology
7 pages
Unit 1,2
No ratings yet
Unit 1,2
17 pages
Regression
No ratings yet
Regression
14 pages
DA Lab ANSWERS
No ratings yet
DA Lab ANSWERS
10 pages
Numpy and Pandas
No ratings yet
Numpy and Pandas
11 pages
Exp 2
No ratings yet
Exp 2
6 pages
Magcamit 2018 Explaining The Three Way Linkage Between Populism Securitization and Realist Foreign Policies President
No ratings yet
Magcamit 2018 Explaining The Three Way Linkage Between Populism Securitization and Realist Foreign Policies President
30 pages
Machine Learning Project Roadmap
No ratings yet
Machine Learning Project Roadmap
4 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
M Tech Dissertation Format
100% (2)
M Tech Dissertation Format
7 pages
Be A 65 Ads Exp 3
No ratings yet
Be A 65 Ads Exp 3
6 pages
Data Mining Reviewer
No ratings yet
Data Mining Reviewer
4 pages
Data Analysis
No ratings yet
Data Analysis
8 pages
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
English Work Manuel
No ratings yet
English Work Manuel
13 pages
MBS DBA Info Pack
No ratings yet
MBS DBA Info Pack
40 pages
Some Exercises
No ratings yet
Some Exercises
9 pages
Handling Missing Values in A Real-Time Dataset During
No ratings yet
Handling Missing Values in A Real-Time Dataset During
5 pages
Service Quality in Hospitality Services VV
No ratings yet
Service Quality in Hospitality Services VV
26 pages
CintarchSlides AnalysisandInterpretationofNon-parametricData Module3ppt - IntroductiontoChisquaretest
No ratings yet
CintarchSlides AnalysisandInterpretationofNon-parametricData Module3ppt - IntroductiontoChisquaretest
26 pages
GPower Faul2007
No ratings yet
GPower Faul2007
17 pages
MAMP 58 Project Manual
No ratings yet
MAMP 58 Project Manual
11 pages
Comparison of Imputation Techniques After Classifying The Dataset Using KNN Classifier For The Imputation of Missing Data
No ratings yet
Comparison of Imputation Techniques After Classifying The Dataset Using KNN Classifier For The Imputation of Missing Data
4 pages
Probability Sampling and Non Probability Sampling
No ratings yet
Probability Sampling and Non Probability Sampling
5 pages
Chambers - 2011 - Evaluating The Impact of Continuing Professional Development The Professional Dissertation in Lifelong Learning
No ratings yet
Chambers - 2011 - Evaluating The Impact of Continuing Professional Development The Professional Dissertation in Lifelong Learning
21 pages
NMSCST - Mary Analyn Lim - Assignment#2 - Setember - 9 - 2024
No ratings yet
NMSCST - Mary Analyn Lim - Assignment#2 - Setember - 9 - 2024
12 pages
DLP Psychoanalysis
No ratings yet
DLP Psychoanalysis
8 pages
Begreber Note For Statistics
No ratings yet
Begreber Note For Statistics
17 pages
Data Science and Statistics
No ratings yet
Data Science and Statistics
13 pages
Silva
No ratings yet
Silva
13 pages
Case Study Rubric
No ratings yet
Case Study Rubric
2 pages
Prof. Ed. 22 Learning Episode 1.4
No ratings yet
Prof. Ed. 22 Learning Episode 1.4
7 pages
Action: Three-Day In-Service Training
No ratings yet
Action: Three-Day In-Service Training
1 page
Research Methods For The Behavioral Sciences 4th Edition Gravetter Test Bank Download PDF
100% (9)
Research Methods For The Behavioral Sciences 4th Edition Gravetter Test Bank Download PDF
44 pages
8th Class Planning
No ratings yet
8th Class Planning
5 pages
Reliability of Cognition Study Guide
No ratings yet
Reliability of Cognition Study Guide
5 pages
Republic of The Philippines: Department of Mechanical Engineering
No ratings yet
Republic of The Philippines: Department of Mechanical Engineering
7 pages
Week 6 Benefits and Beneficiaries of Research
No ratings yet
Week 6 Benefits and Beneficiaries of Research
3 pages

ADS EXP Assignments

Uploaded by

ADS EXP Assignments

Uploaded by

EXPERIMENT 01

Aim : Explore descriptive and inferential statistics on the given dataset

# load the dataset

# display the first 5 rows of the dataset

# display the number of rows and columns in the dataset

# display the column names

# display the data types of each column

# display basic statistics for numerical columns

# display the number of missing values in each column

# load the dataset

# Select two columns (or variables) for inferential analysis

# Perform a t-test to determine if there is a significant difference

# Print the results of the t-test

# Set a significance level

# Interpret the results

Aim : Apply data cleaning techniques (e.g. Data Imputation).

from sklearn.model_selection import train_test_split

array([[ 30000, 6, 0, 1404000], [ 2000, 1, 0, 97200], [ 4500, 2, 0, 112800], [ 10000, 4, 0,

from sklearn.preprocessing import StandardScaler

array([[-1.02132888, -1.67819739, -0.60186099, -1.10649009], [-0.69279526, -0.95639206, -

df.set_index('No_of_Fly_Members', inplace = True)

df['Date of Publication'] = pd.to_numeric(extr)

df['Date of Publication'].isnull().sum() / len(df)

pub = df['Place of Publication']

pub = df['Place of Publication']

Aim : Explore data visualization techniques.

Data visualization is the graphical representation of data to provide a better understanding of

A treemap is a graphical representation of data in which data is presented as rectangles of

Conclusion : We have successfully explored data visualization techniques.

Data Cleaning: If data contains "?" replace it with NAN

df['Mthly_HH_Expense'].fillna('np.nan', inplace= True)

Summary statistics of variable

Bivariate Analysis: Emi or Rent Analysis

Positive linear relationship

# Engine size as potential predictor variable of price

sns.regplot(x="Mthly_HH_Income", y="Emi_or_Rent_Amt", data=df)

EMI or Rent Analysis

• Emi or Rent of any said person increases with their income

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import classification_report

# Apply K-Means clustering

import matplotlib.pyplot as plt

# Compute the inertia for different numbers of clusters

# Plot the inertia vs. the number of clusters

The SMOTE algorithm works as follows:

In this case, the F statistic is 110.97258263216287 and the p-value is

Aim : Outlier detection using distance based/density based method

Distance-based Outlier Detection:

for i in range(0, len(ind_diff),1):

from sklearn.cluster import DBSCAN

for i in range(0, len(ind_outlier),1):

Aim : Explore time series forecasting on the given dataset.

Applications of time series forecasting

When time series forecasting should be used

import seaborn as sns

fig = data.Passengers.hist(figsize = (20,5))

from pylab import rcParams

from sklearn.metrics import mean_squared_error

results = pd.DataFrame({'Method':['Simple moving average forecast'], 'MAPE': [mape], 'RMSE':

data = pd.read_csv("airline-passenger-traffic(1).csv", header=None)

data.columns = ['Month', 'Passengers']

from pylab import rcParams

You might also like