ADS EXP Assignments
ADS EXP Assignments
Theory :
● Descriptive Statistics - Descriptive statistics describe, show, and summarise the basic features of a dataset
found in a given study, presented in a summary that describes the data sample and its measurements. It
helps analysts to understand the data better.
Example : If you want a good example of descriptive statistics, look no further than a student’s grade
point average (GPA). A GPA gathers the data points created through a large selection of grades,
classes, and exams then average them together and presents a general idea of the student’s mean
academic performance. Note that the GPA doesn’t predict future performance or present any
conclusions. Instead, it provides a straightforward summary of students’ academic success based on
values pulled from data.
● Types of Descriptive Statistics - All descriptive statistics are either measures of central tendency or
measures of variability, also known as measures of dispersion.
● Inferential Statistics - Inferential statistics are often used to compare the differences between the treatment
groups. Inferential statistics use measurements from the sample of subjects in the experiment to compare
the treatment groups and make generalizations about the larger population of subjects.
Example : A coach wants to find out how many average cartwheels sophomores at his college can do
without stopping. A sample of a few students will be asked to perform cartwheels and the average
will be calculated. Inferential statistics will use this data to make a conclusion regarding how many
cartwheel sophomores can perform on average.
● Types of Inferential Statistics - Inferential statistics can be classified into hypothesis testing and
regression analysis. Hypothesis testing also includes the use of confidence intervals to test the
parameters of a population
○ Hypothesis testing - Hypothesis testing is a type of inferential statistics that is used to test
assumptions and draw conclusions about the population from the available sample data. It
involves setting up a null hypothesis and an alternative hypothesis followed by conducting
a statistical test of significance. A conclusion is drawn based on the value of the test
statistic, the critical value, and the confidence intervals. A hypothesis test can be left-tailed,
right-tailed, and two-tailed. The most common types of hypothesis testing are Z test, F test,
and T test.
○ Regression Analysis - Regression analysis is used to quantify how one variable will change
with respect to another variable. There are many types of regressions available such as
simple linear, multiple linear, nominal, logistic, and ordinal regression. The most
commonly used regression in inferential statistics is linear regression. Linear regression
checks the effect of a unit change of the independent variable in the dependent variable
Conclusion : We have successfully explored descriptive and inferential statistics on the given dataset.
import pandas as pd
Theory :
Data cleaning is the process of identifying and correcting or removing errors, inconsistencies, and
inaccuracies in data. One common data cleaning technique is data imputation, which is the process of
filling in missing data with estimated values based on other available data.
Here are some steps you can follow to apply data cleaning techniques such as data imputation:
○ Identify missing data: The first step in data imputation is to identify missing data. This can
be done by examining the dataset and looking for cells or fields with missing values.
○ Choose an imputation method: Once you have identified the missing data, you need to
choose an appropriate imputation method. There are several imputation methods available,
including mean imputation, median imputation, mode imputation, and regression
imputation.
○ Perform imputation: Once you have chosen an imputation method, you can perform the
imputation by filling in the missing data with estimated values. For example, if you choose
mean imputation, you would calculate the mean of the available data for that variable and
replace the missing data with that value.
○ Evaluate the results: After imputing the missing data, it is important to evaluate the results
to ensure that the imputed data makes sense and does not introduce bias or errors into the
dataset.
○ Repeat as necessary: If you find that the imputed data is not satisfactory, you may need to
repeat the process with a different imputation method or adjust the parameters of the
imputation method.
In addition to data imputation, there are many other data cleaning techniques that you can use to
improve the quality of your data. Some of these techniques include removing duplicate data,
correcting inconsistencies and errors, and standardizing data formats. The specific techniques you
use will depend on the nature of your data and the goals of your analysis.
Conclusion : We have successfully applied data cleaning techniques like Data Imputation.
import pandas as pd
data = pd.read_csv('Inc_Exp_Data.csv')
data.head()
x=data.iloc[:,1:5].values
y=data.iloc[:,-1].values
ytrain
array([1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 3, 1, 2, 1, 1, 1, 2, 2, 1, 1, 3, 1, 1, 1, 1, 4,
1, 1, 1, 1, 2, 2, 2, 3, 2, 2, 1, 1])
to_drop = ['Highest_Qualified_Member']
df.drop(to_drop, inplace = True, axis = 1)
#Alternaltively
#df.drop(columns=to_drop, inplace=True)
df.head()
df['No_of_Earning_Members'].is_unique
False
df.iloc[1]
df['Mthly_HH_Expense'].head(30)
df.loc[1:, 'Annual_HH_Income'].head(10)
df['Place of Publication'].head(10)
df.loc[4157862]
df['Place of Publication']
EXPERIMENT 03
Theory :
○ Line Chart
A line chart is a graphical representation of data in which data points are plotted and
connected by lines. Line charts are used to show trends over time and to compare data from
different categories.
○ Bar Chart
A bar chart is a graphical representation of data in which data is presented as bars. Bar
charts are used to compare data from different categories and to show changes in data over
time.
○ Pie Chart
A pie chart is a graphical representation of data in which data is presented as slices of a pie.
Pie charts are used to show the percentage breakdown of data.
○ Scatter Plot
A scatter plot is a graphical representation of data in which data points are plotted as
individual points. Scatter plots are used to show the relationship between two variables.
○ Treemap
○ Histogram
A histogram is a graphical representation of data in which data is presented as a series of
bars. Histograms are used to show the distribution of data.
These are some of the commonly used data visualization techniques that can help you gain insights from
your data. The choice of visualization technique depends on the type of data you have and the insights you
want to gain from it.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Data Loading
df = pd.read_csv("Inc_Exp_Data.csv")
df.head()
df.shape
(50, 7)
df.info()
plt.figure(figsize=(10,8))
df[['Mthly_HH_Income','Mthly_HH_Expense','No_of_Fly_Members','Emi_or_Rent_Amt','Annual_HH_In
come','Highest_Qualified_Member','No_of_Earning_Members']].hist(figsize=(10,10),bins=6,color
='Y')
plt.figure(figsize=(10,8))
plt.tight_layout()
plt.show()
Findings
• Maximum people have an income of 30000-40000
• Average expense of people is 15000-35000
• There is 0-5000 EMI or rent value for most people
• Most people have an annual income less then 2 lakhs
• Most families only have 1 earning family member
plt.figure(1)
plt.subplot(221)
df['Annual_HH_Income'].value_counts(normalize=True).plot(figsize=(10,8),kind='line',color='r
ed')
plt.title("Annual Income frequency diagram")
plt.ylabel('No_of_fly_Members')
plt.xlabel('Annual Income');
plt.subplot(222)
df['Mthly_HH_Income'].value_counts(normalize=True).plot(figsize=(10,8),kind='pie')
plt.title("Monthly income frequency diagram")
plt.xlabel('Mthly_HH_Income')
Findings
• For every 4 people family the highest annual income is 6 Lakhs
• Maximum monthly income in this list is 1 Lakh
• There are many 2 people families in this dataset having various annual incomes
import seaborn as sns
corr = df.corr()
plt.figure(figsize=(20,9))
a = sns.heatmap(corr,cmap='brg', annot=True, fmt='.2f')
df[["Mthly_HH_Income","Emi_or_Rent_Amt"]].corr()
Aim : Implement and explore performance evaluation metrics for Data Models (Supervised/Unsupervised
Learning)
Theory :
Performance evaluation metrics are essential for measuring the effectiveness and efficiency of data models.
These metrics are used to compare different models, select the best model for a specific problem, and
optimize model parameters. In supervised learning, some of the commonly used performance evaluation
metrics include accuracy, precision, recall, F1 score, ROC curve, and AUC score. In unsupervised learning,
some commonly used metrics include clustering accuracy, silhouette score, and Davies-Bouldin index.
Accuracy is the most widely used metric in supervised learning, and it measures the proportion of correct
predictions made by the model. Precision measures the fraction of true positive predictions among all
positive predictions, while recall measures the fraction of true positive predictions among all actual
positives. F1 score is a harmonic mean of precision and recall and is often used when precision and recall
are equally important. ROC curve and AUC score are used to evaluate the performance of binary
classifiers, where ROC curve plots the true positive rate (TPR) against the false positive rate (FPR) at
various threshold settings, and AUC score measures the area under the ROC curve.
In unsupervised learning, clustering accuracy measures the extent to which the predicted clusters match the
true labels. Silhouette score measures the similarity of data points within a cluster compared to points in
other clusters, and Davies-Bouldin index measures the average similarity between each cluster and its most
similar cluster.
In addition to these commonly used metrics, there are many other performance evaluation metrics available
for data models, depending on the type of problem and the nature of the data. For instance, in regression
problems, mean absolute error (MAE), mean squared error (MSE), and R-squared are commonly used
metrics to evaluate the performance of a model. MAE measures the average magnitude of the errors in the
predictions made by the model, while MSE measures the average squared magnitude of the errors.
R-squared measures the proportion of variance in the dependent variable that is explained by the
independent variables in the model.
In multi-class classification problems, there are several evaluation metrics available, including
macro-averaged precision, macro-averaged recall, macro-averaged F1 score, and micro-averaged F1 score.
Macro-averaged precision, recall, and F1 score calculate the performance of each class separately and then
take an average of these scores, while micro-averaged F1 score treats the multi-class problem as a binary
classification problem and calculates the F1 score based on the true positive, false positive, and false
negative rates of all classes combined.
It is important to note that no single metric can provide a complete picture of the performance of a data
model, and different metrics may emphasize different aspects of performance. Therefore, it is recommended
to use a combination of different metrics to evaluate the performance of a model thoroughly
Conclusion : Performance evaluation metrics play a crucial role in selecting the best data model for a particular
problem. Choosing the right metric is important, as it determines the effectiveness and efficiency of the model. A
combination of different metrics is often used to evaluate the performance of a model.
Supervised Learning
import pandas as pd
df = pd.read_csv("Inc_Exp_Data.csv")
df.head()
x=df.iloc[:,1:5].values
y=df.iloc[:,-1].values
ytrain
array([1, 2, 2, 3, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1,
2, 1, 1, 1, 1, 1, 1, 2, 1, 2])
Unsupervised Learning
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
df = pd.read_csv('Inc_Exp_Data.csv')
df = df.replace(' ?', pd.NaT)
df = df.dropna()
scaler = StandardScaler()
num_cols = ["Mthly_HH_Income", "Mthly_HH_Expense", "No_of_Fly_Members", "Emi_or_Re
nt_Amt", "Annual_HH_Income","No_of_Earning_Members"]
df[num_cols] = scaler.fit_transform(df[num_cols])
X = df[num_cols]
Elbow Method
Aim : Use SMOTE technique to generate synthetic data. (to solve the problem of class imbalance)
Theory :
Class imbalance is a common problem in many machine learning applications. In some datasets, one class may be
significantly underrepresented compared to others, making it difficult for a machine learning model to learn from
the available data. This can lead to biased models that have high accuracy for the majority class and poor accuracy
for the minority class.
One approach to addressing this problem is to use the Synthetic Minority Over-sampling Technique (SMOTE).
SMOTE is a data augmentation technique that generates synthetic samples for the minority class by interpolating
between the existing minority samples.
SMOTE can be applied to the original dataset to generate synthetic samples for the minority class, effectively
increasing the size of the minority class and reducing the class imbalance in the dataset. This can lead to more
accurate machine learning models that perform well on both the majority and minority classes.
One important consideration when using SMOTE is the choice of the value of k. The value of k determines the
level of interpolation between samples and can have a significant impact on the quality of the generated synthetic
data. In general, a larger value of k will result in more conservative interpolation and may produce higher quality
synthetic data. However, larger values of k may also result in overfitting and reduced generalization performance.
Another consideration when using SMOTE is the potential for introducing synthetic data artifacts or biases. Since
the generated synthetic data is based on the existing minority samples, it may inherit any biases or limitations of
the original data. Additionally, the synthetic data may not accurately capture the full distribution of the minority
class, which can lead to overfitting and reduced generalization performance.
Conclusion :
In conclusion, SMOTE is a powerful technique for addressing the problem of class imbalance in machine learning.
By generating synthetic data for the minority class, SMOTE can help to balance the class distribution and improve
the performance of machine learning models. It is easy to use and can be applied to a wide range of datasets,
making it a valuable tool for data scientists and machine learning practitioners. However, it is important to use
SMOTE carefully and to evaluate the performance of the resulting model to ensure that it is accurate and reliable.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import f_oneway
dataset =pd.read_csv("Inc_Exp_Data.csv")
dataset.head()
A=dataset["Mthly_HH_Income"]
A
B=dataset["Mthly_HH_Expense"]
B
0 8000
1 7000
2 4500
3 2000
4 12000
5 8000
6 16000
7 20000
8 9000
C=dataset["No_of_Fly_Members"]
C
0 3
1 2
2 2
3 1
4 2
5 2
6 3
7 5
8 2
9 4
10 4
D=dataset["Emi_or_Rent_Amt"]
D
0 2000
1 3000
2 0
3 0
4 3000
5 0
6 35000
7 8000
8 0
9 0
10 8000
E=dataset["Annual_HH_Income"]
E
0 64200
1 79920
2 112800
3 97200
4 147000
5 196560
6 167400
7 216000
8 218880
9 220800
10 278400
F=dataset["No_of_Earning_Members"]
F
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 2
10 2
f_oneway(A,B,C,D,E,F)
F_onewayResult(statistic=110.97258263216287, pvalue=1.4660303003601035e-65)
Theory:
Conclusion : We have successfully performed outlier detection using distance based/density based method
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
df=pd.read_csv('Inc_Exp_Data.csv')
def remove_sign(x,sign):
if type(x) is str:
x = float(x.replace(sign,'').replace(',',''))
return x
df=df[['Mthly_HH_Income','Mthly_HH_Expense']]
df=pd.DataFrame(df)
df.price = df.Mthly_HH_Expense.apply(remove_sign,sign='$')
sns.boxplot(y='Mthly_HH_Expense', x='Mthly_HH_Income',data=df)
plt.xticks(rotation=90)
plt.ylabel('Mthly_HH_Expense (Rs.)')
IQR Method
def remove_outlier_IQR(df):
Q1=df.quantile(0.25)
Q3=df.quantile(0.75)
IQR=Q3-Q1
df_final=df[~((df<(Q1-1.5*IQR)) | (df>(Q3+1.5*IQR)))]
return df_final
df_outlier_removed=remove_outlier_IQR(df.Mthly_HH_Expense)
df_outlier_removed=pd.DataFrame(df_outlier_removed)
ind_diff=df.index.difference(df_outlier_removed.index)
len(ind_diff)
3
HAMPEL METHOD
def remove_outlier_Hampel(df):
med=df.median()
List=abs(df-med)
cond=List.median()*4.5
good_list=List[~(List>cond)]
return good_list
df_outlier_removed=remove_outlier_Hampel(df.Mthly_HH_Expense)
df_outlier_removed=pd.DataFrame(df_outlier_removed)
ind_diff=df.index.difference(df_outlier_removed.index)
for i in range(0, len(ind_diff),1):
df_final=df.drop([ind_diff[i]])
df=df_final
sns.boxplot(y='Mthly_HH_Expense', x='Mthly_HH_Income',data=df_final)
plt.xticks(rotation=90)
plt.ylabel('Mthly_HH_Expense(Rs.)')
len(ind_diff)
1
DBSCAN Method
clusters=remove_outliers_DBSCAN((df['Mthly_HH_Expense']),1,1)
clusters.value_counts().sort_values(ascending=False)
df_cluster=pd.DataFrame(clusters)
ind_outlier=df_cluster.index[df_cluster['cluster']==-1]
ind_outlier
plt.plot(clusters)
df_cluster=pd.DataFrame(clusters)
ind_outlier=df_cluster.index[df_cluster['cluster']==-1]
ind_outlier
sns.boxplot(y='Mthly_HH_Expense', x='Mthly_HH_Income',data=df_final)
plt.xticks(rotation=90)
plt.ylabel('Mthly_HH_Expense (Rs.)')
len(ind_outlier)
0
EXPERIMENT 07
Theory:
Time series forecasting occurs when you make scientific predictions based on historical time stamped data.
It involves building models through historical analysis and using them to make observations and drive
future strategic decision-making. An important distinction in forecasting is that at the time of the work, the
future outcome is completely unavailable and can only be estimated through careful analysis and
evidence-based priors. Time series forecasting is the process of analyzing time series data using statistics
and modeling to make predictions and inform strategic decision-making. It’s not always an exact prediction,
and likelihood of forecasts can vary wildly—especially when dealing with the commonly fluctuating
variables in time series data as well as factors outside our control. However, forecasting insight about which
outcomes are more likely—or less likely—to occur than other potential outcomes. Often, the more
comprehensive the data we have, the more accurate the forecasts can be. While forecasting and “prediction”
generally mean the same thing, there is a notable distinction. In some industries, forecasting might refer to
data at a specific future point in time, while prediction refers to future data in general. Series forecasting is
often used in conjunction with time series analysis. Time series analysis involves developing models to gain
an understanding of the data to understand the underlying causes. Analysis can provide the “why” behind
the outcomes you are seeing. Forecasting then takes the next step of what to do with that knowledge and the
predictable extrapolations of what might happen in the future.
Forecasting has a range of applications in various industries. It has tons of practical applications including:
weather forecasting, climate forecasting, economic forecasting, healthcare forecasting engineering
forecasting, finance forecasting, retail forecasting, business forecasting, environmental studies forecasting,
social studies forecasting, and more. Basically anyone who has consistent historical data can analyze that
data with time series analysis methods and then model, forecasting, and predict. For some industries, the
entire point of time series analysis is to facilitate forecasting. Some technologies, such as augmented
analytics, can even automatically select forecasting from among other statistical algorithms if it offers the
most certainty.
Naturally, there are limitations when dealing with the unpredictable and the unknown. Time series
forecasting isn’t infallible and isn’t appropriate or useful for all situations. Because there really is no
explicit set of rules for when you should or should not use forecasting, it is up to analysts and data teams to
know the limitations of analysis and what their models can support. Not every model will fit every data set
or answer every question. Data teams should use time series forecasting when they understand the business
question and have the appropriate data and forecasting capabilities to answer that question. Good
forecasting works with clean, time stamped data and can identify the genuine trends and patterns in
historical data. Analysts can tell the difference between random fluctuations or outliers, and can separate
genuine insights from seasonal variations. Time series analysis shows how data changes over time, and
good forecasting can identify the direction in which the data is changing.
Conclusion : We have successfully explored time series forecasting on the given dataset.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
data = pd.read_csv('airline-passengers.csv')
data.columns = ['Month','Passengers']
data['Month'] = pd.to_datetime(data['Month'], format='%Y-%m')
data = data.set_index('Month')
data.head()
data.plot(figsize=(20, 4))
plt.grid()
plt.legend(loc='best')
plt.title('Airline passenger traffic')
plt.show(block=False)
data = data.assign(Passengers_Linear_Interpolation=data.Passengers.interpolate(method='linea
r'))
data[['Passengers_Linear_Interpolation']].plot(figsize=(20, 4))
plt.grid()
plt.legend(loc='best')
plt.title('Airline passenger traffic: Linear interpolation')
plt.show(block=False)
data['Passengers'] = data['Passengers_Linear_Interpolation']
data.drop(columns=['Passengers_Linear_Interpolation'],inplace=True)
data.head()
y_hat_sma['sma_forecast'] = data['Passengers'].rolling(ma_window).mean()
y_hat_sma['sma_forecast'][train_len:] = y_hat_sma['sma_forecast'][train_len-1]
plt.figure(figsize=(20,5))
plt.grid()
plt.plot(train['Passengers'], label='Train')
plt.plot(test['Passengers'], label='Test')
plt.plot(y_hat_sma['sma_forecast'], label='Simple moving average forecast')
plt.legend(loc='best')
plt.title('Simple Moving Average Method')
plt.show()
ARIMA
data.plot(figsize=(14,6))
plt.title('Airline Passenger Traffic Data')
plt.show(block=False)
data['Passengers_Mean_Imputation'] = data.Passengers.fillna(data.Passengers.mean())
plt.figure(figsize=(16,4))
plt.plot(data.Passengers_Mean_Imputation, label='Passengers_Mean_Imputation')
plt.plot(data.Passengers, label='Passengers')
plt.legend(loc='best')
plt.title('Missing Value Treatment: Mean Imputation')
plt.show(block=False)
data.head()
data["Passengers"]=data["Passengers_Mean_Imputation"]
data.drop(columns=['Passengers_Mean_Imputation'],inplace=True)
data.head()
sigma squared represents the variance of the residual values ar.L1 refers to the
autoregressive term with the lag of 1, ar.L2 represents the same, but with the lag of 2.
ma.L1 and ma.L2 refer to the ‘moving average’ terms with lag of 1 and 2.