0% found this document useful (0 votes)
54 views

Institute's Vision

Uploaded by

Mayuresh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Institute's Vision

Uploaded by

Mayuresh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Institute’s Vision

To be an organisation with potential for excellence in engineering and


management for the advancement of society and human kind.

Institute’s Mission

To excel in academics, practical engineering, management and to commence


research endeavours.

To prepare students for future opportunities.

To nurture students with social and ethical responsibilities.


Department’s Vision

To create IT graduates with ethical and employable skills.

Department’s Mission

To imbibe problem solving and analytical skills through teaching learning


process.

To impart technical and managerial skills to meet the industry requirement.

To encourage ethical and value based education.


Excelssior’s Education Society

K. C. COLLEGE OF ENGINEERING
AND MANAGEMENT STUDIES AND
RESEARCH THANE (EAST).

Certificate
This is to certify that Mr. / Ms. ___________________________________

of Semester ________ Branch ____________ Roll No. _________

has performed and successfully completed all the practical’s in the subject
of ______________________________________________ for the
academic year 20___ to 20___ as prescribed by University of Mumbai.

DATE :- ____________

_____________________________ _____________________________

Practical Incharge Internal Examiner


____________________________ _____________________________

Head of Department External Examiner

COLLEGE SEAL
Lab Objectives: Sr. No. Lab Objectives
The Lab experiments aims:
1 To know the fundamental concepts of data science and
analytics
2 To learn data collection, preprocessing and visualization
techniques for data science
3 To Understand and practice analytical methods for
solving real life problems based on Statistical analysis
4 To learn various machine learning techniques to solve
complex real-world problems
5 To learn streaming and batch data processing using
Apache Spark
6 To map the elements of data science to perceive
information
Lab Outcomes: Sr. No. Lab Outcomes Cognitive levels of attainment as
per Bloom’s Taxonomy

On successful completion, of course, learner/student will be able to:


1 Understand the concept of Data L1
science process and associated
terminologies to solve real-world
problems
2 Analyze the data using different L1, L2, L3, L4
statistical techniques and visualize
the outcome using different types of
plots.
3 Analyze and apply the supervised L1,L2, L3, L4
machine learning techniques like
Classification, Regression or
Support Vector Machine on data for
building the models of data and
solve the problems.
4 Apply the different unsupervised L1, L2,L3
machine learning algorithms like
Clustering, Decision Trees, Random
Forests or Association to solve the
problems.
5 Design and Build an application that L1,L2,L3,L4,L5,L6
performs exploratory data analysis
using Apache Spark
6 Design and develop a data science L1,L2,L3,L4,L5,L6
application that can have data
acquisition, processing, visualization
and statistical analysis methods with
supported machine learning
technique to solve the real-world
problem

Prerequisite: Basics of Python programming and Database management system.

DETAILED SYLLABUS:
Sr. No. Module Detailed Content Hours LO Mapping
I Introduction to 04 LO1
Data Science and i. Introduction,
Data Processing Benefits and uses
using Pandas of data science
ii. Data Science
tasks

iii. Introduction to
Pandas
iv. Data
preparation: Data
cleansing, Data
transformation,
Combine/Merge
/Join data, Data
loading &
preprocessing with
pandas
v. Data
aggregation
vi. Querying data
in Pandas
vii. Statistics with
Pandas Data
Frames
viii. Working with
categorical and text
data
ix. Data Indexing
and Selection
x. Handling
Missing Data
II Data Visualization 04 LO2
and Statistics i. Visualization
with Matplotlib
and Seaborn
ii. Plotting Line
Plots, Bar Plots,
Histograms
Density Plots,
Paths, 3Dplot,
Stream plot,
Logarithmic plots,
Pie chart, Scatter
Plots and Image
visualization using
Matplotlib
iii. Plotting scatter
plot, box plot,
Violin plot, swarm
plot, Heatmap, Bar
Plot using seaborn
iv. Introduction to
scikit-learn and
SciPy
v. Statistics using
python: Linear
algebra, Eigen
value, Eigen
Vector,
Determinant,
Singular Value
Decomposition,
Integration,
Correlation,
Central Tendency,
Variability,
Hypothesis testing,
Anova, z-test, t-test
and chi-square test.

III Machine Learning 05 LO3


i. What is Machine
Learning?
ii. Applications of
Machine Learning;
iii. Introduction to
Supervised
Learning
iv. Overview of
Regression
v. Support Vector
Machine
vi. Classification
algorithms

Program Outcomes

Engineering Graduates will be able to:

1. Engineering knowledge: Apply the knowledge of mathematics, science,


engineeringfundamentals, and an engineering specialization to the solution of complex
engineering problems.

2. Problem analysis: Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of mathematics,
natural sciences, and engineering sciences.

3. Design/development of solutions: Design solutions for complex engineering problems and design
system components or processes that meet the specified needs with appropriate consideration for the
public health and safety, and the cultural, societal, and environmental considerations.

4. Conduct investigations of complex problems: Use research-based knowledge and


researchmethods including design of experiments, analysis and interpretation of data, and
synthesis of the information to provide valid conclusions.

5. Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modelling to complex engineering activities
with an understanding of the limitations.

6. The engineer and society: Apply reasoning informed by the contextual knowledge to
assesssocietal, health, safety, legal and cultural issues and the consequent responsibilities relevant
to the professional engineering practice.
7. Environment and sustainability: Understand the impact of the professional
engineeringsolutions in societal and environmental contexts, and demonstrate the knowledge of,
and need for sustainable development.

8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
normsof the engineering practice.

9. Individual and team work: Function effectively as an individual, and as a member or leader
indiverse teams, and in multidisciplinary settings.

10. Communication: Communicate effectively on complex engineering activities with


theengineering community and with society at large, such as, being able to comprehend and write
effective reports and design documentation, make effective presentations, and give and receive
clear instructions.

11. Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member and
leader in a team, to manage projects and in multidisciplinary environments.

12.Life-long learning: Recognize the need for, and have the preparation and ability to engage
inindependent and life-long learning in the broadest context of technological change.
Department of Information Technology

Subject :Artificial Intelligence & Data Science

Semester :VI

Class : TE

Course Outcomes / Lab Outcomes

Course Code(ITL703) Lab Outcomes

At the end of experiment student will able to


ITL605.1 Understand the concept of Data science process and
associated terminologies to solve real-world
problems

ITL605.2 Analyze the data using different statistical


techniques and visualize the outcome using different
types of plots.
ITL605.3 Analyze and apply the supervised machine learning
techniques like Classification, Regression or
Support Vector Machine on data for building the
models of data and solve the problems.

ITL605.4 Apply the different unsupervised machine learning


algorithms like Clustering, Decision Trees, Random
Forests or Association to solve the problems.

ITL605.5 Design and Build an application that performs


exploratory data analysis using Apache Spark

ITL605.6 Design and develop a data science application that


can have data acquisition, processing, visualization
and statistical analysis methods with supported
machine learning technique to solve the real-world
problem
Rubrics for Practical

Rubrics Maximum 15-12 12-9 9-6 6-0


Description Marks
Weight

Implementatio 5 Successful Output Few errors Incorrect


n completion correct but in the Output
(R1) with accurate not precise output (2-0)
output (5-4) (4-3) (3-2)

Understanding 5 Understanding Understand Improper No


(R2) Experiment Experiment Conclusion Conclusion
and drawn but (3-2) (2-0)
correct conclusion
conclusion less
(5-4) accurate
(4-3)

Punctuality and 5 Submission Submissio Submissio Submissio


Discipline within a week n after n after two n after
(R3) (5-4) week (4-3) weeks three
(3-2) weeks and
more (2-0)
TABLE OF CONTENTS

Sr. Date of Date of Page Grade


Name of Experiment / Sign
No Conduction Submission No.
Marks

10

11
Sr. Date of Date of Page Grade

Name of Experiment Sign


Total Grade / Marks :-

Avg. marks of Experiments Avg. marks of Assignments

(A) (B) Total Marks

(A+B)

Obtained Out of Obtained Out of

__________________ __________________

Practical Incharge Date


EXPERIMENT NO. - 01

Aim of the Experiment :- Data preparation using NumPy and Pandas

Lab Outcome :-

Date of Conduction : ____________ Date of Submission :______________

Punctuality & Discipline


Implementation Understanding (5) (5) Total

(5) (15)

____________________________
Practical Incharge
EXPERIMENT NO. - 01

AIM :   Data preparation using NumPy and Pandas

THEORY:
Data Preprocessing: 
Data preprocessing is a data mining technique which is used to transform the raw data in a
useful and efficient format. 
 
Steps Involved in Data Pre processing: 
1. Data Cleaning: 
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done.
It involves handling of missing data, noisy data etc. 
 
 (a). Missing Data: 
This situation arises when some data is missing in the data. It can be handled in various
ways. 
Some of them are: 
1. Ignore the tuples: 
This approach is suitable only when the dataset we have is quite large and multiple values
are missing within a tuple .  
 
2. Fill the Missing values: 
There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value. 
 
 (b). Noisy Data: 
Noisy data is a meaningless data that can’t be interpreted by machines. It can be generated
due to faulty data collection, data entry errors etc. It can be handled in following ways :  
1. Binning Method: 
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task.
Each segmented is handled separately. One can replace all data in a segment by its mean
or boundary values can be used to complete the task. 
 
2. Regression: 
Here data can be made smooth by fitting it to a regression function. regression used may
be linear (having one independent variable) or multiple (having multiple independent
variables). 
 
3. Clustering: 
This approach groups the similar data in a cluster. The outliers may be undetected or it
will fall outside the clusters. 
2. Data Transformation: 
This step is taken in order to transform the data in appropriate forms suitable for mining
process. This involves following ways: 
1. Normalization: 
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)  
 
2. Attribute Selection: 
In this strategy, new attributes are constructed from the given set of attributes to help the
mining process. 
 
3. Discretization: 
This is done to replace the raw values of numeric attribute by interval levels or conceptual
levels. 
 
4. Concept Hierarchy Generation: 
Here attributes are converted from lower level to higher level in hierarchy. For Example-The
attribute “city” can be converted to “country”. 
 
3. Data Reduction: 
Since data mining is a technique that is used to handle huge amount of data. While working
with huge volume of data, analysis became harder in such cases. In order to get rid of this, we
uses data reduction technique. It aims to increase the storage efficiency and reduce data storage
and analysis costs. 
The various steps to data reduction are: 
1. Data Cube Aggregation: 
Aggregation operation is applied to data for the construction of the data cube.  
 
2. Attribute Subset Selection: 
The highly relevant attributes should be used, rest all can be discarded. For performing
attribute selection, one can use level of significance and p- value of the attribute.the attribute
having p-value greater than significance level can be discarded. 
 
3. Numerosity Reduction: 
This enable to store the model of data instead of whole data, for example: Regression
Models. 
 
4. Dimensionality Reduction: 
This reduce the size of data by encoding mechanism .It can be lossy or lossless. If after
reconstruction from compressed data, original data can be retrieved, such reduction are
called lossless reduction else it is called lossy reduction. The two effective methods of
dimensionality reduction are: Wavelet transforms and PCA (Principal Component
Analysis). 
 
Feature Scaling:
Feature Scaling is a technique to standardize the independent features present in the data in a
fixed range. It is performed during the data pre-processing to handle highly varying magnitudes
or values or units. If feature scaling is not done, then a machine learning algorithm tends to
weigh greater values, higher and consider smaller values as the lower values, regardless of the
unit of the values.
OUTPUT :

CONCLUSION:
EXPERIMENT NO. - 02

Aim of the Experiment :- Data Visualization / Exploratory Data Analysis for the selected data set
using Matplotlib and Seaborn
a. Create a bar graph, contingency table using any 2 variables.
b. Create normalized histogram.
c. Describe what this graphs and tables indicates?

Lab Outcome :-

Date of Conduction : ____________ Date of Submission :______________

Punctuality & Discipline


Implementation Understanding (5) (5) Total

(5) (15)

____________________________

Practical Incharge
EXPERIMENT NO. - 02
AIM : Data Visualization / Exploratory Data Analysis for the selected data set using Matplotlib and
Seaborn
a. Create a bar graph, contingency table using any 2 variables.
b. Create normalized histogram.
c. Describe what this graphs and tables indicates?

THEORY: A bar graph is the graphical representation of categorical data using rectangular bars where
the length of each bar is proportional to the value they represent. A histogram is the graphical
representation of data where data is grouped into continuous number ranges and each range corresponds
to a vertical bar.

Contingency Table is one of the techniques for exploring two or even more variables.
It is basically a tally of counts between two or more categorical variables.

Seaborn.barplot() method in Python


Seaborn is a Python data visualization library based on Matplotlib. It provides a high-
level interface for drawing attractive and informative statistical graphics.

A barplot is basically used to aggregate the categorical data according to some methods
and by default it’s the mean. It can also be understood as a visualization of the group by
action. To use this plot we choose a categorical column for the x-axis and a numerical
column for the y-axis, and we see that it creates a plot taking a mean per categorical
column.

Syntax : seaborn.barplot(x=None, y=None, hue=None, data=None, order=None,


hue_order=None, estimator=<function mean at 0x000002BC3EB5C4C8>, ci=95,
n_boot=1000, units=None, orient=None, color=None, palette=None, saturation=0.75,
errcolor=’.26′, errwidth=None, capsize=None, dodge=True, ax=None, **kwargs,) 
P arameters :
Arguments   Value                   Description
          
  

x, y, hue names of Inputs for plotting long-form data. See examples


variables in for interpretation.
“data“ or vector
data, optional

data DataFrame, Dataset for plotting. If “x“ and “y“ are absent,
array, or list of this is interpreted as wide-form. Otherwise it is
arrays, optional expected to be long-form.

order, lists of strings, Order to plot the categorical levels in, otherwise
hue_order optional the levels are inferred from the data objects.

estimator callable that Statistical function to estimate within each


maps vector -> categorical bin.
scalar, optional

ci float or “sd” or Size of confidence intervals to draw around


None, optional estimated values.  If “sd”, skip bootstrapping
and draw the standard deviation of the
observations. If “None“, no bootstrapping will be
performed, and error bars will not be drawn.

n_boot int, optional Number of bootstrap iterations to use when


computing confidence intervals.

units name of variable Identifier of sampling units, which will be used


in “data“ or to perform a multilevel bootstrap and account
vector data, for repeated measures design. 
optional

orient “v” | “h”, Orientation of the plot (vertical or horizontal).


optional This is usually inferred from the dtype of the
input variables, but can be used to specify when
the “categorical” variable is a numeric or when
plotting wide-form data.

color matplotlib color, Color for all of the elements, or seed for a
optional gradient palette.

palette palette name, Colors to use for the different levels of the “hue“
list, or dict, variable. Should be something that can be
optional interpreted by :func:`color_palette`, or a
dictionary mapping hue levels to matplotlib
colors.

saturation float, optional Proportion of the original saturation to draw


colors at. Large patches often look better with
slightly desaturated colors, but set this to “1“ if
you want the plot colors to perfectly match the
input color spec.

errcolor matplotlib color Color for the lines that represent the confidence
interval.

errwidth float, optional Thickness of error bar lines (and caps). 

capsize float, optional Width of the “caps” on error bars.

dodge bool, optional When hue nesting is used, whether elements


should be shifted along the categorical axis. 

ax matplotlib Axes, Axes object to draw the plot onto, otherwise uses
optional the current Axes.

kwargs ey, value Other keyword arguments are passed through to


mappings “plt.bar“ at draw time.
Following steps are used :

Import Seaborn

Load Dataset from Seaborn as it contain good collection of datasets.

Plot Bar graph using seaborn.barplot() method.

Normalised Histogram using matplotlib()


To normalize a histogram in Python, we can use hist() method. In normalized bar, the
area underneath the plot should be 1.

Steps:

 Make a list of numbers.


 Plot a histogram with density=True.

 To display the figure, use show() method.

Example

import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = [7.00, 3.50]


plt.rcParams["figure.autolayout"] = True

k = [5, 5, 5, 5]
x, bins, p = plt.hist(k, density=True)

plt.show()

OUTPUT:

CONCLUSION:
EXPERIMENT NO. - 03

Aim of the Experiment :- Data Modeling : Validating partition by performing a two‐sample Z‐


test.

Lab Outcome :-

Date of Conduction : ____________ Date of Submission :______________

Punctuality & Discipline


Implementation Understanding (5) (5) Total

(5) (15)
Punctuality & Discipline
Implementation Understanding (5) (5) Total

____________________________

Practical Incharge
Experiment No. 3
AIM : Data Modeling : Validating partition by performing a two‐sample Z‐test.

THEORY: Data Modeling

Data modeling is the process of creating a simplified diagram of a software system and the data
elements it contains, using text and symbols to represent the data and how it flows. Data models
provide a blueprint for designing a new database or reengineering a legacy application.

Z-test
Z-test is a statistical method to determine whether the distribution of the test statistics can be
approximated by a normal distribution. It is the method to determine whether two sample means
are approximately the same or different when their variance is known and the sample size is large
(should be >= 30).

When to Use Z-test:

 The sample size should be greater than 30. Otherwise, we should use the t-test.
 Samples should be drawn at random from the population.
 The standard deviation of the population should be known.
 Samples that are drawn from the population should be independent of each other.
 The data should be normally distributed, however for large sample size, it is assumed to
have a normal distribution.

Hypothesis Testing

A hypothesis is an educated guess/claim about a particular property of an object. Hypothesis


testing is a way to validate the claim of an experiment.

 Null Hypothesis: The null hypothesis is a statement that the value of a population
parameter (such as proportion, mean, or standard deviation) is equal to some claimed
value. We either reject or fail to reject the null hypothesis. Null Hypothesis is denoted by
H0.
 Alternate Hypothesis: The alternative hypothesis is the statement that the parameter has
a value that is different from the claimed value. It is denoted by HA.

Steps to perform Z-test:

 First, identify the null and alternate hypotheses.


 Determine the level of significance (∝).
 Find the critical value of z in the z-test using
 Calculate the z-test statistics. Below is the formula for calculating the z-test statistics.
 where,
o X¯: mean of the sample.

o Mu: mean of the population.

o Sd: Standard deviation of the population.

o n: sample size.

Two-sampled z-test:
In this test, we have provided 2 normally distributed and independent populations, and we have
drawn samples at random from both populations. Here, we consider u1 and u2 be the population
mean X1 and X2 are the observed sample mean. Here, our null hypothesis could be like:

H0 : µ1- µ2 = 0
and alternative hypothesis

H1 : µ1- µ2 ≠ 0
and the formula for calculating the z-test score:

where sigma1 and sigma2 are the standard deviation and n1 and n2 are the sample size of
population corresponding to u1 and u2 . 

Type 1 error and Type II error:

 Type I error: Type 1 error has occurred when we reject the null hypothesis, even when
the hypothesis is true. This error is denoted by alpha.
 Type II error: Type II error has occurred when we didn’t reject the null hypothesis, even
when the hypothesis is false. This error is denoted by beta.

OUTPUT:
CONCLUSION:

EXPERIMENT NO. - 04

Aim of the Experiment :- Implementation of Statistical Hypothesis Test using Scipy and Sci-kit
learn.

Correlation Tests : Chi-Squared Test

Lab Outcome :-

Date of Conduction : ____________ Date of Submission :______________

Punctuality & Discipline


Implementation Understanding (5) (5) Total

(5) (15)
Punctuality & Discipline
Implementation Understanding (5) (5) Total

____________________________

Practical In charge

Experiment No. 4
AIM : Implementation of Statistical Hypothesis Test using Scipy and Sci-kit learn.

Correlation Tests : Chi-Squared Test

THEORY:

The Pearson’s Chi-Square statistical hypothesis is a test for independence between categorical
variables. In this article, we will perform the test using a mathematical approach and then using
Python’s SciPy module.

The Contingency Table :

A Contingency table (also called crosstab) is used in statistics to summarise the relationship between
several categorical variables. Here, we take a table that shows the number of men and women buying
different types of pets.

  dog cat bird total


men 207 282 241 730
women 234 242 232 708
total 441 524 473 1438

The aim of the test is to conclude whether the two variables( gender and choice of pet ) are
related to each other.

Null hypothesis:

We start by defining the null hypothesis (H0) which states that there is no relation between the
variables. An alternate hypothesis would state that there is a significant relation between the two.

We can verify the hypothesis by these methods:


 Using p-value:

We define a significance factor to determine whether the relation between the variables is of
considerable significance. Generally a significance factor or alpha value of 0.05 is chosen. This
alpha value denotes the probability of erroneously rejecting H0 when it is true. A lower alpha
value is chosen in cases where we expect more precision. If the p-value for the test comes out to
be strictly greater than the alpha value, then H0 holds true.

 Using chi-square value:

If our calculated value of chi-square is less or equal to the tabular(also called critical) value of
chi-square, then H0 holds true.

Expected Values Table :

Next, we prepare a similar table of calculated(or expected) values. To do this we need to


calculate each item in the new table as :

row total * column total / grand total

The expected values table :

  dog cat bird total


men 223.87343533 266.00834492 240.11821975 730
women 217.12656467 257.99165508 232.88178025 708
total 441 524 473 1438

Chi-Square Table :

We prepare this table by calculating for each item the following:

(Observed_value – Calculated_value)^2 / Calculated_value

The chi-square table:


  observed (o) calculated (c) (o-c)^2 / c
  207 223.87343533 1.2717579435607573
282 266.00834492 0.9613722161954465
  241 240.11821975 0.003238139990850831
  234 217.12656467 1.3112758457617977
  242 257.99165508 0.991245364156322
  232 232.88178025 0.0033387601600580606
Total     4.542228269825232

From this table, we obtain the total of the last column, which gives us the calculated value of chi-
square.  Hence the calculated value of chi-square is 4.542228269825232

Now, we need to find the critical value of chi-square. We can obtain this from a table. To use this
table, we need to know the degrees of freedom for the dataset.  The degrees of freedom is defined
as : (no. of rows – 1) * (no. of columns – 1).
Hence, the degrees of freedom is (2-1) * (3-1) = 2

Now, look at the table and find the value corresponding to 2 degrees of freedom and 0.05
significance factor :
The tabular or critical value of chi-square here is  5.991

Hence,

Critical value of x^2 >= Calculates value of x^2

Therefore, H0 is accepted, that is, the variables do not have a significant relation.

Performing the test using Python (scipy.stats) :

SciPy is an Open Source Python library, which is used in mathematics, engineering, scientific
and technical computing. 

Installation:

pip install scipy

The chi2_contingency() function of scipy.stats module takes as input, the contingency table in
2d array format. It returns a tuple containing test statistics, the p-value, degrees of freedom and
expected table(the one we created from the calculated values) in that order. 

Hence, we need to compare the obtained p-value with alpha value of 0.05.

from scipy.stats import chi2_contingency


  
# defining the table
data = [[207, 282, 241], [234, 242, 232]]
stat, p, dof, expected = chi2_contingency(data)
  
# interpret p-value
alpha = 0.05
print("p value is " + str(p))
if p <= alpha:
    print('Dependent (reject H0)')
else:
    print('Independent (H0 holds true)')

OUTPUT:
Chi-square Test for feature selection

Feature selection is also known as attribute selection is a process of extracting the most relevant
features from the dataset and then applying machine learning algorithms for the better
performance of the model. A large number of irrelevant features increases the training time
exponentially and increase the risk of overfitting.

Chi-square Test for Feature Extraction:

Chi-square test is used for categorical features in a dataset. We calculate Chi-square between
each feature and the target and select the desired number of features with best Chi-square scores.
It determines if the association between two categorical variables of the sample would reflect
their real association in the population.

Chi- square score is given by :

where –

Observed frequency = No. of observations of class


Expected frequency = No. of expected observations of class if there was no relationship
between the feature and the target.

OUTPUT:

CONCLUSION:
EXPERIMENT NO. - 05

Aim of the Experiment :- Apply regression Model techniques to predict the data on House prices
dataset . And Prediction of Loan Using Multivariable Regression in Python.

Lab Outcome :-

Date of Conduction : ____________ Date of Submission :______________

Punctuality & Discipline


Implementation Understanding (5) (5) Total

(5) (15)
Punctuality & Discipline
Implementation Understanding (5) (5) Total

____________________________

Practical In charge
Experiment No. 5
AIM:- Apply regression Model techniques to predict the data on House prices dataset .
And Prediction of Loan Using Multivariable Regression in Python

THEORY: Linear Regression:


Linear regression is probably one of the most important and widely used regression techniques.
It’s among the simplest regression methods. One of its main advantages is the ease of interpreting
results.

When implementing linear regression of some dependent variable 𝑦 on the set of independent
variables 𝐱 = (𝑥₁, …, 𝑥ᵣ), where 𝑟 is the number of predictors, you assume a linear relationship
between 𝑦 and 𝐱: 𝑦 = 𝛽₀ + 𝛽₁𝑥₁ + ⋯ + 𝛽ᵣ𝑥ᵣ + 𝜀. This equation is the regression equation. 𝛽₀,
𝛽₁, …, 𝛽ᵣ are the regression coefficients, and 𝜀 is the random error.

Linear regression calculates the estimators of the regression coefficients or simply the predicted


weights, denoted with 𝑏₀, 𝑏₁, …, 𝑏ᵣ. They define the estimated regression function (𝐱) = 𝑏₀ +
𝑏₁𝑥₁ + ⋯ + 𝑏ᵣ𝑥ᵣ. This function should capture the dependencies between the inputs and output
sufficiently well.

The estimated or predicted response, (𝐱ᵢ), for each observation 𝑖 = 1, …, 𝑛, should be as close as


possible to the corresponding actual response 𝑦ᵢ. The differences 𝑦ᵢ - (𝐱ᵢ) for all observations 𝑖 =
1, …, 𝑛, are called the residuals. Regression is about determining the best predicted weights, that
is the weights corresponding to the smallest residuals.

To get the best weights, you usually minimize the sum of squared residuals (SSR) for all
observations 𝑖 = 1, …, 𝑛: SSR = Σᵢ(𝑦ᵢ - 𝑓(𝐱ᵢ))². This approach is called the method of ordinary
least squares.

Multiple Linear Regression:


Multiple or multivariate linear regression is a case of linear regression with two or more
independent variables.

If there are just two independent variables, the estimated regression function is (𝑥₁, 𝑥₂) = 𝑏₀ +
𝑏₁𝑥₁ + 𝑏₂𝑥₂. It represents a regression plane in a three-dimensional space. The goal of
regression is to determine the values of the weights 𝑏₀, 𝑏₁, and 𝑏₂ such that this plane is as close
as possible to the actual responses and yield the minimal SSR.

The case of more than two independent variables is similar, but more general. The estimated
regression function is (𝑥₁, …, 𝑥ᵣ) = 𝑏₀ + 𝑏₁𝑥₁ + ⋯ +𝑏ᵣ𝑥ᵣ, and there are 𝑟 + 1 weights to be
determined when the number of inputs is 𝑟.

OUTPUT:

CONCLUSION:
EXPERIMENT NO. - 06
Aim of the Experiment :- Classification modelling
a. Choose classifier for classification problem.
b. Evaluate the performance of classifier

Lab Outcome :-

Date of Conduction : ____________ Date of Submission :______________

Punctuality & Discipline


Implementation Understanding (5) (5) Total

(5) (15)

____________________________

Practical Incharge
Experiment No. 6
Aim : Classification modelling
a. Choose classifier for classification problem.
b. Evaluate the performance of classifier

THEORY: Ensemble learning is a machine learning paradigm where multiple models (often
called “weak learners”) are trained to solve the same problem and combined to get better results.
The main hypothesis is that when weak models are correctly combined we can obtain more
accurate and/or robust models.

Bagging is a homogeneous weak learners’ model that learns from each other independently in
parallel and combines them for determining the model average. Bagging is an acronym for
‘Bootstrap Aggregation’ and is used to decrease the variance in the prediction model. Bagging is
a parallel method that fits different, considered learners independently from each other, making it
possible to train them simultaneously.

Bagging generates additional data for training from the dataset. This is achieved by random
sampling with replacement from the original dataset. Sampling with replacement may repeat
some observations in each new training data set. Every element in Bagging is equally probable
for appearing in a new dataset. 
These multi datasets are used to train multiple models in parallel. The average of all the
predictions from different ensemble models is calculated. The majority vote gained from the
voting mechanism is considered when classification is made. Bagging decreases the variance and
tunes the prediction to an expected outcome.
Example of Bagging: The Random Forest model uses Bagging, where decision tree models with
higher variance are present. It makes random feature selection to grow trees. Several random
trees make a Random Forest.

OUTPUT:

CONCLUSION:
EXPERIMENT NO. - 07

Aim of the Experiment :- Clustering


a. Clustering algorithms for unsupervised classification.
b. Plot the cluster data.

Lab Outcome :-

Date of Conduction : ____________ Date of Submission :______________

Punctuality & Discipline


Implementation Understanding (5) (5) Total

(5) (15)

____________________________
Practical Incharge
Experiment No. 7
AIM :    Clustering
a. Clustering algorithms for unsupervised classification.
b. Plot the cluster data.

THEORY: -K-Means Clustering is an unsupervised learning algorithm that is used to solve the
clustering problems in machine learning or data science. In this topic, we will learn what is K-
means clustering algorithm, how the algorithm works, along with the Python implementation of
k-means clustering.

What is K-Means Algorithm?

K-Means Clustering is an Unsupervised Learning algorithm , which groups the unlabeled dataset
into different clusters. Here K defines the number of pre-defined clusters that need to be created
in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and
so on. It is an iterative algorithm that divides the unlabeled dataset into k different clusters in
such a way that each dataset belongs only one group that has similar properties. It allows us to
cluster the data into different groups and a convenient way to discover the categories of groups in
the unlabeled dataset on its own without the need for any training. It is a centreoid -based
algorithm, where each cluster is associated with a centroid. The main aim of this algorithm is to
minimize the sum of distances between the data point and their corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters,
and repeats the process until it does not find the best clusters. The value of k should be
predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K centre points or centroids by an iterative process.
o Assigns each data point to its closest k-centre Those data points which are near to the
particular k- centre , create a cluster.

Hence each cluster has data points with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?


The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each data point to the new closest centroid
of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

OUTPUT:

CONCLUSION:
EXPERIMENT NO. - 08

Aim of the experiment :- Using any machine learning techniques using available data set to
develop a recommendation system.

Lab Outcome :-

Date of Conduction : ____________ Date of Submission :______________

Punctuality & Discipline


Implementation Understanding (5) (5) Total

(5) (15)

____________________________

Practical Incharge
EXPERIMENT NO.: 8

AIM: Using any machine learning techniques using available data set to develop a recommendation
system.

THEORY: Practically, recommender systems encompass a class of techniques and algorithms

which are able to suggest “relevant” items to users. Ideally, the suggested items are as relevant to

the user as possible, so that the user can engage with those items: YouTube videos, news articles,

online products, and so on.

Items are ranked according to their relevancy, and the most relevant ones are shown to the user.

The relevancy is something that the recommender system must determine and is mainly based on

historical data. If you’ve recently watched YouTube videos about elephants, then YouTube is

going to start showing you a lot of elephant videos with similar titles and themes!

Recommender systems are generally divided into two main categories: collaborative filtering and

content-based systems.

Figure 1: A tree of the different types of Recommender Systems.


Collaborative Filtering Systems

Collaborative filtering methods for recommender systems are methods that are solely based on

the past interactions between users and the target items. Thus, the input to a collaborative
filtering system will be all historical data of user interactions with target items. This data is

typically stored in a matrix where the rows are the users, and the columns are the items.

The core idea behind such systems is that the historical data of the users should be enough to

make a prediction. I.e we don’t need anything more than that historical data, no extra push from

the user, no presently trending information, etc.

Beyond this, collaborative filtering methods are further divided into two sub-groups: memory-

based and model-based methods.

Memory-based methods are the most simplistic as they use no model whatsoever. They assume

that predictions can be made on pure “memory” of past data and usually just employ a simple

distance-measurement approach, like nearest neighbour.

Model-based approaches, on the other hand, always assume some kind of underlying model and

basically try to make sure that whatever predictions come out will fit the model well.

steps:

1. Load up the data with pandas

2. Convert the pandas dataframes to graph lab SFrames

3. Train the model

4. Make recommendations

Principal component analysis (PCA) is a statistical procedure that is used to reduce the

dimensionality. It uses an orthogonal transformation to convert a set of observations of possibly

correlated variables into a set of values of linearly uncorrelated variables called principal

components. It is often used as a dimensionality reduction technique.

Steps Involved in the PCA

Step 1: Standardize the dataset.


Step 2: Calculate the covariance matrix for the features in the dataset.

Step 3: Calculate the eigenvalues and eigenvectors for the covariance matrix.

Step 4: Sort eigenvalues and their corresponding eigenvectors.

Step 5: Pick k eigenvalues and form a matrix of eigenvectors.

Step 6: Transform the original matrix.

OUTPUT:

CONCLUSION:
EXPERIMENT NO. - 09
Aim of the Experiment :- Exploratory data analysis using Apache Spark and Pandas

Lab Outcome :-

Date of Conduction : ____________ Date of Submission :______________

Punctuality & Discipline


Implementation Understanding (5) (5) Total

(5) (15)

__________________________

Practical Incharge
Experiment No-9

AIM: Exploratory data analysis using Apache Spark and Pandas

THEORY:

Exploratory Data Analysis In Python?

Exploratory Data Analysis (EDA) in Python is the first step in data analysis process developed
by “John Tukey” in the 1970s. In statistics, exploratory data analysis is an approach to analyzing
data sets to summarize their main characteristics, often with visual methods.

For Example, You are planning to go on a trip to the “X” location. Things you do before taking
a decision:

 You will explore the location on what all places, waterfalls, trekking, beaches, restaurants
that location has in Google, Instagram, Facebook, and other social Websites.

 Calculate whether it is in your budget or not.

 Check for the time to cover all the places.

 Type of Travel method.

Similarly, when you are trying to build a machine learning model you need to be pretty sure
whether your data is making sense or not. The main aim of exploratory data analysis is to obtain
confidence in your data to an extent a machine learning algorithm.

Need For Exploratory Data Analysis

Exploratory Data Analysis is a crucial step before jumping to machine learning or modeling of
the data. By doing this you can get to know whether the selected features are good enough to
model, are all the features required, are there any correlations based on which we can either go
back to the Data Pre-processing step or move on to modeling.

Once Exploratory Data Analysis is complete, its feature can be used for supervised and
unsupervised machine learning modeling.

In every machine learning workflow, the last step is Reporting or Providing the insights to the
Stake Holders. By completing the Exploratory Data Analysis many plots can be drawn, heat-
maps, frequency distribution, graphs, correlation matrix along with the hypothesis by which any
individual can understand what the data is all about and what insights can get from exploring the
data set.

In Trip Example, all the exploration of the selected place are done based on which we will get
the confidence to plan the trip and even share with our friends the insights we got regarding the
place so that they can also join.

What Are The Steps In Exploratory Data Analysis In Python?

There are many steps for conducting Exploratory data analysis.


 Description of data

 Handling missing data

 Handling outliers

 Understanding relationships and new insights through plots

a) Description of data:
We need to know the different kinds of data and other statistics of our data before we can move
on to the other steps. A good one is to start with the describe() function in python. In Pandas, we
can apply describe() on a DataFrame which helps in generating descriptive statistics that
summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN
values.

The result’s index will include count, mean, std, min, max as well as lower, 50 and upper
percentiles. By default, the lower percentile is 25 and the upper percentile is 75. The 50
percentile is the same as the median.

Loading the Dataset:

import pandas as pd
from sklearn.datasets import load_boston
 
boston = load_boston()
x = boston.data
y = boston.target
columns = boston.feature_names
# creating dataframes
boston_df = pd.DataFrame(boston.data)
boston_df.columns = columns
boston_df.describe()

b) Handling missing data:


Data in the real-world are rarely clean and homogeneous. Data can either be missing during data
extraction or collection due to several reasons. Missing values need to be handled carefully
because they reduce the quality of any of our performance matrix. It can also lead to wrong
prediction or classification and can also cause a high bias for any given model being used. There
are several options for handling missing values. However, the choice of what should be done is
largely dependent on the nature of our data and the missing values. Below are some of the
techniques:
 Drop NULL or missing values

 Fill Missing Values

 Predict Missing values with an ML Algorithm

Drop NULL or missing values:


This is the fastest and easiest step to handle missing values. However, it is not generally advised.
This method reduces the quality of our model as it reduces sample size because it works by
deleting all other observations where any of the variables is missing.

The above code indicates that there are no null values in our data set.

Fill Missing Values:


 This is the most common method of handling missing values. This is a process whereby missing
values are replaced with a test statistic like mean, median or mode of the particular feature the
missing value belongs to. Let’s suppose we have a missing value of age in the boston data set.
Then the below code will fill the missing value with the 30.

Predict Missing values with an ML Algorithm:


This is by far one of the best and most efficient methods for handling missing data. Depending on
the class of data that is missing, one can either use a regression or classification model to predict
missing data.

c) Handling outliers:
An outlier is something which is separate or different from the crowd. Outliers can be a result of
a mistake during data collection or it can be just an indication of variance in your data. Some of
the methods for detecting and handling outliers:

 BoxPlot

 Scatterplot

 Z-score
 IQR(Inter-Quartile Range)

BoxPlot:
A box plot is a method for graphically depicting groups of numerical data through their quartiles.
The box extends from the Q1 to Q3 quartile values of the data, with a line at the median (Q2).
The whiskers extend from the edges of the box to show the range of the data. Outlier points are
those past the end of the whiskers. Boxplots show robust measures of location and spread as well
as providing information about symmetry and outliers.

import seaborn as sns

sns.boxplot(x=boston_df['DIS'])

OUTPUT:

CONCLUSION:
EXPERIMENT NO. - 10
Aim of the Experiment :-

Lab Outcome :-

Date of Conduction : ____________ Date of Submission :______________

Punctuality & Discipline


Implementation Understanding (5) (5) Total

(5) (15)

__________________________

Practical Incharge
EXPERIMENT NO. - 10
AIM :- Batch and Streamed Data Analysis using Spark.

THEORY:

Datasets are becoming huge. Infact, data is growing faster than processing speeds.
Therefore, algorithms involving large data and high amount of computation are often
run on a distributed computing system. A distributed computing system involves
nodes (networked computers) that run processes in parallel and communicate (if,
necessary).
MapReduce – The programming model that is used for Distributed computing is
known as MapReduce. The MapReduce model involves two stages, Map and
Reduce.
1. Map – The mapper processes each line of the input data (it is in the form of a
file), and produces key – value pairs.
Input data → Mapper → list([key, value])
2. Reduce – The reducer processes the list of key – value pairs (after the
Mapper’s function). It outputs a new set of key – value pairs.
list([key, value]) → Reducer → list([key, list(values)])
Spark – Spark (open source Big-Data processing engine by Apache) is a cluster
computing system. It is faster as compared to other cluster computing systems (such
as, Hadoop). It provides high level APIs in Python, Scala, and Java. Parallel jobs are
easy to write in Spark. We will cover PySpark (Python + Apache Spark), because
this will make the learning curve flatter. To install Spark on a linux system,
follow this. To run Spark in a multi – cluster system, follow this. We will see how to
create RDDs (fundamental data structure of Spark).
RDDs (Resilient Distributed Datasets) – RDDs are immutable collection of
objects. Since we are using PySpark, these objects can be of multiple types. These
will become more clear further.
SparkContext – For creating a standalone application in Spark, we first define a
SparkContext –

RDD transformations  – Now, a SparkContext object is created. Now, we will create


RDDs and see some transformations on them.
One major advantage of using Spark is that it does not load the dataset into memory,
lines is a pointer to the ‘file_name.txt’ ?file.

Steps:
1. Our text file is in the following format – (each line represents an edge of a
directed graph)
1    2
1    3
2    3
3    4
.    .
.    .
.    .PySpark
2. Large Datasets may contain millions of nodes, and edges.
3. First few lines set up the SparkContext. We create an RDD lines from it.
4. Then, we transform the lines RDD to edges RDD.The function conv a?cts on
each line and key value pairs of the form (1, 2), (1, 3), (2, 3), (3, 4), … are stored
in the edges RDD.
5. After this the reduceByKey aggregates all the key – pairs corresponding to a
particular key and numNeighbours function is used for generating each vertex’s
degree in a separate RDD Adj_list, which has the form (1, 2), (2, 1), (3, 1), …

OUTPUT:

CONCLUSION:
EXPERIMENT NO. - 11

Aim :- Implementation of Mini project based on case study taken from given dataset using Data
science and Machine learning.
Each group has to select a problem based on which ML project is done. Attach here the same.
The following steps should be outlined.

a) Problem definition, identifying which data set can be implemented.


b) Identify and use a standard data mining dataset available for the problem. Some links for data
science datasets are: Kaggle, UCI Machine Learning Repository etc.
c) Implement appropriate machine learning algorithm.
d) Interpret and visualize the results.

Lab Outcome :-

Date of Conduction : ____________ Date of Submission :______________

Punctuality &
Implementation Understanding (5) Discipline (5) Total

(5) (15)
____________________________

Practical In charge

EXPERIMENT NO. - 11
AIM:  Implementation of Mini project based on case study taken from given dataset using Data
science and Machine learning.
Each group has to select a problem based on which ML project is done. Attach here the same.
The following steps should be outlined.

a) Problem definition, identifying which data set can be implemented.


b) Identify and use a standard data mining dataset available for the problem. Some links for data
science datasets are: Kaggle, UCI Machine Learning Repository etc.
c) Implement appropriate machine learning algorithm.
d) Interpret and visualize the results.

PROJECT DETAILS:

CONCLUSION:

You might also like