Institute's Vision
Institute's Vision
Institute’s Mission
Department’s Mission
K. C. COLLEGE OF ENGINEERING
AND MANAGEMENT STUDIES AND
RESEARCH THANE (EAST).
Certificate
This is to certify that Mr. / Ms. ___________________________________
has performed and successfully completed all the practical’s in the subject
of ______________________________________________ for the
academic year 20___ to 20___ as prescribed by University of Mumbai.
DATE :- ____________
_____________________________ _____________________________
COLLEGE SEAL
Lab Objectives: Sr. No. Lab Objectives
The Lab experiments aims:
1 To know the fundamental concepts of data science and
analytics
2 To learn data collection, preprocessing and visualization
techniques for data science
3 To Understand and practice analytical methods for
solving real life problems based on Statistical analysis
4 To learn various machine learning techniques to solve
complex real-world problems
5 To learn streaming and batch data processing using
Apache Spark
6 To map the elements of data science to perceive
information
Lab Outcomes: Sr. No. Lab Outcomes Cognitive levels of attainment as
per Bloom’s Taxonomy
DETAILED SYLLABUS:
Sr. No. Module Detailed Content Hours LO Mapping
I Introduction to 04 LO1
Data Science and i. Introduction,
Data Processing Benefits and uses
using Pandas of data science
ii. Data Science
tasks
iii. Introduction to
Pandas
iv. Data
preparation: Data
cleansing, Data
transformation,
Combine/Merge
/Join data, Data
loading &
preprocessing with
pandas
v. Data
aggregation
vi. Querying data
in Pandas
vii. Statistics with
Pandas Data
Frames
viii. Working with
categorical and text
data
ix. Data Indexing
and Selection
x. Handling
Missing Data
II Data Visualization 04 LO2
and Statistics i. Visualization
with Matplotlib
and Seaborn
ii. Plotting Line
Plots, Bar Plots,
Histograms
Density Plots,
Paths, 3Dplot,
Stream plot,
Logarithmic plots,
Pie chart, Scatter
Plots and Image
visualization using
Matplotlib
iii. Plotting scatter
plot, box plot,
Violin plot, swarm
plot, Heatmap, Bar
Plot using seaborn
iv. Introduction to
scikit-learn and
SciPy
v. Statistics using
python: Linear
algebra, Eigen
value, Eigen
Vector,
Determinant,
Singular Value
Decomposition,
Integration,
Correlation,
Central Tendency,
Variability,
Hypothesis testing,
Anova, z-test, t-test
and chi-square test.
Program Outcomes
2. Problem analysis: Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of mathematics,
natural sciences, and engineering sciences.
3. Design/development of solutions: Design solutions for complex engineering problems and design
system components or processes that meet the specified needs with appropriate consideration for the
public health and safety, and the cultural, societal, and environmental considerations.
5. Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modelling to complex engineering activities
with an understanding of the limitations.
6. The engineer and society: Apply reasoning informed by the contextual knowledge to
assesssocietal, health, safety, legal and cultural issues and the consequent responsibilities relevant
to the professional engineering practice.
7. Environment and sustainability: Understand the impact of the professional
engineeringsolutions in societal and environmental contexts, and demonstrate the knowledge of,
and need for sustainable development.
8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
normsof the engineering practice.
9. Individual and team work: Function effectively as an individual, and as a member or leader
indiverse teams, and in multidisciplinary settings.
11. Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member and
leader in a team, to manage projects and in multidisciplinary environments.
12.Life-long learning: Recognize the need for, and have the preparation and ability to engage
inindependent and life-long learning in the broadest context of technological change.
Department of Information Technology
Semester :VI
Class : TE
10
11
Sr. Date of Date of Page Grade
(A+B)
__________________ __________________
Lab Outcome :-
(5) (15)
____________________________
Practical Incharge
EXPERIMENT NO. - 01
THEORY:
Data Preprocessing:
Data preprocessing is a data mining technique which is used to transform the raw data in a
useful and efficient format.
Steps Involved in Data Pre processing:
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done.
It involves handling of missing data, noisy data etc.
(a). Missing Data:
This situation arises when some data is missing in the data. It can be handled in various
ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values
are missing within a tuple .
2. Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value.
(b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines. It can be generated
due to faulty data collection, data entry errors etc. It can be handled in following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task.
Each segmented is handled separately. One can replace all data in a segment by its mean
or boundary values can be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function. regression used may
be linear (having one independent variable) or multiple (having multiple independent
variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it
will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining
process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the
mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual
levels.
4. Concept Hierarchy Generation:
Here attributes are converted from lower level to higher level in hierarchy. For Example-The
attribute “city” can be converted to “country”.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working
with huge volume of data, analysis became harder in such cases. In order to get rid of this, we
uses data reduction technique. It aims to increase the storage efficiency and reduce data storage
and analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
2. Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For performing
attribute selection, one can use level of significance and p- value of the attribute.the attribute
having p-value greater than significance level can be discarded.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression
Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanism .It can be lossy or lossless. If after
reconstruction from compressed data, original data can be retrieved, such reduction are
called lossless reduction else it is called lossy reduction. The two effective methods of
dimensionality reduction are: Wavelet transforms and PCA (Principal Component
Analysis).
Feature Scaling:
Feature Scaling is a technique to standardize the independent features present in the data in a
fixed range. It is performed during the data pre-processing to handle highly varying magnitudes
or values or units. If feature scaling is not done, then a machine learning algorithm tends to
weigh greater values, higher and consider smaller values as the lower values, regardless of the
unit of the values.
OUTPUT :
CONCLUSION:
EXPERIMENT NO. - 02
Aim of the Experiment :- Data Visualization / Exploratory Data Analysis for the selected data set
using Matplotlib and Seaborn
a. Create a bar graph, contingency table using any 2 variables.
b. Create normalized histogram.
c. Describe what this graphs and tables indicates?
Lab Outcome :-
(5) (15)
____________________________
Practical Incharge
EXPERIMENT NO. - 02
AIM : Data Visualization / Exploratory Data Analysis for the selected data set using Matplotlib and
Seaborn
a. Create a bar graph, contingency table using any 2 variables.
b. Create normalized histogram.
c. Describe what this graphs and tables indicates?
THEORY: A bar graph is the graphical representation of categorical data using rectangular bars where
the length of each bar is proportional to the value they represent. A histogram is the graphical
representation of data where data is grouped into continuous number ranges and each range corresponds
to a vertical bar.
Contingency Table is one of the techniques for exploring two or even more variables.
It is basically a tally of counts between two or more categorical variables.
A barplot is basically used to aggregate the categorical data according to some methods
and by default it’s the mean. It can also be understood as a visualization of the group by
action. To use this plot we choose a categorical column for the x-axis and a numerical
column for the y-axis, and we see that it creates a plot taking a mean per categorical
column.
data DataFrame, Dataset for plotting. If “x“ and “y“ are absent,
array, or list of this is interpreted as wide-form. Otherwise it is
arrays, optional expected to be long-form.
order, lists of strings, Order to plot the categorical levels in, otherwise
hue_order optional the levels are inferred from the data objects.
color matplotlib color, Color for all of the elements, or seed for a
optional gradient palette.
palette palette name, Colors to use for the different levels of the “hue“
list, or dict, variable. Should be something that can be
optional interpreted by :func:`color_palette`, or a
dictionary mapping hue levels to matplotlib
colors.
errcolor matplotlib color Color for the lines that represent the confidence
interval.
ax matplotlib Axes, Axes object to draw the plot onto, otherwise uses
optional the current Axes.
Import Seaborn
Steps:
Example
k = [5, 5, 5, 5]
x, bins, p = plt.hist(k, density=True)
plt.show()
OUTPUT:
CONCLUSION:
EXPERIMENT NO. - 03
Lab Outcome :-
(5) (15)
Punctuality & Discipline
Implementation Understanding (5) (5) Total
____________________________
Practical Incharge
Experiment No. 3
AIM : Data Modeling : Validating partition by performing a two‐sample Z‐test.
Data modeling is the process of creating a simplified diagram of a software system and the data
elements it contains, using text and symbols to represent the data and how it flows. Data models
provide a blueprint for designing a new database or reengineering a legacy application.
Z-test
Z-test is a statistical method to determine whether the distribution of the test statistics can be
approximated by a normal distribution. It is the method to determine whether two sample means
are approximately the same or different when their variance is known and the sample size is large
(should be >= 30).
The sample size should be greater than 30. Otherwise, we should use the t-test.
Samples should be drawn at random from the population.
The standard deviation of the population should be known.
Samples that are drawn from the population should be independent of each other.
The data should be normally distributed, however for large sample size, it is assumed to
have a normal distribution.
Hypothesis Testing
Null Hypothesis: The null hypothesis is a statement that the value of a population
parameter (such as proportion, mean, or standard deviation) is equal to some claimed
value. We either reject or fail to reject the null hypothesis. Null Hypothesis is denoted by
H0.
Alternate Hypothesis: The alternative hypothesis is the statement that the parameter has
a value that is different from the claimed value. It is denoted by HA.
o n: sample size.
Two-sampled z-test:
In this test, we have provided 2 normally distributed and independent populations, and we have
drawn samples at random from both populations. Here, we consider u1 and u2 be the population
mean X1 and X2 are the observed sample mean. Here, our null hypothesis could be like:
H0 : µ1- µ2 = 0
and alternative hypothesis
H1 : µ1- µ2 ≠ 0
and the formula for calculating the z-test score:
where sigma1 and sigma2 are the standard deviation and n1 and n2 are the sample size of
population corresponding to u1 and u2 .
Type I error: Type 1 error has occurred when we reject the null hypothesis, even when
the hypothesis is true. This error is denoted by alpha.
Type II error: Type II error has occurred when we didn’t reject the null hypothesis, even
when the hypothesis is false. This error is denoted by beta.
OUTPUT:
CONCLUSION:
EXPERIMENT NO. - 04
Aim of the Experiment :- Implementation of Statistical Hypothesis Test using Scipy and Sci-kit
learn.
Lab Outcome :-
(5) (15)
Punctuality & Discipline
Implementation Understanding (5) (5) Total
____________________________
Practical In charge
Experiment No. 4
AIM : Implementation of Statistical Hypothesis Test using Scipy and Sci-kit learn.
THEORY:
The Pearson’s Chi-Square statistical hypothesis is a test for independence between categorical
variables. In this article, we will perform the test using a mathematical approach and then using
Python’s SciPy module.
A Contingency table (also called crosstab) is used in statistics to summarise the relationship between
several categorical variables. Here, we take a table that shows the number of men and women buying
different types of pets.
The aim of the test is to conclude whether the two variables( gender and choice of pet ) are
related to each other.
Null hypothesis:
We start by defining the null hypothesis (H0) which states that there is no relation between the
variables. An alternate hypothesis would state that there is a significant relation between the two.
We define a significance factor to determine whether the relation between the variables is of
considerable significance. Generally a significance factor or alpha value of 0.05 is chosen. This
alpha value denotes the probability of erroneously rejecting H0 when it is true. A lower alpha
value is chosen in cases where we expect more precision. If the p-value for the test comes out to
be strictly greater than the alpha value, then H0 holds true.
If our calculated value of chi-square is less or equal to the tabular(also called critical) value of
chi-square, then H0 holds true.
Chi-Square Table :
From this table, we obtain the total of the last column, which gives us the calculated value of chi-
square. Hence the calculated value of chi-square is 4.542228269825232
Now, we need to find the critical value of chi-square. We can obtain this from a table. To use this
table, we need to know the degrees of freedom for the dataset. The degrees of freedom is defined
as : (no. of rows – 1) * (no. of columns – 1).
Hence, the degrees of freedom is (2-1) * (3-1) = 2
Now, look at the table and find the value corresponding to 2 degrees of freedom and 0.05
significance factor :
The tabular or critical value of chi-square here is 5.991
Hence,
Therefore, H0 is accepted, that is, the variables do not have a significant relation.
SciPy is an Open Source Python library, which is used in mathematics, engineering, scientific
and technical computing.
Installation:
The chi2_contingency() function of scipy.stats module takes as input, the contingency table in
2d array format. It returns a tuple containing test statistics, the p-value, degrees of freedom and
expected table(the one we created from the calculated values) in that order.
Hence, we need to compare the obtained p-value with alpha value of 0.05.
OUTPUT:
Chi-square Test for feature selection
Feature selection is also known as attribute selection is a process of extracting the most relevant
features from the dataset and then applying machine learning algorithms for the better
performance of the model. A large number of irrelevant features increases the training time
exponentially and increase the risk of overfitting.
Chi-square test is used for categorical features in a dataset. We calculate Chi-square between
each feature and the target and select the desired number of features with best Chi-square scores.
It determines if the association between two categorical variables of the sample would reflect
their real association in the population.
where –
OUTPUT:
CONCLUSION:
EXPERIMENT NO. - 05
Aim of the Experiment :- Apply regression Model techniques to predict the data on House prices
dataset . And Prediction of Loan Using Multivariable Regression in Python.
Lab Outcome :-
(5) (15)
Punctuality & Discipline
Implementation Understanding (5) (5) Total
____________________________
Practical In charge
Experiment No. 5
AIM:- Apply regression Model techniques to predict the data on House prices dataset .
And Prediction of Loan Using Multivariable Regression in Python
When implementing linear regression of some dependent variable 𝑦 on the set of independent
variables 𝐱 = (𝑥₁, …, 𝑥ᵣ), where 𝑟 is the number of predictors, you assume a linear relationship
between 𝑦 and 𝐱: 𝑦 = 𝛽₀ + 𝛽₁𝑥₁ + ⋯ + 𝛽ᵣ𝑥ᵣ + 𝜀. This equation is the regression equation. 𝛽₀,
𝛽₁, …, 𝛽ᵣ are the regression coefficients, and 𝜀 is the random error.
To get the best weights, you usually minimize the sum of squared residuals (SSR) for all
observations 𝑖 = 1, …, 𝑛: SSR = Σᵢ(𝑦ᵢ - 𝑓(𝐱ᵢ))². This approach is called the method of ordinary
least squares.
If there are just two independent variables, the estimated regression function is (𝑥₁, 𝑥₂) = 𝑏₀ +
𝑏₁𝑥₁ + 𝑏₂𝑥₂. It represents a regression plane in a three-dimensional space. The goal of
regression is to determine the values of the weights 𝑏₀, 𝑏₁, and 𝑏₂ such that this plane is as close
as possible to the actual responses and yield the minimal SSR.
The case of more than two independent variables is similar, but more general. The estimated
regression function is (𝑥₁, …, 𝑥ᵣ) = 𝑏₀ + 𝑏₁𝑥₁ + ⋯ +𝑏ᵣ𝑥ᵣ, and there are 𝑟 + 1 weights to be
determined when the number of inputs is 𝑟.
OUTPUT:
CONCLUSION:
EXPERIMENT NO. - 06
Aim of the Experiment :- Classification modelling
a. Choose classifier for classification problem.
b. Evaluate the performance of classifier
Lab Outcome :-
(5) (15)
____________________________
Practical Incharge
Experiment No. 6
Aim : Classification modelling
a. Choose classifier for classification problem.
b. Evaluate the performance of classifier
THEORY: Ensemble learning is a machine learning paradigm where multiple models (often
called “weak learners”) are trained to solve the same problem and combined to get better results.
The main hypothesis is that when weak models are correctly combined we can obtain more
accurate and/or robust models.
Bagging is a homogeneous weak learners’ model that learns from each other independently in
parallel and combines them for determining the model average. Bagging is an acronym for
‘Bootstrap Aggregation’ and is used to decrease the variance in the prediction model. Bagging is
a parallel method that fits different, considered learners independently from each other, making it
possible to train them simultaneously.
Bagging generates additional data for training from the dataset. This is achieved by random
sampling with replacement from the original dataset. Sampling with replacement may repeat
some observations in each new training data set. Every element in Bagging is equally probable
for appearing in a new dataset.
These multi datasets are used to train multiple models in parallel. The average of all the
predictions from different ensemble models is calculated. The majority vote gained from the
voting mechanism is considered when classification is made. Bagging decreases the variance and
tunes the prediction to an expected outcome.
Example of Bagging: The Random Forest model uses Bagging, where decision tree models with
higher variance are present. It makes random feature selection to grow trees. Several random
trees make a Random Forest.
OUTPUT:
CONCLUSION:
EXPERIMENT NO. - 07
Lab Outcome :-
(5) (15)
____________________________
Practical Incharge
Experiment No. 7
AIM : Clustering
a. Clustering algorithms for unsupervised classification.
b. Plot the cluster data.
THEORY: -K-Means Clustering is an unsupervised learning algorithm that is used to solve the
clustering problems in machine learning or data science. In this topic, we will learn what is K-
means clustering algorithm, how the algorithm works, along with the Python implementation of
k-means clustering.
K-Means Clustering is an Unsupervised Learning algorithm , which groups the unlabeled dataset
into different clusters. Here K defines the number of pre-defined clusters that need to be created
in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and
so on. It is an iterative algorithm that divides the unlabeled dataset into k different clusters in
such a way that each dataset belongs only one group that has similar properties. It allows us to
cluster the data into different groups and a convenient way to discover the categories of groups in
the unlabeled dataset on its own without the need for any training. It is a centreoid -based
algorithm, where each cluster is associated with a centroid. The main aim of this algorithm is to
minimize the sum of distances between the data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters,
and repeats the process until it does not find the best clusters. The value of k should be
predetermined in this algorithm.
o Determines the best value for K centre points or centroids by an iterative process.
o Assigns each data point to its closest k-centre Those data points which are near to the
particular k- centre , create a cluster.
Hence each cluster has data points with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-5: Repeat the third steps, which means reassign each data point to the new closest centroid
of each cluster.
OUTPUT:
CONCLUSION:
EXPERIMENT NO. - 08
Aim of the experiment :- Using any machine learning techniques using available data set to
develop a recommendation system.
Lab Outcome :-
(5) (15)
____________________________
Practical Incharge
EXPERIMENT NO.: 8
AIM: Using any machine learning techniques using available data set to develop a recommendation
system.
which are able to suggest “relevant” items to users. Ideally, the suggested items are as relevant to
the user as possible, so that the user can engage with those items: YouTube videos, news articles,
Items are ranked according to their relevancy, and the most relevant ones are shown to the user.
The relevancy is something that the recommender system must determine and is mainly based on
historical data. If you’ve recently watched YouTube videos about elephants, then YouTube is
going to start showing you a lot of elephant videos with similar titles and themes!
Recommender systems are generally divided into two main categories: collaborative filtering and
content-based systems.
Collaborative filtering methods for recommender systems are methods that are solely based on
the past interactions between users and the target items. Thus, the input to a collaborative
filtering system will be all historical data of user interactions with target items. This data is
typically stored in a matrix where the rows are the users, and the columns are the items.
The core idea behind such systems is that the historical data of the users should be enough to
make a prediction. I.e we don’t need anything more than that historical data, no extra push from
Beyond this, collaborative filtering methods are further divided into two sub-groups: memory-
Memory-based methods are the most simplistic as they use no model whatsoever. They assume
that predictions can be made on pure “memory” of past data and usually just employ a simple
Model-based approaches, on the other hand, always assume some kind of underlying model and
basically try to make sure that whatever predictions come out will fit the model well.
steps:
4. Make recommendations
correlated variables into a set of values of linearly uncorrelated variables called principal
Step 3: Calculate the eigenvalues and eigenvectors for the covariance matrix.
OUTPUT:
CONCLUSION:
EXPERIMENT NO. - 09
Aim of the Experiment :- Exploratory data analysis using Apache Spark and Pandas
Lab Outcome :-
(5) (15)
__________________________
Practical Incharge
Experiment No-9
THEORY:
Exploratory Data Analysis (EDA) in Python is the first step in data analysis process developed
by “John Tukey” in the 1970s. In statistics, exploratory data analysis is an approach to analyzing
data sets to summarize their main characteristics, often with visual methods.
For Example, You are planning to go on a trip to the “X” location. Things you do before taking
a decision:
You will explore the location on what all places, waterfalls, trekking, beaches, restaurants
that location has in Google, Instagram, Facebook, and other social Websites.
Similarly, when you are trying to build a machine learning model you need to be pretty sure
whether your data is making sense or not. The main aim of exploratory data analysis is to obtain
confidence in your data to an extent a machine learning algorithm.
Exploratory Data Analysis is a crucial step before jumping to machine learning or modeling of
the data. By doing this you can get to know whether the selected features are good enough to
model, are all the features required, are there any correlations based on which we can either go
back to the Data Pre-processing step or move on to modeling.
Once Exploratory Data Analysis is complete, its feature can be used for supervised and
unsupervised machine learning modeling.
In every machine learning workflow, the last step is Reporting or Providing the insights to the
Stake Holders. By completing the Exploratory Data Analysis many plots can be drawn, heat-
maps, frequency distribution, graphs, correlation matrix along with the hypothesis by which any
individual can understand what the data is all about and what insights can get from exploring the
data set.
In Trip Example, all the exploration of the selected place are done based on which we will get
the confidence to plan the trip and even share with our friends the insights we got regarding the
place so that they can also join.
Handling outliers
a) Description of data:
We need to know the different kinds of data and other statistics of our data before we can move
on to the other steps. A good one is to start with the describe() function in python. In Pandas, we
can apply describe() on a DataFrame which helps in generating descriptive statistics that
summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN
values.
The result’s index will include count, mean, std, min, max as well as lower, 50 and upper
percentiles. By default, the lower percentile is 25 and the upper percentile is 75. The 50
percentile is the same as the median.
import pandas as pd
from sklearn.datasets import load_boston
boston = load_boston()
x = boston.data
y = boston.target
columns = boston.feature_names
# creating dataframes
boston_df = pd.DataFrame(boston.data)
boston_df.columns = columns
boston_df.describe()
The above code indicates that there are no null values in our data set.
c) Handling outliers:
An outlier is something which is separate or different from the crowd. Outliers can be a result of
a mistake during data collection or it can be just an indication of variance in your data. Some of
the methods for detecting and handling outliers:
BoxPlot
Scatterplot
Z-score
IQR(Inter-Quartile Range)
BoxPlot:
A box plot is a method for graphically depicting groups of numerical data through their quartiles.
The box extends from the Q1 to Q3 quartile values of the data, with a line at the median (Q2).
The whiskers extend from the edges of the box to show the range of the data. Outlier points are
those past the end of the whiskers. Boxplots show robust measures of location and spread as well
as providing information about symmetry and outliers.
sns.boxplot(x=boston_df['DIS'])
OUTPUT:
CONCLUSION:
EXPERIMENT NO. - 10
Aim of the Experiment :-
Lab Outcome :-
(5) (15)
__________________________
Practical Incharge
EXPERIMENT NO. - 10
AIM :- Batch and Streamed Data Analysis using Spark.
THEORY:
Datasets are becoming huge. Infact, data is growing faster than processing speeds.
Therefore, algorithms involving large data and high amount of computation are often
run on a distributed computing system. A distributed computing system involves
nodes (networked computers) that run processes in parallel and communicate (if,
necessary).
MapReduce – The programming model that is used for Distributed computing is
known as MapReduce. The MapReduce model involves two stages, Map and
Reduce.
1. Map – The mapper processes each line of the input data (it is in the form of a
file), and produces key – value pairs.
Input data → Mapper → list([key, value])
2. Reduce – The reducer processes the list of key – value pairs (after the
Mapper’s function). It outputs a new set of key – value pairs.
list([key, value]) → Reducer → list([key, list(values)])
Spark – Spark (open source Big-Data processing engine by Apache) is a cluster
computing system. It is faster as compared to other cluster computing systems (such
as, Hadoop). It provides high level APIs in Python, Scala, and Java. Parallel jobs are
easy to write in Spark. We will cover PySpark (Python + Apache Spark), because
this will make the learning curve flatter. To install Spark on a linux system,
follow this. To run Spark in a multi – cluster system, follow this. We will see how to
create RDDs (fundamental data structure of Spark).
RDDs (Resilient Distributed Datasets) – RDDs are immutable collection of
objects. Since we are using PySpark, these objects can be of multiple types. These
will become more clear further.
SparkContext – For creating a standalone application in Spark, we first define a
SparkContext –
Steps:
1. Our text file is in the following format – (each line represents an edge of a
directed graph)
1 2
1 3
2 3
3 4
. .
. .
. .PySpark
2. Large Datasets may contain millions of nodes, and edges.
3. First few lines set up the SparkContext. We create an RDD lines from it.
4. Then, we transform the lines RDD to edges RDD.The function conv a?cts on
each line and key value pairs of the form (1, 2), (1, 3), (2, 3), (3, 4), … are stored
in the edges RDD.
5. After this the reduceByKey aggregates all the key – pairs corresponding to a
particular key and numNeighbours function is used for generating each vertex’s
degree in a separate RDD Adj_list, which has the form (1, 2), (2, 1), (3, 1), …
OUTPUT:
CONCLUSION:
EXPERIMENT NO. - 11
Aim :- Implementation of Mini project based on case study taken from given dataset using Data
science and Machine learning.
Each group has to select a problem based on which ML project is done. Attach here the same.
The following steps should be outlined.
Lab Outcome :-
Punctuality &
Implementation Understanding (5) Discipline (5) Total
(5) (15)
____________________________
Practical In charge
EXPERIMENT NO. - 11
AIM: Implementation of Mini project based on case study taken from given dataset using Data
science and Machine learning.
Each group has to select a problem based on which ML project is done. Attach here the same.
The following steps should be outlined.
PROJECT DETAILS:
CONCLUSION: