DSBA Manual 2025
DSBA Manual 2025
Student Name:
Branch:
Certificate
This is to certify that………………………………………………………………………….....
Program………………………………………………………………………………………of
Institute………………………………………………………………………………….……..
has Successfully Completed the Term Work / Assignments Satisfactory in the Course
Empowering society through quality education and research for the socio-economic
development of the region.
PSO1
Professional Skills- The ability to understand, analyze and develop computer programs in the
areas related to algorithms, system software, multimedia, web design, big data analytics, and
networking for efficient design of computer-based systems of varying complexities.
PSO2
Problem-Solving Skills- The ability to apply standard practices and strategies in software
project development using open-ended programming environments to deliver a quality product
for business success.
PSO3
Successful Career and Entrepreneurship- The ability to employ modern computer
languages, environments and platforms in creating innovative career paths to be an
entrepreneur and to have a zest for higher studies.
SAVITRIBAI PHULE PUNE UNIVERSITY
Course Objectives:
• To understand principles of Data Science for the analysis of real time problems
• To develop in depth understanding and implementation of the key technologies in Data Science and
Big Data Analytics
• To analyze and demonstrate knowledge of statistical data analysis techniques for decision-making
• To gain practical, hands-on experience with statistics programming languages and Big Data tools
Course Outcomes:
On completion of the course, learner will be able to
CO1: Apply principles of Data Science for the analysis of real time problems
CO2: Implement data representation using statistical methods
CO3: Implement and evaluate data analytics algorithms
CO4: Perform text preprocessing
CO5: Implement data visualization techniques
CO6: Use cutting edge tools and technologies to analyze Big Data
Suggested List of Laboratory Experiments/Assignments
Sr. No. Group A
Data Wrangling, I
1 Perform the following operations using Python on any open source dataset (e.g., data.csv)
1. Import all the required Python Libraries.
2. Locate an open source data from the web (e.g., https://www.kaggle.com). Provide a
clear description of the data and its source (i.e., URL of the web site).
3. Load the Dataset into pandas dataframe.
4. Data Preprocessing: check for missing values in the data using pandas isnull(),
describe() function to get some initial statistics. Provide variable descriptions. Types of
variables etc. Check the dimensions of the data frame.
5. Data Formatting and Data Normalization: Summarize the types of variables by
checking the data types (i.e., character, numeric, integer, factor, and logical) of the
variables in the data set. If variables are not in the correct data type, apply proper type
conversions.
6. Turn categorical variables into quantitative variables in Python.
In addition to the codes and outputs, explain every operation that you do in the above steps
and explain everything that you do to import/read/scrape the data set.
Data Wrangling II
2 Create an “Academic performance” dataset of students and perform the following
operations using Python.
1. Scan all variables for missing values and inconsistencies. If there are missing
values and/or inconsistencies, use any of the suitable techniques to deal with them.
2. Scan all numeric variables for outliers. If there are outliers, use any of the suitable
techniques to deal with them.
3. Apply data transformations on at least one of the variables. The purpose of this
transformation should be one of the following reasons: to change the scale for better
understanding of the variable, to convert a non-linear relation into a linear one, or to
decrease the skewness and convert the distribution into a normal distribution.
Reason and document your approach properly.
Descriptive Statistics - Measures of Central Tendency and variability
3 Perform the following operations on any open source dataset (e.g., data.csv)
1. Provide summary statistics (mean, median, minimum, maximum, standard
deviation) for a dataset (age, income etc.) with numeric variables grouped by one of
the qualitative (categorical) variable. For example, if your categorical variable is age
groups and quantitative variable is income, then provide summary statistics of income
grouped by the age groups. Create a list that contains a numeric value for each
response to the categorical variable.
2. Write a Python program to display some basic statistical details like percentile,
mean, standard deviation etc. of the species of ‘Iris-setosa’, ‘Iris-versicolor’ and ‘Iris-
versicolor’ of iris.csv dataset.
Provide the codes with outputs and explain everything that you do in this step.
Data Analytics I
Create a Linear Regression Model using Python/R to predict home prices using Boston
Housing Dataset (https://www.kaggle.com/c/boston-housing). The Boston Housing dataset
4 contains information about various houses in Boston through different parameters. There
are 506 samples and 14 feature variables in this dataset.
The objective is to predict the value of prices of the house using the given features.
5 Data Analytics II
1. Implement logistic regression using Python/R to perform classification on
Social_Network_Ads.csv dataset.
2. Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate, Precision,
Recall on the given dataset.
6 Data Analytics III
1. Implement Simple Naïve Bayes classification algorithm using Python/R on iris.csv
dataset.
2. Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate, Precision,
Recall on the given dataset.
7 Text Analytics
1. Extract Sample document and apply following document preprocessing methods:
Tokenization, POS Tagging, stop words removal, Stemming and Lemmatization.
2. Create representation of document by calculating Term Frequency and Inverse
Document Frequency.
8 Data Visualization I
1. Use the inbuilt dataset 'titanic'. The dataset contains 891 rows and contains
information about the passengers who boarded the unfortunate Titanic ship. Use the
Seaborn library to see if we can find any patterns in the data.
2. Write a code to check how the price of the ticket (column name: 'fare') for each
passenger is distributed by plotting a histogram.
9 Data Visualization II
1. Use the inbuilt dataset 'titanic' as used in the above problem. Plot a box plot for
distribution of age with respect to each gender along with the information about whether
they survived or not. (Column names : 'sex' and 'age')
2. Write observations on the inference from the above statistics.
Locate dataset (e.g., sample_weather.txt) for working on weather data which reads the
3
text input files and finds average for temperature, dew point and wind speed.
4 Write a simple program in SCALA using Apache Spark framework
Group C
Write a case study on Global Innovation Network and Analysis (GINA). Components of
1 analytic plan are 1. Discovery business problem framed, 2. Data, 3. Model planning analytic
technique and 4. Results and Key findings.
Use the following dataset and classify tweets into positive and negative tweets.
2 https://www.kaggle.com/ruchi798/data-science-tweets
3 Develop a movie recommendation model using the scikit-learn library in python. Refer
dataset https://github.com/rashida048/Some-NLP-Projects/blob/master/movie_dataset.csv
Write a case study to process data driven for Digital Marketing OR Health care systems
with Hadoop Ecosystem components as shown. (Mandatory)
● HDFS: Hadoop Distributed File System
5 ● YARN: Yet Another Resource Negotiator
● MapReduce: Programming based Data Processing
● Spark: In-Memory data processing
● PIG, HIVE: Query based processing of data services
● HBase: NoSQL Database (Provides real-time reads and writes)
● Mahout, Spark MLLib: (Provides analytical tools) Machine Learning algorithm
libraries ● Solar, Lucene: Searching and Indexing
INDEX
Sr. Date Of Date of Marks and
Name Of The Experiment Sign
No Performance Completion
1. Data Wrangling I
Perform the following operations using Python on any open
source dataset (e.g., data.csv)
1. Import all the required Python Libraries.
2. Locate an open source data from the web (e.g.,
https://www.kaggle.com). Provide a clear description
of the data and its source (i.e.URL of the web site).
3. Load the Dataset into pandas data frame.
4. Data Pre-processing: check for missing values in the
data using pandas insult (), describe () function to get
some initial statistics. Provide variable description.
Type of variables etc. Check the dimensions of the
data frame.
5. Data Formatting and Data Normalization: Summarize
the types of variables by checking the data types (i.e.,
character, numeric, integer, factor, and logical) of the
variables in the data set. If variables are not in the
correct data types, apply proper type conversions.
6. Turn categorical variables into quantitative
variables in python.
In addition to the codes and outputs, explain every operation that
you do in the above steps and explain everything that you do to
import/read/scrape the data set.
2 Data Wrangling II
Create an “Academic performance” dataset of student and
perform the following operations using python.
1. Scan all variables for missing values and inconsistencies.
If there are missing values and/or inconsistencies, use any
of the suitable techniques to deal with them.
2. Scan all numeric variable for outliers. If there are outliers,
use any of the suitable techniques to deal with them.
3. Apply data transformation on at least one of the variable.
The purpose this transformation should be one of the
following reasons: to change the scale for better
understanding of the variable, to convert a non-linear
relation into a linear one, or to decrease the skewness and
convert the distribution into a normal distribution.
Reason and document your approach properly.
5 Data Visualization II
1. Use the inbuilt dataset 'titanic' as used in the above
problem. Plot a box plot for distribution of age with
respect to each gender along with the information about
whether they survived or not. (Column names: 'sex' and
'age')
2. 2. Write observations on the inference from the
above statistics.
6 Data Visualization III
Download the Iris flower dataset or any other dataset into a
Data Frame. (e.g., https://archive.ics.edu/ml/dataset/Iris ). Scan
the dataset and give the inference as:
1. List down the features and their types (e.g., numeric,
nominal) available in the dataset.
2. Create a histogram for each feature in the dataset to
illustrate the feature distributions.
3. Create a box plot for each feature in the dataset.
4. Compare distribution and identify outliers.
7 Data Analytics-I
Create a Linear Regression Model using Python/R to predict home
prices using Boston Housing Dataset
(https://WWW.Kaggle.com/c/boston-housing). The Boston
Housing dataset contains information about various houses in
Boston through different parameters. There are 506 samples & 14
feature variable in this dataset.
The objective is to predict the value of prices of the
house using the given features.
8 Data Analytics-II
Implement logistic regression using Python/R to perform
classification on Social_Network_Ads.csv.
9 Data Analytics-III
Implement Simple Naïve Bayes classification algorithm using
Python/R on iris.csv dataset
10 Text Analytics:
1. Extract Sample document and apply following document
preprocessing methods: Tokenization, POS Tagging, stop words
removal, Stemming and Lemmatization.
2. Create representation of document by calculating Term
Frequency and Inverse Document Frequency.
14 Mini Project 1
15 Mini Project 2
Practical No – 01
Perform the following operations using Python on any open-source dataset (e.g., data.csv)
Theory Concepts:
NumPy:
NumPy is a Python library used for working with arrays. It also has functions for working in
domain of linear algebra, fourier transform, and matrices. NumPy was created in 2005 by
Travis Oliphant. It is an open source project and you can use it freely. In Python we have lists
that serve the purpose of arrays, but they are slow to process. NumPy aims to provide an array
object that is up to 50x faster than traditional Python lists. The array object in NumPy is called
Nd array, it provides a lot of supporting functions that make working with Nd array very easy.
Arrays are very frequently used in data science, where speed and resources are very important.
Pandas:
Pandas is a Python library for data analysis. Started by Wes McKinney in 2008 out of a need
for a powerful and flexible quantitative analysis tool, pandas has grown into one of the most
1|P ag e
popular Python libraries. It has an extremely active community of contributors. Pandas is built
on top of two core Python libraries matplotlib for data visualization and NumPy for
mathematical operations. Pandas acts as a wrapper over these libraries, allowing you to access
many of matplotlib's and NumPy's methods with less code. For instance, pandas' .plot()
combines multiple matplotlib methods into a single method, enabling you to plot a chart in a
few lines.
2|P ag e
3|P ag e
Conclusion:
Hence we can perform all the Data Wrangling step on open-source dataset.
4|P ag e
Practical No – 02
Create an “Academic performance” dataset of student and perform the following operations
using python.
1. Scan all variables for missing values and inconsistencies. If there are missing values
and/or inconsistencies, use any of the suitable techniques to deal with them.
2. Scan all numeric variable for outliers. If there are outliers, use any of the suitable
techniques to deal with them.
3. Apply data transformation on at least one of the variable. The purpose this
transformation should be one of the following reasons: to change the scale for better
understanding of the variable, to convert a non-linear relation into a linear one, or to
decrease the skewness and convert the distribution into a normal distribution.
Reason and document your approach properly.
Theory Concepts:
Data Transformation:
Data Transformation is the process of converting data from one format to another, typically
from the format of a source system into the required format of a destination system. Data
Transformation is a component of most data integration and data wrangling and data
warehousing.
Data transformation is a technique used to convert the raw data into a suitable format that
efficiency eases data mining and retrieves strategic information. Data Transformation includes
data cleaning techniques to convert the data into the appropriate form.
1. Constructive:
2. Destructive:
5|P ag e
3. Aesthetic:
4. Structural:
6|P ag e
Conclusion:
Hence we can perform all the Data Wrangling step including Data Transformation on open-
source dataset.
7|P ag e
Practical No – 03
Perform the following operations on any open source dataset (e.g., data.csv)
Theory Concepts:
Mean:
The arithmetical mean is the sum of a set of numbers separated by the number of numbers in
the collection, or simply the mean or the average.
Median:
In a sorted, ascending or descending, list of numbers, the median is the middle number and
may be more representative of that data set than the average.
Mode:
The mode is the value that most frequently appears in a data value set.
Standard Deviation:
A calculation of the amount of variance or dispersion of a set of values is the standard deviation.
Variance:
The expectation of the square deviation of a random variable from its mean is variance.
8|P ag e
9|P ag e
10 | P a g e
11 | P a g e
Practical No – 04
1. Use the inbuilt dataset 'titanic'. The dataset contains 891 rows and contains information
about the passengers who boarded the unfortunate Titanic ship. Use the Seaborn library
to see if we can find any patterns in the data.
2. 2. Write a code to check how the price of the ticket (column name: 'fare') for each
passenger is distributed by plotting a histogram.
Theory Concepts:
Seaborn library:
Histogram:
A histogram is basically used to represent data provided in a form of some groups. It is accurate
method for the graphical representation of numerical data distribution. It is a type of bar plot
where X-axis represents the bin ranges while Y-axis gives information about frequency.
12 | P a g e
13 | P a g e
14 | P a g e
Conclusion:
we have used seaborn library on the “titanic” database and based on output we have seen the
patterns. Also we have plotted a histogram based on prices of ticket of passengers.
15 | P a g e
Practical No – 05
1. Use the inbuilt dataset 'titanic' as used in the above problem. Plot a box plot for
distribution of age with respect to each gender along with the information about whether
they survived or not. (Column names: 'sex' and 'age')
2. 2. Write observations on the inference from the above statistics.
Theory Concepts:
Data Visualization:
Data Visualization represents the text or numerical data in a visual format, which makes it easy
to grasp the information the data express. We, humans, remember the pictures more easily than
readable text, so Python provides us various libraries for data visualization like matplotlib,
seaborn, plotly, etc. In this tutorial, we will use Matplotlib and seaborn for performing various
techniques to explore data using various plots.
Creating Hypotheses, testing various business assumptions while dealing with any Machine
learning problem statement is very important and this is what EDA helps to accomplish. There
are various tootle and techniques to understand your data, And the basic need is you should
have the knowledge of NumPy for mathematical operations and Pandas for data manipulation.
Univariate Analysis:
Univariate analysis is the simplest form of analysis where we explore a single variable.
Univariate analysis is performed to describe the data in a better way. we perform Univariate
analysis of Numerical and categorical variables differently because plotting uses different plots.
Categorical Data:
A variable that has text-based information is referred to as categorical variables. let’s look at
various plots which we can use for visualizing Categorical data.
16 | P a g e
Titanic Dataset:
It is one of the most popular datasets used for understanding machine learning basics. It
contains information of all the passengers aboard the RMS Titanic, which unfortunately was
shipwrecked. This dataset can be used to predict whether a given passenger survived or not.
17 | P a g e
18 | P a g e
Conclusion:
The columns that can be dropped are: Passenger Id, Name, Ticket, Cabin: They are strings,
cannot be categorized and don’t contribute much to the outcome. Age, Fare: Instead, the
respective range columns are retained. The titanic data can be analyzed using many more graph
techniques and also more column correlations, than, as described in this article
19 | P a g e
Practical No – 06
Download the Iris flower dataset or any other dataset into a Data Frame. (e.g.,
https://archive.ics.edu/ml/dataset/Iris ). Scan the dataset and give the inference as:
1. List down the features and their types (e.g., numeric, nominal) available in the dataset.
2. Create a histogram for each feature in the dataset to illustrate the feature distributions.
3. Create a box plot for each feature in the dataset.
4. Compare distribution and identify outliers.
Theory Concepts:
The Iris flower data set is a multivariate data set introduced by the British statistician and
biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic
problems. It is sometimes called Anderson's Iris data set because Edgar Anderson collected the
data to quantify the morphologic variation of Iris flowers of three related species. The data set
consists of 50 samples from each of three species of Iris (Iris Setosa, Iris virginica, and Iris
versicolor). Four features were measured from each sample: the length and the width of the
sepals and petals, in centi-meters. This dataset became a typical test case for many statistical
classification techniques in machine learning such as support vector machine.
20 | P a g e
21 | P a g e
22 | P a g e
Conclusion:
23 | P a g e
Practical No-07
Create a Linear Regression Model using Python/R to predict home prices using Boston
Housing Dataset (https://WWW.Kaggle.com/c/boston-housing). The Boston Housing dataset
contains information about various houses in Boston through different parameters. There are
506 samples & 14 feature variable in this dataset.
The objective is to predict the value o prices of the house using the given features.
Theory Concepts:
Linear Regression
In statistics, linear regression is a linear approach for modelling the relationship between a
scalar response and one or more explanatory variables. The case of one explanatory variable is
called simple linear regression; for more than one, the process is called multiple linear
regression.
Linear regression analysis is used to predict the value of a variable based on the value of
another variable. The variable you want to predict is called the dependent variable. The variable
you are using to predict the other variable's value is called the independent variable.
More precisely, linear regression is used to determine the character and strength of the
association between a dependent variable and a series of other independent variables. It helps
create models to make predictions, such as predicting a company's stock price.
24 | P a g e
Code & Output:
25 | P a g e
Conclusion:
Hence we understand the concept of Data Analytics & implement the Linear Regression on
Boston Data Set.
26 | P a g e
Practical N0-8
Theory Concepts:
Logistic Regression
In statistics, the logistic model is a statistical model that models the probability of an event
taking place by having the log-odds for the event be a linear combination of one or more
independent variables. In regression analysis, logistic regression is estimating the parameters
of a logistic model.
1. Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
2. Logistic regression predicts the output of a categorical dependent variable. Therefore
the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,
true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
3. Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
4. In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
5. The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
6. Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
datasets.
7. Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the classification.
The below image is showing the logistic function:
27 | P a g e
o Code & Output:
28 | P a g e
29 | P a g e
Conclusion:
Hence we understand the concept of Data Analytics & Perform Logistic Regression on given
Data Set.
30 | P a g e
Practical N0-09
Implement Simple Naïve Bayes classification algorithm using Python/R on iris.csv dataset
Theory Concepts:
Naïve Bayes:
31 | P a g e
32 | P a g e
33 | P a g e
Conclusion:
Theory Concepts:
Text mining is also referred to as text analytics. Text mining is a process of exploring
sizable textual data and finding patterns. Text Mining processes the text itself, while NLP
processes with the underlying metadata. Finding frequency counts of words, length of the
sentence, presence/absence of specific words is known as text mining. Natural language
processing is one of the components of text mining. NLP helps identify sentiment,
finding entities in the sentence, and category of blog/article. Text mining is preprocessed
data for text analytics. In Text Analytics, statistical and machine learning algorithms are
used to classify information.
2.1. Tokenization:
Tokenization is the first step in text analytics. The process of breaking down a text
paragraph into smaller chunks such as words or sentences is called Tokenization.
Token is a single entity that is the building blocks for a sentence or paragraph.
sent_tokenize() method
method
Lemmatization Vs Stemming
Stemming algorithm works by cutting the suffix from the word. In a broader sense
cuts either the beginning or end of the word.
Example:
The initial step is to make a vocabulary of unique words and calculate TF for each
document. TF will be more for words that frequently appear in a document and
less for rare words in a document.
● Inverse Document Frequency (IDF)
It is the measure of the importance of a word. Term frequency (TF) does not
consider the importance of words. Some words such as’ of’, ‘and’, etc. can be
most frequently present but are of little significance. IDF provides weightage to
each word based on its frequency in the corpus D.
After applying TFIDF, text in A and B documents can be represented as a TFIDF vector of
dimension equal to the vocabulary words. The value corresponding to each word represents
the importance of that word in a particular document.
TFIDF is the product of TF with IDF. Since TF values lie between 0 and 1, not using ln
can result in high IDF for some words, thereby dominating the TFIDF. We don’t want
that, and therefore, we use ln so that the IDF should not completely dominate the
TFIDF.
● Disadvantage of TFIDF
It is unable to capture the semantics. For example, funny and humorous are synonyms,
but TFIDF does not capture that. Moreover, TFIDF can be computationally expensive if
the vocabulary is vast.
4. Bag of Words (BoW)
Machine learning algorithms cannot work with raw text directly. Rather, the text
must be converted into vectors of numbers. In natural language processing, a
common technique for extracting features from text is to place all of the words that
occur in the text in a bucket. This approach is called a bag of words model or BoW
for short. It’s referred to as a “bag” of words because any information about the
structure of the sentence is lost.
Title: Write a code in java for a single Word Count application that count the number of
occurrences of each word in given input set using the Hadoop MapReduce framework on local
standalone setup.
Theory Concepts:
Hadoop is an open source framework from Apache and is used to store process and analyze
data which are very huge in volume. Hadoop is written in Java and is not OLAP (online
analytical processing). It is used for batch/offline processing. It is being used by Facebook,
Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding
nodes in the cluster.
Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the
basis of that HDFS was developed. It states that the files will be broken into blocks and
stored in nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the
cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and converts
it into a data set which can be computed in Key value pair. The output of Map task is
consumed by reduce task and then the out of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other
Hadoop modules.
MapReduce:
MapReduce is a processing technique and a program model for distributed computing based
on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce.
Map takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from
a map as an input and combines those data tuples into a smaller set of tuples. As the sequence
of the name MapReduce implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application
into mappers and reducers is sometimes nontrivial. But, once we write an application in the
MapReduce form, scaling the application to run over hundreds, thousands, or even tens of
thousands of machines in a cluster is merely a configuration change. This simple scalability is
what has attracted many programmers to use the MapReduce model.
MapReduce Phase:
Input Splits:
An input in the MapReduce model is divided into small fixed-size parts called input splits. This
part of the input is consumed by a single map. The input data is generally a file or directory
stored in the HDFS.
Mapping:
This is the first phase in the map-reduce program execution where the data in each split is
passed line by line, to a mapper function to process it and produce the output values.
Shuffling:
It is a part of the output phase of Mapping where the relevant records are consolidated from
the output. It consists of merging and sorting. So, all the key-value pairs which have the same
keys are combined. In sorting, the inputs from the merging step are taken and sorted. It returns
key-value pairs, sorting the output.
Reduce:
All the values from the shuffling phase are combined and a single output value is returned.
Thus, summarizing the entire dataset.
Code:
package com.wc;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
conf.setJobName("WordCount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(WC_Mapper.class);
conf.setCombinerClass(WC_Reducer.class);
conf.setReducerClass(WC_Reducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));
JobClient.runJob(conf);
}
// WC_Mapper.java
package com.wc;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
Mapper<LongWritable,Text,Text,IntWritable>{
LongWritable key,
Text value,
OutputCollector<Text,IntWritable> output,
Reporter reporter
) throws IOException {
while (tokenizer.hasMoreTokens()){
word.set(tokenizer.nextToken());
output.collect(word, one);
// WC_Reducer.java
package com.wc;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
Reducer<Text,IntWritable,Text,IntWritable> {
Text key,
Iterator<IntWritable> values,
OutputCollector<Text,IntWritable> output,
Reporter reporter
) throws IOException {
int sum=0;
while (values.hasNext()) {
sum += values.next().get();
output.collect(key,new IntWritable(sum));
40 | P a g e
Output:
Input:
Result:
HDFS 1
Hadoop 2
MapReduce 1
a 2
for 1
is 2
of 1
processing 1
storage 1
tool 1
unit 1
Conclusion:
The MapReduce framework is power full tool processing large scale dataset in distributed
manner. It provides a simple & efficient way to analysize big data using commodity hardware.
The key to using MapReduce effectively to design the map & reduce function carefully to take
advantage the distributed nature of the framework.
41 | P a g e
Practical No-12
Title: Design a distributed application using MapReduce which process a log file of a system.
Theory Concepts:
MapReduce:
The reduce job takes the output from a map as input and combines those data tuples into a
smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always
performed after the map job.
MapReduce programming offers several benefits to help you gain valuable insights from
your big data:
42 | P a g e
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class SalesCountryRunner {
public static void main(String[] args) {
JobClient my_client = new JobClient();
// Create a configuration object for the job
JobConf job_conf = new JobConf(SalesCountryDriver.class);
// Set a name of the Job
job_conf.setJobName("SalePerCountry");
// Specify data type of output key and value
job_conf.setOutputKeyClass(Text.class);
job_conf.setOutputValueClass(IntWritable.class);
// Specify names of Mapper and Reducer Class
job_conf.setMapperClass(SalesCountry.SalesMapper.class);
job_conf.setReducerClass(SalesCountry.SalesCountryReducer.class);
// Specify formats of the data type of Input and output
job_conf.setInputFormat(TextInputFormat.class);
job_conf.setOutputFormat(TextOutputFormat.class);
// Set input and output directories using command line arguments,
//arg[0] = name of input directory on HDFS, and arg[1] = name of output
43 | P a g e
// SalesMapper.java
package SalesCountry;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
public class SalesMapper extends MapReduceBase implements Mapper<LongWritable,
Text, Text,
IntWritable> {
private final static IntWritable one = new IntWritable(1);
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable>
output, Reporter reporter) throws IOException {
String valueString = value.toString();
String[] SingleCountryData = valueString.split(",");
output.collect(new Text(SingleCountryData[7]), one);
}
}
// SalesCountryReducer.java
package SalesCountry;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
public class SalesCountryReducer extends MapReduceBase implements Reducer<Text,
IntWritable,
Text, IntWritable> {
public void reduce(Text t_key, Iterator<IntWritable> values,
OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException {
44 | P a g e
int frequencyForCountry = 0;
while (values.hasNext()) {
// replace type of value with the actual type of our value
IntWritable value = (IntWritable) values.next();
frequencyForCountry += value.get();
}
output.collect(key, new IntWritable(frequencyForCountry));
}
}
Output:
Argentina 1
Australia 38
Austria 7
Bahrain 1
Belgium 8
Bermuda 1
Brazil 5
Bulgaria 1
CO 1
Canada 76
Cayman Isls 1
China 1
Costa Rica 1
45 | P a g e
Country 1
Czech Republic 3
Denmark 15
Dominican Republic 1
Finland 2
France 27
Germany 25
Greece 1
Guatemala 1
Hong Kong 1
Hungary 3
Iceland 1
India 2
Ireland 49
Israel 1
Italy 15
Japan 2
Jersey 1
Kuwait 1
Latvia 1
46 | P a g e
Luxembourg 1
Malaysia 1
Malta 2
Mauritius 1
Moldova 1
Monaco 2
Netherlands 22
New Zealand 6
Norway 16
Philippines 2
Poland 2
Romania 1
Russia 1
South Africa 5
South Korea 1
Spain 12
Sweden 13
Switzerland 36
Thailand 2
The Bahamas 2
47 | P a g e
Turkey 6
Ukraine 1
Conclusion:
MapReduce is a effective tool for large log files, requiring, careful consideration of data
portioning, fault tolerance, scalability, performance optimization & data processing.
48 | P a g e
Practical No-13
Title: Locate dataset for working on weather data which reads the text input files & finds
the average for temperature, dew point & wind speed.
Theory Concepts:
Hadoop is an Apache open source framework written in java that allows distributed processing
of large datasets across clusters of computers using simple programming models. The Hadoop
framework application works in an environment that provides
distributed storage and computation across clusters of computers. Hadoop is designed to scale
up from single server to thousands of machines, each offering local computation and storage.
The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and
provides a distributed file system that is designed to run on commodity hardware. It has many
similarities with existing distributed file systems. However, the differences from other
distributed file systems are significant. It is highly fault-tolerant and is designed to be deployed
on low-cost hardware. It provides high throughput access to application data and is suitable for
applications having large datasets.
Apart from the above-mentioned two core components, Hadoop framework also includes the
following two modules −
• Hadoop Common − These are Java libraries and utilities required by other Hadoop
modules.
• Hadoop YARN − This is a framework for job scheduling and cluster resource
management.
Code:
// MaxTemperatureDriver.java
package MaxMinTemp;
import org.apache.hadoop.conf.Configured;
49 | P a g e
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
if(args.length !=2) {
<outputpath>");
System.exit(-1);
job.setJarByClass(MaxTemperatureDriver.class);
job.setJobName("Max Temperature");
FileOutputFormat.setOutputPath(job,new Path(args[1]));
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
50 | P a g e
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0:1);
return success ? 0 : 1;
System.exit(exitCode);
// MaxTemperatureMapper.java
package MaxMinTemp;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
IntWritable> {
@Override
51 | P a g e
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
int airTemperature;
} else {
// MaxTemperatureReducer.java
package MaxMinTemp;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
52 | P a g e
IntWritable> {
@Override
Output:
1901 317
1902 244
1903 289
1904 256
1905 283
1906 294
1907 283
1908 289
1909 278
1910 294
1911 306
1912 322
53 | P a g e
1913 300
1914 333
1915 294
1916 278
1917 317
1918 322
1919 378
1920 294
Conclusion:
The weather Data Analysis Toolkit using Hadoop & MapReduce. Provide an efficient &
scalable way to process & analyze large amount of weather data.
54 | P a g e