0% found this document useful (0 votes)
6 views77 pages

DSBA Manual 2025

The document is a practical lab manual for a Data Science and Big Data Analytics course for third-year Computer Engineering students at Al Jamia Mohammadiyah Education Society. It outlines the course objectives, outcomes, and a series of laboratory experiments and assignments designed to provide hands-on experience with data analysis, programming, and visualization techniques. Additionally, it includes the vision and mission of the institute and department, program outcomes, and specific skills expected from graduates.

Uploaded by

rasikaahire2704
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views77 pages

DSBA Manual 2025

The document is a practical lab manual for a Data Science and Big Data Analytics course for third-year Computer Engineering students at Al Jamia Mohammadiyah Education Society. It outlines the course objectives, outcomes, and a series of laboratory experiments and assignments designed to provide hands-on experience with data analysis, programming, and visualization techniques. Additionally, it includes the vision and mission of the institute and department, program outcomes, and specific skills expected from graduates.

Uploaded by

rasikaahire2704
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

Practical Lab Manual

Data Science & Big Data Analytics


(310256)
For Third Year Computer Engineering (2019 Course)

Student Name:

Seat No / Roll No: Class:

Branch:

DEPARTMENT OF COMPUTER ENGINEERING

Al Jamia Mohammadiyah Education Society’s


MAULANA MUKHTAR AHMAD NADVI TECHNICAL CAMPUS
Mansoora Campus, Malegaon (Nashik)
Academic Year 2024-25
Department of Computer Engineering

Certificate
This is to certify that………………………………………………………………………….....

Roll No: ……………Exam Seat No……………...………. Class: ……………………………

Program………………………………………………………………………………………of

Institute………………………………………………………………………………….……..

has Successfully Completed the Term Work / Assignments Satisfactory in the Course

………………………………………………...……Course Code……...…………… for the

Academic Year 2024-25 as per Prescribed in Curriculum Third Year Computer

Engineering (2019 Course) Savitribai Phule Pune University.

Subject Teacher Head of Department Principal


Al Jamia Mohammadiyah Education Society’s
MAULANA MUKHTAR AHMAD NADVI TECHNICAL CAMPUS
Mansoora Campus, Malegaon (Nashik)

VISION OF THE INSTITUTE

Empowering society through quality education and research for the socio-economic
development of the region.

MISSION OF THE INSTITUTE


● Inspire students to achieve excellence in science and engineering.
● Commit to making quality education accessible and affordable to serve society.
● Provide transformative, holistic, and value-based immersive learning experiences
for students.
● Transform into an institution of global standards that contributes to nation-
building.
● Develop sustainable, cost-effective solutions through innovation and research.
● Promote quality education in rural areas.

VISION OF THE DEPARTMENT


To build strong research and learning environment producing globally competent
professionals and innovators who will contribute to the betterment of the society.

MISSION OF THE DEPARTMENT


● To create and sustain an academic environment conducive to the highest
level of research and teaching.
● To provide state-of-the-art laboratories which will be up to date with the
new developments in the area of computer engineering.
● To organize competitive event, industry interactions and global
collaborations in view of providing a nurturing environment for students to
prepare for a successful career and the ability to tackle lifelong challenges in
global industrial needs.
● To educate students to be socially and ethically responsible citizens in view
of national and global development.
DEPARTMENT OF COMPUTER ENGINEERING

Program Outcomes (POs)


Learners are expected to know and be able to

Apply the knowledge of mathematics, science, Engineering


PO1 Engineering knowledge fundamentals, and an Engineering specialization to the solution of
complex Engineering problems.
Identify, formulate, review research literature and analyze complex
Engineering problems reaching substantiated conclusions using
PO2 Problem analysis
first principles of mathematics, natural sciences and Engineering
sciences.
Design solutions for complex Engineering problems and design
Design / Development of system components or processes that meet the specified needs with
PO3
Solutions appropriate consideration for the public health and safety, and the
cultural, societal, and environmental considerations.
Use research-based knowledge and research methods including
Conduct Investigations of
PO4 design of experiments, analysis and interpretation of data, and
Complex Problems
synthesis of the information to provide valid conclusions.
Create, select, and apply appropriate techniques, resources, and
modern Engineering and IT tools including prediction and
PO5 Modern Tool Usage
modeling to complex engineering activities with an understanding
of the limitations.
Apply reasoning informed by the contextual knowledge to assess
PO6 The Engineer and Society societal, health, safety, legal and cultural issues and the consequent
responsibilities relevant to the professional engineering practice.
Understand the impact of the professional Engineering solutions in
Environment and
PO7 societal and Environmental contexts, and demonstrate the
sustainability
knowledge of, and need for sustainable development.
Apply ethical principles and commit to professional ethics and
PO8 Ethics
responsibilities and norms of engineering practice.
Function effectively as an individual, and as a member or leader in
PO9 Individual and Team Work
diverse teams, and in multidisciplinary settings.
Communicate effectively on complex Engineering activities with
the Engineering community and with society at large, such as,
P10 Communication Skills being able to comprehend and write effective reports and design
documentation, make effective presentations, and give and receive
clear instructions.
Demonstrate knowledge and understanding of Engineering and
Project Management and management principles and apply these to one’s own work, as a
P11
Finance member and leader in a team, to manage projects and in
multidisciplinary Environments.
Recognize the need for, and have the preparation and ability to
P12 Life-long Learning engage in independent and life-long learning in the broadest context
of technological change.
DEPARTMENT OF COMPUTER ENGINEERING

Program Specific Outcomes (PSOs)


A graduate of the Computer Engineering Program will demonstrate

PSO1
Professional Skills- The ability to understand, analyze and develop computer programs in the
areas related to algorithms, system software, multimedia, web design, big data analytics, and
networking for efficient design of computer-based systems of varying complexities.

PSO2
Problem-Solving Skills- The ability to apply standard practices and strategies in software
project development using open-ended programming environments to deliver a quality product
for business success.

PSO3
Successful Career and Entrepreneurship- The ability to employ modern computer
languages, environments and platforms in creating innovative career paths to be an
entrepreneur and to have a zest for higher studies.
SAVITRIBAI PHULE PUNE UNIVERSITY

THIRD YEAR OF COMPUTER ENGINEERING (2019 COURSE)


310256: DATA SCIENCE & BIG DATA ANALYTICS
LABORATORY

Course Objectives:
• To understand principles of Data Science for the analysis of real time problems
• To develop in depth understanding and implementation of the key technologies in Data Science and
Big Data Analytics
• To analyze and demonstrate knowledge of statistical data analysis techniques for decision-making
• To gain practical, hands-on experience with statistics programming languages and Big Data tools

Course Outcomes:
On completion of the course, learner will be able to
CO1: Apply principles of Data Science for the analysis of real time problems
CO2: Implement data representation using statistical methods
CO3: Implement and evaluate data analytics algorithms
CO4: Perform text preprocessing
CO5: Implement data visualization techniques
CO6: Use cutting edge tools and technologies to analyze Big Data
Suggested List of Laboratory Experiments/Assignments
Sr. No. Group A
Data Wrangling, I
1 Perform the following operations using Python on any open source dataset (e.g., data.csv)
1. Import all the required Python Libraries.
2. Locate an open source data from the web (e.g., https://www.kaggle.com). Provide a
clear description of the data and its source (i.e., URL of the web site).
3. Load the Dataset into pandas dataframe.
4. Data Preprocessing: check for missing values in the data using pandas isnull(),
describe() function to get some initial statistics. Provide variable descriptions. Types of
variables etc. Check the dimensions of the data frame.
5. Data Formatting and Data Normalization: Summarize the types of variables by
checking the data types (i.e., character, numeric, integer, factor, and logical) of the
variables in the data set. If variables are not in the correct data type, apply proper type
conversions.
6. Turn categorical variables into quantitative variables in Python.
In addition to the codes and outputs, explain every operation that you do in the above steps
and explain everything that you do to import/read/scrape the data set.
Data Wrangling II
2 Create an “Academic performance” dataset of students and perform the following
operations using Python.
1. Scan all variables for missing values and inconsistencies. If there are missing
values and/or inconsistencies, use any of the suitable techniques to deal with them.
2. Scan all numeric variables for outliers. If there are outliers, use any of the suitable
techniques to deal with them.
3. Apply data transformations on at least one of the variables. The purpose of this
transformation should be one of the following reasons: to change the scale for better
understanding of the variable, to convert a non-linear relation into a linear one, or to
decrease the skewness and convert the distribution into a normal distribution.
Reason and document your approach properly.
Descriptive Statistics - Measures of Central Tendency and variability
3 Perform the following operations on any open source dataset (e.g., data.csv)
1. Provide summary statistics (mean, median, minimum, maximum, standard
deviation) for a dataset (age, income etc.) with numeric variables grouped by one of
the qualitative (categorical) variable. For example, if your categorical variable is age
groups and quantitative variable is income, then provide summary statistics of income
grouped by the age groups. Create a list that contains a numeric value for each
response to the categorical variable.
2. Write a Python program to display some basic statistical details like percentile,
mean, standard deviation etc. of the species of ‘Iris-setosa’, ‘Iris-versicolor’ and ‘Iris-
versicolor’ of iris.csv dataset.
Provide the codes with outputs and explain everything that you do in this step.
Data Analytics I
Create a Linear Regression Model using Python/R to predict home prices using Boston
Housing Dataset (https://www.kaggle.com/c/boston-housing). The Boston Housing dataset
4 contains information about various houses in Boston through different parameters. There
are 506 samples and 14 feature variables in this dataset.
The objective is to predict the value of prices of the house using the given features.
5 Data Analytics II
1. Implement logistic regression using Python/R to perform classification on
Social_Network_Ads.csv dataset.
2. Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate, Precision,
Recall on the given dataset.
6 Data Analytics III
1. Implement Simple Naïve Bayes classification algorithm using Python/R on iris.csv
dataset.
2. Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate, Precision,
Recall on the given dataset.
7 Text Analytics
1. Extract Sample document and apply following document preprocessing methods:
Tokenization, POS Tagging, stop words removal, Stemming and Lemmatization.
2. Create representation of document by calculating Term Frequency and Inverse
Document Frequency.

8 Data Visualization I
1. Use the inbuilt dataset 'titanic'. The dataset contains 891 rows and contains
information about the passengers who boarded the unfortunate Titanic ship. Use the
Seaborn library to see if we can find any patterns in the data.
2. Write a code to check how the price of the ticket (column name: 'fare') for each
passenger is distributed by plotting a histogram.
9 Data Visualization II
1. Use the inbuilt dataset 'titanic' as used in the above problem. Plot a box plot for
distribution of age with respect to each gender along with the information about whether
they survived or not. (Column names : 'sex' and 'age')
2. Write observations on the inference from the above statistics.

10 Data Visualization III


Download the Iris flower dataset or any other dataset into a DataFrame. (e.g.,
https://archive.ics.uci.edu/ml/datasets/Iris ). Scan the dataset and give the inference as:
1. List down the features and their types (e.g., numeric, nominal) available in the
dataset.
2. Create a histogram for each feature in the dataset to illustrate the feature distributions.
3. Create a boxplot for each feature in the dataset.
4. Compare distributions and identify outliers.
Group B
Write a code in JAVA for a simple WordCount application that counts the number of
1 occurrences of each word in a given input set using the Hadoop MapReduce framework
on local-standalone set-up.
Design a distributed application using MapReduce which processes a log file of a system.
2

Locate dataset (e.g., sample_weather.txt) for working on weather data which reads the
3
text input files and finds average for temperature, dew point and wind speed.
4 Write a simple program in SCALA using Apache Spark framework

Group C
Write a case study on Global Innovation Network and Analysis (GINA). Components of
1 analytic plan are 1. Discovery business problem framed, 2. Data, 3. Model planning analytic
technique and 4. Results and Key findings.
Use the following dataset and classify tweets into positive and negative tweets.
2 https://www.kaggle.com/ruchi798/data-science-tweets

3 Develop a movie recommendation model using the scikit-learn library in python. Refer
dataset https://github.com/rashida048/Some-NLP-Projects/blob/master/movie_dataset.csv

Use the following covid_vaccine_statewise.csv dataset and perform following analytics on


the given dataset https://www.kaggle.com/sudalairajkumar/covid19-in-
4 india?select=covid_vaccine_statewise.csv a. Describe the dataset b. Number of persons
state wise vaccinated for first dose in India c. Number of persons state wise vaccinated for
second dose in India d. Number of Males vaccinated d. Number of females vaccinated

Write a case study to process data driven for Digital Marketing OR Health care systems
with Hadoop Ecosystem components as shown. (Mandatory)
● HDFS: Hadoop Distributed File System
5 ● YARN: Yet Another Resource Negotiator
● MapReduce: Programming based Data Processing
● Spark: In-Memory data processing
● PIG, HIVE: Query based processing of data services
● HBase: NoSQL Database (Provides real-time reads and writes)
● Mahout, Spark MLLib: (Provides analytical tools) Machine Learning algorithm
libraries ● Solar, Lucene: Searching and Indexing
INDEX
Sr. Date Of Date of Marks and
Name Of The Experiment Sign
No Performance Completion

Group A: Data Science

1. Data Wrangling I
Perform the following operations using Python on any open
source dataset (e.g., data.csv)
1. Import all the required Python Libraries.
2. Locate an open source data from the web (e.g.,
https://www.kaggle.com). Provide a clear description
of the data and its source (i.e.URL of the web site).
3. Load the Dataset into pandas data frame.
4. Data Pre-processing: check for missing values in the
data using pandas insult (), describe () function to get
some initial statistics. Provide variable description.
Type of variables etc. Check the dimensions of the
data frame.
5. Data Formatting and Data Normalization: Summarize
the types of variables by checking the data types (i.e.,
character, numeric, integer, factor, and logical) of the
variables in the data set. If variables are not in the
correct data types, apply proper type conversions.
6. Turn categorical variables into quantitative
variables in python.
In addition to the codes and outputs, explain every operation that
you do in the above steps and explain everything that you do to
import/read/scrape the data set.
2 Data Wrangling II
Create an “Academic performance” dataset of student and
perform the following operations using python.
1. Scan all variables for missing values and inconsistencies.
If there are missing values and/or inconsistencies, use any
of the suitable techniques to deal with them.
2. Scan all numeric variable for outliers. If there are outliers,
use any of the suitable techniques to deal with them.
3. Apply data transformation on at least one of the variable.
The purpose this transformation should be one of the
following reasons: to change the scale for better
understanding of the variable, to convert a non-linear
relation into a linear one, or to decrease the skewness and
convert the distribution into a normal distribution.
Reason and document your approach properly.

3 Descriptive Statistics – Measures of Central Tendency and


Variability
Perform the following operations on any open source dataset
(e.g., data.csv)
1. provide summary statistics (Mean, Median, Minimum,
Maximum, Standard Deviation) for a dataset (age,
income etc.) with numeric variable grouped by one of the
qualitative (categorical) variable. For example, if your
categorical variable is age groups and quantitative
variable is income
, then provide summary statistics of income grouped by
the age groups. Create a list that contains a numeric
value for each response to the categorical variable
2. Write a python program to display some basic statistical
details like percentile, mean, standard deviation etc. of the
species of ‘Iris-Setosa’, ‘Iris- Versicolor’ and ‘Iris-
Virginica’ of iris.csv dataset.
Provide the codes with outputs and explain everything
that you do in this step.
4 Data Visualization I
1. Use the inbuilt dataset 'titanic'. The dataset contains 891
rows and contains information about the passengers who
boarded the unfortunate Titanic ship. Use the Seaborn
library to see if we can find any patterns in the data.
2. 2. Write a code to check how the price of the ticket
(column name: 'fare') for each passenger is distributed by
plotting a histogram.

5 Data Visualization II
1. Use the inbuilt dataset 'titanic' as used in the above
problem. Plot a box plot for distribution of age with
respect to each gender along with the information about
whether they survived or not. (Column names: 'sex' and
'age')
2. 2. Write observations on the inference from the
above statistics.
6 Data Visualization III
Download the Iris flower dataset or any other dataset into a
Data Frame. (e.g., https://archive.ics.edu/ml/dataset/Iris ). Scan
the dataset and give the inference as:
1. List down the features and their types (e.g., numeric,
nominal) available in the dataset.
2. Create a histogram for each feature in the dataset to
illustrate the feature distributions.
3. Create a box plot for each feature in the dataset.
4. Compare distribution and identify outliers.

7 Data Analytics-I
Create a Linear Regression Model using Python/R to predict home
prices using Boston Housing Dataset
(https://WWW.Kaggle.com/c/boston-housing). The Boston
Housing dataset contains information about various houses in
Boston through different parameters. There are 506 samples & 14
feature variable in this dataset.
The objective is to predict the value of prices of the
house using the given features.
8 Data Analytics-II
Implement logistic regression using Python/R to perform
classification on Social_Network_Ads.csv.
9 Data Analytics-III
Implement Simple Naïve Bayes classification algorithm using
Python/R on iris.csv dataset
10 Text Analytics:
1. Extract Sample document and apply following document
preprocessing methods: Tokenization, POS Tagging, stop words
removal, Stemming and Lemmatization.
2. Create representation of document by calculating Term
Frequency and Inverse Document Frequency.

Group B: Big Data Analytics – JAVA/SCALA

11 Write a code in java for a single Word Count application that


count the number of occurrences of each word in given input set
using the Hadoop MapReduce framework on local standalone
setup.

12 Design a distributed application using MapReduce which process


a log file of a system.
13 Locate dataset for working on weather data which reads the text
input files & finds the average for temperature, dew point &
wind speed.

Group C: Mini Projects/ Case Study – PYTHON/R

14 Mini Project 1

15 Mini Project 2
Practical No – 01

Title: Data Wrangling I

Perform the following operations using Python on any open-source dataset (e.g., data.csv)

1. Import all the required Python Libraries.


2. Locate an open source data from the web (e.g., https://www.kaggle.com). Provide a
clear description of the data and its source (i.e.URL of the web site).
3. Load the Dataset into pandas data frame.
4. Data Pre-processing: check for missing values in the data using pandas insult (),
describe () function to get some initial statistics. Provide variable description. Type of
variables etc. Check the dimensions of the data frame.
5. Data Formatting and Data Normalization: Summarize the types of variables by
checking the data types (i.e., character, numeric, integer, factor, and logical) of the
variables in the data set. If variables are not in the correct data types, apply proper type
conversions.
6. Turn categorical variables into quantitative variables in python.
Explain everything that you do to import/read/scrape the data set.

Requirements: Python, Anaconda, Jupyter Notebook.

Theory Concepts:

NumPy:

NumPy is a Python library used for working with arrays. It also has functions for working in
domain of linear algebra, fourier transform, and matrices. NumPy was created in 2005 by
Travis Oliphant. It is an open source project and you can use it freely. In Python we have lists
that serve the purpose of arrays, but they are slow to process. NumPy aims to provide an array
object that is up to 50x faster than traditional Python lists. The array object in NumPy is called
Nd array, it provides a lot of supporting functions that make working with Nd array very easy.
Arrays are very frequently used in data science, where speed and resources are very important.

Pandas:

Pandas is a Python library for data analysis. Started by Wes McKinney in 2008 out of a need
for a powerful and flexible quantitative analysis tool, pandas has grown into one of the most

1|P ag e
popular Python libraries. It has an extremely active community of contributors. Pandas is built
on top of two core Python libraries matplotlib for data visualization and NumPy for
mathematical operations. Pandas acts as a wrapper over these libraries, allowing you to access
many of matplotlib's and NumPy's methods with less code. For instance, pandas' .plot()
combines multiple matplotlib methods into a single method, enabling you to plot a chart in a
few lines.

Code & Outputs:

2|P ag e
3|P ag e
Conclusion:

Hence we can perform all the Data Wrangling step on open-source dataset.

4|P ag e
Practical No – 02

Title: Data Wrangling II

Create an “Academic performance” dataset of student and perform the following operations
using python.

1. Scan all variables for missing values and inconsistencies. If there are missing values
and/or inconsistencies, use any of the suitable techniques to deal with them.
2. Scan all numeric variable for outliers. If there are outliers, use any of the suitable
techniques to deal with them.
3. Apply data transformation on at least one of the variable. The purpose this
transformation should be one of the following reasons: to change the scale for better
understanding of the variable, to convert a non-linear relation into a linear one, or to
decrease the skewness and convert the distribution into a normal distribution.
Reason and document your approach properly.

Requirements: Python, Anaconda, Jupyter Notebook.

Theory Concepts:

Data Transformation:

Data Transformation is the process of converting data from one format to another, typically
from the format of a source system into the required format of a destination system. Data
Transformation is a component of most data integration and data wrangling and data
warehousing.

Data transformation is a technique used to convert the raw data into a suitable format that
efficiency eases data mining and retrieves strategic information. Data Transformation includes
data cleaning techniques to convert the data into the appropriate form.

4 Types of Data Transformation

1. Constructive:

The data transformation process adds, copies, or replicates data.

2. Destructive:

The system deletes fields or records

5|P ag e
3. Aesthetic:

The transformation standardizes the data to meet requirements or parameters.

4. Structural:

The database is reorganized by renaming, moving, or combining columns.

Code & Outputs:

6|P ag e
Conclusion:

Hence we can perform all the Data Wrangling step including Data Transformation on open-
source dataset.

7|P ag e
Practical No – 03

Title: Descriptive Statistics – Measures of Central Tendency and Variability

Perform the following operations on any open source dataset (e.g., data.csv)

1. provide summary statistics (Mean, Median, Minimum, Maximum, Standard Deviation)


for a dataset (age, income etc.) with numeric variable grouped by one of the qualitative
(categorical) variable. For example, if your categorical variable is age groups and
quantitative variable is income , then provide summary statistics of income grouped by
the age groups. Create a list that contains a numeric value for each response to the
categorical variable
2. Write a python program to display some basic statistical details like percentile, mean,
standard deviation etc. of the species of ‘Iris-Setosa’, ‘Iris-Versicolor’ and ‘Iris-
Virginica’ of iris.csv dataset.
Provide the codes with outputs and explain everything that you do in this step.

Requirements: Python, Anaconda, Jupyter Notebook.

Theory Concepts:

Mean:

The arithmetical mean is the sum of a set of numbers separated by the number of numbers in
the collection, or simply the mean or the average.

Median:

In a sorted, ascending or descending, list of numbers, the median is the middle number and
may be more representative of that data set than the average.

Mode:

The mode is the value that most frequently appears in a data value set.

Standard Deviation:

A calculation of the amount of variance or dispersion of a set of values is the standard deviation.

Variance:

The expectation of the square deviation of a random variable from its mean is variance.

Code & Outputs:

8|P ag e
9|P ag e
10 | P a g e
11 | P a g e
Practical No – 04

Title: Data Visualization I

1. Use the inbuilt dataset 'titanic'. The dataset contains 891 rows and contains information
about the passengers who boarded the unfortunate Titanic ship. Use the Seaborn library
to see if we can find any patterns in the data.
2. 2. Write a code to check how the price of the ticket (column name: 'fare') for each
passenger is distributed by plotting a histogram.

Requirements: Python, Anaconda, Jupyter Notebook.

Theory Concepts:

Seaborn library:

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level


interface for drawing attractive and informative statistical graphics.

Histogram:

A histogram is basically used to represent data provided in a form of some groups. It is accurate
method for the graphical representation of numerical data distribution. It is a type of bar plot
where X-axis represents the bin ranges while Y-axis gives information about frequency.

Code & Outputs:

12 | P a g e
13 | P a g e
14 | P a g e
Conclusion:

we have used seaborn library on the “titanic” database and based on output we have seen the
patterns. Also we have plotted a histogram based on prices of ticket of passengers.

15 | P a g e
Practical No – 05

Title: Data Visualization II

1. Use the inbuilt dataset 'titanic' as used in the above problem. Plot a box plot for
distribution of age with respect to each gender along with the information about whether
they survived or not. (Column names: 'sex' and 'age')
2. 2. Write observations on the inference from the above statistics.

Requirements: Python, Anaconda, Jupyter Notebook.

Theory Concepts:

Data Visualization:

Data Visualization represents the text or numerical data in a visual format, which makes it easy
to grasp the information the data express. We, humans, remember the pictures more easily than
readable text, so Python provides us various libraries for data visualization like matplotlib,
seaborn, plotly, etc. In this tutorial, we will use Matplotlib and seaborn for performing various
techniques to explore data using various plots.

Exploratory Data Analysis:

Creating Hypotheses, testing various business assumptions while dealing with any Machine
learning problem statement is very important and this is what EDA helps to accomplish. There
are various tootle and techniques to understand your data, And the basic need is you should
have the knowledge of NumPy for mathematical operations and Pandas for data manipulation.

Univariate Analysis:

Univariate analysis is the simplest form of analysis where we explore a single variable.
Univariate analysis is performed to describe the data in a better way. we perform Univariate
analysis of Numerical and categorical variables differently because plotting uses different plots.

Categorical Data:

A variable that has text-based information is referred to as categorical variables. let’s look at
various plots which we can use for visualizing Categorical data.

16 | P a g e
Titanic Dataset:

It is one of the most popular datasets used for understanding machine learning basics. It
contains information of all the passengers aboard the RMS Titanic, which unfortunately was
shipwrecked. This dataset can be used to predict whether a given passenger survived or not.

Code & Output:

17 | P a g e
18 | P a g e
Conclusion:

The columns that can be dropped are: Passenger Id, Name, Ticket, Cabin: They are strings,
cannot be categorized and don’t contribute much to the outcome. Age, Fare: Instead, the
respective range columns are retained. The titanic data can be analyzed using many more graph
techniques and also more column correlations, than, as described in this article

19 | P a g e
Practical No – 06

Title: Data Visualization III

Download the Iris flower dataset or any other dataset into a Data Frame. (e.g.,
https://archive.ics.edu/ml/dataset/Iris ). Scan the dataset and give the inference as:

1. List down the features and their types (e.g., numeric, nominal) available in the dataset.
2. Create a histogram for each feature in the dataset to illustrate the feature distributions.
3. Create a box plot for each feature in the dataset.
4. Compare distribution and identify outliers.

Requirements: Python, Anaconda, Jupyter Notebook.

Theory Concepts:

Data Set Information:

The Iris flower data set is a multivariate data set introduced by the British statistician and
biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic
problems. It is sometimes called Anderson's Iris data set because Edgar Anderson collected the
data to quantify the morphologic variation of Iris flowers of three related species. The data set
consists of 50 samples from each of three species of Iris (Iris Setosa, Iris virginica, and Iris
versicolor). Four features were measured from each sample: the length and the width of the
sepals and petals, in centi-meters. This dataset became a typical test case for many statistical
classification techniques in machine learning such as support vector machine.

Codes & Outputs:

20 | P a g e
21 | P a g e
22 | P a g e
Conclusion:

Therefore, We studies about Iris flower dataset and Data Frame

23 | P a g e
Practical No-07

Title: Data Analytics-I

Create a Linear Regression Model using Python/R to predict home prices using Boston
Housing Dataset (https://WWW.Kaggle.com/c/boston-housing). The Boston Housing dataset
contains information about various houses in Boston through different parameters. There are
506 samples & 14 feature variable in this dataset.

The objective is to predict the value o prices of the house using the given features.

Requirements: Python, Anaconda, Jupyter Notebook.

Theory Concepts:

Linear Regression

In statistics, linear regression is a linear approach for modelling the relationship between a
scalar response and one or more explanatory variables. The case of one explanatory variable is
called simple linear regression; for more than one, the process is called multiple linear
regression.

Linear regression analysis is used to predict the value of a variable based on the value of
another variable. The variable you want to predict is called the dependent variable. The variable
you are using to predict the other variable's value is called the independent variable.

More precisely, linear regression is used to determine the character and strength of the
association between a dependent variable and a series of other independent variables. It helps
create models to make predictions, such as predicting a company's stock price.

24 | P a g e
Code & Output:

25 | P a g e
Conclusion:

Hence we understand the concept of Data Analytics & implement the Linear Regression on
Boston Data Set.

26 | P a g e
Practical N0-8

Title: Data Analytics-II

Implement logistic regression using Python/R to perform classification on


Social_Network_Ads.csv.

Requirements: Python, Anaconda, Jupyter Notebook.

Theory Concepts:

Logistic Regression

In statistics, the logistic model is a statistical model that models the probability of an event
taking place by having the log-odds for the event be a linear combination of one or more
independent variables. In regression analysis, logistic regression is estimating the parameters
of a logistic model.

1. Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
2. Logistic regression predicts the output of a categorical dependent variable. Therefore
the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,
true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
3. Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
4. In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
5. The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
6. Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
datasets.
7. Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the classification.
The below image is showing the logistic function:

27 | P a g e
o Code & Output:

28 | P a g e
29 | P a g e
Conclusion:

Hence we understand the concept of Data Analytics & Perform Logistic Regression on given
Data Set.

30 | P a g e
Practical N0-09

Title: Data Analytics-III

Implement Simple Naïve Bayes classification algorithm using Python/R on iris.csv dataset

Requirements: Python, Anaconda, Jupyter Notebook.

Theory Concepts:

Naïve Bayes:

1. Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes


theorem and used for solving classification problems.
2. It is mainly used in text classification that includes a high-dimensional training dataset.
3. Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick
predictions.
4. It is a probabilistic classifier, which means it predicts on the basis of the probability of
an object.
5. Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.

Code & Implementation:

31 | P a g e
32 | P a g e
33 | P a g e
Conclusion:

Hence we studied the Naive Bayes Classification Algorithm.


Practical N0-10

Title: Text Analytics

1. Extract Sample document and apply following document preprocessing methods:


Tokenization, POS Tagging, stop words removal, Stemming and Lemmatization.
2. Create representation of document by calculating Term Frequency and Inverse
Document Frequency.

Requirements: Python, Anaconda, Jupyter Notebook.

Theory Concepts:

1. Basic concepts of Text Analytics


One of the most frequent types of day-to-day conversion is text communication. In our
everyday routine, we chat, message, tweet, share status, email, create blogs, and offer
opinions and criticism. All of these actions lead to a substantial amount of unstructured
text being produced. It is critical to examine huge amounts of data in this sector of the
online world and social media to determine people's opinions.

Text mining is also referred to as text analytics. Text mining is a process of exploring
sizable textual data and finding patterns. Text Mining processes the text itself, while NLP
processes with the underlying metadata. Finding frequency counts of words, length of the
sentence, presence/absence of specific words is known as text mining. Natural language
processing is one of the components of text mining. NLP helps identify sentiment,
finding entities in the sentence, and category of blog/article. Text mining is preprocessed
data for text analytics. In Text Analytics, statistical and machine learning algorithms are
used to classify information.

2. Text Analysis Operations using natural language toolkit

NLTK(natural language toolkit) is a leading platform for building Python programs to


work with human language data. It provides easy-to-use interfaces and lexical resources
such as WordNet, along with a suite of text processing libraries for classification,
tokenization, stemming, tagging, parsing, and semantic reasoning and many more.
Analysing movie reviews is one of the classic examples to demonstrate a simple NLP
Bag-of-words model, on movie reviews.

2.1. Tokenization:
Tokenization is the first step in text analytics. The process of breaking down a text
paragraph into smaller chunks such as words or sentences is called Tokenization.
Token is a single entity that is the building blocks for a sentence or paragraph.

● Sentence tokenization : split a paragraph into list of sentences using

sent_tokenize() method

● Word tokenization : split a sentence into list of words using word_tokenize()

method

2.2. Stop words removal


Stopwords considered as noise in the text. Text may contain stop words such as is,
am, are, this, a, an, the, etc. In NLTK for removing stopwords, you need to create
a list of stopwords and filter out your list of tokens from these words.

2.3. Stemming and Lemmatization


Stemming is a normalization technique where lists of tokenized words are
converted into shortened root words to remove redundancy. Stemming is the
process of reducing inflected (or sometimes derived) words to their word stem,
base or root form.
A computer program that stems word may be called a stemmer.
E.g.
A stemmer reduces the words like fishing, fished, and fisher to the stem fish.
The stem need not be a word, for example the Porter algorithm reduces, argue,
argued, argues, arguing, and argus to the stem argu .
Lemmatization in NLTK is the algorithmic process of finding the lemma of a
word depending on its meaning and context. Lemmatization usually refers to the
morphological analysis of words, which aims to remove inflectional endings. It
helps in returning the base or dictionary form of a word known as the lemma.
Eg. Lemma for studies is study

Lemmatization Vs Stemming

Stemming algorithm works by cutting the suffix from the word. In a broader sense
cuts either the beginning or end of the word.

On the contrary, Lemmatization is a more powerful operation, and it takes into


consideration morphological analysis of the words. It returns the lemma which is
the base form of all its inflectional forms. In-depth linguistic knowledge is
required to create dictionaries and look for the proper form of the word.
Stemming is a general operation while lemmatization is an intelligent operation
where the proper form will be looked in the dictionary. Hence, lemmatization
helps in forming better machine learning features.

2.4. POS Tagging


POS (Parts of Speech) tell us about grammatical information of words of the
sentence by assigning specific token (Determiner, noun, adjective , adverb ,
verb,Personal Pronoun etc.) as tag (DT,NN ,JJ,RB,VB,PRP etc) to each words.
Word can have more than one POS depending upon the context where it is used.
We can use POS tags as statistical NLP tasks. It distinguishes a sense of word
which is very helpful in text realization and infer semantic information from text
for sentiment analysis.

3. Text Analysis Model using TF-IDF.


Term frequency–inverse document frequency(TFIDF) , is a numerical statistic
that is intended to reflect how important a word is to a document in a collection or
corpus.
● Term Frequency (TF)
It is a measure of the frequency of a word (w) in a document (d). TF is defined as
the ratio of a word’s occurrence in a document to the total number of words in a
document. The denominator term in the formula is to normalize since all the
corpus documents are of different lengths.

Example:

The initial step is to make a vocabulary of unique words and calculate TF for each
document. TF will be more for words that frequently appear in a document and
less for rare words in a document.
● Inverse Document Frequency (IDF)
It is the measure of the importance of a word. Term frequency (TF) does not
consider the importance of words. Some words such as’ of’, ‘and’, etc. can be
most frequently present but are of little significance. IDF provides weightage to
each word based on its frequency in the corpus D.

In our example, since we have two documents in the corpus, N=2.


● Term Frequency — Inverse Document Frequency (TFIDF)
It is the product of TF and IDF.
TFIDF gives more weightage to the word that is rare in the corpus (all the documents).
TFIDF provides more importance to the word that is more frequent in the document.

After applying TFIDF, text in A and B documents can be represented as a TFIDF vector of
dimension equal to the vocabulary words. The value corresponding to each word represents
the importance of that word in a particular document.
TFIDF is the product of TF with IDF. Since TF values lie between 0 and 1, not using ln
can result in high IDF for some words, thereby dominating the TFIDF. We don’t want
that, and therefore, we use ln so that the IDF should not completely dominate the
TFIDF.
● Disadvantage of TFIDF
It is unable to capture the semantics. For example, funny and humorous are synonyms,
but TFIDF does not capture that. Moreover, TFIDF can be computationally expensive if
the vocabulary is vast.
4. Bag of Words (BoW)
Machine learning algorithms cannot work with raw text directly. Rather, the text
must be converted into vectors of numbers. In natural language processing, a
common technique for extracting features from text is to place all of the words that
occur in the text in a bucket. This approach is called a bag of words model or BoW
for short. It’s referred to as a “bag” of words because any information about the
structure of the sentence is lost.

Code & Implementation for


Tokenization, POS Tagging, stop words removal, Stemming and
Lemmatization:
Code & Implementation for
Representation of document by calculating TFIDF:
Conclusion:
Successfully performed the text analysis using different methods and algorithm.
Practical N0-11

Title: Write a code in java for a single Word Count application that count the number of
occurrences of each word in given input set using the Hadoop MapReduce framework on local
standalone setup.

Requirements: Python, Anaconda, Jupyter Notebook.

Theory Concepts:

Hadoop is an open source framework from Apache and is used to store process and analyze
data which are very huge in volume. Hadoop is written in Java and is not OLAP (online
analytical processing). It is used for batch/offline processing. It is being used by Facebook,
Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding
nodes in the cluster.

Modules of Hadoop

1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the
basis of that HDFS was developed. It states that the files will be broken into blocks and
stored in nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the
cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and converts
it into a data set which can be computed in Key value pair. The output of Map task is
consumed by reduce task and then the out of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other
Hadoop modules.

MapReduce:

MapReduce is a processing technique and a program model for distributed computing based
on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce.
Map takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from
a map as an input and combines those data tuples into a smaller set of tuples. As the sequence
of the name MapReduce implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application
into mappers and reducers is sometimes nontrivial. But, once we write an application in the
MapReduce form, scaling the application to run over hundreds, thousands, or even tens of
thousands of machines in a cluster is merely a configuration change. This simple scalability is
what has attracted many programmers to use the MapReduce model.

MapReduce Phase:

Input Splits:
An input in the MapReduce model is divided into small fixed-size parts called input splits. This
part of the input is consumed by a single map. The input data is generally a file or directory
stored in the HDFS.
Mapping:
This is the first phase in the map-reduce program execution where the data in each split is
passed line by line, to a mapper function to process it and produce the output values.
Shuffling:
It is a part of the output phase of Mapping where the relevant records are consolidated from
the output. It consists of merging and sorting. So, all the key-value pairs which have the same
keys are combined. In sorting, the inputs from the merging step are taken and sorted. It returns
key-value pairs, sorting the output.
Reduce:
All the values from the shuffling phase are combined and a single output value is returned.
Thus, summarizing the entire dataset.

Code:

package com.wc;

import java.io.IOException;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;

import org.apache.hadoop.mapred.FileOutputFormat;

import org.apache.hadoop.mapred.JobClient;

import org.apache.hadoop.mapred.JobConf;

import org.apache.hadoop.mapred.TextInputFormat;

import org.apache.hadoop.mapred.TextOutputFormat;

public class WC_Runner {

public static void main(String[] args) throws IOException {

JobConf conf = new JobConf(WC_Runner.class);

conf.setJobName("WordCount");

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(WC_Mapper.class);

conf.setCombinerClass(WC_Reducer.class);

conf.setReducerClass(WC_Reducer.class);

conf.setInputFormat(TextInputFormat.class);

conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf,new Path(args[0]));

FileOutputFormat.setOutputPath(conf,new Path(args[1]));

JobClient.runJob(conf);
}

// WC_Mapper.java

package com.wc;

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.MapReduceBase;

import org.apache.hadoop.mapred.Mapper;

import org.apache.hadoop.mapred.OutputCollector;

import org.apache.hadoop.mapred.Reporter;

public class WC_Mapper extends MapReduceBase implements

Mapper<LongWritable,Text,Text,IntWritable>{

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(

LongWritable key,
Text value,

OutputCollector<Text,IntWritable> output,

Reporter reporter

) throws IOException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()){

word.set(tokenizer.nextToken());

output.collect(word, one);

// WC_Reducer.java

package com.wc;

import java.io.IOException;

import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;

import org.apache.hadoop.mapred.Reducer;

import org.apache.hadoop.mapred.Reporter;

public class WC_Reducer extends MapReduceBase implements

Reducer<Text,IntWritable,Text,IntWritable> {

public void reduce(

Text key,

Iterator<IntWritable> values,

OutputCollector<Text,IntWritable> output,

Reporter reporter

) throws IOException {

int sum=0;

while (values.hasNext()) {

sum += values.next().get();

output.collect(key,new IntWritable(sum));

40 | P a g e
Output:

Input:

HDFS is a storage unit of Hadoop

MapReduce is a processing tool for Hadoop

Result:

HDFS 1

Hadoop 2

MapReduce 1

a 2

for 1

is 2

of 1

processing 1

storage 1

tool 1

unit 1

Conclusion:

The MapReduce framework is power full tool processing large scale dataset in distributed
manner. It provides a simple & efficient way to analysize big data using commodity hardware.

The key to using MapReduce effectively to design the map & reduce function carefully to take
advantage the distributed nature of the framework.

41 | P a g e
Practical No-12

Title: Design a distributed application using MapReduce which process a log file of a system.

Requirements: Python, Anaconda, Jupyter Notebook.

Theory Concepts:

MapReduce:

MapReduce is a programming paradigm that enables massive scalability across hundreds or


thousands of servers in a Hadoop cluster. As the processing component, MapReduce is the
heart of Apache Hadoop. The term "MapReduce" refers to two separate and distinct tasks that
Hadoop programs perform. The first is the map job, which takes a set of data and converts it
into another set of data, where individual elements are broken down into tuples (key/value
pairs).

The reduce job takes the output from a map as input and combines those data tuples into a
smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always
performed after the map job.

MapReduce programming offers several benefits to help you gain valuable insights from
your big data:

• Scalability. Businesses can process petabytes of data stored in the Hadoop


Distributed File System (HDFS).
• Flexibility. Hadoop enables easier access to multiple sources of data and multiple
types of data.
• Speed. With parallel processing and minimal data movement, Hadoop offers fast
processing of massive amounts of data.
• Simple. Developers can write code in a choice of languages, including Java, C++ and
Python.
Code:
// SalesCountryRunner.java
package SalesCountry;
import org.apache.hadoop.fs.Path;

42 | P a g e
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class SalesCountryRunner {
public static void main(String[] args) {
JobClient my_client = new JobClient();
// Create a configuration object for the job
JobConf job_conf = new JobConf(SalesCountryDriver.class);
// Set a name of the Job
job_conf.setJobName("SalePerCountry");
// Specify data type of output key and value
job_conf.setOutputKeyClass(Text.class);
job_conf.setOutputValueClass(IntWritable.class);
// Specify names of Mapper and Reducer Class
job_conf.setMapperClass(SalesCountry.SalesMapper.class);
job_conf.setReducerClass(SalesCountry.SalesCountryReducer.class);
// Specify formats of the data type of Input and output
job_conf.setInputFormat(TextInputFormat.class);
job_conf.setOutputFormat(TextOutputFormat.class);
// Set input and output directories using command line arguments,
//arg[0] = name of input directory on HDFS, and arg[1] = name of output

directory to be created to store the output file.

FileInputFormat.setInputPaths(job_conf, new Path(args[0]));


FileOutputFormat.setOutputPath(job_conf, new Path(args[1]));
my_client.setConf(job_conf);
try {
// Run the job
JobClient.runJob(job_conf);
} catch (Exception e) {
e.printStackTrace();
}
}
}

43 | P a g e
// SalesMapper.java
package SalesCountry;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
public class SalesMapper extends MapReduceBase implements Mapper<LongWritable,
Text, Text,
IntWritable> {
private final static IntWritable one = new IntWritable(1);
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable>
output, Reporter reporter) throws IOException {
String valueString = value.toString();
String[] SingleCountryData = valueString.split(",");
output.collect(new Text(SingleCountryData[7]), one);
}
}

// SalesCountryReducer.java
package SalesCountry;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
public class SalesCountryReducer extends MapReduceBase implements Reducer<Text,
IntWritable,
Text, IntWritable> {
public void reduce(Text t_key, Iterator<IntWritable> values,
OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException {

Text key = t_key;

44 | P a g e
int frequencyForCountry = 0;
while (values.hasNext()) {
// replace type of value with the actual type of our value
IntWritable value = (IntWritable) values.next();
frequencyForCountry += value.get();
}
output.collect(key, new IntWritable(frequencyForCountry));
}
}

Output:

Argentina 1

Australia 38

Austria 7

Bahrain 1

Belgium 8

Bermuda 1

Brazil 5

Bulgaria 1

CO 1

Canada 76

Cayman Isls 1

China 1

Costa Rica 1

45 | P a g e
Country 1

Czech Republic 3

Denmark 15

Dominican Republic 1

Finland 2

France 27

Germany 25

Greece 1

Guatemala 1

Hong Kong 1

Hungary 3

Iceland 1

India 2

Ireland 49

Israel 1

Italy 15

Japan 2

Jersey 1

Kuwait 1

Latvia 1

46 | P a g e
Luxembourg 1

Malaysia 1

Malta 2

Mauritius 1

Moldova 1

Monaco 2

Netherlands 22

New Zealand 6

Norway 16

Philippines 2

Poland 2

Romania 1

Russia 1

South Africa 5

South Korea 1

Spain 12

Sweden 13

Switzerland 36

Thailand 2

The Bahamas 2

47 | P a g e
Turkey 6

Ukraine 1

United Arab Emirates 6

United Kingdom 100

United States 462

Conclusion:

MapReduce is a effective tool for large log files, requiring, careful consideration of data
portioning, fault tolerance, scalability, performance optimization & data processing.

48 | P a g e
Practical No-13

Title: Locate dataset for working on weather data which reads the text input files & finds
the average for temperature, dew point & wind speed.

Requirements: Python, Anaconda, Jupyter Notebook.

Theory Concepts:

Hadoop is an Apache open source framework written in java that allows distributed processing
of large datasets across clusters of computers using simple programming models. The Hadoop
framework application works in an environment that provides
distributed storage and computation across clusters of computers. Hadoop is designed to scale
up from single server to thousands of machines, each offering local computation and storage.

Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and
provides a distributed file system that is designed to run on commodity hardware. It has many
similarities with existing distributed file systems. However, the differences from other
distributed file systems are significant. It is highly fault-tolerant and is designed to be deployed
on low-cost hardware. It provides high throughput access to application data and is suitable for
applications having large datasets.

Apart from the above-mentioned two core components, Hadoop framework also includes the
following two modules −

• Hadoop Common − These are Java libraries and utilities required by other Hadoop
modules.
• Hadoop YARN − This is a framework for job scheduling and cluster resource
management.

Code:

// MaxTemperatureDriver.java

package MaxMinTemp;

import org.apache.hadoop.conf.Configured;

49 | P a g e
import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

public class MaxTemperatureDriver extends Configured implements Tool{

public int run(String[] args) throws Exception {

if(args.length !=2) {

System.err.println("Usage: MaxTemperatureDriver <input path>

<outputpath>");

System.exit(-1);

Job job = new Job();

job.setJarByClass(MaxTemperatureDriver.class);

job.setJobName("Max Temperature");

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job,new Path(args[1]));

job.setMapperClass(MaxTemperatureMapper.class);

job.setReducerClass(MaxTemperatureReducer.class);

50 | P a g e
job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

System.exit(job.waitForCompletion(true) ? 0:1);

boolean success = job.waitForCompletion(true);

return success ? 0 : 1;

public static void main(String[] args) throws Exception {

MaxTemperatureDriver driver = new MaxTemperatureDriver();

int exitCode = ToolRunner.run(driver, args);

System.exit(exitCode);

// MaxTemperatureMapper.java

package MaxMinTemp;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

public class MaxTemperatureMapper extends Mapper<LongWritable, Text, Text,

IntWritable> {

private static final int MISSING = 9999;

@Override

51 | P a g e
public void map(LongWritable key, Text value, Context context) throws

IOException, InterruptedException {

String line = value.toString();

String year = line.substring(15, 19);

int airTemperature;

if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs

airTemperature = Integer.parseInt(line.substring(88, 92));

} else {

airTemperature = Integer.parseInt(line.substring(87, 92));

String quality = line.substring(92, 93);

if (airTemperature != MISSING && quality.matches("[01459]")) {

context.write(new Text(year), new IntWritable(airTemperature));

// MaxTemperatureReducer.java

package MaxMinTemp;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

public class MaxTemperatureReducer extends Reducer<Text, IntWritable, Text,

52 | P a g e
IntWritable> {

@Override

public void reduce(Text key, Iterable<IntWritable> values, Context context)

throws IOException, InterruptedException {

int maxValue = Integer.MIN_VALUE;

for (IntWritable value : values) {

maxValue = Math.max(maxValue, value.get());

context.write(key, new IntWritable(maxValue));

Output:

1901 317

1902 244

1903 289

1904 256

1905 283

1906 294

1907 283

1908 289

1909 278

1910 294

1911 306

1912 322

53 | P a g e
1913 300

1914 333

1915 294

1916 278

1917 317

1918 322

1919 378

1920 294

Conclusion:

The weather Data Analysis Toolkit using Hadoop & MapReduce. Provide an efficient &
scalable way to process & analyze large amount of weather data.

54 | P a g e

You might also like