Dsbda Lab Manual
Dsbda Lab Manual
PREREQUISITES:
1. Discrete mathematics
2. Database Management Systems, Data warehousing, Data mining
3. Programming in Python
COURSE OBJECTIVES:
COURSE OUTCOMES:
Assignment 1:
OBJECTIVE:
SOFTWARE REQUIREMENTS:
1 Ubuntu stable version
2 Java
THEORY:
Introduction
Hadoop is an open-source framework that allows to store and process big data in a
distributed environment across clusters of computers using simple programming models. It
is designed to scale up from single servers to thousands of machines, each offering local
computation and storage.
Big Data
Big data means really a big data, it is a collection of large datasets that cannot be
processed using traditional computing techniques. Big data is not merely a data, rather it
has become a complete subject, which involves various tools, technqiues and frameworks.
Big data involves the data produced by different devices and applications. Given below are
some of the fields that come under the umbrella of Big Data.
Hadoop
Hadoop is an Apache open-source framework written in java that allows distributed
processing of large datasets across clusters of computers using simple programming
models. A Hadoop frame-worked application works in an environment that provides
distributed storage and computation across clusters of computers. Hadoop is designed
to scale up from single server to thousands of machines, each offering local computation
and storage.
Hadoop runs applications using the MapReduce algorithm, where the data is
processed in parallel on different CPU nodes. In short, Hadoop framework is capable
enough to develop applications capable of running on clusters of computers and they
could perform complete statistical analysis for a huge amount of data.
MET’S, Bhujbal Knowledge City, Institute of Engineering
Department of Information Technology
Hadoop Architecture
Hadoop framework includes following four modules:
Hadoop Common: These are Java libraries and utilities required by other
Hadoop modules. These libraries provide filesystem and OS level abstractions and
contains the necessary Java files and scripts required to start Hadoop.
Hadoop YARN: This is a framework for job scheduling and cluster
resource management.
Hadoop Distributed File System (HDFS): A distributed file system that
provides high-throughput access to application data.
Hadoop MapReduce: This is YARN-based system for parallel processing of
large data sets.
MET’S, Bhujbal Knowledge City, Institute of Engineering
Department of Information Technology
sunita@sunita:~$java -version
java version "1.7.0_91"
OpenJDK Runtime Environment (IcedTea 2.5.3) (7u71-2.5.3-0ubuntu0.14.04.1)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)
password. However, this requirement can be eliminated by creating and setting up SSH
certificates using the following commands. If asked for a filename just leave it blank and
press the enter key to continue.
sunita@sunita:~$ssh-keygen -t rsa -P ""
The second command adds the newly created key to the list of authorized keys so that
Hadoop can use ssh without prompting for a password.
[ Note- In case of error – connection refused, we resolve by purge ssh and add it again
sunita@sunita:~$sudo apt-get purge openssh-server ]
D. Install Hadoop
Download Hadoop-
sunita@sunita:~$wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-
2.9.0/hadoop-2.9.0.tar.gz
We want to move the Hadoop installation to the /usr/local/hadoop directory using the
following command:
sunita@sunita:~$ sudo mv hadoop-2.9.0 /usr/local/hadoop
sunita@sunita:~$ sudo chown -R hduser /usr/local
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
sunita@sunita:~$source .bashrc
This command applies the changes made in the .bashrc file.
2. sudo gedit /usr/local/hadoop/etc/hadoop/hadoop-env.sh
We need to set JAVA_HOME by modifying hadoop-env.sh file.
sunita@sunita:~$gedit /usr/local/hadoop/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
Adding the above statement in the hadoop-env.sh file ensures that the value of
JAVA_HOME variable will be available to Hadoop whenever it is started up.
Note that hadoop namenode -format command should be executed once before we start
using Hadoop. If this command is executed again after Hadoop has been used, it'll destroy
all the data on the Hadoop file system.
G. Starting Hadoop
Now it's time to start the newly installed single node cluster. We can use start-all.sh
or (start-dfs.sh and start-yarn.sh)
sunita@sunita:~$start-all.sh
We can check if it's really up and running:
sunita@sunita:~$jps
9026 NodeManager
7348 NameNode
9766 Jps
8887 ResourceManager
7507 DataNode
MET’S, Bhujbal Knowledge City, Institute of Engineering
Department of Information Technology
As command says JPS which means Java Process Status which is used to list all the
processes that are running on java virtual machine. The output means that we now have a
functional instance of Hadoop running on our VPS (Virtual private server).
Hadoop Web Interfaces
Let's start the Hadoop again and see its Web UI:
Accessing HADOOP through browser
http://localhost:50070/
http://localhost:8088/
CONCLUSION:
Assignment 2:
SOFTWARE REQUIREMENTS:
1. Ubuntu stable version
2. GNU GCC Compiler
3. Hadoop
4. JDK 8
THEORY:
What is MapReduce?
MapReduce is a framework using which we can write applications to process huge amounts of
data, in parallel, on large clusters of commodity hardware in a reliable manner. MapReduce is
a processing technique and a program model for distributed computing based on java.
The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes
a set of data and converts it into another set of data, where individual elements are broken down
into tuples (key/value pairs).
Secondly, reduce task, which takes the output from a map as an input and combines those data
tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce
task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application into mappers and
reducers is sometimes nontrivial. But, once we write an application in the MapReduce form,
scaling the application to run over hundreds, thousands, or even tens of thousands of
machines in a cluster are merely a configuration change. This simple scalability is what has
attracted many programmers to use the MapReduce model.
MET’S, Bhujbal Knowledge City, Institute of Engineering
Department of Information Technology
·0 Map stage : The map or mapper’s job is to process the input data. Generally, the input data
is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input
file is passed to the mapper function line by line. The mapper processes the data and creates
several small chunks of data.
·1 Reduce stage : This stage is the combination of the Shuffle stage and the Reduce stage. The
Reducer’s job is to process the data that comes from the mapper. After processing, it produces
a new set of output, which will be stored in the HDFS.
map(key, value):
// key: document name; value: text of document
for each word w in value:
emit(w, 1)
reduce(key, values):
// key: a word; values: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)
Assignment 3:
OBJECTIVE:
OBJECTIVE:
1. To understand and apply the Analytical concept of Big data using Python.
2. To study Python libraries for Data Analytics
SOFTWARE REQUIREMENTS:
1. Ubuntu 16.04
2. Python-3
3. Anaconda-Spyder/ Jupyter notebook/ Google colab
PROBLEM STATEMENT:
Perform the following operations using Python on the Facebook metrics data sets.
o Create data subsets
o Merge Data
o Sort Data
o Transposing Data
o Shape and reshape Data
THEORY:
NumPy:
▪ introduces objects for multidimensional arrays and matrices, as well as
functions that allow to easily perform advanced mathematical and statistical
operations on those objects
▪ provides vectorization of mathematical operations on arrays and matrices which
significantly improves the performance
▪ many other python libraries are built on NumPy
SciPy:
▪ collection of algorithms for linear algebra, differential equations, numerical
integration, optimization, statistics and more
▪ part of SciPy Stack built on NumPy
MET’S, Bhujbal Knowledge City, Institute of Engineering
Department of Information Technology
Pandas:
▪ adds data structures and tools designed to work with table-like data (similar to
Series and Data Frames in R)
▪ provides tools for data manipulation: reshaping, merging, sorting, slicing,
aggregation etc.
▪ allows handling missing data
SciKit-Learn:
▪ provides machine learning algorithms: classification, regression, clustering,
model validation etc.
▪ built on NumPy, SciPy and matplotlib
matplotlib:
▪ python 2D plotting library which produces publication quality figures in a
variety of hardcopy formats
▪ a set of functionalities similar to those of MATLAB
▪ line plots, scatter plots, barcharts, histograms, pie charts etc.
▪ relatively low-level; some effort needed to create advanced visualization
Seaborn:
▪ based on matplotlib
▪ provides high level interface for drawing attractive statistical graphics
Standard way in which data is collected and stored file format. It is most used format
for storing data is the spreadsheet format where data is stored in rows and columns. Each row
is called a record Each column in a spreadsheet holds data belonging to same data type
Commonly used spreadsheet formats are comma separated values and excel sheets. Other
formats include plain text, json, html, mp3 ,mp4 etc
Importing data-
Junk values can be converted to missing values by passing them as a list to the
parameter ‘na_values’
Introduction to Pandas-
Pandas provides high-performance, easy-to-use data structures and analysis tools for
the Python programming language
It is Open-source Python library providing high-performance data manipulation and
analysis tool using its powerful data structures. Name pandas is derived from the word Panel
Data –an econometrics term for multidimensional data
DataFrame-
Pandas deals with dataframes object-
MET’S, Bhujbal Knowledge City, Institute of Engineering
Department of Information Technology
Creating Subset-
There are a number of ways to subset the Data Frame:
• one or more columns
• one or more rows
• a subset of rows and columns
Rows and columns can be selected by their position or label
To subset the data we can apply Boolean indexing. This indexing is commonly known as a
filter. For example, if we want to subset the rows in which the salary value is greater than
$120K: #Calculate mean salary for each professor rank:
df_sub = df[ df['salary'] > 120000 ]
# subset of initial specified observations
MET’S, Bhujbal Knowledge City, Institute of Engineering
Department of Information Technology
dset.head(20)
# Subset of last specified observations
dset.tail(40)
# subset selecting specified columns only
sub1= dset[['pagetotallikes','type','category','comment']]
#subset having specfied colums and range of observations
sub2= dset[['pagetotallikes','type','category','comment']].loc[50:300]
suset=dset[['pagetotallikes','type']].loc[(dset['type']!=2)]
#subset having observations constrained on some column
sub6=dset[ dset['type']==3 ]
#------------------------------------------
## Create subset and merge dataset
sub2= dset[['pagetotallikes','type','category','comment']].loc[50:150]
sub3= dset[['pagetotallikes','type','category','comment']].loc[151:300]
sub4= dset[['pagetotallikes','type','category','comment']].loc[1:25]
#------------------------------------------------
## Merge on some common variable
#1. Create a dictionary where keys are column names and values are opbservation
d={'Student name':['raj','mahesh','jon'],'age':[22,23,25]}
#2. Pass the dictionary to DataFrame method
df=pd.DataFrame(d)
d1={'roll no':[1,2,3],'Student name':['raj','dilip','raam'],'age':[22,22,22]}
df1=pd.DataFrame(d1)
##-----------------------------------
### Sort observations in column values oder
sorteddset=dset.sort_values('pagetotallikes')
sorteddset=dset.sort_values('pagetotallikes', ascending=False)
#------------------------------------------------
### Transpose
Tsub4=sub4.transpose()
#------------------------------------------------
#shape and reshape like pivot table
df1.shape
p_table=pd.pivot_table(df1,index=['roll no','Student name'],values='age')
p_table.shape
Conclusion: Hence, we have studied create, merge, sort, transpose and reshape operations on
Dataset.
MET’S, Bhujbal Knowledge City, Institute of Engineering
Department of Information Technology
Assignment 2:
OBJECTIVE:
SOFTWARE REQUIREMENTS:
1. Ubuntu 16.04
2. Python-3
3. Anaconda-Spyder/ Jupyter notebook/ Google colab
PROBLEM STATEMENT:
Perform the following operations using Python on the Air quality and Heart Diseases
data sets
• Data cleaning
• Data integration
• Data transformation
• Error correcting
• Data model building
THEORY:
Data cleaning, or data preparation is an essential part of statistical analysis. In fact,
in practice it is often more time-consuming than the statistical analysis itself
The way information gets stored in a dataframe or a python object affects the analysis and
outputs of calculations
• There are two main types of data
1. numeric types and
2. character types
◦ ‘64’ simply refers to the memory allocated to store data in each cell which
effectively relates to how many digits it can store in each “cell”
◦ 64 bits is equivalent to 8 bytes
◦ Allocating space ahead of time allows computers to optimize storage and
processing efficiency
2. Character types
Strings are known as objects in pandas which can store values that contain numbers and / or
characters
Data Cleaning-
We need to know how missing values are represented in the dataset in order to make
reasonable decisions. Python, by default replace blank values with ‘nan’
The missing values also exist in the form of ‘nan’,’??’, ‘????’ etc.
We can import the data considering other forms of missing values in a dataframe
MET’S, Bhujbal Knowledge City, Institute of Engineering
Department of Information Technology
Data transformation-
cars_data['Doors']=cars_data['Doors'].astype('int64')
Example-
Assignment 3:
OBJECTIVE:
1. To understand and apply the Analytical concept of Big data using Python.
2. To study Python libraries for Data visualization
SOFTWARE REQUIREMENTS:
1. Ubuntu 16.04
2. Python-3
3. Anaconda-Spyder/ Jupyter notebook/ Google colab
PROBLEM STATEMENT:
Visualize the data using Python libraries matplotlib, seaborn by plotting histogram, scatter-
plot and bar-plot
Data Visualization-
Data visualization allows us to quickly interpret the data and adjust different variables
to see their effect. Technology is increasingly making it easier for us to do so.
Data visualization helps to-
o Observe the patterns
o Identify extreme values that could be anomalies
o Easy interpretation of data insights
• It makes heavy use of NumPy and other extension code to provide good performance even
for large arrays
1. Scatter Plot
A scatter plot is a set of points that represents the values obtained for two different
variables plotted on a horizontal and vertical axis.
• When to use scatter plots?
Scatter plots are used to convey the relationship between two numerical variables.
Scatter plots are sometimes called correlation plots because they show how two
variables are correlated.
MET’S, Bhujbal Knowledge City, Institute of Engineering
Department of Information Technology
2. Histogram-
It is a graphical representation of data using bars of different heights. Histogram groups
numbers into ranges and the height of each bar depicts the frequency of each range or bin
• When to use histograms?
To represent the frequency distribution of numerical variables
Frequency distribution of kilometre of the cars shows that most of the cars have travelled
between 50000 – 100000 km and there are only few cars with more distance travelled
MET’S, Bhujbal Knowledge City, Institute of Engineering
Department of Information Technology
3. Bar Plot-
A bar plot is a plot that presents categorical data with rectangular bars with lengths
proportional to the counts that they represent.
• When to use bar plot?
To represent the frequency distribution of categorical variables
A bar diagram makes it easy to compare sets of data between different groups
1. Scatter Plot-
2. Histogram-
MET’S, Bhujbal Knowledge City, Institute of Engineering
Department of Information Technology
3. Bar Plot-
Assignment 4:
OBJECTIVE:
SOFTWARE REQUIREMENTS:
1. Ubuntu 16.04
2. Tableau Desktop / Tableau Public
PROBLEM STATEMENT:
Perform the data visualization operations using Sales order dataset using tableau.
1 Upload dataset CSV file
2 Plot Sales by region
3 Plot year, month, quarterwise Sale
4 Plot yearwise sale vs profit
Theory:
Tableau is a Data Visualization tool that is widely used for Business Intelligence but is
not limited to it. It helps create interactive graphs and charts in the form of dashboards and
worksheets to gain business insights.
Tableau Public
Tableau Public is purely free of all costs and does not require any license. But it comes
with a limitation that all of your data and workbooks are made public to all Tableau users.
MET’S, Bhujbal Knowledge City, Institute of Engineering
Department of Information Technology
You should see a screen similar to the one above. This is where you import your data.
As is visible, there are multiple formats that your data can be in. It can be in a flat-file such as
Excel, CSV or you can directly load it from data servers too.
steps:
1. Since the data is in an Excel File, click on Excel and choose the Sample – Superstore.xls
file to get :
MET’S, Bhujbal Knowledge City, Institute of Engineering
Department of Information Technology
2. You can see three sheets on the screen, but we are only going to be dealing with Orders
here, so go ahead and drag the same on Drag sheets here :