0% found this document useful (0 votes)

59 views50 pages

DSBDA Lab Manual 23 - 24

The document is a laboratory manual for the Data Science and Big Data Analytics course at Sinhgad College of Engineering for the academic year 2023-24. It outlines the vision, mission, program educational objectives, program outcomes, and specific outcomes for students, as well as the teaching and examination schemes. Additionally, it includes course objectives, outcomes, a list of assignments, and guidelines for laboratory conduct.

Uploaded by

shreyash.npwph22

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views50 pages

DSBDA Lab Manual 23 - 24

Uploaded by

shreyash.npwph22

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

314457: Data Science and Big Data Analytics

Laboratory

Third Year – Information Technology

(2019 Course)
Academic Year 2023-24 Semester – II

Teaching Scheme: Credit Scheme: Examination Scheme:

Practical (PR): 02 Hrs./week 01 Credit PR: 25 Marks TW: 25 Marks

LABORATORY MANUAL V 3.0

DEPARTMENT OF INFORMATION TECHNOLOGY
Sinhgad College of Engineering, Pune

2023-2024
VISION

To provide excellent Information Technology education by building strong

teaching and research environment.

MISSION
1) To transform the students into innovative, competent and high quality IT
professionals to meet the growing global challenges.

2) To achieve and impart quality education with an emphasis on practical skills

and social relevance.

3) To endeavour for continuous up-gradation of technical expertise of students

to cater to the needs of the society.

4) To achieve an effective interaction with industry for mutual benefits.

PROGRAM EDUCATIONAL OBJECTIVES

The students of the Information Technology course after passing out will:

Sr. No. Description

Possess strong fundamental concepts in mathematics, science, engineering, and

PEO1
Technology to address technological challenges.

Possess knowledge and skills in the field of Computer Science and Information
PEO2 Technology for analysing, designing, and implementing complex engineering
problems of any domain with innovative approaches.

Possess an attitude and aptitude for research, entrepreneurship, and higher studies in
PEO3
the field of Computer Science and Information Technology.

Have a commitment to ethical practices, societal contributions through communities,

PEO4
and life-long learning.

Possess better communication, presentation, time management, and team work skills
PEO5 leading to responsible & competent professionals and will be able to address
challenges in the field of IT at the global level.
PROGRAM OUTCOMES
The students in the Information Technology course are expected to know and be able to:

Sr. No. PO's Description

An ability to apply knowledge of mathematics, computing,
PO1 Engineering knowledge science, engineering and technology.
An ability to define a problem and provide a systematic
PO2 Problem analysis solution with the help of conducting experiments, analysing
the problem and interpreting the data.
An ability to design, implement, and evaluate software or a
Design/Development of
PO3 software/hardware system, component, or process to meet
Solutions desired need switch in realistic constraints.

Conduct Investigation of An ability to identify, formulate, and provide essay schematic

PO4 solutions to complex engineering /Technology problems.
Complex Problems
An ability to use the techniques, skills, and modern
PO5 Modern Tool Usage engineering technology tools, and standard processes
necessary for practice as an IT professional.
An ability to apply mathematical foundations, algorithmic
principles, and computer science theory in the modelling and
PO6 The Engineer and Society design of computer-based systems with necessary constraints
and assumptions.
An ability to analyse and provide solutions for the local and
Environment and
PO7 global impact of information technology on individuals,
Sustainability organizations, and society.
An ability to understand professional, ethical, legal, security
PO8 Ethics
and social issues and responsibilities.
An ability to function effectively as an individual or a sate am
PO9 Individual and Team Work
member to accomplish a desired goal(s).
An ability to engage in life-long learning and continuing
professional development to cope up with fast changes in the
PO10 Communication Skills
technologies /tools with the help of electives, profession along
animations and extra- curricular activities.
An ability to communicate effectively in engineering
Project Management and
PO11 community at large by means of effective presentations, report
Finance writing, paper publications, demonstrations.
An ability to understand engineering, management, financial
PO12 Life-long Learning aspects, performance, optimizations and time complexity
necessary for professional practice.
PROGRAM SPECIFIC OUTCOMES

A graduate of the Information Technology Program will demonstrate-

Sr. No. Description

An ability to apply the theoretical concepts and practical knowledge of Information

PSO1 Technology in analysis, design, development and management of information
processing systems and applications in the interdisciplinary domain.

Decision making skills through the use of modern IT tools to make ready for
PSO2
professional responsibilities.
DOCUMENT CONTROL

Reference Code SCOE-IT / Lab Manual Procedures

Version No 3.0
Compliance Status Complete
Date of Compliance 01-12-2023
Security Classification Department Specific
Document Status Definitive
Review Period Yearly

Author
Signature

Name Mrs. S. S. Gadekar, Mrs. T. H. Patil

Designation Assistant Professor

Document History

Revision No. Revision Date Reason For Change

1 01-07-2021 Update
2 01-07-2022 Update
3 01-02-2023 Updating the theory content of a few assignments

Summary of Changes to Data Science & Big Data Analytics Laboratory

Sr. No Changes Change type

Theory content added & Changes in
1 Assignment Group A - 2
Code
2 Assignment Group B - 5 Theory content added
Sinhgad Technical Education Society’s
SINHGAD COLLEGE OF ENGINEERING, PUNE
S. No. 44/1, Off Sinhgad Road, Vadgaon(BK), Pune- 411041
Accredited by NAAC with Grade ‘A+’

DEPARTMENT OF INFORMATION TECHNOLOGY

LABORATORY CODE

Sr. No. Laboratory Code

1 Students should report to the concerned laboratory as per the time table.
2 Keep your bags in rack.
3 While entering in lab remove your shoes and keep it in shoe stand.
4 Turn computer monitors off when asked by your teacher
5 Do not go on banned websites.
6 No food or drinks near the keyboard.
7 Only use your assigned computer and workstation.
8 Do not change the settings on the computer
9 Ask permission to download.
10 Ask permission to print documents
11 Save your work often.
If you are the last class of the day, please POWER DOWN all computers and
12
monitors.
Sinhgad Technical Education Society’s
SINHGAD COLLEGE OF ENGINEERING, PUNE
S. No. 44/1, Off Sinhgad Road, Vadgaon(BK), Pune- 411041
Accredited by NAAC with Grade ‘A+’

ACADEMIC YEAR 2023-24, SEMESTER-II

DEPARTMENT OF INFORMATION TECHNOLOGY

SYLLABUS

Savitribai Phule Pune University, Pune

Third Year Information Technology (2019 Course)
314457: DS & BDA Lab

Teaching Scheme: Credit Scheme: Examination Scheme:

Practical (PR) : 02 hrs./week 01 Credit PR :25 Marks TW : 25 Marks

Prerequisite Courses:
 Discrete mathematics
 Database Management Systems, Data warehousing, Data mining
 Programming in Python

Course Objectives:
1. To understand Big data primitives and fundamentals.

2. To understand the different Big data processing techniques.

3. To understand and apply the Analytical concept of Big data using Python.

4. To understand different data visualization techniques for Big Data.

5. To understand the application and impact of Big Data.

6. To understand emerging trends in Big data analytics.

Course Outcomes:
On completion of the course, students will be able to–
CO1: Apply Big data primitives and fundamentals for application development.
CO2: Explore different Big data processing techniques with use cases.
CO3: Apply the Analytical concept of Big data using Python.
CO4: Visualize the Big Data using Tableau.
CO5: Design algorithms and techniques for Big data analytics.
CO6: Design and develop Big data analytic application for emerging trends.

List of Laboratory Assignments

Group A: Assignments based on the Hadoop
1. Single node/Multiple node Hadoop Installation.

2. Design a distributed application using MapReduce (Using Java) which processes a log file
of a system. List out the users who have logged for maximum period on the system. Use
simple log file from the Internet and process it using a pseudo distribution mode on
Hadoop platform.

3. Write an application using HiveQL for flight information system which will include
A. Creating, Dropping, and altering Database tables.
B. Creating an external Hive table.
C. Load table with data, insert new values and field in the table, Join tables with Hive
D. Create index on Flight Information Table
E. Find the average departure delay per day in 2008.

Group B: Assignments based on Data Analytics using Python

1. Perform the following operations using Python on the Facebook metrics data sets
a. Create data subsets
b. Merge Data
c. Sort Data
d. Transposing Data
e. Shape and reshape Data

2. Perform the following operations using Python on the Air quality and Heart Diseases data
sets
a. Data cleaning
b. Data integration
c. Data transformation
d. Error correcting
e. Data model building

3. Integrate Python and Hadoop and perform the following operations on forest fire dataset
a. Data analysis using the Map Reduce in PyHadoop
b. Data mining in Hive

4. Visualize the data using Python libraries matplotlib, seaborn by plotting the graphs for
assignment no. 2 and 3 ( Group B)

5. Perform the following data visualization operations using Tableau on Adult and Iris datasets.
a. 1D (Linear) Data visualization
b. 2D (Planar) Data Visualization
c. 3D (Volumetric) Data Visualization
d. Temporal Data Visualization
e. Multidimensional Data Visualization
f. Tree/ Hierarchical Data visualization
g. Network Data visualization

Group C: Model Implementation

1. Create a review scrapper for any ecommerce website to fetch real time comments, reviews,
ratings, comment tags, customer name using Python.
2. Develop a mini project in a group using different predictive models techniques to solve any
real life problem. (Refer link dataset- https://www.kaggle.com/tanmoyie/us-graduate-
schools- admission-parameters)

Reference Books:
1. Big Data, Black Book, DT Editorial services, 2015 edition.

2. Data Analytics with Hadoop, Jenny Kim, Benjamin Bengfort, OReilly Media, Inc.
3. Python for Data Analysis by Wes McKinney published by O' Reilly media,
ISBN: 978-1-449-31979-3.
4.Python Data Science Handbook by Jake Vander Plas
https://tanthiamhuat.files.wordpress.com/2018/04/pythondatasciencehandbook.pdf
5. Alex Holmes, Hadoop in practice, Dream tech press.
6. Online References for data set

a. http://archive.ics.uci.edu/ml/
b. https://www.kaggle.com/tanmoyie/us-graduate-schools-admission-parameters
c. https://www.kaggle.com
Sinhgad Technical Education Society’s
SINHGAD COLLEGE OF ENGINEERING, PUNE
S. No. 44/1, Off Sinhgad Road, Vadgaon(BK), Pune- 411041
Accredited by NAAC with Grade ‘A+’
ACADEMIC YEAR 2023-24, SEMESTER-II
DEPARTMENT OF INFORMATION TECHNOLOGY

Name of Student : PRN No. :

Student Roll No. : Class: Third Year
Data Science & Big Data Analytics
Subject : Laboratory Batch:

INDEX

Pg. Submission Re-

Group Title of Assignment Given Date Sign
No. Date Mark

1. Single node/Multiple node Hadoop

Installation.
2. Design a distributed application
using MapReduce (Using Java) which
processes a log file of a system.
A
List out the users who have logged for
maximum period on the system. Use
simple log file from the Internet and
process it using a pseudo distribution
mode on Hadoop platform.
3. Write an application using HiveQL
for flight information system which
will include
a. Creating, Dropping, and
altering Database tables.
b. Creating an external Hive
table.
c. Load table with data, insert
new values and field in the
table, and Join tables with
Hive.
d. Create index on Flight
Information Table.
e. Find the average departure
delay per day in 2008.

1. Perform the following operations

using Python on the Facebook metrics
data sets.
a. Create data subsets
b. Merge Data
c. Sort Data
d. Transposing Data
e. Shape and reshape Data
2. Perform the following operations
using Python on the Air quality and
Heart Diseases data sets
a. Data cleaning
b. Data integration
B c. Data transformation
d. Error correcting
e. Data model building
3. Integrate Python and Hadoop and
perform the following operations on
forest fire dataset
a. Data analysis using the Map
Reduce in PyHadoop.
b. Data mining in Hive
4. Visualize the data using Python
libraries matplotlib, seaborn by
plotting the graphs for assignment no.
2 and 3 ( Group B)
5. Perform the following data
visualization operations using Tableau
on Adult and Iris datasets.
a. 1D (Linear) Data visualization
b. 2D (Planar) Data Visualization
c. 3D (Volumetric) Data
Visualization
d. Temporal Data Visualization
e. Multidimensional Data
Visualization
f. Tree/ Hierarchical Data
visualization
g. Network Data visualization
1. Create a review scrapper for any
ecommerce website to fetch real time
comments, reviews,
Ratings, comment tags, customer
name using Python.
2. Develop a mini project in a group
C
using different predictive models
techniques to solve any real life
problem. (Refer link dataset-
https://www.kaggle.com/tanmoyie/us-
graduate-schools- admission-
parameters)
Sinhgad Technical Education Society’s
SINHGAD COLLEGE OF ENGINEERING, PUNE
S. No. 44/1, Off Sinhgad Road, Vadgaon (BK), Pune- 411041
Accredited by NAAC with Grade ‘A+’

GROUP A Single node/Multiple node Hadoop Installation.

ASSIGNMENT NO. 01

GIVEN DATE:

SUBMISSION DATE:

SIGN. OF FACULTY:

ASSIGNMENT NO. : 01(A)

AIM:
To Perform Hadoop Installation (Configuration) on
a. Single Node
b. Multiple Node

OBJECTIVES:
1. To understand Big Data Fundamentals.
2. To understand Different Big Data Processing Techniques.

OUTCOMES:
1. To apply Big Data Primitives & Fundamentals for application development.

THEORY:

Cluster Computing
A computer cluster is a set of connected computers (nodes) that work together as if
they are a single (much more powerful) machine. Unlike grid computers, where each node
performs a different task, computer clusters assign the same task to each node.
Homogeneous Cluster
In homogeneous clusters, all machines are assumed to be the same; however, in the
heterogeneous type, machines have different computing and consumption power. All-in
strategy (AIS) [70] is a framework for energy management in MapReduce clusters by
powering down all nodes in the cluster during a low utilization period.

Heterogeneous Cluster
A heterogeneous cluster environment can contain processors and devices with
different bandwidth and computational capabilities. Symmetric MPI applications will assign
identical workloads to all participants in the application, which can cause load imbalance, as
the execution time might be shorter on some devices due to their higher computational
performance.

Hadoop
Apache Hadoop is an open-source software framework written in Java for distributed storage
and distributed processing of very large data sets on computer clusters built from commodity
hardware. All the modules in Hadoop are designed with a fundamental assumption that
hardware failures are common and should be automatically handled by the framework. The
core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System
(HDFS), and a processing part called MapReduce. Hadoop splits files into large blocks and
distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code
for nodes to process in parallel based on the data that needs to be processed. This approach
takes advantage of data locality— nodes manipulating the data they have access to— to allow
the dataset to be processed faster and more efficiently than it would be in a more conventional
supercomputer architecture that relies on a parallel file system where computation and data
are distributed via high-speed networking.

The base Apache Hadoop framework is composed of the following modules:

 Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
 Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on
commodity machines, providing very high aggregate bandwidth across the cluster;
 Hadoop YARN – a resource-management platform responsible for managing computing
resources in clusters and using them for scheduling of users' applications; and
 Hadoop MapReduce – an implementation of the MapReduce programming model for large
scale data processing.

Benefits of using Hadoop

Installing Hadoop

Prerequisites
• VIRTUAL BOX: it is used for installing the operating system on it.
• OPERATING SYSTEM: You can install Hadoop on Linux-based operating systems.
Ubuntu and Cent OS are very commonly used. In this tutorial, we are using Cent OS.
• JAVA: You need to install the Java 8 package on your system.
• HADOOP: You require Hadoop 2.7.3 package.

Step 1: Download the Java 8 Package & Save the file in your home directory.
Step 2: Extract the Java Tar File.

Command: tar -xvf jdk-8u101-linux-i586.tar.gz

Fig: Hadoop Installation – Downloading Hadoop

Step 3: Download the Hadoop 2.7.3 Package.
Command: wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-
2.7.3.tar.gz

Step 4: Extract the Hadoop tar File.

Command: tar -xvf hadoop-2.7.3.tar.gz

Fig: Hadoop Installation – Extracting Hadoop Files

Step 5: Add the Hadoop and Java paths in the bash file (.bashrc).
Open. bashrc file. Now, add Hadoop and Java Path as shown below.
Command: vi .bashrc

Fig: Hadoop Installation – Setting Environment Variable

Then, save the bash file and close it.

For applying all these changes to the current Terminal, execute the source command.
Command: source .bashrc

Fig: Hadoop Installation – Refreshing environment variables

To make sure that Java and Hadoop have been properly installed on your system and can be
accessed through the Terminal, execute the java -version and Hadoop version commands.
Command: java –version

Fig: Hadoop Installation – Checking Java Version

Command: Hadoop version

Fig: Hadoop Installation – Checking Hadoop Version

Step 6: Edit the Hadoop Configuration files.

Command: cd hadoop-2.7.3/etc/hadoop/
Command: ls
All the Hadoop configuration files are located in hadoop-2.7.3/etc/hadoop directory as you
can see in the snapshot below:

Fig: Hadoop Installation – Hadoop Configuration Files

Step 7: Open core-site.xml and edit the property mentioned below inside configuration tag:
Core-site.xml informs Hadoop daemon where Name Node runs in the cluster. It contains
configuration settings of Hadoop core such as I/O settings that are common to HDFS &
MapReduce.
Command: vi core-site.xml

1 <?xml version="1.0" encoding="UTF-8"?>

2 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
3 <configuration>
4 <property>
5 <name>fs.default.name</name>
6 <value>hdfs://localhost:9000</value>
7 </property>
8 </configuration>

Step 8: Edit hdfs-site.xml and edit the property mentioned below inside configuration tag:
Hdfs-site.xml contains configuration settings of HDFS daemons (i.e. NameNode, DataNode,
and Secondary NameNode). It also includes the replication factor and block size of HDFS.
Command: vi hdfs-site.xml

Fig: Hadoop Installation – Configuring hdfs-site.xml

1 <?xml version="1.0" encoding="UTF-8"?>

2 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
3 <configuration>
4 <property>
5 <name>dfs.replication</name>
6 <value>1</value>
7 </property>
8 <property>
9 <name>dfs.permission</name>
10 <value>false</value>
11 </property>
12 </configuration>
Step 9: Edit the mapred-site.xml file and edit the property mentioned below inside
configuration tag:

Mapred-site.xml contains configuration settings of MapReduce application like number of

JVM that can run in parallel, the size of the mapper and the reducer process, CPU cores
available for a process, etc.

In some cases, mapred-site.xml file is not available. So, we have to create the mapred-site.xml
file using mapred-site.xml template.

Command: cp mapred-site.xml. Template mapred-site.xml

Command: vi mapred-site.xml.

Fig: Hadoop Installation – Configuring mapred-site.xml

1 <?xml version="1.0" encoding="UTF-8"?>

2 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
3 <configuration>
4 <property>
5 <name>mapreduce.framework.name</name>
6 <value>yarn</value>
7 </property>
8 </configuration>

Step 10: Edit yarn-site.xml and edit the property mentioned below inside configuration tag:

Yarn-site.xml contains configuration settings of Resource Manager and NodeManager like

application memory management size, the operation needed on program & algorithm, etc.

You can even check out the details of Big Data with the Azure Data Engineering Certification
in Hyderabad.

Command: vi yarn-site.xml
Fig: Hadoop Installation – Configuring yarn-site.xml

1 <?xml version="1.0">
2 <configuration>
3 <property>
4 <name>yarn.nodemanager.aux-services</name>
5 <value>mapreduce_shuffle</value>
6 </property>
7 <property>
8 <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
9 <value>org.apache.hadoop.mapred.ShuffleHandler</value>
10 </property>
11 </configuration>

Step 11: Edit hadoop-env.sh and add the Java Path as mentioned below:

Hadoop-env.sh contains the environment variables that are used in the script to run Hadoop
like Java home path, etc.

Command: vi hadoop–env.sh

Fig: Hadoop Installation – Configuring hadoop-env.sh

Step 12: Go to Hadoop home directory and format the NameNode.

Command: cd

Command: cd hadoop-2.7.3

Command: bin/hadoop namenode –format

Fig: Hadoop Installation – Formatting NameNode

Step 13: Once the NameNode is formatted, go to hadoop-2.7.3/sbin directory and start all the
daemons.

Command: cd hadoop-2.7.3/sbin

Either you can start all daemons with a single command or do it individually.

Command: ./start-all.sh

The above command is a combination of start-dfs.sh, start-yarn.sh & mr-jobhistory-

daemon.sh Or you can run all the services individually as below:

Start NameNode:

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all
files stored in the HDFS and tracks all the file stored across the cluster.

Command: ./hadoop-daemon.sh start namenode

Start DataNode: On startup, a DataNode connects to the Namenode and it responds to the
requests from the Namenode for different operations.

Command: ./hadoop-daemon.sh start datanode.

Start Resource Manager:
Resource Manager is the master that arbitrates all the available cluster resources and thus
helps in managing the distributed applications running on the YARN system. Its work is to
manage each Node Managers and the each application’s Application Master.

Command: ./yarn-daemon.sh start resource manager

Fig: Hadoop Installation – Starting Resource Manager

Start Node Manager:

The Node Manager in each machine framework is the agent which is responsible for managing
containers, monitoring their resource usage and reporting the same to the Resource Manager.

Command: ./yarn-daemon.sh start node manager.

Start JobHistoryServer:
JobHistoryServer is responsible for servicing all job history related requests from client.
Command: ./mr-jobhistory-daemon.sh start historyserver

Step 14: To check that all the Hadoop services are up and running, run the below command.

Command: jps

Fig: Hadoop Installation – Checking Daemons

Step 15: Now open the Mozilla browser and go to local host: 50070/dfshealth.html to check
the NameNode interface.
Sinhgad Technical Education Society’s
SINHGAD COLLEGE OF ENGINEERING, PUNE
S. No. 44/1, Off Sinhgad Road, Vadgaon (BK), Pune- 411041
Accredited by NAAC with Grade ‘A+’

Design a distributed application using

MapReduce (Using Java) which processes a log
file of a system. List out the users who have
GROUP A
logged for maximum period on the system. Use
ASSIGNMENT NO. 02
simple log file from the Internet and process it
using a pseudo distribution mode on Hadoop
Platform.
GIVEN DATE:

SUBMISSION DATE:

SIGN. OF FACULTY:

ASSIGNMENT NO. : 02(A)

AIM:
Design a distributed application using MapReduce (Using Java) which processes a log file of
a system. List out the users who have logged for maximum period on the system. Use simple
log file from the Internet and process it using a pseudo distribution mode on Hadoop platform.

OBJECTIVES:
1. To understand Big Data Primitives & Fundamentals.
2. To understand different Big Data Processing Techniques.

OUTCOMES:
1. To apply Big Data primitives & fundamentals for application development.
2. To design algorithm & Techniques for Big Data Analysis.

THEORY:
MapReduce is a framework using which we can write applications to process huge
amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner.
MapReduce is a processing technique and a program model for distributed computing based
on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce.
Map takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from
a map as an input and combines those data tuples into a smaller set of tuples. As the sequence
of the name MapReduce implies, the reduce task is always performed after the map job. The
major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application into mappers and reducers
is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling
the application to run over hundreds, thousands, or even tens of thousands of machines in a
cluster is merely a configuration change. This simple scalability is what has attracted many
programmers to use the MapReduce model. MapReduce program executes in three stages,
namely map stage, shuffle stage, and reduce stage.

Map stage: The map or mapper‘s job is to process the input data. Generally the input data is
in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file
is passed to the mapper function line by line. The mapper processes the data and creates
several small chunks of data.

Reduce stage: This stage is the combination of the Shuffle stage and the Reduce stage. The
Reducer‘s job is to process the data that comes from the mapper. After processing, it produces
a new set of output, which will be stored in the HDFS.

Inserting Data into HDFS:

 The MapReduce framework operates on pairs, that is, the framework views the input
to the job as a set of pairs and produces a set of pairs as the output of the job,
conceivably of different types.
 The key and the value classes should be in serialized manner by the framework and
hence, need to implement the Writable interface. Additionally, the key classes have to
implement the Writable-Comparable interface to facilitate sorting by the framework.
 Input and Output types of a MapReduce job: (Input) -> map ->-> reduce -> (Output).

Fig.1: An Example Program to understand working of MapReduce Program.

Steps for Compilation & Execution of Program:

Su hadoopuser #sudomkdiranalyzelogs ls
#sudochmod -R 777 analyzelogs/ cd
ls cd .. (to move to home directory)
pwd ls cd pwd
#sudochown -R hadoop1 analyzelogs/ cd
ls
#cd analyzelogs/ ls cd ..
Copy the Files (Mapper.java,Reduce.java,Driver.java to Analyzelogs Folder)

#sudocp /home/priyanka/Desktop/assignment3/* -/analyzelogs/ (Convert

access_log_short.txt into access_log_short.csv)

Start HADOOP #start-dfs.sh #start-yarn.sh #jps cd

Cd
analyzelogs
ls pwd
ls
#ls -ltr
#ls -al
#sudochmod +r *.* pwd
#export

CLASSPATH="$HADOOP_HOME/share/hadoop/mapreduce/hadoo p-mapreduceclient-
core-
2.9.0.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop- mapreduce-clientcommon-

2.9.0.jar:$HADOOP_HOME/share/hadoop/common/hadoop-common-

2.9.0.jar:~/analyzelogs/SalesCountry/*:$HADOOP_HOME/lib/*"

(This should be PWD)

Compile Java Files

#javac-d. SalesMapper.java SalesCountryReducer.java

SalesCountryDriver.java ls #cd Sales Country/

ls cd ..
#sudo get it Manifest.txt
Main-class:
SalesCountry.SalesCountryDriver
(Press enter)
#jar -cfm analyzelogs.jar Manifest.txt SalesCountry/*.class
ls cd
#cd analyzelogs/

Create Directory on Hadoop

#sudomkdir ~/input2000 ls pwd

#sudocp access_log_short.csv ~/input2000/
# $HADOOP_HOME/bin/hdfsdfs -put ~/input2000 /
#$HADOOP_HOME/bin/hadoop jar analyzelogs.jar/input2000/output2000
# $HADOOP_HOME/bin/hdfsdfs -cat /output2000/part- 00000
# stop-all.sh
# jps For GUI
Go to browser (local host: 50070) Go to utilities (browse directory)

Conclusion: Thus we have learnt how to design a distributed application using MapReduce
and process a log file of a system

Output: Please attached output after conclusion page

Sinhgad Technical Education Society’s
SINHGAD COLLEGE OF ENGINEERING, PUNE
S. No. 44/1, Off Sinhgad Road, Vadgaon (BK), Pune- 411041
Accredited by NAAC with Grade ‘A+’

Write an application using HiveQL for flight

informationsystem which will include
a. Creating, Dropping, and altering Database
tables.
GROUP A b. Creating an external Hive table.
ASSIGNMENT NO. :03 c. Load table with data, insert new values and
fieldin the table, and Join tables with Hive.
d. Create index on Flight Information Table.
e. Find the average departure delay per day in
2008.
GIVEN DATE:

SUBMISSION DATE:

SIGN. OF FACULTY:

ASSIGNMENT NO. : 03(A)

AIM:

Write an application using HiveQL for flight information system which will
include
a. Creating, Dropping, and altering Database tables.
b. Creating an external Hive table.
c. Load table with data, insert new values and field in the table, and Join tables
with Hive.
d. Create index on Flight Information Table.
e. Find the average departure delay per day in 2008.

OBJECTIVES:
1. To understand Big Data Primitives & Fundamentals.
2. To understand different Big Data Processing Techniques.
OUTCOMES:
1. To apply Big Data primitives & fundamentals for application development.
2. To design algorithm & Techniques for Big Data Analysis.

THEORY:

Hive:
Apache Hive is a data warehouse system developed by Facebook to process a huge
amount of structure data in Hadoop. We know that to process the data using Hadoop, we need
to right complex map-reduce functions which is not an easy task for most of the developers.
Hive makes this work very easy for us. It uses a scripting language called HiveQL which is
almost similar to the SQL. So now, we just have to write SQL-like commands and at the
backend of Hive will automatically convert them into the map-reduce jobs.

Hive architecture:

Hive is a data warehouse infrastructure software that can create interaction

between user and HDFS. Hive chooses respective database servers to store the
schema or Metadata of tables, databases, columns in a table, their data types, and
HDFS mapping. HiveQL is similar to SQL for querying on schema info on the
Meta store. Execution engine processes the query and generates results as same
as MapReduce results. It uses the flavour of MapReduce. Hadoop distributed
file system or HBASE are the data storage techniques to store data into file
system.
Working of Hive:

Data Types in Apache Hive

Hive data types are divided into the following 5 different categories:

1. Numeric Type: TINYINT, SMALLINT, INT, BIGINT

2. Date/Time Types: TIMESTAMP, DATE, INTERVAL
3. String Types: STRING, VARCHAR, CHAR
4. Complex Types: STRUCT, MAP, UNION, ARRAY
5. Misc Types: BOOLEAN, BINARY

Conclusion:

Hence we have created an application using HiveQL for flight information system
which will include
1. Creating, Dropping, and altering Database tables.
2. Creating an external Hive table.

Output: Please attached output after conclusion page

Sinhgad Technical Education Society’s
SINHGAD COLLEGE OF ENGINEERING, PUNE
S. No. 44/1, Off Sinhgad Road, Vadgaon (BK), Pune- 411041
Accredited by NAAC with Grade ‘A+’

Perform the following operations using Python on the

Facebook metrics data sets
a. Create data subsets
GROUP B
b. Merge Data
ASSIGNMENT NO. 01 c. Sort Data
d. Transposing Data
e. Shape and reshape Data
GIVEN DATE:

SUBMISSION DATE:

SIGN. OF FACULTY:

ASSIGNMENT NO. : 01(B)

AIM:

Perform the following operations using Python on the Facebook metrics data
sets
a. Create data subsets
b. Merge Data
c. Sort Data
d. Transposing Data
e. Shape and reshape Data

OBJECTIVES:

1. To understand & apply the analytical concept of Big Data using R/Python.

OUTCOMES:

1. To apply the analytical concept of Big Data using R/Python.

2. To design Big Data analytic application for Emerging Trends.
THEORY:

Python
It is a general-purpose interpreted, interactive, object-oriented, and high- level programming
language. It was created by Guido van Rossum during 1985- 1990. Like Perl, Python source
code is also available under the GNU General Public License (GPL). This tutorial gives
enough understanding on Python programming language.

Python's features
 Easy-to-learn − Python has few keywords, simple structure, and a clearly defined syntax.
This allows the student to pick up the language quickly.
 Easy-to-read − Python code is more clearly defined and visible to the eyes.
 Easy-to-maintain − Python's source code is fairly easy-to-maintain.
 A broad standard library − Python's bulk of the library is very portable and cross-platform
compatible on UNIX, Windows, and Macintosh.
 Interactive Mode − Python has support for an interactive mode which allows interactive
testing and debugging of snippets of code.
 Portable − Python can run on a wide variety of hardware platforms and has the same
interface on all platforms.
 Extendable − you can add low-level modules to the Python interpreter. These modules
enable programmers to add to or customize their tools to be more efficient.
 Databases − Python provides interfaces to all major commercial databases.
 GUI Programming − Python supports GUI applications that can be.
 Created and ported to many system calls, libraries and windows systems.

Installation steps -
Python is available on a wide variety of platforms including Linux and Mac OS.
Python distribution is available for a wide variety of platforms. You need to download only
the binary code applicable for your platform and install Python.

UNIX and LINUX Installation

Here are the simple steps to install Python on Unix/Linux machine.

 Open a Web browser and go to https://www.python.org/downloads/

 Follow the link to download zipped source code available for Unix/Linux.
 Download and extract files.
 Editing the Modules/Setup file if you want to customize some options.
 Run ./configure script
 Make
 Make install

This installs Python at standard location /usr/local/bin and its libraries at

/usr/local/lib/pythonXX where XX is the version of Python.
Note- You can install Python on Windows or any other operating system.

Jupyter notebook:

With your virtual environment active, install Jupyter with the local instance of pip. Pip install
jupyter

Run your notebook- jupyter notebook

Dataset: Download Facebook metrics data set.

Sample operation statements-

1. Create a data subset having 5 columns and 50 rows.(Type, Post Weekday, Post Hour,
like, share)
2. Create a data subset having 4 columns and 25 rows. {Type, Total Interactions, like,
share)
3. Create a subset having Post Hour>3hr.
4. Sort the subset 1 on like and share column in descending order.
5. Merge first two subset on Type and sort them on no of shares(share column)
6. Merge first two subsets on different left (like) and right columns (share).
7. Transpose first two subsets and sort them in descending order.
8. Show all data frames in wide and long formats. Convert wide to long and vice versa.

Conclusion:

Thus we have learnt different operations using Python on the Facebook metrics data
sets.
Sinhgad Technical Education Society’s
SINHGAD COLLEGE OF ENGINEERING, PUNE
S. No. 44/1, Off Sinhgad Road, Vadgaon (BK), Pune- 411041
Accredited by NAAC with Grade ‘A+’

Perform the following operations using Python on the

Air quality and Heart Diseases data sets
a. Data cleaning
GROUP B b. Data integration
ASSIGNMENT NO.02 c. Data transformation
d. Error correcting
e. Data model building
GIVEN DATE:

SUBMISSION DATE:

SIGN. OF FACULTY:

ASSIGNMENT NO. : 02(B)

AIM:
Perform the following operations using Python on the Air quality and
Heart Diseases data sets
a. Data cleaning
b. Data integration
c. Data transformation
d. Error correcting
e. Data model building

OBJECTIVES:
1. To understand & apply the analytical concept of Big Data using R/Python.
2. To understand different Data Visualization techniques for Big Data.

OUTCOMES:
1. To apply the analytical concept of Big Data using R/Python.
2. To design Big Data analytic application for Emerging Trends.
THEORY:

Download the datasets Air Quality and heart diseases available at kaggle.com.

Data cleaning

Data cleaning means fixing bad data in your data set. Bad data could be:

• Empty cells
• Data in wrong format
• Wrong data
• Duplicates
When working with multiple data sources, there are many chances for data to be incorrect,
duplicated, or mislabeled. If data is wrong, outcomes and algorithms are unreliable, even
though they may look correct. Data cleaning is the process of changing or eliminating garbage,
incorrect, duplicate, corrupted, or incomplete data in a dataset. There‘s no such absolute way
to describe the precise steps in the data cleaning process because the processes may vary from
dataset to dataset. Data cleansing, data cleansing, or data scrub is the initiative among the
general data preparation process. Data cleaning plays an important part in developing reliable
answers and within the analytical process and is observed to be a basic feature of the info
science basics. The motive of data cleaning services is to construct uniform and standardized
data sets that enable data analytical tools and business intelligence easy access and perceive
accurate data for each problem.

Data cleaning is the most important task that should be done as a data science professional.
Having wrong or bad quality data can be detrimental to processes and analysis. Having clean
data will ultimately increase overall productivity and permit the very best quality information
in your decision-making.

• Error-Free Data
• Data Quality
• Accurate and Efficient
• Complete Data
• Maintains Data Consistency
Data Integration:

So far, we've made sure to remove the impurities in data and make it clean. Now, the next step
is to combine data from different sources to get a unified structure with more meaningful and
valuable information. This is mostly used if the data is segregated into different sources.

Data Transformation:

Now, we have a lot of columns that have different types of data. Our goal is to transform the
data into a machine-learning-digestible format. All machine learning algorithms are based on
mathematics. So, we need to convert all the columns into numerical format.
Handling Categorical Data:

There are some algorithms that can work well with categorical data, such as decision trees.
But most machine learning algorithms cannot operate directly with categorical data. These
algorithms require the input and output both to be in numerical form. If the output to be
predicted is categorical, then after prediction we convert them back to categorical data from
numerical data. Let's discuss some key challenges that we face while dealing with categorical
data.

Encoding:

To address the problems associated with categorical data, we can use encoding. This is the
process by which we convert a categorical variable into a numerical form. Here, we will
look at three simple methods of encoding categorical data.

Replacing:

This is a technique in which we replace the categorical data with a number. This is a simple
replacement and does not involve much logical processing. Let's look at an exercise to get a
better idea of this

Error Correction:

There are many reasons such as noise, cross-talk etc., which may help data to get corrupted
during transmission. Most of the applications would not function expectedly if they receive
erroneous data. Thus error correction is important to do before any analysis.
• Gauge min and max values: For continuous variables, checking the minimum and
maximum values for each column can give you a quick idea of whether your values
are falling within the correct range.
• Look for missing values: The easiest way to find missing is to perform a count or
sorting your columns. It helps in finding missing values which can be
replaced/removed to get expected analysis.

Model Building:

In this phase, the data science team needs to develop data sets for training, testing, and
production purposes. These data sets enable data scientists to develop analytical methods and
train it, while holding aside some data for testing the model.
• Normalization
• Simple and Multiple Linear Regression
• Model Evaluation Using Visualization
• Polynomial Regression and Pipelines
• R-squared and MSE for In-Sample Evaluation
• Prediction and Decision Making

Conclusion:

Thus we have learnt different operations using Python on the Air quality data sets.
Sinhgad Technical Education Society’s
SINHGAD COLLEGE OF ENGINEERING, PUNE
S. No. 44/1, Off Sinhgad Road, Vadgaon (BK), Pune- 411041
Accredited by NAAC with Grade ‘A+’

Integrate Python and Hadoop and perform the following

operations on forest fire dataset
GROUP B a. Data analysis using the Map Reduce
ASSIGNMENT NO. 03 in PyHadoop.
b. Data mining in Hive
GIVEN DATE:

SUBMISSION DATE:

SIGN. OF FACULTY:

ASSIGNMENT NO. : 03(B)

AIM:
Integrate Python and Hadoop and perform the following operations on
forest fire dataset
a. Data analysis using the Map Reduce in PyHadoop.
b. Data mining in Hive
OBJECTIVES:

1. To understand & apply the Analytical concept of Big Data using R/Python.
2. To understand different Data Visualization techniques for Big Data.

OUTCOMES:

1. To apply the analytical concept of Big Data using R/Python.

2. To design Big Data analytic application for Emerging Trends.
THEORY:

Python:

Python is a popular high-level programming language known for its simplicity and readability.
It has a large standard library and a vast ecosystem of third- party packages that make it
suitable for a wide range of applications. Python supports multiple programming paradigms,
including procedural, object- oriented, and functional programming. It is widely used in
various domains, such as web development, data analysis, machine learning, artificial
intelligence, scientific computing, and automation.

Hadoop:

Hadoop is an open-source framework designed for distributed storage and processing of large
datasets across clusters of computers. It provides a scalable and fault-tolerant solution for
processing and analyzing big data. The core components of Hadoop are Hadoop Distributed
File System (HDFS) for distributed storage and MapReduce for distributed processing.
Hadoop allows for parallel processing of data by breaking it into smaller chunks and
distributing them across multiple nodes in a cluster. It is widely used in big data analytics and
has become the de facto standard for processing large- scale datasets.

Integration of Python and Hadoop:

Python can be integrated with Hadoop through various libraries and frameworks, such as
PySpark, Hadoop Streaming, and PyHive, PySpark provides a Python API for Apache Spark,
a fast and distributed data processing engine that runs on top of Hadoop. Hadoop Streaming
allows you to write MapReduce programs in any language, including Python, by reading input
from standard input and writing output to standard output. PyHive provides a Python interface
to interact with Apache Hive, a data warehouse infrastructure built on top of Hadoop, allowing
you to perform data mining and analysis using SQL-like queries. By integrating Python with
Hadoop, you can leverage the power of distributed computing to process and analyze large
datasets efficiently. Python's simplicity and rich ecosystem make it a popular choice for data
analysis and manipulation, while Hadoop provides the infrastructure for distributed
processing and storage. To integrate Python and Hadoop, you can use the PySpark library,
which provides a Python API for Apache Spark, a fast and distributed data processing engine
that runs on Hadoop. PySpark allows you to write MapReduce programs in Python and
execute them on a Hadoop cluster.

Here's a step-by-step guide on how to perform data analysis using MapReduce in PyHadoop:

1. Install PySpark: Install PySpark on your machine by following the instructions

provided in the official PySpark documentation:
https://spark.apache.org/docs/latest/api/python/getting_started/inst allation.html

2. Set up Hadoop cluster: Set up a Hadoop cluster or use an existing cluster that you
have access to.
3. Import necessary modules: In your Python script, import the necessary modules for
PySpark: from pyspark import SparkContext, SparkConf
4. Create a SparkContext: Create a SparkContext object to connect to the Spark cluster:

conf = SparkConf().setAppName("ForestFireAnalysis") sc =
SparkContext(conf=conf)

5. Load the forest fire dataset: Load the forest fire dataset into an RDD (Resilient
Distributed Dataset) using the textFile() method. Assuming the dataset is stored in
HDFS, you can specify the HDFS path to the dataset file:

dataset_path = "hdfs://<hdfs_path_to_dataset>" dataset_rdd =

sc.textFile(dataset_path)

6. MapReduce operations: Perform your desired data analysis using MapReduce

operations, such as map(), reduceByKey(), filter(), etc. Here's an example of
counting the number of forest fires by year:

def extract_year(record): fields = record.split(",") return fields[2].split("-")[0]

# Map operation: extract year from each record year_rdd =

dataset_rdd.map(extract_year)

# Reduce operation: count the number of forest fires by year fire_count_by_year =

year_rdd.map(lambda year: (year, 1)).reduceByKey(lambda a, b: a + b)

# Print the results

for year, count in fire_count_by_year.collect(): print(f"Year: {year}, Count:

{count}")

7. Stop the SparkContext: After performing the analysis, stop the SparkContext to release
the cluster resources:

sc.stop()

For data mining in Hive, you can use the PyHive library, which provides a Python interface
to interact with Hive. Here's an example of how to perform data mining operations in Hive
using PyHive:

1. Install PyHive: Install PyHive on your machine by running the

following command:

pip install pyhive

2. Import necessary modules: In your Python script, import the necessary modules for
PyHive:
from pyhive import hive

3. Connect to Hive: Connect to the Hive server using the connect function:
connection = hive.connect(host='<hive_host>', port=<hive_port>,
username='<hive_username>')

Replace <hive_host>, <hive_port>, and <hive_username> with

the appropriate values for your Hive setup.

4. Create a cursor: Create a cursor object to execute Hive queries:

cursor = connection.cursor()

5. Execute data mining queries: Use the cursor to execute data mining queries in Hive.
For example, you can run a query to find the average forest fire area by month:

query = "SELECT MONTH(date), AVG(area) FROM forest_fire_data GROUP BY

MONTH(date)"
cursor.execute(query)

# Fetch the results results = cursor

Conclusion:

Hence we have integrated Python and Hadoop and performed the following operations on
forest fire dataset:
a. Data analysis using the Map Reduce in PyHadoop.
b. Data mining in Hive
Sinhgad Technical Education Society’s
SINHGAD COLLEGE OF ENGINEERING, PUNE
S. No. 44/1, Off Sinhgad Road, Vadgaon (BK), Pune- 411041
Accredited by NAAC with Grade ‘A+’

Visualize the data using Python libraries matplotlib,

GROUP B seaborn by plotting the graphs for assignment no. 2
ASSIGNMENT NO. and
04 3 ( Group B)
GIVEN DATE:

SUBMISSION DATE:

SIGN. OF FACULTY:

ASSIGNMENT NO. : 04(B)

AIM:

Visualize the data using Python libraries matplotlib, seaborn by plotting

the graphs for assignment no. 2 and 3 (Group B)

OBJECTIVES:

1. To understand & apply the Analytical concept of Big Data using R/Python.
2. To understand different data visualization techniques for Big Data.

OUTCOMES:

1. To apply the analytical concept of Big Data using R/Python.

2. To design Big Data Analytic application for Emerging Trends.

THEORY:

It may sometimes seem easier to go through a set of data points and build insights from
it but usually this process may not yield good results. There could be a lot of things left
undiscovered as a result of this process. Additionally, most of the data sets used in real life
are too big to do any analysis manually.
Data visualization is an easier way of presenting the data, however complex it is, to analyze
trends and relationships amongst variables with the help of pictorial representation.
The following are the advantages of Data Visualization
1. Easier representation of compels data
2. Highlights good and bad performing areas
3. Explores relationship between data points

Identifies data patterns even for larger data points Visualization should have:

1. Appropriate usage of shapes, colors, and size while building visualization

2. Plots/graphs using a co-ordinate system are more pronounced
3. Knowledge of suitable plot with respect to the data types brings more clarity to the
information
4. Usage of labels, titles, legends and pointers passes seamless information
the wider audience.

Visualization libraries in python:

There are a lot of python libraries which could be used to build visualization like
matplotlib, vispy, bokeh, seaborn, pygal, folium, plotly, cufflinks, and networkx. Of the many,
matplotlib and seaborn seems to be very widely used for basic to intermediate level of
visualizations.

1. Matplotlib

It is an library in Python for 2D plots of arrays, It is a multi-platform data visualization

library built on NumPy arrays and designed to work with the broader SciPy stack.
It is well maintained visualization output with high quality graphics draws a lot of
users to it. Basic as well as advanced charts could be very easily built from the
users/developers point of view, since it has a large community support, resolving issues
and debugging becomes much easier.

2. Seaborn

This library sits on top of matplotlib. Means, it has some flavors of matplotlib while from
the visualization point, it is much better than matplotlib and has added features as well.
Benefits:
 Built-in themes aid better visualization
 Statistical functions aiding better data insights
 Better aesthetics and built-in plots
 Helpful documentation with effective examples

Conclusion:
Hence we have successfully visualized the data using Python libraries matplotlib,
seaborn by plotting the graphs for assignment no. 1 and 2 (Group B)
Sinhgad Technical Education Society’s
SINHGAD COLLEGE OF ENGINEERING, PUNE
S. No. 44/1, Off Sinhgad Road, Vadgaon (BK), Pune- 411041
Accredited by NAAC with Grade ‘A+’

Perform the following data visualization operations

using Tableau on Adult and Iris datasets.
a. 1D (Linear) Data visualization
GROUP B b. 2D (Planar) Data Visualization
ASSIGNMENT NO. c. 3D (Volumetric) Data Visualization
05 d. Temporal Data Visualization
e. Multidimensional Data Visualization
f. Tree/ Hierarchical Data visualization
g. Network Data visualization
GIVEN DATE:

SUBMISSION DATE:
SIGN. OF FACULTY:

ASSIGNMENT NO. : 05(B)

AIM:

Perform the following data visualization operations using Tableau on Adult

and Iris datasets.
a. 1D (Linear) Data visualization
b. 2D (Planar) Data Visualization
c. 3D (Volumetric) Data Visualization
d. Temporal Data Visualization
e. Multidimensional Data Visualization
f. Tree/ Hierarchical Data visualization
g. Network Data visualization

OBJECTIVES:

1. To understand Application & Impact of Big Data.

2. To understand Emerging Trends in Big Data Analytics.
3. To understand different data visualization techniques for Emerging Trends.
OUTCOMES:

1. To design Big Data Analytic Application of Emerging Trends.

2. To visualize the Big Data using Tableau.
3. To design algorithms & techniques for Big Data Analytics.

THEORY:

Data visualization or data visualization is viewed by many disciplines as a modern equivalent

of visual communication. It involves the creation and study of the visual representation of
data, meaning "information that has been abstracted in some schematic form, including
attributes or variables for the units of information".
Data visualization refers to the techniques used to communicate data or information by
encoding it as visual objects (e.g., points, lines or bars) contained in graphics. The goal is to
communicate information clearly and efficiently to users. It is one of the steps in data analysis
or data science.

Examples:
• Lists of data items, organized by a single feature (e.g., alphabetical order) (not commonly
visualized)

Examples (geospatial):

• Choropleth:
Broadly, examples of scientific visualization:

• 3D computer models

In 3D computer graphics, 3D modeling (or three-dimensional modeling) is the process of

developing a mathematical representation of any surface of an object (either inanimate or
living) in three dimensions via specialized software. The product is called a 3D model.
Someone who works with 3D models may be referred to as a 3D artist. It can be displayed as
a two- dimensional image through a process called 3D rendering or used in a computer
simulation of physical phenomena. The model can also be physically created using 3D
printing devices.

• Surface and volume rendering

Rendering is the process of generating an image from a model, by means of computer

programs. The model is a description of three-dimensional objects in a strictly defined
language or data structure. It would contain geometry, viewpoint, texture, lighting, and
shading information. The image is a digital image or raster graphics image. The term may be
by analogy with an "artist's rendering" of a scene. 'Rendering' is also used to describe the
process of calculating effects in a video editing file to produce final video output.

Volume rendering is a technique used to display a 2D projection of a 3D

discretely sampled data set. A typical 3D data set is a group of 2D slice images
acquired by a CT or MRI scanner. Usually these are acquired in a regular pattern
(e.g., one slice every millimeter) and usually have a regular number of image
pixels in a regular pattern. This is an example of a regular volumetric grid, with
each volume element, or voxel represented by a single value that is obtained by
sampling the immediate area surrounding the voxel.

• Computer simulations

Computer simulation is a computer program, or network of computers, that

attempts to simulate an abstract model of a particular system. Computer
simulations have become a useful part of mathematical modeling of many natural
systems in physics, and computational physics, chemistry and biology; human
systems in economics, psychology, and social science; and in the process of
engineering and new technology, to gain insight into the operation of those
systems, or to observe their behavior. [6] The simultaneous visualization and
simulation of a system is called visualization.

Examples:

Tools: SIMILE Timeline, TimeFlow, Timeline JS, Excel Image:

Friendly, M. & Denis, D. J. (2001). Milestones in the history of thematic cartography,
statistical graphics, and data visualization. Web
document, http://www.datavis.ca/milestones/. Accessed: August 30, 2012.
• Time series:

Examples (category proportions, counts)

• Histogram:

• Pie chart:
Examples:

• General tree visualization:

• Dendrogram

Examples:
• Matrix
• Node-link diagram (link-based layout algorithm)

Tableau:
Tableau is a Business Intelligence tool for visually analyzing the data. Users can create and
distribute an interactive and shareable dashboard, which depict the trends, variations, and
density of the data in the form of graphs and charts. Tableau can connect to files, relational
and Big Data sources to acquire and process data. The software allows data blending and real-
time collaboration, which makes it very unique. It is used by businesses, academic researchers,
and many government organizations for visual data analysis. It is also positioned as a leader
Business Intelligence and Analytics Platform in Gartner Magic Quadrant.

Tableau Features:

Tableau provides solutions for all kinds of industries, departments, and data environments.
Following are some unique features which enable Tableau to handle diverse scenarios.

 Speed of Analysis − as it does not require high level of programming expertise, any
user with access to data can start using it to derive value from the data.
 Self-Reliant − Tableau does not need a complex software setup. The desktop version
which is used by most users is easily installed and contains all the features needed to
start and complete data analysis.
 Visual Discovery − The user explores and analyzes the data by using visual tools
like colors, trend lines, charts, and graphs. There is very little script to be written
as nearly everything is done by drag and drop.
 Blend Diverse Data Sets − Tableau allows you to blend different relational, semi
structured and raw data sources in real time, without expensive up-front integration
costs. The users don‘t need to know the details of how data is stored.
 Architecture Agnostic − Tableau works in all kinds of devices where data flows. Hence,
the user need not worry about specific hardware or software requirements to use
Tableau.
 Real-Time Collaboration − Tableau can filter, sort, and discuss data on the fly and
embed a live dashboard in portals like SharePoint site or Salesforce. You can save your
view of data and allow colleagues to subscribe to your interactive dashboards so they
see the very latest data just by refreshing their web browser.
 Centralized Data − Tableau server provides a centralized location to manage all of the
organization‘s published data sources. You can delete, change permissions, add tags,
and manage schedules in one convenient location. It‘s easy to schedule extract refreshes
and manage them in the data server. Administrators can centrally define a schedule for
extracts on the server for both incremental and full refreshes.

There are three basic steps involved in creating any Tableau data analysis report.
These three steps are −

 Connect to a data source − It involves locating the data and using an appropriate type
of connection to read the data.
 Choose dimensions and measures − this involves selecting the required columns from
the source data for analysis.
 Apply visualization technique − This involves applying required visualization
methods, such as a specific chart or graph type to the data being analyzed.

For convenience, let‘s use the sample data set that comes with Tableau installation named
sample – superstore.xls. Locate the installation folder of Tableau and go to My Tableau
Repository. Under it, you will find the above file at Data sources\9.2\en_US-US.

Connect to a Data Source

On opening Tableau, you will get the start page showing various data sources. Under
the header “Connect”, you have options to choose a file or server or saved data source. Under
Files, choose excel. Then navigate to the file “Sample – Superstore.xls” as mentioned above.
The excel file has three sheets named Orders, People and Returns. Choose Orders.

Choose the Dimensions and Measures

Next, choose the data to be analyzed by deciding on the dimensions and measures.
Dimensions are the descriptive data while measures are numeric data. When put together, they
help visualize the performance of the dimensional data with respect to the data which are
measures. Choose Category and Region as the dimensions and Sales as the measure. Drag and
drop them as shown in the following screenshot. The result shows the total sales in each
category for each region.

Apply Visualization Technique

In the previous step, you can see that the data is available only as numbers. You have
to read and calculate each of the values to judge the performance. However, you can see them
as graphs or charts with different colors to make a quicker judgment.
We drag and drop the sum (sales) column from the Marks tab to the Columns shelf. The table
showing the numeric values of sales now turns into a bar chart automatically.

Conclusion:
Thus we have learnt how to visualize the data in different types (1 1D (Linear) Data
visualization, 2D (Planar) Data Visualization, 3D (Volumetric) Data Visualization, Temporal
Data Visualization, Multidimensional Data Visualization, Tree/ Hierarchical Data
visualization, Network Data visualization) by using Tableau Software.

Operating Systems LAB MANUAL T.E. I.T. 2024-25
No ratings yet
Operating Systems LAB MANUAL T.E. I.T. 2024-25
75 pages
Big Data Analytics Lab Manual 2025
No ratings yet
Big Data Analytics Lab Manual 2025
91 pages
CST 322 Data Analytics (Elective)
No ratings yet
CST 322 Data Analytics (Elective)
244 pages
DBMS Lab Manual 2023-24
No ratings yet
DBMS Lab Manual 2023-24
77 pages
DSBDA Lab Manual 2022-23 Final-1
No ratings yet
DSBDA Lab Manual 2022-23 Final-1
148 pages
DSB Da Lab Manual
No ratings yet
DSB Da Lab Manual
164 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
45 pages
Dsa Lab Manual Reg 2021-Correct PGM
No ratings yet
Dsa Lab Manual Reg 2021-Correct PGM
140 pages
EOD&BOD
100% (6)
EOD&BOD
67 pages
DSBDA Lab Manual 2022-23
100% (2)
DSBDA Lab Manual 2022-23
148 pages
OCS353 - Data Science Manual-FULL
No ratings yet
OCS353 - Data Science Manual-FULL
64 pages
Ilide - Info Data Analytics Lab File Rohit PR
No ratings yet
Ilide - Info Data Analytics Lab File Rohit PR
23 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
55 pages
Ccs334 Big Data Analytics Laboratory Manual
No ratings yet
Ccs334 Big Data Analytics Laboratory Manual
75 pages
Department of Computer Science and Engineering: Even Semester
No ratings yet
Department of Computer Science and Engineering: Even Semester
45 pages
R Programming Lab Manual
No ratings yet
R Programming Lab Manual
57 pages
3 Cse Big Data Analytics 19a 05 602p R 19 Lab Manual
No ratings yet
3 Cse Big Data Analytics 19a 05 602p R 19 Lab Manual
29 pages
Dbms Course File
No ratings yet
Dbms Course File
43 pages
BigDataAnalytics Lab Manual (DS)
No ratings yet
BigDataAnalytics Lab Manual (DS)
44 pages
CS3361 Lab Manual Final DS 2023-24 Odd
No ratings yet
CS3361 Lab Manual Final DS 2023-24 Odd
12 pages
Bda Lab Manual 21-22 - 22-08-2022
No ratings yet
Bda Lab Manual 21-22 - 22-08-2022
44 pages
EDA Lab Record
No ratings yet
EDA Lab Record
45 pages
Artificial Intelligence & Machine Learning Lab Manual (R22a6684)
No ratings yet
Artificial Intelligence & Machine Learning Lab Manual (R22a6684)
81 pages
IV Year Syllabus (2024 - 25)
No ratings yet
IV Year Syllabus (2024 - 25)
51 pages
Big Data Analytics (R18a0529)
No ratings yet
Big Data Analytics (R18a0529)
139 pages
DL Lab Manual Student
No ratings yet
DL Lab Manual Student
6 pages
6 Big Data Analytics Lab Manual
No ratings yet
6 Big Data Analytics Lab Manual
73 pages
1 To 5 and 9
No ratings yet
1 To 5 and 9
38 pages
Bda 1
No ratings yet
Bda 1
95 pages
Experiment List. DSPYL
No ratings yet
Experiment List. DSPYL
10 pages
Big Data and Analytics Q&a
No ratings yet
Big Data and Analytics Q&a
18 pages
Jaya - BDA Record Front Pages
No ratings yet
Jaya - BDA Record Front Pages
8 pages
DSBA Manual 2025
No ratings yet
DSBA Manual 2025
77 pages
Big Daa R18 Manual
No ratings yet
Big Daa R18 Manual
84 pages
ML Using Python IT UPDATED
No ratings yet
ML Using Python IT UPDATED
53 pages
Bda Lab Manual (R20a0592)
No ratings yet
Bda Lab Manual (R20a0592)
89 pages
Sylabi
No ratings yet
Sylabi
161 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
90 pages
Revised Syllabus TY Information Technology W.e.f.ay 2020 21
No ratings yet
Revised Syllabus TY Information Technology W.e.f.ay 2020 21
28 pages
Faculty Python Lab Manual
100% (1)
Faculty Python Lab Manual
33 pages
LecturePlan CS201 21CSH-471
No ratings yet
LecturePlan CS201 21CSH-471
8 pages
WIKY - Service Manual - v1.31: WWW - Alfastreet.si
100% (1)
WIKY - Service Manual - v1.31: WWW - Alfastreet.si
31 pages
08 B Tech AIDS
No ratings yet
08 B Tech AIDS
31 pages
CSE 3002 Big Data Technologies - 7sem
No ratings yet
CSE 3002 Big Data Technologies - 7sem
19 pages
Python Lab Mannual
No ratings yet
Python Lab Mannual
20 pages
DWDM Lab Manual - It - Iii-Ii - 2018-19 PDF
No ratings yet
DWDM Lab Manual - It - Iii-Ii - 2018-19 PDF
96 pages
It - III B.tech Sem-II - DWDM Lab Manual (20-21)
No ratings yet
It - III B.tech Sem-II - DWDM Lab Manual (20-21)
94 pages
RMM Data Mining Lab Manual Iv-I Cse R16 2019-2020 PDF
No ratings yet
RMM Data Mining Lab Manual Iv-I Cse R16 2019-2020 PDF
136 pages
Bda Lab Manual (R20a0592)
No ratings yet
Bda Lab Manual (R20a0592)
89 pages
MCA 3rd Semester Big Data Analytics Syllabus
No ratings yet
MCA 3rd Semester Big Data Analytics Syllabus
15 pages
Internship
No ratings yet
Internship
22 pages
08 B Tech AIDS
No ratings yet
08 B Tech AIDS
49 pages
LecturePlan CS201 20SMP-460
No ratings yet
LecturePlan CS201 20SMP-460
5 pages
Geethanjali College of Engineering and Technology (Ugc Autonomous Institution)
No ratings yet
Geethanjali College of Engineering and Technology (Ugc Autonomous Institution)
34 pages
2CS702-CPD-Odd 23 24
No ratings yet
2CS702-CPD-Odd 23 24
9 pages
Updated HCI LAB MNUAL
No ratings yet
Updated HCI LAB MNUAL
54 pages
It Iii B.tech Sem-Ii Dwdm-R17a0590 Lab Manual 2019-20
No ratings yet
It Iii B.tech Sem-Ii Dwdm-R17a0590 Lab Manual 2019-20
107 pages
Dbms Lab Manual - II B.tech It Semii (2017-18)
No ratings yet
Dbms Lab Manual - II B.tech It Semii (2017-18)
83 pages
DS&BD Lab Manul
No ratings yet
DS&BD Lab Manul
98 pages
Forcepoint Ipsec Guide: Forcepoint Web Security Cloud
No ratings yet
Forcepoint Ipsec Guide: Forcepoint Web Security Cloud
36 pages
Data Analytics Lab File Rohit
No ratings yet
Data Analytics Lab File Rohit
23 pages
Excel Timeline Template - 2007
No ratings yet
Excel Timeline Template - 2007
20 pages
Oracle DBA Tuning
No ratings yet
Oracle DBA Tuning
10 pages
IGCSE Computer Science Mark Scheme
No ratings yet
IGCSE Computer Science Mark Scheme
11 pages
CFS 105 Digital Security Chapter 2 Toolbox Authentication, Access Control, and Cryptography
No ratings yet
CFS 105 Digital Security Chapter 2 Toolbox Authentication, Access Control, and Cryptography
40 pages
CatOS Commands
No ratings yet
CatOS Commands
740 pages
Tonghui TH3311 Series Manual
No ratings yet
Tonghui TH3311 Series Manual
96 pages
IM ch04
No ratings yet
IM ch04
8 pages
Situation Vacant: Staff Positions For Main Campus (Uet, Taxila)
No ratings yet
Situation Vacant: Staff Positions For Main Campus (Uet, Taxila)
4 pages
Exam Az 300 Microsoft Azure Architect Technologies Skills Measured
0% (1)
Exam Az 300 Microsoft Azure Architect Technologies Skills Measured
17 pages
Usability Inspection Methods (1994)
100% (1)
Usability Inspection Methods (1994)
2 pages
DBMS Module1-5
No ratings yet
DBMS Module1-5
7 pages
Vsphere Esxi Vcenter Server 651 Monitoring Performance Guide PDF
No ratings yet
Vsphere Esxi Vcenter Server 651 Monitoring Performance Guide PDF
206 pages
Implementing Self-Service Procurement in ERP - SAP Documentation
No ratings yet
Implementing Self-Service Procurement in ERP - SAP Documentation
3 pages
Tybsc It 26072019
No ratings yet
Tybsc It 26072019
91 pages
Module 1 Introduction To Java Technology
No ratings yet
Module 1 Introduction To Java Technology
62 pages
Business Intelligence Masters Training PDF
No ratings yet
Business Intelligence Masters Training PDF
14 pages
Natural Language Processing With Python Analyzing Text With the Natural Language Toolkit 1st Edition by Steven Bird, Ewan Klein, Edward Loper 0596516495 9780596516499 - The 2025 ebook edition is available with updated content
No ratings yet
Natural Language Processing With Python Analyzing Text With the Natural Language Toolkit 1st Edition by Steven Bird, Ewan Klein, Edward Loper 0596516495 9780596516499 - The 2025 ebook edition is available with updated content
47 pages
AQA Computer Science AS-Level 3.1.1 Programming: Advanced Notes
No ratings yet
AQA Computer Science AS-Level 3.1.1 Programming: Advanced Notes
15 pages
(Computer Awareness) C Program Basics
No ratings yet
(Computer Awareness) C Program Basics
19 pages
8 GHZ To 16 GHZ, 4-Channel, X Band and Ku Band Beamformer: Adar1000
No ratings yet
8 GHZ To 16 GHZ, 4-Channel, X Band and Ku Band Beamformer: Adar1000
65 pages
Assessment System: Exam Viewer - Enetwork Final Exam - Ccna Exploration: Network Fundamentals (Version 4.0)
No ratings yet
Assessment System: Exam Viewer - Enetwork Final Exam - Ccna Exploration: Network Fundamentals (Version 4.0)
39 pages
Assignment Computer Application in Business: Mr. Shahid Waseem
No ratings yet
Assignment Computer Application in Business: Mr. Shahid Waseem
10 pages
Leif Kirchoff: Masters in Computer Science
No ratings yet
Leif Kirchoff: Masters in Computer Science
4 pages
San Francisco St. Butuan City 8600, Region XIII Caraga, Philippines
No ratings yet
San Francisco St. Butuan City 8600, Region XIII Caraga, Philippines
2 pages
CUH PCULog0
No ratings yet
CUH PCULog0
2 pages
Indigo Site Automation
No ratings yet
Indigo Site Automation
2 pages
Online Payment Complaint: Suryashakti Kisan Yojana (SKY)
No ratings yet
Online Payment Complaint: Suryashakti Kisan Yojana (SKY)
1 page