DSBDA Lab Manual 23 - 24
DSBDA Lab Manual 23 - 24
Laboratory
2023-2024
VISION
MISSION
1) To transform the students into innovative, competent and high quality IT
professionals to meet the growing global challenges.
The students of the Information Technology course after passing out will:
Possess knowledge and skills in the field of Computer Science and Information
PEO2 Technology for analysing, designing, and implementing complex engineering
problems of any domain with innovative approaches.
Possess an attitude and aptitude for research, entrepreneurship, and higher studies in
PEO3
the field of Computer Science and Information Technology.
Possess better communication, presentation, time management, and team work skills
PEO5 leading to responsible & competent professionals and will be able to address
challenges in the field of IT at the global level.
PROGRAM OUTCOMES
The students in the Information Technology course are expected to know and be able to:
Decision making skills through the use of modern IT tools to make ready for
PSO2
professional responsibilities.
DOCUMENT CONTROL
Author
Signature
Document History
LABORATORY CODE
SYLLABUS
Prerequisite Courses:
Discrete mathematics
Database Management Systems, Data warehousing, Data mining
Programming in Python
Course Objectives:
1. To understand Big data primitives and fundamentals.
3. To understand and apply the Analytical concept of Big data using Python.
Course Outcomes:
On completion of the course, students will be able to–
CO1: Apply Big data primitives and fundamentals for application development.
CO2: Explore different Big data processing techniques with use cases.
CO3: Apply the Analytical concept of Big data using Python.
CO4: Visualize the Big Data using Tableau.
CO5: Design algorithms and techniques for Big data analytics.
CO6: Design and develop Big data analytic application for emerging trends.
2. Design a distributed application using MapReduce (Using Java) which processes a log file
of a system. List out the users who have logged for maximum period on the system. Use
simple log file from the Internet and process it using a pseudo distribution mode on
Hadoop platform.
3. Write an application using HiveQL for flight information system which will include
A. Creating, Dropping, and altering Database tables.
B. Creating an external Hive table.
C. Load table with data, insert new values and field in the table, Join tables with Hive
D. Create index on Flight Information Table
E. Find the average departure delay per day in 2008.
1. Perform the following operations using Python on the Facebook metrics data sets
a. Create data subsets
b. Merge Data
c. Sort Data
d. Transposing Data
e. Shape and reshape Data
2. Perform the following operations using Python on the Air quality and Heart Diseases data
sets
a. Data cleaning
b. Data integration
c. Data transformation
d. Error correcting
e. Data model building
3. Integrate Python and Hadoop and perform the following operations on forest fire dataset
a. Data analysis using the Map Reduce in PyHadoop
b. Data mining in Hive
4. Visualize the data using Python libraries matplotlib, seaborn by plotting the graphs for
assignment no. 2 and 3 ( Group B)
5. Perform the following data visualization operations using Tableau on Adult and Iris datasets.
a. 1D (Linear) Data visualization
b. 2D (Planar) Data Visualization
c. 3D (Volumetric) Data Visualization
d. Temporal Data Visualization
e. Multidimensional Data Visualization
f. Tree/ Hierarchical Data visualization
g. Network Data visualization
Reference Books:
1. Big Data, Black Book, DT Editorial services, 2015 edition.
2. Data Analytics with Hadoop, Jenny Kim, Benjamin Bengfort, OReilly Media, Inc.
3. Python for Data Analysis by Wes McKinney published by O' Reilly media,
ISBN: 978-1-449-31979-3.
4.Python Data Science Handbook by Jake Vander Plas
https://tanthiamhuat.files.wordpress.com/2018/04/pythondatasciencehandbook.pdf
5. Alex Holmes, Hadoop in practice, Dream tech press.
6. Online References for data set
a. http://archive.ics.uci.edu/ml/
b. https://www.kaggle.com/tanmoyie/us-graduate-schools-admission-parameters
c. https://www.kaggle.com
Sinhgad Technical Education Society’s
SINHGAD COLLEGE OF ENGINEERING, PUNE
S. No. 44/1, Off Sinhgad Road, Vadgaon(BK), Pune- 411041
Accredited by NAAC with Grade ‘A+’
ACADEMIC YEAR 2023-24, SEMESTER-II
DEPARTMENT OF INFORMATION TECHNOLOGY
INDEX
GIVEN DATE:
SUBMISSION DATE:
SIGN. OF FACULTY:
AIM:
To Perform Hadoop Installation (Configuration) on
a. Single Node
b. Multiple Node
OBJECTIVES:
1. To understand Big Data Fundamentals.
2. To understand Different Big Data Processing Techniques.
OUTCOMES:
1. To apply Big Data Primitives & Fundamentals for application development.
THEORY:
Cluster Computing
A computer cluster is a set of connected computers (nodes) that work together as if
they are a single (much more powerful) machine. Unlike grid computers, where each node
performs a different task, computer clusters assign the same task to each node.
Homogeneous Cluster
In homogeneous clusters, all machines are assumed to be the same; however, in the
heterogeneous type, machines have different computing and consumption power. All-in
strategy (AIS) [70] is a framework for energy management in MapReduce clusters by
powering down all nodes in the cluster during a low utilization period.
Heterogeneous Cluster
A heterogeneous cluster environment can contain processors and devices with
different bandwidth and computational capabilities. Symmetric MPI applications will assign
identical workloads to all participants in the application, which can cause load imbalance, as
the execution time might be shorter on some devices due to their higher computational
performance.
Hadoop
Apache Hadoop is an open-source software framework written in Java for distributed storage
and distributed processing of very large data sets on computer clusters built from commodity
hardware. All the modules in Hadoop are designed with a fundamental assumption that
hardware failures are common and should be automatically handled by the framework. The
core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System
(HDFS), and a processing part called MapReduce. Hadoop splits files into large blocks and
distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code
for nodes to process in parallel based on the data that needs to be processed. This approach
takes advantage of data locality— nodes manipulating the data they have access to— to allow
the dataset to be processed faster and more efficiently than it would be in a more conventional
supercomputer architecture that relies on a parallel file system where computation and data
are distributed via high-speed networking.
Installing Hadoop
Prerequisites
• VIRTUAL BOX: it is used for installing the operating system on it.
• OPERATING SYSTEM: You can install Hadoop on Linux-based operating systems.
Ubuntu and Cent OS are very commonly used. In this tutorial, we are using Cent OS.
• JAVA: You need to install the Java 8 package on your system.
• HADOOP: You require Hadoop 2.7.3 package.
Step 1: Download the Java 8 Package & Save the file in your home directory.
Step 2: Extract the Java Tar File.
Step 5: Add the Hadoop and Java paths in the bash file (.bashrc).
Open. bashrc file. Now, add Hadoop and Java Path as shown below.
Command: vi .bashrc
Step 8: Edit hdfs-site.xml and edit the property mentioned below inside configuration tag:
Hdfs-site.xml contains configuration settings of HDFS daemons (i.e. NameNode, DataNode,
and Secondary NameNode). It also includes the replication factor and block size of HDFS.
Command: vi hdfs-site.xml
In some cases, mapred-site.xml file is not available. So, we have to create the mapred-site.xml
file using mapred-site.xml template.
Command: vi mapred-site.xml.
Step 10: Edit yarn-site.xml and edit the property mentioned below inside configuration tag:
You can even check out the details of Big Data with the Azure Data Engineering Certification
in Hyderabad.
Command: vi yarn-site.xml
Fig: Hadoop Installation – Configuring yarn-site.xml
1 <?xml version="1.0">
2 <configuration>
3 <property>
4 <name>yarn.nodemanager.aux-services</name>
5 <value>mapreduce_shuffle</value>
6 </property>
7 <property>
8 <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
9 <value>org.apache.hadoop.mapred.ShuffleHandler</value>
10 </property>
11 </configuration>
Step 11: Edit hadoop-env.sh and add the Java Path as mentioned below:
Hadoop-env.sh contains the environment variables that are used in the script to run Hadoop
like Java home path, etc.
Command: vi hadoop–env.sh
Command: cd
Command: cd hadoop-2.7.3
Step 13: Once the NameNode is formatted, go to hadoop-2.7.3/sbin directory and start all the
daemons.
Command: cd hadoop-2.7.3/sbin
Either you can start all daemons with a single command or do it individually.
Command: ./start-all.sh
Start NameNode:
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all
files stored in the HDFS and tracks all the file stored across the cluster.
Start DataNode: On startup, a DataNode connects to the Namenode and it responds to the
requests from the Namenode for different operations.
Start JobHistoryServer:
JobHistoryServer is responsible for servicing all job history related requests from client.
Command: ./mr-jobhistory-daemon.sh start historyserver
Step 14: To check that all the Hadoop services are up and running, run the below command.
Command: jps
SUBMISSION DATE:
SIGN. OF FACULTY:
OBJECTIVES:
1. To understand Big Data Primitives & Fundamentals.
2. To understand different Big Data Processing Techniques.
OUTCOMES:
1. To apply Big Data primitives & fundamentals for application development.
2. To design algorithm & Techniques for Big Data Analysis.
THEORY:
MapReduce is a framework using which we can write applications to process huge
amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner.
MapReduce is a processing technique and a program model for distributed computing based
on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce.
Map takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from
a map as an input and combines those data tuples into a smaller set of tuples. As the sequence
of the name MapReduce implies, the reduce task is always performed after the map job. The
major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application into mappers and reducers
is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling
the application to run over hundreds, thousands, or even tens of thousands of machines in a
cluster is merely a configuration change. This simple scalability is what has attracted many
programmers to use the MapReduce model. MapReduce program executes in three stages,
namely map stage, shuffle stage, and reduce stage.
Map stage: The map or mapper‘s job is to process the input data. Generally the input data is
in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file
is passed to the mapper function line by line. The mapper processes the data and creates
several small chunks of data.
Reduce stage: This stage is the combination of the Shuffle stage and the Reduce stage. The
Reducer‘s job is to process the data that comes from the mapper. After processing, it produces
a new set of output, which will be stored in the HDFS.
The MapReduce framework operates on pairs, that is, the framework views the input
to the job as a set of pairs and produces a set of pairs as the output of the job,
conceivably of different types.
The key and the value classes should be in serialized manner by the framework and
hence, need to implement the Writable interface. Additionally, the key classes have to
implement the Writable-Comparable interface to facilitate sorting by the framework.
Input and Output types of a MapReduce job: (Input) -> map ->-> reduce -> (Output).
Su hadoopuser #sudomkdiranalyzelogs ls
#sudochmod -R 777 analyzelogs/ cd
ls cd .. (to move to home directory)
pwd ls cd pwd
#sudochown -R hadoop1 analyzelogs/ cd
ls
#cd analyzelogs/ ls cd ..
Copy the Files (Mapper.java,Reduce.java,Driver.java to Analyzelogs Folder)
CLASSPATH="$HADOOP_HOME/share/hadoop/mapreduce/hadoo p-mapreduceclient-
core-
2.9.0.jar:$HADOOP_HOME/share/hadoop/mapreduce/hadoop- mapreduce-clientcommon-
2.9.0.jar:$HADOOP_HOME/share/hadoop/common/hadoop-common-
2.9.0.jar:~/analyzelogs/SalesCountry/*:$HADOOP_HOME/lib/*"
ls cd ..
#sudo get it Manifest.txt
Main-class:
SalesCountry.SalesCountryDriver
(Press enter)
#jar -cfm analyzelogs.jar Manifest.txt SalesCountry/*.class
ls cd
#cd analyzelogs/
Conclusion: Thus we have learnt how to design a distributed application using MapReduce
and process a log file of a system
SUBMISSION DATE:
SIGN. OF FACULTY:
AIM:
Write an application using HiveQL for flight information system which will
include
a. Creating, Dropping, and altering Database tables.
b. Creating an external Hive table.
c. Load table with data, insert new values and field in the table, and Join tables
with Hive.
d. Create index on Flight Information Table.
e. Find the average departure delay per day in 2008.
OBJECTIVES:
1. To understand Big Data Primitives & Fundamentals.
2. To understand different Big Data Processing Techniques.
OUTCOMES:
1. To apply Big Data primitives & fundamentals for application development.
2. To design algorithm & Techniques for Big Data Analysis.
THEORY:
Hive:
Apache Hive is a data warehouse system developed by Facebook to process a huge
amount of structure data in Hadoop. We know that to process the data using Hadoop, we need
to right complex map-reduce functions which is not an easy task for most of the developers.
Hive makes this work very easy for us. It uses a scripting language called HiveQL which is
almost similar to the SQL. So now, we just have to write SQL-like commands and at the
backend of Hive will automatically convert them into the map-reduce jobs.
Hive architecture:
Hive data types are divided into the following 5 different categories:
Conclusion:
Hence we have created an application using HiveQL for flight information system
which will include
1. Creating, Dropping, and altering Database tables.
2. Creating an external Hive table.
SUBMISSION DATE:
SIGN. OF FACULTY:
AIM:
Perform the following operations using Python on the Facebook metrics data
sets
a. Create data subsets
b. Merge Data
c. Sort Data
d. Transposing Data
e. Shape and reshape Data
OBJECTIVES:
1. To understand & apply the analytical concept of Big Data using R/Python.
OUTCOMES:
Python
It is a general-purpose interpreted, interactive, object-oriented, and high- level programming
language. It was created by Guido van Rossum during 1985- 1990. Like Perl, Python source
code is also available under the GNU General Public License (GPL). This tutorial gives
enough understanding on Python programming language.
Python's features
Easy-to-learn − Python has few keywords, simple structure, and a clearly defined syntax.
This allows the student to pick up the language quickly.
Easy-to-read − Python code is more clearly defined and visible to the eyes.
Easy-to-maintain − Python's source code is fairly easy-to-maintain.
A broad standard library − Python's bulk of the library is very portable and cross-platform
compatible on UNIX, Windows, and Macintosh.
Interactive Mode − Python has support for an interactive mode which allows interactive
testing and debugging of snippets of code.
Portable − Python can run on a wide variety of hardware platforms and has the same
interface on all platforms.
Extendable − you can add low-level modules to the Python interpreter. These modules
enable programmers to add to or customize their tools to be more efficient.
Databases − Python provides interfaces to all major commercial databases.
GUI Programming − Python supports GUI applications that can be.
Created and ported to many system calls, libraries and windows systems.
Installation steps -
Python is available on a wide variety of platforms including Linux and Mac OS.
Python distribution is available for a wide variety of platforms. You need to download only
the binary code applicable for your platform and install Python.
Jupyter notebook:
With your virtual environment active, install Jupyter with the local instance of pip. Pip install
jupyter
Conclusion:
Thus we have learnt different operations using Python on the Facebook metrics data
sets.
Sinhgad Technical Education Society’s
SINHGAD COLLEGE OF ENGINEERING, PUNE
S. No. 44/1, Off Sinhgad Road, Vadgaon (BK), Pune- 411041
Accredited by NAAC with Grade ‘A+’
SUBMISSION DATE:
SIGN. OF FACULTY:
OBJECTIVES:
1. To understand & apply the analytical concept of Big Data using R/Python.
2. To understand different Data Visualization techniques for Big Data.
OUTCOMES:
1. To apply the analytical concept of Big Data using R/Python.
2. To design Big Data analytic application for Emerging Trends.
THEORY:
Download the datasets Air Quality and heart diseases available at kaggle.com.
Data cleaning
Data cleaning means fixing bad data in your data set. Bad data could be:
• Empty cells
• Data in wrong format
• Wrong data
• Duplicates
When working with multiple data sources, there are many chances for data to be incorrect,
duplicated, or mislabeled. If data is wrong, outcomes and algorithms are unreliable, even
though they may look correct. Data cleaning is the process of changing or eliminating garbage,
incorrect, duplicate, corrupted, or incomplete data in a dataset. There‘s no such absolute way
to describe the precise steps in the data cleaning process because the processes may vary from
dataset to dataset. Data cleansing, data cleansing, or data scrub is the initiative among the
general data preparation process. Data cleaning plays an important part in developing reliable
answers and within the analytical process and is observed to be a basic feature of the info
science basics. The motive of data cleaning services is to construct uniform and standardized
data sets that enable data analytical tools and business intelligence easy access and perceive
accurate data for each problem.
Data cleaning is the most important task that should be done as a data science professional.
Having wrong or bad quality data can be detrimental to processes and analysis. Having clean
data will ultimately increase overall productivity and permit the very best quality information
in your decision-making.
• Error-Free Data
• Data Quality
• Accurate and Efficient
• Complete Data
• Maintains Data Consistency
Data Integration:
So far, we've made sure to remove the impurities in data and make it clean. Now, the next step
is to combine data from different sources to get a unified structure with more meaningful and
valuable information. This is mostly used if the data is segregated into different sources.
Data Transformation:
Now, we have a lot of columns that have different types of data. Our goal is to transform the
data into a machine-learning-digestible format. All machine learning algorithms are based on
mathematics. So, we need to convert all the columns into numerical format.
Handling Categorical Data:
There are some algorithms that can work well with categorical data, such as decision trees.
But most machine learning algorithms cannot operate directly with categorical data. These
algorithms require the input and output both to be in numerical form. If the output to be
predicted is categorical, then after prediction we convert them back to categorical data from
numerical data. Let's discuss some key challenges that we face while dealing with categorical
data.
Encoding:
To address the problems associated with categorical data, we can use encoding. This is the
process by which we convert a categorical variable into a numerical form. Here, we will
look at three simple methods of encoding categorical data.
Replacing:
This is a technique in which we replace the categorical data with a number. This is a simple
replacement and does not involve much logical processing. Let's look at an exercise to get a
better idea of this
Error Correction:
There are many reasons such as noise, cross-talk etc., which may help data to get corrupted
during transmission. Most of the applications would not function expectedly if they receive
erroneous data. Thus error correction is important to do before any analysis.
• Gauge min and max values: For continuous variables, checking the minimum and
maximum values for each column can give you a quick idea of whether your values
are falling within the correct range.
• Look for missing values: The easiest way to find missing is to perform a count or
sorting your columns. It helps in finding missing values which can be
replaced/removed to get expected analysis.
Model Building:
In this phase, the data science team needs to develop data sets for training, testing, and
production purposes. These data sets enable data scientists to develop analytical methods and
train it, while holding aside some data for testing the model.
• Normalization
• Simple and Multiple Linear Regression
• Model Evaluation Using Visualization
• Polynomial Regression and Pipelines
• R-squared and MSE for In-Sample Evaluation
• Prediction and Decision Making
Conclusion:
Thus we have learnt different operations using Python on the Air quality data sets.
Sinhgad Technical Education Society’s
SINHGAD COLLEGE OF ENGINEERING, PUNE
S. No. 44/1, Off Sinhgad Road, Vadgaon (BK), Pune- 411041
Accredited by NAAC with Grade ‘A+’
SUBMISSION DATE:
SIGN. OF FACULTY:
AIM:
Integrate Python and Hadoop and perform the following operations on
forest fire dataset
a. Data analysis using the Map Reduce in PyHadoop.
b. Data mining in Hive
OBJECTIVES:
1. To understand & apply the Analytical concept of Big Data using R/Python.
2. To understand different Data Visualization techniques for Big Data.
OUTCOMES:
Python:
Python is a popular high-level programming language known for its simplicity and readability.
It has a large standard library and a vast ecosystem of third- party packages that make it
suitable for a wide range of applications. Python supports multiple programming paradigms,
including procedural, object- oriented, and functional programming. It is widely used in
various domains, such as web development, data analysis, machine learning, artificial
intelligence, scientific computing, and automation.
Hadoop:
Hadoop is an open-source framework designed for distributed storage and processing of large
datasets across clusters of computers. It provides a scalable and fault-tolerant solution for
processing and analyzing big data. The core components of Hadoop are Hadoop Distributed
File System (HDFS) for distributed storage and MapReduce for distributed processing.
Hadoop allows for parallel processing of data by breaking it into smaller chunks and
distributing them across multiple nodes in a cluster. It is widely used in big data analytics and
has become the de facto standard for processing large- scale datasets.
Python can be integrated with Hadoop through various libraries and frameworks, such as
PySpark, Hadoop Streaming, and PyHive, PySpark provides a Python API for Apache Spark,
a fast and distributed data processing engine that runs on top of Hadoop. Hadoop Streaming
allows you to write MapReduce programs in any language, including Python, by reading input
from standard input and writing output to standard output. PyHive provides a Python interface
to interact with Apache Hive, a data warehouse infrastructure built on top of Hadoop, allowing
you to perform data mining and analysis using SQL-like queries. By integrating Python with
Hadoop, you can leverage the power of distributed computing to process and analyze large
datasets efficiently. Python's simplicity and rich ecosystem make it a popular choice for data
analysis and manipulation, while Hadoop provides the infrastructure for distributed
processing and storage. To integrate Python and Hadoop, you can use the PySpark library,
which provides a Python API for Apache Spark, a fast and distributed data processing engine
that runs on Hadoop. PySpark allows you to write MapReduce programs in Python and
execute them on a Hadoop cluster.
Here's a step-by-step guide on how to perform data analysis using MapReduce in PyHadoop:
2. Set up Hadoop cluster: Set up a Hadoop cluster or use an existing cluster that you
have access to.
3. Import necessary modules: In your Python script, import the necessary modules for
PySpark: from pyspark import SparkContext, SparkConf
4. Create a SparkContext: Create a SparkContext object to connect to the Spark cluster:
conf = SparkConf().setAppName("ForestFireAnalysis") sc =
SparkContext(conf=conf)
5. Load the forest fire dataset: Load the forest fire dataset into an RDD (Resilient
Distributed Dataset) using the textFile() method. Assuming the dataset is stored in
HDFS, you can specify the HDFS path to the dataset file:
7. Stop the SparkContext: After performing the analysis, stop the SparkContext to release
the cluster resources:
sc.stop()
For data mining in Hive, you can use the PyHive library, which provides a Python interface
to interact with Hive. Here's an example of how to perform data mining operations in Hive
using PyHive:
3. Connect to Hive: Connect to the Hive server using the connect function:
connection = hive.connect(host='<hive_host>', port=<hive_port>,
username='<hive_username>')
5. Execute data mining queries: Use the cursor to execute data mining queries in Hive.
For example, you can run a query to find the average forest fire area by month:
Conclusion:
Hence we have integrated Python and Hadoop and performed the following operations on
forest fire dataset:
a. Data analysis using the Map Reduce in PyHadoop.
b. Data mining in Hive
Sinhgad Technical Education Society’s
SINHGAD COLLEGE OF ENGINEERING, PUNE
S. No. 44/1, Off Sinhgad Road, Vadgaon (BK), Pune- 411041
Accredited by NAAC with Grade ‘A+’
SUBMISSION DATE:
SIGN. OF FACULTY:
OBJECTIVES:
1. To understand & apply the Analytical concept of Big Data using R/Python.
2. To understand different data visualization techniques for Big Data.
OUTCOMES:
THEORY:
It may sometimes seem easier to go through a set of data points and build insights from
it but usually this process may not yield good results. There could be a lot of things left
undiscovered as a result of this process. Additionally, most of the data sets used in real life
are too big to do any analysis manually.
Data visualization is an easier way of presenting the data, however complex it is, to analyze
trends and relationships amongst variables with the help of pictorial representation.
The following are the advantages of Data Visualization
1. Easier representation of compels data
2. Highlights good and bad performing areas
3. Explores relationship between data points
Identifies data patterns even for larger data points Visualization should have:
There are a lot of python libraries which could be used to build visualization like
matplotlib, vispy, bokeh, seaborn, pygal, folium, plotly, cufflinks, and networkx. Of the many,
matplotlib and seaborn seems to be very widely used for basic to intermediate level of
visualizations.
1. Matplotlib
2. Seaborn
This library sits on top of matplotlib. Means, it has some flavors of matplotlib while from
the visualization point, it is much better than matplotlib and has added features as well.
Benefits:
Built-in themes aid better visualization
Statistical functions aiding better data insights
Better aesthetics and built-in plots
Helpful documentation with effective examples
Conclusion:
Hence we have successfully visualized the data using Python libraries matplotlib,
seaborn by plotting the graphs for assignment no. 1 and 2 (Group B)
Sinhgad Technical Education Society’s
SINHGAD COLLEGE OF ENGINEERING, PUNE
S. No. 44/1, Off Sinhgad Road, Vadgaon (BK), Pune- 411041
Accredited by NAAC with Grade ‘A+’
SUBMISSION DATE:
SIGN. OF FACULTY:
OBJECTIVES:
THEORY:
Examples:
• Lists of data items, organized by a single feature (e.g., alphabetical order) (not commonly
visualized)
Examples (geospatial):
• Choropleth:
Broadly, examples of scientific visualization:
• 3D computer models
• Computer simulations
Examples:
• Histogram:
• Pie chart:
Examples:
• Dendrogram
Examples:
• Matrix
• Node-link diagram (link-based layout algorithm)
Tableau:
Tableau is a Business Intelligence tool for visually analyzing the data. Users can create and
distribute an interactive and shareable dashboard, which depict the trends, variations, and
density of the data in the form of graphs and charts. Tableau can connect to files, relational
and Big Data sources to acquire and process data. The software allows data blending and real-
time collaboration, which makes it very unique. It is used by businesses, academic researchers,
and many government organizations for visual data analysis. It is also positioned as a leader
Business Intelligence and Analytics Platform in Gartner Magic Quadrant.
Tableau Features:
Tableau provides solutions for all kinds of industries, departments, and data environments.
Following are some unique features which enable Tableau to handle diverse scenarios.
Speed of Analysis − as it does not require high level of programming expertise, any
user with access to data can start using it to derive value from the data.
Self-Reliant − Tableau does not need a complex software setup. The desktop version
which is used by most users is easily installed and contains all the features needed to
start and complete data analysis.
Visual Discovery − The user explores and analyzes the data by using visual tools
like colors, trend lines, charts, and graphs. There is very little script to be written
as nearly everything is done by drag and drop.
Blend Diverse Data Sets − Tableau allows you to blend different relational, semi
structured and raw data sources in real time, without expensive up-front integration
costs. The users don‘t need to know the details of how data is stored.
Architecture Agnostic − Tableau works in all kinds of devices where data flows. Hence,
the user need not worry about specific hardware or software requirements to use
Tableau.
Real-Time Collaboration − Tableau can filter, sort, and discuss data on the fly and
embed a live dashboard in portals like SharePoint site or Salesforce. You can save your
view of data and allow colleagues to subscribe to your interactive dashboards so they
see the very latest data just by refreshing their web browser.
Centralized Data − Tableau server provides a centralized location to manage all of the
organization‘s published data sources. You can delete, change permissions, add tags,
and manage schedules in one convenient location. It‘s easy to schedule extract refreshes
and manage them in the data server. Administrators can centrally define a schedule for
extracts on the server for both incremental and full refreshes.
There are three basic steps involved in creating any Tableau data analysis report.
These three steps are −
Connect to a data source − It involves locating the data and using an appropriate type
of connection to read the data.
Choose dimensions and measures − this involves selecting the required columns from
the source data for analysis.
Apply visualization technique − This involves applying required visualization
methods, such as a specific chart or graph type to the data being analyzed.
For convenience, let‘s use the sample data set that comes with Tableau installation named
sample – superstore.xls. Locate the installation folder of Tableau and go to My Tableau
Repository. Under it, you will find the above file at Data sources\9.2\en_US-US.
On opening Tableau, you will get the start page showing various data sources. Under
the header “Connect”, you have options to choose a file or server or saved data source. Under
Files, choose excel. Then navigate to the file “Sample – Superstore.xls” as mentioned above.
The excel file has three sheets named Orders, People and Returns. Choose Orders.
Next, choose the data to be analyzed by deciding on the dimensions and measures.
Dimensions are the descriptive data while measures are numeric data. When put together, they
help visualize the performance of the dimensional data with respect to the data which are
measures. Choose Category and Region as the dimensions and Sales as the measure. Drag and
drop them as shown in the following screenshot. The result shows the total sales in each
category for each region.
In the previous step, you can see that the data is available only as numbers. You have
to read and calculate each of the values to judge the performance. However, you can see them
as graphs or charts with different colors to make a quicker judgment.
We drag and drop the sum (sales) column from the Marks tab to the Columns shelf. The table
showing the numeric values of sales now turns into a bar chart automatically.
Conclusion:
Thus we have learnt how to visualize the data in different types (1 1D (Linear) Data
visualization, 2D (Planar) Data Visualization, 3D (Volumetric) Data Visualization, Temporal
Data Visualization, Multidimensional Data Visualization, Tree/ Hierarchical Data
visualization, Network Data visualization) by using Tableau Software.