Unit 5 Ids
Unit 5 Ids
UNIT-V
Data Visualization and Prototype Application Development: Data Visualization options,
Crossfilter, the JavaScript MapReduce library, Creating an interactive dashboard with dc.js,
Dashboard development tools. Applying the Data Science process for real world problem solving
scenarios as a detailed case study.
5.1 Introduction
Data scientists must deliver their new insights to the end user. The results can be communicated
in several ways:
• A one-time presentation: Research questions are one-shot deals because the business
decision derived from them will bind the organization to a certain course for many years to
come.
Example: Company investment decisions:
➢ Do we distribute our goods from two distribution centers or only one?
➢ Where do they need to be located for optimal efficiency?
When the decision is made, the exercise may not be repeated until you’ve retired.
• A new viewport on your data: The most obvious example is customer segmentation. The
segments themselves will be communicated via reports and presentations. When a clear and
relevant customer segmentation is discovered, it can be fed back to the database as a new
dimension on the data from which it was derived. From then on, people can make their own
reports, such as how many products were sold to each segment of customers.
• A real-time dashboard: Sometimes the task of a data scientist doesn’t end when the
discovered new information is send to the database. But when other people start making
reports on this newly discovered data, they might interpret it incorrectly and make reports
that don’t make sense. The data scientist should make the first refreshable report so others,
mainly reporters and IT, can understand it and follow it. This in turn shorten the delivery
time of our insights to the end user who wants to use it on an everyday basis.
Important factors that we come across while preparing a final report are:
• What kind of decision are you supporting? Is it a strategic or an operational one? Strategic
decisions often only require you to analyze and report once, whereas operational decisions
require the report to be refreshed regularly.
1
Introduction to Data Science Unit-5
• How big is your organization? In smaller ones you’ll be in charge of the entire cycle: from
data gathering to reporting. In bigger ones a team of reporters might be available to make the
dashboards for you. But even in this last situation, delivering a prototype dashboard can be
beneficial because it presents an example and often shortens delivery time.
Data visualization options
• The art of presenting your data and information as graphs, charts, or maps is known as data
visualization.
• Data visualization's purpose is to emphasize observations that would not otherwise jump out
when looking at a linear list of values and numbers to enable people to quickly and easily
grasp their data.
How to Select the Appropriate Graph or Chart for Your Data?
To successfully express our message and insights, selecting the appropriate chart or graph for the
data is essential. The following factors need to be considered while choosing the optimal data
visualization:
Purpose
What are you trying to visualize? Are you attempting to demonstrate contrasts, patterns, or
connections in your data?
Type of Data
What kind of data do you have? Is it a numerical or category list? Both continuous and discrete?
This will aid in choosing the best types of data visualization charts.
Context
What context does your data come from? Is it recent or historical? Local or worldwide? This will
enable you to choose the proper scale and coverage for your visualization.
Most Common Types of Data Visualization are:
1. Column Chart
2. Line Graph
3. Pie Chart
4. Bar Chart
5. Heat Maps
6. Scatter Plot
7. Bubble Chart
2
Introduction to Data Science Unit-5
8. Funnel Chart
9. Radar Chart
10. Tree Chart
3
Introduction to Data Science Unit-5
Fig. 5.2 Pharmacy medicines data set opened in Excel: the first 10 lines of stock data are
enhanced with a light-sensitivity variable
• As we can see, the information is time-series data for an entire year of stock movement, so
every medicine thus has 365 entries in the data set. For the example’s sake we’ll use a
fraction of this amount.
• Also, the data set is limited to 29 medicines, a little more than 10,000 lines of data.
• Also, it’s not recommended to load your entire database into the user’s browser; the browser
will freeze while loading, and if it’s too much data, the browser will even crash.
• Normally data is precalculated on the server and parts of it are requested using, for example,
a REST service.
• We use the data visualization option dc.js, which is a cross-breed between the JavaScript
MapReduce library Crossfilter and the data visualization library d3.js.
• Crossfilter was developed by Square Register, a company that handles payment transactions.
Square developed Crossfilter to allow their customers extremely speedy slice and dice on
their payment history. Crossfilter is not the only JavaScript library capable of MapReduce
processing, but it most certainly does the job, is open source, is free to use, and is
maintained by an established company (Square).
• Example alternatives to Crossfilter are Map.js, Meguro, and Underscore.js.
• d3.js can safely be called the most versatile JavaScript data visualization library; it was
developed by Mike Bostock as a successor to his Protovis library. Many JavaScript libraries
are built on top of d3.js.
4
Introduction to Data Science Unit-5
• NVD3, C3.js, xCharts, and Dimple offer same services like d3.js; an abstraction layer on top
of d3.js, which makes it easier to draw simple graphs. They mainly differ in the type of
graphs they support and their default design.
• The main reason for choosing dc.js among many options is: dc.js can easily set up an
interactive dashboard where clicking one graph will create filtered views on related graphs.
5
Introduction to Data Science Unit-5
• We don’t want to send enormous loads of data over the internet or even your internal
network though, for these reasons:
➢ Sending a bulk of data will tax the network to the point where it will bother other
users.
➢ The browser is on the receiving end, and while loading in the data it will temporarily
freeze. For small amounts of data this is unnoticeable, but when we start looking at
100,000 lines, it can become a visible lag. When we go over 1,000,000 lines,
depending on the width of our data, our browser could give up.
• For the data we do send, there is a Crossfilter to handle it once it arrives in the browser. In
this case study, the pharmacist requested the central server for stock data of 2015 for 29
medicines she was particularly interested in.
5.3.1 Setting up everything
• dc.js is the visualization library we will use to create your interactive dashboard
• To build the actual dc.js application we require the following these libraries:
➢ JQuery—To handle the interactivity
➢ Crossfilter.js—A MapReduce library and prerequisite to dc.js
➢ d3.js—A popular data visualization library and prerequisite to dc.js
➢ Bootstrap—A widely used layout library you’ll use to make it all look better
• We write only three files:
➢ index.html—The HTML page that contains our application
➢ application.js—To hold all the JavaScript code
➢ application.css—For CSS (Cascading Style Sheet)
• In addition, we also need to run our code on an HTTP server. So we had to set up a LAMP
(Linux, Apache, MySQL, PHP), WAMP (Windows, Apache, MySQL, PHP), or XAMPP
(Cross Environment, Apache, MySQL, PHP, Perl) server.
• But for the sake of simplicity we won’t set up any of those servers here. Instead we can do it
with a single Python command.
• Use the command-line tool (Linux shell or Windows CMD) and move to the folder
containing the index.html.
• The following command will launch a Python HTTP server on our localhost.
python -m SimpleHTTPServer
6
Introduction to Data Science Unit-5
7
Introduction to Data Science Unit-5
CreateTable()
CreateTable() requires three arguments:
➢ data: The data it needs to put into a table.
➢ variablesInTable: What variables it needs to show.
➢ Title: The title of the table.
CreateTable() uses a predefined variable, tableTemplate, that contains our overall table layout.
CreateTable() can then add rows of data to this template.
8
Introduction to Data Science Unit-5
We show our data on the screen, but preferably not all of it; only the first five entries as shown
in figure 5.5. We can have a date variable in our data and if we want to make sure Crossfilter
will recognize it as such later on, so we first parse it and create a new variable called Day. We
show the original, Date, to appear in the table for now, but later on we’ll use Day for all our
calculations.
Fig. 5.5 Input medicine table shown in browser: first five lines
5.3.2 Unleashing Crossfilter to filter the medicine data set
• Now let’s go into Crossfilter to use filtering and MapReduce. We should put all our code
now within the main() function.
• The first thing we’ll need to do is declare a Crossfilter instance and initiate it with our data.
CrossfilterInstance = crossfilter(medicineData);
• On this instance we can register dimensions, which are the columns of the table.
• Currently Crossfilter is limited to 32 dimensions. If we are handling data wider than 32
dimensions, we should consider narrowing it down before sending it to the browser.
9
Introduction to Data Science Unit-5
10
Introduction to Data Science Unit-5
Fig. 5.7 Data filtered on medicine name Grazax 75 000 SQ-T and sorted by day
• If we like to know how many observations we have per medicine. Logic dictates that you
should end up with the same number for every medicine: 365, or 1 observation per day in
2015.
• Crossfilter comes with two MapReduce functions: reduceCount() and reduceSum().
• If we want to do anything apart from counting and summing, we need to write reduce
functions for it.
• The countPerMed variable now contains the data grouped by the medicine dimension and a
line count for each medicine in the form of a key and a value.
• To create the table we need to address the variable key instead of medName and value for
the count.
var countPerMed = medNameDim.group().reduceCount();
variablesInTable = ["key","value"]
filteredTable
.empty()
.append(CreateTable(countPerMed.top(Infinity), variablesInTable,'Reduced Table'));
Fig. 5.8 MapReduced table with the medicine as the group and a count of data lines as the value
11
Introduction to Data Science Unit-5
• Apart from the reduceCount() and reduceSum() functions, Crossfilter has the more general
reduce() function. This function takes three arguments:
➢ The reduceAdd() function: A function that describes what happens when an extra
observation is added.
➢ The reduceRemove() function: A function that describes what needs to happen when
an observation disappears (for instance, because a filter is applied).
➢ The reduceInit() function: This one sets the initial values for everything that’s
calculated. For a sum and count the most logical starting point is 0.
• A custom reduce function requires three components: an initiation, an add function, and a
remove function.
• The initial reduce function will set starting values of the p object:
var reduceInitAvg = function(p,v)
{ return {count: 0, stockSum : 0, stockAvg:0};
}
• The reduce functions themselves take two arguments.
➢ p is an object that contains the combination situation so far; it persists over all
observations. This variable keeps track of the sum and count for you and thus
represents your goal, your end result.
➢ v represents a record of the input data and has all its variables available to you. The
reduceInit() is called only once, but reduceAdd() is called every time a record is
added and reduceRemove() every time a line of data is removed.
➢ The reduceInit() function, here called reduceInitAvg() because we’re going to
calculate an average, basically initializes the p object by defining its components
(count, sum, and average) and setting their initial values.
Let’s look at reduceAddAvg():
var reduceAddAvg = function(p,v){
p.count += 1;
p.stockSum = p.stockSum + Number(v.Stock);
p.stockAvg = Math.round(p.stockSum / p.count);
return p; }
12
Introduction to Data Science Unit-5
• reduceAddAvg() takes the same p and v arguments. The Stock is summed up for every
record we add, and then the average is calculated based on the accumulated sum and record
count:
var reduceRemoveAvg = function(p,v){
p.count -= 1;
p.stockSum = p.stockSum - Number(v.Stock);
p.stockAvg = Math.round(p.stockSum / p.count); return p;
}
• The reduceRemoveAvg() function looks similar but does the opposite: when a record is
removed, the count and sum are lowered. Now apply this MapReduce function to the data
set:
13
Introduction to Data Science Unit-5
• This can be done by inserting the spot of the graphs related code in the index.html page.
• In application.js we can add all the upcoming code in your main() function.
• dc.renderAll() is dc’s command to draw the graphs which should be placed only once at the
bottom of the main() function.
• The first graph we need is the “total stock over time,” as shown in the following listing. We
already have the time dimension declared, so all we need is to sum the stock by the time
dimension.
Listing 5.3 Code to generate "total stock over time" graph
14
Introduction to Data Science Unit-5
Figure 5.10 dc.js graph: sum of medicine stock over the year 2015
• Now let’s create a row chart that represents the average stock per medicine.
Listing 5.4 Code to generate “average stock per medicine” graph
• Since we used custom-defined reduce() function this time, dc.js doesn’t know what data to
represent. With the .valueAccessor() method we can specify p.value.stockAvg as the value
of our choice.
• The dc.js row chart’s label’s font color is gray; this makes our row chart somewhat hard to
read. We can remedy this by overwriting its following CSS in the application.css file:
.dc-chart g.row text {fill: black;}
• One simple line can make the difference between a clear and an obscure graph.
• Now when we select an area on the line chart, the row chart is automatically adapted to
represent the data for the correct time period. Inversely, we can select one or multiple
medicines on the row chart, causing the line chart to adjust accordingly.
15
Introduction to Data Science Unit-5
• Next we need to register light dimension onto the Crossfilter instance first. We can also add
a reset button, which causes all filters to reset, as shown in the following listing.
Listing 5.7 The dashboard reset filters button
16
Introduction to Data Science Unit-5
Fig. 5.12 dc.js fully interactive dashboard on medicines and their stock within the hospital
pharmacy
5.5 Dashboard development tools
• We have the proven and true software packages of renowned developers such as Tableau,
MicroStrategy, Qlik, SAP, IBM, SAS, Microsoft, Spotfire, and so on.
• These companies all offer dashboard tools worth investigating but they are paid tools.
• Developers can also offer free public versions with limited functionality.
• Some companies will at least give us a trial version. In the end we have to pay for the full
version of any of these packages.
• We can also get visualization libraries that only come with a trial period and no free
community edition, such as Wijmo, Kendo, and FusionCharts. They are worth looking into
because they also provide support and guarantee regular updates.
• HTML is a free data visualization tool, which proliferates with free JavaScript libraries to
plot any data we want.
• Some of the visualization tools are:
17
Introduction to Data Science Unit-5
➢ HighCharts: One of the most mature browser-based graphing libraries. The free license
applies only to noncommercial pursuits. If you want to use it in a commercial context,
prices range anywhere from $90 to $4000.
➢ Chartkick: A JavaScript charting library for Ruby on Rails fans.
➢ Google Charts: The free charting library of Google. As with many Google products, it is
free to use, even commercially, and offers a wide range of graphs.
➢ d3.js: This is an odd one out because it isn’t a graphing library but a data visualization
library. Libraries such as HighCharts and Google Charts are meant to draw certain
predefined charts, d3.js doesn’t lay down such restrictions. d3.js is currently the most
versatile JavaScript data visualization library available.
• Even though we have many options why or when would we consider building our own
interface with HTML5 instead of using alternatives such as SAP’s BusinessObjects, SAS
JMP, Tableau, Clickview, or one of the many others?
• Here are a few reasons:
➢ No budget: When we work in a startup or other small company, the licensing costs
accompanying this kind of software can be high.
➢ High accessibility: The data science application is meant to release results to any kind of
user, especially people who might only have a browser at their disposal—our own
customers, for instance. Data visualization in HTML5 runs fluently on mobile.
➢ Big pools of talent out there: Although there aren’t that many Tableau developers, scads
of people have web-development skills. When planning a project, it’s important to take
into account whether you can staff it.
➢ Quick release: Going through the entire IT cycle might take too long at the company,
and we want people to enjoy our analysis quickly. Once the interface is available and
being used, IT can take all the time they want to industrialize the product.
➢ Prototyping: The better you can show IT its purpose and what it should be capable of,
the easier it is for them to build or buy a sustainable application that does what you want
it to do
➢ Customizability: Although the established software packages are great at what they do,
an application can never be as customized as when you create it yourself.
And why wouldn’t you do this?
18
Introduction to Data Science Unit-5
➢ Company policy: This is the biggest one: it’s not allowed. Large companies have IT
backup teams that allow only a certain number of tools to be used so they can keep their
supporting role under control.
➢ Mature reporting team: If you have a good reporting department, why would you still
bother?
➢ Customization is satisfactory: Not everyone wants the shiny stuff; basic can be enough.
Several of the bigger platforms are browser interfaces with JavaScript running under the
hood. Tableau, BusinessObjects Webi, SAS Visual Analytics, and so on all have HTML
interfaces; their tolerance to customization might grow over time.
19