Ds 1603 - Data Visualization Unit I Introduction
Ds 1603 - Data Visualization Unit I Introduction
Unit I INTRODUCTION
https://www.tableau.com/learn/articles/data-
visualization#:~:text=Data%20visualization%20helps%20to%20tell,data%20and%20hig
hlighting%20useful%20information
Tables: This consists of rows and columns used to compare variables. Tables can show a
great deal of information in a structured way, but they can also overwhelm users that are
simply looking for high-level trends.
https://www.tableau.com/data-insights/reference-library/visual-analytics/tables
Pie charts and stacked bar charts: These graphs are divided into sections that represent
parts of a whole. They provide a simple way to organize data and compare the size of each
component to one other.
Line charts and area charts: These visuals show change in one or more quantities by
plotting a series of data points over time and are frequently used within predictive
analytics. Line graphs utilize lines to demonstrate these changes while area charts connect
data points with line segments, stacking variables on top of one another and using color to
distinguish between variables.
https://www.tableau.com/data-insights/reference-library/visual-analytics/charts
Histograms: This graph plots a distribution of numbers using a bar chart (with no spaces
between the bars), representing the quantity of data that falls within a particular range.
This visual makes it easy for an end user to identify outliers within a given dataset.
Scatter plots: These visuals are beneficial in reveling the relationship between two
variables, and they are commonly used within regression data analysis. However, these can
sometimes be confused with bubble charts, which are used to visualize three variables via
the x-axis, the y-axis, and the size of the bubble.
Heat maps: These graphical representation displays are helpful in visualizing behavioral
data by location. This can be a location on a map, or even a webpage.
Tree maps, which display hierarchical data as a set of nested shapes, typically rectangles.
Tree maps are great for comparing the proportions between categories via their area size.
Geospatial: A visualization that shows data in map form using different shapes and colors
to show the relationship between pieces of data and specific locations. Learn more.
Infographic: A combination of visuals and words that represent data. Usually uses charts
or diagrams.
Open source visualization tools
Open source libraries, such as D3.js, provide a way for analysts to present data in an
interactive way, allowing them to engage a broader audience with new data. Some of the
most popular open source visualization libraries include:
ECharts: A powerful charting and visualization library that offers an easy way to
add intuitive, interactive, and highly customizable charts to products, research
papers, presentations, etc. Echarts is based in JavaScript and ZRender, a lightweight
canvas library.
Vega: Vega defines itself as “visualization grammar,” providing support to
customize visualizations across large datasets which are accessible from the web.
With so many data visualization tools readily available, there has also been a rise in
ineffective information visualization. Visual communication should be simple and
deliberate to ensure that your data visualization helps your target audience arrive at your
intended insight or conclusion. The following best practices can help ensure your data
visualization is useful and clear:
Know your audience(s): Think about who your visualization is designed for and then
make sure your data visualization fits their needs. What is that person trying to
accomplish? What kind of questions do they care about? Does your visualization address
their concerns? You’ll want the data that you provide to motivate people to act within their
scope of their role. If you’re unsure if the visualization is clear, present it to one or two
people within your target audience to get feedback, allowing you to make additional edits
prior to a large presentation.
Choose an effective visual: Specific visuals are designed for specific types of datasets. For
instance, scatter plots display the relationship between two variables well, while line
graphs display time series data well. Ensure that the visual actually assists the audience in
understanding your main takeaway. Misalignment of charts and data can result in the
opposite, confusing your audience further versus providing clarity.
Keep it simple: Data visualization tools can make it easy to add all sorts of information to
your visual. However, just because you can, it doesn’t mean that you should! In data
visualization, you want to be very deliberate about the additional information that you add
to focus user attention. For example, do you need data labels on every bar in your bar
chart? Perhaps you only need one or two to help illustrate your point. Do you need a
variety of colors to communicate your idea? Are you using colors that are accessible to a
wide range of audiences (e.g. accounting for color blind audiences)? Design your data
visualization for maximum impact by eliminating information that may distract your target
audience.
Visualization design objectives are principles and goals that guide the creation of effective
and impactful data visualizations. These objectives ensure that the visual representation of
data is clear, accurate, and meaningful to the audience.
Clarity:
Objective: Ensure that the message of the visualization is immediately clear to the
audience.
Methods: Use clear labels, legends, and annotations. Avoid unnecessary complexity and
clutter. Prioritize simplicity in design.
Accuracy:
Methods: Use accurate scales, provide context for data points, and validate data sources.
Clearly communicate any limitations or uncertainties in the data.
Relevance:
Objective: Tailor the visualization to address the specific needs and questions of the
audience.
Methods: Focus on key insights, eliminate unnecessary details, and choose visualization
types that best convey the intended message for the target audience.
Aesthetics:
Objective: Create visually appealing and engaging visualizations to capture and maintain
audience attention.
Methods: Choose appropriate color schemes, use consistent and readable fonts, and pay
attention to layout and design principles. Aesthetically pleasing visualizations can enhance
user experience.
Consistency:
Objective: Maintain a consistent style and format throughout the visualization to reduce
confusion.
Methods: Use a consistent color palette, font, and layout. Ensure that similar data points or
categories are represented consistently across different charts or graphs.
Interactivity:
Objective: Allow users to interact with the visualization to explore data and gain deeper
insights.
Methods: Incorporate interactive features like tooltips, filters, and zoom options.
Interactive elements can enhance engagement and make the visualization more dynamic.
Accessibility:
Objective: Ensure that the visualization is accessible to a diverse audience, including those
with visual or cognitive impairments.
Methods: Use accessible color contrasts, provide alternative text for images, and design
with consideration for users who may rely on screen readers or other assistive
technologies.
Storytelling:
Objective: Use the visualization to tell a compelling story and guide the audience through
key insights.
Methods: Organize the data in a logical sequence, use annotations to highlight important
points, and consider the narrative flow of the visualization.
Scalability:
Objective: Design visualizations that can scale effectively as the volume of data grows.
Methods: Choose appropriate chart types for the size of the dataset, use hierarchical
representations when necessary, and consider the potential for expansion without loss of
clarity.
Usability:
Objective: Ensure that the visualization is user-friendly and intuitive for the intended
audience.
Methods: Test the visualization with representative users, gather feedback, and iterate on
the design to improve user experience.
By incorporating these visualization design objectives, creators can develop visualizations
that effectively communicate data insights and contribute to informed decision-making.
Data Visualization Tools
1. Tableau
Tableau is a highly popular tool for visualizing data for two main reasons: it's easy to use
and very powerful. You can connect it to lots of data sources and create all sorts of charts
and maps. Salesforce owns Tableau, and it's widely used by many people and big
companies.
Tableau has different versions like desktop, server, and web-based options, plus some
customer relationship management (CRM) software.
Providing integration for advanced databases, including Teradata, SAP, My SQL, Amazon
AWS, and Hadoop, Tableau efficiently creates visualizations and graphics from large,
constantly-evolving datasets used for artificial intelligence, machine learning, and Big Data
applications.
2. Plotly
An open-source data visualization tool, Plotly offers full integration with analytics-centric
programming languages like Matlab, Python, and R, which enables complex visualizations.
Widely used for collaborative work, disseminating, modifying, creating, and sharing
interactive, graphical data, Plotly supports both on-premise installation and cloud
deployment.
Power BI
Power BI, Microsoft's easy-to-use data visualization tool, is available for both on-premise
installation and deployment on the cloud infrastructure. Power BI is one of the most
complete data visualization tools that supports a myriad of backend databases, including
Teradata, Salesforce, PostgreSQL, Oracle, Google Analytics, Github, Adobe Analytics, Azure,
SQL Server, and Excel. The enterprise-level tool creates stunning visualizations and
delivers real-time insights for fast decision-making.
The Pros of Power BI:
High-grade security
D3.js
With Data-Driven Document, you can use any browser to bind data to a DOM in a
document, allowing you to manipulate documents from anywhere. Transforming data
involves selecting selections of nodes and manipulating them individually. You can easily
change and alter node attributes, register event listeners, change nodes, alter HTML or text
content, and access the document's underlying DOM by working with functions of data
(styles, attributes, and other properties). You can associate operations (updates, additions,
and deletions) with nodes to improve performance. You can build new functions using the
function factory, as well as using the graphical primitives included. Geographic coordinates
can be retrieved using a function as opposed to a constant. Properties can be reused by
having data bound to the documents.
It uses HTML, SVG and CSS to create graphics from data, for example generating a table in
HTML from data. Using animated transitions and high performance, you can easily visualize
data in bar charts and graphics, support large amounts of data, and enjoy dynamic
interaction and animation in a 3D environment with large amounts of data.
Reference:
https://www.simplilearn.com/data-visualization-tools-article
Each set of data has particular display needs, and the purpose for which you’re using the
data set has just as much of an effect on those needs as the data itself. There are dozens of
quick tools for developing graphics in a cookie-cutter fashion in office programs, on the
Web, and elsewhere, but complex data sets used for specialized applications require unique
treatment. Throughout this book, we’ll discuss how the characteristics of a data set help
determine what kind of visualization you’ll use.
When you hear the term “information overload,” you probably know exactly what it means
because it’s something you deal with daily. In Richard Saul Wurman’s book Information
Anxiety (Doubleday), he describes how the New York Times on an average
Sunday contains more information than a Renaissance-era person had access to in his
entire lifetime. But this is an exciting time. For $300, you can purchase a commodity PC that
has thousands of times more computing power than the first computers used to tabulate
the U.S. Census. The capability of modern machines is astounding. Performing sophisticated
data analysis no longer requires a research laboratory, just a cheap machine and some
code. Complex data sets can be accessed, explored, and analyzed by the public in a way that
simply was not possible in the past.
The past 10 years have also brought about significant changes in the graphic capabilities of
average machines. Driven by the gaming industry, high-end 2D and 3D graphics hardware
no longer requires dedicated machines from specific vendors, but can instead be purchased
as a $100 add-on card and is standard equipment for any machine costing $700 or more.
When not used for gaming, these cards can render extremely sophisticated models with
thousands of shapes, and can do so quickly enough to provide smooth, interactive
animation. And these prices will only decrease—within a few years’ time, accelerated
graphics will be standard equipment on the aforementioned commodity PC.
Data Collection
We’re getting better and better at collecting data, but we lag in what we can do with it. Most
of the examples in this book come from freely available data sources on the
Internet. Lots of data is out there, but it’s not being used to its greatest potential because
it’s not being visualized as well as it could be. (More about this can be found in Chapter 9,
which covers places to find data and how to retrieve it.) With all the data we’ve collected,
we still don’t have many satisfactory answers to the sort of questions that we started with.
This is the greatest challenge of our information rich era: how can these questions be
answered quickly, if not instantaneously? We’re Thinking About Data
We also do very little sophisticated thinking about information itself. When AOL released a
data set containing the search queries of millions of users that had been “randomized” to
protect the innocent, articles soon appeared about how people could be identified by—and
embarrassed by—information regarding their search habits. Even though we can collect
this kind of information, we often don’t know quite what it means. Was this a major issue
or did it simply embarrass a few AOL users? Similarly, when millions of records of personal
data are lost or accessed illegally, what does that mean? With so few people addressing
data, our understanding remains quite narrow, boiling down to things like, “My credit card
number might be stolen” or “Do I care if anyone sees what I search?”
We might be accustomed to thinking about data as fixed values to be analyzed, but data is a
moving target. How do we build representations of data that adjust to new values every
second, hour, or week? This is a necessity because most data comes from the real world,
where there are no absolutes. The temperature changes, the train runs late, or a product
launch causes the traffic pattern on a web site to change drastically. What happens when
things start moving? How do we interact with “live” data? How do we unravel data as it
changes over time? We might use animation to play back the evolution of a data set, or
interaction to control what time span we’re looking at.
The process of visualization can help us see the world in a new way, revealing unexpected
patterns and trends in the otherwise hidden information around us. At its best, data
visualization is expert storytelling.
Mapping data by hand can be satisfying, yet is slow and tedious. So we usually employ the
power of computation to speed things up. The increased speed enables us to work with
much larger datasets of thousands or millions of values; what would have taken years of
effort by hand can be mapped in a moment. Just as important, we can rapidly experiment
with alternate mappings, tweaking our rules and seeing their output rerendered
immediately. This loop of write/render/evaluate is critical to the iterative process of
refining a design.
Why Interactive?
Dynamic, interactive visualizations can empower people to explore the data for themselves.
An interactive visualization that offers an overview of the data alongside tools for “drilling
down” into the details may successfully fulfill many roles at once, addressing the different
concerns of different audiences, from those new to the subject matter to those already
deeply familiar with the data.
Of course, interactivity can also encourage engagement with the data in ways that static
images cannot. With animated transitions and well-crafted interfaces, some visualizations
can make exploring data feel more like playing a game. Interactive visualization can be a
great medium for engaging an audience who might not otherwise care about the topic or
data at hand.
Visualizations aren’t truly visual unless they are seen. Getting your work out there for
others to see is critical, and publishing on the Web is the quickest way to reach a global
audience. Working with web-standard technologies means that your work can be seen and
experienced by anyone using a recent web browser, regardless of the operating system
(Windows, Mac, Linux) and device type (laptop, desktop, smartphone, tablet).
Acquire
Obtain the data, whether from a file on a disk or a source over a network.
Parse
Provide some structure for the data’s meaning, and order it into categories.
Filter
Mine
Represent
Refine
Improve the basic representation to make it clearer and more visually engaging.
Interact
Add methods for manipulating the data or controlling what features are visible.
An Example
To illustrate the seven steps listed in the previous section, and how they contribute to
effective information visualization, let’s look at how the process can be applied to
understanding a simple data set. In this case, we’ll take the zip code numbering system that
the U.S. Postal Service uses. The application is not particularly advanced, but it provides a
skeleton for how the process works.
Acquire
The acquisition step involves obtaining the data. Like many of the other steps, this can be
either extremely complicated (i.e., trying to glean useful data from a large system) or very
simple (reading a readily available text file).
A copy of the zip code listing can be found on the U.S. Census Bureau web site, as it is
frequently used for geographic coding of statistical data. The listing is a freely available file
with approximately 42,000 lines, one for each of the codes, a tiny portion of which is shown
in Figure 1-1.
Figure 1-1. Zip codes in the format provided by the U.S. Census Bureau
Acquisition concerns how the user downloads your data as well as how you obtained the
data in the first place. If the final project will be distributed over the Internet, as you design
the application, you have to take into account the time required to download data into the
browser. And because data downloaded to the browser is probably part of an even larger
data set stored on the server, you may have to structure the data on the server to facilitate
retrieval of common subsets.
Parse
After you acquire the data, it needs to be parsed—changed into a format that tags each part
of the data with its intended use. Each line of the file must be broken along its individual
parts; in this case, it must be delimited at each tab character. Then, each piece of data needs
to be converted to a useful format. Figure 1-2 shows the layout of each line in the census
listing, which we have to understand to parse it and get out of it what we want.
Each field is formatted as a data type that we’ll handle in a conversion program:
String
A set of characters that forms a word or a sentence. Here, the city or town name is
designated as a string. Because the zip codes themselves are not so much numbers as a
series of digits (if they were numbers, the code 02139 would be stored as 2139, which is
not the same thing), they also might be considered strings.
Float
A number with decimal points (used for the latitudes and longitudes of each location). The
name is short for floating point, from programming nomenclature that describes how the
numbers are stored in the computer’s memory.
Character
A single letter or other symbol. In this data set, a character sometimes designates special
post offices.
Integer
A number without a fractional portion, and hence no decimal points (e.g., –14, 0, or 237).
Index
Data (commonly an integer or string) that maps to a location in another table of data. In
this case, the index maps numbered codes to the names and two-digit abbreviations of
states. This is common in databases, where such an index is used as a pointer into another
table, sometimes as a way to compact the data further (e.g., a two-digit code requires less
storage than the full name of the state or territory).
With the completion of this step, the data is successfully tagged and consequently more
useful to a program that will manipulate or represent it in some way.
Filter
The next step involves filtering the data to remove portions not relevant to our use. In this
example, for the sake of keeping it simple, we’ll be focusing on the contiguous 48 states, so
the records for cities and towns that are not part of those states— Alaska, Hawaii, and
territories such as Puerto Rico—are removed. Another project could require significant
mathematical work to place the data into a mathematical model or normalize it (convert it
to an acceptable range of numbers).
Mine
This step involves math, statistics, and data mining. The data in this case receives only a
simple treatment: the program must figure out the minimum and maximum values for
latitude and longitude by running through the data (as shown in Figure 1-3) so that it can
be presented on a screen at a proper scale. Most of the time, this step will be far more
complicated than a pair of simple math operations.
Represent
This step determines the basic form that a set of data will take. Some data sets are shown as
lists, others are structured like trees, and so forth. In this case, each zip code has a latitude
and longitude, so the codes can be mapped as a two-dimensional plot, with the minimum
and maximum values for the latitude and longitude used for the start and end of the scale
in each dimension. This is illustrated in Figure 1-4.
The Represent stage is a linchpin that informs the single most important decision in a
visualization project and can make you rethink earlier stages. How you choose to represent
the data can influence the very first step (what data you acquire) and the third step (what
particular pieces you extract).
Refine
In this step, graphic design methods are used to further clarify the representation by
Interact
The next stage of the process adds interaction, letting the user control or explore the data.
Interaction might cover things like selecting a subset of the data or changing the viewpoint.
As another example of a stage affecting an earlier part of the process, this stage can also
affect the refinement step, as a change in viewpoint might require the data to be designed
differently.
In the Zipdecode project, typing a number selects all zip codes that begin with that number.
Figures 1-6 and 1-7 show all the zip codes beginning with zero and nine, respectively.
In addition, users can enable a “zoom” feature that draws them closer to each subsequent
digit, revealing more detail around the area and showing a constant rate of detail at each
level. Because we’ve chosen a map as a representation, we could add more details of state
and county boundaries or other geographic features to help viewers associate the “data”
space of zip code points with what they know about the local environment.