DVT Unit-1 Notes
DVT Unit-1 Notes
Application and Popularization: Existing graphical methods were refined and applied.
·
Key Points:
While the early 1900s seemed like a "dark age" in terms of new graphic inventions, it was vital
for the widespread adoption and practical application of existing methods.
The era also set the stage for future advancements in data visualization by establishing standards
and laying the groundwork for more complex analysis.
The increase in the use of statistical analysis caused a temporary decline in the perceived value of
graphic data display.
Data visualization began to grow in importance in the mid-1960s after a period of dormancy since
the 1930s, driven largely by three key developments. In the USA, John W. Tukey called for data
analysis to be recognized as a distinct branch of statistics in his influential paper, "The Future of
Data Analysis" (1962). He introduced new graphic displays under the term 'exploratory data
analysis' (EDA), including stem-leaf plots and boxplots, which became well-known in the statistical
community. Though his book on EDA was only published in 1977, its chapters circulated earlier
and made graphical data analysis more appealing.
In France, Jacques Bertin published "Sémiologie graphique" (1967), which organized visual
elements in graphics akin to Mendeleev's organization of chemical elements. Alongside this, Jean-
Paul Benzécri developed a visual approach to multidimensional data analysis. Various schools of
thought around graphical data emerged in Europe, including the Netherlands and Germany, despite
a decline in hand-drawn graphics during this time.
The introduction of Fortran in 1957 marked the start of computer processing for statistical data. By
the late 1960s, mainframe computers allowed for the construction of graphics through computer
programs. Significant collaborations emerged between computer science, data analysis, and display
technology, leading to innovative visualization methods. New techniques for multivariate data
representation, dimension-reduction methods, animations of statistical processes, and perceptually
based theories also began to develop, leading to the first modern GIS and interactive systems for
statistical graphics.
During the last quarter of the 20th century, data visualization became a well-developed and lively
area of research involving various fields. Many software tools for different visualization methods
and data types became available for desktop computers. However, giving a quick overview of the
latest trends in data visualization is challenging due to their diversity and rapid development across
many disciplines. A few major themes emerge despite this complexity.
One key development was the creation of highly interactive statistical computing systems, which
shifted from command-driven, programmable systems to those designed for visual data analysis.
New methods for visualizing complex, high-dimensional data also emerged, such as the grand tour,
scatterplot matrix, and parallel coordinates plot. Additionally, there was a resurgence in graphical
techniques for discrete and categorical data, with visualization methods applied to a growing
number of problems and data structures. There was increased focus on the cognitive and perceptual
aspects of how data is displayed.
These advancements in visualization techniques depended on progress in theoretical and
technological infrastructure. Examples include the growth of statistical and graphics software, both
commercial and open-source, as well as new statistical modeling approaches. Innovations in
computer processing capabilities allowed for the use of intense computational methods and access
to large datasets.
From the early 1970s to the mid-1980s, advances mainly focused on static graphs for
multidimensional quantitative data. Techniques like principal component analysis allowed for the
visualization of high-dimensional datasets in interesting lower-dimensional views. The foundations
for visualizing multidimensional contingency tables were established, leading to specialized
techniques such as the mosaic plot, which became widely used.
Statistical Historiography
This review is based on information from the Milestones Project, which covers important
developments in data visualization. It explores how modern statistics and graphics can inform this
history, a concept called ‘statistical historiography. ’ This also provides new perspectives on the
history.
History as ‘Data’
Historical events are usually marked with specific dates and descriptions. Joseph Priestley, one of
the first to view history as data, created a Chart of Biography in the late 1700s. This chart displayed
the lifespans of famous individuals from 1200 B. C. to A. D. 1750 using horizontal lines and dots to
show uncertainty. Priestley classified these people vertically, such as statesmen or scholars. His
work influenced Playfair's time-series and bar charts, though British statisticians at the time didn't
see the connection between history and data. In 1885, Alfred Marshall argued that statistics could
help understand historical events through 'historical curves. '
Analysing Milestones Data
The Milestones Project collects information as a chronological list and maintains it in a relational
database for data analysis. The simplest analyses focus on trends over time. A density estimate of
248 milestone items from 1500 to the present shows notable trends, with a steady rise until about
1880, followed by a decline through 1945, and a sharp rise to the present. The peak during the
Golden Age is close to present levels but may be underrepresented due to fewer recent events.
Different trends can be observed by classifying the items based on factors like place and content.
When items are sorted by development place (Europe vs. North America), it reveals that the peak in
Europe around 1875-1880 coincides with a smaller peak in North America. Europe's decline after
the Golden Age was matched by an initial rise in North America, driven by popularization and the
application of graphical methods, followed by a steeper decline due to the influence of
mathematical statistics.
Mosaic plots classify milestone items by subject and aspect. Early innovations mainly dealt with
physical subjects, while later periods shifted to mathematical topics. Human subjects were
prominent in the 19th century but less so overall. The analysis suggests that exploring historical
data is a fruitful area for future research.
2. GOOD GRAPHS
2.1 Introduction
Data graphics have been used for centuries to present information in a
visually meaningful way.
A historical example is Playfair’s Commercial and Political Atlas
(1801), which effectively visualized trade data between England and
Ireland.
The effectiveness of a graphic depends on how well it communicates information.
Many modern data graphics, especially in media and scientific publications,
fail to convey information clearly or may even mislead the audience.
The key factors that make a graphic effective include:
o Content – What information is being shown.
o Context – How the graphic fits within the overall message.
o Construction – How well it is designed.
A good graphic blends all these elements for clarity, accuracy, and visual appeal.
Figure2.1. Playfair’s chart of trade between England and Ireland from 1700 to 1800
B. Context (How does the graphic fit in?)
A graphic must complement the overall narrative or research it supports.
It should be relevant and understandable in the given setting.
Example:
A business report may use bar charts, but the same format might not work
well in a scientific study.
Graphics should be placed close to the text that explains them for
easy reference.
Information) Definition:
These graphics are carefully designed visual representations used to present
specific insights to an audience.
The focus is on clarity, accuracy, and effectiveness, ensuring that the
audience understands the intended message.
Key Considerations for Presentation Graphics:
1. Audience Awareness
o A chart in a business report should be different from one in a
scientific journal.
o The choice of visualization should match the audience’s familiarity
and expectations.
o Example: A pie chart might be suitable for a general audience, but a box plot
would be better for statisticians.
o Example:
A scientific paper may allow space for one key graph, so choosing
the right visualization is critical.
o Example:
A corporate report may be referenced for years, so
errors in visualization can have long-term
consequences.
o Example:
A histogram shows data distribution, while a box plot highlights
outliers.
3. Customization for Deep Analysis
o Since these graphics are used internally, they do not need to be
visually perfect.
o Analysts may use interactive tools like Python’s Matplotlib or
Tableau to explore relationships dynamically.
History
Data graphics have a long history, but most experts believe they truly began with Playfair
over 200 years ago. He created basic plots like bar charts and histograms and made visually
appealing displays. Wainer and Spence recently published some of his work. While not all
of his graphics were good, many were. In the late 19th century, Minard created notable
graphics, including a famous chart of Napoleon's movements during the Moscow
campaign. The French Ministry of Public Works used his ideas in annual publications from
1879 to 1899 to present economic data geographically for France. In the early 20th century,
graphics were less used in statistics, although Fisher’s 1925 book highlighted their
importance. The 1920s and early 1930s saw Neurath’s group creating pictograms that
influenced modern infographics. Early computers limited graphic quality, but
improvements in hardware and software have since made it easier to create better graphics,
highlighting the need for quality in graphic presentation.
Literature
Several authors have written great books on making good statistical graphics, with Edward Tufte
being the most well-known. His books include many excellent examples and some poor ones,
explaining how to properly represent data. Tufte criticizes bad decoration and data
misrepresentation but focuses mainly on representing data correctly. Another valuable source is
Cleveland's books, which also provide useful advice on data displays. While practical guides exist,
there's a need for theory to help understand practices in this field. Bertin’s work on graphical
semiology is often mentioned, but there's little theory developed since then. Examples are crucial in
this area, and Wainer's books offer helpful and engaging illustrations. Websites like Friendly's
Gallery and ASK E. T. also provide various examples and advice. Readers can find many poor
examples easily without searching too hard.
Presenting simple statistical information can easily go wrong despite its apparent simplicity.
Distortions can occur due to misleading scales, 3-D displays complicating 2-D data
comparisons, non-proportional areas, and cluttered visuals making it hard to discern
information. Beyond technical issues, there are semantic challenges as well. A graphic's
message depends on its caption, headline, and accompanying article, which should ideally
align and support each other. However, discrepancies among these elements can lead to
confusion. Statisticians may have little control over headlines and accompanying articles if
they are not the primary authors.
In the media, graphics often highlight news items or enhance articles, sometimes created by
independent companies under tight deadlines, which can lead to awkward fits. Academic
publications, where authors produce graphics, should aim for careful and thorough
preparation to ensure context alignment.
The success of a graphic also relies on its subject, context, and aesthetics. Familiarity with
certain graphic types helps readers interpret them effectively; however, new forms may cause
misunderstandings. This issue extends beyond graphics to design in general, as creators often
find that their work is misinterpreted. Additionally, graphics may appear differently in print
versus on screens, and complex visuals work better in scientific articles than in brief TV
segments. Graphics explained by commentators differ from printed graphics, and interactive
web graphics allow for more information without cluttering the display.
Plotting a single variable is usually easy. The type of variable affects the choice of graphic;
for example, use histograms or boxplots for continuous variables, and barcharts or
piecharts for categorical variables. Data transformation or aggregation depends on the data
distribution and graphic goals. Scaling and captioning should be simple but chosen
carefully.
Multivariate graphics are more complex. Displaying the joint distribution of two
categorical variables is challenging. The main decision is the type of display, with variable
choice and order being important. Generally, the dependent variable should be plotted last,
and in a scatterplot, it is traditional to place the dependent variable on the vertical axis.
The goal is to develop a method that provides good results most of the time, while users
should verify and adjust the scales for their data. Difficult scaling issues arise when data
cross natural boundaries, such as scaling data from 4 to 95 being easier than from 4 to 101.
Choosing scales that span from minimum to maximum can obscure boundary points, so it's
advisable to extend scales beyond observed limits and use easily understandable rounded
values. There is no strict requirement to include zero in a scale, but there should be a good
reason for excluding it to avoid misleading readers. Zero is a common baseline, but one
can also use other values like one or one hundred for financial data.
Figures illustrating the cumulative times in the Tour de France and histograms of the
Hidalgo stamp thickness data emphasize the importance of scale choices and labeling axes.
Good graphics should align axes properly and allow for meaningful comparisons across
different graphics. An appropriate number of labels is crucial: too many can create clutter,
while too few can hinder comprehension. Lastly, unnecessary tick marks can complicate
the visuals and obscure important data.
The effect of a display can be influenced by many factors. The order in which variables are
shown in graphics can make a difference. This is seen in various plots like parallel
coordinate plots and matrix visualizations. When dealing with nominal variables without
natural ordering, the plotting order matters significantly. It can be alphabetic, geographic,
or based on size or another variable. Two bar charts of the Titanic data show this, with
different ordering methods for the same data set.
Adding Model or Statistical Information – Overlaying (Statistical) information
Guides can be used on plots to highlight specific issues and indicate positive or negative
values. Sloping guides show deviations from a straight line. Fitted lines, like polynomial
regression, help illustrate overall patterns and local variations. For example, a figure showing
times from a 100-meter road race indicates a linear relationship for faster runners and a flat
one for slower ones.
In scientific journals, it is common to plot point estimates with their confidence intervals,
usually at 95%. Another figure demonstrates the deterioration of thin plastic over time, with
high variability in results. Overlapping information can clutter displays, and adjustments may
be needed for clear presentation, especially when preparing for publication.
Captions should clearly explain the graphic they support and include the data source, as
relying on surrounding text is often ineffective. While ideal captions can be long and
detailed, they may discourage readers. A balance where the caption summarizes the
graphic, with detailed explanations in the text, is often a better approach.
Legends should indicate which symbols and colors represent data groups directly on the
plot to avoid unnecessary eye movement.
Annotations highlight specific features but should be used sparingly due to space
limitations. The union estimates of protest turnout are larger than police estimates, except
in Marseille and Paris, where the differences are significant.
Positioning in Text
Keeping graphics and text on the same page or on facing pages is helpful for practical
reasons. It's inconvenient to flip pages back and forth when graphics and related text are on
different pages. However, it is not always possible to avoid this. Where graphics are placed
on a page is a design issue.
Parallel Coordinates
Mosaic Plots
- Represents cumulative times of riders in the 2004 Tour de France for all 21 stages .
- Vertical lines indicate stages, and lines show individual rider progress .
- The plot reveals time differences increasing in mountain stages and fewer line
crossings in later stages , indicating stable rankings.
- Displays time series plots , highlighting the importance of proper time alignment.
Figure 2.16. Cancer mortality rates for white males in the USA between 1970 and 1994 by
State
Economic Area.he scale has been chosen so that each interval contains 10% of the SEAs.
- Shows regional patterns , with higher rates in the East and lower in the Midwest.
3. STATIC GRAPHS
Static graphics in data visualization refer to images or charts that present data in a fixed,
unchanging form. These visualizations do not allow for user interaction or real-time updates;
they are simply
representations of data at a particular point in time. Static graphics are typically used in
reports, printed materials, presentations, and other contexts where interactivity is not required.
Here are some common types of static graphics used in data visualization:
1. Bar Charts
FRAME: gov*birth GRAPH: bar() A description consistent with this chapter would involve
a
description of the coordinate systems and graphical shapes that make up the plot. For
example, the barplot consists of a plotting region and several graphical elements. he
plotting region is positioned to provide margins for axes and has scales appropriate to the
range of the data.he graphical elements consist of axes drawn on the edges of the plotting
region, plus three rectangles drawn relative to the scales within the plotting region, with the
height of the rectangles based on the birth rate data.
4. Grid
Lines:
o Optional but helpful, grid lines allow viewers to visually track the values
of data points across the plot. This is particularly useful for line charts,
bar charts, and scatter plots.
4. Data Points / Markers:
o The actual data points (dots, bars, lines) that represent the values being
plotted. Depending on the chart, these markers might have different shapes,
sizes, or colors to distinguish between different groups or categories.
5. Annotations:
3.2 Customization
Let us assume that your statistical software allows you to produce a complete plot from a
single command and that it provides sensible defaults for the positioning and appearance of
the plot. It is still quite unlikely that the plot you end up with will be exactly what you
want. For example, you may want a different scale on the axes, or the tick marks in
different positions, or no axes at all. After being able to draw something, the next most
important feature of statistical graphics software is the ability to control what gets drawn
and how it gets drawn.
When creating plots, the parameters become more complex. For instance, when drawing an
axis, one parameter might set the number of tick marks, and another might set the axis
label text. An important parameter for a complete plot is the data to display, along with
options for showing axes and legends.
In R, each graphics function has specific parameters for controlling output. For example,
you can create a plot without axes and labels and add a line with specific controls for its
position, color, type, and width.
Graphical Parameters
A common set of graphical parameters can be used to change how graphical outputs look.
This includes line color, fill color, line width, and line style. These parameters are similar
to the graphics state in the PostScript language. For complete control over the appearance
of graphical outputs, statistical graphics software needs to provide a full set of graphical
parameters. Some often-overlooked parameters are semitransparent colors, line joins and
endings, and access to various fonts. Edward Tute suggests using professional graphics
software like Adobe Illustrator for high quality, but better results can be achieved with
control within the statistical graphics software itself.
In R, there are many graphical parameters to control aspects like colors, line types, and
fonts, which can be further expanded to include basic drawing parameters found in
advanced graphics languages such as SVG. Examples include gradient fills and general
pattern fills. Composition of graphical elements can be illustrated by adding a legend to a
plot, with both having transparent backgrounds while the plot has grid lines. If we do not
want grid lines behind the legend, one way is to have the legend output fully cover the plot
output where drawn.
Drawing the legend on top of the plot can lead to an undesired result where grid lines are
visible behind it. Using an opaque background for the legend can work, but it relies on
knowing the plot's background color. A better solution involves more complex image
manipulations, such as negating the alpha channel of the legend. This process allows the
legend to blend with the plot without showing unwanted grid lines behind it, regardless of
the plot’s final background color.
-> Arranging Plots
When creating multiple plots on a page, new free parameters related to their location and size
become available. It's crucial for statistical graphics software to allow specifying arrangements for
these plots. In R, one can easily create an array of equally sized plots and different sized
arrangements using specific code.
-> Annotation
A more complex customization adds more graphical output to a plot, like informative labels for
data symbols.
Graphical Primitives
The first requirement for producing annotations is the ability to create basic graphical
output, like text labels. Statistical graphics software should function like a drawing
program, allowing users to draw lines, rectangles, and text. In R, functions exist for
standard graphical elements. Example code shows how to add shapes and text to a plot.
Coordinate Systems
Statistical graphics software is unique because it can create multiple graphical outputs at
once and position them in different coordinate systems. For example, the title of a plot can
be placed halfway across the page using a normalized coordinate system for the whole
page, while data symbols in a scatterplot are positioned based on the range of the data
within the plot area. Axis labels can also be centered along an axis using a normalized
coordinate system limited to the plot axes.
Users often export their plots for editing in third-party software, like Microsoft Office, but
this loses the original coordinate systems, making it harder to add annotations accurately.
Annotations should be placed relative to the original plot's coordinate systems to maintain
proper positioning.
In R, the traditional graphics functions only work with one coordinate system at a time,
such as the text() function for axis scales and the mtext() function for plot margins. R’s
grid package offers a more flexible method for handling multiple coordinate systems,
which allows for more precise control over adding output to plots.
The user interface for controlling graphical output can use either a command line or a
graphical user interface (GUI). In a command-line setup, function calls include arguments
for each control parameter, while GUIs offer dialog boxes filled with options. A challenge
in statistical graphics is the many parameters needed for complex elements, like a matrix of
scatterplots, where each element requires specific control. Using a mouse to select plot
components is intuitive but can lead to confusion due to the hierarchical nature of plots.
Command lines allow for more precision, such as expressing axis text with specific paths.
Additionally, in GUIs, repeating editing actions across different plots is difficult, whereas
command lines can easily capture and repeat operations.
3.3 Extensibility
The ability to create complete plots and manage their appearance is a basic requirement for
statistical graphics software. A more advanced feature is the option to add new types of
plots. Creating a new plot differs from customization because it starts from scratch and can
be shared with others as a new function or menu item. Essential features needed to develop
new plots include the ability for users to add functions or menu items, access to low-level
building blocks, and support for combining these blocks into larger graphical elements.
The production of a complex plot requires placing different parts within various coordinate
systems. The output arrangement in a coordinate system is clear, like drawing a data
symbol at a specific spot and size. However, the way coordinate systems or plots are
arranged relative to each other is less clear, often organized in rows and columns, similar to
typesetting in LATEX or HTML. For a statistical graphics system, defining these implicit
arrangements is helpful.
In R, there’s a concept called ‘layout’ that divides a space into rows and columns with
adjustable sizes. A viewport can be positioned based on this layout instead of exact
coordinates. For instance, code can create a viewport in a central area with equal margins.
The next code positions another viewport, and nesting viewports allows for complex
implicit arrangements of graphical elements in R.
A statistical graphics system should let users create basic graphic shapes and position them
flexibly. Users should also be able to record a series of drawing operations in a function for others
to use.
Extending a system relies heavily on the user interface. An extensible system usually
needs a language for creating new graphics, meaning users must write code to make
extensions. While graphical programming interfaces are possible, using a command line is
more powerful and flexible. A minimum requirement for a GUI is to record code for actions.
Ideally, the extension language should match the system's development language, with
scripting languages like R or Python being favored for ease of use.
Static 3-D plots are not very useful because it's hard to see 3-D structures without movement.
However, 3-D images are important, like for visualizing a model's prediction surface. R can
draw 3-D surfaces simply with the persp() function, while the rgl add-on package offers
access to the OpenGL 3-D graphics system.
-> Speed
In dynamic and interactive statistical graphics, speed is crucial. The drawing process must be
fast to allow real-time updates as users change settings. In static graphics, speed is less
important; achieving the desired result is prioritized over the time it takes. Plots may take
several seconds to draw, which is acceptable. This speed consideration is key for the user
interface. In R, much graphics code is slower as it is written in interpreted R code, but it
allows users to see, modify, and create their own code. Limits are still needed because
drawing a single plot can take longer with many observations and batch jobs. Complex plots
in R, like Trellis plots, may also be slow, but users generally find this acceptable, as
generating all figures for a medium-sized book can take under a minute.
When creating plots for reports, different formats are needed based on the report type. For
printed reports, PostScript or PDF formats are best, while PNG is preferred for web
publication. There are many software options available to convert graphic formats, making
it easier to produce the required output without needing statistical graphics software to
support all formats.
Nonetheless, it's valuable for statistical graphics software to support various formats. This
can enhance the lowest-common-denominator format, with R, for instance, allowing for
clipping in formats that lack this feature. Furthermore, modern formats can offer advanced
features like transparency and animation that simpler formats do not support.
Additionally, saving the original code used to create plots, like R code, is recommended
along with traditional formats. This allows for easier modifications and better manipulation
of plots compared to editing PDF or PostScript versions.
Statistical graphics software, mainly focusing on how data is presented rather than its
source. While this approach allows for more graphic flexibility, it is important to recognize
the need for tools to generate, import, and analyze data, as data is essential for meaningful
plots. Ideally, statistical graphics should be part of a larger system with data-handling
capabilities, like R.