0% found this document useful (0 votes)
37 views35 pages

DVT Unit-1 Notes

The document outlines the historical milestones of data visualization from ancient maps to modern interactive graphics, highlighting key developments and figures in each era. It emphasizes the importance of effective graphic design, focusing on content, context, and construction to enhance clarity and communication. Additionally, it distinguishes between presentation graphics for communicating findings and exploratory graphics for data analysis, detailing their respective characteristics and purposes.

Uploaded by

madhuri p
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views35 pages

DVT Unit-1 Notes

The document outlines the historical milestones of data visualization from ancient maps to modern interactive graphics, highlighting key developments and figures in each era. It emphasizes the importance of effective graphic design, focusing on content, context, and construction to enhance clarity and communication. Additionally, it distinguishes between presentation graphics for communicating findings and exploratory graphics for data analysis, detailing their respective characteristics and purposes.

Uploaded by

madhuri p
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

UNIT-I

1. MILESTONES OF DATA VISUALIZATION

Pre-17th Century: Early Maps and Diagrams


- Early Visualization: Maps and geometric diagrams were crucial for navigation and astronomy.
Ancient civilizations like Egyptians, Babylonians, and Greeks developed sophisticated visual
representations. The Turin Papyrus Map (1150 BC) and the Dunhuang Star Chart are notable
examples.
- Coordinate Systems: Ancient Egyptians used concepts akin to latitude and longitude by 200
B.C. Claudius Ptolemy's map projections remained standard until the 14th century.
- Graphical Depictions: A 10th-century graph depicted celestial body positions over time. By
the 14th century, Nicole Oresme introduced proto bar graphs.
- Notable Figures:
- Claudius Ptolemy: Developed map projections using latitude and longitude.
- Nicole Oresme: Introduced proto bar graphs in the 14th century.
- Nicolas of Cusa: Explored theoretical graphs of distance vs. speed.

1600–1699: Measurement and Theory


Key Developments
- Physical Measurement: The 17th century focused on measuring time, distance, and space for
astronomy, surveying, map-making, navigation, and territorial expansion
- Theoretical Advances:
- Analytic Geometry: Developed by Descartes and Fermat, introducing coordinate systems.
- Probability Theory: Initiated by Pascal and Fermat.
- Demographic Statistics: John Graunt and William Petty pioneered 'political arithmetic' to study
population and wealth.
- Notable Figures and Contributions:
- Galileo: Contributed to measurement theory and used telescopes for astronomical observations.
- Ole Roemer: Measured the speed of light in 1676.
- Michael Florent van Langren: Created an early visual representation of statistical data in 1644.
- Christiaan Huygens: Applied probability theory to graph continuous distribution functions.
- Technological Innovations:
- Telescopes: Revolutionized astronomy by enabling detailed observations of celestial bodies.
- Instruments for Navigation: Quadrants and sextants were used to measure angular distances

1700–1799: New Graphic Forms


Key Developments
- Cartography and Thematic Mapping: Mapmakers began to show more than just
geographical positions, introducing isolines, contours, and thematic mapping of physical,
economic, and medical data. Edmund Halley developed isogons (lines of equal magnetic
declination) in 1701, and Philippe Buache introduced contour maps later.
- Abstract Graphs and Statistical Theory: Abstract graphs and graphs of functions became
more widespread, alongside early statistical theory developments. Johann Lambert contributed to
curve fitting and interpolation from empirical data points.
- Technological Innovations:
- Printing Techniques: Jacob le Blon invented three-color printing in 1710, and Aloys Senefelder
developed lithography in 1798, facilitating the reproduction of data images.
- Coordinate Paper: Dr. Buxton patented printed coordinate paper in 1794, though its widespread
use was delayed.
- Notable Figures :
- Joseph Priestley: Created timeline charts and visualizations of historical events.
- William Playfair: Invented line graphs, bar charts, pie charts, and circle graphs, revolutionizing
data visualization.

1800 -1850:Beginning of Modern Graphics


Main Innovations in Cartography During the 18th Century
Technological and Methodological Advances
- Printing Techniques: Lithography, invented by Aloys Senefelder in 1798, allowed for precise
and error-free reproduction of maps, enhancing accuracy and reducing manual transcription errors
- Triangulation Methods: Systematic surveys using triangulation improved map precision and
reliability. Innovators like Cassini and Jean Picard developed astronomical methods for measuring
longitude.
- Thematic Mapping: The 18th century saw the beginnings of thematic mapping, where maps
began to depict not just geographical features but also data like geology, economics, and
medicine.
Impact on Accuracy and Accessibility
- Marine Chronometer: John Harrison's invention of the marine chronometer enabled precise
longitude measurements at sea, significantly enhancing navigational accuracy.
Increased Map Availability: Advances in printing and surveying techniques made maps more
accessible and widespread, contributing to broader geographical understanding.

1850 to 1900 :THE GOLDEN AGE OF STATISTICAL GRAPHICS


The period from 1850 to 1900 is often referred to as the Golden Age of Statistical Graphics. This
era saw an explosive growth in the use of graphic methods and their application across various
topics. Key factors contributing to this growth included:
- Systematic Data Collectio: Official state statistical offices were established across Europe,
emphasizing the importance of numerical data for planning and development.
- Statistical Theory: Advances by Gauss, Laplace, Quetelet, and others provided a framework for
analyzing large datasets.
- Technological Innovations: New methods and technologies enabled the creation of
sophisticated graphics.
- Graphical Innovations: Pioneers like Minard and Nightingale developed novel graphical
techniques, such as flow maps and polar area charts.
1900–1950: The Modern Dark Ages:

· Late 1800s (Golden Age):

A period of significant innovation in statistical graphics and thematic cartography.

· Early 1900s (Modern Dark Ages?):

Initially seen as a decline in graphical innovation.

Shift towards quantification and statistical models in social sciences.

However, it was also a time of:

Mainstreaming: Statistical graphics became widely used in textbooks, education, government,


business, and science.

Application and Popularization: Existing graphical methods were refined and applied.

Scientific Breakthroughs: Graphics played a crucial role in discoveries in astronomy, physics,


and biology.

Development of Standards: Efforts were made to standardize and improve graphical


presentation.
Foundation for Future Development: preparing for the computational revolution.

·
Key Points:

While the early 1900s seemed like a "dark age" in terms of new graphic inventions, it was vital
for the widespread adoption and practical application of existing methods.

The era also set the stage for future advancements in data visualization by establishing standards
and laying the groundwork for more complex analysis.

The increase in the use of statistical analysis caused a temporary decline in the perceived value of
graphic data display.

1950–1975: Rebirth of Data Visualization

Data visualization began to grow in importance in the mid-1960s after a period of dormancy since
the 1930s, driven largely by three key developments. In the USA, John W. Tukey called for data
analysis to be recognized as a distinct branch of statistics in his influential paper, "The Future of
Data Analysis" (1962). He introduced new graphic displays under the term 'exploratory data
analysis' (EDA), including stem-leaf plots and boxplots, which became well-known in the statistical
community. Though his book on EDA was only published in 1977, its chapters circulated earlier
and made graphical data analysis more appealing.

In France, Jacques Bertin published "Sémiologie graphique" (1967), which organized visual
elements in graphics akin to Mendeleev's organization of chemical elements. Alongside this, Jean-
Paul Benzécri developed a visual approach to multidimensional data analysis. Various schools of
thought around graphical data emerged in Europe, including the Netherlands and Germany, despite
a decline in hand-drawn graphics during this time.

The introduction of Fortran in 1957 marked the start of computer processing for statistical data. By
the late 1960s, mainframe computers allowed for the construction of graphics through computer
programs. Significant collaborations emerged between computer science, data analysis, and display
technology, leading to innovative visualization methods. New techniques for multivariate data
representation, dimension-reduction methods, animations of statistical processes, and perceptually
based theories also began to develop, leading to the first modern GIS and interactive systems for
statistical graphics.

1975–present: High-D, Interactive and Dynamic Data Visualization

During the last quarter of the 20th century, data visualization became a well-developed and lively
area of research involving various fields. Many software tools for different visualization methods
and data types became available for desktop computers. However, giving a quick overview of the
latest trends in data visualization is challenging due to their diversity and rapid development across
many disciplines. A few major themes emerge despite this complexity.

One key development was the creation of highly interactive statistical computing systems, which
shifted from command-driven, programmable systems to those designed for visual data analysis.
New methods for visualizing complex, high-dimensional data also emerged, such as the grand tour,
scatterplot matrix, and parallel coordinates plot. Additionally, there was a resurgence in graphical
techniques for discrete and categorical data, with visualization methods applied to a growing
number of problems and data structures. There was increased focus on the cognitive and perceptual
aspects of how data is displayed.
These advancements in visualization techniques depended on progress in theoretical and
technological infrastructure. Examples include the growth of statistical and graphics software, both
commercial and open-source, as well as new statistical modeling approaches. Innovations in
computer processing capabilities allowed for the use of intense computational methods and access
to large datasets.

From the early 1970s to the mid-1980s, advances mainly focused on static graphs for
multidimensional quantitative data. Techniques like principal component analysis allowed for the
visualization of high-dimensional datasets in interesting lower-dimensional views. The foundations
for visualizing multidimensional contingency tables were established, leading to specialized
techniques such as the mosaic plot, which became widely used.

The development of interactive graphic methods, enabling real-time manipulation of graphical


objects, further drove growth in data visualization. By the 1990s, these ideas were integrated into
more comprehensive systems for dynamic graphics, combining data manipulation and analysis,
showing the greater power of these combined factors compared to their individual parts.

Statistical Historiography

This review is based on information from the Milestones Project, which covers important
developments in data visualization. It explores how modern statistics and graphics can inform this
history, a concept called ‘statistical historiography. ’ This also provides new perspectives on the
history.

History as ‘Data’
Historical events are usually marked with specific dates and descriptions. Joseph Priestley, one of
the first to view history as data, created a Chart of Biography in the late 1700s. This chart displayed
the lifespans of famous individuals from 1200 B. C. to A. D. 1750 using horizontal lines and dots to
show uncertainty. Priestley classified these people vertically, such as statesmen or scholars. His
work influenced Playfair's time-series and bar charts, though British statisticians at the time didn't
see the connection between history and data. In 1885, Alfred Marshall argued that statistics could
help understand historical events through 'historical curves. '
Analysing Milestones Data

The Milestones Project collects information as a chronological list and maintains it in a relational
database for data analysis. The simplest analyses focus on trends over time. A density estimate of
248 milestone items from 1500 to the present shows notable trends, with a steady rise until about
1880, followed by a decline through 1945, and a sharp rise to the present. The peak during the
Golden Age is close to present levels but may be underrepresented due to fewer recent events.

Different trends can be observed by classifying the items based on factors like place and content.
When items are sorted by development place (Europe vs. North America), it reveals that the peak in
Europe around 1875-1880 coincides with a smaller peak in North America. Europe's decline after
the Golden Age was matched by an initial rise in North America, driven by popularization and the
application of graphical methods, followed by a steeper decline due to the influence of
mathematical statistics.

Mosaic plots classify milestone items by subject and aspect. Early innovations mainly dealt with
physical subjects, while later periods shifted to mathematical topics. Human subjects were
prominent in the 19th century but less so overall. The analysis suggests that exploring historical
data is a fruitful area for future research.
2. GOOD GRAPHS

2.1 Introduction
 Data graphics have been used for centuries to present information in a
visually meaningful way.
 A historical example is Playfair’s Commercial and Political Atlas
(1801), which effectively visualized trade data between England and
Ireland.
 The effectiveness of a graphic depends on how well it communicates information.
 Many modern data graphics, especially in media and scientific publications,
fail to convey information clearly or may even mislead the audience.
 The key factors that make a graphic effective include:
o Content – What information is being shown.
o Context – How the graphic fits within the overall message.
o Construction – How well it is designed.
 A good graphic blends all these elements for clarity, accuracy, and visual appeal.

Content, Context, and Construction


A. Content (What is being plotted?)
 The most crucial part of a graphic is the data itself.
 A graphic should accurately reflect the data and make the message clear.
 No amount of fancy design can make a meaningless or misleading graphic useful.

Figure2.1. Playfair’s chart of trade between England and Ireland from 1700 to 1800
B. Context (How does the graphic fit in?)
 A graphic must complement the overall narrative or research it supports.
 It should be relevant and understandable in the given setting.
 Example:
 A business report may use bar charts, but the same format might not work
well in a scientific study.
 Graphics should be placed close to the text that explains them for
easy reference.

C. Construction (How is it designed?)


 Good design improves readability and ensures that the message is clear.
 Example:
o The DDB Life Style Survey (1975–1998) was presented using two
different charts.

o The left-hand graph used gridlines, 3D columns, and colors,


which could distort the perception of data.
o The right-hand graph was simpler and clearer, making it a better choice.

 Key design principles:


o Consistency – Use the same style and scaling across multiple graphics.
o Proximity – Keep related graphics and text close together.
o Layout – Graphics should be appropriately sized and well-positioned.

Presentation Graphics and Exploratory Graphics


Data visualization serves two primary purposes:

1. Presentation Graphics – Used to communicate findings to others in a clear


and structured manner.
2. Exploratory Graphics – Used to analyze and discover insights from data
before making final conclusions.

A. Presentation Graphics (For Communicating

Information) Definition:
 These graphics are carefully designed visual representations used to present
specific insights to an audience.
 The focus is on clarity, accuracy, and effectiveness, ensuring that the
audience understands the intended message.
Key Considerations for Presentation Graphics:
1. Audience Awareness
o A chart in a business report should be different from one in a
scientific journal.
o The choice of visualization should match the audience’s familiarity
and expectations.
o Example: A pie chart might be suitable for a general audience, but a box plot
would be better for statisticians.

2. Design and Readability


o Since presentation graphics are often published or shared widely, they must
be visually appealing and easy to understand.

o Guidelines for effective design:


 Use appropriate labels and scaling.
 Avoid unnecessary decorations that may distract from the message.
 Choose simple, clear colors and avoid 3D effects that can
distort perception.

3. Space and Layout Constraints


o Often, there is limited space available for charts in reports, presentations, or
newspapers.

o Example:
 A scientific paper may allow space for one key graph, so choosing
the right visualization is critical.

4. Longevity and Accuracy


o Unlike exploratory graphics, which may be temporary, presentation
graphics remain in use for a long time.

o Example:
 A corporate report may be referenced for years, so
errors in visualization can have long-term
consequences.

B. Exploratory Graphics (For

Analyzing Data) Definition:


 These graphics are used to explore and understand raw data, helping
researchers and analysts uncover patterns, trends, and relationships.

 The focus is on quick insights rather than


polished design. Key Characteristics of
Exploratory Graphics:
1. Iterative and Flexible
o These graphs are temporary and can be quickly modified or discarded.
o Example: A data scientist analyzing customer sales data may generate
multiple scatter plots before finding a meaningful trend.

2. Multiple Views of the Same Data


o Different visualization methods may reveal different insights.

o Example:
 A histogram shows data distribution, while a box plot highlights
outliers.
3. Customization for Deep Analysis
o Since these graphics are used internally, they do not need to be
visually perfect.
o Analysts may use interactive tools like Python’s Matplotlib or
Tableau to explore relationships dynamically.

4. Short Life Span


o Unlike presentation graphics, exploratory graphs are not meant
for final reports.
o They are experimental and can be discarded once insights are extracted.
2.2 Background

History
Data graphics have a long history, but most experts believe they truly began with Playfair
over 200 years ago. He created basic plots like bar charts and histograms and made visually
appealing displays. Wainer and Spence recently published some of his work. While not all
of his graphics were good, many were. In the late 19th century, Minard created notable
graphics, including a famous chart of Napoleon's movements during the Moscow
campaign. The French Ministry of Public Works used his ideas in annual publications from
1879 to 1899 to present economic data geographically for France. In the early 20th century,
graphics were less used in statistics, although Fisher’s 1925 book highlighted their
importance. The 1920s and early 1930s saw Neurath’s group creating pictograms that
influenced modern infographics. Early computers limited graphic quality, but
improvements in hardware and software have since made it easier to create better graphics,
highlighting the need for quality in graphic presentation.

Literature

Several authors have written great books on making good statistical graphics, with Edward Tufte
being the most well-known. His books include many excellent examples and some poor ones,
explaining how to properly represent data. Tufte criticizes bad decoration and data
misrepresentation but focuses mainly on representing data correctly. Another valuable source is
Cleveland's books, which also provide useful advice on data displays. While practical guides exist,
there's a need for theory to help understand practices in this field. Bertin’s work on graphical
semiology is often mentioned, but there's little theory developed since then. Examples are crucial in
this area, and Wainer's books offer helpful and engaging illustrations. Websites like Friendly's
Gallery and ASK E. T. also provide various examples and advice. Readers can find many poor
examples easily without searching too hard.

The Media and Graphics


Graphical displays of data are frequently seen in the press and are considered a type of infographic,
focusing on information visualization. The New York Times provides many notable examples.
While there are general guidelines for all infographics, this chapter focuses on creating data
visualizations. Media data displays summarize information, like political survey results or financial
trends. These simple displays are often decorated, which can sometimes highlight the information
or make it harder to understand. Misleading or flawed graphics are also common.
2.3 Presentation (What to Whom, How and Why)

Presenting simple statistical information can easily go wrong despite its apparent simplicity.
Distortions can occur due to misleading scales, 3-D displays complicating 2-D data
comparisons, non-proportional areas, and cluttered visuals making it hard to discern
information. Beyond technical issues, there are semantic challenges as well. A graphic's
message depends on its caption, headline, and accompanying article, which should ideally
align and support each other. However, discrepancies among these elements can lead to
confusion. Statisticians may have little control over headlines and accompanying articles if
they are not the primary authors.

In the media, graphics often highlight news items or enhance articles, sometimes created by
independent companies under tight deadlines, which can lead to awkward fits. Academic
publications, where authors produce graphics, should aim for careful and thorough
preparation to ensure context alignment.

The success of a graphic also relies on its subject, context, and aesthetics. Familiarity with
certain graphic types helps readers interpret them effectively; however, new forms may cause
misunderstandings. This issue extends beyond graphics to design in general, as creators often
find that their work is misinterpreted. Additionally, graphics may appear differently in print
versus on screens, and complex visuals work better in scientific articles than in brief TV
segments. Graphics explained by commentators differ from printed graphics, and interactive
web graphics allow for more information without cluttering the display.

2.4 Scientiic Design Choices in Data Visualization

Plotting a single variable is usually easy. The type of variable affects the choice of graphic;
for example, use histograms or boxplots for continuous variables, and barcharts or
piecharts for categorical variables. Data transformation or aggregation depends on the data
distribution and graphic goals. Scaling and captioning should be simple but chosen
carefully.

Multivariate graphics are more complex. Displaying the joint distribution of two
categorical variables is challenging. The main decision is the type of display, with variable
choice and order being important. Generally, the dependent variable should be plotted last,
and in a scatterplot, it is traditional to place the dependent variable on the vertical axis.

Choice of Graphical Form


There are various types of data displays, including bar charts, pie charts, histograms, dot
plots, box plots, scatter plots, rose plots, and mosaic plots. The choice of display depends
on the data type and the information to be shown. A poor graph type choice is hard to fix
later, so it's essential to choose wisely from the start. However, there may not always be
one best option, and different displays can highlight various aspects of the same data. It's
also important not to just use default settings from software.

Graphical Display Options


Scales
Defining the scale for the axis of a categorical variable involves choosing an informative
order based on what the categories mean or their sizes. For continuous variables, it's more
challenging, as one has to select endpoints, divisions, and tick marks. It can be surprising
when reliable software creates a poor scale, as the right scale often seems obvious.
However, creating an algorithm to automatically determine scales reveals the complexity
of the task. Wilkinson, in the Grammar of Graphics, suggests properties that "nice" scales
should havelike simplicity, granularity, and coverageand proposes a possible algorithm.
While these properties are beneficial, the algorithm can be easily manipulated.

The goal is to develop a method that provides good results most of the time, while users
should verify and adjust the scales for their data. Difficult scaling issues arise when data
cross natural boundaries, such as scaling data from 4 to 95 being easier than from 4 to 101.
Choosing scales that span from minimum to maximum can obscure boundary points, so it's
advisable to extend scales beyond observed limits and use easily understandable rounded
values. There is no strict requirement to include zero in a scale, but there should be a good
reason for excluding it to avoid misleading readers. Zero is a common baseline, but one
can also use other values like one or one hundred for financial data.

Figures illustrating the cumulative times in the Tour de France and histograms of the
Hidalgo stamp thickness data emphasize the importance of scale choices and labeling axes.
Good graphics should align axes properly and allow for meaningful comparisons across
different graphics. An appropriate number of labels is crucial: too many can create clutter,
while too few can hinder comprehension. Lastly, unnecessary tick marks can complicate
the visuals and obscure important data.

Sorting and Ordering

The effect of a display can be influenced by many factors. The order in which variables are
shown in graphics can make a difference. This is seen in various plots like parallel
coordinate plots and matrix visualizations. When dealing with nominal variables without
natural ordering, the plotting order matters significantly. It can be alphabetic, geographic,
or based on size or another variable. Two bar charts of the Titanic data show this, with
different ordering methods for the same data set.
Adding Model or Statistical Information – Overlaying (Statistical) information
Guides can be used on plots to highlight specific issues and indicate positive or negative
values. Sloping guides show deviations from a straight line. Fitted lines, like polynomial
regression, help illustrate overall patterns and local variations. For example, a figure showing
times from a 100-meter road race indicates a linear relationship for faster runners and a flat
one for slower ones.

In scientific journals, it is common to plot point estimates with their confidence intervals,
usually at 95%. Another figure demonstrates the deterioration of thin plastic over time, with
high variability in results. Overlapping information can clutter displays, and adjustments may
be needed for clear presentation, especially when preparing for publication.

Captions, Legends and Annotations

Captions should clearly explain the graphic they support and include the data source, as
relying on surrounding text is often ineffective. While ideal captions can be long and
detailed, they may discourage readers. A balance where the caption summarizes the
graphic, with detailed explanations in the text, is often a better approach.
Legends should indicate which symbols and colors represent data groups directly on the
plot to avoid unnecessary eye movement.
Annotations highlight specific features but should be used sparingly due to space
limitations. The union estimates of protest turnout are larger than police estimates, except
in Marseille and Paris, where the differences are significant.

Positioning in Text
Keeping graphics and text on the same page or on facing pages is helpful for practical
reasons. It's inconvenient to flip pages back and forth when graphics and related text are on
different pages. However, it is not always possible to avoid this. Where graphics are placed
on a page is a design issue.

Size, Frames and Aspect Ratio


Graphics should be large enough for clear visibility but not excessively so, depending on
the layout.
Frames can be used to separate graphics from each other or text, but they should be
avoided if they create clutter.
Aspect ratios greatly impact how graphics are perceived. For gradual change, expand the
horizontal axis and reduce the vertical axis; the opposite shows dramatic change. Useful
advice on aspect ratios is found in Cleveland (1994).
Colour
Colour should have been discussed earlier as it is a powerful way to display data, but it is
hard to use correctly. A useful tool for choosing colour schemes for maps is Colorbrewer
by Cynthia Brewer, found at http://colorbrewer.org. Factors to consider include colour
blindness, associations with colours, print reproduction issues, and personal preferences.

2.5 Higher-dimensional Displays and Special Structures

Scatterplot Matrices (Sploms)

- Used for visualizing pairwise relationships between multiple continuous


variables.

- Helpful for small datasets to identify correlations.


- Example in the image: A scatterplot matrix of five variables from a car
emissions dataset (engine size, fuel consumption, CO2 emissions, etc.).

Parallel Coordinates

- Used to display high-dimensional data in a 2D space.


- Helps compare multiple continuous variables simultaneously.
- The example in the image (Figure 2.11, not fully visible) visualizes the cumulative
times of 147 cyclists in the 2004 Tour de France.
Figure 2.10. A scatterplot matrix of the five main continuous variables from a car emissions
dataset

- Shows a scatterplot matrix of five variables from a car emissions dataset.


- Reveals linear relationships between engine size, fuel consumption, and CO2
emissions.

- Diagonal elements contain variable names; off-diagonal elements show pairwise


scatterplots.

- Helps identify patterns and correlations efficiently.

Mosaic Plots

- Used for visualizing counts in multivariate contingency tables.


- Different types exist, including the doubledecker plot (a 5-D example shown in Figure
2.12).

- Helps analyze categorical data relationships by varying width and color to


represent proportions.
- The given example studies patterns of arrest based on 5,226 cases in Toronto
using factors like Gender, Employment, Citizenship, and Color .

- Represents cumulative times of riders in the 2004 Tour de France for all 21 stages .
- Vertical lines indicate stages, and lines show individual rider progress .

- Axes are aligned at their means , making it easier to compare performances


across stages.

- The plot reveals time differences increasing in mountain stages and fewer line
crossings in later stages , indicating stable rankings.

Small Multiples and Trellis Displays

- Instead of a single large plot, smaller comparable plots are used.


- Helps in subgroup analysis and better data comparison.
Figure 2.12. A doubledecker plot of Toronto arrest data. Source: Fox (2003)

- A double-decker plot of Toronto area data.


- Helps in analyzing categorical data distributions.

(Boxplots of Fuel Consumption)

- Compares fuel consumption by engine type in Germany.


- Shows distribution, median, and variation across different engine types.
(Trellis Display of Car Emissions)

- A trellis plot showing car emissions data from Germany.


- Each panel represents a subset of the data , helping in detailed comparison.
Time Series and Maps

- Time Series : Focus on temporal data ordering, such as tracking measurements


over time.

- Maps : Used for geographical data visualization, showing spatial patterns


effectively.
(Time Series Data)

- Displays time series plots , highlighting the importance of proper time alignment.

Figure 2.16. Cancer mortality rates for white males in the USA between 1970 and 1994 by
State
Economic Area.he scale has been chosen so that each interval contains 10% of the SEAs.

- Map of cancer mortality rates in the US (1970).

- Shows regional patterns , with higher rates in the East and lower in the Midwest.
3. STATIC GRAPHS

Static graphics in data visualization refer to images or charts that present data in a fixed,
unchanging form. These visualizations do not allow for user interaction or real-time updates;
they are simply
representations of data at a particular point in time. Static graphics are typically used in
reports, printed materials, presentations, and other contexts where interactivity is not required.
Here are some common types of static graphics used in data visualization:
1. Bar Charts

 Purpose: Display and compare quantities of different categories.


 Usage: Ideal for showing discrete data and making comparisons across categories.
 Example: Comparing sales across different regions or products.
2. Line Charts

 Purpose: Show trends or changes over time.


 Usage: Often used in time series data to observe patterns or fluctuations.
 Example: Tracking stock prices or temperature changes over days, months, or years.
3. Pie Charts

 Purpose: Show parts of a whole.


 Usage: Best for representing percentage breakdowns of a dataset.
 Example: Distribution of market share among different companies.
4. Histograms

 Purpose: Show the distribution of numerical data.


 Usage: Used to visualize the frequency of data points within certain ranges (bins).
 Example: Distribution of test scores for a class.
5. Scatter Plots

 Purpose: Display relationships between two numerical variables.


 Usage: Used to identify patterns, correlations, or trends between variables.
 Example: Comparing height vs. weight for a group of individuals.
6. Box Plots

 Purpose: Summarize the distribution of a dataset through its quartiles.


 Usage: Good for identifying outliers, variability, and central tendency in data.
 Example: Showing the distribution of test scores for multiple groups.
7. Venn Diagrams

 Purpose: Show relationships between sets.


Usage: Used to highlight commonalities or differences between different sets.
 Example: Showing the overlap between users of two different software tools.
Benefits of Static Graphics:
 Simplicity: Easy to design and understand without the need for interactive elements.
 Clarity: They can provide a clear, focused message for communicating data insights
quickly.
 Print-Friendly: Can be easily included in reports, publications, and other printed
materials.
 Consistency: The message remains unchanged regardless of user interaction.
The Grammar of Graphics:
A comprehensive overview of statistical graphics is provided by Wilkinson’s Grammar of
Graphics . Wilkinson outlines a system in which statistical graphics are
described in a high-level, abstract language and which encompasses more than just static
graphical displays. his chapter provides a different view, where statistical graphics is seen as
an extension of a general graphics language like PostScript (Inc., ) or SVG (Ferraiolo et
al., ). his view is lower level, more explicit about the basic graphical elements which are
drawn and more focused on static
graphics. To emphasize the difference, consider a simple barplot of birth rate for three
different types of government for the barplot would be a statement of the following form
(from p. of the
Grammar of Graphics, st edn. Wilkinson,

FRAME: gov*birth GRAPH: bar() A description consistent with this chapter would involve
a
description of the coordinate systems and graphical shapes that make up the plot. For
example, the barplot consists of a plotting region and several graphical elements. he
plotting region is positioned to provide margins for axes and has scales appropriate to the
range of the data.he graphical elements consist of axes drawn on the edges of the plotting
region, plus three rectangles drawn relative to the scales within the plotting region, with the
height of the rectangles based on the birth rate data.

3.1 Complete Plots in Data Visualization

A complete plot in data visualization is a comprehensive visual representation of data,


typically providing the viewer with clear and detailed insights into the structure, distribution,
and relationships within a dataset. It consists of several components designed to ensure that
the
information is conveyed accurately and effectively. In addition to the data itself, a well-
designed plot includes labels, axes, legends, titles, and sometimes annotations, all of which
help to provide context and interpretation.

Key Components of a Complete Plot


1. Data Representation:
o This is the core of the plot and refers to the graphical representation of the
dataset itself. The data can be displayed in various forms like points, lines,
bars, or areas, depending on the type of plot.
2. Axes:
o X-axis (horizontal): Represents one dimension of the data, usually the
independent variable or time.
o Y-axis (vertical): Represents another dimension, often the dependent
variable or measured outcomes.
o Axis Labels: The labels on the axes should clearly indicate what each axis
represents. These labels should include units of measurement where
applicable (e.g., "Time (days)" or "Temperature (°C)").
3. Title:
o A clear, concise title should describe what the plot is showing. This helps the
viewer immediately understand the context or focus of the data.
o Example: In a line chart with multiple lines, each line should be labeled in the
legend with its corresponding category or group.

4. Grid
Lines:
o Optional but helpful, grid lines allow viewers to visually track the values
of data points across the plot. This is particularly useful for line charts,
bar charts, and scatter plots.
4. Data Points / Markers:

o The actual data points (dots, bars, lines) that represent the values being
plotted. Depending on the chart, these markers might have different shapes,
sizes, or colors to distinguish between different groups or categories.
5. Annotations:

o Annotations can be added to plots to highlight specific data points, trends, or


areas of interest. This could be in the form of arrows, text labels, or other
indicators.

-> Sensible defaults:


In data visualization are pre-configured settings that ensure a plot is clear, readable, and
aesthetically pleasing without the user having to adjust each detail. These defaults aim to
simplify the process of creating effective visualizations. Key sensible defaults include:
1. Axis Labels and Ticks: Automatically labeled based on data columns.

2. Gridlines: Light gridlines help align data with the axis.

3. Color Palette: Default colors that are distinguishable and accessible.

4. Plot Title: Automatically generated, often based on data context.

5. Legends: Automatically added for multi-series plots.

6. Axis Scales: Default to linear scales unless data suggests otherwise.

7. Data Markers: Default markers (dots, lines) for clarity.

8. Bar Width/Spacing: Set to ensure bars are readable.

9. Date Formatting: Default date format for time-series data.

10. Tick Intervals: Sensibly spaced to avoid clutter.


-> User interface:
A sometimes controversial aspect of statistical graphics software is the user interface. The
choice is between a command line, where the user must type textual commands (or function
calls), and a graphical user interface (GUI), consisting of menus and dialogue boxes. A batch
system is considered to be a command-line interface; the important point is that the user has
to do everything by typing on the keyboard rather than by pointing and clicking with a
mouse. Oten both a command line and a GUI will be offered . The interface to a piece of
software is conceptually orthogonal to the set of
features that the software provides, which is our main focus here. Nevertheless, in each
section of this chapter we will briefly discuss the user interface because there are situations
where the interface has a significant impact on the accessibility of certain features. For the
purpose of producing complete plots, the choice of user interface is not very important.
Where one system
might have an option on a GUI menu to produce a histogram, another system can have a
command or function to do the same thing.

3.2 Customization

Let us assume that your statistical software allows you to produce a complete plot from a
single command and that it provides sensible defaults for the positioning and appearance of
the plot. It is still quite unlikely that the plot you end up with will be exactly what you
want. For example, you may want a different scale on the axes, or the tick marks in
different positions, or no axes at all. After being able to draw something, the next most
important feature of statistical graphics software is the ability to control what gets drawn
and how it gets drawn.

-> Setting Parameters


For any output, you need to specify several free parameters. For example, to draw a line,
you must choose where it starts and ends. Even for a simple line, you need to provide the
color, line style (like dashed or solid), thickness, and how the ends should look (rounded or
square).

When creating plots, the parameters become more complex. For instance, when drawing an
axis, one parameter might set the number of tick marks, and another might set the axis
label text. An important parameter for a complete plot is the data to display, along with
options for showing axes and legends.

In R, each graphics function has specific parameters for controlling output. For example,
you can create a plot without axes and labels and add a line with specific controls for its
position, color, type, and width.

> plot(1:10, axes=FALSE, ann=FALSE)


> lines(1:10, col="red", lty="dashed", lwd=3)

Graphical Parameters

A common set of graphical parameters can be used to change how graphical outputs look.
This includes line color, fill color, line width, and line style. These parameters are similar
to the graphics state in the PostScript language. For complete control over the appearance
of graphical outputs, statistical graphics software needs to provide a full set of graphical
parameters. Some often-overlooked parameters are semitransparent colors, line joins and
endings, and access to various fonts. Edward Tute suggests using professional graphics
software like Adobe Illustrator for high quality, but better results can be achieved with
control within the statistical graphics software itself.

In R, there are many graphical parameters to control aspects like colors, line types, and
fonts, which can be further expanded to include basic drawing parameters found in
advanced graphics languages such as SVG. Examples include gradient fills and general
pattern fills. Composition of graphical elements can be illustrated by adding a legend to a
plot, with both having transparent backgrounds while the plot has grid lines. If we do not
want grid lines behind the legend, one way is to have the legend output fully cover the plot
output where drawn.

Drawing the legend on top of the plot can lead to an undesired result where grid lines are
visible behind it. Using an opaque background for the legend can work, but it relies on
knowing the plot's background color. A better solution involves more complex image
manipulations, such as negating the alpha channel of the legend. This process allows the
legend to blend with the plot without showing unwanted grid lines behind it, regardless of
the plot’s final background color.
-> Arranging Plots

When creating multiple plots on a page, new free parameters related to their location and size
become available. It's crucial for statistical graphics software to allow specifying arrangements for
these plots. In R, one can easily create an array of equally sized plots and different sized
arrangements using specific code.

-> Annotation

A more complex customization adds more graphical output to a plot, like informative labels for
data symbols.

Graphical Primitives

The first requirement for producing annotations is the ability to create basic graphical
output, like text labels. Statistical graphics software should function like a drawing
program, allowing users to draw lines, rectangles, and text. In R, functions exist for
standard graphical elements. Example code shows how to add shapes and text to a plot.

> x <- rnorm(20)


> plot(x)
> polygon(c(1, 1:20, 20), c(0, x, 0),
col="grey", border=NA)
> rect(1, -0.5, 20, 0.5,
col="white", lty="dotted")
> lines(x)
> points(x, pch=16)
> text(c(0.7, 20.3), 0, c("within", "control"), srt=90)
Some slightly more complex shapes (not currently natively supported by R) are spline curves,
arbitrary paths (like in PostScript or SVG), and polygons with holes. An example of a
polygon with a hole is an island within a lake within an island, where both islands are part of
the same country or state and can be represented as a single polygon.

Coordinate Systems

Statistical graphics software is unique because it can create multiple graphical outputs at
once and position them in different coordinate systems. For example, the title of a plot can
be placed halfway across the page using a normalized coordinate system for the whole
page, while data symbols in a scatterplot are positioned based on the range of the data
within the plot area. Axis labels can also be centered along an axis using a normalized
coordinate system limited to the plot axes.

Users often export their plots for editing in third-party software, like Microsoft Office, but
this loses the original coordinate systems, making it harder to add annotations accurately.
Annotations should be placed relative to the original plot's coordinate systems to maintain
proper positioning.

In R, the traditional graphics functions only work with one coordinate system at a time,
such as the text() function for axis scales and the mtext() function for plot margins. R’s
grid package offers a more flexible method for handling multiple coordinate systems,
which allows for more precise control over adding output to plots.

-> The User Interface

The user interface for controlling graphical output can use either a command line or a
graphical user interface (GUI). In a command-line setup, function calls include arguments
for each control parameter, while GUIs offer dialog boxes filled with options. A challenge
in statistical graphics is the many parameters needed for complex elements, like a matrix of
scatterplots, where each element requires specific control. Using a mouse to select plot
components is intuitive but can lead to confusion due to the hierarchical nature of plots.
Command lines allow for more precision, such as expressing axis text with specific paths.
Additionally, in GUIs, repeating editing actions across different plots is difficult, whereas
command lines can easily capture and repeat operations.

3.3 Extensibility

The ability to create complete plots and manage their appearance is a basic requirement for
statistical graphics software. A more advanced feature is the option to add new types of
plots. Creating a new plot differs from customization because it starts from scratch and can
be shared with others as a new function or menu item. Essential features needed to develop
new plots include the ability for users to add functions or menu items, access to low-level
building blocks, and support for combining these blocks into larger graphical elements.

-> Building Blocks


What are the fundamental building blocks from which plots are made? At the lowest level, a
plot is simply basic graphical shapes and text, so these must be available . In addition, there
must be some way to define coordinate systems so that graphical elements can be
conveniently positioned in sensible locations to make up a plot. Surprisingly, that’s about it.
Given the ability to draw shapes and locate them conveniently, you can produce a huge
variety of results. Controlling coordinate systems is a special case of being able to define
arbitrary transformations on output, such as is provided by the current transformation matrix
in PostScript or transform attributes on group elements in SVG. We have already seen that R
provides basic graphical elements such as lines and text . R also provides ways to control
coordinate systems; this discussion will focus on the features provided by the grid system
because they are more flexible. The grid system in R provides the concept of a ‘viewport’,
which represents a rectangular region on the page and contains several different coordinate
systems. View ports can be nested (positioned within each other) to produce quite complex
arrangements of regions.
First of all, we create a region centred on the page, but only 80 % as wide and high as
the page.
> pushViewport(viewport(width=0.8, height=0.8,
xscale=c(0, 3), yscale=c(0, 10)))
his now is where drawing occurs, so rectangles and axes are drawn relative to this
viewport.
> grid.rect(gp=gpar(fill="light grey"))
> grid.xaxis(at=1:2, gp=gpar(cex=0.5))
> grid.yaxis(gp=gpar(cex=0.5))
Now we define a new viewport, which is located at (, ) relative to the axis scales
of the first viewport. his also demonstrates the idea of multiple coordinate systems;
the width and height of this new viewport are specified in terms of absolute units,
rather than relative to the axis scales of the previous viewport.
> pushViewport(viewport(unit(1, "native"),
unit(4, "native"),
width=unit(1, "cm"),
height=unit(1, "inches")))
We draw a rectangle around this new viewport and then draw the word ‘thermome
ter’.
> grid.rect(gp=gpar(fill="white"))
> grid.text("thermometer",
y=0, just="left", rot=90)
We create yet another viewport, which is just the bottom  % of the second viewport,
and draw a filled rectangle within that.
> pushViewport(viewport(height=0.3, y=0,
just="bottom"))
> grid.rect(gp=gpar(fill="black"))
Finally, we create a viewport in exactly the same location as the third viewport, but
this time with clipping turned; when we draw the word ‘thermometer’ again in white,
it is only drawn within the filled blac rectangle.
> pushViewport(viewport(clip=TRUE))
> grid.text("thermometer",
y=0, just="left", rot=90,
gp=gpar(col="white"))
Graphical Layout

The production of a complex plot requires placing different parts within various coordinate
systems. The output arrangement in a coordinate system is clear, like drawing a data
symbol at a specific spot and size. However, the way coordinate systems or plots are
arranged relative to each other is less clear, often organized in rows and columns, similar to
typesetting in LATEX or HTML. For a statistical graphics system, defining these implicit
arrangements is helpful.

In R, there’s a concept called ‘layout’ that divides a space into rows and columns with
adjustable sizes. A viewport can be positioned based on this layout instead of exact
coordinates. For instance, code can create a viewport in a central area with equal margins.
The next code positions another viewport, and nesting viewports allows for complex
implicit arrangements of graphical elements in R.

Transformations in Statistical Graphics

An important difference between transformations in a general graphics language and


transformations in statistical software is that statistical software does not apply
transformations to all output. In statistical graphics, transformations affect the locations
and sizes of shapes but text is sized separately to keep it readable.
-> Combining Graphical Elements

A statistical graphics system should let users create basic graphic shapes and position them
flexibly. Users should also be able to record a series of drawing operations in a function for others
to use.

-> The User Interface

Extending a system relies heavily on the user interface. An extensible system usually
needs a language for creating new graphics, meaning users must write code to make
extensions. While graphical programming interfaces are possible, using a command line is
more powerful and flexible. A minimum requirement for a GUI is to record code for actions.
Ideally, the extension language should match the system's development language, with
scripting languages like R or Python being favored for ease of use.

3.4 Other Issues


This section covers several issues related to static graphics, explained elsewhere.

-> 3-D Plots

Static 3-D plots are not very useful because it's hard to see 3-D structures without movement.
However, 3-D images are important, like for visualizing a model's prediction surface. R can
draw 3-D surfaces simply with the persp() function, while the rgl add-on package offers
access to the OpenGL 3-D graphics system.

-> Speed

In dynamic and interactive statistical graphics, speed is crucial. The drawing process must be
fast to allow real-time updates as users change settings. In static graphics, speed is less
important; achieving the desired result is prioritized over the time it takes. Plots may take
several seconds to draw, which is acceptable. This speed consideration is key for the user
interface. In R, much graphics code is slower as it is written in interpreted R code, but it
allows users to see, modify, and create their own code. Limits are still needed because
drawing a single plot can take longer with many observations and batch jobs. Complex plots
in R, like Trellis plots, may also be slow, but users generally find this acceptable, as
generating all figures for a medium-sized book can take under a minute.

-> Output Formats

When creating plots for reports, different formats are needed based on the report type. For
printed reports, PostScript or PDF formats are best, while PNG is preferred for web
publication. There are many software options available to convert graphic formats, making
it easier to produce the required output without needing statistical graphics software to
support all formats.

Nonetheless, it's valuable for statistical graphics software to support various formats. This
can enhance the lowest-common-denominator format, with R, for instance, allowing for
clipping in formats that lack this feature. Furthermore, modern formats can offer advanced
features like transparency and animation that simpler formats do not support.

Additionally, saving the original code used to create plots, like R code, is recommended
along with traditional formats. This allows for easier modifications and better manipulation
of plots compared to editing PDF or PostScript versions.

-> Data Handling

Statistical graphics software, mainly focusing on how data is presented rather than its
source. While this approach allows for more graphic flexibility, it is important to recognize
the need for tools to generate, import, and analyze data, as data is essential for meaningful
plots. Ideally, statistical graphics should be part of a larger system with data-handling
capabilities, like R.

You might also like