Discover millions of audiobooks, ebooks, and so much more with a free trial

From $11.99/month after trial. Cancel anytime.

JavaScript and jQuery for Data Analysis and Visualization
JavaScript and jQuery for Data Analysis and Visualization
JavaScript and jQuery for Data Analysis and Visualization
Ebook845 pages5 hours

JavaScript and jQuery for Data Analysis and Visualization

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Go beyond design concepts—build dynamic data visualizations using JavaScript

JavaScript and jQuery for Data Analysis and Visualization goes beyond design concepts to show readers how to build dynamic, best-of-breed visualizations using JavaScript—the most popular language for web programming.

The authors show data analysts, developers, and web designers how they can put the power and flexibility of modern JavaScript libraries to work to analyze data and then present it using best-of-breed visualizations. They also demonstrate the use of each technique with real-world use cases, showing how to apply the appropriate JavaScript and jQuery libraries to achieve the desired visualization.

All of the key techniques and tools are explained in this full-color, step-by-step guide. The companion website includes all sample codes used to generate the visualizations in the book, data sets, and links to the libraries and other resources covered.

  • Go beyond basic design concepts and get a firm grasp of visualization approaches and techniques using JavaScript and jQuery
  • Discover detailed, step-by-step directions for building specific types of data visualizations in this full-color guide
  • Learn more about the core JavaScript and jQuery libraries that enable analysis and visualization
  • Find compelling stories in complex data, and create amazing visualizations cost-effectively

Let JavaScript and jQuery for Data Analysis and Visualization be the resource that guides you through the myriad strategies and solutions for combining analysis and visualization with stunning results.

LanguageEnglish
PublisherWiley
Release dateNov 14, 2014
ISBN9781118847220
JavaScript and jQuery for Data Analysis and Visualization

Related to JavaScript and jQuery for Data Analysis and Visualization

Related ebooks

Computers For You

View More

Reviews for JavaScript and jQuery for Data Analysis and Visualization

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    JavaScript and jQuery for Data Analysis and Visualization - Jon Raasch

    PART I

    The Beauty of Numbers Made Visible

    Chapter 1: The World of Data Visualization

    Chapter 2: Working with the Essentials of Analysis

    Chapter 3: Building a Visualization Foundation

    Chapter 1

    The World of Data Visualization

    What's in This Chapter

    Overview of chart design options

    Comparison of different business applications for data visualization

    Rundown of technological advancements that have made data visualization what it is today

    When thinking about data visualization, it's hard to resist the comparison to natural metamorphosis. Consider raw data as the caterpillar: functional, multi-faceted, able to get from here to there, but a little ungainly and really appreciated only by a select few. After data is transformed via visualization, it becomes the butterfly: sleek, agile, and highly recognizable to the point of inspiring and evoking an emotional response. The world of data visualization is an ecosystem unto itself, constantly spawning new nodes of details that—under the proper nourishing conditions—evolve into relatable depictions that consolidate concepts into an understandable, and hopefully compelling, form.

    And where does the web professional fit in this metaphor? Why, they are the spinners and caretakers of the cocoon that transforms raw numbers into meaningful representation, of course. Putting the linguistic paraphrasing aside, web designers and developers are a vital component in visualizing data. Naturally, the current and evolving technological landscape has made this role possible—and increasingly efficient.

    Overall, JavaScript and jQuery for Data Analysis and Visualization serves as a practical field guide to the robust world of data visualization, from the acquisition and nurturing of data to its transfiguration into the optimal visual format. This chapter is intended to provide an overview of the present environment, highlighting its capabilities and limitations and discussing how you, the web professional, are a key player in visualizing data.

    Bringing Numbers to Life

    Appreciating numeric data can be a challenge. Data visualization with relational graphics and evocative imagery helps make raw data meaningful. But before you can transform the data into a meaningful representation, you have to get it first.

    Acquiring the Data

    The data sphere is enormous and growing dramatically, if not exponentially, every day. Data is streaming in from everywhere—and when you consider that the Mars Rover, Curiosity, continually sends its data findings back to Earth, you understand that everywhere is no exaggeration.

    With the tremendous amount of data already available, its acquisition is often just a matter of logistics. If the information is in a non-digital form—that is, written records—it will need to be transcribed into the proper format. Should the desired data be accessible digitally, it may need to be converted from its current structure to one compatible with the display or visualization application.

    When your information is in the proper format, you next need to ensure it is exactly the data you need and nothing more. The wealth of data available today makes targeting your data selection, typically through a process known as filtering, pretty much a requirement in all situations. Even when organizations fine-tune their data input from the beginning, changes in the sample or desired output over time will force a filtering adjustment.

    Why is it so important to restrict your data stream? One clear reason is processing efficiency. Working with an overload of unnecessary information increases application execution time—which corresponds directly to increased bandwidth and, thus, costs. Additionally, filtering makes raw data more meaningful. Focused information is easier to analyze and also more easily digested by end users.

    Visualizing the Data

    In a sense, the most difficult aspect of data visualization is deciding exactly how the information should be depicted. The web designer must select the optimum representation that communicates the data in the clearest, most desired manner with the highest degree of impact. More importantly, the representation should be a discovery tool that leads the user to meaningful insights. Here's an incomplete list of available formats:

    Area chart

    Bar chart

    Bubble chart

    Candlestick chart

    Gauge chart

    Geographic chart

    Heat map

    Hierarchical edge bundling

    Infographics

    Line chart

    Marimekko chart

    Network node map

    OLHC (Open-high-low-close) chart

    We've really just scratched the surface with ways data can be presented. Most of these formats can be shown in either 2D or 3D. You can include interactive elements and animation to add dimensions to the data. But be careful to balance these bells and whistles with meaningful data. No amount of eye candy is worth compromising the representation of information.

    NOTE It's important to realize that a key factor in visualization is intent. Raw data on almost every subject can be interpreted in any number of ways. What message is intended to be communicated should be among the first decisions made when beginning the process of representing data visually.

    There are other primary options to consider as well. Do you expose the underlying data or not? If so, are the numbers always visible or are they visible only when some interaction occurs, such as when the viewer's mouse hovers over a data point? Is the initial visualization all there, or does the online version allow the user to drill down for more details? Is animation used to represent a dynamic change? Is there other interactivity available, such as horizontal scrolling along a timeline or zooming into it?

    Then, of course, there is styling. With simple bar and pie charts, you'll not only need to decide which colors represent which elements, but also the size, color, style, and font to be applied for labels and legends, if any—yet another choice. Many such selections will be governed by other factors, such as the creating organization's branding or in-house standards; however, just as many will have no such foundation to work from, and the designer's vision will become paramount.

    Moving beyond the basics of charting primitives, the visualization designer can choose to include graphics. Not only can background images frame a presentation—both literally and thematically—but symbols can be used as data points, like logos pinned in a map of third-quarter sales. An entire field of data visualization—infographics—is devoted to the combination of information and visual imagery.

    The truth is that the web professional's current options for depicting data are a bounty of riches. Although the possibilities may appear to be overwhelming, it's up to the visualization designer to identify the optimum representation and bring it into reality.

    Simultaneous Acquisition and Visualization

    The world of data visualization doesn't just consume existing data: New data is constantly being added to the stores, even in real time. Information can be collected directly through an HTML form on a website and incorporated into the representation programmatically. One of the most common examples of this is an online poll, such as the one shown in Figure 1.1. After a site visitor has chosen his or her desired response and clicked Vote, the current relative standing of all entries, including the one just entered, is displayed.

    c01fig001

    Figure 1.1 Some polls allow the user to instantly see the current results.

    Source: www.dailykos.com/story/2014/08/18/1322337/-Cheers-and-Jeers-Monday

    Collecting live data has a number of challenges, but the recent advances made by the widespread acceptance of HTML5 have ameliorated many of them. When combined with a few key JavaScript libraries, it is now possible to use advanced form elements, such as slider controls, across the full spectrum of modern browsers.

    Acquiring the data in real time is just the first step. The web developer is also responsible for validating and standardizing the data. Validation is critical in two ways: first, to ensure that all required information is supplied, and second, to verify that the data is in the proper format. Naturally, if you're trying to find out where your clientele is based, you can't if the requested postal code is left blank. Likewise, if the postal code is in the wrong format, such as a four-digit entry for a U.S. address, the data is worthless. Both of these issues can be corrected by proper validation, whether handled on the client-side with JavaScript, server-side via PHP or another server language, or some combination of the two.

    Standardized data is just as important and typically applies to time and date details. There are numerous ways to enter a date: March 10, 2011 could be 03/10/11, 10/03/11, or 11/03/10 depending on whether you're in the United States, Australia, or China, respectively. To make sure the intended date is collected correctly, the entered information will need to be standardized to a format the visualization application recognizes before it is saved. Read Chapter 6 for more information about data validation.

    Applications of Data Visualization

    So there's all this wonderful data out there, just waiting to be brought to life by this almost magical transformative process. But why should it? The question really is cui bono? Who benefits? In a sense, the answer is everyone. Whenever information is made clearer and more understandable, it's better for all. But the web professional doesn't get paid by everyone, so let's narrow the scope and focus on the key groups who stand the most to gain from data visualization.

    Uses in the Public Sector

    Groups in the public sector include all levels of government (those in it and those trying to get in it), as well as police, military, transportation agencies, and educational and healthcare facilities. Just a few folks, right? Oh, and let's add philanthropy and philanthropic projects, a.k.a. charities, into the mix, just for fun.

    All these organizations have a key interest in discovering what is happening (the data) and then conveying that information internally to others in their own group and/or externally to the broader public (the visual). Many such efforts are mandated and essential to the organization's existence. Take, for example, the U.S. census. The data is collected on a massive scale every 10 years—by law—and then impacts multiple facets of American life such as state and regional funding and, of course, congressional representation. The U.S. Census Bureau maintains a treasure trove of the aggregate data, now visually accessible to everyone through its online presence at www.census.gov. Not only are there government-sanctioned representations of the collected census information, like the map in Figure 1.2, but the site also makes APIs available (api.census.gov) for public web developer access.

    c01fig002

    Figure 1.2 You'll need to request a no-charge digital key to access the APIs from api.census.gov.

    Business-to-Business and Intrabusiness Uses

    If the business of business is business, how do you do business? Mostly through marketing, whether you're a vendor targeting another company or one department lobbying internally for increased resources. And the heart of marketing is persuasion—which is often bolstered, if not solely accomplished, by making your case through the compelling presentation of data.

    As with the public sector, many such presentations are required. Look through any annual report to see the latest encapsulation of the company's standing, graphically depicted in quickly graspable charts. Today, creating an online report is standard practice. Similar data visualizations are undertaken daily in department and division meetings to plot sales progress, reveal public reaction to products, and adjust business direction.

    There are significant data visualization opportunities for the web designer within the business-to-business arena. Most of this type of work, like other website or intranet work, will be handled by an internal team. Cultivating such skills would definitely add value to any web professional's resume.

    Additionally, a wide variety of data visualizations are used internally within organizations. These tools help businesses grapple with and understand their own data.

    Business-to-Consumer Uses

    Obviously, marketing plays as big a role in the business-to-consumer realm as it does in business to business, if not more. Sharp, effective advertising, as well as other forms of marketing, are pretty much required for a company's message to cut through the omnipresent media noise. Often a clearly defined representation of data can make the difference.

    Although there are plenty of uses for pie charts, stock charts and other fundamental data representations in business-to-consumer communications, infographics are seen far more frequently. Infographics combine data and information in a visually engaging manner. Sometimes, the data is represented straightforwardly, such as the percentage values shown in the infographic from HealthIT.gov (see Figure 1.3), or more graphically, as shown in the infographic from the CDC (see Figure 1.4).

    c01fig003

    Figure 1.3 The icons in this infographic graphically reinforce the numeric percentages.

    c01fig004

    Figure 1.4 Infographics are adept at combining highlighted key terms, such as urban areas and heat-related illnesses with numeric data, as shown in this infographic from the CDC.

    Infographics is a tremendously rich area with an almost endless range of possibilities; because of the openness of the format, it can be a designer's playground. To learn more about creating this particular type of data visualization, see Chapter 16.

    Web Professionals: In the Thick of It

    As noted in this chapter's introduction, web professionals are at the heart of data visualization. Consider that it first takes someone with web savvy to access and translate the data into a usable form. Then, if the data collection is to be ongoing, one or more forms have to be set up correctly online to make sure the needed data is acquired, valid, and—where necessary—standardized. Finally, someone with a working knowledge of browser-compatible languages must create the visual display of the data so that it can be viewed on the Internet.

    Control of Presentation

    Web professionals—across the spectrum of their functionality—are responsible for this growing sphere of communication. Let's break down the process from their perspective:

    A web developer with server-side skills is needed to handle the back-end processing of data to make it accessible.

    A JavaScript coder is responsible for filtering, sorting, and manipulating the data to prepare it for representation. This role could also be handled server-side or in combination with client-side technology.

    An HTML coder builds any required forms to allow interactive data addition, often with JavaScript libraries for validation.

    One or more web designers create the look-and-feel of all data-related pages, including styling the output of the visualized data.

    A web coder, leveraging his or her own knowledge of JavaScript, combined with core frameworks and data visualization libraries, displays the data in a representational format.

    Although all the described tasks could possibly be fulfilled by a single individual, it's just as likely that these tasks are handled by a group working closely together. Whether it's done by one (very busy) person or a networked team spread around the world, the important take-away is that web professionals own the data visualization process from top to bottom.

    TIP Curious as to what other web professionals have been doing in the field of data visualization? There are a number of sites online that provide a bevy of examples. One of the best that we've found is at http://visualizing.org/, which not only has compelling galleries but also a robust community dedicated to data and design.

    What Tech Brings to the Table

    Web professionals are dependent on robust web software to accomplish any aspect of their work, but the need for power tools is particularly vital to properly handle data visualization. Recent years have witnessed a sea change in online technology that has greatly expanded the possibilities for representing data. Although there are many contributing factors, the following discussion focuses on three key ones:

    Faster, more efficient JavaScript engines in browsers

    The rapid proliferation of HTML5 compatible browsers

    The increased availability of JavaScript frameworks and libraries

    Faster and Better JavaScript Processing

    For the last several years, browser makers have identified JavaScript processing as a key battleground and have pursued faster JavaScript engines with great vigor. The bar graph in Figure 1.5 compares runs of the SunSpider benchmark, created and maintained by WebKit.org, for older browsers (Internet Explorer 7 and Safari 3) against the latest—as of this writing—browsers, Internet Explorer 10 and Safari 6. In this chart, smaller is better, and you can see there has been a radical shift in browser efficiency. The values for the earlier browser versions come from a June 2008 article that appeared on ZDNet (http://www.zdnet.com/blog/hardware/sunspider-javascript-benchmark-and-acid-3-compatibility-charts-firefox-3-0-rc-3-and-opera-9-50-added/2090); we ran the benchmarks on the newer browsers ourselves.

    c01fig005

    Figure 1.5 The lesser values indicate faster and more desirable processing times by JavaScript engines.

    The increase in JavaScript processing functionality has had a direct effect on the realm of data visualization, in both the analysis and the rendering phase. The JavaScript engine handles raw numeric computations as well as on-screen drawing, either directly or in conjunction with the hardware renderer. This combination greatly increases the viability of direct browser data visualization, without resorting to a third-party plug-in, like Adobe Flash.

    Rise of HTML5

    A faster engine isn't much good without fuel to run it—luckily, a load of high-octane HTML5 was delivered just in time. The roots of HTML5 can be traced back to 2004 and the Web Hypertext Application Technology (WHAT) Working Group—but adoption was glacially slow. At one point, the W3C had actually slated the web language for final recommendation status in 2022! The introduction of smartphones, most notably Apple's iPhone, changed all that. The device's embrace of HTML5 in lieu of Flash triggered a feature adoption race among all major browsers, with HTML5 becoming the current standard for mobile devices.

    Why is HTML5 so important to data visualization? First, let me clarify that this latest version of the web's primary language brings along two closely knit partners: CSS3 and advanced JavaScript APIs. The enhanced capabilities brought by these three related technologies have truly revolutionized web design and development overall. The following are a few key features that have been especially beneficial for data visualization:

    The tag: Include a seemingly blank element on your HTML5 page and suddenly you have access to the full palette of graphics—including primitives (such as circles and rectangles), plotted points with connected lines, gradients, text, imported images, and much more—all drawn by JavaScript, live. What's more, you have the option to make whatever you put on your canvas interactive, capable of being changed by the user (see Figure 1.6).

    c01fig006

    Figure 1.6 HTML5 brings support for advanced functionality such as the tag, which opens the door to interactive charting among many other data visualization benefits.

    SVG: Although we've had limited SVG support for some time, its usage has greatly expanded with HTML5. This canvas alternative also enables you to create rich graphics on the web.

    Web fonts: After being limited to a handful of system fonts common to PC and Mac, web designers everywhere were hungry for the possibilities brought by browser support for web fonts. Now, designers can use an ever-growing family of decorative and other font faces to give the impact their infographics and other data visualizations need—while remaining search engine compatible and screen reader friendly.

    Advanced form elements: Because we were sick and tired of working with the extremely limited set of form elements, this one was pretty high on our personal wish list. HTML5 brings a great number of new input types (such as email, tel, and url) that makes it much easier for users to correctly enter the proper data, especially on mobile devices. In addition, new form controls such as the range slider bring an enhanced user experience into play. Browser support for these elements is not quite at the same level as some of the other HTML5 features, but it does seem to get better with each version release.

    TIP Perhaps the best resource for checking whether HTML5 specifics can be incorporated into a web page is http://caniuse.com/. This site tracks each of the HTML5, CSS, and JavaScript API features and their current (as well as past and future) browser version support. We consider Can I Use an essential stop in the planning stage of any new site or application.

    Lowering the Implementation Bar

    To complete our car metaphor, let's agree that we have now have a powerful vehicle (our highly efficient JavaScript engine) and a super fuel (widely supported HTML5). Does anyone know how to drive this thing? Thanks to the popularity and ease of use of JavaScript-related libraries, specifically those written in jQuery, the answer for an increasing number of web professionals is a resounding Yes!

    It's true that anyone with sufficient JavaScript know-how could manage the requisite data acquisition, conversion, and rendering required in the data visualization life cycle. However, armed with core jQuery and targeted libraries, such a process becomes much more efficient and successful.

    In fact, if there is a single raison d'etre for this book, it's the existence and proliferation of these JavaScript libraries that will be leveraged throughout this title. In addition to making it easier to bring the real-world data numbers to life in the first place, most sophisticated JavaScript libraries also make it much more straightforward to modify controlling parameters and even the data itself, all on the fly. This added degree of flexibility strengthens the case for taking advantage of code libraries such as Google Charts, D3, Raphaël and jqPlot to name just a few covered in this book and available right now to be put to work.

    Summary

    Data visualization is the process of acquiring data, analyzing it, and displaying the resulting information in a graphical fashion. The entire procedure itself can run the gamut from the extremely straightforward, such as creating a pie chart from values in a spreadsheet, to the exceedingly complex, as when building a sophisticated infographic distilling reams of census and geographic data. When thinking about the world of data visualization, keep these key points in mind:

    Visualizing data makes it easier for a wider audience to quickly grasp the relative nature of selected data.

    There are a tremendous number of options when it comes to deciding which form of representation your information should take. The job of the visualization designer is to realize the optimum choices for communicating the data's message.

    Data can be collected and displayed visually in real time through the use of HTML forms and JavaScript coding.

    The primary creators of data visualizations are the public sector and the business-to-business, intrabusiness, and business-to-consumer markets.

    Advances in browser JavaScript processing, HTML5 browser support, and the proliferation of related JavaScript libraries lay the technological foundation for data visualization.

    Chapter 2

    Working with the Essentials of Analysis

    What's in This Chapter

    Basic analytic concepts

    Key mathematical terms commonly applied when evaluating data

    Techniques for uncovering patterns within the information

    Strategies for forecasting future trends

    The current Google definition of analysis is a perfect fit when applied to data visualization:

    Detailed examination of the elements or structure of something, typically as a basis for discussion or interpretation.

    You know the expression, Can't see the forest for the trees? When you analyze data with visualization in mind, you potentially are looking at both the forest and the trees. The individual data points are, of course, extremely important, but so is the overall pattern they form: the structure referenced in the Google definition. Moreover, the whole purpose of analyzing data for visualization is to discuss, interpret, and understand—to paint a picture with the numbers and not by the numbers.

    This chapter covers the basic tenets of analysis in order to lay a foundation for the material ahead. It starts by defining a few of the key mathematical terms commonly applied when evaluating data. Next, the chapter discusses techniques frequently used to uncover patterns within the information and strategies for forecasting future trends based on the data.

    Key Analytic Concepts

    At its heart, most data is number based. For every text-focused explication that starts with One side feels this way and another side feels that way, the next question is inevitably numeric: How many are on each side? Such simplified headcounts are rarely the full scope of a data visualization project and it is often necessary to bring more sophisticated numeric analysis into play. This section explores the more frequently applied concepts.

    Mean Versus Median

    One of the most common statistical tasks is to determine the average—or mean—of a particular set of numbers. The mean is the sum of all the considered values divided by the total number of those values. Let's say you have sales figures for seven different parts of the country, shown in Table 2.1.

    Table 2.1 Sample Sales by Region

    All the dollar amounts added together equal $1,000,000. Divide the total by 7—the total number of values—to arrive at the mean: $142,857. Although this is significant in terms of sales as a whole, it doesn't really indicate the more typical figure for most of the regions. The significantly higher amount from California skews the results. Quite often when someone asks for the average, what they are really asking for is the median.

    The median is the midpoint in a series of values: quite literally, the middle. Let's list regional sales in descending order, from highest to lowest (see Table 2.2).

    Table 2.2 Sample Sales by Region, Descending Order

    The median sales figure (the Northeast region's $100,000) is actually much closer to what most of the other areas are bringing in. To quantify variance in the data—like that shown in the preceding example—statisticians rely on a concept called standard deviation.

    Standard Deviation

    Standard deviation measures the distribution of numbers from the average or mean of any given sample set. The higher the deviation, the more spread out the data. Knowing the standard deviation allows you to determine, and thus potentially map, which values lie outside the norm.

    Following are the steps for calculating the standard deviation:

    Determine the mean of the values set.

    Subtract the mean from each value.

    Square the results. Cleverly, this is called the squared differences.

    Find the mean for all the squared differences.

    Get the square root of the just-calculated mean. The result is the standard deviation.

    Let's run our previous data set through these steps to identify its standard deviation.

    The mean, as calculated before, is 142,857.

    Subtract the mean from the values to get the following results:

    To handle the negative values properly, square the results:

    Add all the squared values together to get 79,642,857,143; divide by 7 (the number of values) and you have 11,377,551,020.

    Calculate the square root of that value to find that 106,665 is the standard deviation.

    When you know the standard deviation from the mean, you can say which figures might be abnormally high or abnormally low. The range runs from 36,191 (the mean minus the standard deviation) to 249,522 (the mean plus the standard deviation). The California sales figure of $400,000 is outside the norm by slightly more than $150,000.

    To demonstrate how values can change the standard deviation, try recalculating it after dropping the California sales to $150,000—a figure much more in line with the other regions. With that modification, the standard deviation is 44,031, indicating a much narrower variance range from 98,825 to 186,888.

    Working with Sampled Data

    Statisticians aren't always able to access all the data as we were with the regional sales information referenced earlier in this chapter. Polls, for example, almost always reflect the input of just a portion—or sample—of the targeted population. To account for the difference, three separate concepts are applied: a variation on the standard deviation formula, the per capita calculation for taking into account the relative size of the data population, and the margin of error.

    Standard Deviation Variation

    There's a very simple modification to the standard deviation formula that is incorporated when working with sampled data. Called Bessel's Correction, this change modifies a single value. Rather than divide the sum of the squared differences by the total number of values, the sum is divided by the number of values less one. This seemingly minor change has a significant impact statisticians believe represents the standard deviation more accurately when working with a subset of the entire data set rather than the complete order.

    Assume that the previously discussed sales data was from a global sales force and thus the data is only a portion rather than the entirety. In this situation, the sum of the squared differences (79,642,857,143) would be divided by 6 rather than 7, which results in 13,273,809,523 as opposed to 11,377,551,020—a difference of almost 2 trillion. Taking the square root of this value results in a new standard deviation of 115,212 versus 106,665.

    Per Capita Calculations

    Looking at raw numbers without taking any other factors into consideration can lead to inaccurate conclusions. One enhancement is to bring the size of the population of a sampled region into play. This type of calculation is called per capita, Latin for each head.

    To apply the per capita value, you divide the given number attributed to an area by the population of that area. Typically, this results in a very small decimal, which makes it difficult to completely comprehend. To make the result easier to grasp, it is often multiplied by a larger value, such as 100,000, which would then be described as per 100,000 people.

    To better understand this concept, compare two of the sales regions that each brought in $75,000: the Southeast and the Southwest. According to the U.S. 2010 census, the population of the Southeast is 78,320,977, whereas the Southwest's population is 38,030,918. If you divide the sales figure for each by their respective population and then multiply that by 100,000, you get the results shown in Table 2.3.

    Table 2.3 Regional Sales per Capita

    When the per capita calculation is figured in, the perceptive difference is quite significant. Essentially, the Southwest market sales were better than the Southeast by better than 2-to-1. Such framing of the data would be critical information for any organization making decisions about future spending based on current data.

    Margin of Error

    If you're not sampling the entire population on any given subject, your data is likely to be somewhat imprecise. This impreciseness is known as the margin of error. The term is frequently used with political polls where you might encounter a note that it contains a margin of error of plus or minus 3.5% or something similar. This percentage value is very easy to calculate and, wondrously, works regardless of the overall population's size.

    To find the margin of error, simply divide 1 by the square root of the number of samples. For example, let's say you surveyed a neighborhood about a household cleaning product. If 1,500 people answered your questions, the resulting margin of error would be 2.58 percent. Here's how the math breaks down:

    Find the square root of your sample size. The square root of 1,500 is close to 38.729.

    Divide 1 by that square root value. One divided by 38.729 is around 0.0258.

    Multiply the decimal value by 100 to find the percentage. In this case, the final percentage would be 2.58 percent.

    The larger your sample, the smaller the margin of error—stands to reason, right? So if the sample size doubles to 3,000, the margin of error would be 1.82 percent. Note that the percentage value for double the survey size is not half the margin of error for 1,500; the correlation is proportional, but not on a 1-to-1 ratio.

    Because this calculation is true regardless of the overall population size—your sampled audience could be in New York or in Montana—it has wide application. Naturally, there are many other factors that could come into play, but the margin of error is unaffected.

    Detecting Patterns with Data Mining

    Data visualizations are often used in support of illustrating one or more perceived patterns in targeted information. Another term for identifying these patterns and their relationship to each other is data mining. The most common data-mining tool is a relational database that contains multiple forms of information, such as transactional data, environmental information, and demographics.

    Data mining incorporates a number of techniques for recognizing relationships between various bits of information details. The following are the key techniques:

    Associations: The Association technique is often applied to transactions, where a consumer purchases two or more items at the same time. The textbook example—albeit a fictional one—is of a supermarket chain discovering that men frequently buy beer when they purchased diapers on Thursdays. This association between the seemingly disparate products enables the retailer to make key decisions, like those involving product placement or pricing. Of course, any association data should be taken with a grain of salt because correlation does not imply causation.

    Classifications: Classification separates data records into predefined groups or classes according to existing or predictive criteria. For example, let's say you're classifying online customers according to whether they would buy a new car every other year. Using relational, comparative data—identifying other factors that correlated with previous consumers who bought an automobile every two years—you could classify new entries in the database accordingly.

    Decision trees: A decision tree follows a logic flow dictated by choices and circumstances. In practice, the decision tree resembles a flow chart, like the one shown in Figure 2.1. Decision trees are often used in conjunction with classifications.

    c02fig001

    Figure 2.1 In a decision tree, environmental factors, such as the weather, along with personal choices, can impact the final decision equally.

    Clusters: Clustering looks at existing attributes or values and groups entries with similarities. The clustering technique lends itself to more of an exploratory approach than classification because you don't have to predetermine the associated groups. However, this data mining method can also identify members of specific market segments.

    Sequential patterning:By examining the sequential order in which actions are taken, you can determine the action likely to be taken next. Sequential patterning is a foundation of trend analysis, and timelines are often incorporated for related data visualization.

    These various techniques can be applied separately or in combination with one another.

    Projecting Future Trends

    The prediction of future actions based on current behavior is a cornerstone of data visualization. Much projection is based on regression analysis. The simplest regression analysis depends on two variables interconnected in a causal relationship. The first variable is considered independent and the second, dependent. Let's say you're looking at how long a dog attends a behavioral school and the number of times the dog chews up the furniture. As you analyze the data, you discover that there is a correlation between the length of the dog's training (the independent variable) and its behavior (the dependent variable): The longer the pet stays in the training, the less furniture destruction. Table 2.4 shows the raw data.

    Table 2.4 Data for Regression Analysis

    To get a better sense of how regression analysis works, Figure 2.2 shows the basic points plotted on a graph. As you can see, the points are slightly scattered across the grid. Because it involves only two variables, this type of projection is referred to as simple linear regression.

    NOTE Some statisticians refer to the independent and dependent variables in regression analysis as exogenous and endogenous, respectively. Exogenous refers to something that was developed from external factors, whereas endogenous is defined as having an internal cause or origin.

    c02fig002

    Figure 2.2 The number of days training (the independent variable) is shown in the X axis and the number of dog bites (the dependent variable) in the Y.

    To clarify the data trend direction, a regression

    Enjoying the preview?
    Page 1 of 1