0% found this document useful (0 votes)
37 views

Exploratory Data Analysis

Exploratory data analysis (EDA) is both a data analysis perspective and a set of techniques that allow the data to guide the analysis rather than imposing a structure on the data. EDA emphasizes visual representations and graphical techniques over summary statistics and involves cycling between exploratory and confirmatory approaches. Several useful techniques for displaying data are discussed such as frequency tables, bar charts, and pie charts.

Uploaded by

Ranita Lakra
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Exploratory Data Analysis

Exploratory data analysis (EDA) is both a data analysis perspective and a set of techniques that allow the data to guide the analysis rather than imposing a structure on the data. EDA emphasizes visual representations and graphical techniques over summary statistics and involves cycling between exploratory and confirmatory approaches. Several useful techniques for displaying data are discussed such as frequency tables, bar charts, and pie charts.

Uploaded by

Ranita Lakra
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

Exploratory Data Analysis

Exploratory data analysis (EDA) is both a data analysis perspective and a set of techniques. 1 In exploratory data analysis, the data guide the choice of analysisor a revision of the planned analysisrather than the analysis presuming to overlay its structure on the data without the benefit of the analysts scrutiny. This is comparable to our position that research should be problem-oriented rather than tool-driven. The flexibility to respond to the patterns revealed by successive iterations in the discovery process is an important attribute of this approach, confirmatory data analysis occupies a position closer to classical statistical inference in its use of significance and confidence. But confirmatory analysis may also differ from traditional practices by using information from a closely related data set or by validating findings through the gathering and analyzing of new data.2 One authority has compared exploratory data analysis to the role of police detectives and other investigators and confirmatory analysis to that of judges and the judicial system. The former are involved in the search for clues and evidence; the latter are preoccupied with evaluating the strength of what is found. Exploratory data analysis is the first step in the search for evidence, without which confirmatory analysis has nothing to evaluate.3 Consistent with that analogy, EDA shares a commonality with exploratory designs, not formalized ones. Because it doesnt follow a rigid structure, it is free to take many paths in unraveling the mysteries in the datato sift the unpredictable from the predictable.

A major contribution of the exploratory approach lies in the emphasis on visual representations and graphical techniques over summary statistics. Summary statistics, as you will see momentarily, may obscure, conceal, or even misrepresent the underlying structure of the data. When numerical summaries are used exclusively and accepted without visual inspection, the selection of confirmatory models may be precipitous, and based on flawed assumptions. Consequently, it may produce erroneous conclusions.4 For these reasons, data analysis should begin with visual inspection. After that, it is not only possible but also desirable to cycle between exploratory and confirmatory approaches.

Several useful techniques for displaying data are not new to EDA. They are essential to any examination of the data. For example, a frequency table is a simple device for arraying data. An example is presented in Exhibit 162. It arrays data by assigned numerical value, with columns for percent, percent adjusted for missing values, and cumulative percent. Sector, the nominal variable that describes the business classifications or markets of the sampled corporations, provides the observations for this table. Although there are 100 observations, the small number of categories makes the variable easily tabled. The same data are presented in Exhibit 163 using a bar chart

and a pie chart. The values and percentages are more readily understood in this graphic format, and visualization of the sector categories and their relative sizes is improved. When the variable of interest is measured on an interval-ratio scale and is one with many potential values, these techniques are not particularly informative. Exhibit 164 is a condensed frequency table of the highest total return to investors measured in percentages of the top 48 companies in this category taken from the Fortune 500.6 Only two values, 59.9 and 66, have a frequency greater than 1. Thus, the primary contribution of this table is an ordered list of values. If the table were converted to a bar chart, it would have 48 bars of equal length and two bars with two occurrences. And bar charts do not reserve spaces for values where no observations occur within the range. Constructing a pie chart for this variable would also be pointless.

You might also like