Lesson4 Data
Lesson4 Data
Module 1: AI Conception
• Experience
• Training
• Manuals
• Procedures
• Other People
• Where else?
Recap
• “Knowledge is the factor that
allows you to take effective action.
It allows you to make the right
decision and to do the right
thing.” Nick Milton
• Knowledge provides capability
and know-how
• Cannot have knowledge without
data, context, and a story to tell
Metrics for Analytics Ready Data
▪ Data source reliability: where it came from?
▪ Data content accuracy: correct and a good match
▪ Data accessibility: can we easily get to it when needed
▪ Data security and data privacy
▪ Data richness: Comprehensiveness
▪ Data consistency
▪ Data currency/data timeliness
▪ Data granularity: lowest level of detail
▪ Data validity: Acceptable values
▪ Data relevancy
Data Quality Activity
The Art and Science of Data Preprocessing
• Data reduction
1. Variables
• Dimensional reduction
• Variable selection
2. Cases/samples
• Sampling
• Balancing / stratification
Data Preprocessing Tasks and Methods (1 of 3)
Table 2.1 A Summary of Data Preprocessing Tasks
and Potential Methods
Main Task Subtasks Popular Methods
Data Access and collect the data SQL queries, software agents, Web services.
consolidation Select and filter the data Domain expertise, SQL queries, statistical tests.
Integrate and unify the data SQL queries, domain expertise, ontology-driven data
mapping.
Data cleaning Handle missing values in Fill in missing values (imputations) with most appropriate
the data values (mean, median, min/max, mode, etc.); recode the
missing values with a constant such as “ML”; remove the
record of the missing value; do nothing.
Data cleaning Identify and reduce noise in Identify the outliers in data with simple statistical
the data techniques (such as averages and standard deviations) or
with cluster analysis; once identified, either remove the
outliers or smooth them by using binning, regression, or
simple averages.
Data Preprocessing Tasks and Methods (2 of 3)
n
x1 + x2 + + xn x
x = x = i =1 i
n n
• Median
– The number in the middle
• Mode
– The most frequent observation
Descriptive Statistics Measures of Dispersion (1 of 2)
• Dispersion
– Degree of variation in a given variable
• Range
– Max - Min
• Variance Standard Deviation
i = 1 i
n
( x − x) 2
n
s2 = ( x − x) 2
n −1 s = i =1 i
n −1
• Mean Absolute Deviation (MAD)
– Average absolute deviation from the mean
Descriptive Statistics Measures of Dispersion (2 of 2)
▪ Quartiles
▪ Box-and-Whiskers Plot
▪ a.k.a. box-plot
▪ Versatile / informative
Descriptive Statistics Shape of a Distribution
• Histogram – frequency chart
• Skewness
– Measure of asymmetry
n
( xi − x ) 3
Skewness = S = i =1
(n − 1) s 3
• Kurtosis
– Peak/tall/skinny nature of the distribution
i =1 i
n
( x − x ) 4
Kurtosis = K = 4
− 3
ns
Relationship Between Dispersion and Shape Properties
Data Visualization
• “The use of visual representations to explore,
make sense of, and communicate data.”
• Data visualization vs. Information visualization
• Information = aggregation, summarization, and
contextualization of data
• Related to information graphics, scientific
visualization, and statistical graphics
• Often includes charts, graphs, illustrations, …
A Brief History of Data Visualization
• Data visualization can date back to the second century AD
• Most developments have occurred in the last two and a half
centuries
• Until recently it was not recognized as a discipline
• Today’s most popular visual forms date back a few centuries
Which Chart or Graph Should You Use?
The Emergence of Data Visualization and Visual Analytics
• Emergence of new companies
– Tableau, Spotfire, QlikView, …
• Increased focus by the big players
– MicroStrategy improved Visual Insight
– S A P launched Visual Intelligence
– S A S launched Visual Analytics
– Microsoft bolstered PowerPivot with Power View
– I B M launched Cognos Insight
– Oracle acquired Endeca
Visual Analytics
• A recently coined term
– Information visualization + predictive analytics
• Information visualization
– Descriptive, backward focused
– “what happened” “what is happening”
• Predictive analytics
– Predictive, future focused
– “what will happen” “why will it happen”
• There is a strong move toward visual analytics
Copyright
This work is protected by United States copyright laws and is
provided solely for the use of instructors in teaching their
courses and assessing student learning. Dissemination or sale of
any part of this work (including on the World Wide Web) will
destroy the integrity of the work and is not permitted. The work
and materials from it should never be made available to students
except by instructors using the accompanying text in their
classes. All recipients of this work are expected to abide by these
restrictions and to honor the intended pedagogical purposes and
the needs of other instructors who rely on these materials.