0% found this document useful (0 votes)
36 views

Lesson4 Data

This document discusses data, information, and knowledge. It defines data as raw facts and figures that require context and processing to become meaningful information. Information answers questions by analyzing data. Knowledge involves understanding relationships between information and determining appropriate actions. The document provides examples of how data is transformed into information and knowledge. It also outlines key aspects of ensuring data quality and preprocessing data for analytics.

Uploaded by

Halah Aftab
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Lesson4 Data

This document discusses data, information, and knowledge. It defines data as raw facts and figures that require context and processing to become meaningful information. Information answers questions by analyzing data. Knowledge involves understanding relationships between information and determining appropriate actions. The document provides examples of how data is transformed into information and knowledge. It also outlines key aspects of ensuring data quality and preprocessing data for analytics.

Uploaded by

Halah Aftab
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

MIS 272

Artificial Intelligence for Business

Module 1: AI Conception

Lesson 2: Data, Context,


Information, Knowledge
Students Learning Outcomes

2.1 What is Data?

2.2 Data to Information to Knowledge

2.3 Data Quality

2.4 Descriptive Statistics

2.5 Data Visulation


Why Look at Data?
• To answer a question
▪ What percentage of college players make it to the PL?
❖ Answer: 0.2%
▪ How many people died last year because of
preventable medical errors?
❖ Answer: 44,000 to 98,000
• To explore
▪ May find question along the way
• Keep open mind for:
▪ Unusual patterns, results
▪ Ah-ha! Moment
Think about a good question to ask? Examples?
What is Data?
• Raw facts and figures
• No meaning until it is processed and given a context
• Data point: a value assigned to a thing
• Example: What values can we assign to the picture?
▪ What is it?: Football
▪ Category: Sport
▪ Color(s): Black & White
▪ Condition: New, Used
▪ Size: 23 cm (Dia); 70 cm (Circ.)
▪ Weight: 440 g
▪ Price: 100 SR
What about something else? Human? Juice?
Types of Data
• Qualitative data: is not expressed as a number
▪ Feelings, emotions (Facebook status, Tweet)
▪ Interview transcript
▪ Description of colors, texture, taste, etc.

• Quantitative data: information measured and written down with numbers


▪Score on a test
▪Number of laces on a football
▪Price of item, size of item

Examples, what data is qualitative and quantitative?


Unstructured vs. Structured data

Unstructured Data Structured Data


• Information that doesn't reside in a • Resides in a fixed field within a
traditional row-column database record or file is called structured data
• Sentence: “I have 5 used brown • Ex: CSV structured format of football
footballs with a circumference of 22 • “quantity”, “color”, “condition”,
inches and a length of 11 inches.” “item”, “category”,
• No fixed underlying structure “circumference(in)”, “price per
unit (USD)”
• Ex: PDFs, images (.png), e-mails, 5,”brown”,”used”,”ball”,”football”,
videos 22,11
• 80% to 90% of data in an
organization
Data to Information to Knowledge
Example 1
• Remember, data has no meaning
until it is given a context and
processed
• Then, it can be deemed as
information
• What do these numbers mean?
• 51, 77, 58, 82, 64, 70
• Needs a context
• Test scores achieved by students
• Needs to be processed in some form
(i.e., needs an interesting and practical
question)
• Ex: what average text score
• Information: the average test score for
the students is 67%
Data to Information to Knowledge
Example 2
• What does this text mean?
• chocolate, strawberry, vanilla, strawberry,
vanilla, vanilla, strawberry, vanilla, vanilla
• Needs a context
• Tubs of ice cream sold yesterday
• Needs to be processed in some form
• Needs an interesting and practical question
• Needs to be analyzed
• Ex: what was the most popular flavor sold
yesterday
• Information: we create a chart from the data
to show which was the most popular flavor
Data to Information to Knowledge
Example 1
• Knowledge: the capability of Example
understanding the relationship • Data: 51, 77, 58, 82, 64, 70
between pieces of information
and what to do with the • Context: Student test scores
information • Processing (analysis): what is the
• Simple definition: average. Let’s get it.
what to do with the information • Information: 67% is the average
• Knowledge: what would/could you
do??
Data to Information to Knowledge]:
Example 2
• Knowledge: the capability of Example
understanding the relationship • Data: chocolate, strawberry, vanilla,
between pieces of information strawberry, vanilla, vanilla,
and what to do with the strawberry, vanilla, vanilla
information
• Context: Tubes of ice cream sold
• Simple definition: yesterday
what to do with the information
• Processing (analysis): what was the
most popular flavor sold yesterday
• Information: vanilla @ 56%
• Knowledge: what would/could you
do??
Where Does Knowledge Come From?

• Experience
• Training
• Manuals
• Procedures
• Other People
• Where else?
Recap
• “Knowledge is the factor that
allows you to take effective action.
It allows you to make the right
decision and to do the right
thing.” Nick Milton
• Knowledge provides capability
and know-how
• Cannot have knowledge without
data, context, and a story to tell
Metrics for Analytics Ready Data
▪ Data source reliability: where it came from?
▪ Data content accuracy: correct and a good match
▪ Data accessibility: can we easily get to it when needed
▪ Data security and data privacy
▪ Data richness: Comprehensiveness
▪ Data consistency
▪ Data currency/data timeliness
▪ Data granularity: lowest level of detail
▪ Data validity: Acceptable values
▪ Data relevancy
Data Quality Activity
The Art and Science of Data Preprocessing

▪ The real-world data is dirty, misaligned, overly complex, and inaccurate


▪ Not ready for analytics!

▪ Art – it develops and improves with experience

▪ Readying the data for analytics is needed


▪ Data preprocessing
▪ Data consolidation
▪ Data cleaning
▪ Data transformation
▪ Data reduction
The Art and Science of Data Preprocessing

• Data reduction

1. Variables
• Dimensional reduction
• Variable selection

2. Cases/samples
• Sampling
• Balancing / stratification
Data Preprocessing Tasks and Methods (1 of 3)
Table 2.1 A Summary of Data Preprocessing Tasks
and Potential Methods
Main Task Subtasks Popular Methods
Data Access and collect the data SQL queries, software agents, Web services.
consolidation Select and filter the data Domain expertise, SQL queries, statistical tests.
Integrate and unify the data SQL queries, domain expertise, ontology-driven data
mapping.
Data cleaning Handle missing values in Fill in missing values (imputations) with most appropriate
the data values (mean, median, min/max, mode, etc.); recode the
missing values with a constant such as “ML”; remove the
record of the missing value; do nothing.
Data cleaning Identify and reduce noise in Identify the outliers in data with simple statistical
the data techniques (such as averages and standard deviations) or
with cluster analysis; once identified, either remove the
outliers or smooth them by using binning, regression, or
simple averages.
Data Preprocessing Tasks and Methods (2 of 3)

Main Task Subtasks Popular Methods


Data cleaning Find and eliminate Identify the erroneous values in data (other than outliers),
erroneous data such as odd values, inconsistent class labels, odd distributions;
once identified, use domain expertise to correct the values or
remove the records holding the erroneous values.
Data Normalize the data Reduce the range of values in each numerically valued variable
transformation to a standard range (e.g., 0 to 1 or -1 to +1) by using a variety
of normalization or scaling techniques.
Data Discretize or If needed, convert the numeric variables into discrete
transformation aggregate the representations using range-or
data frequency-based binning techniques; for categorical variables,
reduce the number of values by applying proper concept
hierarchies.
Data Preprocessing Tasks and Methods (3 of 3)

Main Task Subtasks Popular Methods


Data Construct new Derive new and more informative variables from the
transformation attributes existing ones using a wide range of mathematical
functions (as simple as addition and multiplication or as
complex as a hybrid combination of log transformations).

Data reduction Reduce number of Principal component analysis, independent component


attributes analysis, chi-square testing, correlation analysis, and
decision tree induction.

Data reduction Reduce number of Random sampling, stratified sampling, expert-knowledge-


records driven purposeful sampling.
Data reduction Balance skewed Oversample the less represented or undersample the
data more represented classes.
Descriptive Statistics Measures of Centrality
Tendency
• Arithmetic mean


n
x1 + x2 +    + xn x
x = x = i =1 i
n n
• Median
– The number in the middle
• Mode
– The most frequent observation
Descriptive Statistics Measures of Dispersion (1 of 2)
• Dispersion
– Degree of variation in a given variable
• Range
– Max - Min
• Variance Standard Deviation
i = 1 i
n
( x − x) 2


n
s2 = ( x − x) 2
n −1 s = i =1 i

n −1
• Mean Absolute Deviation (MAD)
– Average absolute deviation from the mean
Descriptive Statistics Measures of Dispersion (2 of 2)

▪ Quartiles
▪ Box-and-Whiskers Plot
▪ a.k.a. box-plot
▪ Versatile / informative
Descriptive Statistics Shape of a Distribution
• Histogram – frequency chart
• Skewness
– Measure of asymmetry


n
( xi − x ) 3

Skewness = S = i =1

(n − 1) s 3
• Kurtosis
– Peak/tall/skinny nature of the distribution

i =1 i
n
( x − x ) 4

Kurtosis = K = 4
− 3
ns
Relationship Between Dispersion and Shape Properties
Data Visualization
• “The use of visual representations to explore,
make sense of, and communicate data.”
• Data visualization vs. Information visualization
• Information = aggregation, summarization, and
contextualization of data
• Related to information graphics, scientific
visualization, and statistical graphics
• Often includes charts, graphs, illustrations, …
A Brief History of Data Visualization
• Data visualization can date back to the second century AD
• Most developments have occurred in the last two and a half
centuries
• Until recently it was not recognized as a discipline
• Today’s most popular visual forms date back a few centuries
Which Chart or Graph Should You Use?
The Emergence of Data Visualization and Visual Analytics
• Emergence of new companies
– Tableau, Spotfire, QlikView, …
• Increased focus by the big players
– MicroStrategy improved Visual Insight
– S A P launched Visual Intelligence
– S A S launched Visual Analytics
– Microsoft bolstered PowerPivot with Power View
– I B M launched Cognos Insight
– Oracle acquired Endeca
Visual Analytics
• A recently coined term
– Information visualization + predictive analytics
• Information visualization
– Descriptive, backward focused
– “what happened” “what is happening”
• Predictive analytics
– Predictive, future focused
– “what will happen” “why will it happen”
• There is a strong move toward visual analytics
Copyright
This work is protected by United States copyright laws and is
provided solely for the use of instructors in teaching their
courses and assessing student learning. Dissemination or sale of
any part of this work (including on the World Wide Web) will
destroy the integrity of the work and is not permitted. The work
and materials from it should never be made available to students
except by instructors using the accompanying text in their
classes. All recipients of this work are expected to abide by these
restrictions and to honor the intended pedagogical purposes and
the needs of other instructors who rely on these materials.

You might also like