AIML Unit 2 Understanding Data
AIML Unit 2 Understanding Data
Learning
S. Sridhar and M. Vijayalakshmi
Chapter 2
Understanding of Data
What is Data?
• DATA ARE FACTS
• FACTS ARE IN THE FORM OF NUMBERS, AUDIO, VIDEO, AND IMAGE
• NEED TO ANALYZE DATA FOR TAKING DECISIONS.
• XML/JSON OBJECTS
• RSS FEEDS
• HIERARCHICAL RECORDS
Data Storage
❖Flat Files
I. Simplest and most commonly available data source.
II. Data is stored in plain ASCII or EBCDIC
III. Minor changes of data in flat files affect the results of the data mining
algorithms.
IV. Suitable only for storing small dataset and not desirable if the dataset
becomes larger.
• CSV files – CSV stands for comma-separated value files where the values are separated by
commas. These are used by spreadsheet and database applications. The first row may have
attributes and the rest of the rows represent the data.
• TSV files – TSV stands for Tab separated values files where values are separated by Tab. Both
CSV and TSV files are generic in nature and can be shared. There are many tools like Google Sheets
and Microsoft Excel to process these files.
Data Storage
❖ Database System
I. Database files and a database management system
II. original data and metadata.
III. Database administrator, query processing, and transaction
manager.
Time is spent for collection of good quality data. ‘Good data’ is one that has the
following properties:
1. Timeliness –
2. Relevancy –.
3. Knowledge about the data –
1.Data Collection
open/public data, social media data and multimodal data.
1. Open or public data source –
2. Social media –
3. Multimodal data -
2.Data Preprocessing
In real world, the available data is ’dirty’.
•Incomplete data • Inaccurate data
• Outlier data • Data with missing values
• Data with inconsistent values • Duplicate data
✓ Data preprocessing improves the quality of the data mining techniques.
✓ Raw data must be preprocessed to give accurate results.
✓ The process of detection and removal of errors in data → data cleaning.
✓ Making the data processable for ML algorithms→ Data wrangling
• Salary = ’ ’ is incomplete data.
• DoB of patients, John, Andre, and Raju, is the missing data.
• The age of David is ‘5’ but his DoB - 10/10/1980 -→ inconsistent data.
Outliers :characteristics that are different from other data and have very
unusual values. It might be a typographical error. It is often required to
distinguish between noise and outlier data.
Missing Data Analysis
1. Ignore the tuple
2. Fill in the values manually
3. A global constant can be used to fill in the missing attributes:’
Unknown’ or be ’Infinity’.
4. The attribute value may be filled by the attribute value.
5. Use the attribute mean for all samples belonging to the same
class.
6. Use the most possible value to fill in the missing value.
Removal of Noisy or Outlier Data
1. Noise is a random error or variance in a measured value.
2. Removed using binning -the given data values are sorted and distributed
into equal frequency bins. The bins are also called as buckets.
3. The binning method then uses the neighbor values to smooth the noisy
data.
❑ ‘smoothing by means’ mean of the bin removes the values of the bins
❑ ‘smoothing by bin medians’ where the bin median replaces the bin
values
❑ ‘smoothing by bin boundaries’ the closest bin boundary.
❑The maximum and minimum values are called bin boundaries.
Example
Example 2.1: Consider the following set: S = {12, 14, 19, 22, 24, 26, 28, 31, 34}. Apply
various binning techniques and show the result.
Solution: By equal-frequency bin method, the data should be distributed across bins. Let us
assume the bins of size 3, then the above data is distributed across the bins as shown below:
Bin 1 : 12 , 14, 19
Bin 2 : 22, 24, 26
Bin 3 : 28, 31, 32
By smoothing bins method, the bins are replaced by the bin means.
Bin 1 : 15, 15, 15
Bin 2 : 24, 24, 24
Bin 3 : 30.3, 30.3, 30.3
Using smoothing by bin boundaries method, the bins' values would be like:
Bin 1 : 12, 12, 19
Bin 2 : 22, 22, 26
Bin 3 : 28, 32, 32
Data Integration and Data Transformations
1. Merge data from multiple sources into a single data source. Lead to
redundant data.
2. Goal of data integration is to detect and remove redundancies that arise
from integration.
3. Normalization, the attribute values are scaled to fit in a range (say 0-1) to
improve the performance of the data mining algorithm.
4. Min-Max
5. z-Score
Min-Max Procedure
1. Each variable V is normalized by its difference with the minimum
value divided by the range to a new range, say 0–1.
Min and max are the minimum and maximum of the given data, new max
and new min are the minimum and maximum of the target range, say 0
and 1.
Consider the set: V = {88, 90, 92, 94}. Apply Min-Max procedure and
map the marks to a new range 0–1.
Solution: The minimum of the list V is 88 and maximum is 94. The new min and
new max are 0 and 1, respectively.
Marks {88, 90, 92, 94} are
mapped to the new range
{0, 0.33, 0.66, 1}. Thus,
the Min-Max
normalization range is
between 0 and 1.
z-Score Normalization
Difference between the field value and mean value, and
by scaling this difference by standard deviation of the
attribute.
Hence, the z-score of the marks 10, 20, 30 are -1, 0 and 1, respectively.
2.4 DESCRIPTIVE STATISTICS
1. It is used to summarize and describe data.
Dataset and Data Types
1. A dataset can be assumed to be a collection of data objects.
2. The data objects may be records, points, vectors, patterns, events,
cases, samples or observations.
3. These records contain many attributes.
4. An attribute can be defined as the property or characteristics of an
object.
2.4 DESCRIPTIVE STATISTICS
1. It is used to summarize and describe data.
Dataset and Data Types
1. Every attribute should be associated with a value.
2. This process is called measurement.
3. The type of attribute determines the data types, often referred to as
measurement scale types.
1. Categorical or qualitative data
2. Numerical or quantitative data
Categorical or Qualitative Data
•Nominal Data –patient ID.
1. Data that can be categorized does not have numerical value.
2. Nominal data type provides only information but has no ordering among
data.
3. Only operations like (=, ≠) are meaningful for these data.
•Ordinal Data –
1. It provides enough information and has natural order.
2. Fever = {Low, Medium, High} is an ordinal data. Certainly, low is less
than medium and medium is less than high, irrespective of the value.
3. Any transformation can be applied to these data to get a new value.
Numeric or Qualitative Data
•Interval Data –
1. Interval data is a numeric data for which the differences between values
are meaningful.
2. For example, there is a difference between 30 degree and 40 degree.
3. Only the permissible operations are + and -.
•Ratio Data –
1. For ratio data, both differences and ratio are meaningful.
Another way of classifying the data is to classify it as:
1.Discrete Data This kind of data is recorded as integers.
2.Continuous Data It can be fitted into a range and includes decimal
point. For example, age is a continuous data. Though age appears to
be discrete data, one may be 12.5 years old and it makes sense.
Patient height and weight are all continuous data.
It can be observed that the number of students with 22 marks are 2. The total
number of students are 10. So, 2/10 × 100 = 20% space in a pie of 100% is allotted
for marks 22 .
Histogram showing frequency distributions. The histogram for students’
marks {45, 60, 60, 80, 85} in the group range of 0-25, 26-50, 51-75, 76-100 is
given below in Figure 2.5. One can visually inspect from Figure 2.5 that the
number of students in the range 76-100 is 2.
Dot Plots less clustered as compared to bar charts. Dot plot of English
marks for five students with ID as {1, 2, 3, 4, 5} and marks {45, 60, 60, 80, 85}
is given. The advantage is that by visual inspection one can find out who got
more marks.
2.5.2 Central Tendency
1. Mean – Arithmetic average (or mean) is a measure of central tendency
that represents the ‘center’ of the dataset. Mathematically, the average of
all the values in the sample (population) is denoted as x. Let x1, x2, … , xN
be a set of ‘N’ values or observations, then the arithmetic mean is given as:
Median class is that class where N/2th item is present. Here, i is the class interval of the
median class and L1 is the lower limit of median class, f is the frequency of the median
class, and cf is the cumulative frequency of all classes preceding median.
3. Mode – Mode is the value that occurs more frequently in the dataset.
the value that has the highest frequency is called mode.
2.5.3 Dispersion
The spreadout of a dataset around the central tendency (mean, median or mode) is
called dispersion. Dispersion is represented by various ways such as range, variance,
standard deviation, and standard error. These are second order measures.
1.Range -Difference between the maximum and minimum of values of the given list
of data.
2.Standard Deviation Standard deviation is the average distance from the mean of
the dataset to each point.
The formula for sample standard deviation is given by: Here, N is the size of the population, xi is
observation or value from the population and m is the population mean. Often, N – 1 is used instead
of N in the denominator of Eq. (2.8).
Quartiles and Inter Quartile Range
1. Percentiles are about data that are less than the coordinates by some
percentage of the total value.
2.kth percentile is the property that the k% of the data lies at or below Xi.
3.For example, median is 50th percentile and can be denoted as Q0.50. The
25th percentile is called first quartile (Q1) and the 75th percentile is called
third quartile (Q3).
4. Inter Quartile Range (IQR) is the difference between Q3 and Q1.
5.Outliers are normally the values falling apart at least by the amount 1.5 ×
IQR above the third quartile or below the first quartile.
Interquartile is defined by Q0.75 – Q0.25.
Example 2.4: For patients’ age list {12, 14, 19, 22, 24, 26, 28, 31, 34}, find the IQR.
Solution: The median is in the fifth position. In this case, 24 is the median. The first
quartile is median of the scores below the mean i.e., {12, 14, 19, 22}. Hence, it’s the
median of the list below 24. In this case, the median is the average of the second and third
values, that is, Q0.25 = 16.5.
Similarly, the third quartile is the median of the values above the median, that is {26, 28,
31, 34}.
So, Q0.75 is the average of the seventh and eighth score. In this case, it is 28 + 31/2 = 59/2
= 29.5.
Hence, the IQR using Eq. (2.10) is:
= Q0.75 – Q0.25
= 29.5-16.5 = 13
Five-point Summary and Box Plots The median, quartiles Q1 and Q3, and minimum
and maximum written in the order < Minimum, Q1, Median, Q3, Maximum > is known as
five-point summary
Example 2.5: Find the 5-point summary of the list {13, 11, 2, 3, 4, 8, 9}.
Solution: The minimum is 2 and the maximum is 13. The Q1, Q2 and Q3 are 3, 8 and 11,
respectively. Hence, 5-point summary is {2, 3, 8, 11, 13}, that is, {minimum, Q1, median,
Q3, maximum}. Box plots are useful for describing 5-point summary. The Box plot for
the set is given in Figure 2.7.
2.5.4 Shape
Skewness and Kurtosis (called moments) indicate the symmetry/asymmetry
and peak location of the dataset.
Skewness
The measures of direction and degree of symmetry are called measures of
third order. skewness should be zero as in ideal normal distribution.
More often, the given dataset may not have perfect symmetry (consider the
following Figure 2.8).
MAD and CV