Understanding Data-1
Understanding Data-1
Introduction
All facts are data. Data can be directly human interpretable or diffused data such as images or video that
can be interpreted only by a computer. A byte is 8 bits. A bit is either 0 or 1. A kilo byte (KB) is 1024
bytes, one mega byte(MB) is approximately 1000 KB, one giga byte is approximately 1,000,000 KB, 1000
giga bytes is one tera byte and 1000000 tera bytes is one Exa byte.
Data is available in different data sources like flat files, databases, or data warehouses. It can either be
an operational data or non-operational data. Operational data is the one that is encountered in normal
business procedures and processes.
Data whose volume is less and can be stored and processed by a small-scale computer is called small
data'. These data are collected from several sources, and integrated and processed by a small-scale
computer. Big data, on the other hand, is a larger data whose volume is much larger than 'small data'
and is characterized as follows:
1. Volume Since there is a reduction in the cost of storing devices, there has been a tremendous growth
of data. Small traditional data is measured in terms of gigabytes (GB) and terabytes (TB), but Big Data is
measured in terms of petabytes (PB) and exabytes (EB) One exabyte is 1 million terabytes.
2. Velocity-The fast arrival speed of data and its increase in data volume is noted as velocity. The
availability of lot devices and Internet power ensures that the data is arriving at a faster rate. Velocity
helps to understand the relative growth of big data and its accessibility by users, systems and
applications
Form-There are many forma of data, Data types range from test, graph, audio, video, to maps.
There can be composite data too, where one me lia can have many other sources of data, for
example, a video can have an audio song.
Function-These are data from various sources like human conversations, transaction records,
and old archive data
Source of data -This is the third aspect of variety. There are many sources of data. Broadly, the
data source can be classified as open/public data, social media data and multimodal data. These
are discused in Section 2.3.1 of this chapter.
Some of the other forms of Vs that are often quoted in the literature as characteristics of Big data are:
4. Veracity of data - Veracity of data deals with aspects like conformity to the facts, truth fulness,
believability, and confidence in data. There may be many sources of error such as technical errors,
typographical errors, and human errors. So, veracity is one of the most important aspects of data
5. Validity-Validity in the accuracy of the data for taking decisions or for any other goals that are needed
by the given problem.
MODULE-1 Understanding Data -1
6. Value-Value is the characteristic of big data that indicates the of the information that is extracted
from the data and its influence on the decisions that are taken based on it.
Thus, these 6 Vs are helpful to characterize the big data. The data quality of the numerical attributes is
determined by factors like precision, bias and accuracy. Precision is defined as the closeness of repeated
measurements. Often, standard deviation is used to measure the precision. Bias is a systematic result
due to erroneous assumptions of the algorithms or procedures. Accuracy is the degree of measurement
of errors that refers to the closeness of measurements to the true value of the quantity.
Types of Data
In Big Data, there are three kinds of data. They are structured data, unstructured data, and semi-
structured data.
Structured Data
In structured data, data is stored in an organized manner such as a database where it is available in the
form of a table. The data can also be retrieved in an organized manner using tools like SQL.
The structured data frequently encountered in machine learning are listed below:
Record Data A dataset is a collection of measurements taken from a process. We have a collection of
objects in a dataset and each object has a set of measurements. The measurements can be arranged in
the form of a matrix. Rows in the matrix represent an object and can be called as entities, cases, or
records. The columns of dataset are called attributes, features, or fields. The table is filled with observed
data. Also, it is better to note the general jargons that are associated with the dataset. Label is the term
that is used to describe the individual observations.
Data Matrix It is a variation of the record type because it consists of numeric attributes. The standard
matrix operations can be applied on these data. The data is thought of as points or vectors in the
multidimensional space where every attribute is a dimension describing the object.
Graph Data It involves the relations among objects. For example, a web page can refer to another web
page. This can be modelled as a graph. The modes are web pages and the hyperlink is an edge that
connects the nodes.
Ordered Data Ordered data objects involve attributes that have an implicit order among them.
1. Temporal data – It is the data whose attributes are associated with time. For example, the
customer purchasing patterns during festival time is sequential data. Time series data is a special
type of sequence data where the data is a series of measurements over time.
2. Sequence data- It is like sequential data but does not have time stamps. This data involves the
sequence of words or letters. For example, DNA data is a sequence of four characters- A T G C.
MODULE-1 Understanding Data -1
3. Spatial data- It has attributes such as positions or areas. For example, maps are spatial data where
the points are related by location.
Unstructured Data
Unstructured data includes video, image, and audio. It also includes textual documents, programs, and
blog data. It is estimated that 80% of the data are unstructured data.
Semi-Structured Data
Semi-structured data are partially structured and partially unstructured. These include data like
XML/JSON data, RSS feeds, and hierarchical data.
Once the dataset is assembled, it must be stored in a structure that is suitable for data analysis. The goal
of data storage management is to make data available for analysis. There are different approaches to
organize and manage data in storage files and systems from flat file to data warehouses. Some of them
are listed below:
Flat Files -These are the simplest and most commonly available data source. It is also the cheapest way
of organizing data. These flat files are the files where data is stored in plain ASCII or EBCDIC format.
Minor changes of data in flat files affect the results of the data mining algorithms. Hence, flat files is
suitable only for storing small dataset and not desirable if the dataset becomes larger.
CSV files- CSV stands for comma-separated value files where the values are separated by
commas. These are used by spreadsheet and database applications. The first row may have
attributes and the rest of the rows represent data.
TSV files- TSV stands for Tab separated values files where values are separated by Tab.
Both CSV and TSV files are generic in nature and can be shared. There are many tools like
Google Sheets and Microsoft Excel to process these files.
Database System: It normally consists of database files and a database management system(DBMS).
Database files contain original data and metadata. DBMS aims to manage data and improves operator
performance by including various tools like database administrator, query processing, and transaction
manager. A relational database consists of sets of tables. The tables have rows and columns. The
columns represent the attributes and rows represent tuples. A tuple corresponds to either an object or a
relationship between objects. A user can access and manipulate the data in the database using SQL.
World Wide Web (WWW) It provides a diverse, worldwide online information source. The objective
of data mining algorithms is to mine interesting patterns of information present in WWW.
XML (eXtensible Markup Language) It is both human and machine interpretable data format that
can be used to represent data that needs to be shared across the platforms.
Data Stream It is dynamic data, which flows in and out of the observing environment. Typical
characteristics of data stream are huge volume of data, dynamic, fixed order movement, and real-
time constraints.
RSS (Really Simple Syndication) It is a format for sharing instant feeds across services.
JSON (JavaScript Object Notation) It is another useful data interchange format that is often used for
many machine learning algorithms.
For performing data analytics, many frameworks are proposed. All proposed analytics frameworks have
some common factors. Big data framework is a layered architecture has the following layers:
Data Connection Layer: It has data ingestion mechanisms and data connectors. Data ingestion means
taking raw data and importing it into appropriate data structures. It performs the tasks of ETL process.
By ETL , it means extract, transform and load operations.
MODULE-1 Understanding Data -1
Data Management Layer: It performs preprocessing of data. The purpose of this layer is to allow
parallel execution of queries, and read, write and data management tasks. There may be many
schemes that can be implemented by this layer such as data-in-place, where the data is not moved at
all, or constructing data repositories such as data warehouses and pull data on-demand mechanisms.
Data Analytic Layer: It has many functionalities such as statistical tests, machine learning algorithms
to understand, and construction of machine learning models. This layer implements many model
validation mechanisms too.
Presentation Layer: It has mechanisms such as dashboards, and applications that display the results of
analytical engines and machine learning algorithms.
Thus, the Big Data processing cycle involves data management that consists of the following steps.
1. Data collection
2. Data preprocessing
3. Applications of machine learning algorithm
4. Interpretation of results and visualization of machine learning algorithm
Data Collection
The first task of gathering datasets are the collections of data. It is often estimated that most of the time
is spent for collection of good quality data. A good quality data yields a better result. It is often difficult
to characterize a ‘Good data’. ‘Good data’ is one that has the following properties:
1. Timeliness- The data should be relevant and not stale or obsolete data.
2. Relevancy- The data should be relevant and ready for the machine learning or data mining
algorithms. All the necessary information should be available and there should be no bias in the
data.
3. Knowledge about the data- The data should be understandable and interpretable, and should be
self-sufficient for the required application as desired by the domain knowledge engineer.
Broadly, the data source can be classified as open/public data, social media data and multimodal
data.
1. Open or public data source- It is a data source that does not have any stringent copyright rules or
restrictions. It data can be primarily used for many purposes. Government census data are good
examples of open data:
Digital libraries that have huge amount of text data as well as document images
Scientific domains with huge collection of experimental data like genomic data and
biological data
Healthcare systems that use extensive databases like patient databases, health
insurance data, doctors’ information, and bioinformatics information.
MODULE-1 Understanding Data -1
2. Social media- It is the data that is generated by various social media platforms like Twitter,
Facebook, YouTube, and Instagram. An enormous amount of data is generated by these
platforms.
3. Multimodal data- It includes data that involves many modes such as text, video, audio and mixed
types. Some of them are listed below:
Images archives contain larger image databases along with numeric and text data
The World Wide Web (WWW) has huge amount of data that is distributed on the
Internet.
Data Preprocessing
Data preprocessing improves the quality of the data mining techniques. The raw data must be
prepocessed to give accurate results. The process of detection and removal of errors in data is called
data cleaning. Data wrangling means making the data processable for machine learning algorithms.
Some of the data errors include human errors such as typographical errors or incorrect measurement
and structural errors like improper data formats. Data errors can also arise from omission and
duplication of attributes. Noise is a random component and involves distortion of a value or
introduction of spurious objects. Often, the noise is used if the data is a spatial or temporal component.
Certain deterministic distortions in the form of a streak are known as artifacts.
Consider, for example, the following patient Table 2.1. The ‘bad’ or ‘dirty’ data can be observed in
this table.
It can be observed that data like Salary = ’ ‘ is incomplete data. The DoB of patients, John, Andre, and
Raju, is the missing data. The age of David is recorded as ‘5’ but his DoB indicates it is 10/10/1980. This is
called inconsistent data.
Inconsistent data occurs due to problems in conversions, inconsistent formats, and difference in
units. Salary for John is -1500. It cannot be less than ‘0’. It is an instance of noisy data. Outliers are data
that exhibit the characteristics that are different from other data and have very unusual values. The age
of Raju cannot be 136. It might be a typographical error. It is often required to distinguish between noise
and outlier data.
MODULE-1 Understanding Data -1
Outliers may be legitimate data and sometimes are of interest to the data mining algorithms. These
errors often come during data collection stage. These must be removed so that machine learning
algorithms yield better results as the quality of results is determined by the quality of input data. This
removal process is called data cleaning.
The primary data cleaning process is missing data analysis. Data cleaning routines attempt to fill up
the missing values, smoothen the noise while identifying the outliers and correct the inconsistence of
the data. This enables data mining to avoid overfitting of the models.
The procedures that given below can solve the problem of missing data:
1. Ignore the tuple- A tuple with missing data, especially the class label, is ignored. This method
is not effective when the percentage of missing values increases.
2. Fill in the values manually – Here, the domain expert can analyse the data tables and carry
out the analysis and fill in the values manually. But, this is time consuming and may not be
feasible for larger sets.
3. A global constant can be used to fill in the missing attributes. The missing values may be
‘Unknown’ or be ‘Infinity’. But, some data mining results may give spurious results by
analysing these labels.
4. The attribute value may be filled by the average value. Say, the average income can replace a
missing value.
5. Use the attribute mean for all samples belongings to the same class. Here, the average value
replaces the missing values of all tuples that fall in this group.
6. Use the most possible value to fill in the missing value. The most probable value can be
obtained from other methods like classification and decision tree prediction.
Some of these methods introduce bias in the data. The filled value may not be correct and could
be just as estimated value. Hence, the difference between the estimated and the original value is
called an error or bias.
Noise is a random error or variance in a measured value. It can be removed by using binning,
which is a method where the given data values are sorted and distributed into equal frequency
bins. The bins are also called as buckets. The binning method then uses the neighbor values to
smooth the noisy data.
Some of techniques commonly used are ‘smoothing by means’ where the mean of the bin
removes the values of the bins, ‘smoothing by bin medians’ where the bin value is replaced by
the closest bin boundary. The maximum and minimum values are called bin boundaries. Binning
methods may be used as a discretization technique. Example 2.1 illustrates this principle.
MODULE-1 Understanding Data -1
Example 2.1: Consider the following set: S = {12, 14, 19, 22, 24, 26, 28, 31, 34}. Apply various
binning techniques and show the result.
Solution: By equal-frequency bin method, the data should be distributed across bins. Let us
assume the bins of size 3, then the above data is distributed across the bins as shown below:
Bin 1 : 12,14,19
By smoothing bins method, the bins are replaced by the bin means. This method results in:
Bin 1 : 15,15,15
Bin 2 : 24,24,24
Using smoothing by bin boundaries method, the bin’s values would be like:
Bin 1 : 12,12,19
Bin 2 : 22,22,22
Bin 3 : 28,32, 32
As per the method, the minimum and maximum values of the bin are determined, and it serves
as bin boundary and does not change. Rest of the values are transformed to the nearest value. It
can be observed in Bin 1, the middle value 14 is compared with the boundary values 12 and 19
and changed to the closest value, that is 12. This process is repeated for all bins.
DESCRIPTIVE STATISTICS
Data visualization is a branch of study that is useful for investigating the given dat ainly,
the plots are useful to explain and present data to customers.
Descriptive analysis and data visualization techniques help to understand the nature of the
data, which further helps to determine the kinds of machine learning or data mining tasks that
can be applied to the data. This step is often known as Exploratory Data Analysis (EDA). The focus
MODULE-1 Understanding Data -1
of EDA is to understand the given data and to prepare it for machine learning algorithms. EDA
includes descriptive statistics and data visualization.
A dataset can be assumed to be a collection of data objects. The data objects may be records,
points, vectors, patterns, events, cases, samples or observations. These records contain many
attributes. An attribute can be defined as the property or characteristics of an object.
For example, consider the following database shown in sample Table 2.2.
Every attribute should be associated with a value. This process is called measurement. The type
of attribute determines the data types, often referred to as measurement scale types. The data
types are shown in Figure 2.1.
Categorical or Qualitative Data: The categorical data can be divided into two types. They are
nominal type and ordinal type.
MODULE-1 Understanding Data -1
Nominal Data – In Table 2.2, patient ID is nominal data. Nominal data are symbols and
cannot be processed like a number. For example, the average of a patient ID does not
make any statistical sense. Nominal data type provides only information but has no
ordering among data. Only operations like (=,≠) are meaningful for these data. For
example the patient ID can be checked for equality and nothing else.
Ordinal Data – It provides enough information and has natural order. For example,
Fever={low, Medium,High} is an ordinal data. Certainly, low is less than medium and
medium is less than high, irrespective of the value. Any transformation can be applied to
Numeric or Qualitative Data: It can be divided into two categories. They are interval type
and ratio type.
Interval Data – Interval data is a numeric data for which the differences between
values are meaningful. For example, there is a difference between 30 degree and 40
degree. Only the permissible operations are + and -.
Ratio Data – For ratio data, both differences and ratio are meaningful. The
difference between the ratio and interval data is the position of zero in the scale.
For example, take the Centigrade- Fahrenheit conversion. The zeroes of both scales
do not match. Hence, these are interval data.
Discrete Data: This kind of data is recorded as integers. For example, the responses
of the survey can be discrete data. Employee identification number such as 10001 is
discrete data.
Continuous Data: It can be fitted into a range and includes decimal point. For
example, age is a continuous data. Though age appears to be discrete data, one
may be 12.5 years old and it makes sense. Patients height and weight are all
continuous data.
Third way of classifying the data is based on the number of variables used in
the dataset. Based on that, the data can be classified as univariate data, bivariate
data, and multivariate data. This is shown in Figure 2.2.
MODULE-1 Understanding Data -1
In case of univariate data, the dataset has only one variable. A variable is also called as category.
Bivariate data indicates that the number of variables used are two and multivariate data uses three or
more variables.
This chapter primarily deals with univariate data in detail with just an overview of bivariate and
multivariate data.
Univariate analysis is the simplest form of statistical analysis. As the name indicates, the dataset has
only one variable. A variable can be called as a category. Univariate does not deal with cause or
relationships. The aim of univariate analysis is to describe data and final patterns.
Univariate data description involves finding the frequency distributions, central tendency measures,
dispersion or variation, and shape of the data.
Data Visualization
To understand data, graph visualization is must. Data visualization helps to understands data. It helps to
present information and data to customers. Some of the graphs that are used in univariate data analysis
are bar charts, histograms, frequency polygons and pie charts.
The advantages of the graphs are presentation of data, summarization of data, description of data,
exploration of data, and to make comparisons of data. Let us consider some forms of graphs now:
Bar Chart: A Bar Chart (or Bar graph) is used to display the frequency distribution for variables. Bar
charts are used to illustrate discrete data. The charts can also help to explain the counts of nominal
data. It also helps in comparing the frequency of different groups.
The bar chart for students’ marks {45, 60, 60, 80, 85} with Student ID= {1, 2, 3, 4, 5} is shown below in
Figure 2.3.
MODULE-1 Understanding Data -1
Pie Chart: These are equally helpful in illustrating the univariate data. The percentage frequency
distribution of students’ marks {22, 22, 40, 40, 70, 70, 70, 85, 90, 90} is below in Figure 2.4.
It can be observed that the number of students with 22 marks are 2. The total number of students are
10. So, 2/10*100=20% space in a pie of 100% is allotted for marks 22 in Figure 2.4.
Histogram: It plays an important role in data mining for showing frequency distributions. The histogram
for students’ marks {45, 60, 60, 80, 85} in the group range of 0-25, 26-50, 51-75, 76-100 is given below in
Figure 2.5. One can visually inspect from Figure 2.5 that the number of students in the range 76-100 is 2.
Histogram conveys useful information like nature of data and its mode. Mode indicates the peak of
dataset. In other words, histograms can be used as charts to show frequency, skewness present in the
data, and shape.
Dot Plots: These are similar to bar charts. They are less clustered as compared to bar charts, as they
illustrate the bars only with single points. The dot plot of English marks for five students with ID as {1, 2,
MODULE-1 Understanding Data -1
3, 4, 5} and marks {45, 60, 60, 80, 85} is given in Figure 2.6. The advantage is that by visual inspection
one can find out who get more marks.
One cannot remember all the data. Therefore, a condensation or summary of the data is necessary. This
makes the data analysis easy and simple. One such summary is called central tendency. Thus, central
tendency can explain the characteristics of data and that further helps in comparison. Mass data have
tendency to concentrate at certain values, normally in the central location. It is called measure of central
tendency (or averages). This represents the first order of measures. Popular measures are mean, median
and mode.
1. Mean-Arithmetic average (or mean) is a measure of central tendency that represents the
‘center’ of the dataset. This is the commonest measure used in our daily conversation such as
average income or average traffic. It can be found by adding all the data and dividing the sum by
the number of observations. Mathematically, the average of all the values in the sample
(population) is denoted as x. et x1, x2, ….. , xN be a set of ‘N’ values or observations, then the
arithmetic mean is given as:
Weighted mean- Unlike arithmetic mean that gives the weightage of all items equally,
weighted mean gives different importance to all items as the item importance varies.
Hence, different weightage can be given to items.
In case of frequency distribution, mid values of the range are taken for
computation. This is illustrated in the following computation.
MODULE-1 Understanding Data -1
In weighted mean, the mean is computed by adding the product of proportion and
group mean. It is mostly used when the sample sizes are unequal.
Geometric mean- Let x1,x2, …. , xN be a set of ‘N’ values or observations. Geometric
mean is the Nth root of the product of N items. The formula for computing geometric
mean is given as follows:
Here, n is the number of items and xi are values. For example, if the values are 6 and
8, the geometric mean is given as
The problem of mean is its extreme sensitiveness to noise. Even small changes in the input affect the
mean drastically. Hence, often the top 2% is chopped off and then the mean is calculated for a larger
dataset.
2. Median-The middle value in the distribution is called median. If total number of items in the
distribution is odd, then the middle value is called median. If the numbers are even, then the
average value of two items in the centre is the median. It can be observed that the median is the
value where xi is divided into two equal halves, with half of the values being lower than the
median and half higher than the median. A median class is that class where (N/2)th item is
present.
In the continuous case, the median is given by the formula:
3. Mode – Mode is the value that occurs more frequently in the dataset. In other words, the value
that has the highest frequency is called mode. Mode is only for discrete data and is not
applicable for continuous data as there are no repeated values in the continuous data.
MODULE-1 Understanding Data -1
The procedure for finding the mode is to calculate the frequencies for all the values in the
data, and mode is the value (or values) with the highest frequency. Normally, the dataset is
classified as unimodal, bimodal and trimodal with modes 1, 2 and 3, respectively.
Dispersion
The spreadout of a set data around the central tendency (mean, median or mode) is called
dispersion. Dispersion is represented by various ways such as range, variance, standard
deviation, and standard error. These are second order measures. The most common measures
of the dispersion data are listed below:
Range: Range is the difference between the maximum and minimum of value of the given list
of data.
Standard Deviation: The mean does not convey much more than a middle point. For example,
the following datasets {10, 20, 30} and {10, 50, 0} both have a mean of 20. The difference
between these two sets is the spread of data.
Standard deviation is the average distance from the mean of the dataset to each point.
The formula for sample standard deviation is given by:
Here, N is the size of the population, xi is observation or value from the population and µ is the
population mean. Often, N-1 is used instead of N in the denominator of Eq. (2.8). The reason is
that for larger real-world, the division by N-1 gives an answer closer to the actual value.
Quartiles and Inter Quartile Range: It is sometimes convenient to subdivide the dataset using
coordinates. Percentiles are about data that are less than the coordinates by some percentage
of the total value. kth percentile is the property that the k% of the data lies at or below Xi. For
example, median is 50th percentile and can be denoted as Q0.50. The 25th percentile is called first
quartile (Q1) and the 75th percentile is called third quartile (Q3).
Another measure that is useful to measure dispersion is Inter Quartile Range (IQR).The IQR
is the difference between Q3 and Q1.
Interquartile percentile = Q3 – Q1 (2.9)
Outliers are normally the values falling apart at least by the amount 1.5 * IQR above third
quartile or below the first quartile.
Interquartile is defined by Q0.75 – Q0.25.
Five-point Summary and Box Plots: The median, quartiles Q1 and Q3, and minimum and
maximum written in the order < Minimum, Q1, Median, Q3, Maximum > is known as five-point
summary.
MODULE-1 Understanding Data -1
Box plots are suitable for continuous variables and a nominal variable. Box plots can be used
to illustrate data distributions and summary of data. It is the popular way for plotting five
number summaries. A Box plot is also known as a Box and whisker plot.
The box contains bulk of the data. These data are between first and third quartiles. The line
inside the box indicates location- mostly median of the data. If the median is not equidistant,
then the data is skewed. The whiskers that project from the ends of the box indicate the spread
of the tails and the maximum and minimum of the data value.
Shape
Skewness and Kurtosis (called moments) indicate the symmetry/asymmetry and peak location
of the dataset.
Skewness
The measures of direction and degree of symmetry are called measures of third order. Ideally,
skewness should be zero as in ideal normal distribution. More often, the given dataset may not
have perfect symmetry (consider the following Figure 2.8).
The dataset may also either have very high values or extremely low values. If the dataset has far
higher values, then it is said to be skewed to the right. On the other hand, if the dataset has far
more low values then it is said to be skewed towards left. If the tail is longer on the right-hand
side and hump on the right-hand side, it is called positive skew. Otherwise, it is called negative
skew.
The given dataset may have an equal distribution of data. The implication of this is that if the
data is skewed, then there is a greater chance of outliers in the dataset. This affects the mean
and median. Hence, this may affect the performance of the data mining algorithm. A perfect
symmetry means the skewness is zero. In case of skew, the median is greater than the mean. In
positive skew, the mean is greater than the median.
Generally, for negatively skewed distribution, the median is more than the mean. The
relationship between skew and the relative size of the mean and median can be summarized by
a convenient numerical skew index known as Pearson 2 skewness coefficient.
MODULE-1 Understanding Data -1
Also, the following measure is more commonly used to measure skewness. Let X1, X2, …., XN be a
set of ‘N’ values or observations then the skewness can be given as:
Here, µ is the population mean and σ is the population standard deviation of the univariate
data. Sometimes, for bias correction instead of N, N-1 is used.
Kurtosis
Kurtosis also indicates the peaks of data. If the data is high peak, then it indicates higher kurtosis
and vice versa.
Kurtosis is the measure of whether the data is heavy tailed or light tailed relative to normal
distribution. It can be observed that normal distribution has bell-shaped curve with no long tails.
Low kurtosis tends to have light tails. The implication is that there is no outlier data. Let x1,x2,…,
xN be a set of ‘N’ values or observations. Then, kurtosis is measured using the formula given
below:
It can be observed that N-1 is used instead of N in the numerator of the above equation for
bias correction. Here, and σ are the mean and standard deviation of the univariate data,
respectively.
Some of the other useful measures for finding the shape of the univariate dataset are mean
absolute deviation (MAD) and coefficient of variation (CV).
MAD is another dispersion measure and is robust to outliers. Normally, the outlier point is
detected by computing the deviation from median and by dividing it by MAD. Here, the absolute
deviation between the data and mean is taken. Thus, the absolute deviation is given as:
|x - µ|
The sum of the absolute deviations is given as ∑|x-µ|
Coefficient of variation is used to compare datasets with different units. CV is the ratio of
standard deviation and mean, and %CV is the percentage of coefficient of variations.
MODULE-1 Understanding Data -1
The ideal way to check the shape of the dataset is a stem and leaf plot. A stem and leaf plot are
a display that help us to know the shape and distribution of the data. In this method, each value
is split into a ‘stem’ and a ‘leaf’. The last digit is usually the leaf and digits to the left of the leaf
mostly form the stem. For example, marks 45 are divided into stem 4 and leaf 5 in the Figure
2.9.
The stem and leaf plot for the English subject marks, say, {45, 60, 60, 80, 85} is given in Figure
2.9.
It can be seen from Figure 2.9 that the first column is stem and the second column is leaf. For
the given English marks, two students with 60 marks are shown in stem and leaf plot as stem-6
with 2 leaves with 0.
As discussed earlier, the ideal shape of the dataset is a bell-shaped curve. This corresponds
to normality. Most of the statistical tests are designed only for normal distribution of data. A Q-
Q plot can be used to assess the shape of the dataset. The Q-Q plot is a 2D scatter plot of an
univariate data against theoretical normal distribution data or of two datasets- the quartiles of
the first and second datasets. The normal Q-Q plot for marks x = [13 11 2 3 4 8 9] is given below
in Figure 2.10.
MODULE-1 Understanding Data -1
Ideally, the points fall along the reference line (45 Degree) if the data follows normal distribution. If the
deviation is more, then there is greater evidence that the datasets follow some different distribution,
that is, other than the normal distribution shape. In such a case, careful analysis of the statistical
investigations should be carried out before interpretation.
This skewness, kurtosis, mean absolute deviation and coefficient of variation help in assessing the
univariate data.