W1L1,2,3 Lecture Script
W1L1,2,3 Lecture Script
Hello Everyone! I am sure you must have come across the word “Data”.
Data is the central point or fundamental of any activity you do in your life.
Data is everywhere. In this digital world, there is an explosion of data
due to emerging data acquisition and data generating technologies. Not
only humans, but also machines generate data. The data is growing in
terms of volume, variety and velocity. We are drowning with data but
starving for its understanding and use. Therefore, understanding data
and its various aspects is very important. In this module we could
answer all the terms, and concepts.
Type of Data
Data can be of any of the four types ofmeasurement scales:Nominal,
Ordinal, Interval, and Ratio. Data types (numbers or symbols) differ in
thefollowing four characteristics:
1. Labelling or categorising
2. Order (numbers are in ordered),
1|Page
3. Distance (the difference between numbers are ordered) and
4. Origin (the numbers have a unique origin which is indicated by
number zero).
Example:
How happy are you with our product? (Variable: Happy): 1. Very
Unhappy
2. Unhappy 3. Neutral 4. Happy 5. Very Happy
In the above example, the options are in order. On the other hand, the
difference between choice“Unhappy” and “Very Unhappy”may not equal
2|Page
or not known to the difference between options“Neural” and “Unhappy”,
or difference between “Very Happy” and “Happy”.
Few statistics and statistical tests are available for analysing such data
such as frequency distribution, percentile, mode, median, and rank-order
correlation.
Example: Time
The difference between 7 PM and 8 PM is a measurable 60 minutes, as
is the difference between 8 PM and 9 PM.In interval data, the distance
between options does have meaning.
The data type has no true zero value but has a relative zero value.
Timezero value does not mean time does not exist or time is unavailable
at that point. Zero value is a human-defined number which represents a
certain time (Noon or Midnight). This is because time scale is arbitrary.
Suppose we definetime (12-Hour time) what is currently 11 PM as 2 AM
(i.e. shifted the scale by 3). It is allowed as the scale is arbitrary. Now we
have
3 PM 6 PM
6 PM 11 PM
Here, it shows that 11 is not double of 6 indicating that interpretation of
“double” or “twice” is based on our choice “0”.
3|Page
There is a good number of statisticsthat are available for analysing
interval type of datathat includes mean, median, mode, and standard
deviation. As the scale does not have a true zero, one cannot compute
ratios, multiplication and division whereas additions and subtractions are
possible.
Ratio:
A Ratio data type is a scale that has a true or absolute zero point. For
example, weight is a ratio scale. It has true zero and one can say 80 kg
is twice 40 kg. Length zero means that there is no length. The other
examples are Heartbeat, and Salary. The other properties of ratio data
are identity, magnitude, order, and equal distance.A large number of
statistics, mathematical operations, and statistical tests are available for
analysing such type of data, for example, range, mean, standard
deviation, t-tests, factor analysis, and regression, etc.
Classification of Data
Data can be classified in many ways:
Qualitative and Quantitative
Structured, Semi-structured, and Unstructured
Metric, and Non-metric data
Primary data and Secondary data
Human- generated data and Machine-generated data
Online and Offline data
Real-Time data and Near-Real-Time data
Static and Dynamic data
Small Data, and Big Data
Streaming data, Perishable data and Embedded data
4|Page
Qualitative vs. Quantitative Data
Structured data:
Structured data is data that are highly organized in a uniform or standard
form.It refers to clearly defined data type and data that has a defined
length and a format. For example, a table that contains a set of fixed
rows and columns, stores structured data. Database systems that are
based on relational data model store structured data. Structured data
deals with predefined schema used in database systems. The structure
is consistent. In reality, structured data generated is only 5% of the total
data generated. An example of structured data is an employees table
(database) that consists of the name of the employee, date of birth,
gender, number of dependents and date of joining.It is very much easy
to upload, extract, load, store, make aquery and analyze structured data
compare to unstructured and semi-structured data.
5|Page
Unstructured data
In the case of unstructured data, the nature of schema may not be
known, or they do not conform to the organized form of structured data.
It is comprised of data that is usually not as easily searchable, stored
and processed. It does not comply with any data storage schema
(defined rows and columns). Examples of unstructured data are audio,
text, video, blogs, and log files of machine data, and Emails. Facebook,
LinkedIn, WhatsApp, and Twitter are some of the technologies (social
media data) that deal with unstructured data.
Semi-Structured data
Semi-structured data is in between structured and unstructured data.
There exist no standards of the organization of data. But in some cases,
it contains some parts that can be separated and put in a fixed form. The
other parts are unstructured.Examples of semi-structured data areX-rays
images, XML (textual Markuplanguage for exchanging data on the Web)
documents, JSON documents, .csv files, tab-delimited filesand NoSQL
database. Due to lack of data organization, it is difficult to manage
unstructured and semi-structured data.
6|Page
and indicates the presence or absence of a property. Nominal datatypes
and ordinal data types are examples of non-metric data.
Secondary data, on the other hand is prepared data and ready for use.
Other researchers and agencies already collect this type of data for their
own purpose and use. Therefore, the source of secondary data is
problem dependent. Researchers can use this secondary data in their
research study considering its relevance. Secondary data generally
available in published reports, journals, computerized databases, and
syndicated services by firms.
7|Page
Human Generated data vs. Machine-generated data:
8|Page
Real time data vs. Near-real time data
Near real-time data is data that is generated quickly soon after they
occur. The response time is not mandatory and not optimized to be as
fast as possible. The time involved in near real-time processing depends
on the type of application and context. The delay can be in minutes,
seconds or milliseconds.
Static data is fixed and does not change once it is recorded. This type of
data is not available ona real-time basis and self-controlled as well.Static
data is also not frequently accessed.
Examples:When you join a college, you provide data related to your
name, gender, educational details, and home-town etc. in a form. The
data collected after filling the form is static data.
Dynamic data are data that may change once it is stored or recorded.
The data is updated at any point of time whenever there is a
requirement. The data is changed over time as new data is available.
Dynamic data can become static subsequently.
9|Page
Example:Savings bank account data is one example where the balance
amount or loan amount is changed when transactions are done.
Small Data
Small data is the data with volume, format, and type that human can
comprehend easily and apply traditional data processing applications
and techniques for storing, processing and analysing the same. Small
data is usually collected in a controlled manner and part of is part of
database systems in various applications. Small data is generally
structured (e.g. relational model database) and in a tabular format.
Big data
In recent times, “Big Data” has become a buzz-word. Firms like Google,
eBay, LinkedIn, Macy, Caesars Entertainment Corp and Facebook are
the pioneer in using big data. The three main features of big data are
volume (quantity of data), velocity (speed of data), and variety (types of
data).The unique characteristics of big data are lack of structure, the
opportunities it creates, and the low-cost technologies it uses.
Volume: In recent years, the quantity of new data generated has been
grown exponentially. For example, Facebook generates a thousand
terabytes of new data on day to day basis and processes millions of
photographs and videos per second. There is no clear-cut specific
threshold of volume concerning to big data. It is relative and depends on
the industry. The complexity of volume varies by many factors such as
how fast the data (the rate at which data) are captured or generated and
types of data it generates.
10 | P a g e
Velocity: Velocity is the rate at which data are generated. Also, how fast
the data is processed and analyzed for its use. The data can be
generated in real-time or near real-time basis. Digital devices such as
sensors and mobile devices generate millions of data on a real-time
basis.
Streaming Data
Streaming data is the data that is generated as a constant flow. The data
is dynamic and is generated by many data sources. It is a new type of
data gained significance in the contest of social media.
Examples: Log files generated from mobile applications, Twitter data
Perishable Data
In many critical applications, the value of data is lost over time. Such
type of data is known as perishable data or fast data. Data is perishable
when a minimal amount of time is left to act upon it leading to some
action. One must capture such data, analyse it and then act upon it in
near-real time or real-time to get its benefit before it loses its value.
11 | P a g e
Importance of perishable data and its processing has increased in the
context of big data and big data analytics.
Examples of perishable data: Health-related data (e.g. blood pressure,
ECG), location data in a retail store), clickstream data, mobile devices
data, and stock market trading data.
Embedded data
12 | P a g e
Storing of data:
13 | P a g e
For example, storing and managing data in low-end software
applications that do not use a standard database, accounting information
can easily be manipulated where no audit is possible subsequently. On
the other hand, it would be extremely difficult or impossible to
manipulate data if the application data is maintained in high-end
applications that use the qualified database as forensic auditing will
detect any manipulation of data.
14 | P a g e
3. Input
4. Processing
5. Output
Data Collection:
Data can be collected from respondents, groups, organizations using
various data collection methodologies. The source of data can be
primary or secondary. Before data collection, instrument (measurement
scale) for data collection is developed, tested for reliability and validated
using established methods. To ensure quality, a pre-test of the
instrument is conducted. Different types of biases of data collection (e.g.,
Non-response bias and common method bias) are addressed. A tool
must be reliable and valid.
Data Preparation
One must detect any error and inconsistency in the data collected. The
detection includes ineligible entries, confusing answers, incomplete
answers, outliers, conflicting answers and missing data. There are many
ways to address these issues such as collecting the respective data
again, estimating missing values, validating the response (through
observation, peer review, and record inspection) and discarding
response. While editing, bias in any form should not be introduced by
the user. One can also take the help of a computer for finding such
errors. Also, in this step it is identified any extra or unusable data that
are not to be entered and analyzed.
15 | P a g e
to different categories. Coding helps in entering data faster and entering
less data as well.
Input
The raw data is now ready for processing. In the case of automation, the
data is entered to the computer using appropriate application. The data
is then stored in some system as discussed earlier. If the input data is
wrong, then the result would be wrong.
Data Processing
Data Processing means applying some operations on data using
arithmetical calculations (addition, subtraction, multiplication, and
division), sequencing (such as ascending or descending), sorting,
merging, summarising, aggregating, and classifying, etc. Data
processing can be done manually or it can be done by using a computer
(electronic data processing).
Data Analysis
In data analysis, data is reviewed and analyzed to find the output that
helps in drawing inferences, finding relationships and supporting
decisions.
E.g. Income data: 4500, 2300, 1000, 3000, and 4300.
Expenditure data: 2300, 1200, 3000, 1450, and 7000.
In this case, data processing is to find average income and average
expenditure.
Data analysis is finding the relationship between income and
expenditure, i.e., for example how expenditure is changed when there is
a change in income.
16 | P a g e
Methodologies of processing and analysis vary across different types or
classification of data. Some of the emerging data analysis methods
include data mining, data analytics, and data visualization.
Output
In the final stage of data processing, output or result is generated. Once
the result or output is received, further processing, analysis,
interpretation, and visualization may be done manually or through IT. In
all the stages of data processing, quality of data must be maintained
which is explained below.
Quality of data:
***
17 | P a g e