0% found this document useful (0 votes)
39 views17 pages

W1L1,2,3 Lecture Script

Cyber security

Uploaded by

Ramesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views17 pages

W1L1,2,3 Lecture Script

Cyber security

Uploaded by

Ramesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Module-1: Introduction to Data

Hello Everyone! I am sure you must have come across the word “Data”.
Data is the central point or fundamental of any activity you do in your life.
Data is everywhere. In this digital world, there is an explosion of data
due to emerging data acquisition and data generating technologies. Not
only humans, but also machines generate data. The data is growing in
terms of volume, variety and velocity. We are drowning with data but
starving for its understanding and use. Therefore, understanding data
and its various aspects is very important. In this module we could
answer all the terms, and concepts.

Definition: What is Data?


Data is raw factsand figure about any object or phenomenon without any
context and hence meaningless.“Data” is used as plural and “datum” is
the singular form of “Data.”Data is generally captured, stored and
processed but not presented “as it is” as it has no meaning and not in
any context. It is more technology oriented.
Data is a set of values of subjects concerning any qualitative or
quantitative variables. Raw data is “unprocessed data” i.e., data that has
not been changed since collected.
Example of a dataset: 34, 45, 67, and 20.

Type of Data
Data can be of any of the four types ofmeasurement scales:Nominal,
Ordinal, Interval, and Ratio. Data types (numbers or symbols) differ in
thefollowing four characteristics:

1. Labelling or categorising
2. Order (numbers are in ordered),

1|Page
3. Distance (the difference between numbers are ordered) and
4. Origin (the numbers have a unique origin which is indicated by
number zero).

The data types with their characteristics are explained below:


Nominal:Nominal scale is a kind of “name”or labelling of variables
(named variables).A variable is a symbolic name and takes different
values where the associated values may be changed. A nominal
variable has no numerical value. Re-ordering or rearranging of the
names or data options will not change any value.Minimal statistics and
statistical tests are available such asfrequency distribution, percentages,
mode, and chi-square, etc. for further processing of such data. Nominal
data type identifies mutuallyexclusive and exhaustive data.
Examples:
 Caste: a. OBC, b. SC, c. ST, d. General
 Gender: M-Male, F: Female).
Here, Caste and Gender are nominal scales or variables.

Ordinal:In case of ordinal data type, the value of data in is in a specific


“order.” On the other hand, the difference between two consecutive data
or options is not the same or not known. It is something like nominal
data with an order.

Example:
How happy are you with our product? (Variable: Happy): 1. Very
Unhappy
2. Unhappy 3. Neutral 4. Happy 5. Very Happy
In the above example, the options are in order. On the other hand, the
difference between choice“Unhappy” and “Very Unhappy”may not equal
2|Page
or not known to the difference between options“Neural” and “Unhappy”,
or difference between “Very Happy” and “Happy”.
Few statistics and statistical tests are available for analysing such data
such as frequency distribution, percentile, mode, median, and rank-order
correlation.

Interval:Interval scale has values of equal intervals indicating some item.


In this numeric scale, options are not only in “order”, but also the
differences between consecutive values are exact and have meaning.
Interval scales not only provide the order but also the value between
options.

Example: Time
The difference between 7 PM and 8 PM is a measurable 60 minutes, as
is the difference between 8 PM and 9 PM.In interval data, the distance
between options does have meaning.

The data type has no true zero value but has a relative zero value.
Timezero value does not mean time does not exist or time is unavailable
at that point. Zero value is a human-defined number which represents a
certain time (Noon or Midnight). This is because time scale is arbitrary.
Suppose we definetime (12-Hour time) what is currently 11 PM as 2 AM
(i.e. shifted the scale by 3). It is allowed as the scale is arbitrary. Now we
have
3 PM 6 PM
6 PM  11 PM
Here, it shows that 11 is not double of 6 indicating that interpretation of
“double” or “twice” is based on our choice “0”.

3|Page
There is a good number of statisticsthat are available for analysing
interval type of datathat includes mean, median, mode, and standard
deviation. As the scale does not have a true zero, one cannot compute
ratios, multiplication and division whereas additions and subtractions are
possible.

Ratio:
A Ratio data type is a scale that has a true or absolute zero point. For
example, weight is a ratio scale. It has true zero and one can say 80 kg
is twice 40 kg. Length zero means that there is no length. The other
examples are Heartbeat, and Salary. The other properties of ratio data
are identity, magnitude, order, and equal distance.A large number of
statistics, mathematical operations, and statistical tests are available for
analysing such type of data, for example, range, mean, standard
deviation, t-tests, factor analysis, and regression, etc.

Classification of Data
Data can be classified in many ways:
 Qualitative and Quantitative
 Structured, Semi-structured, and Unstructured
 Metric, and Non-metric data
 Primary data and Secondary data
 Human- generated data and Machine-generated data
 Online and Offline data
 Real-Time data and Near-Real-Time data
 Static and Dynamic data
 Small Data, and Big Data
 Streaming data, Perishable data and Embedded data

4|Page
Qualitative vs. Quantitative Data

Qualitative data is subjective and descriptive. It is generally expressed in


terms of text, names, subjective narratives, symbols, or number code.
Examples of qualitative data are a Twitter message, a voice recording,
and a medical prescription.
Quantitative datais aset of numbers. The data is objective type which
can be measured objectively and expressed numerically. Example of
quantitative data is salary of employees expressed in amount.

Structured, Semi-structured, and Unstructured

Structured data:
Structured data is data that are highly organized in a uniform or standard
form.It refers to clearly defined data type and data that has a defined
length and a format. For example, a table that contains a set of fixed
rows and columns, stores structured data. Database systems that are
based on relational data model store structured data. Structured data
deals with predefined schema used in database systems. The structure
is consistent. In reality, structured data generated is only 5% of the total
data generated. An example of structured data is an employees table
(database) that consists of the name of the employee, date of birth,
gender, number of dependents and date of joining.It is very much easy
to upload, extract, load, store, make aquery and analyze structured data
compare to unstructured and semi-structured data.

5|Page
Unstructured data
In the case of unstructured data, the nature of schema may not be
known, or they do not conform to the organized form of structured data.
It is comprised of data that is usually not as easily searchable, stored
and processed. It does not comply with any data storage schema
(defined rows and columns). Examples of unstructured data are audio,
text, video, blogs, and log files of machine data, and Emails. Facebook,
LinkedIn, WhatsApp, and Twitter are some of the technologies (social
media data) that deal with unstructured data.

Semi-Structured data
Semi-structured data is in between structured and unstructured data.
There exist no standards of the organization of data. But in some cases,
it contains some parts that can be separated and put in a fixed form. The
other parts are unstructured.Examples of semi-structured data areX-rays
images, XML (textual Markuplanguage for exchanging data on the Web)
documents, JSON documents, .csv files, tab-delimited filesand NoSQL
database. Due to lack of data organization, it is difficult to manage
unstructured and semi-structured data.

Metric data vs. Non-Metric data

Metric data is quantitative data and on which basic mathematical


calculations can be applied and meaningful. Metric data is based on the
metric properties where the data involves defining the distance between
values. Interval and ratio data types are metric data types.

Non-metric data is data where the distance between values cannot be


measured. Data in case of non-metric variables are mutually exclusive

6|Page
and indicates the presence or absence of a property. Nominal datatypes
and ordinal data types are examples of non-metric data.

Primary data Vs. Secondary data

Primary data isthe data that is collected by the researcher or by third-


party agencies directly from the primary sources. The data collected can
be qualitative, or quantitative. Primary data is collected through
observations, survey questionnaires, interviews, experimentation, Focus
Group Discussions (FGD), and projective techniques.

Secondary data, on the other hand is prepared data and ready for use.
Other researchers and agencies already collect this type of data for their
own purpose and use. Therefore, the source of secondary data is
problem dependent. Researchers can use this secondary data in their
research study considering its relevance. Secondary data generally
available in published reports, journals, computerized databases, and
syndicated services by firms.

Generally, primary data collection is more time consuming and costly


compare to that of secondary data. Users of secondary data may face
the problem as the available secondary data may not meet the needs of
the problem in hand. Only partial data may be available for use. This
type of data might be having issues such as irrelevant and inaccurate
data, time-lag and dependability (credibility, reputation, and
trustworthiness of source).

7|Page
Human Generated data vs. Machine-generated data:

Human-generated data isthe data that iscreated manually by human


actors in various activities, and processes either using papers or
Information Technologies (IT). Some examples are manual data entry in
software applications, Emails, Twitter and videos created and uploaded
in YouTube.

Data can also be generated by machines (non-human actors)


automatically without any human intervention. There is an exponential
growth of data capturing and data generating machines. The data
generated through devices grows significantly in volume, and velocity.

Examples of technologies that generate data (different varieties): Slot


machines in casinos, Wi-Fi devices, networks and equipment’s logs,
cameras, smart-phones, online gamingand Twitter feed.Few of the
above (e.g., online gaming, web logs) are of hybrid type i.e., human and
machine-generated data.Machine-generated data leads to high volume,
high velocity and different variety of data. There would be an enormous
challenge of processing and analysing such data.

Online data vs.Off-line data

Online data is data available on the Internet. Online indicates


“connectivity”.
Examples: Online shopping, online booking of movie tickets, online
checking the results and writing one Email using Google Gmail. Offline
data is data that is available in the manual system or in the computer
system that is not connected to the Internet.

8|Page
Real time data vs. Near-real time data

Real-time data is data that is generated on a real-time basis. The data is


delivered instantaneously after it is collected without any delay. It is
viewable the moment it is available. The response time is mandatory
and optimized to be as fast as possible.
Examples:Bank ATM, credit card fraud detection, buying an air ticket
online

Near real-time data is data that is generated quickly soon after they
occur. The response time is not mandatory and not optimized to be as
fast as possible. The time involved in near real-time processing depends
on the type of application and context. The delay can be in minutes,
seconds or milliseconds.

Static data vs. Dynamic data

Static data is fixed and does not change once it is recorded. This type of
data is not available ona real-time basis and self-controlled as well.Static
data is also not frequently accessed.
Examples:When you join a college, you provide data related to your
name, gender, educational details, and home-town etc. in a form. The
data collected after filling the form is static data.

Dynamic data are data that may change once it is stored or recorded.
The data is updated at any point of time whenever there is a
requirement. The data is changed over time as new data is available.
Dynamic data can become static subsequently.

9|Page
Example:Savings bank account data is one example where the balance
amount or loan amount is changed when transactions are done.

Small data vs. Bigdata

Small Data
Small data is the data with volume, format, and type that human can
comprehend easily and apply traditional data processing applications
and techniques for storing, processing and analysing the same. Small
data is usually collected in a controlled manner and part of is part of
database systems in various applications. Small data is generally
structured (e.g. relational model database) and in a tabular format.

Big data
In recent times, “Big Data” has become a buzz-word. Firms like Google,
eBay, LinkedIn, Macy, Caesars Entertainment Corp and Facebook are
the pioneer in using big data. The three main features of big data are
volume (quantity of data), velocity (speed of data), and variety (types of
data).The unique characteristics of big data are lack of structure, the
opportunities it creates, and the low-cost technologies it uses.

Volume: In recent years, the quantity of new data generated has been
grown exponentially. For example, Facebook generates a thousand
terabytes of new data on day to day basis and processes millions of
photographs and videos per second. There is no clear-cut specific
threshold of volume concerning to big data. It is relative and depends on
the industry. The complexity of volume varies by many factors such as
how fast the data (the rate at which data) are captured or generated and
types of data it generates.

10 | P a g e
Velocity: Velocity is the rate at which data are generated. Also, how fast
the data is processed and analyzed for its use. The data can be
generated in real-time or near real-time basis. Digital devices such as
sensors and mobile devices generate millions of data on a real-time
basis.

Variety: Variety means data in various formats, types, and structures.


The new types of data include unstructured, and semi-structured data.

Veracity: Veracity term was coined by IBM. Veracity is the conformity to


facts such as accuracy, quality, trustworthiness, and variability. It
ensures the highest level of consistency in the data flow.

Streaming data, Perishable data, and Embedded data

Streaming Data

Streaming data is the data that is generated as a constant flow. The data
is dynamic and is generated by many data sources. It is a new type of
data gained significance in the contest of social media.
Examples: Log files generated from mobile applications, Twitter data

Perishable Data

In many critical applications, the value of data is lost over time. Such
type of data is known as perishable data or fast data. Data is perishable
when a minimal amount of time is left to act upon it leading to some
action. One must capture such data, analyse it and then act upon it in
near-real time or real-time to get its benefit before it loses its value.

11 | P a g e
Importance of perishable data and its processing has increased in the
context of big data and big data analytics.
Examples of perishable data: Health-related data (e.g. blood pressure,
ECG), location data in a retail store), clickstream data, mobile devices
data, and stock market trading data.

Embedded data

Embedded Data is an extra data that is recorded when recording data


related to the intended activity. The data has two components, a
variable, and a single value or multiple values. Embedded data is a
variable that can be named and set equal to any values desired.
Example: When a customer is inputting responses to a set of questions,
the location of the customer can be recorded as embedded data besides
recording customer’s responses. Here embedded variable is “Location”
and value is set to “India” (single value).

The Granularity of data:

The granularity of data represents a level of detail in data. Higher


granularity means more levels of detail. Different levels of granularity are
required for further processing.Processing of higher granular data leads
to better insights.
Example: Yearly sales is more summarized (i.e. less granular) compared
to that of monthly, weekly and daily (i.e. more granular)
Sometimes, it is impossible to get higher granular data from less
granular data. For example, it is impossible to get daily sales data from
yearly sales data given, whereas it is easy to get monthly or yearly sales
from daily sales data given.

12 | P a g e
Storing of data:

Data can be stored in physical files, DBMS, DW, or in various software


applications like Microsoft Excel and Microsoft Word. The capacity of
storing in software applications or database varies. For example,
spreadsheet applications like Microsoft Excel can store very less amount
of data compared to popular database systems like Oracle. Similarly, the
organization of data in various data storage systems varies.Major
enterprise software applications use database systems for storing data.

Database management systems (DBMS)

Database management systems (DBMS) is a system by which one can


create a database, store data into it, make queries, manipulate data, and
generate standard or customized reports. Database systems store
current transaction data, and detailed data. The data is dynamic. The
data are application oriented. It supports day-to-day operations and
accessed by many operational users. One can also write programs to
process the data stored in the database. Data in database systems is
shared and integrated and hence the scope of use of data in various
applications is very high. DBMS is integrated with any other application
software, the Internet and the Intranet as well.

The primary objective of Database systems is to provide “Efficiency”.


The system is designed in such a way to maintain atomicity,
consistency, integrity, durability, security, and privacy. On the other hand
software applications that do not use database systems may not provide
such features and hence management of data would be a challenge.

13 | P a g e
For example, storing and managing data in low-end software
applications that do not use a standard database, accounting information
can easily be manipulated where no audit is possible subsequently. On
the other hand, it would be extremely difficult or impossible to
manipulate data if the application data is maintained in high-end
applications that use the qualified database as forensic auditing will
detect any manipulation of data.

Data Warehousing Systems

Data warehousing (DW) is a repository of subject-oriented time-variant


and historical data. The objective of DW is “decision-support”. A "single
version of the truth" is available to all users. DW stores historical data
that may be detailed, lightly or highly summarized data. Data in DW
generally static and not updated. At different interval of time, data from
various transaction systems are extracted, transformed and loaded to
DW.Instead of updating, data is added to the DW with a date and time
stamp. The processing of data in DW is an ad-hoc type and not frequent.
Analytics are applied to data in DW to discover insights. Limited
managerial users use DW organizational level decisions.A data mart is a
small and scaled-down version of a DW. It contains data specific to
some domain and accessed by users of that domain. Usually, the
response time of data mart is much better than that of the DW.

Steps of Data Processing

The various steps of data processing are:


1. Data Collection
2. Data Preparation

14 | P a g e
3. Input
4. Processing
5. Output

Data Collection:
Data can be collected from respondents, groups, organizations using
various data collection methodologies. The source of data can be
primary or secondary. Before data collection, instrument (measurement
scale) for data collection is developed, tested for reliability and validated
using established methods. To ensure quality, a pre-test of the
instrument is conducted. Different types of biases of data collection (e.g.,
Non-response bias and common method bias) are addressed. A tool
must be reliable and valid.

Data Preparation
One must detect any error and inconsistency in the data collected. The
detection includes ineligible entries, confusing answers, incomplete
answers, outliers, conflicting answers and missing data. There are many
ways to address these issues such as collecting the respective data
again, estimating missing values, validating the response (through
observation, peer review, and record inspection) and discarding
response. While editing, bias in any form should not be introduced by
the user. One can also take the help of a computer for finding such
errors. Also, in this step it is identified any extra or unusable data that
are not to be entered and analyzed.

Coding: For better understanding and analysis, different coding


strategies are adopted for different entries. A code-book is prepared.
Categories need to be established, and then the data are to be assigned

15 | P a g e
to different categories. Coding helps in entering data faster and entering
less data as well.

Input
The raw data is now ready for processing. In the case of automation, the
data is entered to the computer using appropriate application. The data
is then stored in some system as discussed earlier. If the input data is
wrong, then the result would be wrong.

Data Processing
Data Processing means applying some operations on data using
arithmetical calculations (addition, subtraction, multiplication, and
division), sequencing (such as ascending or descending), sorting,
merging, summarising, aggregating, and classifying, etc. Data
processing can be done manually or it can be done by using a computer
(electronic data processing).

Data Analysis
In data analysis, data is reviewed and analyzed to find the output that
helps in drawing inferences, finding relationships and supporting
decisions.
E.g. Income data: 4500, 2300, 1000, 3000, and 4300.
Expenditure data: 2300, 1200, 3000, 1450, and 7000.
In this case, data processing is to find average income and average
expenditure.
Data analysis is finding the relationship between income and
expenditure, i.e., for example how expenditure is changed when there is
a change in income.

16 | P a g e
Methodologies of processing and analysis vary across different types or
classification of data. Some of the emerging data analysis methods
include data mining, data analytics, and data visualization.

Output
In the final stage of data processing, output or result is generated. Once
the result or output is received, further processing, analysis,
interpretation, and visualization may be done manually or through IT. In
all the stages of data processing, quality of data must be maintained
which is explained below.

Quality of data:

Quality of data refers to the condition of values of variables and it’s


fitness for serving its intended purpose. Quality of data depends on
what questions are being used to collect data, how the data is collected,
from what sources the data are collected, when it is collected, and what
sequence it is collected.

Various characteristics of quality are accuracy, human bias, duplication,


appropriateness, reliability, consistency, validity, relevance, and
completeness. Poor quality data leads to inaccurate processing,
analysis, interpretations and understanding. Data cleaning, validation,
and verification methods are generally used to ensure superior data
quality.

***

17 | P a g e

You might also like