Emerging Tech CH 2
Emerging Tech CH 2
DATA SCIENCE
An Overview of Data Science
Data science is called data-driven science
It is multi-disciplinary field that uses scientific methods, processes, algorithms,
and systems to extract knowledge and insights from structured, semi-structured
and unstructured data.
o Is a blend of various tools, algorithms, and machine learning principles
It is much more than simply analyzing data.
It offers a range of roles and requires a range of skills.
It is primarily used to make decisions and predictions.
It is a process of using raw data to explore insight and deliver a data product.
2
What are data and information?
• Data can be defined as a representation of facts, concepts, or instructions
in a formalized manner
• Data is unprocessed facts and figures
• Data is a symbol or any row material (it can be text, number, image
and diagram.)
o Can be represented with:
alphabets (A-Z, a-z)
digits (0-9) or
special characters (+, -, /, *, <,>, =, etc.).
4
Data Processing Cycle
Data processing is the re-structuring or re- ordering of data by
people or machines
A raw data is fed to computer systems to generate the final output which is
information.
•
6
Data processing cycle
Input:
• In this step, the input data is prepared in some convenient form for
processing.
• The form will depend on the processing machine.
• For example, when electronic computers are used, the input data can be recorded
on any one of the several types of storage medium, such as hard disk, CD, flash
disk and so on.
Processing:
• The input data is changed to produce data in a more useful form.
Output:
• The result of the proceeding processing step is collected.
7
DATA SCIENCE APPLICATIONS AND EXAMPLES
A data type is simply an attribute of data that tells the compiler or interpreter how
the programmer intends to use the data.
A data type makes the values that expression, such as a variable or a function, might
take.
This data type defines the operations that can be done on the data, the meaning of the
data, and the way values of that type can be stored.
9
Data types from Computer programming perspective
Integers(int)- is used to store whole numbers,
mathematically known as integers
Booleans(bool)- is used to represent restricted to one of two values: true
or false
characters(char)- is used to store a single character
Floating-point numbers(float)- is used to store real numbers
Alphanumeric strings(string)- used to store a combination of
characters and numbers
10
Data types from Data Analytics perspective
From a data analytics point of view,
It is important to understand that there are three common types of data
types or structures:
1. Structured,
2. Semi-structured, and
3. Unstructured data types.
11
1. Structured Data
Structured data are those that can be easily organized, stored and
transferred in a defined data model.
Easily searchable by basic algorithm like spread sheets.
Easily processed by computers.
Structured data conforms to a tabular format with a relationship between
the different rows and columns.
Example:
o Excel files or SQL databases
12
Example --------Database
ID Name Age Department CGPA
13
2. Semi-structured Data
Their structures are irregular, implicit, flexible and often nested
hierarchically.
Is a form of structured data that does not conform with the formal
structure of data models associated with relational databases
It has some organizational properties like tags and other markers to
separate semantic elements that makes it easier to analyze.
It is also known as a self-describing structure.
o Examples: include JSON and XML
13
.
15
3. Unstructured Data
Is information that either does not have a predefined data model or is not
organized in a pre-defined manner.
They are not easily combined or computationally analyzed
Unstructured information is typically text-heavy but may contain data
such as dates, numbers, and facts as well.
This results in irregularities and ambiguities that make it difficult to
understand using traditional programs as compared to data stored in
structured databases.
o Examples: include text documents, audio, video files , or PDFs
16
.
17
Metadata
Metadata – Data about Data
From technical point of view, this is not a separate data structure, but it
is one of the most important elements for Big Data analysis and big data
solutions.
Metadata is data about data
it is meaning of data
It provides additional information about a specific set of data.
Metadata is considered as processed data, used by Big data solutions
for initial analysis.
o Example: In a set of photographs, metadata could describe when
and where the photos were taken. 18
Data value Chain
Describe the process of data creation and use; from first identifying a need
for data to its final use and possible reuse.
The Data Value Chain is introduced to describe the information flow within a
big data system as a series of steps needed to generate value and useful
insights from data.
Data chain: is any combination of two or more data element/ data item.
20
1. Data Acquisition
It is the process of gathering, filtering, and cleaning data before it is put
in a data warehouse or any other storage on which data analysis can be
carried out.
Later used on data analysis
Data acquisition is one of the major big data challenges in terms of
infrastructure requirements.
Data acquisition answers the following:
• How do we get the data
• What kind of data do we need
• Who owns the data
21
2. Data Analysis
Making the raw data acquired amenable to use in decision-making as well
as domain-specific usage.
Data analysis involves exploring, transforming, and modeling data with the
goal of highlighting relevant data, synthesizing and extracting useful hidden
information with high potential from a business point of view.
22
3. Data Curation
It is the active management of data over its life cycle to ensure it
meets the necessary data quality requirements for its effective usage.
o e.g. research
Data curation processes can be categorized into different activities
such as content creation, selection, classification, transformation,
validation, and preservation.
23
4. Data Storage
It is the persistence and management of data in a scalable way
that satisfies the needs of applications that require fast access to
the data.
o E.g. Relational Database Management Systems (RDBMS)
RDBMS: the main, and almost unique, a solution to the storage
paradigm for nearly 40 years.
The RDBMS: Not used for Big data
23
5. Data Usage
It covers the data-driven business activities that need access to data, its
analysis, and the tools needed to integrate the data analysis within the
business activity.
25
What Is Big Data?
Big data is the term for a collection of data sets so large and complex.
It can not handle with a single computer
It becomes difficult to process using on-hand database management tools or
traditional data processing applications.
“large dataset” means a dataset too large to reasonably process or store with
traditional tooling or on a single computer.
Big data is characterized by 3V(5V) and more:
26
Big data is characterized b y 5V and more:
Volume:
• Refer to the vast amount of data generated every second.
• Data generated from emails, social networking sites, photos, videos, sensor
data etc.
• This increasing data sets makes too large to store and analysis using
traditional database technology.
• Now with big data Technology we can store and use data with help of
distributed system.
27
Big data is characterized b y 5V and more:
Variety
Refer the different type of data we can now use.
In past we only focused to deal with only structured data which neatly
fitted into tables, relational databases and structure table databases like
MYSQL
Now 80% of the data is unstructured and can’t be put in table easily.
With big data Technology we can now analyze and bring together data of
different types such as messages, social media conversations, photos,
sensor data, video or voice recordings.
• We can handle not only structure data but also accommodate unstructured
data and semi-structure data. 28
Big data is characterized b y 5V and more:
Veracity: refers to the trustworthiness/ reliability of the data.
With many forms of big data, quality and accuracy are less controllable
(just think of Twitter posts with hash tags, abbreviations, typos and
colloquial speech as well as the reliability and accuracy of content) , but
Big data and analytics technology now allows us to work with these
type of data.
The volumes often make up for the lack of quality or accuracy.
29
Big data is characterized b y 5V and more:
Value: most important V.
Having access to big data is no good unless we can turn it into
value.
Collects the big data analysis it and make it available in the test
cases required for business.
30
Big data is characterized b y 5V and more:
Velocity: Refer to the speed at which data is generated the speed at
which data moves around.
o Just think of social media messages going viral in seconds.
o Technology allows us now to analyze the data while it is being
generated (sometimes referred to as in-memory analytics),
without ever putting it into databases.
31
Big data is characterized b y 5V (summary)
Volume: large amounts of data Zeta bytes/Massive datasets
Velocity: Data is live streaming or in motion
Variety: data comes in many different forms from diverse sources
Veracity: can we trust the data? How accurate is it?
Value: a mechanism to bring the correct meaning out of the data
32
Big data is characterized b y 5V and more:
33
Variety
• Variety: data comes in many different forms from diverse sources
34
Velocity: Data is live streaming or in
motion
35
Value: a mechanism to bring the correctmeaning out of thedata
36
Veracity: canwetrust the data? How accurateis it?
37
Clustered Computing and Hadoop Ecosystem
Clustered Computing
Because of the qualities of big data, individual computers
are often inadequate for handling the data at most stages.
To better address the high storage and computational needs
of big data, computer clusters are a better fit.
Giving different task to different computers
38
Cont’d…
Big data clustering software combines the resources of many
smaller machines,
Seeking to provide a number of benefits:
o Resource Pooling/sharing
o High Availability:
o Easy Scalability:
39
.
40
Hadoop and its Ecosystem
Hadoop is an open-source framework intended to make interaction with
big data easier.
It is a framework that allows for the distributed processing of large
datasets across clusters of computers using simple programming models.
Hadoop is a software which manage different computers which are found
on different locations but they are connected each other using computer
network
It is inspired by a technical document published by Google.
57
The four key characteristics of Hadoop are:
Economical: Its systems are highly economical as ordinary computers can be
used for data processing.
Reliable: It is reliable as it stores copies of the data on different machines and is
resistant to hardware failure.
Scalable: It is easily scalable both, horizontally and vertically. A few extra nodes
help in scaling up the framework.
Flexible: It is flexible and you can store as much structured and unstructured
data as you need to and decide to use them later.
42
Cont’d…
Hadoop has an ecosystem that has evolved from its
four core components:
o Data management,
o Access,
o Processing, and
o Storage.
It is continuously growing to meet the needs of Big
Data.
43
It comprises the following components and many others:
HDFS: Hadoop Distributed File System
YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data Processing
Spark: In-Memory data processing
PIG, HIVE: Query-based processing of data services
HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning algorithm libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing cluster
Oozie: Job Scheduling
44
Hadoop Ecosystem
45
HDFS
HDFS: Is specially designed for storing huge dataset in
commodity hardware
Data is stored in a distributed manner
Enables fast data transfer among the nodes
It is all about storing and managing huge dataset in a
cluster
It is highly fault tolerance and efficient enough to
process huge amount of data
46
• HDFS Has two core components
1. Name node and
2. Data node master
• Name node: also called master
• Is the brain of the system slave slave slave
• There is only one name node
• Maintains and manage the data node and it also store the meta
data
• If this name node crashed the entire system will dead
• Data node : also called slave
• Store a block of data
• There can be multiple data node
• Store the actual data and does reading writing and
processing performs replication as well
Big Data Life Cycle with Hadoop
1. Ingesting data into the system
2. Processing the data in storage
3. Computing and analyzing data
4. Visualizing the result
Big Data Life Cycle with Hadoop
1. Ingesting data into the system
The first stage of Big Data processing is Ingest.
The data is ingested or transferred to Hadoop from
various sources such as relational databases, systems, or
local files.
Sqoop transfers data from RDBMS to HDFS, whereas
Flume transfers event data.
48
Big Data Life Cycle with Hadoop
2. Processing the data in storage
The second stage is Processing.
In this stage, the data is stored and processed.
The data is stored in the distributed file system, HDFS, and
the NoSQL distributed data, HBase.
Spark and MapReduce perform data processing
49
Big Data Life Cycle with Hadoop
3. Computing analyzing data
The third stage is to Analyze. Here, the data is analyzed by
processing frameworks such as Pig, Hive, and Impala.
Pig converts the data using a map and reduce and then
analyzes it.
Hive is also based on the map and reduce programming
and is most suitable for structured data.
50
Big Data Life Cycle with Hadoop
4. Visualizing the results
The fourth stage is Access, which is performed by tools
such as Hue and Cloudera Search.
In this stage, the analyzed data can be accessed by users.
51