Chapter 2 [Data Science]
Chapter 2 [Data Science]
Data Science
.
Page 2
Main Contents
Overview of Data Science
Data
Is representation of facts, concepts, or instructions in a
Information
Is the processed data on which decisions and actions are
based.
It is data that has been processed into a form that is
Output
Data Processing Cycle
Data Processing Cycle… Page 7
Input
In this step, the input data is prepared in some convenient form for
processing.
The form will depend on the processing machine.
produce an output.
Example: [keyboard, mouse...]
Data Processing Cycle… Page 8
Processing
In this step, the input data is changed to produce data in a more useful
form.
Example: [CPU, GPU, Network Interface Cards…]
Data Processing Cycle… Page 9
Output
At this stage, the result of the proceeding processing step is collected.
The particular form of the output data depends on the use of the data.
the compiler or interpreter how the programmer intends to use the data.
Data types from Computer programming perspective
The Common data types include
Semi-structured, and
The following figure describes the three types of data and metadata.
Data Types and their Representation… Page
12
Structured Data
Structured data is data that adheres to a pre-defined data model and is
Semi-structured Data
Semi-structured data is a form of structured data that does not
Unstructured Data
Unstructured data is information that either does not have a
it is one of the most important elements for Big Data analysis and big
data solutions.
Metadata is data about data.
and where the photos were taken. The metadata then provides fields for
dates and locations which, by themselves, can be considered structured
data.
Data Types and their Representation… Page 16
Meta Data
Data Value Chain Page 17
The Data Value Chain is concerned with describing the information flow within
a big data system as a series of steps needed to generate value and useful insights
from data.
The data value chain describes the evolution of data from collection to analysis,
Data Acquisition
It is the process of gathering, filtering, and cleaning data before it is put in a data
warehouse or any other storage solution on which data analysis can be carried
out.
Data acquisition is one of the major big data challenges in terms of infrastructure
requirements.
The infrastructure required to support the acquisition of big data must deliver
low, predictable latency in both capturing data and in executing queries; be able
to handle very high transaction volumes, often in a distributed environment; and
support flexible and dynamic data structures.
Data Value Chain… Page 19
Data Analysis
It is concerned with making the raw data acquired amenable to use in
decision-making as well as domain-specific usage.
Data analysis involves:
Exploring,
Transforming, and
Modeling data
The main goal of data analysis is highlighting relevant data, synthesizing and
extracting useful hidden information with high potential from a business point
of view.
Related areas include
Data Value Chain… Page 20
Data Curation
It is the active management of data over its life cycle to ensure it meets the
ensuring that data are trustworthy, discoverable, accessible, reusable and fit their
purpose.
A key trend for the duration of big data utilizes community and crowdsourcing
approaches.
Data Value Chain… Page 21
Data Storage
It is the persistence and management of data in a scalable way that
satisfies the needs of applications that require fast access to the data.
Relational Database Management Systems (RDBMS) have been the
main, and almost unique, a solution to the storage paradigm for nearly
40 years.
Not Only SQL (NoSQL) technologies have been designed with the
Data Usage
It covers the data-driven business activities that need access to data,
its analysis, and the tools needed to integrate the data analysis within
the business activity.
Data usage in business decision-making can enhance
and Veracity
Basic Concepts of Big Data Page 25
Clustered Computing
Because of the qualities of big data, individual computers are often
High Availability
Easy Scalability
Clustered Computing and Hadoop Ecosystem… Page 27
Resource Pooling
Combining the available storage space to hold data.
High Availability
Availability guarantees to prevent hardware or software failures from
Easy Scalability
Clusters make it easy to scale horizontally by adding additional
machines to the group. This means the system can react to changes in
resource requirements without expanding the physical resources on a
machine.
Clustered Computing and Hadoop Ecosystem… Page 28
software like Hadoop’s YARN (which stands for Yet Another Resource
Negotiator).
The assembled computing cluster often acts as a foundation that other
Reliable
Scalable
Flexible
Clustered Computing and Hadoop Ecosystem… Page 30
Hadoop has an ecosystem that has evolved from its four core
event data.
2. Processing the data in storage:
The second stage is Processing.
The data is stored in the distributed file system, HDFS, and the NoSQL
and Impala.
Pig converts the data using a MapReduce and then analyzes it.
Hive is also based on the MapReduce programming and is most suitable
for structured data.
Cloudera Search.
In this stage, the analyzed data can be accessed by users.
Page 35
?
END OF CHAPTER TWO
Next:- Chapter Three [Artificial Intelligence]