0% found this document useful (0 votes)
5 views35 pages

Chapter 2 [Data Science]

Chapter Two provides an overview of data science, detailing its definition, the data processing cycle, and various data types including structured, semi-structured, and unstructured data. It also discusses the data value chain, big data concepts, and the Hadoop ecosystem, emphasizing the importance of clustered computing for handling large datasets. The chapter concludes with an outline of the big data life cycle stages within the Hadoop framework.

Uploaded by

nabilalihaji772
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views35 pages

Chapter 2 [Data Science]

Chapter Two provides an overview of data science, detailing its definition, the data processing cycle, and various data types including structured, semi-structured, and unstructured data. It also discusses the data value chain, big data concepts, and the Hadoop ecosystem, emphasizing the importance of clustered computing for handling large datasets. The chapter concludes with an outline of the big data life cycle stages within the Hadoop framework.

Uploaded by

nabilalihaji772
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

CHAPTER TWO

Data Science

.
Page 2

Main Contents
Overview of Data Science

Data and Information

Data Processing Cycle


Data Science
Data Types and their Representation

Data Value Chain

Basic Concepts of Big Data

Clustered Computing and Hadoop Ecosystem


Overview of Data Science Page 3

 Data science is a multi-disciplinary field that uses scientific methods,

processes, algorithms, and systems to extract knowledge and insights


from structured, semi-structured and unstructured data.
 Let’s consider this idea by thinking about some of the data involved in

buying a box of cereal from the store or supermarket:

Whatever your cereal preferences teff, wheat, or barley you prepare

for the purchase by writing “cereal” in your notebook. This planned

purchase is a piece of data though it is written by pencil that you can

read. (This an example of data).


Data and Information Page 4

Data
 Is representation of facts, concepts, or instructions in a

formalized manner, which should be suitable for


communication, interpretation, or processing, by human or
electronic machines.
 It can be described as unprocessed facts and figures.

 Can be represented with the help of characters such as

alphabets (A-Z, a-z), digits (0-9) or special characters (+, -, /,


*, <,>, =, etc.).
Data and Information … Page 5

Information
 Is the processed data on which decisions and actions are

based.
 It is data that has been processed into a form that is

meaningful to the recievers.


 Information is interpreted data: created from organized,

structured, and processed data in a particular context.


Data Processing Cycle Page 6

 Data processing is the re-structuring or re-ordering of data by people

or machines to increase their usefulness and add values for a particular


purpose.
 Data processing consists of the following basic steps:

 Input, processing, and output

 These three steps constitute the data processing cycle.

Input Processing Output

Output
Data Processing Cycle
Data Processing Cycle… Page 7

Input
 In this step, the input data is prepared in some convenient form for

processing.
 The form will depend on the processing machine.

 Any information that is provided to a computer or a software

program is known as input.


 The input enables the computer to do what is designed to do and

produce an output.
Example: [keyboard, mouse...]
Data Processing Cycle… Page 8

Processing
In this step, the input data is changed to produce data in a more useful

form.
Example: [CPU, GPU, Network Interface Cards…]
Data Processing Cycle… Page 9

Output
At this stage, the result of the proceeding processing step is collected.

The particular form of the output data depends on the use of the data.

Example: [Monitor, Printer, Projector…]


Data Types and their Representation Page
10

 In computer programming, a data type is an attribute of data that tells

the compiler or interpreter how the programmer intends to use the data.
Data types from Computer programming perspective
 The Common data types include

 Integers(int)- is used to store whole numbers, integers

 Booleans(bool)- is used to represent true or false.

 Characters(char)- is used to store a single character like “A”.

 Floating-point numbers(float)- is used to store real numbers

 Alphanumeric strings(string)- used to store a combination of

characters and numbers like “ddu01256”.


Data Types and their Representation Page 11

Data types from Data Analytics perspective


 From a data analytics point of view, it is important to understand that

there are three common types of data types or structures:


 Structured

 Semi-structured, and

 Unstructured data types

 The fourth data type is metadata which data of data.

 The following figure describes the three types of data and metadata.
Data Types and their Representation… Page
12

Structured Data
 Structured data is data that adheres to a pre-defined data model and is

therefore straight forward to analyze.


 Structured data conforms to a tabular format with a relationship

between the different rows and columns.


Example: Excel files , Coma Separated Value files (.csv) and SQL
database files.
 Each of these has structured rows and columns that can be sorted.
Data Types and their Representation… Page 13

Semi-structured Data
 Semi-structured data is a form of structured data that does not

conform with the formal structure of data models associated with


relational databases or other forms of data tables.
 It contains tags or other markers to separate semantic elements and

enforce hierarchies of records and fields within the data. Therefore, it is


also known as a self-describing structure.
Examples: JSON (JavaScript Object Notation) and XML (Extended
Markup Languages) are forms of semi-structured data.
Data Types and their Representation… Page 14

Unstructured Data
 Unstructured data is information that either does not have a

predefined data model or is not organized in a pre-defined manner.


 Unstructured information is typically text-heavy but may contain data

such as dates, numbers, and facts as well.


 This results in irregularities and ambiguities that make it difficult to

understand using traditional programs as compared to data stored in


structured databases.
Example: Audio, video files and NoSQL (None SQL) databases.
Data Types and their Representation… Page 15

Metadata (Data about Data)


 From a technical point of view, this is not a separate data structure, but

it is one of the most important elements for Big Data analysis and big
data solutions.
 Metadata is data about data.

 It provides additional information about a specific set of data.

 Metadata is frequently used by Big Data solutions for initial analysis.

 In a set of photographs, for example, metadata could describe when

and where the photos were taken. The metadata then provides fields for
dates and locations which, by themselves, can be considered structured
data.
Data Types and their Representation… Page 16

Meta Data
Data Value Chain Page 17

 The Data Value Chain is concerned with describing the information flow within

a big data system as a series of steps needed to generate value and useful insights
from data.
 The data value chain describes the evolution of data from collection to analysis,

dissemination, and the final impact of data on decision making.


• The Big Data Value Chain identifies the following key high-level activities:
Data Value Chain… Page 18

Data Acquisition
 It is the process of gathering, filtering, and cleaning data before it is put in a data

warehouse or any other storage solution on which data analysis can be carried
out.
 Data acquisition is one of the major big data challenges in terms of infrastructure

requirements.
 The infrastructure required to support the acquisition of big data must deliver

low, predictable latency in both capturing data and in executing queries; be able
to handle very high transaction volumes, often in a distributed environment; and
support flexible and dynamic data structures.
Data Value Chain… Page 19

Data Analysis
 It is concerned with making the raw data acquired amenable to use in
decision-making as well as domain-specific usage.
 Data analysis involves:

 Exploring,

 Transforming, and

 Modeling data

 The main goal of data analysis is highlighting relevant data, synthesizing and

extracting useful hidden information with high potential from a business point
of view.
 Related areas include
Data Value Chain… Page 20

Data Curation
 It is the active management of data over its life cycle to ensure it meets the

necessary data quality requirements for its effective usage.


 Data curation processes can be categorized into different activities such as content

creation, selection, classification, transformation, validation, and preservation.


 Data curation is performed by expert curators that are responsible for improving

the accessibility and quality of data.


 Data curators (scientific curators or data annotators) hold the responsibility of

ensuring that data are trustworthy, discoverable, accessible, reusable and fit their
purpose.
 A key trend for the duration of big data utilizes community and crowdsourcing

approaches.
Data Value Chain… Page 21

Data Storage
 It is the persistence and management of data in a scalable way that

satisfies the needs of applications that require fast access to the data.
 Relational Database Management Systems (RDBMS) have been the

main, and almost unique, a solution to the storage paradigm for nearly
40 years.
 Not Only SQL (NoSQL) technologies have been designed with the

scalability goal in mind and present a wide range of solutions based on


alternative data models.
Data Value Chain… Page 22

Data Usage
 It covers the data-driven business activities that need access to data,

its analysis, and the tools needed to integrate the data analysis within
the business activity.
 Data usage in business decision-making can enhance

competitiveness through the reduction of costs, increased added value,


or any other parameter that can be measured against existing
performance criteria.
Basic Concepts of Big Data Page 23

 Big data is a blanket term for the non-traditional strategies and

technologies needed to gather, organize, process, and gather insights


from large datasets.
 While the problem of working with data that exceeds the computing

power or storage of a single computer is not new, the pervasiveness,


scale, and value of this type of computing have greatly expanded in
recent years.
Basic Concepts of Big Data Page 24

What is Big Data?


 Big data is the term for a collection of data sets so large and complex

that it becomes difficult to process using on-hand database


management tools or traditional data processing applications.
 In this context, a “large dataset” means a dataset too large to

reasonably process or store with traditional tooling or on a single


computer.
 Big data is characterized by 3V and more: Volume, Velocity, Variety

and Veracity
Basic Concepts of Big Data Page 25

Characteristics of Big Data


 Volume: large amounts of data /Massive datasets

 Velocity: Data is live streaming or in motion

 Variety: data comes in many different forms from diverse sources

 Veracity: can we trust the data? How accurate is it?


Clustered Computing and Hadoop Ecosystem Page 26

Clustered Computing
Because of the qualities of big data, individual computers are often

inadequate for handling the data at most stages.


To better address the high storage and computational needs of big data,

computer clusters are a better fit.


Big data clustering software combines the resources of many smaller

machines, seeking to provide a number of benefits:


 Resource Pooling

 High Availability

 Easy Scalability
Clustered Computing and Hadoop Ecosystem… Page 27

Resource Pooling
 Combining the available storage space to hold data.

High Availability
 Availability guarantees to prevent hardware or software failures from

affecting access to data and processing.

Easy Scalability
 Clusters make it easy to scale horizontally by adding additional

machines to the group. This means the system can react to changes in
resource requirements without expanding the physical resources on a
machine.
Clustered Computing and Hadoop Ecosystem… Page 28

 Using clusters requires a solution for managing cluster membership,

coordinating resource sharing, and scheduling actual work on


individual nodes.
 Cluster membership and resource allocation can be handled by

software like Hadoop’s YARN (which stands for Yet Another Resource
Negotiator).
 The assembled computing cluster often acts as a foundation that other

software interfaces with to process the data.


Clustered Computing and Hadoop Ecosystem… Page 29

Hadoop and its Ecosystem


Hadoop is an open-source framework intended to make interaction with

big data easier.


 It is a framework that allows for the distributed processing of large

datasets across clusters of computers using simple programming models.


 It is inspired by a technical document published by Google. The four

key characteristics of Hadoop are:


 Economical

 Reliable

 Scalable

 Flexible
Clustered Computing and Hadoop Ecosystem… Page 30

The key characteristics of Hadoop:


 Economical: Its systems are highly economical as ordinary

computers can be used for data processing.


 Reliable: It is reliable as it stores copies of the data on different

machines and is resistant to hardware failure.


 Scalable: It is easily scalable both, horizontally and vertically. A

few extra nodes help in scaling up the framework


 Flexible: It is flexible and you can store as much structured and

unstructured data as you need to and decide to use them later.


Clustered Computing and Hadoop Ecosystem… Page 31

 Hadoop has an ecosystem that has evolved from its four core

components: data management, access, processing, and storage.


 It is continuously growing to meet the needs of Big Data.

 It comprises the following main components and many others:


• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query-based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling
Clustered Computing and Hadoop Ecosystem… Page 32
Big Data Life Cycle with Hadoop (Stages) Page 33

1. Ingesting data into the system:


 The first stage of Big Data processing is Ingest.

 The data is ingested or transferred to Hadoop from various sources such

as relational databases, systems, or local files.


 Sqoop transfers data from RDBMS to HDFS, whereas Flume transfers

event data.
2. Processing the data in storage:
 The second stage is Processing.

 In this stage, the data is stored and processed.

 The data is stored in the distributed file system, HDFS, and the NoSQL

distributed data, HBase.



Big Data Life Cycle with Hadoop… Page 34

 Computing and analyzing data


 The third stage is to Analyze.

 Here, the data is analyzed by processing frameworks such as Pig, Hive,

and Impala.
 Pig converts the data using a MapReduce and then analyzes it.
 Hive is also based on the MapReduce programming and is most suitable
for structured data.

 Visualizing the results


 The fourth stage is Access, which is performed by tools such as Hue and

Cloudera Search.
 In this stage, the analyzed data can be accessed by users.
Page 35

?
END OF CHAPTER TWO
Next:- Chapter Three [Artificial Intelligence]

You might also like