0% found this document useful (0 votes)

15 views23 pages

Chapter 2-Data Science

Uploaded by

Wondimu Bantihun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views23 pages

Chapter 2-Data Science

Uploaded by

Wondimu Bantihun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 23

Chapter 2

Data Science
Contents:

 Overview of Data Science

 Data Types and Their Representation

 Data value Chain

 Basic Concepts of Big Data

Overview of Data Science

• Data science is a multi-disciplinary field that uses scientific methods,

processes, algorithms, and systems to extract knowledge and insights
from structured, semi-structured and unstructured data.

• Data science is much more than simply analyzing data.

• It offers a range of roles and requires a range of skills.

What are data and information?
• Data can be defined as a representation of facts, concepts, or instructions in a
formalized manner.
• It can be described as unprocessed facts and figures.
• It is represented with the help of:
• alphabets (A-Z, a-z),
• digits (0-9) or
• special characters (+, -, /, *, <,>, =, etc.)
• information is the processed data on which decisions and actions are based.
• It is interpreted data; created from organized, structured, and processed data
in a particular context.
Data Processing Cycle
• Data processing is the re-structuring or re-ordering of data by people
or machines to increase their usefulness and add values for a particular
purpose.
• Data processing consists of the following basic steps - input,
processing, and output.

Data Processing Cycle

Data types and their representation

1. Data types from Computer programming perspective

• Integers(int)- is used to store whole numbers, mathematically known
as integers
• Booleans(bool)- is used to represent restricted to one of two values:
true or false
• Characters(char)- is used to store a single character
• Floating-point numbers(float)- is used to store real numbers
• Alphanumeric strings(string)- used to store a combination of
characters and numbers
• 2. Data types from Data Analytics perspective

Structured Data Unstructured Data Semi-structured Data

• A pre-defined data • Have no pre-defined data model. • contains tags or other markers to
• Straightforward to analyze • May contain data such as dates, separate semantic elements
• Placed in tabular format numbers and facts • known as a self-describing
• Example: Excel files or • difficult to understand using structure.
SQL databases. traditional programs. • Example: JSON and XML
• Example: audio, video files
Metadata
• It is not a separate data structure, but most important element for Big
Data analysis and solutions.
• They are called data about data.
• In a set of photographs,
for example, metadata
could describe when and
where the photos were
taken.
Data value Chain

• The Data Value Chain is introduced to describe the information flow

within a big data system.
• describes the full data lifecycle from collection to analysis and usage.
• The Big Data Value Chain identifies the following key high-level
activities:
Basic concepts of big data

• Big data is the term for a collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing applications.

• According to IBM, Big data is characterized by 3V and more:

• Volume (amount of data): dealing with large scales of data within

data processing (e.g. Global Supply Chains, Global Financial
Analysis, Large Hadron Collider).
• Velocity (speed of data): dealing with streams of high frequency of
incoming real-time data (e.g. Sensors, Pervasive Environments, Electronic
Trading, Internet of Things).

• Variety (range of data types/sources): dealing with data using differing

syntactic formats (e.g. Spreadsheets, XML, DBMS), schemas, and meanings
(e.g. Enterprise Data Integration).

• Veracity: can we trust the data? How accurate is it? etc.

Clustered Computing and Hadoop Ecosystem

Clustered Computing

• Cluster computing refers that many of the computers connected on a network and they
perform like a single entity.

• Because of the qualities of big data, individual computers are often inadequate for handling
the data at most stages.

• To better address the high storage and computational needs of big data, computer clusters are
a better fit.

• Big data clustering software combines the resources of many smaller machines, seeking to
provide a number of benefits:
suppose you have a big file having more than 500 mb data and you need to count the number of words. But
your computer has only 100 mb, how you can handle it ?
• Resource Pooling: Combining the available storage space to hold data
is a clear benefit, but CPU and memory pooling are also extremely
important.
• Processing large datasets requires large amounts of all three of these
resources.
• High Availability: Clusters can provide varying levels of fault
tolerance and availability guarantees to prevent hardware or software
failures from affecting access to data and processing.
• Easy Scalability: Clusters make it easy to scale horizontally by
adding additional machines to the group.
• Cluster membership and resource allocation can be handled by software
like Hadoop’s YARN (which stands for Yet Another Resource Negotiator).

• Hadoop is an open-source framework intended to make interaction with big

data easier.

• It is a framework that allows for the distributed processing of large datasets

across clusters of computers using simple programming models.
The four key characteristics of Hadoop are:

• Economical: Its systems are highly economical as ordinary computers can be used for data
processing.

• Reliable: It is reliable as it stores copies of the data on different machines and is resistant
to hardware failure.

• Scalable: It is easily scalable both, horizontally and vertically. A few extra nodes help in
scaling up the framework.
• Flexible: It is flexible and you can store as much structured and unstructured data as you
need to and decide to use them later.
• Hadoop has an ecosystem that has evolved from its four core
components: data management, data access, data processing, and data
storage.

Hadoop ecosystem
Big Data Life Cycle with Hadoop
1. Ingesting data into the system

• The first stage of Big Data processing is Ingest.

• The data is ingested or transferred to Hadoop from various sources

such as relational databases, systems, or local files.

• Sqoop transfers data from RDBMS to HDFS, whereas Flume transfers

event data.
2. Processing the data in storage

• The second stage is Processing. In this stage, the data is stored and
processed.

• The data is stored in the distributed file system, HDFS, and the
NoSQL distributed data, Hbase, Spark and MapReduce perform data
processing.
3. Computing and analyzing data

• The third stage is to Analyze. Here, the data is analyzed by processing

frameworks such as Pig, Hive, and Impala.

• Pig converts the data using a map and reduce and then analyzes it.

• Hive is also based on the map and reduce programming and is most
suitable for structured data.
4. Visualizing the results
• The fourth stage is Access, which is performed by tools such as Hue
and Cloudera Search.
• In this stage, the analyzed data can be accessed by users.
Thank you!!!

Big Data Analytics 1-5
100% (1)
Big Data Analytics 1-5
63 pages
Windows Server Manage Windows Admin Center
No ratings yet
Windows Server Manage Windows Admin Center
446 pages
Big Data Analytics - Lecture Slides
No ratings yet
Big Data Analytics - Lecture Slides
72 pages
Windows Server 2025
No ratings yet
Windows Server 2025
90 pages
Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
Bdhs - Ebook
No ratings yet
Bdhs - Ebook
970 pages
2V0 620
50% (2)
2V0 620
79 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Big Data Analytics M1
No ratings yet
Big Data Analytics M1
27 pages
OS UNIT-1 Introduction
No ratings yet
OS UNIT-1 Introduction
126 pages
Vrealize Operations 810 Help
No ratings yet
Vrealize Operations 810 Help
1,782 pages
CipherTrust Manager - Hands-On - Clustering
100% (1)
CipherTrust Manager - Hands-On - Clustering
10 pages
Big Data Unit 1 AKTU Notes
No ratings yet
Big Data Unit 1 AKTU Notes
87 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
Data Science
No ratings yet
Data Science
87 pages
Veritas Storage Foundation™ Cluster File System Installation Guide Linux - Sfcfs - Install - 51sp1 - Lin
No ratings yet
Veritas Storage Foundation™ Cluster File System Installation Guide Linux - Sfcfs - Install - 51sp1 - Lin
495 pages
Liquid Office Install and Admin Guide
100% (1)
Liquid Office Install and Admin Guide
86 pages
Professional Cloud DevOps Engineer Google Exam Updated Dumps
No ratings yet
Professional Cloud DevOps Engineer Google Exam Updated Dumps
12 pages
Hadoop Ecosystem Large PDF
No ratings yet
Hadoop Ecosystem Large PDF
229 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Big Data S All Units
No ratings yet
Big Data S All Units
122 pages
Co & Os 3,4,5 Units
No ratings yet
Co & Os 3,4,5 Units
120 pages
Cucm B Install Guide Cucm Imp 15
No ratings yet
Cucm B Install Guide Cucm Imp 15
90 pages
BigData Unit1
No ratings yet
BigData Unit1
74 pages
Unit 1 BDA
No ratings yet
Unit 1 BDA
43 pages
Bda U2
No ratings yet
Bda U2
68 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
33 pages
Big Data-2
No ratings yet
Big Data-2
40 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Ch2 Emerging
No ratings yet
Ch2 Emerging
24 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
C1000-119 Stu
No ratings yet
C1000-119 Stu
20 pages
Aix Interview Qust
100% (2)
Aix Interview Qust
21 pages
Wa0000.
No ratings yet
Wa0000.
35 pages
Etrel INCH Configuration Guide
No ratings yet
Etrel INCH Configuration Guide
36 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
Lab M8
No ratings yet
Lab M8
20 pages
Unit Iv PDF
No ratings yet
Unit Iv PDF
26 pages
Big Data Overview
No ratings yet
Big Data Overview
18 pages
Biggdata
No ratings yet
Biggdata
24 pages
Module 1
No ratings yet
Module 1
54 pages
BIG Data Analytics 21CSH-471: Computer Science & Engineering
No ratings yet
BIG Data Analytics 21CSH-471: Computer Science & Engineering
17 pages
Chapter - 2 Data Sciences
No ratings yet
Chapter - 2 Data Sciences
25 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
Hadoop Big Data Unit 2
No ratings yet
Hadoop Big Data Unit 2
23 pages
Lecture8 - Big Data (Hadoop)
No ratings yet
Lecture8 - Big Data (Hadoop)
29 pages
Zend Server CM Installation Guide
No ratings yet
Zend Server CM Installation Guide
18 pages
CH-2 Data Science Emerging Technology
No ratings yet
CH-2 Data Science Emerging Technology
20 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
2 Emerging
No ratings yet
2 Emerging
10 pages
Big Data Deals With Large Data Sets
No ratings yet
Big Data Deals With Large Data Sets
4 pages
Big Data
No ratings yet
Big Data
17 pages
Chaoter Data Science
No ratings yet
Chaoter Data Science
20 pages
Uc PDF
No ratings yet
Uc PDF
10 pages
BDA Exp 1
No ratings yet
BDA Exp 1
7 pages
Unit-1 Cloud Computing
No ratings yet
Unit-1 Cloud Computing
18 pages
2 Data Science
No ratings yet
2 Data Science
27 pages
Rag Vs Cag Report
No ratings yet
Rag Vs Cag Report
6 pages
Redhat Courses
No ratings yet
Redhat Courses
3 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
32 pages
Onboard Present
No ratings yet
Onboard Present
20 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Big Data and Hadoop Self Notes
No ratings yet
Big Data and Hadoop Self Notes
16 pages
BIG DATA Notes
No ratings yet
BIG DATA Notes
11 pages
Chapter 2 - Intro. To Data Sciences
No ratings yet
Chapter 2 - Intro. To Data Sciences
27 pages
Big Data: Presented By, Nishaa R
No ratings yet
Big Data: Presented By, Nishaa R
24 pages
Big Data Streams Analytics: Challenges, Analysis, and Applications
No ratings yet
Big Data Streams Analytics: Challenges, Analysis, and Applications
55 pages
High Availability For Power Systems: Client Presentation
No ratings yet
High Availability For Power Systems: Client Presentation
33 pages
Big Data NOTES
No ratings yet
Big Data NOTES
14 pages
Deepti DevOps Engineer
No ratings yet
Deepti DevOps Engineer
7 pages
Bigdata
No ratings yet
Bigdata
12 pages
Bangladesh University of Professionals: Submitted by Submitted To ID: Section: Batch
No ratings yet
Bangladesh University of Professionals: Submitted by Submitted To ID: Section: Batch
6 pages
Experiment 01 PDF
No ratings yet
Experiment 01 PDF
6 pages
Seminar Big Data Hadoop
No ratings yet
Seminar Big Data Hadoop
28 pages
3par Simulator - 3ParDude
No ratings yet
3par Simulator - 3ParDude
11 pages
Grid Computing
No ratings yet
Grid Computing
31 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
Hadoop Quick Guide
No ratings yet
Hadoop Quick Guide
32 pages
Introduction To Big Data: Soorya Prasanna Ravichandran
No ratings yet
Introduction To Big Data: Soorya Prasanna Ravichandran
33 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
RAC - ASM - VOTING DISK Interview Questions & Answer
No ratings yet
RAC - ASM - VOTING DISK Interview Questions & Answer
22 pages
TWO NODE Failover Cluster
No ratings yet
TWO NODE Failover Cluster
20 pages
HPC Architecture and ECO System PDF
No ratings yet
HPC Architecture and ECO System PDF
3 pages
Distributed Ansys Guide
No ratings yet
Distributed Ansys Guide
25 pages
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet

Chapter 2-Data Science

Uploaded by

Chapter 2-Data Science

Uploaded by

Chapter 2

 Overview of Data Science

 Data Types and Their Representation

 Data value Chain

 Basic Concepts of Big Data

• Data science is a multi-disciplinary field that uses scientific methods,

• Data science is much more than simply analyzing data.

• It offers a range of roles and requires a range of skills.

Data Processing Cycle

1. Data types from Computer programming perspective

Structured Data Unstructured Data Semi-structured Data

• The Data Value Chain is introduced to describe the information flow

• According to IBM, Big data is characterized by 3V and more:

• Volume (amount of data): dealing with large scales of data within

• Variety (range of data types/sources): dealing with data using differing

• Veracity: can we trust the data? How accurate is it? etc.

• Hadoop is an open-source framework intended to make interaction with big

• It is a framework that allows for the distributed processing of large datasets

• The first stage of Big Data processing is Ingest.

• The data is ingested or transferred to Hadoop from various sources

• Sqoop transfers data from RDBMS to HDFS, whereas Flume transfers

• The third stage is to Analyze. Here, the data is analyzed by processing

You might also like