0% found this document useful (0 votes)

11 views41 pages

Emergency Chapter Two

emerg

Uploaded by

Regasa Tamiru

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views41 pages

Emergency Chapter Two

emerg

Uploaded by

Regasa Tamiru

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 41

CHAPTER 2:

DATA SCIENCE
AFTER COMPLETING THIS CHAPTER,
THE STUDENTS WILL BE ABLE TO:

➢ Describe what data science is and the role of data

scientists.
➢ Differentiate data and information.
➢ Describe data processing life cycle
➢ Understand different data types from diverse perspectives
➢ Describe data value chain in emerging era of big data.
➢ Understand the basics of Big Data.
➢ Describe the purpose of the Hadoop ecosystem
components.
2.1. AN OVERVIEW OF DATA
SCIENCE

➢ What is data science? Can you

describe the role of data in
emerging technology?
➢ What are data and information?
➢ What is big data?
DEFINITION

Data science is a multi-disciplinary field that

uses scientific methods, processes,
algorithms, and systems to extract knowledge
and insights from structured, semi-structured
and unstructured data.
Data science is much more than simply
analyzing data. It offers a range of roles and
requires a range of skills.
CON’T…

Data scientists need to be curious and result-

oriented, with exceptional industry-specific
knowledge and communication skills that allow
them to explain highly technical results to their
non-technical counterparts. They possess a strong
quantitative background in statistics and linear
algebra as well as programming knowledge
with focuses on data warehousing, mining, and
modeling to build and analyze algorithms.
2.1.1. WHAT ARE DATA AND
INFORMATION?

• Define Data and

information
2.1.2. DATA PROCESSING
CYCLE
Data processing is the re-structuring or re-
ordering of data by people or machines to
increase their usefulness and add values for
a particular purpose. Data processing
consists of the following basic steps - input,
processing, and output. These three steps
constitute the data processing cycle.
CON’T …
2.3 DATA TYPES AND THEIR
REPRESENTATION

Data types can be described from

diverse perspectives. In computer
science and computer
programming, for instance, a data
type is simply an attribute of data that
tells the compiler or interpreter how the
programmer intends to use the data.
CON’T…

Almost all programming languages explicitly

include the notion of data type, though
different languages may use different
terminology. Common data types include:
•Integers(int)- is used to store whole
numbers, mathematically known as integers
• Booleans(bool)- is used to represent
restricted to one of two values: true or false
CON’T

• Characters(char)- is used to store a

single character
• Floating-point numbers(float)- is
used to store real numbers
•Alphanumeric strings(string)- used
to store a combination of characters
and numbers
2.3.2. DATA TYPES FROM DATA
ANALYTICS PERSPECTIVE

From a data analytics point of view, it is important

to understand that there are three common types
of data types or structures: Structured,
Semi-structured, and Unstructured data
types.
Fig.

below describes the three types of data and

metadata.
CON’T…
STRUCTURED DATA
Structured data is data that adheres to a pre-
defined data model and is therefore
straightforward to analyze. Structured data
conforms to a tabular format with a
relationship between the different rows and
columns. Common examples of structured
data are Excel files or SQL databases. Each
of these has structured rows and columns
that can be sorted.
SEMI-STRUCTURED DATA

Semi-structured data is a form of structured data that

does not conform with the formal structure of data
models associated with relational databases or other
forms of data tables, but nonetheless, contains tags
or other markers to separate semantic elements and
enforce hierarchies of records and fields within the
data. Therefore, it is also known as a self-describing
structure. Examples of semi-structured data include
JSON and XML are forms of semi-structured data.
UNSTRUCTURED DATA

Unstructured data is information that either does

not have a predefined data model or is not
organized in a pre-defined manner. Unstructured
information is typically text-heavy but may
contain data such as dates, numbers, and facts as
well. This results in irregularities and ambiguities that
make it difficult to understand using traditional
programs as compared to data stored in
structured databases. Common examples of
unstructured data include audio, video files or No-
SQL databases.
METADATA – DATA ABOUT DATA
The last category of data type is metadata. From a
technical point of view, this is not a separate data
structure, but it is one of the most important elements for
Big Data analysis and big data solutions. Metadata is data
about data. It provides additional information about
a specific set of data. In a set of photographs, for
example, metadata could describe when and where the
photos were taken. The metadata then provides fields for
dates and locations which, by themselves, can be
considered structured data. Because of this reason,
metadata is frequently used by Big Data solutions for
initial analysis.
2.4. DATA VALUE CHAIN

The Data Value Chain is introduced to

describe the information flow within
a big data system as a series of
steps needed to generate value and
useful insights from data. The Big
Data Value Chain identifies the
following key high-level activities:
CON’T
2.4.1. DATA ACQUISITION
It is the process of gathering, filtering, and cleaning data
before it is put in a data warehouse or any other
storage solution on which data analysis can be carried
out. Data acquisition is one of the major big data
challenges in terms of infrastructure
requirements. The infrastructure required to support
the acquisition of big data must deliver low, predictable
latency in both capturing data and in executing
queries; be able to handle very high transaction
volumes, often in a distributed environment; and
support flexible and dynamic data structures.
2.4.2. DATA ANALYSIS

It is concerned with making the raw data acquired

amenable to use in decision-making as well as
domain-specific usage. Data analysis involves
exploring, transforming, and modeling data
with the goal of highlighting relevant data,
synthesizing and extracting useful hidden
information with high potential from a business
point of view. Related areas include data
mining, business intelligence, and machine
learning.
2.4.3. DATA CURATION

It is the active management of data over its life cycle to ensure

it meets the necessary data quality requirements for its
effective usage. Data curation processes can be categorized
into different activities such as content creation, selection,
classification, transformation, validation, and preservation.
Data curation is performed by expert curators that are
responsible for improving the accessibility and quality of
data. Data curators (also known as scientific curators or
data annotators) hold the responsibility of ensuring that data
are trustworthy, discoverable, accessible, reusable and fit
their purpose. A key trend for the duration of big data utilizes
community and Crowd sourcing approaches.
2.4.4. DATA STORAGE

It is the persistence and management of data in a

scalable way that satisfies the needs of applications that
require fast access to the data. Relational Database
Management Systems (RDBMS) have been the main, and
almost unique, a solution to the storage paradigm for nearly
40 years. However, the ACID (Atomicity, Consistency,
Isolation, and Durability) properties that guarantee database
transactions lack flexibility with regard to schema
changes and the performance and fault tolerance when data
volumes and complexity grow, making them unsuitable for
big data scenarios. NoSQL technologies have been designed
with the scalability goal in mind and present a wide range of
solutions based on alternative data models.
2.4.5. DATA USAGE

It covers the data-driven business activities

that need access to data, its analysis, and the
tools needed to integrate the data analysis
within the business activity. Data usage in
business decision-making can enhance
competitiveness through the reduction of
costs, increased added value, or any
parameter that can be measured against
existing performance criteria.
2.5. BASIC CONCEPTS OF BIG
DATA
• 2.5.1. What Is Big Data?
Big data is the term for a collection of data sets so large
and complex that it becomes difficult to process using
on-hand database management tools or traditional
data processing applications. In this context, a “large
dataset” means a dataset too large to reasonably
process or store with traditional tooling or on a single
computer. This means that the common scale of big
datasets is constantly shifting and may vary
significantly from organization to organization.
CON’T….

Big data is characterized by 3V and more:

• Volume: large amounts of data Zeta
bytes/Massive datasets
• Velocity: Data is live streaming or in motion
• Variety: data comes in many different forms
from diverse sources
• Veracity: can we trust the data? How accurate
is it? etc.
FIGURE :- CHARACTERISTICS
OF BIG DATA
2.5.2. CLUSTERED COMPUTING AND
HADOOP ECOSYSTEM

Because of the qualities of big data, individual computers are

often inadequate for handling the data at most stages. To
better address the high storage and computational needs of
big data, computer clusters are a better fit. Big data
clustering software combines the resources of many smaller
machines, seeking to provide a number of benefits:
CON’T

• Resource Pooling: Combining the available storage space

to hold data is a clear benefit, but CPU and memory pooling
are also extremely important. Processing large datasets
requires large amounts of all three of these resources.
• High Availability: Clusters can provide varying levels of
fault tolerance and availability guarantees to prevent
hardware or software failures from affecting access to data
and processing. This becomes increasingly important as we
continue to emphasize the importance of real-time analytics
CON’T

• Easy Scalability: Clusters make it

easy to scale horizontally by adding
additional machines to the group.
This means the system can react to
changes in resource requirements
without expanding the physical
resources on a machine.
CON’T…

• Using clusters requires a solution for

managing cluster membership,
coordinating resource sharing, and
scheduling actual work on individual
nodes. Cluster membership and resource
allocation can be handled by software like
Hadoop’s YARN (which stands for Yet
Another Resource Negotiator).
2.5.2.2.HADOOP AND ITS
ECOSYSTEM
• Hadoop is an open-source framework
intended to make interaction with
big data easier. It is a framework that
allows for the distributed processing
of large datasets across clusters of
computers using simple
programming models.
CON’T…

• It is inspired by a technical document published by

Google. The four key characteristics of Hadoop are
• Economical: Its systems are highly economical as
ordinary computers can be used for data processing.
• Reliable: It is reliable as it stores copies of the data on
different machines and is resistant to hardware failure.
CON’T…

• Scalable: It is easily scalable both, horizontally and

vertically. A few extra nodes help in scaling up the
framework.
• Flexible: It is flexible and you can store as much structured
and unstructured data as you need to and decide to use
them later
Hadoop has an ecosystem that has evolved from its four core
components: data management, access, processing, and
storage. It is continuously growing to meet the needs of Big
Data. It comprises the following components and many
others:
CON’T…

• HDFS: Hadoop Distributed File System

• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query-based processing of data services
• HBase: NoSQL Database
CON’T…

• Mahout, Spark MLLib: Machine Learning

algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling
CON’T…
2.5.3. BIG DATA LIFE CYCLE
WITH HADOOP
• 2.5.3.1. Ingesting data into the system
The first stage of Big Data processing is
Ingest. The data is ingested or
transferred to Hadoop from various
sources such as relational databases,
systems, or local files. Sqoop transfers
data from RDBMS to HDFS, whereas
Flume transfers event data
CON’T…

• 2.5.3.2. Processing the data in storage

The second stage is Processing. In this
stage, the data is stored and processed.
The data is stored in the distributed file
system, HDFS, and the NoSQL
distributed data, HBase. Spark and
MapReduce perform data processing.
CON’T….

• 2.5.3.3. Computing and analyzing data The

third stage is to Analyze. Here, the data is
analyzed by processing frameworks such
as Pig, Hive, and Impala. Pig converts the
data using a map and reduce and then
analyzes it. Hive is also based on the map
and reduce programming and is most
suitable for structured data.
CON’T….

• 2.5.3.4. Visualizing the results

The fourth stage is Access, which
is performed by tools such as Hue
and Cloudera Search. In this
stage, the analyzed data can be
accessed by users

m3010 Series PDF
100% (2)
m3010 Series PDF
133 pages
PSYC 262 Course Syllabus (Spring 2024)
No ratings yet
PSYC 262 Course Syllabus (Spring 2024)
11 pages
482 Cr.P.C. Ramu & Ors by Prem Driver CJM 2017
No ratings yet
482 Cr.P.C. Ramu & Ors by Prem Driver CJM 2017
10 pages
Secondary Market DR S Sreenivasa Murthy
No ratings yet
Secondary Market DR S Sreenivasa Murthy
33 pages
Project Insight - Cipla LTD
No ratings yet
Project Insight - Cipla LTD
14 pages
External Environment
No ratings yet
External Environment
54 pages
Towards A Third Food Regime: Behind The Transformation
No ratings yet
Towards A Third Food Regime: Behind The Transformation
13 pages
Elements of The Traditional Music of Thailand
80% (5)
Elements of The Traditional Music of Thailand
8 pages
Instruction Manual Fieldvue dvc2000 Digital Valve Controller Fisher en 135208
No ratings yet
Instruction Manual Fieldvue dvc2000 Digital Valve Controller Fisher en 135208
80 pages
2 Data-Science PDF
No ratings yet
2 Data-Science PDF
49 pages
Metric Screw Thread Chart: Metric Tap Size Tap Drill (Inches) Clearance Drill (Inches)
No ratings yet
Metric Screw Thread Chart: Metric Tap Size Tap Drill (Inches) Clearance Drill (Inches)
2 pages
PDF p2 Guerrero Ch15 Compress
No ratings yet
PDF p2 Guerrero Ch15 Compress
27 pages
IT 106 - Intro To Data Sciences
No ratings yet
IT 106 - Intro To Data Sciences
32 pages
Purchase Order: Po No. Dated
No ratings yet
Purchase Order: Po No. Dated
3 pages
Ey Step Up To Ind As For Banks and NBFCSMHNGG
No ratings yet
Ey Step Up To Ind As For Banks and NBFCSMHNGG
44 pages
Data Science
No ratings yet
Data Science
35 pages
Adobe PR
No ratings yet
Adobe PR
54 pages
Special Web 1 PDF
No ratings yet
Special Web 1 PDF
12 pages
CH-2 Data Science
No ratings yet
CH-2 Data Science
45 pages
Question of Fact Speech & PowerPoint
No ratings yet
Question of Fact Speech & PowerPoint
2 pages
Standard American Accent Worksheets
No ratings yet
Standard American Accent Worksheets
10 pages
What Is The UK Spouse Visa 2023
No ratings yet
What Is The UK Spouse Visa 2023
6 pages
Practice Multiple Choice Questions For Test1
100% (1)
Practice Multiple Choice Questions For Test1
7 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
Evaluate Vygotsky's Theory of Cognitive Development (8 Marks)
No ratings yet
Evaluate Vygotsky's Theory of Cognitive Development (8 Marks)
1 page
17-In Re Vicente Pelaez March 3, 1923
No ratings yet
17-In Re Vicente Pelaez March 3, 1923
3 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
Homework-3 Cap405: Computer Graphics
No ratings yet
Homework-3 Cap405: Computer Graphics
9 pages
Data Lifecycle
No ratings yet
Data Lifecycle
55 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Chapter - 2 Data Sciences
No ratings yet
Chapter - 2 Data Sciences
25 pages
Data Science: Chapter Two
No ratings yet
Data Science: Chapter Two
8 pages
Psycholinguistics View of Language
No ratings yet
Psycholinguistics View of Language
15 pages
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Chap 2-Data Analysis
No ratings yet
Chap 2-Data Analysis
27 pages
Bhs Inggris 6
No ratings yet
Bhs Inggris 6
2 pages
Chapter 2-2
No ratings yet
Chapter 2-2
34 pages
Chapter 2 Introduction To Data Science
No ratings yet
Chapter 2 Introduction To Data Science
50 pages
Chapter 2 - Intro To Data Sciences (Updated)
No ratings yet
Chapter 2 - Intro To Data Sciences (Updated)
67 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
Chapter 2: Data Science
No ratings yet
Chapter 2: Data Science
32 pages
#2 Data Science
No ratings yet
#2 Data Science
32 pages
IET - Chapter 2
No ratings yet
IET - Chapter 2
32 pages
Final GR 7 Tech Term 3 Task 5 Memo
No ratings yet
Final GR 7 Tech Term 3 Task 5 Memo
4 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
Foundation Year Web
No ratings yet
Foundation Year Web
4 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
Chaoter Data Science
No ratings yet
Chaoter Data Science
20 pages
CHAPTER 2 Emerging
No ratings yet
CHAPTER 2 Emerging
8 pages
Elektronik Ders Prog 2024 2025guz
No ratings yet
Elektronik Ders Prog 2024 2025guz
24 pages
Chapter 2 EmTe
No ratings yet
Chapter 2 EmTe
37 pages
Chapter 2 - Overview For Data Science
No ratings yet
Chapter 2 - Overview For Data Science
31 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
37 pages
Practical Experiments 1st Paper
No ratings yet
Practical Experiments 1st Paper
38 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
Spring-2024 ECE-2020 Syllabus & Schedule v1
No ratings yet
Spring-2024 ECE-2020 Syllabus & Schedule v1
8 pages
Chapter 2 Data Science1
No ratings yet
Chapter 2 Data Science1
41 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
41 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
33 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
ETCh 2
No ratings yet
ETCh 2
36 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
Emerging CH2
No ratings yet
Emerging CH2
41 pages
Chapter 2 (Data Science)
No ratings yet
Chapter 2 (Data Science)
35 pages
Emerging Tech CH 2
No ratings yet
Emerging Tech CH 2
52 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
8 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Ict Ch. 2
No ratings yet
Ict Ch. 2
38 pages
Data Science
No ratings yet
Data Science
32 pages
Ch2 Emerging
No ratings yet
Ch2 Emerging
24 pages
ET Ch-2 Data Science PPT
No ratings yet
ET Ch-2 Data Science PPT
28 pages
Course Name: Introduction To Emerging Technologies
No ratings yet
Course Name: Introduction To Emerging Technologies
24 pages
Chapter Two
No ratings yet
Chapter Two
57 pages
Group Assingment 1: - Which Emerging Technologies Will Have More Effect On Our Day-to-Day Life and How?
No ratings yet
Group Assingment 1: - Which Emerging Technologies Will Have More Effect On Our Day-to-Day Life and How?
4 pages
Chapter 2 - EMTE - 240216 - 133452
No ratings yet
Chapter 2 - EMTE - 240216 - 133452
47 pages
Islamic Answer
No ratings yet
Islamic Answer
27 pages
Introduction To Big Data Platform (Module-3)
No ratings yet
Introduction To Big Data Platform (Module-3)
23 pages
Introduction To Emerging Technologies Chapter 2
No ratings yet
Introduction To Emerging Technologies Chapter 2
31 pages
Reconceptualizing Confucian Philosophy in The 21st Century 1st Edition Xinzhong Yao (Eds.) Download
100% (2)
Reconceptualizing Confucian Philosophy in The 21st Century 1st Edition Xinzhong Yao (Eds.) Download
56 pages
Chapter 2EMR
No ratings yet
Chapter 2EMR
21 pages
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet

Emergency Chapter Two

Uploaded by

Emergency Chapter Two

Uploaded by

CHAPTER 2:

➢ Describe what data science is and the role of data

➢ What is data science? Can you

Data science is a multi-disciplinary field that

Data scientists need to be curious and result-

• Define Data and

Data types can be described from

Almost all programming languages explicitly

• Characters(char)- is used to store a

From a data analytics point of view, it is important

below describes the three types of data and

Semi-structured data is a form of structured data that

Unstructured data is information that either does

The Data Value Chain is introduced to

It is concerned with making the raw data acquired

It is the active management of data over its life cycle to ensure

It is the persistence and management of data in a

It covers the data-driven business activities

Big data is characterized by 3V and more:

Because of the qualities of big data, individual computers are

• Resource Pooling: Combining the available storage space

• Easy Scalability: Clusters make it

• Using clusters requires a solution for

• It is inspired by a technical document published by

• Scalable: It is easily scalable both, horizontally and

• HDFS: Hadoop Distributed File System

• Mahout, Spark MLLib: Machine Learning

• 2.5.3.2. Processing the data in storage

• 2.5.3.3. Computing and analyzing data The

• 2.5.3.4. Visualizing the results

You might also like