0% found this document useful (0 votes)

18 views37 pages

Chapter 2 EmTe

Uploaded by

fasilmengesha79

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views37 pages

Chapter 2 EmTe

Uploaded by

fasilmengesha79

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 37

Emerging Technologies

Chapter 2. Introduction to Data Science

Outlines
 Overview for Data Science
 Definition of data and information
 Data types and representation
 Data Value Chain
 Data Acquisition
 Data Analysis
 Data Curating
 Data Storage
 Data Usage
 Basic concepts of Big data

2
What is data science?
 Data science is a multi-disciplinary field that uses
scientific methods, processes, algorithms, and systems
to extract knowledge and insights from structured,
semi-structured and unstructured data.
 Data science is much more than simply analyzing
data. It offers a range of roles and requires a range of
skills.

3
What are data and information?
 Data can be defined as a representation of facts,
concepts, or instructions in a formalized manner, which
should be suitable for communication, interpretation, or
processing, by human or electronic machines.
 It can be described as unprocessed facts and figures.
 It is represented with the help of characters such as
 alphabets (A-Z, a-z)
 digits (0-9) or
 special characters (+, -, /, *, <,>, =, etc.)

4
What are data and information?...
 Whereas information is the processed data on which
decisions and actions are based.
 It is data that has been processed into a form that is
meaningful to the recipient and is of real or perceived
value in the current or the prospective action or decision
of recipient.
 Furtherer more, information is interpreted data; created
from organized, structured, and processed data in a
particular context.

5
2.1.2. Data Processing Cycle
 Data processing is the re-structuring or re-ordering of
data by people or machines to increase their usefulness
and add values for a particular purpose.
 Data processing consists of the following basic steps -
input, processing, and output. These three steps
constitute the data processing cycle.

6
Data Processing Cycle
 Input − in this step, the input data is prepared in some convenient form
for processing. The form will depend on the processing machine. For
example, when electronic computers are used, the input data can be
recorded on any one of the several types of storage medium, such as hard
disk, CD, flash disk and so on.
 Processing − in this step, the input data is changed to produce data in a
more useful form. For example, interest can be calculated on deposit to a
bank, or a summary of sales for the month can be calculated from the
sales orders.
 Output − at this stage, the result of the proceeding processing step is
collected. The particular form of the output data depends on the use of the
data. For example, output data may be payroll for employees.

7
2.3 Data types and their representation
 Data types can be described from diverse perspectives.
 In computer science and computer programming, for
instance, a data type is simply an attribute of data that
tells the compiler or interpreter how the programmer
intends to use the data.

8
2.3.1. Data types from Computer programming perspective

 Almost all programming languages explicitly include

the notion of data type, though different languages may
use different terminology. Common data types include:
 Integers(int)- is used to store whole numbers, mathematically
known as integers
 Booleans(bool)- is used to represent restricted to one of two
values: true or false
 Characters(char)- is used to store a single character
 Floating-point numbers(float)- is used to store real numbers
 Alphanumeric strings(string)- used to store a combination of
characters and numbers

9
Data types from Computer programming perspective
 A data type makes the values that expression, such as a
variable or a function, might take.
 This data type defines the operations that can be done
on the data, the meaning of the data, and the way values
of that type can be stored.

10
2.3.2. Data types from Data Analytics perspective
 From a data analytics point of view, it is important to
understand that there are three common types of data
types or structures: Structured, Semi-structured, and
Unstructured data types. Fig. 2.2 below describes the
three types of data and metadata.

Figure 2.2 Data types from a data analytics perspective

11
Structured Data
 Structured data is data that adheres to a pre-defined
data model and is therefore straightforward to analyze.
 Structured data conforms to a tabular format with a
relationship between the different rows and columns.
 Common examples of structured data are Excel files or
SQL databases.
 Each of these has structured rows and columns that
can be sorted.

12
Semi-structured Data
 Semi-structured data is a form of structured data that
does not conform with the formal structure of data
models associated with relational databases or other
forms of data tables, but nonetheless, contains tags or
other markers to separate semantic elements and enforce
hierarchies of records and fields within the data.
 Therefore, it is also known as a self-describing
structure. Examples of semi-structured data include
JSON and XML are forms of semi-structured data.

13
Unstructured Data
 Unstructured data is information that either does not
have a predefined data model or is not organized in a
pre-defined manner.
 Unstructured information is typically text-heavy but
may contain data such as dates, numbers, and facts as
well. This results in irregularities and ambiguities that
make it difficult to understand using traditional
programs as compared to data stored in structured
databases. Common examples of unstructured data
include audio, video files or No-SQL databases.

14
Metadata – Data about Data
 The last category of data type is metadata. From a technical point
of view, this is not a separate data structure, but it is one of the
most important elements for Big Data analysis and big data
solutions.
 Metadata is data about data. It provides additional information
about a specific set of data.
 In a set of photographs, for example, metadata could describe
when and where the photos were taken.
 The metadata then provides fields for dates and locations which,
by themselves, can be considered structured data. Because of this
reason, metadata is frequently used by Big Data solutions for
initial analysis.

15
2.4. Data value Chain
 The Data Value Chain is introduced to describe the information
flow within a big data system as a series of steps needed to
generate value and useful insights from data. The Big Data Value
Chain identifies the following key high-level activities:

Figure 2.3 Data Value Chain

16
2.4.1. Data Acquisition
 It is the process of gathering, filtering, and cleaning data
before it is put in a data warehouse or any other storage
solution on which data analysis can be carried out.
 Data acquisition is one of the major big data challenges
in terms of infrastructure requirements.
 The infrastructure required to support the acquisition of
big data must deliver low, predictable latency in both
capturing data and in executing queries; be able to
handle very high transaction volumes, often in a
distributed environment; and support flexible and
dynamic data structures

17
2.4.2. Data Analysis
 It is concerned with making the raw data acquired
amenable to use in decision-making as well as domain-
specific usage.
 Data analysis involves exploring, transforming, and
modeling data with the goal of highlighting relevant
data, synthesizing and extracting useful hidden
information with high potential from a business point of
view.
 Related areas include data mining, business
intelligence, and machine learning.

18
2.4.3. Data Curation
 It is the active management of data over its life cycle to ensure it
meets the necessary data quality requirements for its effective
usage.
 Data curation processes can be categorized into different
activities such as content creation, selection, classification,
transformation, validation, and preservation.
 Data curation is performed by expert curators that are responsible
for improving the accessibility and quality of data.
 Data curators (also known as scientific curators or data
annotators) hold the responsibility of ensuring that data are
trustworthy, discoverable, accessible, reusable and fit their
purpose. A key trend for the duration of big data utilizes
community and crowdsourcing approaches.

19
2.4.4. Data Storage
 It is the persistence and management of data in a scalable way that
satisfies the needs of applications that require fast access to the data.
 Relational Database Management Systems (RDBMS) have been the
main, and almost unique, a solution to the storage paradigm for
nearly 40 years.
 However, the ACID (Atomicity, Consistency, Isolation, and Durability)
properties that guarantee database transactions lack flexibility with
regard to schema changes and the performance and fault tolerance
when data volumes and complexity grow, making them unsuitable
for big data scenarios.
 No SQL technologies have been designed with the scalability goal
in mind and present a wide range of solutions based on alternative
data models.

20
2.4.5. Data Usage
 It covers the data-driven business activities that need
access to data, its analysis, and the tools needed to
integrate the data analysis within the business activity.
 Data usage in business decision-making can enhance
competitiveness through the reduction of costs,
increased added value, or any other parameter that can
be measured against existing performance criteria.

21
2.5. Basic concepts of big data
 Big data is a blanket term for the non-traditional
strategies and technologies needed to gather, organize,
process, and gather insights from large datasets.
 In this section, we will talk about big data on a
fundamental level and define common concepts you
might come across.
 We will also take a high-level look at some of the
processes and technologies currently being used in this
space.

22
What Is Big Data?
Big data is characterized by 10V’s and more:
Volume:

 large amounts of data Zeta bytes/Massive datasets

Velocity:
 Data is live streaming or in motion
Variety:
 data comes in many different forms from diverse sources
Variability
 Variability is a measure of the inconsistencies in data and is

often confused with variety. To understand variability, let us

consider an example. You go to a coffee shop every day and
purchase the same latte each day. However, it may smell or
taste slightly or significantly different each day.

23
Cont’d
 Veracity:

Veracity refers to the reliability of the data source. Numerous
factors can contribute to the reliability of the input they provide
at a particular time in a particular situation.

Veracity is particularly important for making data-driven
decisions for businesses as reproducibility of patterns relies
heavily on the credibility of initial data inputs.
 Validity

Validity pertains to the accuracy of data for its intended use.
For example, you may acquire a dataset pertaining to data
related to your subject of inquiry, increasing the task of forming
a meaningful relationship and inquiry. Registered charity data
contact lists
24
Cont’d
 Visualization
 With a new data visualization tool being released every month
or so, visualizing data is key to insightful results.
 Value
 BIG data is nothing if it cannot produce meaningful value.
Consider, again, the example of Target using a 16-year-old’s
shopping habits to predict her pregnancy. While in this case, it
violates privacy, in most other cases, it can generate
incredible customer value by bombarding them with the
specific product advertisement they require.

25
Cont’d
 Volatility
 Volatility refers to the time considerations placed on a
particular data set. It involves considering if data acquired a
year ago would be relevant for analysis for predictive
modeling today
 Vulnerability
 Big data is often about consumers. We often overlook the
potential harm in sharing our shopping data, but the reality is
that it can be used to uncover confidential information about
an individual. For instance, Target accurately predicted a
teenage girl’s pregnancy before her own parents knew it. To
avoid such consequences, it’s important to be mindful of the
information we share online.
26
Cont’d

fig. Characteristics of big data(10V’s

27
What Is Big Data?

Figure 2.4 Characteristics of big data

28
2.5.2. Clustered Computing and Hadoop Ecosystem
Clustered Computing
Because of the qualities of big data, individual computers are often
inadequate for handling the data at most stages.
To better address the high storage and computational needs of big

data, computer clusters are a better fit.

Big data clustering software combines the resources of many

smaller machines, seeking to provide a number of benefits:

 Resource Pooling

 High Availability

 Easy Scalability

29
Cluster computing benefits
 Resource Pooling
 Combining the available storage space to hold data is a clear
benefit, but CPU and memory pooling are also extremely
important.
 High Availability
 Clusters can provide varying levels of fault tolerance and
availability guarantees to prevent hardware or software failures
from affecting access to data and processing.
 Easy Scalability
 Clusters make it easy to scale horizontally by adding additional
machines to the group.

30
Cluster computing benefits
 Using clusters requires a solution for managing cluster
membership, coordinating resource sharing,and
scheduling actual work on individual nodes.
 Cluster membership and resource allocation can be
handled by software like Hadoop’s YARN (which
stands for Yet Another Resource Negotiator).
 The assembled computing cluster often acts as a
foundation that other software interfaces with to process
the data.

31
Hadoop and its Ecosystem
 Hadoop is an open-source framework intended to make
interaction with big data easier.
 It is a framework that allows for the distributed
processing of large datasets across clusters of computers
using simple programming models.
 It is inspired by a technical document published by
Google. The four key characteristics of Hadoop are:

32
Hadoop and its Ecosystem
 Economical: Its systems are highly economical as
ordinary computers can be used for data processing.
 Reliable: It is reliable as it stores copies of the data on
different machines and is resistant to hardware failure.
 Scalable: It is easily scalable both, horizontally and
vertically. A few extra nodes help in scaling up the
framework.
 Flexible: It is flexible and you can store as much
structured and unstructured data as you need to and
decide to use them later.

33
Hadoop and its Ecosystem
 Hadoop has an ecosystem that has evolved from its four core
components: data management, access, processing, and storage. It
is continuously growing to meet the needs of Big Data. It comprises
the following components and many others
 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query-based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling

34
Hadoop and its Ecosystem

Figure 2.5 Hadoop Ecosystem

35
Big Data Life Cycle with Hadoop
Ingesting data into the system
 The first stage of Big Data processing is Ingest. The data
is ingested or transferred to Hadoop from various sources
such as relational databases, systems, or local files. Sqoop
transfers data from RDBMS to HDFS, whereas Flume
transfers event data.
Processing the data in storage
 The second stage is Processing. In this stage, the data is
stored and processed. The data is stored in the distributed
file system, HDFS, and the NoSQL distributed data,
HBase. Spark and MapReduce perform data processing.

36
Big Data Life Cycle with Hadoop

Computing and analyzing data

 The third stage is to Analyze. Here, the data is analyzed by
processing frameworks such as Pig, Hive, and Impala. Pig
converts the data using a map and reduce and then analyzes
it. Hive is also based on the map and reduce programming
and is most suitable for structured data.
Visualizing the results
 The fourth stage is Access, which is performed by tools
such as Hue and Cloudera Search. In this stage, the
analyzed data can be accessed by users.

PCX - Report 1
No ratings yet
PCX - Report 1
11 pages
Cbe receiptFT25127PN6DQ
No ratings yet
Cbe receiptFT25127PN6DQ
1 page
Cbe receiptTT25151HHSLN
No ratings yet
Cbe receiptTT25151HHSLN
1 page
Cbe receiptFT25127CP302
No ratings yet
Cbe receiptFT25127CP302
1 page
Student Management System Project Report
88% (24)
Student Management System Project Report
66 pages
Database Management System
From Everand
Database Management System
Manish Soni
No ratings yet
Telebirr - Oromia Tax Collection7
No ratings yet
Telebirr - Oromia Tax Collection7
1 page
8 Magnetism Problem Solutions
No ratings yet
8 Magnetism Problem Solutions
4 pages
J Notes 2
No ratings yet
J Notes 2
123 pages
Isp Email
No ratings yet
Isp Email
40 pages
Img 0001
No ratings yet
Img 0001
1 page
Shivani Pathak
No ratings yet
Shivani Pathak
8 pages
Synchronizing MQL5 With Python Involves Setting Up An Environment Where MQL5 Can Call Python Scripts and Exchange Data
No ratings yet
Synchronizing MQL5 With Python Involves Setting Up An Environment Where MQL5 Can Call Python Scripts and Exchange Data
4 pages
Python ETL - Course Content
No ratings yet
Python ETL - Course Content
4 pages
ФТН демо ENG
No ratings yet
ФТН демо ENG
15 pages
2042 Unit 2 - 1
No ratings yet
2042 Unit 2 - 1
28 pages
Chapter VI. Fundamental Concepts of Macroeconomics22
No ratings yet
Chapter VI. Fundamental Concepts of Macroeconomics22
49 pages
Talend Metadata
No ratings yet
Talend Metadata
14 pages
IDS, IPS and Firewalls
No ratings yet
IDS, IPS and Firewalls
2 pages
Chapter One PDF
No ratings yet
Chapter One PDF
66 pages
Intro Stat Session 7
No ratings yet
Intro Stat Session 7
54 pages
Foreign Relation of Ethiopia & Eritrea
No ratings yet
Foreign Relation of Ethiopia & Eritrea
8 pages
ФТН ПП ENG
No ratings yet
ФТН ПП ENG
13 pages
Inclusiveness Lecture Notes 3 (Chapters 1-4)
No ratings yet
Inclusiveness Lecture Notes 3 (Chapters 1-4)
45 pages
USB Security System
No ratings yet
USB Security System
88 pages
Devops
No ratings yet
Devops
6 pages
Inclusiveness Lecture Notes 4 (Chapters 1-4)
No ratings yet
Inclusiveness Lecture Notes 4 (Chapters 1-4)
201 pages
Resume Mayur
No ratings yet
Resume Mayur
1 page
Abeni Emerg
No ratings yet
Abeni Emerg
8 pages
APPLICATIONS PROGRAM Part2
No ratings yet
APPLICATIONS PROGRAM Part2
14 pages
Chapter 2EMR
No ratings yet
Chapter 2EMR
21 pages
Project Proposal
No ratings yet
Project Proposal
3 pages
(ET) Chapter - 2
No ratings yet
(ET) Chapter - 2
31 pages
Chapter 2 - EMTE - 240216 - 133452
No ratings yet
Chapter 2 - EMTE - 240216 - 133452
47 pages
Introduction To Emerging Technologies Chapter 2
No ratings yet
Introduction To Emerging Technologies Chapter 2
31 pages
Smurf Attacks
No ratings yet
Smurf Attacks
4 pages
XcitiumAdvancedEDR DataSheet V6
No ratings yet
XcitiumAdvancedEDR DataSheet V6
5 pages
Course Name: Introduction To Emerging Technologies
No ratings yet
Course Name: Introduction To Emerging Technologies
24 pages
UNIT2 - IoT-1
No ratings yet
UNIT2 - IoT-1
67 pages
Docker
No ratings yet
Docker
22 pages
CV Zied-SIOUD SeniorDev
No ratings yet
CV Zied-SIOUD SeniorDev
1 page
Ch2 Emerging
No ratings yet
Ch2 Emerging
24 pages
EmTec Chapter 2
No ratings yet
EmTec Chapter 2
32 pages
Chapter 2 (Data Science)
No ratings yet
Chapter 2 (Data Science)
35 pages
Cloud Computing.
No ratings yet
Cloud Computing.
105 pages
ET Ch-2 Data Science PPT
No ratings yet
ET Ch-2 Data Science PPT
28 pages
Yokogawa Marine Business: Bulletin 53L01A01-01EN
No ratings yet
Yokogawa Marine Business: Bulletin 53L01A01-01EN
4 pages
Chapter 2 Introduction To Data Science - For Extension
No ratings yet
Chapter 2 Introduction To Data Science - For Extension
51 pages
200 IT Security Job Interview Questions-1
No ratings yet
200 IT Security Job Interview Questions-1
188 pages
Ict Ch. 2
No ratings yet
Ict Ch. 2
38 pages
Emerging CH2
No ratings yet
Emerging CH2
41 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
8 pages
Data Science
No ratings yet
Data Science
32 pages
Home Automation Using IoT
No ratings yet
Home Automation Using IoT
21 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
33 pages
Peningkatan Kualitas Pengelolaan Jurnal Ilmiah: Model Dashboard Information System Untuk
No ratings yet
Peningkatan Kualitas Pengelolaan Jurnal Ilmiah: Model Dashboard Information System Untuk
8 pages
Emerging Tech CH 2
No ratings yet
Emerging Tech CH 2
52 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
IS632 Chapter 3 Test Bank
No ratings yet
IS632 Chapter 3 Test Bank
4 pages
Chapter 2 Data Science1
No ratings yet
Chapter 2 Data Science1
41 pages
Wireless PLCs (EU2-16Langu
No ratings yet
Wireless PLCs (EU2-16Langu
75 pages
IET - Chapter 2
No ratings yet
IET - Chapter 2
32 pages
Emergency Chapter Two
No ratings yet
Emergency Chapter Two
41 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
56 pages
ETCh 2
No ratings yet
ETCh 2
36 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
41 pages
Chapter 2 - Intro To Data Sciences (Updated)
No ratings yet
Chapter 2 - Intro To Data Sciences (Updated)
67 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
37 pages
Nodeflux - Face Recognition Analytic Sheet 1.2.71
No ratings yet
Nodeflux - Face Recognition Analytic Sheet 1.2.71
9 pages
Data Lifecycle
No ratings yet
Data Lifecycle
55 pages
Chapter - 2 Data Sciences
No ratings yet
Chapter - 2 Data Sciences
25 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
58 pages
Chap 2-Data Analysis
No ratings yet
Chap 2-Data Analysis
27 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
55 pages
Hana - 2092196 - How-To - Terminating Sessions in SAP HANA
No ratings yet
Hana - 2092196 - How-To - Terminating Sessions in SAP HANA
3 pages
CH-2 Introduction To Data Science
No ratings yet
CH-2 Introduction To Data Science
26 pages
Data Science
No ratings yet
Data Science
35 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
Chaoter Data Science
No ratings yet
Chaoter Data Science
20 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
2 Data-Science PDF
No ratings yet
2 Data-Science PDF
49 pages
IT 106 - Intro To Data Sciences
No ratings yet
IT 106 - Intro To Data Sciences
32 pages
Tuning PostgreSQL With Pgbench
No ratings yet
Tuning PostgreSQL With Pgbench
11 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
#2 Data Science
No ratings yet
#2 Data Science
32 pages
Using TKProf To Compare Actual and Predicted Row Counts
No ratings yet
Using TKProf To Compare Actual and Predicted Row Counts
2 pages
Chapter 2: Data Science
No ratings yet
Chapter 2: Data Science
32 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
Mccall (1977), Boehm (1978) Furps: Article Purpose
No ratings yet
Mccall (1977), Boehm (1978) Furps: Article Purpose
3 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
37 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet
CH-2 Data Science
No ratings yet
CH-2 Data Science
45 pages
Data Science: Chapter Two
No ratings yet
Data Science: Chapter Two
8 pages
Chapter 2. Introduction To Data Science
100% (2)
Chapter 2. Introduction To Data Science
45 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages

Chapter 2 EmTe

Uploaded by

Chapter 2 EmTe

Uploaded by

Emerging Technologies

Chapter 2. Introduction to Data Science

 Almost all programming languages explicitly include

Figure 2.2 Data types from a data analytics perspective

Figure 2.3 Data Value Chain

 large amounts of data Zeta bytes/Massive datasets

often confused with variety. To understand variability, let us

fig. Characteristics of big data(10V’s

Figure 2.4 Characteristics of big data

data, computer clusters are a better fit.

smaller machines, seeking to provide a number of benefits:

Figure 2.5 Hadoop Ecosystem

Computing and analyzing data

You might also like