0% found this document useful (0 votes)

15 views29 pages

Chapter 2 DS New

The document is a chapter from Addis Ababa University's introduction to emerging technologies course. It covers key concepts in data science including differentiating data and information, explaining the data processing lifecycle and value chain, describing different data types, and analyzing the Hadoop ecosystem. The chapter aims to help students understand fundamental data science topics.

Uploaded by

Kenean

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views29 pages

Chapter 2 DS New

Uploaded by

Kenean

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 29

Addis Ababa University

School of Commerce

Introduction to Emerging
Technologies
Chapter Two- Data science

07/01/2023 1
Contents
 Learning outcomes
 An overview of data science
 Data Vs information
 Data processing cycle
 Data types and their representation
 Data value chain
 Basic concepts of big data
 Hadoop ecosystem
 Review questions

07/01/2023 2
Learning outcomes
After the successfully completing this chapter, the students can
Differentiate data and information
Explain data processing life cycle
Differentiate different data types from diverse perspectives
Explain the data value chain
Explain the basics of big data
Analyze Hadoop ecosystem components and their use in big data

07/01/2023 3
An Overview of Data Science
 Data science is a multi-disciplinary field that uses scientific methods, processes,
algorithms, and systems to extract knowledge and insights from structured, semi-
structured and unstructured data.

07/01/2023 4
Data Vs Information
Data:
 Representation of facts, concepts, or instructions in a formalized manner, which
should be suitable for communication, interpretation, or processing, by human or
electronic machines.
 Described as unprocessed facts and figures.
 Represented with the help of characters such as alphabets (A-Z, a-z), digits (0-9) or
special characters (+, -, /, *, <,>, =, etc.).

07/01/2023 5
…Data Vs Information
Information:
 Processed data on which decisions and actions are based.
 Data that has been processed into a form that is meaningful to the recipient and is of
real or perceived value in the current or the prospective action or decision of
recipient.
 Interpreted data; created from organized, structured, and processed data in a
particular context.

07/01/2023 6
…Data Vs Information

Source: internet

07/01/2023 7
Data Processing Cycle
 Data processing is the re-structuring or re-ordering of data by people or machines to
increase their usefulness and add values for a particular purpose. It has three steps.
Input:
 Data preparation in convenient form for processing. The form will depend on the
processing machine.
 For example, when electronic computers are used for data processing, the input data
can be recorded on hard disk, CD, flash disk and so on.

Source: Introduction to emerging technology module page 23

07/01/2023 8
Data Processing Cycle
Processing:
 The input data is changed to produce data in a more useful form.
 For example, interest can be calculated on deposit to a bank, or a summary of sales
for the month can be calculated from the sales orders.
Output:
 The result of the processing step is collected. The particular form of the output data
depends on the use of the data.
 For example, output data may be payroll for employees.

07/01/2023 9
Data types and their representation
1. Data types from Computer programming perspective: defines the operations
that can be done on the data, the meaning of the data, and the way values of that
type can be stored.
E.G int, bool, char, float, double, string
2. Data types from Data Analytics perspective: there are three common types of
data types or structures: Structured, Semi-structured, and Unstructured data types.

Source: Introduction to emerging technology module page 25

07/01/2023 10
Structured Data
 It conforms to a tabular format with a relationship between the different rows and
columns.
 Examples of structured data are Excel files or SQL databases. Each of these has
structured rows and columns that can be sorted.

Source: internet

07/01/2023 11
Semi-structured data
 It is a form of structured data that does not conform with the formal structure of data
models associated with relational databases or other forms of data tables, but
nonetheless, contains tags or other markers to separate semantic elements and
enforce hierarchies of records and fields within the data.
 Examples of semi-structured data include JSON and XML

Source: internet

07/01/2023 12
Unstructured Data
 It is information that either does not have a predefined data model or is not
organized in a pre-defined manner. Unstructured information is typically text-heavy
but may contain data such as dates, numbers, and facts as well which results in
irregularities and ambiguities that make it difficult to understand using traditional
programs as compared to data stored in structured databases.
 Examples of unstructured data include audio, video files or No-SQL databases.

Source: internet
07/01/2023 13
Metadata – Data about Data
 It is not a separate data structure, but it is one of the most important elements for
Big Data analysis and big data solutions.
 Metadata is data about data. It provides additional information about a specific set
of data.
 Example, In a set of photographs, metadata could describe when and where the
photos were taken.

07/01/2023 14
Data value Chain
 The Data Value Chain is introduced to describe the information flow within a big
data system as a series of steps needed to generate value and useful insights from
data.

Source: Introduction to emerging technology module page 26

07/01/2023 15
…Data value Chain
Data Acquisition:
 The process of gathering, filtering, and cleaning data before it is put in a data
warehouse or any other storage solution on which data analysis can be carried out.
 One of the major big data challenges in terms of infrastructure requirements because
the infrastructure must deliver low, predictable latency in both capturing data and in
executing queries; be able to handle very high transaction volumes, often in a
distributed environment; and support flexible and dynamic data structures.

07/01/2023 16
…Data value Chain
Data Analysis:
 Concerned with making the raw data acquired amenable to use in decision-making
as well as domain-specific usage.
 Involves exploring, transforming, and modeling data with the goal of highlighting
relevant data, synthesizing and extracting useful hidden information with high
potential from a business point of view.

07/01/2023 17
…Data value Chain
Data Curation:
 The active management of data over its life cycle to ensure it meets the necessary
data quality requirements for its effective usage.
 Its processes can be categorized into different activities such as content creation,
selection, classification, transformation, validation, and preservation.
 Data curation is performed by expert curators that are responsible for improving the
accessibility and quality of data.
 Data curators hold the responsibility of ensuring that data are trustworthy,
discoverable, accessible, reusable and fit their purpose.

07/01/2023 18
…Data value Chain
Data Storage:
 The persistence and management of data in a scalable way that satisfies
the needs of applications that require fast access to the data.
 Relational Database Management Systems (RDBMS) have been the
main, and almost unique, a solution to the storage paradigm. However,
the ACID (Atomicity, Consistency, Isolation, and Durability) properties
that guarantee database transactions lack flexibility with regard to
schema changes
 NoSQL technologies have been designed with the scalability goal in
mind and present a wide range of solutions based on alternative data
models.
07/01/2023 19
…Data value Chain
Data Usage:
 It covers the data-driven business activities that need access to
data, its analysis, and the tools needed to integrate the data
analysis within the business activity.
 Data usage in business decision-making can enhance
competitiveness through the reduction of costs, increased
added value, or any other parameter that can be measured
against existing performance criteria.

07/01/2023 20
Basic concepts of big data
 Big data is the term for a collection of data sets so large and complex that it
becomes difficult to process using on-hand database management tools or traditional
data processing applications.
 In this context, a “large dataset” means a dataset too large to reasonably process or
store with traditional tooling or on a single computer.
 E.g.

07/01/2023 21
Basic concepts of big data
Big data is characterized by 3V and more:

Source: Introduction to emerging technology module page 29

07/01/2023 22
Clustered Computing
 Individual computers are often inadequate for handling the big data at most stages.
 To address the high storage and computational needs of big data, computer clusters
are needed.
 Big data clustering software combines the resources of many smaller machines,
seeking to provide a number of benefits:
 Resource Pooling  combine available storage space, CPU, …
 High Availability  fault tolerance and availability
 Easy Scalability  expansion in resource requirement without expanding
the physical resources on the machine
 The good example of clustering software is Hadoop’s YARN

07/01/2023 23
Hadoop and its Ecosystem
 Hadoop is an open-source framework intended to make interaction with big data
easier. It is a framework that allows for the distributed processing of large datasets
across clusters of computers using simple programming models.
 Gives the massive data storage facility, enormous computational power and the
ability to handle different virtually limitless jobs or tasks.
 The four key characteristics of Hadoop are:
 Economical ordinary computers can be used for data processing
 Reliable stores copies of data on different machines (resistant to HW failure)
 Scalable expand horizontally or vertically by adding few extra nodes
 Flexible store as much structured and unstructured data as you need

07/01/2023 24
…Hadoop and its Ecosystem
Hadoop Ecosystem has evolved from its four core components:
1. Data management,
2. Data access,
3. Data processing, and
4. Data storage.
It is continuously growing to meet the needs of Big Data.

07/01/2023 25
Source: Introduction to emerging technology module page 31
07/01/2023 26
Big Data Life Cycle with Hadoop
Has 4 stages:
1. Ingesting: transferring data into to Hadoop from various sources such as relational
databases, systems, or local files. Sqoop transfers data from RDBMS to HDFS
2. Processing: the data is stored and processed. The data is stored in the distributed
file system, HDFS, and the NoSQL distributed data, HBase. Spark and MapReduce
perform data processing.
3. Computing and analyzing: data analyzation using processing frameworks such as
Pig, Hive, and Impala. Pig converts the data using a map and reduce and then
analyzes it.
4. Visualizing: accessing the result, performed by tools such as Hue and Cloudera
Search.

07/01/2023 27
Review Questions
 Briefly explain data Vs information?
 Discuss data and its types from computer programming and data analytics
perspectives?
 Briefly explain each steps of data value chain?
 List out and discuss the characteristics of Big Data?
 What is Hadoop system? What is it used for?

07/01/2023 28
END!!
07/01/2023 29

Computer Studies: Preamble
No ratings yet
Computer Studies: Preamble
29 pages
PWM Outputs Enhance Sensor Signal Conditioners
No ratings yet
PWM Outputs Enhance Sensor Signal Conditioners
5 pages
Re2 Framework Log
No ratings yet
Re2 Framework Log
3,293 pages
Applying Artificial Intelligence at Scale in Semiconductor Manufacturing - McKinsey
No ratings yet
Applying Artificial Intelligence at Scale in Semiconductor Manufacturing - McKinsey
25 pages
B38DF LS1 Introduction
No ratings yet
B38DF LS1 Introduction
46 pages
Database Management Systems PPT Part 2
No ratings yet
Database Management Systems PPT Part 2
8 pages
ACSPL Programmer Guide
No ratings yet
ACSPL Programmer Guide
374 pages
Hc25.26.p70 Risc V Warterman Ucb
No ratings yet
Hc25.26.p70 Risc V Warterman Ucb
1 page
CS3451 OS Syllabus
No ratings yet
CS3451 OS Syllabus
2 pages
Shalvi Python Internship Report - Word
No ratings yet
Shalvi Python Internship Report - Word
53 pages
Maya Render Log
No ratings yet
Maya Render Log
102 pages
Overview of SCADA Rep.2
No ratings yet
Overview of SCADA Rep.2
20 pages
PDF - NSM - NSM Rule Book PG Diploma in HPC March 2024
No ratings yet
PDF - NSM - NSM Rule Book PG Diploma in HPC March 2024
24 pages
BF 7412 Amxx
No ratings yet
BF 7412 Amxx
208 pages
Lecture 2 Enabling Technologies
No ratings yet
Lecture 2 Enabling Technologies
30 pages
SQL - ORA-01748 - Only Simple Column Names Allowed Here in Oracle - Stack Overflow
No ratings yet
SQL - ORA-01748 - Only Simple Column Names Allowed Here in Oracle - Stack Overflow
3 pages
14.3.5 Packet Tracer - Basic Router Configuration Review
No ratings yet
14.3.5 Packet Tracer - Basic Router Configuration Review
16 pages
Imagerunner 3245
No ratings yet
Imagerunner 3245
60 pages
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (648)
Introduction To Linux - Basic Commands & Environment - Linux-2
No ratings yet
Introduction To Linux - Basic Commands & Environment - Linux-2
57 pages
Anomaly Detection With Machine Learning in Wireless Networks and IoT by Zyyad Shah Master Thesis 2021
No ratings yet
Anomaly Detection With Machine Learning in Wireless Networks and IoT by Zyyad Shah Master Thesis 2021
98 pages
Automatic Public Lighting System
No ratings yet
Automatic Public Lighting System
4 pages
Introducing The All New Azure Metrics Advisor
No ratings yet
Introducing The All New Azure Metrics Advisor
1 page
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
Smart Substation: State of Art and Future Development
No ratings yet
Smart Substation: State of Art and Future Development
8 pages
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Module Code: DEE40082 Module Title: Project 1 Course: Diploma in Electrical and Electronic Engineering
No ratings yet
Module Code: DEE40082 Module Title: Project 1 Course: Diploma in Electrical and Electronic Engineering
8 pages
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
Unit 1: JDBC (Java Database Connectivity)
No ratings yet
Unit 1: JDBC (Java Database Connectivity)
7 pages
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2886)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
L&T Electrical & Automation - Electrical Engg. Student - 2 Weeks Internship-2 - 282102
No ratings yet
L&T Electrical & Automation - Electrical Engg. Student - 2 Weeks Internship-2 - 282102
4 pages
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
Css Practical 2
No ratings yet
Css Practical 2
14 pages
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
8086 Addressing Modes:: Instruction Operand (8-Bit or 16-Bit)
No ratings yet
8086 Addressing Modes:: Instruction Operand (8-Bit or 16-Bit)
4 pages
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
CS604 Operating Systems Solved MCQs
67% (3)
CS604 Operating Systems Solved MCQs
6 pages
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
The LLVM Compiler Framework and Infrastructure
No ratings yet
The LLVM Compiler Framework and Infrastructure
61 pages
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)

Chapter 2 DS New

Uploaded by

Chapter 2 DS New

Uploaded by

Addis Ababa University

Source: Introduction to emerging technology module page 23

Source: Introduction to emerging technology module page 25

Source: Introduction to emerging technology module page 26

Source: Introduction to emerging technology module page 29

You might also like