0% found this document useful (0 votes)

14 views27 pages

Chap 2-Data Analysis

Uploaded by

atakilti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views27 pages

Chap 2-Data Analysis

Uploaded by

atakilti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 27

Adigrat University

College of Engineering and Technology

Department of Computing

Course Title: Introduction to Emerging Technologies

Course Code:

Chapter Two: Data Science

Outlines

• An Overview of Data Science

• Data types and their representation
• Data value Chain
• Basic Concepts of Big Data

2
An Overview of Data Science
• Data science is a multi-disciplinary field
• Uses scientific methods, processes, algorithms, and systems to
extract knowledge and insights from structured and unstructured
data.
• Data science is much more than simply analyzing data.
• It offers a range of roles and requires a range of skills.

• Let’s consider the data involved in buying a box of cereal(teff, wheat,

or burly) from store:
• Prepare for the purchase by writing “cereal” in your notebook .
This planned purchase is a piece of data.
• In store, use your data as a reminder to grab the item and put it in
your cart.

3
An Overview of Data Science…
• The cashier scans the barcode on your container, and the
cash register logs the price.

• If purchase was one of the last boxes in the store, a computer tells
the stock manager that it is time to request another order from the
distributor.

• At the end of the month, a store manager looks at a collection of

pie charts showing all the different kinds of cereal that were sold
and decides to offer more varieties of these next month.

• So, the small piece of information that began on your notebook

ended up on the desk of a manager as an aid to decision making.

4
An Overview of Data Science…
• On the trip from your pencil(notebook) to the manager’s desk, the
data went through many transformations.
• Pieces of hardware such as the barcode scanner were involved in
collecting, manipulating and storing the data.
• Different pieces of software were used to organize, aggregate,
visualize, and present the data.
• People decided which systems to buy and install, who should get
access to what kinds of data.

• As an academic discipline, data science continues to evolve as one of

the most promising and in-demand career paths for skilled
professionals.
• Today, successful data professionals understand that they must
advance past the traditional skills of analyzing large
amounts of data, data mining, and programming skills.
5
An Overview of Data Science…
What are data and information?
• Data is a representation of facts, concepts, or instructions in a
formalized manner.
• It should be suitable for communication, interpretation, or
processing, by human or electronic machines.
• It can be described as unprocessed facts and figures.
• It is represented with the help of characters such as
 alphabets (A-Z, a-z), digits (0-9) or special characters (+, -, /,
*, <,>, =, etc.).

• Information is the processed data on which decisions and actions are

based information created from organized, structured, and processed
data in a particular context.

6
An Overview of Data Science…
Data Processing Cycle
• Data processing is the re-structuring or re-ordering of data by people or
machines to increase their usefulness and add values for a particular
purpose.
• It consists of the following basic steps
 input, processing, and output
• These three steps constitute the data processing cycle

7
An Overview of Data Science…
Data Processing Cycle...
A. Input: the input data is prepared in some convenient form for
processing depend on the processing machine.
• Fore example, for electronic computers, input data can be
recorded on any one of the several types of storage medium,
such as hard disk, CD and flash disk.
B. Processing: the input data is changed to produce data in a more
useful form.
• Fore example, interest can be calculated on deposit to a bank, a
summary of sales for the month can be calculated from the
sales orders.
C. Output: result of the processing step is collected.
• For example, output data may be payroll for employees

8
Data Types and their Representation
• Data types can be described from diverse perspectives.
• for instance, In computer programming, a data type is simply an
attribute of data that tells the compiler how the programmer intends
to use the data.

Data Types from Computer Programming Perspective

• This data type defines the operations that can be done on the data.
Though different languages may use different terminology.
• Common data types include:
• Integers(int)- is used to store whole numbers
• Floating-point numbers(float)- store real numbers.
• Characters(char)- is used to store a single character
• Booleans(bool)- to one of two values: true or false
• Alphanumeric strings(string)- characters and numbers
9
Data types and their representation…
Data types from Data Analytics perspective
• From a data analytics point of view, there are three common types of
data types or structures
A. Structured
B. Semi-structured
C. Unstructured data types
Below figure describes the three types of data and metadata

10
Data types and their representation…
Data types from Data Analytics perspective…
A. Structured Data: is data that adheres to a pre-defined data
model and is therefore straightforward to analyze
• It conforms to a tabular format with a relationship between
the different rows and columns.
• Common examples of structured data are Excel files or
SQL databases. Each of these has structured rows and
columns that can be sorted.
B. Semi-structured Data: is a form of structured data that
• Does not conform with the formal structure of data models
associated with relational databases.
• Contains tags or other markers to separate semantic
elements and enforce hierarchies of records and fields
• It is also known as a self-describing structure
• Example: JSON and XML
11
Data types and their representation…
Data types from Data Analytics perspective…
C. Unstructured Data: is information that either
• Does not have a predefined data model or is not organized
in a pre-defined manner.
• It is typically text-heavy but may contain data such as dates,
numbers, and facts as well.
• This results in irregularities and ambiguities
• For example, audio, video files or NoSQL databases
D. Metadata (Data about Data): provides additional information
about a specific set of data.
• Frequently used by Big Data solutions for initial analysis
• For example, In a set of photographs, metadata could describe
when and where the photos were taken

12
Data Value Chain
• The Data Value Chain is introduced to describe the information flow
within a big data system as a series of steps needed to generate value
and useful insights from data.
• Big Data Value Chain identifies the following key high-level
activities:

13
Data value Chain…

Data Acquisition
•It is the process of gathering, filtering, and cleaning data before it is
put in a data warehouse or any other storage solution on which data
analysis can be carried out.

• It is one of the major big data challenges in terms of infrastructure

requirements.

• The infrastructure required to support the acquisition of big data must

deliver low, predictable latency in both capturing data and in executing
queries.

• Be able to handle very high transaction volumes, often in a distributed

environment and
• Support flexible and dynamic data structures.
14
Data value Chain…
Data Analysis

• It is concerned with making the raw data acquired amenable to use in

decision-making as well as domain-specific usage.

• Data analysis involves

• Exploring, transforming, and modeling data with the goal of
highlighting relevant data
• Synthesizing and extracting useful hidden information with high
potential from a business point of view.

• Related areas include:

• Data mining
• Business intelligence
• Machine learning
15
Data value Chain…
Data Curation

• It is the active management of data over its life cycle to ensure it

meets the necessary data quality requirements for its effective usage.
• Data curation processes can be categorized into different activities
such as:
• Content creation
• Selection and classification
• Transformation, validation, and preservation
• It is performed by expert curators that are responsible for improving
the accessibility and quality of data.

• Data curators hold the responsibility of ensuring that data are

trustworthy, discoverable, accessible, reusable and fit their purpose.

16
Data value Chain…
Data Storage

• It is the persistence and management of data in a scalable way that

satisfies the needs of applications that require fast access to the data.

• Relational DBMS have been the main solution to the storage

paradigm for nearly 40 years.

• NoSQL technologies have been designed with the scalability goal in

mind and present a wide range of solutions based on alternative data
models.

• However, the ACID (Atomicity, Consistency, Isolation, and

Durability) properties that guarantee database transactions lack
flexibility with regard to schema changes and the performance and fault
tolerance when data volumes and complexity grow.
17
Data value Chain…
Data Usage

• It covers the data-driven business activities that need access to data

and its analysis.

• And the tools needed to integrate the data analysis within the business
activity.

• Data usage in business decision making can enhance competitiveness

through
• The reduction of costs
• Increased added value
• Any other parameter that can be measured against existing
performance criteria.

18
Basic Concepts of Big Data
• Big data is a term for the non-traditional strategies and technologies
needed to gather, organize, process, and gather insights from large
datasets.
What Is Big Data?
• Big data is the term for a collection of large and complex data sets.

• It becomes difficult to process using on-hand database management

tools or traditional data processing applications

•Big data is characterized by 3V and more:

• Volume: large amounts of data Zeta bytes/Massive dataset
• Velocity: Data is live streaming or in motion
• Variety: data comes in many different forms from diverse sources
• Veracity: can we trust the data? How accurate is it?

19
Basic Concepts of Big Data…
Below figure shows the Characteristics of big data.

20
Basic Concepts of Big Data…
Clustered Computing and Hadoop Ecosystem
Clustered Computing
• Because of the quantities of big data, individual computers are often
inadequate for handling the data at most stages.

• To better address the high storage and computational needs of big

data, computer clusters are a better fit.

• Big data clustering software combines the resources of many smaller

machines, seeking to provide a number of benefits:
• Resource Pooling: Combining the available storage space to hold
data is a clear benefit.
• But CPU and memory pooling are also extremely important.
• Processing large datasets requires large amounts of all three of
these resources.
21
Basic Concepts of Big Data…
Clustered Computing and Hadoop Ecosystem…
Clustered Computing…

• High Availability: Clusters can provide varying levels of fault

tolerance and availability guarantees
• To prevent hardware or software failures from affecting access
to data and processing.
• This becomes increasingly important as we continue to
emphasize the importance of real-time analytics.

• Easy Scalability: Clusters make it easy to scale horizontally by

adding additional machines to the group.
• This means the system can react to changes in resource
requirements without expanding the physical resources on a
machine.
22
Basic Concepts of Big Data…
Clustered Computing and Hadoop Ecosystem…
Clustered Computing…
• Using clusters requires a solution for managing
• Cluster membership
• Coordinating resource sharing
• Scheduling actual work on individual nodes

• Cluster membership and resource allocation can be handled by

software like Hadoop’s YARN.

• The assembled computing cluster often acts as a foundation that other

software interfaces with to process the data.

• The machines involved in the computing cluster are also typically

involved with the management of a distributed storage system.
23
Basic Concepts of Big Data…
Clustered Computing and Hadoop Ecosystem…
Hadoop and its Ecosystem
• Hadoop is an open-source framework intended to make interaction
with big data easier.
• It is a framework that allows for the distributed processing of large
datasets across clusters of computers

• The four key characteristics of Hadoop are:

• Economical: highly economical as ordinary computers can be used
for data processing
• Reliable: as it stores copies of the data on different machines and is
resistant to hardware failure.
• Scalable: It is easily scalable horizontally and vertically
• Flexible: It is flexible and you can store as much structured and
unstructured data as you need.
24
Basic Concepts of Big Data…
Clustered Computing and Hadoop Ecosystem…
Hadoop and its Ecosystem…

• Hadoop has an ecosystem that has evolved from its four core
components:
• Data management
• Access
• Processing
• Storage
• It is continuously growing to meet the needs of Big Data.
• It comprises the following components and many others:
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• HBase: NoSQL Database
25
Basic Concepts of Big Data…
Clustered Computing and Hadoop Ecosystem…
Hadoop and its Ecosystem…
Below figure shows Hadoop Ecosystem

26
Basic Concepts of Big Data…
Clustered Computing and Hadoop Ecosystem…

Big Data Life Cycle with Hadoop

1. Ingesting data into the system: this is the first stage and data is
ingested or transferred to Hadoop from various sources such as
relational databases, local files.
2. Processing the data in storage: data is stored and processed. The
data is stored in the distributed file system, HDFS and NoSQL
perform data processing.
3. Computing and analyzing data: data is analyzed by processing
frameworks such as Pig, Hive, and Impala
4. Visualizing the results: this stage is Access, which is performed
by tools such as Hue and Cloudera Search
• The analyzed data can be accessed by users.

Chapter 2 - EMTE - 240216 - 133452
No ratings yet
Chapter 2 - EMTE - 240216 - 133452
47 pages
(ET) Chapter - 2
No ratings yet
(ET) Chapter - 2
31 pages
Python Essentials 1 Badge20230518-28-It4cq6
No ratings yet
Python Essentials 1 Badge20230518-28-It4cq6
1 page
Emerging CH2
No ratings yet
Emerging CH2
41 pages
Islamic Answer
No ratings yet
Islamic Answer
27 pages
Chapter 2 (Data Science)
No ratings yet
Chapter 2 (Data Science)
35 pages
Introduction To Emerging Technologies Chapter 2
No ratings yet
Introduction To Emerging Technologies Chapter 2
31 pages
Ict Ch. 2
No ratings yet
Ict Ch. 2
38 pages
CG Programs
No ratings yet
CG Programs
72 pages
Chapter 2EMR
No ratings yet
Chapter 2EMR
21 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
33 pages
Chapter 2 Data Science1
No ratings yet
Chapter 2 Data Science1
41 pages
Data Science
No ratings yet
Data Science
32 pages
Chapter 2 - Overview For Data Science
No ratings yet
Chapter 2 - Overview For Data Science
31 pages
Data Lifecycle
No ratings yet
Data Lifecycle
55 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
58 pages
Emergency Chapter Two
No ratings yet
Emergency Chapter Two
41 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
CH 3 and 4
No ratings yet
CH 3 and 4
60 pages
Emerging Tech CH 2
No ratings yet
Emerging Tech CH 2
52 pages
Ch2 Emerging
No ratings yet
Ch2 Emerging
24 pages
Course Name: Introduction To Emerging Technologies
No ratings yet
Course Name: Introduction To Emerging Technologies
24 pages
ET Ch-2 Data Science PPT
No ratings yet
ET Ch-2 Data Science PPT
28 pages
Research of Merry
No ratings yet
Research of Merry
37 pages
Worksheet SQL
No ratings yet
Worksheet SQL
14 pages
Data Science
No ratings yet
Data Science
35 pages
Big Data Day II
No ratings yet
Big Data Day II
38 pages
Go Live With Shopify Challenge Roadmap
No ratings yet
Go Live With Shopify Challenge Roadmap
12 pages
Programming Chapter 4
No ratings yet
Programming Chapter 4
8 pages
HTC Emerging Ch2
No ratings yet
HTC Emerging Ch2
37 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
8 pages
Big Data Analysis
No ratings yet
Big Data Analysis
25 pages
IT 106 - Intro To Data Sciences
No ratings yet
IT 106 - Intro To Data Sciences
32 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
55 pages
MCA Rtu Syllabuss
No ratings yet
MCA Rtu Syllabuss
6 pages
PGW Post
No ratings yet
PGW Post
9 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Cloud Computing
No ratings yet
Cloud Computing
48 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
WU M.SC Computer Networks - Draft Version Curriculum
No ratings yet
WU M.SC Computer Networks - Draft Version Curriculum
76 pages
Chapter 2 EmTe
No ratings yet
Chapter 2 EmTe
37 pages
Chapter - 2 Data Sciences
No ratings yet
Chapter - 2 Data Sciences
25 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
41 pages
Application Control Word File Final Xxx...
No ratings yet
Application Control Word File Final Xxx...
9 pages
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
Social Media
No ratings yet
Social Media
5 pages
List of All DOS Command
No ratings yet
List of All DOS Command
9 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
37 pages
ETCh 2
No ratings yet
ETCh 2
36 pages
2 Data Science
No ratings yet
2 Data Science
27 pages
RabbitMQ Training Daywise
No ratings yet
RabbitMQ Training Daywise
6 pages
Programming Chapter 1
No ratings yet
Programming Chapter 1
11 pages
Polytechnic University of The Philippines Paranaque Campus Bachelor of Science in Computer Engineering
No ratings yet
Polytechnic University of The Philippines Paranaque Campus Bachelor of Science in Computer Engineering
18 pages
#2 Data Science
No ratings yet
#2 Data Science
32 pages
System Detail of Noida Office
No ratings yet
System Detail of Noida Office
61 pages
CH-2 Data Science Emerging Technology
No ratings yet
CH-2 Data Science Emerging Technology
20 pages
Chapter 2-2
No ratings yet
Chapter 2-2
34 pages
CH-2 Data Science
No ratings yet
CH-2 Data Science
45 pages
Shubham Mangukiya (Backend-Developer) Resume
No ratings yet
Shubham Mangukiya (Backend-Developer) Resume
1 page
PHD Interview Questions and Answers
100% (1)
PHD Interview Questions and Answers
11 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
DAA or Algorithms in 9 Hours
No ratings yet
DAA or Algorithms in 9 Hours
344 pages
Chaoter Data Science
No ratings yet
Chaoter Data Science
20 pages
Word Processing Teachers Note Section 2 Part I
No ratings yet
Word Processing Teachers Note Section 2 Part I
30 pages
Big Data Visualization
No ratings yet
Big Data Visualization
7 pages
OpenStack-made-easy Ebook 11.17 PDF
No ratings yet
OpenStack-made-easy Ebook 11.17 PDF
29 pages
Chapter 2
No ratings yet
Chapter 2
30 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
37 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
30 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Goodrich 6e Ch12 MergeSort
No ratings yet
Goodrich 6e Ch12 MergeSort
19 pages
Hospital Housekeeper Id Card Template
No ratings yet
Hospital Housekeeper Id Card Template
2 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Open Innovation Researching A New Paradigm
No ratings yet
Open Innovation Researching A New Paradigm
14 pages
09 Korosi Estelecki Engl
No ratings yet
09 Korosi Estelecki Engl
12 pages
MVB Ug d000656-003145
No ratings yet
MVB Ug d000656-003145
109 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
Infinityfree Web Hosting: Setting Up From Nothing
No ratings yet
Infinityfree Web Hosting: Setting Up From Nothing
1 page
Technical Guidance Notes, Resources and Tip Sheets
No ratings yet
Technical Guidance Notes, Resources and Tip Sheets
6 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
40 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
Pokedex Download: Click Here To Download
0% (2)
Pokedex Download: Click Here To Download
2 pages
PHD Research Proposal
No ratings yet
PHD Research Proposal
6 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
Introduction To Cloud Computing
No ratings yet
Introduction To Cloud Computing
8 pages
Final Exam
No ratings yet
Final Exam
2 pages
Cloud Technology Virtualization
No ratings yet
Cloud Technology Virtualization
5 pages
School Proposal
100% (5)
School Proposal
31 pages
Expressing Interest in Ph.D. Opportunity Under Your Supervision
No ratings yet
Expressing Interest in Ph.D. Opportunity Under Your Supervision
1 page
Feasibility Study of Automotive Industrial Plant and Engineering in Tigray
100% (1)
Feasibility Study of Automotive Industrial Plant and Engineering in Tigray
5 pages
Nightingale 06
No ratings yet
Nightingale 06
14 pages
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
Praveen Kumar
No ratings yet
Praveen Kumar
1 page
Components of An Android Application: 1. Activities
100% (1)
Components of An Android Application: 1. Activities
3 pages
01 Index
No ratings yet
01 Index
1 page
Car Template Proposal 4g
No ratings yet
Car Template Proposal 4g
37 pages
Data Science: Chapter Two
No ratings yet
Data Science: Chapter Two
8 pages
Final Exam Basic Arduino Workshop PDF
100% (3)
Final Exam Basic Arduino Workshop PDF
41 pages
Aspire T120: User's Manual
No ratings yet
Aspire T120: User's Manual
20 pages
Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet
Chapter 2. Introduction To Data Science
100% (2)
Chapter 2. Introduction To Data Science
45 pages
Data Mining-Applications, Issues
No ratings yet
Data Mining-Applications, Issues
9 pages
A3 1 1IntroductionFlipFlops
No ratings yet
A3 1 1IntroductionFlipFlops
5 pages
Difference Between SAP Memory and ABAP Memory: Answers 1
No ratings yet
Difference Between SAP Memory and ABAP Memory: Answers 1
2 pages
Using Openoffice Base: A. Multiple Choice Questions
100% (1)
Using Openoffice Base: A. Multiple Choice Questions
5 pages

Chap 2-Data Analysis

Uploaded by

Chap 2-Data Analysis

Uploaded by

Adigrat University

College of Engineering and Technology

Course Title: Introduction to Emerging Technologies

Chapter Two: Data Science

• An Overview of Data Science

• Let’s consider the data involved in buying a box of cereal(teff, wheat,

• At the end of the month, a store manager looks at a collection of

• So, the small piece of information that began on your notebook

• As an academic discipline, data science continues to evolve as one of

• Information is the processed data on which decisions and actions are

Data Types from Computer Programming Perspective

• It is one of the major big data challenges in terms of infrastructure

• The infrastructure required to support the acquisition of big data must

• Be able to handle very high transaction volumes, often in a distributed

• It is concerned with making the raw data acquired amenable to use in

• Data analysis involves

• Related areas include:

• It is the active management of data over its life cycle to ensure it

• Data curators hold the responsibility of ensuring that data are

• It is the persistence and management of data in a scalable way that

• Relational DBMS have been the main solution to the storage

• NoSQL technologies have been designed with the scalability goal in

• However, the ACID (Atomicity, Consistency, Isolation, and

• It covers the data-driven business activities that need access to data

• Data usage in business decision making can enhance competitiveness

• It becomes difficult to process using on-hand database management

•Big data is characterized by 3V and more:

• To better address the high storage and computational needs of big

• Big data clustering software combines the resources of many smaller

• High Availability: Clusters can provide varying levels of fault

• Easy Scalability: Clusters make it easy to scale horizontally by

• Cluster membership and resource allocation can be handled by

• The assembled computing cluster often acts as a foundation that other

• The machines involved in the computing cluster are also typically

• The four key characteristics of Hadoop are:

Big Data Life Cycle with Hadoop

You might also like