0% found this document useful (0 votes)

15 views

Chapter 2. Introduction to Data Science

Chapter 2 provides an introduction to Data Science, covering its definition, the role of data scientists, and the distinction between data and information. It outlines the data processing life cycle, different data types, and the Big Data value chain, emphasizing the importance of data acquisition, analysis, curation, storage, and usage. The chapter also discusses the Hadoop ecosystem as a solution for managing and processing large datasets.

Uploaded by

bezawitamsalu2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Chapter 2. Introduction to Data Science

Uploaded by

bezawitamsalu2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Chapter 2

Introduction to Data Science

Outline
❖An overview of Data Science

❖What are Data and Information?

❖Data types and their representations

❖Data Value Chain

❖Concepts of Big Data

2
Objective
After completing this chapter, the students will be able to:
 Describe what data science is and the role of data
scientists.
 Differentiate data and information.
 Describe data processing life cycle
 Understand different data types from diverse
perspectives
 Describe data value chain in emerging era of big data.
 Understand the basics of Big Data
 Describe the purpose of the Hadoop ecosystem
components.
3
Overview of Data Science
 Data Science is a multi-disciplinary field that uses
scientific methods, processes, algorithms and
systems to extract knowledge and insights
from Data (structured data, semi structured data and
unstructured data).

 Data science offers a range of roles and requires a

range of skills to analyze data

 Data science focused on extracting knowledge from

data sets.
4
Overview of Data Science
 Data science continues to evolve as one of the most
promising and in-demand career paths for skilled
professionals.
 Data science is a new field of study that combines
domain expertise such as:
 programming skills (computer science)
 mathematics
 Statistics
 Machine learning
 Information science
 Data mining to extract meaningful insights from data.
5
6
Overview of Data Science
 Data scientists are data experts who masters the full
spectrum of the data science life cycle in order to uncover
useful intelligence from data for an organization.
 Data scientists need to:
 Be curious and result-oriented
 Be good at communication skills that allow them to
explain highly technical results to their non-
technical counterparts.
 Have a strong quantitative background in statistics
and linear algebra as well as programming
knowledge with focuses in data warehousing,
mining, and modeling to build and analyze
algorithms.
7
Data Vs Information
What is Data?
 Data can be described as unprocessed facts and
figures.
 Data can be defined as a collection of facts, concepts,
or instructions in a formalized manner.
 Data should be interpreted, or processed by human or
electronic machine to have a true meaning.
 Data can be presented in the form of
 Alphabets (A-Z, a-z)
 Digits (0-9)
 Special characters (+,-,/,*,<,>,= etc.)
8
What are Data and Information?
What is Information?
 Information is the processed data on which decisions
and actions are based.
 It is data that has been processed into a form that is
meaningful to the recipient and is of real or perceived
value in the current or the prospective action or
decision of recipient.
 Information is interpreted data; created from
organized, structured, and processed data in a
particular context.
9
Data Vs Information
Data Information
• Described as unprocessed or raw • Described as processed data
facts and figures
• Cannot help in decision making • Can help in decision making
• Data is raw material that can be • Interpreted data; created from
organized, structured, and organized, structured, and
interpreted to create useful processed data in a particular
information. context.
• Data is a group of text, images, and • Processed data in the form of
voice representing quantities, action text, images, and voice
and objects'. representing quantities, action
and objects'.
10
Data Processing Cycle
 Data processing is the re-structuring or re-ordering of
data by people or machine to increase their usefulness
and add values for a particular purpose.
 Data processing cycle is consisted of three basic steps-
Input, Processing, and Output.

11
Data Processing Cycle
 Input step:
 The input data is prepared in some convenient form for
processing.
 The form depends on the processing machine.
 Processing step
 The activities of converting input to an output
 The input data is changed to produce data in a more
useful form.
 Output step
 The result of processing is called an output
 The result of the proceeding processing step is collected.
12
Example- Data Processing Cycle

13
Data types and their representation
 Data types can be described from diverse
perspectives.
 From the perspective of computer science
and computer programming, for instance, a
data type is simply an attribute of data that
tells the compiler or interpreter how the
programmer intends to use the data.

14
Data types from Computer
programming perspective
 All programming languages explicitly include the
notion of data type
 Common data types include:
• Integers (int)- is used to store whole numbers,
mathematically known as integers
• Booleans (bool)- is used to represent restricted to
one of two values: true or false
• Characters (char)- is used to store a single
character
• Floating-point numbers (float)- is used to store
real numbers
• Alphanumeric strings (string)- used to store a
15 combination of characters and numbers.
Data types from Data Analytics
perspective
 Data analytics is the science of analyzing
raw data in order to make conclusions about
that information
 From a data analytics point of view, there are
three common data types or structures:
 Structured data
 Semi-structured data
 Unstructured data

16
Data types from Data Analytics perspective
Structured, Unstructured, and Semi-structured

17
Structured Data
 Structured data is data that adheres to a pre-
defined Data Model and is therefore
straightforward to analyze.
 Structured data conforms to a tabular format
with a relationship between the different rows
and columns.
 Common examples of structured data are Excel
files or SQL databases. Each of these has
structured rows and columns that can be sorted.
18
Unstructured Data
 Unstructured data does not have a predefined data
model and is not organized in a pre-defined manner.
 Unstructured information is typically text-heavy but
may contain data such as dates, numbers, and facts as
well.
 Unstructured data is difficult to understand using
traditional programs as compared to data stored in
structured databases.
 Common examples of unstructured data include audio
files, video files, PDF, Word file or No-SQL databases.
19
Semi-Structured Data
 Semi-structured data is a form of structured
data that does not obey the tabular structure of
data models associated with relational databases or
other forms of data tables
 Semi-structured data contains tags or other markers
to separate semantic elements within the data.
 Therefore, it is also known as self-describing
structure
 Example of semi-structured data: XML, JSON…
20
21
22
Metadata – Data about Data
 From a technical point of view, this is not a
separate data structure, but it is one of the most
important elements for Big Data analysis and big
data solutions.
 Metadata is data about data.
 It provides additional information about a
specific set of data.

23
Metadata – Data about Data
 For example, in a set of photographs, a metadata
could describe when and where the photos were
taken.
 The metadata then provides fields for dates and
locations which, by themselves, can be
considered structured data.
 Because of this reason, metadata is frequently
used by Big Data solutions for initial analysis.
24
Big Data Value Chain (DVC)
 The Big Data-Value-Chain describes the information
flow within a big data system that aims to generate
values and useful insights from data.
 The Big Data Value Chain identifies the following key
high-level activities:
✓ Data Acquisition
✓ Data Analysis
✓ Data Curation
✓ Data Storage
✓ Data Usage
25
Data Value Chain (DVC)

26
Data Acquisition
 Data Acquisition is the process of gathering, filtering,
and cleaning data before it is put in a data warehouse or
any other storage solution on which data analysis can be
carried out.
 Data acquisition is one of the major big data challenges in
terms of infrastructure requirements.
 The infrastructure required to support the acquisition of
big data must provide:
 Low latency
 high volumes of transaction
 flexible and dynamic data structures
27
Data Analysis
 Data Analysis is concerned with making the raw data
acquired amenable to use in decision-making as well as
domain-specific usages.
 Data analysis involves exploring, transforming, and
modeling data with the goal of highlighting relevant
data, synthesizing and extracting useful hidden
information with high potential from a business point of
view.
 Related areas include data mining, business intelligence,
and machine learning.
28
Data Curation
 It is the active management of data over its life cycle to
ensure it meets the necessary data quality requirements
for its effective usage.
 Data curation is the processes of Content creation,
Selection, Classification, transformation, validation and
preservation of Data
 Data curation is performed by expert curators (Data
curators, scientific curators, or data annotators) that are
responsible for improving the Accessibility, Quality,
Trustworthy, Discoverable, Accessible and Reusable of
29
data.
Data Storage
 It is the persistence and management of data in a
scalable way that satisfies the needs of applications that
require fast access to the data.
 Relational Database Management Systems (RDBMS)
have been the main solution to data storage.
 The best solution to store Big data a data lake
because it can support various data types and typically
are based on Hadoop clusters, cloud object storage
services, NoSQL databases or other big data platforms
30
Data Usage
 It covers the data-driven business activities that need
access to data, its analysis, and the tools needed to
integrate the data analysis within the business
activity.

 Data usage in business decision-making can enhance

competitiveness through the reduction of costs,
increased added value, or any other parameter that
can be measured against existing performance
criteria.
31
Basic Concepts of Big Data
What Is Big Data?
 Big data is the term for a collection of large and complex
data sets that it becomes difficult to process using on-hand
database management tools or traditional data processing
applications.
 In this context, a “large dataset” means a dataset too large
to reasonably process or store with traditional tooling or
on a single computer.
 This means that the common scale of big datasets is
constantly shifting and may vary significantly from
organization to organization.
32
Characterized of Big Data
Big data is characterized by 4Vs and more:
➢Volume: large amounts of data or Massive datasets
 Zettabytes (1021 (1,000,000,000,000,000,000,000) bytes
➢Velocity:
o Data is live streaming or in motion
➢Variety:
 data comes in many different forms from diverse sources.
➢Veracity:
 can we trust the data? How accurate is it?

33
Characterized of Big Data

34
Big Data Solutions:
Clustered Computing
 A computer cluster is a set of computers that work together
so that they can be viewed as a single system.
 Because of the qualities of big data, individual
computers are often inadequate for handling the data at
most stages.
 To better address the high storage and computational
needs of big data, computer clusters are a better fit.
 Big data clustering software combines the resources of
many smaller machines, seeking to provide a number of
benefits.
35
Benefits of Clustered Computing
Resource Pooling:
 Combining the available storage space to hold
data is a clear benefit, but CPU and Memory
pooling are also extremely important.
 Processing large datasets requires large
amounts of all three of these resources.
 Storage (Hard Disk)
 Processor (CPU)
 Memory (RAM)
36
Benefits of Clustered Computing
High Availability:
 Clusters can provide varying levels of fault
tolerance and availability that guarantees to
prevent hardware or software failures from
affecting access to data and processing.
 This becomes increasingly important as we
continue to emphasize the importance of real-
time analytics.

37
Benefits of Clustered Computing
Easy Scalability:
 Clusters make it easy to scale or to expand
horizontally by adding additional machines to
the network.
 This means the system can react to changes in
resource requirements without expanding the
physical resources on a machine.

38
Hadoop Ecosystem
 Hadoop is an open-source framework intended to make
interaction with big data easier.

 It is a framework that allows for the distributed

processing of large datasets across clusters of
computers using simple programming models.

 It is inspired by a technical document published by

Google.

39
Hadoop Ecosystem Interface

40
Big Data Life Cycle with Hadoop
 Activities or life cycle involved with big data
processing are:
I. Ingesting data into the system

II. Processing data in the storage

III. Computing and analyzing data

IV. Visualizing the results

Project Report ON Food Ordering System IN C++ Programming Language
100% (1)
Project Report ON Food Ordering System IN C++ Programming Language
98 pages
Solidity Cheat Sheet
No ratings yet
Solidity Cheat Sheet
1 page
Collections PDF
67% (3)
Collections PDF
66 pages
Chapter 2 - Introduction to Data Science
No ratings yet
Chapter 2 - Introduction to Data Science
37 pages
ict Ch. 2
No ratings yet
ict Ch. 2
38 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
33 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
Chapter 2 EmTe
No ratings yet
Chapter 2 EmTe
37 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
37 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
Emerging Tech Ch 2
No ratings yet
Emerging Tech Ch 2
52 pages
#2 Data Science
No ratings yet
#2 Data Science
32 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
55 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
IT 106 - Intro To Data Sciences
No ratings yet
IT 106 - Intro To Data Sciences
32 pages
Chapter 2 [Data Science]
No ratings yet
Chapter 2 [Data Science]
35 pages
Chapter 2 Data Science1
No ratings yet
Chapter 2 Data Science1
41 pages
Data Lifecycle
No ratings yet
Data Lifecycle
55 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
58 pages
Data Science
No ratings yet
Data Science
32 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
ETCh2
No ratings yet
ETCh2
36 pages
Data Science
No ratings yet
Data Science
35 pages
U - 02 ET
No ratings yet
U - 02 ET
24 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
ET_Ch-2_Data_Science_ppt (2)
No ratings yet
ET_Ch-2_Data_Science_ppt (2)
28 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
Chapter 2. Introduction To Data Science
100% (2)
Chapter 2. Introduction To Data Science
45 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
40 pages
Chapter 2 - Intro To Data Sciences (Updated)
No ratings yet
Chapter 2 - Intro To Data Sciences (Updated)
67 pages
Chap 2-Data Analysis
No ratings yet
Chap 2-Data Analysis
27 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
56 pages
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
CH-2 Introduction To Data Science
No ratings yet
CH-2 Introduction To Data Science
26 pages
Data Science: Chapter Two
No ratings yet
Data Science: Chapter Two
8 pages
Chapter 2 - Intro to Data Sciences[2]
No ratings yet
Chapter 2 - Intro to Data Sciences[2]
41 pages
Chapter 2 Introduction To Data Science
No ratings yet
Chapter 2 Introduction To Data Science
50 pages
Chapter 2 Introduction to Data Science_for Extension
No ratings yet
Chapter 2 Introduction to Data Science_for Extension
51 pages
Chapter 2 - Introduction to Data Science (2)
No ratings yet
Chapter 2 - Introduction to Data Science (2)
35 pages
Chapter 2
No ratings yet
Chapter 2
30 pages
CH-2 Data Science
No ratings yet
CH-2 Data Science
45 pages
Multidisciplinary Field That Uses A Variety
No ratings yet
Multidisciplinary Field That Uses A Variety
48 pages
Chaoter Data Science
No ratings yet
Chaoter Data Science
20 pages
IET - Chapter 2
No ratings yet
IET - Chapter 2
32 pages
Chapter 2-2
No ratings yet
Chapter 2-2
34 pages
Introduction To Data Science: Chapter Two
No ratings yet
Introduction To Data Science: Chapter Two
52 pages
Emerging_CH2
No ratings yet
Emerging_CH2
41 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
Chapter 2: Data Science
No ratings yet
Chapter 2: Data Science
32 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
30 pages
Ch2 Emerging
No ratings yet
Ch2 Emerging
24 pages
Sample Security Plan
No ratings yet
Sample Security Plan
9 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
36 pages
CHAPTER 2 Emerging
No ratings yet
CHAPTER 2 Emerging
8 pages
Data Science and Analytics: Transforming Raw Data into Actionable Insights: A Comprehensive Guide
From Everand
Data Science and Analytics: Transforming Raw Data into Actionable Insights: A Comprehensive Guide
Marlowe Reyes
No ratings yet
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Server Configuration Summary - MWO Wiki
No ratings yet
Server Configuration Summary - MWO Wiki
8 pages
C++ 1st and 2nd Semester
No ratings yet
C++ 1st and 2nd Semester
95 pages
Pyhton Progrmming
No ratings yet
Pyhton Progrmming
97 pages
Aceline Paper 4 Practice 28042024
No ratings yet
Aceline Paper 4 Practice 28042024
7 pages
Workmode API in FRUN
No ratings yet
Workmode API in FRUN
16 pages
Chapter_3_Basics of Dart and Flutter
No ratings yet
Chapter_3_Basics of Dart and Flutter
15 pages
AI TimeTable Report
No ratings yet
AI TimeTable Report
19 pages
Apacs 4mation Presentation
No ratings yet
Apacs 4mation Presentation
8 pages
C++
No ratings yet
C++
21 pages
Mastermachine C
No ratings yet
Mastermachine C
4 pages
Ch. 4 Python Functions, Modules and Packages
No ratings yet
Ch. 4 Python Functions, Modules and Packages
41 pages
Python ML Book
No ratings yet
Python ML Book
211 pages
Groovy Presentation
100% (1)
Groovy Presentation
69 pages
Python For Finance - The Complete Beginner's Guide - by Behic Guven - Jul, 2020 - Towards Data Science PDF
100% (1)
Python For Finance - The Complete Beginner's Guide - by Behic Guven - Jul, 2020 - Towards Data Science PDF
12 pages
Untitled
No ratings yet
Untitled
11 pages
Computation Thinking
No ratings yet
Computation Thinking
4 pages
Curso Sysmac - Ladder
No ratings yet
Curso Sysmac - Ladder
70 pages
Chapter 3
No ratings yet
Chapter 3
122 pages
Jscript Tutorial
No ratings yet
Jscript Tutorial
89 pages
Database Management System Class 10
No ratings yet
Database Management System Class 10
33 pages
Python Django Internship Report 5 - Pagenumber
No ratings yet
Python Django Internship Report 5 - Pagenumber
52 pages
Adding Points To Cimplicity
100% (1)
Adding Points To Cimplicity
1 page
PrepareAD FAIL
No ratings yet
PrepareAD FAIL
144 pages
Programming and Problem Solving with Python Ashok Namdev Kamthane - The ebook is available for instant download, read anywhere
100% (1)
Programming and Problem Solving with Python Ashok Namdev Kamthane - The ebook is available for instant download, read anywhere
57 pages
265 - GE8151 Problem Solving and Python Programming - 2 Marks With Answers PDF
0% (1)
265 - GE8151 Problem Solving and Python Programming - 2 Marks With Answers PDF
58 pages
Week2 CM MDL CP1212
No ratings yet
Week2 CM MDL CP1212
13 pages
Tristation 1131: Turbomachinery Control Software
No ratings yet
Tristation 1131: Turbomachinery Control Software
13 pages

Chapter 2. Introduction to Data Science

Uploaded by

Chapter 2. Introduction to Data Science

Uploaded by

Chapter 2

Introduction to Data Science

❖What are Data and Information?

❖Data types and their representations

❖Data Value Chain

❖Concepts of Big Data

 Data science offers a range of roles and requires a

 Data science focused on extracting knowledge from

 Data usage in business decision-making can enhance

 It is a framework that allows for the distributed

 It is inspired by a technical document published by

II. Processing data in the storage

III. Computing and analyzing data

IV. Visualizing the results

You might also like