Chapter 2. Introduction to Data Science
Chapter 2. Introduction to Data Science
2
Objective
After completing this chapter, the students will be able to:
Describe what data science is and the role of data
scientists.
Differentiate data and information.
Describe data processing life cycle
Understand different data types from diverse
perspectives
Describe data value chain in emerging era of big data.
Understand the basics of Big Data
Describe the purpose of the Hadoop ecosystem
components.
3
Overview of Data Science
Data Science is a multi-disciplinary field that uses
scientific methods, processes, algorithms and
systems to extract knowledge and insights
from Data (structured data, semi structured data and
unstructured data).
11
Data Processing Cycle
Input step:
The input data is prepared in some convenient form for
processing.
The form depends on the processing machine.
Processing step
The activities of converting input to an output
The input data is changed to produce data in a more
useful form.
Output step
The result of processing is called an output
The result of the proceeding processing step is collected.
12
Example- Data Processing Cycle
13
Data types and their representation
Data types can be described from diverse
perspectives.
From the perspective of computer science
and computer programming, for instance, a
data type is simply an attribute of data that
tells the compiler or interpreter how the
programmer intends to use the data.
14
Data types from Computer
programming perspective
All programming languages explicitly include the
notion of data type
Common data types include:
• Integers (int)- is used to store whole numbers,
mathematically known as integers
• Booleans (bool)- is used to represent restricted to
one of two values: true or false
• Characters (char)- is used to store a single
character
• Floating-point numbers (float)- is used to store
real numbers
• Alphanumeric strings (string)- used to store a
15 combination of characters and numbers.
Data types from Data Analytics
perspective
Data analytics is the science of analyzing
raw data in order to make conclusions about
that information
From a data analytics point of view, there are
three common data types or structures:
Structured data
Semi-structured data
Unstructured data
16
Data types from Data Analytics perspective
Structured, Unstructured, and Semi-structured
17
Structured Data
Structured data is data that adheres to a pre-
defined Data Model and is therefore
straightforward to analyze.
Structured data conforms to a tabular format
with a relationship between the different rows
and columns.
Common examples of structured data are Excel
files or SQL databases. Each of these has
structured rows and columns that can be sorted.
18
Unstructured Data
Unstructured data does not have a predefined data
model and is not organized in a pre-defined manner.
Unstructured information is typically text-heavy but
may contain data such as dates, numbers, and facts as
well.
Unstructured data is difficult to understand using
traditional programs as compared to data stored in
structured databases.
Common examples of unstructured data include audio
files, video files, PDF, Word file or No-SQL databases.
19
Semi-Structured Data
Semi-structured data is a form of structured
data that does not obey the tabular structure of
data models associated with relational databases or
other forms of data tables
Semi-structured data contains tags or other markers
to separate semantic elements within the data.
Therefore, it is also known as self-describing
structure
Example of semi-structured data: XML, JSON…
20
21
22
Metadata – Data about Data
From a technical point of view, this is not a
separate data structure, but it is one of the most
important elements for Big Data analysis and big
data solutions.
Metadata is data about data.
It provides additional information about a
specific set of data.
23
Metadata – Data about Data
For example, in a set of photographs, a metadata
could describe when and where the photos were
taken.
The metadata then provides fields for dates and
locations which, by themselves, can be
considered structured data.
Because of this reason, metadata is frequently
used by Big Data solutions for initial analysis.
24
Big Data Value Chain (DVC)
The Big Data-Value-Chain describes the information
flow within a big data system that aims to generate
values and useful insights from data.
The Big Data Value Chain identifies the following key
high-level activities:
✓ Data Acquisition
✓ Data Analysis
✓ Data Curation
✓ Data Storage
✓ Data Usage
25
Data Value Chain (DVC)
26
Data Acquisition
Data Acquisition is the process of gathering, filtering,
and cleaning data before it is put in a data warehouse or
any other storage solution on which data analysis can be
carried out.
Data acquisition is one of the major big data challenges in
terms of infrastructure requirements.
The infrastructure required to support the acquisition of
big data must provide:
Low latency
high volumes of transaction
flexible and dynamic data structures
27
Data Analysis
Data Analysis is concerned with making the raw data
acquired amenable to use in decision-making as well as
domain-specific usages.
Data analysis involves exploring, transforming, and
modeling data with the goal of highlighting relevant
data, synthesizing and extracting useful hidden
information with high potential from a business point of
view.
Related areas include data mining, business intelligence,
and machine learning.
28
Data Curation
It is the active management of data over its life cycle to
ensure it meets the necessary data quality requirements
for its effective usage.
Data curation is the processes of Content creation,
Selection, Classification, transformation, validation and
preservation of Data
Data curation is performed by expert curators (Data
curators, scientific curators, or data annotators) that are
responsible for improving the Accessibility, Quality,
Trustworthy, Discoverable, Accessible and Reusable of
29
data.
Data Storage
It is the persistence and management of data in a
scalable way that satisfies the needs of applications that
require fast access to the data.
Relational Database Management Systems (RDBMS)
have been the main solution to data storage.
The best solution to store Big data a data lake
because it can support various data types and typically
are based on Hadoop clusters, cloud object storage
services, NoSQL databases or other big data platforms
30
Data Usage
It covers the data-driven business activities that need
access to data, its analysis, and the tools needed to
integrate the data analysis within the business
activity.
33
Characterized of Big Data
34
Big Data Solutions:
Clustered Computing
A computer cluster is a set of computers that work together
so that they can be viewed as a single system.
Because of the qualities of big data, individual
computers are often inadequate for handling the data at
most stages.
To better address the high storage and computational
needs of big data, computer clusters are a better fit.
Big data clustering software combines the resources of
many smaller machines, seeking to provide a number of
benefits.
35
Benefits of Clustered Computing
Resource Pooling:
Combining the available storage space to hold
data is a clear benefit, but CPU and Memory
pooling are also extremely important.
Processing large datasets requires large
amounts of all three of these resources.
Storage (Hard Disk)
Processor (CPU)
Memory (RAM)
36
Benefits of Clustered Computing
High Availability:
Clusters can provide varying levels of fault
tolerance and availability that guarantees to
prevent hardware or software failures from
affecting access to data and processing.
This becomes increasingly important as we
continue to emphasize the importance of real-
time analytics.
37
Benefits of Clustered Computing
Easy Scalability:
Clusters make it easy to scale or to expand
horizontally by adding additional machines to
the network.
This means the system can react to changes in
resource requirements without expanding the
physical resources on a machine.
38
Hadoop Ecosystem
Hadoop is an open-source framework intended to make
interaction with big data easier.
39
Hadoop Ecosystem Interface
40
Big Data Life Cycle with Hadoop
Activities or life cycle involved with big data
processing are:
I. Ingesting data into the system