0% found this document useful (0 votes)

17 views35 pages

Chapter 2 - Introduction To Data Science

Chapter 2 provides an introduction to data science, defining it as a multi-disciplinary field focused on extracting insights from data and highlighting the role of data scientists. It discusses the differences between data and information, the data processing life cycle, various data types, and the significance of big data and the Hadoop ecosystem. The chapter emphasizes the importance of data in decision-making processes and outlines the data value chain in the context of big data.

Uploaded by

Swunet mosie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views35 pages

Chapter 2 - Introduction To Data Science

Uploaded by

Swunet mosie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 35

Chapter 2

Data Science
OUTLINE
 After completing this chapter:

 Describe what data science is and the role of data scientists.

 Differentiate data and information.

 Describe data processing life cycle

 Understand different data types from diverse perspectives

 Describe data value chain in emerging era of big data.

 Understand the basics of Big Data.

 Describe the purpose of the Hadoop ecosystem components.

INTRODUCTION TO DATA
SCIENCE
 Data science is a multi-disciplinary field that uses scientific
methods, processes, algorithms, and systems to extract knowledge
and insights from data.

 The main purpose of data science is to find the patterns within the
data and uses several techniques to analyze and draw the
perceptions from data.

 Data science is much more than simply analyzing data. It offers a

range of roles and requires a range of skills.

 Data scientist is the person who performs analysis on data and give
insights to decision makers. In Data Science, the data scientist has
the responsibility of making the predictions from the data.
INTRODUCTION TO DATA

SCIENCE
Data scientist aims to derive conclusions from the whole data.
With the help of these conclusions, the data scientist can support
the industries in making smarter business decisions.

 Data Science is all about:

 Ask correct questions and analyzing the raw data

 Visualize data for getting a better perspective.

 Modeling the data by using various complex and efficient

algorithms.

 Understand the data to make a better decision and finding

the final result.
EXAMPLE OF DATA SCIENCE
 Let’s us consider choosing department from available departments in Dilla
university as data science problem example:

 Before choosing a department you write your preferences on your paper as Law,
Anthropology, Economics, Accounting etc. based on what you see in the semester.
This planned choice is a piece of data though it is written by pencil or on
mobile that you can read.

 When it is time to choose, you use your data as a reminder to choose department.

 When you choose department, the registrar register you to the department you
want.

 In the registrar, a system tells that the department capacity is full and to assign
students to other department.

 Finally, at the end of registration the registrar employees see different graphs of
students data based on sex, age, country etc. They use this information to prepare
for next year.
EXAMPLE OF DATA SCIENCE
 The small piece of information that began on your notebook
ended up in many different places, most notably on the registrar
office as an aid to decision making.

 The data went through many transformations. In addition to the

computers where the data might have stopped by or stayed on
for the long term, lots of other pieces of hardware were involved
in collecting, manipulating, transmitting, and storing the data.

 In addition, many different pieces of software were used to

organize, aggregate, visualize, and present the data. Finally,
many different human systems were involved in working with
the data.
DATA AND INFORMATION
 Data can be defined as a representation of facts, concepts, or

instructions in a formalized manner, which should be suitable

for communication, interpretation, or processing, by human or

electronic machines.

 It can be described as unprocessed facts and figures.

 It is represented with the help of characters such as alphabets

(A-Z, a-z), digits (0-9) or special characters (+, -, /, *, <,>, =,

etc.).
DATA AND INFORMATION
 Information is the processed data on which decisions and
actions are based. It is data that has been processed into a form
that is meaningful to the recipient.

 Furtherer more, information is interpreted data; created from

organized, structured, and processed data in a particular
context
DATA AND INFORMATION
Data Information
 raw facts  data with context
 no context  processed data
 just numbers and text  value-added to data
 summarized
 organized
 analyzed

 For Example:
Data: 51012
 Information:
 5/10/12 The date of your final exam.
 51,012 Birr The starting salary of an accounting major.
 51012 Zip code of Dilla.
DATA PROCESSING CYCLE
 Data processing is the re-structuring or re-ordering of data by people
or machines to increase their usefulness and add values for a particular
purpose. There are three steps constitute the data processing cycle.

 Input: in this step, the input data is prepared in some convenient form
for processing. The input data can be recorded on any one of the
several types of storage medium, such as hard disk, CD, flash disk ,
Papers and so on.

 E.g when opening account in CBE branch your data will be stored in
DATA PROCESSING CYCLE
 Processing: in this step, the input data is changed to produce data
in a more useful form. For example, interest can be calculated on
deposit to a bank, or a summary of withdraws in the month can be
calculated.

 Output: at this stage, the result of the processing step is collected.

The particular form of the output data depends on the use of the
data. For example, output data may be summary like bank
statement etc.

 There are various types of data science tools which are used to
analysis different types of data:

 SAS (Statistical Analysis Software)


DATA PROCESSING TOOLS
 SAS (Statistical Analysis Software): is one of data science
tools which are specially designed for statistical operations used
by large organizations to analyze data.

 Weka tool is a software, which allows easier implementation of

machine learning algorithms through interacting platform. The
user can understand the functioning of machine learning on the
data without having to write a line of code. The Weka is the ideal
for the Data Scientists who are the beginners in machine
learning.

 Excel: Microsoft developed Excel for the spreadsheet

calculations, but nowadays, it is widely used for data processing,
DATA TYPE

 Data types can be described from diverse perspectives.

 In Computer science, data type is simply a behavior

( characteristics) of data that tells the programmer the intends
use of the data.

 int x=5;

 String name=“Abebe”;

 float p=3.14;

 From a data analytics point of view, there are three types of data
types Structured, Semi-structured, and Unstructured
DATA TYPE
 Structured data: is a data whose elements are organized in
tabular structured format. It is easy to retrieve, search and perform
analysis on this type of data. Example: Excel files, Sql databases.

 Semi-structured data: is a form of structured data that does not

obey the tabular structure format but have some organizational
properties that make it easier to analyze ,i.e. it might contain tags
or markers to separate semantic elements in a data. Therefore it’s
also known as self describing structure . Example: JSON and
XML.

 Unstructured data: is a data which is not organized in a pre-

defined manner or does not have a pre-defined data structure.
DATA TYPE
 Metadata is not a separate data type, but it is one of the most
important elements for Big Data analysis and big data solutions.

 Metadata is data about data. It provides additional information

about a specific set of data.

 In a set of photographs, for example, metadata could describe

when and where the photos were taken.

 The metadata then provides fields for dates and locations which,
by themselves, can be considered structured data. Because of
this reason, metadata is frequently used by Big Data solutions
for initial analysis.
DATA VALUE CHAIN
 Data value chain describes the information flow within a big data
system as a series of steps needed to generate value and useful
insights from data.

 The Big Data Value Chain identifies the following key high-level
activities:
DATA ACQUISITION
 It is the process of gathering, filtering, and cleaning data before it is
put in a data warehouse .

 A data warehouse is a subject-oriented, integrated(unified), time-

variant, and nonvolatile collection of data in support of management’s
decision-making process. A data warehouse is a decision support
database that is maintained separately from the organization’s
operational database.

 Data acquisition is one of the major big data challenges in terms of

infrastructure requirements.

 The infrastructure required to support the acquisition of big data must

deliver low, predictable latency in capturing data. It must be able to
handle very high transaction volumes, often in a distributed
DATA ANALYSIS

 Data analysis involves exploring, transforming, and modeling

data with the goal of highlighting relevant data,
synthesizing(combining) and extracting useful hidden
information with high potential from a business point of view.

 The data will be analyzed to get the following characteristics

from the data:

 Classification

 Clustering

 Association etc

 To perform analysis different computer science areas include

DATA CURATION
 It is the active management of data over its life cycle to ensure it
meets the necessary data quality requirements for its effective
usage.

 Data curation processes can be categorized into different

activities such as

 Content creation and selection

 Transformation and validation

 Preservation.

 Data curators (scientific curators or data annotators) hold the

responsibility of ensuring that data are trustworthy,
discoverable, accessible, reusable and fit their purpose.
DATA STORAGE
 In data storage phase, the collected data is stored and prepared for
being used in the next phase. The collected data needs to be secured.

 In order to guarantee the safety of the collected data, some security

measures can be used like data anonymization approach,
permutation, and data partitioning (vertically or horizontally).

 In data storage we must ensure the persistence management of data

in a scalable way that satisfies the needs of applications that require
fast access to the data.

 RDBMS have been the main, and almost unique, a solution to the
storage paradigm for nearly 40 years but lacks flexibility when data
volumes and complexity grow . For this reason NoSQL technologies
have been designed.
DATA USAGE
 It covers the data-driven business activities that need access to
data, its analysis, and the tools needed to integrate the data
analysis within the business activity .

 Data usage in business decision-making can enhance

competitiveness through the reduction of costs, increased added
value, or any other parameter that can be measured against
existing performance criteria.

 Data usage is given through specific tools and in turn through

query and scripting languages that typically depend on the
underlying data stores, their execution engines, APIs, and
programming models.
BIG DATA
 Big data is a blanket term for the non-traditional strategies and
technologies needed to gather, organize, process, and gather insights
from large datasets.

 While the problem of working with data that exceeds the computing
power or storage of a single computer is not new, the pervasiveness,
scale, and value of this type of computing have greatly expanded in
recent years.

 Big Data is a collection of data that is huge in volume, yet growing

exponentially with time. It is a data with so large size and complexity
that none of traditional data management tools can store it or process
it efficiently.

 A “large” means a dataset too large to reasonably process or store with

BIG DATA
 Big data is characterized by 4V.
 Volume: large amounts of data Zeta bytes/Massive datasets
 Velocity: the speed of data processing. Data is live streaming or
in motion
 Variety: data comes in many different forms from diverse
sources.
 Veracity: can we trust the data? How accurate/truthful is it?
BIG DATA
 Big Data is used for the analysis to get insight that will help you with the
business decision.

 Some real-world examples that will explain how big data is used are as
follows:

 Big Data is used to find out consumer shopping habits.

 It can be used to monitor health conditions through data from wearables.

 The transportation industry uses fuel optimization tools where big data is
used.

 It is used for predictive inventory ordering.

 It can help you with real-time data monitoring(like knowing how your
networks, applications, and services are performing.) and cybersecurity
protocols.
BIG DATA
 Big Data is responsible to handle, manage, and process different
types of data like Structured, Semi-structured, and Unstructured.

 It is cost-effective in terms of maintaining a large amount of data.

It works on a distributed system.

 We can save large amounts of data for a long time using Big Data
techniques. So it is easy to handle historical data and generate
accurate reports.

 Data processing speed is very fast and thus social media is using
Big Data techniques.

 It allows users to make efficient decisions for their business

based on current and historical data.
WHO’S GENERATING BIG DATA
 There are many companies who are generating big data.

Social media and networks Scientific instruments

(all of us are generating data) (collecting all sorts of data)

Mobile devices Sensor technology and

(tracking all objects all the networks
time) (measuring all kinds of data)
CLUSTERED COMPUTING
 To better address the high storage and computational needs of big
data, we need fast, secure and reliable computing environment.

 Cluster Computing is coming to solve the problems of stand alone

technology. The objective is to improve the performance/power efficiency
of a single processor for storing and mining the large data sets, using the
multiple disks and CPUs.

 Cluster computing refers that many of the computers connected on a

network and they perform like a single entity. Each computer that is
connected to the network is called a node.

 Cluster computing offers solutions to solve complicated problems by

providing faster computational speed, and enhanced data integrity. The
connected computers execute operations all together thus creating the
impression like a single system (virtual machine).
CLUSTERED COMPUTING
 Clustering software provide a number of benefits like:

 Resource Pooling: combining the available storage space to

hold data.
 High Availability: guarantees to availability incase of hardware or
software failures.

 Easy Scalability: easy to scale horizontally by adding machines to

the group.

 Clusters requires a solution for managing cluster membership,

coordinating resource sharing, and scheduling actual work on
individual nodes.

 Cluster membership and resource allocation can be handled by

HADOOP ECOSYSTEM
 Hadoop is an open-source framework intended to make interaction
with big data easier. It is a framework that allows for the distributed
processing of large datasets across clusters of computers using simple
programming models.

 The four key characteristics of Hadoop are:

 Economical: its systems are highly economical as ordinary

computers can be used for data processing.

 Reliable: stores copies of the data on different machines and is

resistant to hardware failure.

 Scalable: It is easily scalable both, horizontally and vertically.

 Flexible: can store as much structured and unstructured data .

HADOOP ECOSYSTEM
 Hadoop has an ecosystem that has evolved from its four core

components: data management, access, processing, and

storage.

 In the ecosystem there are two main components for data

storage.

 HDFS(Hadoop Distributed File System): is the major

component of Hadoop ecosystem and is responsible for

storing large data sets of structured or unstructured data

HADOOP ECOSYSTEM
 For processing the ecosystem have:

 Yet Another Resource Negotiator(YARN): manage the resources

across the clusters. It performs scheduling and resource allocation

for the Hadoop System.

 MapReduce: makes it possible to carry over the processing’s logic

and helps to write applications which transform big data sets into a

manageable one. It takes away the complexity of distributed

programming by exposing two processing steps that developers

implement: 1) Map and 2) Reduce. In the Mapping step, data is split

HADOOP ECOSYSTEM
 For the data access:

 HIVE: performs reading and writing of large data sets.

 PIG: structure the data flow, processing and analyzing huge

data sets.

 MAHOUT: provides various functionalities such as

collaborative filtering, clustering, and classification. It allows
invoking algorithms with the help of its own libraries.

 For the data management the ecosystem have:

 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling
BIG DATA LIFE CYCLE WITH HADOOP
 Ingesting data into the system: The first stage of Big Data processing is
Ingest.

 The data is ingested or transferred to Hadoop from various sources such

as relational databases, systems, or local files. Sqoop transfers data from
RDBMS to HDFS, whereas Flume transfers event data such as social
media data, clickstreams etc.).

 Processing: in this stage, the data is stored and processed. The data is
stored in HDFS, or/and HBase. Spark and MapReduce perform data
processing.

 Analyze: data is analyzed by processing frameworks such as Pig, Hive,

and Impala.

 Access: which is performed by tools such as Hue and Cloudera Search. In

SUMMARY
 Let’s us understand the Hadoop ecosystem from a real life example. There is a
company that has established its branches in three different countries, let’s
assume a branch in, A.A, Awassa and Dilla.

 In every branch, the entire customer data is stored in the Local Database daily.

 Every quarterly, half-yearly or yearly basis, the organization wants to analyze

this data for business development. To do this, the organization collect all this
data from multiple sources, perform necessary cleaning steps and put it in
Data Warehouse.

 Then, we can use it for analytical purposes. So for analysis, we can generate
reports from the data available in the Data Warehouse. Multiple charts and
reports can be generated using Business Intelligence Tools.

 We require analysis for analytical purposes to grow the business and make
appropriate decisions for the organizations.
THE END
?

FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
Evidence Handling
No ratings yet
Evidence Handling
18 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
52 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
33 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
EcoStruxure Building Operation - System Upgrade Reference Guide
No ratings yet
EcoStruxure Building Operation - System Upgrade Reference Guide
178 pages
Introduction To Emerging Technologies Chapter 2
No ratings yet
Introduction To Emerging Technologies Chapter 2
31 pages
Emerging Tech CH 2
No ratings yet
Emerging Tech CH 2
52 pages
Chapter 2EMR
No ratings yet
Chapter 2EMR
21 pages
(ET) Chapter - 2
No ratings yet
(ET) Chapter - 2
31 pages
Srs (1) 1
No ratings yet
Srs (1) 1
21 pages
Data Lifecycle
No ratings yet
Data Lifecycle
55 pages
ET Ch-2 Data Science PPT
No ratings yet
ET Ch-2 Data Science PPT
28 pages
Chapter 2 - Intro To Data Sciences (Updated)
No ratings yet
Chapter 2 - Intro To Data Sciences (Updated)
67 pages
Emerging CH2
No ratings yet
Emerging CH2
41 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
56 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
Ict Ch. 2
No ratings yet
Ict Ch. 2
38 pages
Multidisciplinary Field That Uses A Variety
No ratings yet
Multidisciplinary Field That Uses A Variety
48 pages
Chapter 2 (Data Science)
No ratings yet
Chapter 2 (Data Science)
35 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Chapter 2 Data Science1
No ratings yet
Chapter 2 Data Science1
41 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
55 pages
HTC Emerging Ch2
No ratings yet
HTC Emerging Ch2
37 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Chapter - 2
No ratings yet
Chapter - 2
38 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
41 pages
Chapter 2. Introduction To Data Science
100% (2)
Chapter 2. Introduction To Data Science
45 pages
IET - Chapter 2
No ratings yet
IET - Chapter 2
32 pages
ETCh 2
No ratings yet
ETCh 2
36 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Data Science
No ratings yet
Data Science
35 pages
Chapter 2 Introduction To Data Science
No ratings yet
Chapter 2 Introduction To Data Science
50 pages
MSTR Vs Looker
No ratings yet
MSTR Vs Looker
45 pages
Chapter 2 EmTe
No ratings yet
Chapter 2 EmTe
37 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
37 pages
Visualizing Trees and Forests
No ratings yet
Visualizing Trees and Forests
24 pages
CH-2 Introduction To Data Science
No ratings yet
CH-2 Introduction To Data Science
26 pages
IT 106 - Intro To Data Sciences
No ratings yet
IT 106 - Intro To Data Sciences
32 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
30 pages
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
Emerg T 2200
No ratings yet
Emerg T 2200
17 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Chapter 2-2
No ratings yet
Chapter 2-2
34 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
32 pages
BQ Solutions-1
No ratings yet
BQ Solutions-1
19 pages
Sample Security Plan
No ratings yet
Sample Security Plan
9 pages
Intro To Big Data
No ratings yet
Intro To Big Data
21 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
Asset-V1 e-SHE+EX101+Q1+type@asset+block@Chapter2 Session 1 PDF
No ratings yet
Asset-V1 e-SHE+EX101+Q1+type@asset+block@Chapter2 Session 1 PDF
6 pages
Introduction To Data Science: Chapter Two
No ratings yet
Introduction To Data Science: Chapter Two
52 pages
Trim Cost of Your RAG Apps 1735384012
No ratings yet
Trim Cost of Your RAG Apps 1735384012
9 pages
Lab 01
No ratings yet
Lab 01
12 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
Chapter 2
No ratings yet
Chapter 2
30 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
37 pages
PYTHON Application Programming 17CS664 (Even Sem 2019 20) 6th Sem
No ratings yet
PYTHON Application Programming 17CS664 (Even Sem 2019 20) 6th Sem
17 pages
Librany
No ratings yet
Librany
6 pages
Chapter 2: Data Science
No ratings yet
Chapter 2: Data Science
32 pages
Ai For The Future of Privacy Part 2
No ratings yet
Ai For The Future of Privacy Part 2
2 pages
7 Sem
No ratings yet
7 Sem
8 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Modul - Bahasa Inggris 1 - UNIT 5 Rev 2019 - PDF
No ratings yet
Modul - Bahasa Inggris 1 - UNIT 5 Rev 2019 - PDF
12 pages
User Experience Design - Wikipedia
No ratings yet
User Experience Design - Wikipedia
9 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
GRACUSER and GRACUSERCONN Explained
No ratings yet
GRACUSER and GRACUSERCONN Explained
3 pages
CHAPTER 2 Emerging
No ratings yet
CHAPTER 2 Emerging
8 pages
12 Chapterwise Blue Print 2022-23
No ratings yet
12 Chapterwise Blue Print 2022-23
3 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
Lesson 4
No ratings yet
Lesson 4
27 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
40 pages
CIS Module 4 VDC Storage
No ratings yet
CIS Module 4 VDC Storage
34 pages
Pentaho Data Integration Fundamentals: Course Description
No ratings yet
Pentaho Data Integration Fundamentals: Course Description
2 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
36 pages
External Data Recovery Services by Techchef
No ratings yet
External Data Recovery Services by Techchef
7 pages
IR-19 Asgmnt02 PDF
No ratings yet
IR-19 Asgmnt02 PDF
1 page
Chapter 2
No ratings yet
Chapter 2
10 pages
DataBase Performance ST04
No ratings yet
DataBase Performance ST04
5 pages
Computer Grade IV
No ratings yet
Computer Grade IV
1 page
Effective Sales Enablement
No ratings yet
Effective Sales Enablement
7 pages
Basic Mongodb Interview Questions
No ratings yet
Basic Mongodb Interview Questions
7 pages
Quiz 9
No ratings yet
Quiz 9
3 pages
Bizagi Platform Overview
No ratings yet
Bizagi Platform Overview
13 pages
Notes - KCS 061 Big Data Unit 1
No ratings yet
Notes - KCS 061 Big Data Unit 1
25 pages
SnapLogic Whitepaper - The Integrator's Dilemma Whitepaper
No ratings yet
SnapLogic Whitepaper - The Integrator's Dilemma Whitepaper
6 pages
Database Management System
From Everand
Database Management System
Manish Soni
No ratings yet
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet