0% found this document useful (0 votes)

34 views27 pages

2 Data Science

This document provides an overview of data science and key concepts related to big data. It discusses how data science uses scientific methods to extract knowledge from various types of data. It also defines key terms like data, information, and different data types. Additionally, it covers the data value chain, characteristics of big data, Hadoop ecosystem, and the basic big data life cycle.

Uploaded by

kigali ac

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views27 pages

2 Data Science

Uploaded by

kigali ac

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 27

Chap 2.

DATA SCIENCE

1
Chap 2. DATA SCIENCE

Lecturer: Dr Djuma SUMBIRI

2
1. An Overview of Data Science

 Data science is a multi-

disciplinary field
 It uses scientific methods,
processes, algorithms, and
systems to extract
knowledge and insights
from structured, semi-
structured and
unstructured data.

3
2. Data vs Information
 Data
 A representation of facts, concepts,
or instructions in a formalized manner,
which should be suitable for communication,
interpretation, or processing, by human or electronic
machines.
 Unprocessed facts and figures.

 Is represented with the help of characters such as

alphabets (A-Z, a-z), digits (0-9) or special characters

+, -, /, *, <,>, =,etc
 Information
 The processed data on which decisions and actions
are based
 Interpreted data; created from organized, structured,
and processed data in a particular context
4
Data Processing Cycle

 Input
 Input data is prepared in some convenient form for processing.
 Example: the input data can be recorded on any one of the
several types of storage medium.
 Processing
 Input data is changed to produce data in a more useful form.
 Example: interest can be calculated on deposit to a bank, or a
summary of sales
 Output
 The result of the proceeding processing step is collected.
 Example: output data may be payroll for employees.
5
3. Data types and their
representation
 Data types can be described from diverse
perspectives
 Data types from Computer programming perspective
 An attribute of data that tells the compiler or

interpreter how the programmer intends to use the

data.
 Data types from Data Analytics perspective
 Structure of the data

6
DT-Computer programming
perspective
 Integers(int)
 Store whole numbers, mathematically known as integers.
 7, 12, 999

 Booleans(bool)
 Represents restricted to one of two values: true or false
 Characters(char)
 Store a single character
 97 (in ASCII, 97 is a lower case 'a')

 Floating-point numbers(float)
 Store real numbers
 3.15, 9.06, 00.13

 Alphanumeric strings(string)
 Store a combination of characters and numbers
 hello world, Alice, Bob123

7
DT-Data Analytics
perspective

8
Structured-Unstructured-
Semi structured

 Metadata is data about data.

 It provides additional information about a specific set of data.
 Most important elements for Big Data analysis and big data
solutions.

9
10
Activity
 Discuss data types from programing and analytics
perspectives.
 Compare metadata with structured, unstructured
and semi-structured data
 Given at least one example of structured,
unstructured and semi-structured data types

11
4. Data value Chain
 Describes the information flow within a big data system as
a series of steps needed to generate value and useful
insights from data.
 Describes the process of data creation and use from first
identifying a need for data to its final use and possible
reuse.

12
Data Acquisition
 Process of gathering, filtering, and cleaning data
before it is put in a data warehouse or any other
storage solution on which data analysis can be
carried out.

13
Data Analysis
 Making the raw data acquired
amenable to use in decision-
making as well as domain-specific
usage.
 Involves exploring, transforming,
and modeling data with the goal
of highlighting relevant data,
synthesizing and extracting useful
hidden information with high
potential from a business point of
view.
 Related areas include data mining,
business intelligence, and
machine learning.

14
Data Curation
 Active management of data over its life cycle to
ensure it meets the necessary data quality
requirements for its effective usage.
 Processes can be categorized into different
activities such as content creation, selection,
classification, transformation, validation, and
preservation.
 Performed by expert curators that are responsible
for improving the accessibility and quality of data.
 Data curators (also known as scientific curators or
data annotators) hold the responsibility of
ensuring that data are trustworthy, discoverable,
accessible, reusable and fit their purpose
15
Data Storage
 Persistence and
management of data
in a scalable way that
satisfies the needs of
applications that
require fast access to
the data.

16
Data Usage
 Covers the data-driven business
activities that need access to data, its
analysis, and the tools needed to
integrate the data analysis within the
business activity.
 With data usage, decision-making in
business can enhance competitiveness
 through the reduction of costs,
 increased added value,
 or any other parameter that can be
measured against existing performance
criteria.
17
5. Basic concepts of
big data

 Big data
 A collection of data sets

so large and complex

that it becomes difficult
to process using on-
hand database
management tools or
traditional data
processing applications.
 A “large dataset”

 A dataset too large

to reasonably
process or store with
traditional tooling or
on a single
computer.
18
Big Data Characteristics

19
Five Vs

 Volume
 The size and amounts of big data that companies manage and
analyze
 Value
 The most important “V” from the perspective of the business the
value of big data usually comes from insight discovery and pattern
recognition that led to more effective operations, stronger customer
relationships and other clear and quantifiable business benefits
 Variety
 The diversity and range of different data types, including unstructured
data, semi-structured data and raw data
 Velocity
 Refers to the high speed of accumulation of data.
 Veracity
 The “truth” or accuracy of data and information assets, which often
determines executive-level confidence

20
Clustered Computing and Hadoop
Ecosystem

 Individual computers
are often inadequate
for handling the data
at most stages.
 Solution
 computer clusters

 Big data clustering

software combines the
resources of many
smaller machines

21
Benefits of Combining Small
computers

 Resource Pooling
 Combining the available storage space, CPU, memory
 High Availability
 Emphasize the importance of real-time analytics.
 Prevent hardware or software failures from affecting
access to data and processing.
 Easy Scalability
 Easy to scale horizontally by adding additional
machines to the group.

 Cluster membership and resource allocation can

be handled by software like Hadoop’s YARN
(which stands for Yet Another Resource
Negotiator).
22
Hadoop and its Ecosystem
 Hadoop is an open-source framework intended to
make interaction with big data easier.
 Allows for the distributed processing of large
datasets across clusters of computers using simple
programming models.
 Key characteristics of Hadoop
 Economical: Its systems are highly economical as
ordinary computers can be used for data processing.
 Reliable: It is reliable as it stores copies of the data on
different machines and is resistant to hardware failure.
 Scalable: It is easily scalable both, horizontally and
vertically. A few extra nodes help in scaling up the
framework.
 Flexible: It is flexible, and you can store as much
structured and unstructured data as you need to and
decide to use them later. 23
Hadoop and its Ecosystem

24
Big Data Life Cycle with
Hadoop
1. Ingesting data into the system
 The data is ingested or transferred to Hadoop from various
sources such as relational databases, systems, or local files.
 Sqoop transfers data from RDBMS to HDFS, whereas Flume
transfers event data.
2. Processing the data in storage
 The data is stored and processed.
 The data is stored in the distributed file system, HDFS, and the
NoSQL distributed data, HBase.
 Spark and MapReduce perform data processing.
3. Computing and analyzing data
 Data is analyzed by processing frameworks such as Pig, Hive
4. Visualizing the results
 The analyzed data can be accessed by users.
 Hue and Cloudera Search are used

25
Activity
1. Which information flow step in the data value
chain you think is labor-intensive? Why?
2. What are the different data types and their value
chain?
3. List and describe each technology or tool used in
the big data life cycle.
4. Discuss the methods of computing over a large
dataset.
5. Discuss the purpose of each Hadoop Ecosystem
components?
6. Why Data Science is confluence of multiple
disciplines? Which are those?

26
Practical Assig
 Plate recognition system with Python (Day)
 Sentiment analysis using python(Evening)

Step by Step Guide To Create EIT in Oracle HRMS
100% (1)
Step by Step Guide To Create EIT in Oracle HRMS
37 pages
2015 Coc Level 3 DBA
100% (3)
2015 Coc Level 3 DBA
5 pages
Wollega University Department of Computer Science Selected Topics in Computer Science by Tadele D. March 18, 2023
100% (1)
Wollega University Department of Computer Science Selected Topics in Computer Science by Tadele D. March 18, 2023
75 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
EmTec Chapter 2
No ratings yet
EmTec Chapter 2
32 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
30 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
33 pages
How To Add Custom Field To Condition Table
100% (1)
How To Add Custom Field To Condition Table
4 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
Emerging Tech CH 2
No ratings yet
Emerging Tech CH 2
52 pages
Chapter 2 - EMTE - 240216 - 133452
No ratings yet
Chapter 2 - EMTE - 240216 - 133452
47 pages
Chapter 2 Data Science1
No ratings yet
Chapter 2 Data Science1
41 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
58 pages
Chapter 04 - Small Data, Big Data - 02
No ratings yet
Chapter 04 - Small Data, Big Data - 02
53 pages
Ict Ch. 2
No ratings yet
Ict Ch. 2
38 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
55 pages
Chapter 2 (Data Science)
No ratings yet
Chapter 2 (Data Science)
35 pages
Introduction To Emerging Technologies Chapter 2
No ratings yet
Introduction To Emerging Technologies Chapter 2
31 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
41 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
ETCh 2
No ratings yet
ETCh 2
36 pages
Data Science
No ratings yet
Data Science
32 pages
IET - Chapter 2
No ratings yet
IET - Chapter 2
32 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
37 pages
Chapter 2 EmTe
No ratings yet
Chapter 2 EmTe
37 pages
Chapter Two2
No ratings yet
Chapter Two2
21 pages
Islamic Answer
No ratings yet
Islamic Answer
27 pages
Course Name: Introduction To Emerging Technologies
No ratings yet
Course Name: Introduction To Emerging Technologies
24 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
ET Ch-2 Data Science PPT
No ratings yet
ET Ch-2 Data Science PPT
28 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
Church Management System
100% (2)
Church Management System
4 pages
Data Science
No ratings yet
Data Science
23 pages
Chapter - 2 Data Sciences
No ratings yet
Chapter - 2 Data Sciences
25 pages
Ch2 Emerging
No ratings yet
Ch2 Emerging
24 pages
Chapter 2: Data Science
No ratings yet
Chapter 2: Data Science
32 pages
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
Chap 2-Data Analysis
No ratings yet
Chap 2-Data Analysis
27 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
Data Science
No ratings yet
Data Science
35 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
#2 Data Science
No ratings yet
#2 Data Science
32 pages
Data Science: Lecture #1
No ratings yet
Data Science: Lecture #1
22 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
CH 2 - Emerging
No ratings yet
CH 2 - Emerging
24 pages
CH-2 Data Science
No ratings yet
CH-2 Data Science
45 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
CH-2 Data Science Emerging Technology
No ratings yet
CH-2 Data Science Emerging Technology
20 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Chaoter Data Science
No ratings yet
Chaoter Data Science
20 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
Chapter 2 - Intro. To Data Sciences
No ratings yet
Chapter 2 - Intro. To Data Sciences
27 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
2 Emerging
No ratings yet
2 Emerging
10 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
Big Data
No ratings yet
Big Data
10 pages
Data Science: Chapter Two
No ratings yet
Data Science: Chapter Two
8 pages
10 Reasons Why ChatGPT Will Fail
No ratings yet
10 Reasons Why ChatGPT Will Fail
2 pages
User Interface Design PDF
No ratings yet
User Interface Design PDF
51 pages
Cross-Site Scripting (XSS) - OWASP
No ratings yet
Cross-Site Scripting (XSS) - OWASP
9 pages
Battle of The Giants - Comparing Kimball and Inmon PDF
No ratings yet
Battle of The Giants - Comparing Kimball and Inmon PDF
15 pages
Class 10 IT Links 2022-23
No ratings yet
Class 10 IT Links 2022-23
9 pages
Versant Database v.7.0.1.0 Administration Manual
No ratings yet
Versant Database v.7.0.1.0 Administration Manual
465 pages
Database Management Systems (R22a0504)
No ratings yet
Database Management Systems (R22a0504)
96 pages
Mini Project Report XXXXXXXX
No ratings yet
Mini Project Report XXXXXXXX
25 pages
Lab # 08 Implementation of SQL Joins
No ratings yet
Lab # 08 Implementation of SQL Joins
14 pages
Human Computer Interaction Unit 2 JNTUH
No ratings yet
Human Computer Interaction Unit 2 JNTUH
11 pages
Elegant Free Powerpoint Presentation Template
No ratings yet
Elegant Free Powerpoint Presentation Template
25 pages
Deep Learning Healthcare
No ratings yet
Deep Learning Healthcare
10 pages
9 - Databases New Syllabus 2210 (MT-L)
No ratings yet
9 - Databases New Syllabus 2210 (MT-L)
16 pages
Business Intelligence Unit
No ratings yet
Business Intelligence Unit
5 pages
Data Security, Data Privacy, Ethics in Database
No ratings yet
Data Security, Data Privacy, Ethics in Database
32 pages
SF Lms SaveCoursePDF
No ratings yet
SF Lms SaveCoursePDF
5 pages
SQL Quiz
No ratings yet
SQL Quiz
138 pages
Generative AI Applications
No ratings yet
Generative AI Applications
44 pages
Normalized vs. Denormalized: Normalization
No ratings yet
Normalized vs. Denormalized: Normalization
3 pages
Thesis Template AITU 1
No ratings yet
Thesis Template AITU 1
10 pages
Veritas Netbackup™ For Microsoft Exchange Server Administrator'S Guide
No ratings yet
Veritas Netbackup™ For Microsoft Exchange Server Administrator'S Guide
244 pages
Romney Ais13 PPT 03
No ratings yet
Romney Ais13 PPT 03
14 pages
ITE 2152 Introduction To Mobile Application Development: Week 9
No ratings yet
ITE 2152 Introduction To Mobile Application Development: Week 9
24 pages
Implementation of GIS: G.J. Meaden (FAO Consultant, Canterbury, United Kingdom)
No ratings yet
Implementation of GIS: G.J. Meaden (FAO Consultant, Canterbury, United Kingdom)
32 pages
Introduction To ODL Graphical Representation
No ratings yet
Introduction To ODL Graphical Representation
10 pages
SQL Word
No ratings yet
SQL Word
7 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet

2 Data Science

Uploaded by

2 Data Science

Uploaded by

Chap 2.

Lecturer: Dr Djuma SUMBIRI

 Data science is a multi-

 Is represented with the help of characters such as

alphabets (A-Z, a-z), digits (0-9) or special characters

interpreter how the programmer intends to use the

 Metadata is data about data.

so large and complex

 A dataset too large

 Big data clustering

 Cluster membership and resource allocation can

You might also like