0% found this document useful (0 votes)
34 views27 pages

2 Data Science

This document provides an overview of data science and key concepts related to big data. It discusses how data science uses scientific methods to extract knowledge from various types of data. It also defines key terms like data, information, and different data types. Additionally, it covers the data value chain, characteristics of big data, Hadoop ecosystem, and the basic big data life cycle.

Uploaded by

kigali ac
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views27 pages

2 Data Science

This document provides an overview of data science and key concepts related to big data. It discusses how data science uses scientific methods to extract knowledge from various types of data. It also defines key terms like data, information, and different data types. Additionally, it covers the data value chain, characteristics of big data, Hadoop ecosystem, and the basic big data life cycle.

Uploaded by

kigali ac
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Chap 2.

DATA SCIENCE

1
Chap 2. DATA SCIENCE

Lecturer: Dr Djuma SUMBIRI

2
1. An Overview of Data Science

 Data science is a multi-


disciplinary field
 It uses scientific methods,
processes, algorithms, and
systems to extract
knowledge and insights
from structured, semi-
structured and
unstructured data.

3
2. Data vs Information
 Data
 A representation of facts, concepts,
or instructions in a formalized manner,
which should be suitable for communication,
interpretation, or processing, by human or electronic
machines.
 Unprocessed facts and figures.

 Is represented with the help of characters such as

alphabets (A-Z, a-z), digits (0-9) or special characters


+, -, /, *, <,>, =,etc
 Information
 The processed data on which decisions and actions
are based
 Interpreted data; created from organized, structured,
and processed data in a particular context
4
Data Processing Cycle

 Input
 Input data is prepared in some convenient form for processing.
 Example: the input data can be recorded on any one of the
several types of storage medium.
 Processing
 Input data is changed to produce data in a more useful form.
 Example: interest can be calculated on deposit to a bank, or a
summary of sales
 Output
 The result of the proceeding processing step is collected.
 Example: output data may be payroll for employees.
5
3. Data types and their
representation
 Data types can be described from diverse
perspectives
 Data types from Computer programming perspective
 An attribute of data that tells the compiler or

interpreter how the programmer intends to use the


data.
 Data types from Data Analytics perspective
 Structure of the data

6
DT-Computer programming
perspective
 Integers(int)
 Store whole numbers, mathematically known as integers.
 7, 12, 999

 Booleans(bool)
 Represents restricted to one of two values: true or false
 Characters(char)
 Store a single character
 97 (in ASCII, 97 is a lower case 'a')

 Floating-point numbers(float)
 Store real numbers
 3.15, 9.06, 00.13

 Alphanumeric strings(string)
 Store a combination of characters and numbers
 hello world, Alice, Bob123

7
DT-Data Analytics
perspective

8
Structured-Unstructured-
Semi structured

 Metadata is data about data.


 It provides additional information about a specific set of data.
 Most important elements for Big Data analysis and big data
solutions.

9
10
Activity
 Discuss data types from programing and analytics
perspectives.
 Compare metadata with structured, unstructured
and semi-structured data
 Given at least one example of structured,
unstructured and semi-structured data types

11
4. Data value Chain
 Describes the information flow within a big data system as
a series of steps needed to generate value and useful
insights from data.
 Describes the process of data creation and use from first
identifying a need for data to its final use and possible
reuse.

12
Data Acquisition
 Process of gathering, filtering, and cleaning data
before it is put in a data warehouse or any other
storage solution on which data analysis can be
carried out.

13
Data Analysis
 Making the raw data acquired
amenable to use in decision-
making as well as domain-specific
usage.
 Involves exploring, transforming,
and modeling data with the goal
of highlighting relevant data,
synthesizing and extracting useful
hidden information with high
potential from a business point of
view.
 Related areas include data mining,
business intelligence, and
machine learning.

14
Data Curation
 Active management of data over its life cycle to
ensure it meets the necessary data quality
requirements for its effective usage.
 Processes can be categorized into different
activities such as content creation, selection,
classification, transformation, validation, and
preservation.
 Performed by expert curators that are responsible
for improving the accessibility and quality of data.
 Data curators (also known as scientific curators or
data annotators) hold the responsibility of
ensuring that data are trustworthy, discoverable,
accessible, reusable and fit their purpose
15
Data Storage
 Persistence and
management of data
in a scalable way that
satisfies the needs of
applications that
require fast access to
the data.

16
Data Usage
 Covers the data-driven business
activities that need access to data, its
analysis, and the tools needed to
integrate the data analysis within the
business activity.
 With data usage, decision-making in
business can enhance competitiveness
 through the reduction of costs,
 increased added value,
 or any other parameter that can be
measured against existing performance
criteria.
17
5. Basic concepts of
big data

 Big data
 A collection of data sets

so large and complex


that it becomes difficult
to process using on-
hand database
management tools or
traditional data
processing applications.
 A “large dataset”

 A dataset too large

to reasonably
process or store with
traditional tooling or
on a single
computer.
18
Big Data Characteristics

19
Five Vs

 Volume
 The size and amounts of big data that companies manage and
analyze
 Value
 The most important “V” from the perspective of the business the
value of big data usually comes from insight discovery and pattern
recognition that led to more effective operations, stronger customer
relationships and other clear and quantifiable business benefits
 Variety
 The diversity and range of different data types, including unstructured
data, semi-structured data and raw data
 Velocity
 Refers to the high speed of accumulation of data.
 Veracity
 The “truth” or accuracy of data and information assets, which often
determines executive-level confidence

20
Clustered Computing and Hadoop
Ecosystem

 Individual computers
are often inadequate
for handling the data
at most stages.
 Solution
 computer clusters

 Big data clustering


software combines the
resources of many
smaller machines

21
Benefits of Combining Small
computers

 Resource Pooling
 Combining the available storage space, CPU, memory
 High Availability
 Emphasize the importance of real-time analytics.
 Prevent hardware or software failures from affecting
access to data and processing.
 Easy Scalability
 Easy to scale horizontally by adding additional
machines to the group.

 Cluster membership and resource allocation can


be handled by software like Hadoop’s YARN
(which stands for Yet Another Resource
Negotiator).
22
Hadoop and its Ecosystem
 Hadoop is an open-source framework intended to
make interaction with big data easier.
 Allows for the distributed processing of large
datasets across clusters of computers using simple
programming models.
 Key characteristics of Hadoop
 Economical: Its systems are highly economical as
ordinary computers can be used for data processing.
 Reliable: It is reliable as it stores copies of the data on
different machines and is resistant to hardware failure.
 Scalable: It is easily scalable both, horizontally and
vertically. A few extra nodes help in scaling up the
framework.
 Flexible: It is flexible, and you can store as much
structured and unstructured data as you need to and
decide to use them later. 23
Hadoop and its Ecosystem

24
Big Data Life Cycle with
Hadoop
1. Ingesting data into the system
 The data is ingested or transferred to Hadoop from various
sources such as relational databases, systems, or local files.
 Sqoop transfers data from RDBMS to HDFS, whereas Flume
transfers event data.
2. Processing the data in storage
 The data is stored and processed.
 The data is stored in the distributed file system, HDFS, and the
NoSQL distributed data, HBase.
 Spark and MapReduce perform data processing.
3. Computing and analyzing data
 Data is analyzed by processing frameworks such as Pig, Hive
4. Visualizing the results
 The analyzed data can be accessed by users.
 Hue and Cloudera Search are used

25
Activity
1. Which information flow step in the data value
chain you think is labor-intensive? Why?
2. What are the different data types and their value
chain?
3. List and describe each technology or tool used in
the big data life cycle.
4. Discuss the methods of computing over a large
dataset.
5. Discuss the purpose of each Hadoop Ecosystem
components?
6. Why Data Science is confluence of multiple
disciplines? Which are those?

26
Practical Assig
 Plate recognition system with Python (Day)
 Sentiment analysis using python(Evening)

27

You might also like