2 Data Science
2 Data Science
DATA SCIENCE
1
Chap 2. DATA SCIENCE
2
1. An Overview of Data Science
3
2. Data vs Information
Data
A representation of facts, concepts,
or instructions in a formalized manner,
which should be suitable for communication,
interpretation, or processing, by human or electronic
machines.
Unprocessed facts and figures.
Input
Input data is prepared in some convenient form for processing.
Example: the input data can be recorded on any one of the
several types of storage medium.
Processing
Input data is changed to produce data in a more useful form.
Example: interest can be calculated on deposit to a bank, or a
summary of sales
Output
The result of the proceeding processing step is collected.
Example: output data may be payroll for employees.
5
3. Data types and their
representation
Data types can be described from diverse
perspectives
Data types from Computer programming perspective
An attribute of data that tells the compiler or
6
DT-Computer programming
perspective
Integers(int)
Store whole numbers, mathematically known as integers.
7, 12, 999
Booleans(bool)
Represents restricted to one of two values: true or false
Characters(char)
Store a single character
97 (in ASCII, 97 is a lower case 'a')
Floating-point numbers(float)
Store real numbers
3.15, 9.06, 00.13
Alphanumeric strings(string)
Store a combination of characters and numbers
hello world, Alice, Bob123
7
DT-Data Analytics
perspective
8
Structured-Unstructured-
Semi structured
9
10
Activity
Discuss data types from programing and analytics
perspectives.
Compare metadata with structured, unstructured
and semi-structured data
Given at least one example of structured,
unstructured and semi-structured data types
11
4. Data value Chain
Describes the information flow within a big data system as
a series of steps needed to generate value and useful
insights from data.
Describes the process of data creation and use from first
identifying a need for data to its final use and possible
reuse.
12
Data Acquisition
Process of gathering, filtering, and cleaning data
before it is put in a data warehouse or any other
storage solution on which data analysis can be
carried out.
13
Data Analysis
Making the raw data acquired
amenable to use in decision-
making as well as domain-specific
usage.
Involves exploring, transforming,
and modeling data with the goal
of highlighting relevant data,
synthesizing and extracting useful
hidden information with high
potential from a business point of
view.
Related areas include data mining,
business intelligence, and
machine learning.
14
Data Curation
Active management of data over its life cycle to
ensure it meets the necessary data quality
requirements for its effective usage.
Processes can be categorized into different
activities such as content creation, selection,
classification, transformation, validation, and
preservation.
Performed by expert curators that are responsible
for improving the accessibility and quality of data.
Data curators (also known as scientific curators or
data annotators) hold the responsibility of
ensuring that data are trustworthy, discoverable,
accessible, reusable and fit their purpose
15
Data Storage
Persistence and
management of data
in a scalable way that
satisfies the needs of
applications that
require fast access to
the data.
16
Data Usage
Covers the data-driven business
activities that need access to data, its
analysis, and the tools needed to
integrate the data analysis within the
business activity.
With data usage, decision-making in
business can enhance competitiveness
through the reduction of costs,
increased added value,
or any other parameter that can be
measured against existing performance
criteria.
17
5. Basic concepts of
big data
Big data
A collection of data sets
to reasonably
process or store with
traditional tooling or
on a single
computer.
18
Big Data Characteristics
19
Five Vs
Volume
The size and amounts of big data that companies manage and
analyze
Value
The most important “V” from the perspective of the business the
value of big data usually comes from insight discovery and pattern
recognition that led to more effective operations, stronger customer
relationships and other clear and quantifiable business benefits
Variety
The diversity and range of different data types, including unstructured
data, semi-structured data and raw data
Velocity
Refers to the high speed of accumulation of data.
Veracity
The “truth” or accuracy of data and information assets, which often
determines executive-level confidence
20
Clustered Computing and Hadoop
Ecosystem
Individual computers
are often inadequate
for handling the data
at most stages.
Solution
computer clusters
21
Benefits of Combining Small
computers
Resource Pooling
Combining the available storage space, CPU, memory
High Availability
Emphasize the importance of real-time analytics.
Prevent hardware or software failures from affecting
access to data and processing.
Easy Scalability
Easy to scale horizontally by adding additional
machines to the group.
24
Big Data Life Cycle with
Hadoop
1. Ingesting data into the system
The data is ingested or transferred to Hadoop from various
sources such as relational databases, systems, or local files.
Sqoop transfers data from RDBMS to HDFS, whereas Flume
transfers event data.
2. Processing the data in storage
The data is stored and processed.
The data is stored in the distributed file system, HDFS, and the
NoSQL distributed data, HBase.
Spark and MapReduce perform data processing.
3. Computing and analyzing data
Data is analyzed by processing frameworks such as Pig, Hive
4. Visualizing the results
The analyzed data can be accessed by users.
Hue and Cloudera Search are used
25
Activity
1. Which information flow step in the data value
chain you think is labor-intensive? Why?
2. What are the different data types and their value
chain?
3. List and describe each technology or tool used in
the big data life cycle.
4. Discuss the methods of computing over a large
dataset.
5. Discuss the purpose of each Hadoop Ecosystem
components?
6. Why Data Science is confluence of multiple
disciplines? Which are those?
26
Practical Assig
Plate recognition system with Python (Day)
Sentiment analysis using python(Evening)
27