Chap 2-Data Analysis
Chap 2-Data Analysis
2
An Overview of Data Science
• Data science is a multi-disciplinary field
• Uses scientific methods, processes, algorithms, and systems to
extract knowledge and insights from structured and unstructured
data.
• Data science is much more than simply analyzing data.
• It offers a range of roles and requires a range of skills.
3
An Overview of Data Science…
• The cashier scans the barcode on your container, and the
cash register logs the price.
• If purchase was one of the last boxes in the store, a computer tells
the stock manager that it is time to request another order from the
distributor.
4
An Overview of Data Science…
• On the trip from your pencil(notebook) to the manager’s desk, the
data went through many transformations.
• Pieces of hardware such as the barcode scanner were involved in
collecting, manipulating and storing the data.
• Different pieces of software were used to organize, aggregate,
visualize, and present the data.
• People decided which systems to buy and install, who should get
access to what kinds of data.
6
An Overview of Data Science…
Data Processing Cycle
• Data processing is the re-structuring or re-ordering of data by people or
machines to increase their usefulness and add values for a particular
purpose.
• It consists of the following basic steps
input, processing, and output
• These three steps constitute the data processing cycle
7
An Overview of Data Science…
Data Processing Cycle...
A. Input: the input data is prepared in some convenient form for
processing depend on the processing machine.
• Fore example, for electronic computers, input data can be
recorded on any one of the several types of storage medium,
such as hard disk, CD and flash disk.
B. Processing: the input data is changed to produce data in a more
useful form.
• Fore example, interest can be calculated on deposit to a bank, a
summary of sales for the month can be calculated from the
sales orders.
C. Output: result of the processing step is collected.
• For example, output data may be payroll for employees
8
Data Types and their Representation
• Data types can be described from diverse perspectives.
• for instance, In computer programming, a data type is simply an
attribute of data that tells the compiler how the programmer intends
to use the data.
• This data type defines the operations that can be done on the data.
Though different languages may use different terminology.
• Common data types include:
• Integers(int)- is used to store whole numbers
• Floating-point numbers(float)- store real numbers.
• Characters(char)- is used to store a single character
• Booleans(bool)- to one of two values: true or false
• Alphanumeric strings(string)- characters and numbers
9
Data types and their representation…
Data types from Data Analytics perspective
• From a data analytics point of view, there are three common types of
data types or structures
A. Structured
B. Semi-structured
C. Unstructured data types
Below figure describes the three types of data and metadata
10
Data types and their representation…
Data types from Data Analytics perspective…
A. Structured Data: is data that adheres to a pre-defined data
model and is therefore straightforward to analyze
• It conforms to a tabular format with a relationship between
the different rows and columns.
• Common examples of structured data are Excel files or
SQL databases. Each of these has structured rows and
columns that can be sorted.
B. Semi-structured Data: is a form of structured data that
• Does not conform with the formal structure of data models
associated with relational databases.
• Contains tags or other markers to separate semantic
elements and enforce hierarchies of records and fields
• It is also known as a self-describing structure
• Example: JSON and XML
11
Data types and their representation…
Data types from Data Analytics perspective…
C. Unstructured Data: is information that either
• Does not have a predefined data model or is not organized
in a pre-defined manner.
• It is typically text-heavy but may contain data such as dates,
numbers, and facts as well.
• This results in irregularities and ambiguities
• For example, audio, video files or NoSQL databases
D. Metadata (Data about Data): provides additional information
about a specific set of data.
• Frequently used by Big Data solutions for initial analysis
• For example, In a set of photographs, metadata could describe
when and where the photos were taken
12
Data Value Chain
• The Data Value Chain is introduced to describe the information flow
within a big data system as a series of steps needed to generate value
and useful insights from data.
• Big Data Value Chain identifies the following key high-level
activities:
13
Data value Chain…
Data Acquisition
•It is the process of gathering, filtering, and cleaning data before it is
put in a data warehouse or any other storage solution on which data
analysis can be carried out.
16
Data value Chain…
Data Storage
• And the tools needed to integrate the data analysis within the business
activity.
18
Basic Concepts of Big Data
• Big data is a term for the non-traditional strategies and technologies
needed to gather, organize, process, and gather insights from large
datasets.
What Is Big Data?
• Big data is the term for a collection of large and complex data sets.
19
Basic Concepts of Big Data…
Below figure shows the Characteristics of big data.
20
Basic Concepts of Big Data…
Clustered Computing and Hadoop Ecosystem
Clustered Computing
• Because of the quantities of big data, individual computers are often
inadequate for handling the data at most stages.
• Hadoop has an ecosystem that has evolved from its four core
components:
• Data management
• Access
• Processing
• Storage
• It is continuously growing to meet the needs of Big Data.
• It comprises the following components and many others:
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• HBase: NoSQL Database
25
Basic Concepts of Big Data…
Clustered Computing and Hadoop Ecosystem…
Hadoop and its Ecosystem…
Below figure shows Hadoop Ecosystem
26
Basic Concepts of Big Data…
Clustered Computing and Hadoop Ecosystem…
27