0% found this document useful (0 votes)
15 views23 pages

Chapter 2-Data Science

Uploaded by

Wondimu Bantihun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views23 pages

Chapter 2-Data Science

Uploaded by

Wondimu Bantihun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Chapter 2

Data Science
Contents:

 Overview of Data Science

 Data Types and Their Representation

 Data value Chain

 Basic Concepts of Big Data


Overview of Data Science

• Data science is a multi-disciplinary field that uses scientific methods,


processes, algorithms, and systems to extract knowledge and insights
from structured, semi-structured and unstructured data.

• Data science is much more than simply analyzing data.

• It offers a range of roles and requires a range of skills.


What are data and information?
• Data can be defined as a representation of facts, concepts, or instructions in a
formalized manner.
• It can be described as unprocessed facts and figures.
• It is represented with the help of:
• alphabets (A-Z, a-z),
• digits (0-9) or
• special characters (+, -, /, *, <,>, =, etc.)
• information is the processed data on which decisions and actions are based.
• It is interpreted data; created from organized, structured, and processed data
in a particular context.
Data Processing Cycle
• Data processing is the re-structuring or re-ordering of data by people
or machines to increase their usefulness and add values for a particular
purpose.
• Data processing consists of the following basic steps - input,
processing, and output.

Data Processing Cycle


Data types and their representation

1. Data types from Computer programming perspective


• Integers(int)- is used to store whole numbers, mathematically known
as integers
• Booleans(bool)- is used to represent restricted to one of two values:
true or false
• Characters(char)- is used to store a single character
• Floating-point numbers(float)- is used to store real numbers
• Alphanumeric strings(string)- used to store a combination of
characters and numbers
• 2. Data types from Data Analytics perspective

Structured Data Unstructured Data Semi-structured Data


• A pre-defined data • Have no pre-defined data model. • contains tags or other markers to
• Straightforward to analyze • May contain data such as dates, separate semantic elements
• Placed in tabular format numbers and facts • known as a self-describing
• Example: Excel files or • difficult to understand using structure.
SQL databases. traditional programs. • Example: JSON and XML
• Example: audio, video files
Metadata
• It is not a separate data structure, but most important element for Big
Data analysis and solutions.
• They are called data about data.
• In a set of photographs,
for example, metadata
could describe when and
where the photos were
taken.
Data value Chain

• The Data Value Chain is introduced to describe the information flow


within a big data system.
• describes the full data lifecycle from collection to analysis and usage.
• The Big Data Value Chain identifies the following key high-level
activities:
Basic concepts of big data

• Big data is the term for a collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing applications.

• According to IBM, Big data is characterized by 3V and more:

• Volume (amount of data): dealing with large scales of data within


data processing (e.g. Global Supply Chains, Global Financial
Analysis, Large Hadron Collider).
• Velocity (speed of data): dealing with streams of high frequency of
incoming real-time data (e.g. Sensors, Pervasive Environments, Electronic
Trading, Internet of Things).

• Variety (range of data types/sources): dealing with data using differing


syntactic formats (e.g. Spreadsheets, XML, DBMS), schemas, and meanings
(e.g. Enterprise Data Integration).

• Veracity: can we trust the data? How accurate is it? etc.


Clustered Computing and Hadoop Ecosystem

Clustered Computing

• Cluster computing refers that many of the computers connected on a network and they
perform like a single entity.

• Because of the qualities of big data, individual computers are often inadequate for handling
the data at most stages.

• To better address the high storage and computational needs of big data, computer clusters are
a better fit.

• Big data clustering software combines the resources of many smaller machines, seeking to
provide a number of benefits:
suppose you have a big file having more than 500 mb data and you need to count the number of words. But
your computer has only 100 mb, how you can handle it ?
• Resource Pooling: Combining the available storage space to hold data
is a clear benefit, but CPU and memory pooling are also extremely
important.
• Processing large datasets requires large amounts of all three of these
resources.
• High Availability: Clusters can provide varying levels of fault
tolerance and availability guarantees to prevent hardware or software
failures from affecting access to data and processing.
• Easy Scalability: Clusters make it easy to scale horizontally by
adding additional machines to the group.
• Cluster membership and resource allocation can be handled by software
like Hadoop’s YARN (which stands for Yet Another Resource Negotiator).

• Hadoop is an open-source framework intended to make interaction with big


data easier.

• It is a framework that allows for the distributed processing of large datasets


across clusters of computers using simple programming models.
The four key characteristics of Hadoop are:

• Economical: Its systems are highly economical as ordinary computers can be used for data
processing.

• Reliable: It is reliable as it stores copies of the data on different machines and is resistant
to hardware failure.

• Scalable: It is easily scalable both, horizontally and vertically. A few extra nodes help in
scaling up the framework.
• Flexible: It is flexible and you can store as much structured and unstructured data as you
need to and decide to use them later.
• Hadoop has an ecosystem that has evolved from its four core
components: data management, data access, data processing, and data
storage.

Hadoop ecosystem
Big Data Life Cycle with Hadoop
1. Ingesting data into the system

• The first stage of Big Data processing is Ingest.

• The data is ingested or transferred to Hadoop from various sources


such as relational databases, systems, or local files.

• Sqoop transfers data from RDBMS to HDFS, whereas Flume transfers


event data.
2. Processing the data in storage

• The second stage is Processing. In this stage, the data is stored and
processed.

• The data is stored in the distributed file system, HDFS, and the
NoSQL distributed data, Hbase, Spark and MapReduce perform data
processing.
3. Computing and analyzing data

• The third stage is to Analyze. Here, the data is analyzed by processing


frameworks such as Pig, Hive, and Impala.

• Pig converts the data using a map and reduce and then analyzes it.

• Hive is also based on the map and reduce programming and is most
suitable for structured data.
4. Visualizing the results
• The fourth stage is Access, which is performed by tools such as Hue
and Cloudera Search.
• In this stage, the analyzed data can be accessed by users.
Thank you!!!

You might also like