0% found this document useful (0 votes)
11 views41 pages

Emerging CH2

Chapter Two discusses the fundamental concepts of data science, including the distinction between data and information, the role of data scientists, and the importance of algorithms in data processing. It outlines the data processing cycle, various data types, and the data value chain, emphasizing the significance of big data and clustered computing. Additionally, it introduces Hadoop and its ecosystem as a framework for managing and processing large datasets.

Uploaded by

eyobamesfin8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views41 pages

Emerging CH2

Chapter Two discusses the fundamental concepts of data science, including the distinction between data and information, the role of data scientists, and the importance of algorithms in data processing. It outlines the data processing cycle, various data types, and the data value chain, emphasizing the significance of big data and clustered computing. Additionally, it introduces Hadoop and its ecosystem as a framework for managing and processing large datasets.

Uploaded by

eyobamesfin8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

CHAPTER TWO

Prepared by: Marta G.(MSc.)


Contents
Data Vs Information
What is data science
Data scientist
Algorithm
Data processing cycles
Data types
Data value chain
Big data
Clustered computing
Data vs. Information
• Data
• Can be defined as a representation of facts, concepts, or instructions in a
formalized manner, which should be suitable for communication, interpretation,
or processing, by human or electronic machines.
• It can be described as unprocessed facts and figures
• It is represented with the help of characters such as alphabets (A-Z, a-z), digits
(0-9) or special characters (+, -, /, *, <,>, =, etc.
• Information
• The processed data on which decisions and actions are based
• Information is interpreted data; created from organized, structured, and processed
data in a particular context
• What is knowledge and wisdom?
Data science
➢ Data science is now one of the most influential topics all around.
➢ Data science is the study of data to extract meaningful insights for business.
➢ Data science is a multi-disciplinary field that uses scientific methods,
processes, algorithms, and systems to extract knowledge and insights from
structured, semi-structured and unstructured data.
➢ Example: The data involved in buying a box of cereal from the store to
supermarket
Cont…

Data Science defined as the extraction of actionable knowledge


directly from the data through the process of discovery, hypothesis, and
analytical hypotheses analysis.
Data scientist
A data scientist (is a job title) is a person
engaging in a systematic activity to acquire
knowledge from data.
In a more restricted sense, a data scientist
may refer to an individual who uses the
scientific method on existing data.
Data scientist possess a strong Quantitative
background in statistics, Linear algebra,
Programming knowledge with focuses on data
warehousing, mining, and modeling to build and
analyze algorithms
Algorithms
An algorithm is a set of instructions designed to perform a specific task.
It refers to a sequence of finite steps to solve a particular problem
This can be a simple process, such as multiplying two numbers
An algorithm:
Is an unambiguous description that makes clear what has to be
implemented.
Expects a defined set of inputs.
Produces a defined set of outputs.
Is guaranteed to terminate and produce a result
Most algorithms are guaranteed to produce the correct result.
If the algorithm has preconditions(requirements), it must be met.
Search engines use proprietary algorithms to display the most relevant results
from their search index for specific queries.
Data Processing Cycle
• Data processing is the conversion of raw data to meaningful information
through a process.
• Data processing cycle as the term suggests a sequence of steps or operations for
processing data, i.e., processing raw data to the usable form.
• The process/Stages of data processing includes activities like data entry/input,
calculation/process, output and storage
Data processing cycle Cont…

• Input
• It is the task where verified data is coded or converted into machine readable
form so that it can be processed through a computer.
Data entry is done through the use of a keyboard, digitizer, scanner, or data
entry from an existing source.
• Processing
• Once the input is provided the raw data is processed by a suitable or selected
processing method.
• This is the most important step as it provides the processed data in the form
of output which will be used further.
Cont…

• Output and interpretation


• It is the stage where processed information is now transmitted to the
user.
• Output is presented to users in various report formats like printed
report, audio, video, or on monitor.
• Storage
• It is the last stage in the data processing cycle, where data, instruction
and information are held for future use.
• The importance of this cycle is that it allows quick access and retrieval
of the processed information, allowing it to be passed on to the next
stage directly, when needed.
Data types
Data types can be described from diverse perspectives.
In computer science and computer programming, for instance, a
data type
Data Type is simply an attribute of data that tells the compiler or interpreter
how the programmer intends to use the data.
Data types can be classified from two perspectives.
1. Data types from Data Analytics perspective
2. Data types from Computer programming perspective
From computer programing perspective
• Common data types include
• Integers(int)- is used to store whole numbers, mathematically known as integers
• Examples of integers are 0, 1, 2, 3 and 4
• Booleans(bool)- is used to represent restricted to one of two values: true or false
• Characters(char)- is used to store a single character
• Floating-point numbers(float)- is used to store real numbers Eg: 5.2, 12.7
• Alphanumeric strings(string)- used to store a combination of characters and numbers

• A string is a sequence of characters enclosed between the double quotes “ ”


Data representation
Types are an abstraction letting us model things in categories and it is
largely a mental construct.
All computer represent data nothing more than a string of ones and
zeroes.
In order for said ones and zeroes to convey any meaning, they need to
be contextualized.
Data types provide that context.
E.g. 01100001
Data types from Data Analytics perspective
Data analytics (DA) is the method of examining knowledge
sets to conclude the data they contain, progressively with the
help of specialized systems and software package
From a data analytics point of view, it is important to
understand that there are three common types of data types or
structures:
Structured,
Semi-structured, and
Unstructured data types
Structured Data
Structured data refers to data that is organized and formatted in
a specific way to make it easily readable and understandable by
both humans and machines.
Structured data store in a table format with a relationship
between the different rows and columns
Structured data is highly valuable because it can be easily
searched, queried, and analyzed using various tools and
techniques
Cont…
Common examples of structured data are
Excel files
SQL databases
Each of these has structured rows and columns that can be sorted
Semi structured Data
Semi-structured data is a type of data that is not purely structured, but
also not completely unstructured.
It contains some level of organization or structure, but does not
conform to a rigid schema or data model
Semi-structured contains tags or other markers to separate semantic
elements
Semi-structured data is information that doesn’t exist in a relational
database but that have some organizational properties that make it
easier to analyze.
Eg: JSON, XML
JSON Example
XML Example
Unstructured Data
• Unstructured data is information that either does not have a predefined
data model or is not organized in a pre-defined manner.
• Unstructured data may have its own internal structure, but does not
follow neatly into a spreadsheet or database.
• From 80% to 90% of data generated and collected by organizations is
unstructured,
• Its volumes are growing rapidly — many times faster than the rate of growth for
structured databases.
• Eg: Audio, Video, Files, Sensor data etc
Cont…
Metadata – Data about Data
Metadata is data about data (Data that describes other data).
It provides additional information about a specific set of data.
Metadata summarizes basic information about data, which can make
finding and working with particular instances of data easier.
For example: author, date created and date modified and file size
are examples of very basic document metadata.
Having the ability to filter through that metadata makes it much easier
for someone to locate a specific document.
In context of databases, metadata would be info on tables, views,
columns, arguments etc.
Data value chain
The Data Value Chain is introduced to describe the information
flow within a big data system as a series of steps needed to
generate value and useful insights from data.
Data acquisition, data analysis, data curation, data storage, data usage
1. Data acquisition
It is the process of gathering, filtering, and cleaning data before it is
put in a data warehouse or any other storage solution on which data
analysis can be carried out.
Data acquisition is one of the major big data challenges in terms of
infrastructure requirements.
Cont…
2. Data analysis:
▪ Data analysis involves exploring, transforming, and modeling data with the goal of

extracting useful hidden information

▪ It is concerned with making the raw data acquired amenable to use in decision-making
as well as domain-specific usage.
Cont…

3. Data curation:
• It is the active management of data over its life cycle to ensure it meets the
necessary data quality requirements for its effective usage.
• Data curation processes can be categorized into different activities such as
content creation, selection, classification, transformation, validation, and
preservation.

• Data curation is performed by expert curators(scientific curators data ,


annotators ) that are responsible for improving the accessibility and quality of
data.
Cont..

4. Data storage:
• It is the persistence and management of data in a scalable way that
satisfies the needs of applications that require fast access to the data.
Eg: RDMS
• The ACID (Atomicity, Consistency, Isolation, and Durability)
properties that guarantee database transactions lack flexibility with
regard to schema changes and the performance when data volumes and
complexity grow, making them unsuitable for big data scenarios.
• NoSQL technologies have been designed with the scalability goal in
mind and present a wide range of solutions based on alternative data
models.
Cont…

5. Data usage:
• It covers the data-driven business activities that need access to data,
its analysis, and the tools needed to integrate the data analysis
within the business activity.
• Data usage in business decision making can enhance
competitiveness through the reduction of costs, increased added
value, or any other parameter that can be measured against existing
performance criteria.
• It is the amount of data (things like images, movies, photos, videos,
and other files) that you send, receive, download and/or upload.
Cont..

• Big data is the term for a collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing applications.
• Today, it may consist of petabytes (1,024 terabytes) or exabytes
(1,024 petabytes) of information, including billions or even trillions of
records from millions of people.
• But it doesn't mean the amount of data, the thing matters is what
organization do with data.
• Big Data is analyzed for insights that lead to better decisions.
Cont…
• Big Data is associated with the concept of 3 V that is volume,
velocity, and variety. Big data is characterized by 3V and
more:
• Volume: large amounts of data Zeta bytes/Massive datasets
• Velocity: Data is live streaming or in motion
• Variety: data comes in many different forms from diverse
sources
• Veracity: can we trust the data? How accurate is it? etc.
• Value: refers to economically useful benefits that an
organization obtained from Big Data
Cont…
Clustered Computing
• Because of the qualities and quantities of big data,
individual computers are often inadequate for handling the
data at most stages.
• To better address the high storage and computational needs of
big data, computer clusters are a better fit.
• “Computer cluster” basically refers to a set of connected
computer working together.
• The cluster represents one system and the objective is to
improve performance.
Cont…

• Big data clustering software combines the resources of many


smaller machines, seeking to provide a number of benefits:
• Resource Pooling:
• Combining the available storage space to hold data is a clear
benefit, but CPU and memory pooling are also extremely
important.
• Processing large datasets requires large amounts of all three of
these resources.
Cont…

• High Availability
• Clusters can provide varying levels of fault tolerance and
availability guarantees to prevent hardware or software
failures from affecting access to data and processing.
• This becomes increasingly important as we continue to
emphasize the importance of real-time analytics.
Cont..

• Easy Scalability
• Clusters make it easy to scale horizontally by adding additional
machines to the group.
• This means the system can react to changes in resource
requirements without expanding the physical resources on a
machine.
Cont…

• Using clusters requires a solution for managing cluster


membership, coordinating resource sharing, and
scheduling actual work on individual nodes.
• Cluster membership and resource allocation can be handled
by software like Hadoop’s YARN (which stands for Yet
Another Resource Negotiator).
Hadoop and its Ecosystem
Hadoop is an open-source framework intended to make interaction
with big data easier. It is a framework that allows for the distributed
processing of large datasets across clusters of computers using simple
programming models.
The four key characteristics of Hadoop are:
Economical: Its systems are highly economical as ordinary computers can be
used for data processing.
Reliable: It is reliable as it stores copies of the data on different machines and is
resistant to hardware failure.
Scalable: It is easily scalable both, horizontally and vertically. A few extra nodes
help in scaling up the framework.
Flexible: It is flexible and you can store as much structured and unstructured
data as you need to and decide to use them later.
Hadoop and its Ecosystem
● Hadoop has an ecosystem that has evolved from its four core components: data
management, access, processing, and storage.
● It is continuously growing to meet the needs of Big Data.
● It comprises the following components and many others:
○ HDFS: Hadoop Distributed File System ○ PIG, HIVE: Query-based processing of data services

○ YARN: Yet Another Resource Negotiator ○ HBase: NoSQL Database

○ MapReduce: Programming based Data Processing ○ Mahout, Spark MLLib: Machine Learning
algorithm libraries
○ Spark: In-Memory data processing
○ Solar, Lucene: Searching and Indexing

○ Zookeeper: Managing cluster

○ Oozie: Job Scheduling


Big Data Life Cycle with Hadoop
Ingesting data into the system
• The first stage of Big Data processing is Ingest.
• The data is ingested or transferred to Hadoop from various sources such as
relational databases, systems, or local files.
• Sqoop transfers data from RDBMS to HDFS, whereas Flume transfers event data.

Processing the data in storage


• The second stage is Processing.
• In this stage, the data is stored and processed.
• The data is stored in the distributed file system, HDFS (Hadoop Distributed File
System), and the NoSQL distributed data, HBase. Spark and MapReduce perform
data processing.
Big Data Life Cycle with Hadoop
Computing and analyzing data :
• The third stage is to Analyze.
• Here, the data is analyzed by processing frameworks such as Pig,
Hive, and Impala.
• Pig converts the data using a map and reduce and then analyzes
it. Hive is also based on the map and reduce programming and is
most suitable for structured data.
Visualizing the results
• The fourth stage is Access, which is performed by tools such as Hue
and Cloudera Search. In this stage, the analyzed data can be
accessed by users.

You might also like