0% found this document useful (0 votes)

228 views36 pages

Chapter 2 - Introduction To Data Science

The document discusses data science and introduces key concepts like data types, the data processing cycle, and big data. It defines data science, the role of data scientists, and differentiates between data and information. It also describes structured, semi-structured, and unstructured data types.

Uploaded by

Bacha Tariku

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

228 views36 pages

Chapter 2 - Introduction To Data Science

Uploaded by

Bacha Tariku

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 36

Chapter 2

Data Science
OUTLINE
 After completing this chapter:

 Describe what data science is and the role of data scientists.

 Differentiate data and information.

 Describe data processing life cycle

 Understand different data types from diverse perspectives

 Describe data value chain in emerging era of big data.

 Understand the basics of Big Data.

 Describe the purpose of the Hadoop ecosystem components.

INTRODUCTION TO DATA SCIENCE
 Data science is a multi-disciplinary field that uses scientific methods, processes,
algorithms, and systems to extract knowledge and insights from data.

 The main purpose of data science is to find the patterns within the data and uses
several techniques to analyze and draw the perceptions from data.

 Data science is much more than simply analyzing data. It offers a range of roles
and requires a range of skills.

 Data scientist is the person who performs analysis on data and give insights to
decision makers. In Data Science, the data scientist has the responsibility of
making the predictions from the data.
INTRODUCTION TO DATA SCIENCE
 Data scientist aims to derive conclusions from the whole data. With the help of
these conclusions, the data scientist can support the industries in making smarter
business decisions.

 Data Science is all about:

 Ask correct questions and analyzing the raw data

 Visualize data for getting a better perspective.

 Modeling the data by using various complex and efficient algorithms.

 Understand the data to make a better decision and finding the final result.
DATA AND INFORMATION
 Data can be defined as a representation of facts, concepts, or instructions in a

formalized manner, which should be suitable for communication, interpretation,

or processing, by human or electronic machines.

 It can be described as unprocessed facts and figures.

 It is represented with the help of characters such as alphabets (A-Z, a-z), digits (0-

9) or special characters (+, -, /, *, <,>, =, etc.).

DATA AND INFORMATION
 Information is the processed data on which decisions and actions are based. It is
data that has been processed into a form that is meaningful to the recipient.

 Furtherer more, information is interpreted data; created from organized,

structured, and processed data in a particular context
DATA AND INFORMATION
Data Information
 raw facts  data with context
 no context  processed data
 just numbers and text  value-added to data
 summarized
 organized
 analyzed

 For Example:
Data: 51012
 Information:
 5/10/12 The date of your final exam.
 51,012 Birr The starting salary of an accounting major.
 51012 Zip code of Dilla.
EXAMPLE OF DATA SCIENCE
 Let’s us consider choosing department from available departments in Dilla university as
data science problem example:

 Before choosing a department you write your preferences on your paper as Law,
Anthropology, Economics, Accounting etc. based on what you see in the semester. This
planned choice is a piece of data though it is written by pencil or on mobile that you
can read.

 When it is time to choose, you use your data as a reminder to choose department.

 When you choose department, the registrar register you to the department you want.

 In the registrar, a system tells that the department capacity is full and to assign students to
other department.

 Finally, at the end of registration the registrar employees see different graphs of students
data based on sex, age, country etc. They use this information to prepare for next year.
EXAMPLE OF DATA SCIENCE
 The small piece of information that began on your notebook ended up in many
different places, most notably on the registrar office as an aid to decision making.

 The data went through many transformations. In addition to the computers where
the data might have stopped by or stayed on for the long term, lots of other pieces
of hardware such as computers were involved in collecting, manipulating,
transmitting, and storing the data.

 In addition, many different pieces of software were used to organize, aggregate,

visualize, and present the data. Finally, many different human systems were
involved in working with the data.

 All of this process is what constitute data science.

DATA TYPE

 Data types can be described from diverse perspectives.

 In Computer science, data type is simply a behavior ( characteristics) of data that

tells the programmer the intends use of the data.

 Number x=5;

 String name=“Melaku”;

 Float p=3.14;

 From a data analytics point of view, there are three types of data types Structured,
Semi-structured, and Unstructured
DATA TYPE
 Structured data: is a data whose elements are organized in tabular structured
format. It is easy to retrieve, search and perform analysis on this type of data.
Example: data of your list .

 Semi-structured data: is a form of structured data that does not obey the tabular
structure format but have some organizational properties that make it easier to
analyze. Example: Facebook pages content.

 Unstructured data: is a data that is which is not organized in a pre-defined

manner or does not have a pre-defined data structure.

 Example: Word, PDF, Text, Music.

DATA TYPE
 Metadata is not a separate data type, but it is one of the most important elements
for Big Data analysis and big data solutions.

 Metadata is data about data. It provides additional information about a specific set
of data.

 In a set of photographs, for example, metadata could describe when and where the
photos were taken.

 The metadata then provides fields for dates and locations which, by themselves,
can be considered structured data. Because of this reason, metadata is frequently
used by Big Data solutions for initial analysis.
DATA PROCESSING CYCLE
 Data processing is the re-structuring or re-ordering of data by people or machines
to increase their usefulness and add values for a particular purpose. There are three
steps constitute the data processing cycle.

 Input: in this step, the input data is prepared in some convenient form for
processing. The input data can be recorded on any one of the several types of
storage medium, such as hard disk, CD, flash disk , Papers and so on.

 E.g when opening account in CBE branch your data will be stored in computers.
DATA PROCESSING CYCLE
 Processing: in this step, the input data is changed to produce data in a more
useful form. For example, interest can be calculated on deposit to a bank, or a
summary of withdraws in the month can be calculated.

 Output: at this stage, the result of the processing step is collected. The particular
form of the output data depends on the use of the data. For example, output data
may be summary like bank statement etc.

 There are various types of data science tools which are used to analysis different
types of data:

 SAS (Statistical Analysis Software)

 EXCEL etc.
DATA PROCESSING TOOLS
 SAS (Statistical Analysis Software): is one of data science tools which are
specially designed for statistical operations used by large organizations to analyze
data.

 Weka tool is a software, which allows easier implementation of machine learning

algorithms through interacting platform. The user can understand the functioning
of machine learning on the data without having to write a line of code. The Weka
is the ideal for the Data Scientists who are the beginners in machine learning.

 Excel: Microsoft developed Excel for the spreadsheet calculations, but nowadays,
it is widely used for data processing, Visualization, and complex calculations. It is
the most powerful analytical tool for data science. Excel comes with different
formulae, tables, filters and user can also make their custom functions.
BIG DATA
 Big data is a blanket term for the non-traditional strategies and technologies
needed to gather, organize, process, and gather insights from large datasets.

 While the problem of working with data that exceeds the computing power or
storage of a single computer is not new, the pervasiveness, scale, and value of this
type of computing have greatly expanded in recent years.

 Big Data is a collection of data that is huge in volume, yet growing exponentially
with time. It is a data with so large size and complexity that none of traditional
data management tools can store it or process it efficiently.

 A “large” means a dataset too large to reasonably process or store with traditional
tooling or on a single computer.
BIG DATA
 Big data is characterized by 4V.
 Volume: large amounts of data Zeta bytes/Massive datasets
 Velocity: the speed of data processing. Data is live streaming or in motion
 Variety: data comes in many different forms from diverse sources.
 Veracity: can we trust the data? How accurate is it?
BIG DATA
 Big Data is used for the analysis to get insight that will help you with the business
decision.

 Some real-world examples that will explain how big data is used are as follows:

 Big Data is used to find out consumer shopping habits.

 It can be used to monitor health conditions through data from wearables.

 The transportation industry uses fuel optimization tools where big data is used.

 It is used for predictive inventory ordering.

 It can help you with real-time data monitoring and cybersecurity protocols.

 It is used for forecasting.

BIG DATA
 Big Data is responsible to handle, manage, and process different types of data like
Structured, Semi-structured, and Unstructured.

 It is cost-effective in terms of maintaining a large amount of data. It works on a

distributed system.

 We can save large amounts of data for a long time using Big Data techniques. So it
is easy to handle historical data and generate accurate reports.

 Data processing speed is very fast and thus social media is using Big Data
techniques.

 It allows users to make efficient decisions for their business based on current and
historical data.
WHO’S GENERATING BIG
 DATA
There are many companies who are generating big data.

Social media and networks Scientific instruments

(all of us are generating data) (collecting all sorts of data)

Mobile devices Sensor technology and

(tracking all objects all the networks
time) (measuring all kinds of data)
DATA VALUE CHAIN
 Data value chain describe the information flow within a big data system as a series
of steps needed to generate value and useful insights from data. The data value
chain describes the process of data creation and use from first identifying a need
for data to its final use and possible reuse.

 The Big Data Value Chain identifies the following key high-level activities:
DATA ACQUISITION
 It is the process of gathering, filtering, and cleaning data before it is put in a data
warehouse (Processed).

 A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile

collection of data in support of management’s decision-making process. A data
warehouse is a decision support database that is maintained separately from the
organization’s operational database.

 Data acquisition is one of the major big data challenges in terms of infrastructure
requirements.

 The infrastructure required to support the acquisition of big data must deliver low,
predictable latency in capturing data. It must be able to handle very high transaction
volumes, often in a distributed environment; and support flexible and dynamic data
structures.
DATA ANALYSIS

 Data analysis involves exploring, transforming, and modeling data with the goal
of highlighting relevant data, synthesizing and extracting useful hidden
information with high potential from a business point of view.

 The data will be analysis to get the following characteristics from the data:

 Classification

 Clustering

 Association etc

 To perform analysis different computer science areas include like data mining,
business intelligence, and machine learning etc are used.
DATA CURATION
 It is the active management of data over its life cycle to ensure it meets the
necessary data quality requirements for its effective usage.

 Data curation processes can be categorized into different activities such as

 Content creation and selection

 Classification

 Transformation and validation

 Preservation.

 Data curators (scientific curators or data annotators) hold the responsibility of

ensuring that data are trustworthy, discoverable, accessible, reusable and fit their
purpose.
DATA STORAGE
 In data storage phase, the collected data is stored and prepared for being used in
the next phase. As the collected data may contain of sensitive information, it is
essential to take sufficient precautions during data storing.

 In data storage we must ensure the persistence management of data in a scalable

way that satisfies the needs of applications that require fast access to the data.

 In order to guarantee the safety of the collected data, some security measures can
be used like data anonymization approach, permutation, and data partitioning
(vertically or horizontally).

 RDBMS have been the main, and almost unique, a solution to the storage
paradigm for nearly 40 years.
DATA USAGE
 It covers the data-driven business activities that need access to data, its analysis, and
the tools needed to integrate the data analysis within the business decision making.

 Data usage in business decision-making can enhance competitiveness through the

reduction of costs, increased added value, or any other parameter that can be
measured against existing performance criteria.

 The process of decision-making includes reporting, exploration of data (browsing

and lookup), and exploratory search (finding correlations, comparisons, what-if
scenarios, etc.).

 Data usage is given through specific tools and in turn through query and scripting
languages that typically depend on the underlying data stores, their execution
engines, APIs, and programming models.
SUMMARY OF DATA SCIENCE AND BIG DATA
 Example: take an example of the banking industry to explain the task of Big Data
and Data Science:

 Data science will analyze work systems, information, procedures, and documents
of the bank. Big data analysis will assess the financial and management aspects of
the bank based on cost & time for specific user.
Data Science will help the Big Data will help the banking industry with
banking industry with
 Fraud detection  Provide personalized banking solutions to
 Risk management their customers.
 Customer data analysis  Boosting performance.
 Marketing and sales  Performing effective customer feedback
 AI-driven Chatbots & virtual analysis
assistants.  Effective risk management.
CLUSTERED COMPUTING
 To better address the high storage and computational needs of big data, we need fast,
secure and reliable computing environment.

 Cluster Computing is coming to solve the problems of stand alone technology. The
objective is to improve the performance/power efficiency of a single processor for
storing and mining the large data sets, using the multiple disks and CPUs.

 Cluster computing refers that many of the computers connected on a network and they
perform like a single entity. Each computer that is connected to the network is called a
node.

 Cluster computing offers solutions to solve complicated problems by providing faster

computational speed, and enhanced data integrity. The connected computers execute
operations all together thus creating the impression like a single system (virtual
machine).
CLUSTERED COMPUTING
 Clustering software provide a number of benefits like:

 Resource Pooling: combining the available storage space to hold data.

 High Availability: guarantees to availability incase of hardware or software

failures.

 Easy Scalability: easy to scale horizontally by adding machines to the group.

 Clusters requires a solution for managing cluster membership, coordinating

resource sharing, and scheduling actual work on individual nodes.

 Cluster membership and resource allocation can be handled by software like

Hadoop’s ecosystem.
HADOOP ECOSYSTEM
 Hadoop is an open-source framework intended to make interaction with big data
easier. It is a framework that allows for the distributed processing of large
datasets across clusters of computers using simple programming models.

 The four key characteristics of Hadoop are:

 Economical: highly economical as computers can be used for data processing.

 Reliable: stores copies of the data on different machines and is resistant to

hardware failure.

 Scalable: It is easily scalable both, horizontally and vertically.

 Flexible: can store as much structured and unstructured data .

HADOOP ECOSYSTEM
 Hadoop has an ecosystem that has evolved from its four core components: data

management, access, processing, and storage.

 In the ecosystem there are two main components for data storage.

 HDFS: is the major component of Hadoop ecosystem and is responsible for

storing large data sets of structured or unstructured data across various nodes.

 HBASE: is NoSQL database which supports all kinds of data and thus

capable of handling anything of Hadoop Database.

HADOOP ECOSYSTEM
 For processing the ecosystem have:

 Yet Another Resource Negotiator(YARN): manage the resources across the

clusters. It performs scheduling and resource allocation for the Hadoop

System.

 MapReduce: makes it possible to carry over the processing’s logic and helps

to write applications which transform big data sets into a manageable one.
HADOOP ECOSYSTEM
 For the data access:

 HIVE: performs reading and writing of large data sets.

 PIG: structure the data flow, processing and analyzing huge data sets.

 MAHOUT: provides various functionalities such as collaborative filtering,

clustering, and classification. It allows invoking algorithms with the help of
its own libraries.

 For the data management the ecosystem have:

 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling
 Flume: monitoring
BIG DATA LIFE CYCLE WITH HADOOP
 Ingesting data into the system: The first stage of Big Data processing is Ingest.

 The data is ingested or transferred to Hadoop from various sources such as

relational databases, systems, or local files. Sqoop transfers data from RDBMS
to HDFS, whereas Flume transfers event data.

 Processing: in this stage, the data is stored and processed. The data is stored in
HDFS, or/and HBase. Spark and MapReduce perform data processing.

 Analyze: data is analyzed by processing frameworks such as Pig, Hive, and

Impala.

 Access: which is performed by tools such as Hue and Cloudera Search. In this
stage, the analyzed data can be accessed by users.
SUMMARY
 Let’s us understand the Hadoop ecosystem from a real life example. There is a company
that has established its branches in three different countries, let’s assume a branch in India,
A.A, Awassa and Dilla.

 In every branch, the entire customer data is stored in the Local Database daily.

 Every quarterly, half-yearly or yearly basis, the organization wants to analyze this data for
business development. To do this, the organization collect all this data from multiple
sources, perform necessary cleaning steps and put it in Data Warehouse.

 Then, we can use it for analytical purposes. So for analysis, we can generate reports from
the data available in the Data Warehouse. Multiple charts and reports can be generated
using Business Intelligence Tools.

 We require analysis for analytical purposes to grow the business and make appropriate
decisions for the organizations.
THE END
?

Unit 1 PPT 1
No ratings yet
Unit 1 PPT 1
27 pages
Introduction To Data Science
100% (1)
Introduction To Data Science
200 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
52 pages
Data Science 3
100% (1)
Data Science 3
216 pages
What Is Exploratory Data Analysis (EDA)
100% (2)
What Is Exploratory Data Analysis (EDA)
13 pages
Data Science: Chapter 1: Introduction To Big Data
100% (2)
Data Science: Chapter 1: Introduction To Big Data
77 pages
Knowledge Representation in Data Mining
No ratings yet
Knowledge Representation in Data Mining
22 pages
Introduction To Analytics
100% (1)
Introduction To Analytics
45 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
35 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
ETCh 2
No ratings yet
ETCh 2
36 pages
Exploratory Data Analysis Reference
100% (2)
Exploratory Data Analysis Reference
49 pages
Data Analytics, Data Visualization and Big Data
No ratings yet
Data Analytics, Data Visualization and Big Data
25 pages
Introduction To Data Science: Chapter Two
No ratings yet
Introduction To Data Science: Chapter Two
52 pages
Unit-1 Data Visualization Notes
No ratings yet
Unit-1 Data Visualization Notes
15 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
37 pages
Statistical Methods For Decision Making (SMDM) Project Report
100% (2)
Statistical Methods For Decision Making (SMDM) Project Report
22 pages
Data Science Process - Word Template
No ratings yet
Data Science Process - Word Template
3 pages
R22-Ids-Question Bank
No ratings yet
R22-Ids-Question Bank
4 pages
Unit 3 Data Mining
No ratings yet
Unit 3 Data Mining
21 pages
Chapter 5 - Data Exploration and Visualization With
No ratings yet
Chapter 5 - Data Exploration and Visualization With
39 pages
Business Analytics and Data Science
No ratings yet
Business Analytics and Data Science
25 pages
Dam301 Data Mining and Data Warehousing Summary 08024665051
No ratings yet
Dam301 Data Mining and Data Warehousing Summary 08024665051
48 pages
Exploratory Data Analysis: Prasad Deshmukh
No ratings yet
Exploratory Data Analysis: Prasad Deshmukh
15 pages
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
100% (1)
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
8 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
Big Data - Introduction: Ravichandran
100% (1)
Big Data - Introduction: Ravichandran
44 pages
Unit-3 DS
No ratings yet
Unit-3 DS
21 pages
Data Science Lecture 1 Introduction
No ratings yet
Data Science Lecture 1 Introduction
27 pages
R Programming UNIT-1
No ratings yet
R Programming UNIT-1
48 pages
Big Data
No ratings yet
Big Data
21 pages
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
No ratings yet
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
12 pages
Lecture 1
100% (1)
Lecture 1
21 pages
Introduction To Python For Data Science - Syllabus
100% (1)
Introduction To Python For Data Science - Syllabus
5 pages
DS Mod 1 To 2 Complete Notes
No ratings yet
DS Mod 1 To 2 Complete Notes
63 pages
Predictive Modeling Project Report
100% (2)
Predictive Modeling Project Report
31 pages
Linear Algebra I Final
No ratings yet
Linear Algebra I Final
185 pages
DATA MINING Chapter 1 and 2 Lect Slide
No ratings yet
DATA MINING Chapter 1 and 2 Lect Slide
47 pages
20IT503 - Big Data Analytics - Unit2
No ratings yet
20IT503 - Big Data Analytics - Unit2
62 pages
Data Visualization in Support of Executive Decision Making
No ratings yet
Data Visualization in Support of Executive Decision Making
14 pages
Chi Merge
No ratings yet
Chi Merge
5 pages
Data Science A Beginner S Guide 1668243666
100% (1)
Data Science A Beginner S Guide 1668243666
26 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
7 pages
Data Mining
No ratings yet
Data Mining
27 pages
2nd Unit - 2.2 - Data Analytics
No ratings yet
2nd Unit - 2.2 - Data Analytics
22 pages
Introduction To C
No ratings yet
Introduction To C
40 pages
Data Visualization PDF
No ratings yet
Data Visualization PDF
3 pages
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
No ratings yet
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
16 pages
Tableau Tutorial For Beginners
No ratings yet
Tableau Tutorial For Beginners
8 pages
Business Intelligence & Business Analytics
No ratings yet
Business Intelligence & Business Analytics
8 pages
A Crash Course in Data Science Review
No ratings yet
A Crash Course in Data Science Review
11 pages
Introduction To Data Science 5-13
No ratings yet
Introduction To Data Science 5-13
19 pages
R Data Analyst DAR
No ratings yet
R Data Analyst DAR
6 pages
Memory Based Reasoning - BIA
100% (1)
Memory Based Reasoning - BIA
19 pages
Data Visualisation and Analytics
No ratings yet
Data Visualisation and Analytics
3 pages
Fundamentals of Business Analytics
No ratings yet
Fundamentals of Business Analytics
5 pages
Tutorial On "R" Programming Language
No ratings yet
Tutorial On "R" Programming Language
25 pages
An Introduction To Data Mining
No ratings yet
An Introduction To Data Mining
47 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
19 pages
Mobile Application Development
No ratings yet
Mobile Application Development
8 pages
Structure
No ratings yet
Structure
31 pages
Question Bank: Data Warehousing and Data Mining Semester: VII
No ratings yet
Question Bank: Data Warehousing and Data Mining Semester: VII
4 pages
Data Preprocessing
No ratings yet
Data Preprocessing
3 pages
Object Oriented Analysis and Design - Syllabus
No ratings yet
Object Oriented Analysis and Design - Syllabus
1 page
C++ Module Chapter 1
No ratings yet
C++ Module Chapter 1
9 pages
C++ Module Chapter 3
No ratings yet
C++ Module Chapter 3
11 pages
Methods of Proof
No ratings yet
Methods of Proof
9 pages

Chapter 2 - Introduction To Data Science

Uploaded by

Chapter 2 - Introduction To Data Science

Uploaded by

Chapter 2

 Describe what data science is and the role of data scientists.

 Differentiate data and information.

 Describe data processing life cycle

 Understand different data types from diverse perspectives

 Describe data value chain in emerging era of big data.

 Understand the basics of Big Data.

 Describe the purpose of the Hadoop ecosystem components.

 Data Science is all about:

 Ask correct questions and analyzing the raw data

 Visualize data for getting a better perspective.

 Modeling the data by using various complex and efficient algorithms.

formalized manner, which should be suitable for communication, interpretation,

or processing, by human or electronic machines.

 It can be described as unprocessed facts and figures.

9) or special characters (+, -, /, *, <,>, =, etc.).

 Furtherer more, information is interpreted data; created from organized,

 In addition, many different pieces of software were used to organize, aggregate,

 All of this process is what constitute data science.

 Data types can be described from diverse perspectives.

 In Computer science, data type is simply a behavior ( characteristics) of data that

 Unstructured data: is a data that is which is not organized in a pre-defined

 Example: Word, PDF, Text, Music.

 SAS (Statistical Analysis Software)

 Weka tool is a software, which allows easier implementation of machine learning

 Big Data is used to find out consumer shopping habits.

 It can be used to monitor health conditions through data from wearables.

 It is used for predictive inventory ordering.

 It is used for forecasting.

 It is cost-effective in terms of maintaining a large amount of data. It works on a

Social media and networks Scientific instruments

Mobile devices Sensor technology and

 A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile

 Data curation processes can be categorized into different activities such as

 Content creation and selection

 Transformation and validation

 Data curators (scientific curators or data annotators) hold the responsibility of

 In data storage we must ensure the persistence management of data in a scalable

 Data usage in business decision-making can enhance competitiveness through the

 The process of decision-making includes reporting, exploration of data (browsing

 Cluster computing offers solutions to solve complicated problems by providing faster

 Resource Pooling: combining the available storage space to hold data.

 High Availability: guarantees to availability incase of hardware or software

 Easy Scalability: easy to scale horizontally by adding machines to the group.

 Clusters requires a solution for managing cluster membership, coordinating

 Cluster membership and resource allocation can be handled by software like

 The four key characteristics of Hadoop are:

 Economical: highly economical as computers can be used for data processing.

 Reliable: stores copies of the data on different machines and is resistant to

 Scalable: It is easily scalable both, horizontally and vertically.

 Flexible: can store as much structured and unstructured data .

management, access, processing, and storage.

 HDFS: is the major component of Hadoop ecosystem and is responsible for

capable of handling anything of Hadoop Database.

 Yet Another Resource Negotiator(YARN): manage the resources across the

clusters. It performs scheduling and resource allocation for the Hadoop

 HIVE: performs reading and writing of large data sets.

 MAHOUT: provides various functionalities such as collaborative filtering,

 For the data management the ecosystem have:

 The data is ingested or transferred to Hadoop from various sources such as

 Analyze: data is analyzed by processing frameworks such as Pig, Hive, and

You might also like