0% found this document useful (0 votes)
429 views24 pages

Case Study DSBDA Report Final

Uploaded by

rishusingh1514
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
429 views24 pages

Case Study DSBDA Report Final

Uploaded by

rishusingh1514
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

ABSTRACT

Nowadays, in the healthcare industry, there is a significant rise in the number of doctors,
patients, drugs, and medicines. The analysis and prediction of future health conditions are
stillin the developing stage. The data which is exerted in a little amount has risen greatly
from a few bytes to terabytes, not only has the storage increased but also the dataset
maintenance. The traditional method of using data mining and diagnosis tools is difficult,
therefore the needfor big data tools and techniques arises. Big data and Big data analytics are
rising technology. Primary sets of such data are created for the medical and healthcare
contexts. There isn’t an ideal method to measure patient satisfaction. This paper presents
the ideas and methodology using data mining techniques such as clustering which is
acquired from the datasets and put forward into the big data tool tools HADOOP for
effective analysis of healthcare data. Hadoop has gained popularity due to its ability to
store lying and an accessing large amount of data, quickly and costs effectively through
clusters of commodity hardware.

Keywords- Big data, Healthcare, Clustering, Hadoop.

1
INTRODUCTION

Healthcare Industry is one of the world's greatest and most extensive ventures. Over, the ongoing
year Healthcare administration around the globe is changing from infection-focused to a patient-
focused model and volume-based model. Teaching the predominance of Healthcare and
diminishing the cost is a guideline behind the creating development towards-esteem-based
Healthcare conveyance model and a patient-focused mind. The volume and interest for huge
information in Healthcare associations are developing little by close to nothing. To give
successful patient-focused care, it is fundamental to oversee and analyze the huge amount of data
sets. The traditional methods are obsolete and are not sufficiently adequate to break down
enormous information as the assortment and volume of information sources have expanded and
a very large rate in the previous two decades. There is a requirement for new and creativetools
and methods that can meet and surpass the capacity of overseeing such a huge amount ofdata being
generated by the healthcare department. The social insurance framework of healthcare
departments is a community in nature.

This is since it comprises a substantial number of partners such as doctors with specialization in
different sectors, medical care takers research center technologists, and other individuals that
cooperate to accomplish the shared objectives of decreasing medicinal cost and blunders and also
giving quality healthcare experience. Every one of these partners produces information from
heterogeneous sources, for example, physical examination, clinical notes, patients’ meetings and
perceptions, research facility tests, imaging reports, medications, treatments, overviews, bills, and
protection. The rate at which information is being generated from heterogeneous sources from
various healthcare departments has incremented exponentially on the daily basis. Therefore, it is
becoming hard to store, process, and break down this interrelated information with traditional
dataset handling applications. Nonetheless, new and efficient methods and systems are in addition
to proprovidingeat processing advancements to store, process, break down andextricate values
from voluminous and heterogeneous medical information being generated continuously.

2
PROBLEM STATEMENT
Write a case study to process data driven for Digital Marketing OR Health care systems with
Hadoop Ecosystem components as shown.
● HDFS: Hadoop Distributed File System
● YARN: Yet Another Resource Negotiator
● MapReduce: Programming based Data Processing
● Spark: In-Memory data processing
● PIG, HIVE: Query based processing of data services
● HBase: NoSQL Database (Provides real-time reads and writes)
● Mahout, Spark MLLib: (Provides analytical tools) Machine Learning algorithm libraries
● Solar, Lucene: Searching and Indexing

Objective:

Our objective is to process and analyze healthcare data using various Hadoop components.
Let’s delve into how each component contributes to our healthcare use case:

HDFS (Hadoop Distributed File System): Store diverse healthcare data, such as electronic
health records (EHRs), medical images (X-rays, MRIs), and genomic sequences.

YARN: Efficiently allocate resources for healthcare applications.

MapReduce: Process structured and semi-structured data.

Spark: Perform in-memory data processing for faster analytics.

PIG and HIVE: Query and transform healthcare data.

HBase: Store real-time patient data, such as vital signs or telemetry data.

Mahout and Spark MLLib: Apply machine learning algorithms to predict disease outcomes,
recommend treatments, or personalize patient care.

Solar and Lucene: Build search indexes for medical literature, research papers, and clinical
guidelines.

Outcome:

By integrating the Hadoop ecosystem into healthcare systems, we achieve the following
outcomes:
Real-time patient monitoring and alerts.
Personalized treatment recommendations.
Early disease detection.
Streamlined research collaboration.

3
INTRODUCTION TO HADOOP ECOSYSTEM

2.1. Hadoop Distributed File System

It is the most important component of the Hadoop Ecosystem. HDFS is the primary storage
system of Hadoop. Hadoop distributed file system (HDFS) iJava-based based file system that
provides scalable, fault-tolerant, reliable, and cost-efficient data storage for Big Data. HDFS isa
distributed filesystem that runs on commodity hardware. HDFS is already configured with default
configuration for many installations. Most of the time for large clusters configuration is needed.
Hadoop interacts directly with HDFS by shell-like commands.

HDFS Components:

There are two major components of Hadoop HDFS- NameNode and DataNode. Let’s now discuss
these Hadoop HDFS Components-

i. NameNode

It is also known as the Master node. NameNode does not store actual data or datasets. NameNode
stores Metadata i.e. number of blocks, their location, on which Rack, which Datanode the data is
stored, and other details. It consists of files and directories.
Tasks of HDFS NameNode

● Manage file system namespace.

● Regulates client’s access to files.


● Executes file system execution such as naming, closing, and opening files, and
directories.

ii. DataNode

It is also known as Slave. HDFS Data node is responsible for storing actual data in HDFS.
Datanode performs operations as per the request of the clients. Replica block of Datanode consists
of 2 files on the file system. The first file is for data and the second file is for recordingthe block’s
metadata. HDFS Metadata includes checksums for data. At startup, each Datanodeconnects to its
corresponding Namenode and does handshaking. Verification of namespace IDand software
version of DataNode take place by handshaking. At the time of mismatch found,DataNode goes
down automatically.
4
Tasks of HDFS DataNode

● DataNode performs operations like block replica creation, deletion, and replication
according to the instruction of NameNode. ● DataNode manages the data storage ofthe
system.

This was all about HDFS as a Hadoop Ecosystem component.


2.2. MapReduce

Hadoop MapReduce is the core Hadoop ecosystem component that provides data processing.
MapReduce is a software framework for easily writing applications that process the vast amountof
structured and unstructured data stored in the Hadoop Distributed File system.
MapReduce programs are parallel, and thus are very useful for performing large-scale data analysis
using multiple machines in the cluster. Thus, it improves the speed and reliability of cluster this
parallel processing.

Fig.Hadoop MapReduce

Working of MapReduce

Hadoop Ecosystem component ‘MapReduce’ works by breaking the processing into two phases:

2.2.1. Map phase


2.2.2. Reduce phase
Each phase has key-value pairs as input and output. In addition, the programmer also specifies two
functions: map function and reduce function

5
Hadoop Ecosystem and it’s Components

The map function takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs). Read Mapper in detail.
Reduce function takes the output from the Map as an input and combines those data tuples basedon
the key and accordingly modifies the value of the key. Read Reducer in detail.

Features of MapReduce:

 Simplicity MapReduce jobs are easy to run. Applications can be written in any
language such as Java, C++, and python.
 Scalability – MapReduce can process petabytes of data.
 Speed – Using parallel processing problems that take days to solve, it is solved
in hours and minutes by MapReduce.
 Fault Tolerance – MapReduce takes care of failures. If one copy of data is
unavailable, another machine has a copy of the same key pair which can be
used forsolving the same subtask.

Refer to MapReduce Comprehensive Guide for more details.

Hope the Hadoop Ecosystem explained is helpful to you. The next component we take is YARN.

2.3. YARN

Hadoop YARN (Yet Another Resource Negotiator) is a Hadoop ecosystem component that provides
resource management. Yarn is also one of the most
important components of Hadoop Ecosystem. YARN is called the operating system of Hadoopas
it is responsible for managing and monitoring workloads. It allows multiple data processing
engines such as real-time streaming and batch processing to handle data stored on a single
platform.

6
Hadoop Ecosystem and it’s Components

Hadoop Yarn Diagram

YARN has been projected as a data operating system for Hadoop2. The main features of YARNare:
Flexibility Enables other purpose-built data processing models beyond MapReduce
(batch), such as interactive and streaming. Due to this feature of YARN, other
applications can also be run along with Map Reduce programs in Hadoop2.
2.3.1. Efficiency – As many applications run on the same cluster, Hence, efficiency of Hadoop
increases without much effect on the quality of service.
2.3.2. Shared – Provides a stable, reliable, secure foundation and shared operationalservices
across multiple workloads. Additional programming models such as graph processing
and iterative modeling are now possible for data processing.

2.3.3. Refer YARN Comprehensive Guide for more details.


● 2.4. Hive
● The Hadoop ecosystem component, Apache Hive, is an open-source data warehouse
system for querying and analyzing large datasets stored in Hadoop files. Hive does three
main functions: data summarization, query, and analysis.

7
Hadoop Ecosystem and it’s Components

● Hive uses a language called HiveQL (HQL), which is similar to SQL. HiveQL
automatically translates SQL-like queries into MapReduce jobs which will execute on
Hadoop.

Fig . Hive Diagram

The main parts of Hive are:

Metastore stores the metadata.


● Driver – Manage the lifecycle of a HiveQL statement.
● Query compiler – Compiles HiveQL into Directed Acyclic Graph(DAG).
● Hive server – Provide a thrift interface and JDBC/ODBC server.

Refer to Hive Comprehensive Guide for more details.


2.5. Pig

Apache Pig is a high-level language platform for analyzing and querying huge datasets that arestored
in HDFS. The pig as a component of the Hadoop Ecosystem uses PigLatin language. It

8
Hadoop Ecosystem and it’s Components

is very similar to SQL. It loads the data, applies the required filters, and dumps the data in the
required format. For Program execution, the pig requires a Java runtime environment.

Fig. Pig Diagram

Features of Apache Pig:

● Extensibility – For carrying out special-purpose processing, users can create their
functions.
● Optimization opportunities – Pig allows the system to optimize automatic execution.
This allows the user to pay attention to semantics instead of efficiency.
Handles all kinds of data – Pig analyzes both structured as well as unstructured.

Refer to Pig – A Complete guide for more details.

9
Hadoop Ecosystem and it’s Components

2.6. HBase

Apache HBase is a Hadoop ecosystem component which is a distributed database that was
designed to store structured data in tables that could have billions of rows and millions of
columns. HBase is a scalable, distributed, and NoSQL database that is built on top of HDFS. HBase,
provide real-time access to read or write data in HDFS.

Fig.HBase Diagram

10
Hadoop Ecosystem and it’s Components

Components of Hbase

There are two HBase Components namely- HBase Master and RegionServer. i.

HBase Master

It is not part of the actual data storage but negotiates load balancing across all RegionServer. ●
Maintain and monitor the Hadoop cluster.

● Performs administration (interface for creating, updating, and deleting tables.) ●


Controls the failover.
● HMaster handles DDL operation.

ii. RegionServer

It is the worker node which handles read, writes, updates, and deletes requests from clients. The
region server process runs on every node in the Hadoop cluster. Region server runs on HDFS
DateNode.

Refer to HBase Tutorial for more details.

2.7. HCatalog

It is a table and storage management layer for Hadoop. HCatalog supports different components
available in Hadoop ecosystems like MapReduce, Hive, and Pig to easily read and write data from
the cluster. HCatalog is a key component of Hive that enables the user to store their datain any
format and structure.
By default, HCatalog supports RCFile, CSV, JSON, sequence file, and ORC file formats.

Benefits of HCatalog:

● Enables notifications of data availability.


● With the table abstraction, HCatalog frees the user from the overhead of data storage.
● Provide visibility for data cleaning and archiving tools.
2.8. Avro

Acro is a part of the Hadoop ecosystem and is the most popular Data serialization system. Avrois
an open-source project that provides data serialization and data exchange services for

11
Hadoop Ecosystem and it’s Components

Hadoop. These services can be used together or independently. Big data can exchange programs
written in different languages using Avro.
Using serialization service programs can serialize data into files or messages. It stores data
definition and data together in one message or file making it easy for programs to dynamically
understand the information stored in the Avro files or message.

Avro schema – It relies on schemas for serialization/deserialization. Avro requires the schema for
data writing/reading. When Avro data is stored in a file its schema is stored with it, so that files
may be processed later by any program.

Dynamic typing – It refers to serialization and deserialization without code generation. It


complements the code generation which is available in Avro for statically typed language as an
optional optimization.

Features provided by Avro:

● Rich data structures.


● Remote procedure call.
● Compact, fast, binary data format.
● containerer file, to store persistent data.

2.9. Thrift

It is a software framework for scalable cross-language services development. Thrift is an interface


definition language for RPC(Remote procedure call) communication. Hadoop does alot of RPC
calls so there is a possibility of using the Hadoop Ecosystem component Apache Thrift for
performance or other reasons.

12
Hadoop Ecosystem and it’s Components

Fig. Thrift Diagram

2.10. Apache Drill

The main purpose of the Hadoop Ecosystem Component is large-scale data processing including
structured and semi-structured data. It is a low latency distributed query engine that is designed
to scale to several thousands of nodes and query petabytes of data. The drill is the first distributed
SQL query engine that has a schema-free model.

Application of Apache drill

The drill has become an invaluable tool at cardlytics, a company that provides consumer purchase
data for mobile and internet banking. Cardlytics is using a drill to quickly process trillions of
records and execute queries.

Features of Apache Drill:

The drill has a specialized memory management system to eliminate garbage collection and
optimize memory allocation and usage. Drill plays well with Hive by allowing developers to reuse
their existing Hive deployment.
● Extensibility – Drill provides an extensible architecture at all layers, including query
layer, query optimization, and client API. We can extend any layer for the specific need
of an organization.

13
Hadoop Ecosystem and it’s Components

● Flexibility – Drill provides a hierarchical columnar data model that can represent
complex, highly dynamic data and allow efficient processing.
● Dynamic schema discovery – Apache drill does not require schema or type specification
for data to start the query execution process. Instead, the drill starts processing the data
in units called record batches and discovers schema on the fly during processing.
● Drill decentralized metadata – Unlike other SQL Hadoop technologies, the drill doesnot
have a centralized metadata requirement. Drill users do not need to create and manage
tables in metadata to query data.

2.11. Apache Mahout

Mahout is an open-source framework for creating scalable machine learning algorithms and data
mining libraries. Once data is stored in Hadoop HDFS, mahout provides the data science tools to
automatically find meaningful patterns in those big data sets.

Algorithms of Mahout are:

● Clustering – Here it takes the item in a particular class and organizes them into naturally
occurring groups, such that items belonging to the same group are similar to each other.
● Collaborative filtering – It mines user behavior and makes product recommendations
(e.g. Amazon recommendations)
● Classifications – It learns from existing categorization and then assigns unclassified
items to the best category.
● Frequent pattern mining – It analyzes items in a group (e.g., items in a shopping cart or
terms in a query session) and then identifies which items typically appear together.
2.12. Apache Sqoop

Sqoop imports data from external sources into related Hadoop ecosystem components like HDFS,
HBase, or Hive. It also exports data from Hadoop to other external sources. Sqoop workswith
relational databases such as Teradata, Netezza, Oracle, and MySQL.

14
Hadoop Ecosystem and it’s Components

Fig. Apache Sqoop Diagram

Features of Apache Sqoop:

● Import sequential datasets from the mainframe – Sqoop satisfies the growing need to
move data from the mainframe to HDFS.
● Import direct to ORC files – Improves compression and lightweight indexing and
improves query performance.
● Parallel data transfer – For faster performance and optimal system utilization.
● Efficient data analysis – Improve the efficiency of data analysis by combining
structured data and unstructured data on a schema on the reading data lake.
● Fast data copies – from an external system into Hadoop.
2.13. Apache Flume

Flume efficiently collects, aggregates, and moves a large amount of data from its origin and sends
it back to HDFS. It is a fault-tolerant and reliable mechanism. This Hadoop Ecosystem component
allows the data to flow from the source into the Hadoop environment. It uses a simple extensible
data model that allows for the online analytic application. Using Flume, we can immediately get
the data from multiple servers into Hadoop.

15
Hadoop Ecosystem and it’s Components

Apache Flume

Refer to Flume Comprehensive Guide for more details

2.14. Ambari

Ambari, another Hadoop ecosystem component, is a management platform for provisioning,


managing, monitoring, and securing the Apache Hadoop cluster. Hadoop management gets
simpler as Ambari provides a consistent, secure platform for operational control.

16
Hadoop Ecosystem and it’s Components

Fig. Ambari Diagram

Features of Ambari:

● Simplified installation, configuration, and management – Ambari easily and


efficiently creates and manages large-scale clusters.
● Centralized security setup – Ambari reduces the complexity to administer and
configure cluster security across the entire platform.
● Highly extensible and customizable – Ambari is highly extensible for bringing
custom services under management.
● Full visibility into cluster health – Ambari ensures that the cluster is healthy and
available with a holistic approach to monitoring.

2.15. Zookeeper

Apache Zookeeper is a centralized service and a Hadoop Ecosystem component for maintaining
configuration information, naming, providing distributed synchronization, and providing group
services. Zookeeper manages and coordinates a large cluster of machines.

17
Hadoop Ecosystem and it’s Components

Fig. ZooKeeper Diagram

Features of Zookeeper:

● Fast – Zookeeper is fast with workloads where reads to data are more common than
writes. The ideal read/write ratio is 10:1.
● Ordered – Zookeeper maintains a record of all transactions.
2.16. Oozie
It is a workflow scheduler system for managing Apache Hadoop jobs. Oozie combines multiple
jobs sequentially into one logical unit of work. Oozie framework is fully integrated with apache
Hadoop stack, YARN as an architecture center, and supports Hadoop jobs for apache MapReduce,
Pig, Hive, and Sqoop.

Fig. Oozie Diagram

18
Hadoop Ecosystem and it’s Components

In Oozie, users can create a Directed Acyclic Graph of workflow, which can run in parallel and
sequentially in Hadoop. Oozie is scalable and can manage the timely execution ofthousands of
workflow in a Hadoop cluster. Oozie is very much flexible as well. One can easilystart, stop,
suspend and rerun jobs. It is even possible to skip a specific failed node or rerun it in Oozie.

There are two basic types of Oozie jobs:

● Oozie workflow – It is to store and run workflows composed of Hadoop jobs e.g.,
MapReduce, Pig, Hive.
● Oozie Coordinator – It runs workflow jobs based on predefined schedules and the
availability of data.

This was all about the Components of the Hadoop Ecosystem

3. Big Data in Healthcare Industry

Fig: Big data Adoption in Industry

As can be seen above most industries are already using or shifting towards Big Data includingthe
Healthcare Industry.
The Healthcare Industry uses it primarily for the following
• Data Warehouse Optimization
• Patient Analysis
• Predictive Maintenance

19
Hadoop Ecosystem and it’s Components

• Hadoop uses the MapReduce algorithm to create tasks, called jobs which can be executed
independently on different clusters (DataNodes) while the result is fetched back to a
single node (NameNode) for output.
• In our example, the map function is in the following format –
• <serial_number, name, drug_prescribed,
gender,total_expenditure_on_prescribed_drugs>

Fig 9: Map Reduction of Patient Data

As can be seen above, the system will group the items having the same key. Finally, the system
provides the requested output.

Patient data is stored in a centralized repository which makes the system cost-effective by
reducing the number of storage warehouses as well as eliminating any sort of data redundancy,
which leads to the system being consistent as well.

20
Hadoop Ecosystem and it’s Components

Oozie Diagram

In Oozie, users can create Directed Acyclic Graph of workflow, which can run in parallel and
sequentially in Hadoop. Oozie is scalable and can manage timely execution of thousands of workflow in
a Hadoop cluster. Oozie is very much flexible as well. One can easily start, stop, suspend and rerun jobs.
It is even possible to skip a specific failed node or rerun it in Oozie.

There are two basic types of Oozie jobs:

● Oozie workflow – It is to store and run workflows composed of Hadoop jobs e.g., MapReduce,
pig, Hive.
● Oozie Coordinator – It runs workflow jobs based on predefined schedules and availability of
data.

This was all about Components of Hadoop Ecosystem

24
Hadoop Ecosystem and it’s Components

CONCLUSION

We have covered all the Hadoop Ecosystem Components in detail. Hence these Hadoop ecosystem
components empower Hadoop functionality. Hadoop ecosystem is mainly designed to store and process
huge data that should have presented any of the two factors between volume, velocity, and variety. It is
storing data in a distributed processing system that runs on commodity hardware. Considering the full
Hadoop ecosystem process, HDFS distributes the data blocks, and Map Reduce provides the programming

framework to read data from a file stored in HDFS.

25
Hadoop Ecosystem and it’s Components

REFERENCES

1. Hadoop Ecosystem Overview


2. YARN: Yet Another Resource Negotiator
3. Apache Spark
4. HBase: The Hadoop Database
5. Apache Mahout
6. Apache Lucene

Submitted By:

Manasi Shivarkar
Roll no.: T213049
Div.: TE-C

Date:

Marks obtained:
Sign of course coordinator:

Name of course Coordinator:

26
Hadoop Ecosystem and it’s Components

27

You might also like