Case Study DSBDA Report Final
Case Study DSBDA Report Final
Nowadays, in the healthcare industry, there is a significant rise in the number of doctors,
patients, drugs, and medicines. The analysis and prediction of future health conditions are
stillin the developing stage. The data which is exerted in a little amount has risen greatly
from a few bytes to terabytes, not only has the storage increased but also the dataset
maintenance. The traditional method of using data mining and diagnosis tools is difficult,
therefore the needfor big data tools and techniques arises. Big data and Big data analytics are
rising technology. Primary sets of such data are created for the medical and healthcare
contexts. There isn’t an ideal method to measure patient satisfaction. This paper presents
the ideas and methodology using data mining techniques such as clustering which is
acquired from the datasets and put forward into the big data tool tools HADOOP for
effective analysis of healthcare data. Hadoop has gained popularity due to its ability to
store lying and an accessing large amount of data, quickly and costs effectively through
clusters of commodity hardware.
1
INTRODUCTION
Healthcare Industry is one of the world's greatest and most extensive ventures. Over, the ongoing
year Healthcare administration around the globe is changing from infection-focused to a patient-
focused model and volume-based model. Teaching the predominance of Healthcare and
diminishing the cost is a guideline behind the creating development towards-esteem-based
Healthcare conveyance model and a patient-focused mind. The volume and interest for huge
information in Healthcare associations are developing little by close to nothing. To give
successful patient-focused care, it is fundamental to oversee and analyze the huge amount of data
sets. The traditional methods are obsolete and are not sufficiently adequate to break down
enormous information as the assortment and volume of information sources have expanded and
a very large rate in the previous two decades. There is a requirement for new and creativetools
and methods that can meet and surpass the capacity of overseeing such a huge amount ofdata being
generated by the healthcare department. The social insurance framework of healthcare
departments is a community in nature.
This is since it comprises a substantial number of partners such as doctors with specialization in
different sectors, medical care takers research center technologists, and other individuals that
cooperate to accomplish the shared objectives of decreasing medicinal cost and blunders and also
giving quality healthcare experience. Every one of these partners produces information from
heterogeneous sources, for example, physical examination, clinical notes, patients’ meetings and
perceptions, research facility tests, imaging reports, medications, treatments, overviews, bills, and
protection. The rate at which information is being generated from heterogeneous sources from
various healthcare departments has incremented exponentially on the daily basis. Therefore, it is
becoming hard to store, process, and break down this interrelated information with traditional
dataset handling applications. Nonetheless, new and efficient methods and systems are in addition
to proprovidingeat processing advancements to store, process, break down andextricate values
from voluminous and heterogeneous medical information being generated continuously.
2
PROBLEM STATEMENT
Write a case study to process data driven for Digital Marketing OR Health care systems with
Hadoop Ecosystem components as shown.
● HDFS: Hadoop Distributed File System
● YARN: Yet Another Resource Negotiator
● MapReduce: Programming based Data Processing
● Spark: In-Memory data processing
● PIG, HIVE: Query based processing of data services
● HBase: NoSQL Database (Provides real-time reads and writes)
● Mahout, Spark MLLib: (Provides analytical tools) Machine Learning algorithm libraries
● Solar, Lucene: Searching and Indexing
Objective:
Our objective is to process and analyze healthcare data using various Hadoop components.
Let’s delve into how each component contributes to our healthcare use case:
HDFS (Hadoop Distributed File System): Store diverse healthcare data, such as electronic
health records (EHRs), medical images (X-rays, MRIs), and genomic sequences.
HBase: Store real-time patient data, such as vital signs or telemetry data.
Mahout and Spark MLLib: Apply machine learning algorithms to predict disease outcomes,
recommend treatments, or personalize patient care.
Solar and Lucene: Build search indexes for medical literature, research papers, and clinical
guidelines.
Outcome:
By integrating the Hadoop ecosystem into healthcare systems, we achieve the following
outcomes:
Real-time patient monitoring and alerts.
Personalized treatment recommendations.
Early disease detection.
Streamlined research collaboration.
3
INTRODUCTION TO HADOOP ECOSYSTEM
It is the most important component of the Hadoop Ecosystem. HDFS is the primary storage
system of Hadoop. Hadoop distributed file system (HDFS) iJava-based based file system that
provides scalable, fault-tolerant, reliable, and cost-efficient data storage for Big Data. HDFS isa
distributed filesystem that runs on commodity hardware. HDFS is already configured with default
configuration for many installations. Most of the time for large clusters configuration is needed.
Hadoop interacts directly with HDFS by shell-like commands.
HDFS Components:
There are two major components of Hadoop HDFS- NameNode and DataNode. Let’s now discuss
these Hadoop HDFS Components-
i. NameNode
It is also known as the Master node. NameNode does not store actual data or datasets. NameNode
stores Metadata i.e. number of blocks, their location, on which Rack, which Datanode the data is
stored, and other details. It consists of files and directories.
Tasks of HDFS NameNode
ii. DataNode
It is also known as Slave. HDFS Data node is responsible for storing actual data in HDFS.
Datanode performs operations as per the request of the clients. Replica block of Datanode consists
of 2 files on the file system. The first file is for data and the second file is for recordingthe block’s
metadata. HDFS Metadata includes checksums for data. At startup, each Datanodeconnects to its
corresponding Namenode and does handshaking. Verification of namespace IDand software
version of DataNode take place by handshaking. At the time of mismatch found,DataNode goes
down automatically.
4
Tasks of HDFS DataNode
● DataNode performs operations like block replica creation, deletion, and replication
according to the instruction of NameNode. ● DataNode manages the data storage ofthe
system.
Hadoop MapReduce is the core Hadoop ecosystem component that provides data processing.
MapReduce is a software framework for easily writing applications that process the vast amountof
structured and unstructured data stored in the Hadoop Distributed File system.
MapReduce programs are parallel, and thus are very useful for performing large-scale data analysis
using multiple machines in the cluster. Thus, it improves the speed and reliability of cluster this
parallel processing.
Fig.Hadoop MapReduce
Working of MapReduce
Hadoop Ecosystem component ‘MapReduce’ works by breaking the processing into two phases:
5
Hadoop Ecosystem and it’s Components
The map function takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs). Read Mapper in detail.
Reduce function takes the output from the Map as an input and combines those data tuples basedon
the key and accordingly modifies the value of the key. Read Reducer in detail.
Features of MapReduce:
Simplicity MapReduce jobs are easy to run. Applications can be written in any
language such as Java, C++, and python.
Scalability – MapReduce can process petabytes of data.
Speed – Using parallel processing problems that take days to solve, it is solved
in hours and minutes by MapReduce.
Fault Tolerance – MapReduce takes care of failures. If one copy of data is
unavailable, another machine has a copy of the same key pair which can be
used forsolving the same subtask.
Hope the Hadoop Ecosystem explained is helpful to you. The next component we take is YARN.
2.3. YARN
Hadoop YARN (Yet Another Resource Negotiator) is a Hadoop ecosystem component that provides
resource management. Yarn is also one of the most
important components of Hadoop Ecosystem. YARN is called the operating system of Hadoopas
it is responsible for managing and monitoring workloads. It allows multiple data processing
engines such as real-time streaming and batch processing to handle data stored on a single
platform.
6
Hadoop Ecosystem and it’s Components
YARN has been projected as a data operating system for Hadoop2. The main features of YARNare:
Flexibility Enables other purpose-built data processing models beyond MapReduce
(batch), such as interactive and streaming. Due to this feature of YARN, other
applications can also be run along with Map Reduce programs in Hadoop2.
2.3.1. Efficiency – As many applications run on the same cluster, Hence, efficiency of Hadoop
increases without much effect on the quality of service.
2.3.2. Shared – Provides a stable, reliable, secure foundation and shared operationalservices
across multiple workloads. Additional programming models such as graph processing
and iterative modeling are now possible for data processing.
7
Hadoop Ecosystem and it’s Components
● Hive uses a language called HiveQL (HQL), which is similar to SQL. HiveQL
automatically translates SQL-like queries into MapReduce jobs which will execute on
Hadoop.
Apache Pig is a high-level language platform for analyzing and querying huge datasets that arestored
in HDFS. The pig as a component of the Hadoop Ecosystem uses PigLatin language. It
8
Hadoop Ecosystem and it’s Components
is very similar to SQL. It loads the data, applies the required filters, and dumps the data in the
required format. For Program execution, the pig requires a Java runtime environment.
● Extensibility – For carrying out special-purpose processing, users can create their
functions.
● Optimization opportunities – Pig allows the system to optimize automatic execution.
This allows the user to pay attention to semantics instead of efficiency.
Handles all kinds of data – Pig analyzes both structured as well as unstructured.
9
Hadoop Ecosystem and it’s Components
2.6. HBase
Apache HBase is a Hadoop ecosystem component which is a distributed database that was
designed to store structured data in tables that could have billions of rows and millions of
columns. HBase is a scalable, distributed, and NoSQL database that is built on top of HDFS. HBase,
provide real-time access to read or write data in HDFS.
Fig.HBase Diagram
10
Hadoop Ecosystem and it’s Components
Components of Hbase
There are two HBase Components namely- HBase Master and RegionServer. i.
HBase Master
It is not part of the actual data storage but negotiates load balancing across all RegionServer. ●
Maintain and monitor the Hadoop cluster.
ii. RegionServer
It is the worker node which handles read, writes, updates, and deletes requests from clients. The
region server process runs on every node in the Hadoop cluster. Region server runs on HDFS
DateNode.
2.7. HCatalog
It is a table and storage management layer for Hadoop. HCatalog supports different components
available in Hadoop ecosystems like MapReduce, Hive, and Pig to easily read and write data from
the cluster. HCatalog is a key component of Hive that enables the user to store their datain any
format and structure.
By default, HCatalog supports RCFile, CSV, JSON, sequence file, and ORC file formats.
Benefits of HCatalog:
Acro is a part of the Hadoop ecosystem and is the most popular Data serialization system. Avrois
an open-source project that provides data serialization and data exchange services for
11
Hadoop Ecosystem and it’s Components
Hadoop. These services can be used together or independently. Big data can exchange programs
written in different languages using Avro.
Using serialization service programs can serialize data into files or messages. It stores data
definition and data together in one message or file making it easy for programs to dynamically
understand the information stored in the Avro files or message.
Avro schema – It relies on schemas for serialization/deserialization. Avro requires the schema for
data writing/reading. When Avro data is stored in a file its schema is stored with it, so that files
may be processed later by any program.
2.9. Thrift
12
Hadoop Ecosystem and it’s Components
The main purpose of the Hadoop Ecosystem Component is large-scale data processing including
structured and semi-structured data. It is a low latency distributed query engine that is designed
to scale to several thousands of nodes and query petabytes of data. The drill is the first distributed
SQL query engine that has a schema-free model.
The drill has become an invaluable tool at cardlytics, a company that provides consumer purchase
data for mobile and internet banking. Cardlytics is using a drill to quickly process trillions of
records and execute queries.
The drill has a specialized memory management system to eliminate garbage collection and
optimize memory allocation and usage. Drill plays well with Hive by allowing developers to reuse
their existing Hive deployment.
● Extensibility – Drill provides an extensible architecture at all layers, including query
layer, query optimization, and client API. We can extend any layer for the specific need
of an organization.
13
Hadoop Ecosystem and it’s Components
● Flexibility – Drill provides a hierarchical columnar data model that can represent
complex, highly dynamic data and allow efficient processing.
● Dynamic schema discovery – Apache drill does not require schema or type specification
for data to start the query execution process. Instead, the drill starts processing the data
in units called record batches and discovers schema on the fly during processing.
● Drill decentralized metadata – Unlike other SQL Hadoop technologies, the drill doesnot
have a centralized metadata requirement. Drill users do not need to create and manage
tables in metadata to query data.
Mahout is an open-source framework for creating scalable machine learning algorithms and data
mining libraries. Once data is stored in Hadoop HDFS, mahout provides the data science tools to
automatically find meaningful patterns in those big data sets.
● Clustering – Here it takes the item in a particular class and organizes them into naturally
occurring groups, such that items belonging to the same group are similar to each other.
● Collaborative filtering – It mines user behavior and makes product recommendations
(e.g. Amazon recommendations)
● Classifications – It learns from existing categorization and then assigns unclassified
items to the best category.
● Frequent pattern mining – It analyzes items in a group (e.g., items in a shopping cart or
terms in a query session) and then identifies which items typically appear together.
2.12. Apache Sqoop
Sqoop imports data from external sources into related Hadoop ecosystem components like HDFS,
HBase, or Hive. It also exports data from Hadoop to other external sources. Sqoop workswith
relational databases such as Teradata, Netezza, Oracle, and MySQL.
14
Hadoop Ecosystem and it’s Components
● Import sequential datasets from the mainframe – Sqoop satisfies the growing need to
move data from the mainframe to HDFS.
● Import direct to ORC files – Improves compression and lightweight indexing and
improves query performance.
● Parallel data transfer – For faster performance and optimal system utilization.
● Efficient data analysis – Improve the efficiency of data analysis by combining
structured data and unstructured data on a schema on the reading data lake.
● Fast data copies – from an external system into Hadoop.
2.13. Apache Flume
Flume efficiently collects, aggregates, and moves a large amount of data from its origin and sends
it back to HDFS. It is a fault-tolerant and reliable mechanism. This Hadoop Ecosystem component
allows the data to flow from the source into the Hadoop environment. It uses a simple extensible
data model that allows for the online analytic application. Using Flume, we can immediately get
the data from multiple servers into Hadoop.
15
Hadoop Ecosystem and it’s Components
Apache Flume
2.14. Ambari
16
Hadoop Ecosystem and it’s Components
Features of Ambari:
2.15. Zookeeper
Apache Zookeeper is a centralized service and a Hadoop Ecosystem component for maintaining
configuration information, naming, providing distributed synchronization, and providing group
services. Zookeeper manages and coordinates a large cluster of machines.
17
Hadoop Ecosystem and it’s Components
Features of Zookeeper:
● Fast – Zookeeper is fast with workloads where reads to data are more common than
writes. The ideal read/write ratio is 10:1.
● Ordered – Zookeeper maintains a record of all transactions.
2.16. Oozie
It is a workflow scheduler system for managing Apache Hadoop jobs. Oozie combines multiple
jobs sequentially into one logical unit of work. Oozie framework is fully integrated with apache
Hadoop stack, YARN as an architecture center, and supports Hadoop jobs for apache MapReduce,
Pig, Hive, and Sqoop.
18
Hadoop Ecosystem and it’s Components
In Oozie, users can create a Directed Acyclic Graph of workflow, which can run in parallel and
sequentially in Hadoop. Oozie is scalable and can manage the timely execution ofthousands of
workflow in a Hadoop cluster. Oozie is very much flexible as well. One can easilystart, stop,
suspend and rerun jobs. It is even possible to skip a specific failed node or rerun it in Oozie.
● Oozie workflow – It is to store and run workflows composed of Hadoop jobs e.g.,
MapReduce, Pig, Hive.
● Oozie Coordinator – It runs workflow jobs based on predefined schedules and the
availability of data.
As can be seen above most industries are already using or shifting towards Big Data includingthe
Healthcare Industry.
The Healthcare Industry uses it primarily for the following
• Data Warehouse Optimization
• Patient Analysis
• Predictive Maintenance
19
Hadoop Ecosystem and it’s Components
• Hadoop uses the MapReduce algorithm to create tasks, called jobs which can be executed
independently on different clusters (DataNodes) while the result is fetched back to a
single node (NameNode) for output.
• In our example, the map function is in the following format –
• <serial_number, name, drug_prescribed,
gender,total_expenditure_on_prescribed_drugs>
As can be seen above, the system will group the items having the same key. Finally, the system
provides the requested output.
Patient data is stored in a centralized repository which makes the system cost-effective by
reducing the number of storage warehouses as well as eliminating any sort of data redundancy,
which leads to the system being consistent as well.
20
Hadoop Ecosystem and it’s Components
Oozie Diagram
In Oozie, users can create Directed Acyclic Graph of workflow, which can run in parallel and
sequentially in Hadoop. Oozie is scalable and can manage timely execution of thousands of workflow in
a Hadoop cluster. Oozie is very much flexible as well. One can easily start, stop, suspend and rerun jobs.
It is even possible to skip a specific failed node or rerun it in Oozie.
● Oozie workflow – It is to store and run workflows composed of Hadoop jobs e.g., MapReduce,
pig, Hive.
● Oozie Coordinator – It runs workflow jobs based on predefined schedules and availability of
data.
24
Hadoop Ecosystem and it’s Components
CONCLUSION
We have covered all the Hadoop Ecosystem Components in detail. Hence these Hadoop ecosystem
components empower Hadoop functionality. Hadoop ecosystem is mainly designed to store and process
huge data that should have presented any of the two factors between volume, velocity, and variety. It is
storing data in a distributed processing system that runs on commodity hardware. Considering the full
Hadoop ecosystem process, HDFS distributes the data blocks, and Map Reduce provides the programming
25
Hadoop Ecosystem and it’s Components
REFERENCES
Submitted By:
Manasi Shivarkar
Roll no.: T213049
Div.: TE-C
Date:
Marks obtained:
Sign of course coordinator:
26
Hadoop Ecosystem and it’s Components
27