0% found this document useful (0 votes)

429 views24 pages

Case Study DSBDA Report Final

Uploaded by

rishusingh1514

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

429 views24 pages

Case Study DSBDA Report Final

Uploaded by

rishusingh1514

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

ABSTRACT

Nowadays, in the healthcare industry, there is a significant rise in the number of doctors,
patients, drugs, and medicines. The analysis and prediction of future health conditions are
stillin the developing stage. The data which is exerted in a little amount has risen greatly
from a few bytes to terabytes, not only has the storage increased but also the dataset
maintenance. The traditional method of using data mining and diagnosis tools is difficult,
therefore the needfor big data tools and techniques arises. Big data and Big data analytics are
rising technology. Primary sets of such data are created for the medical and healthcare
contexts. There isn’t an ideal method to measure patient satisfaction. This paper presents
the ideas and methodology using data mining techniques such as clustering which is
acquired from the datasets and put forward into the big data tool tools HADOOP for
effective analysis of healthcare data. Hadoop has gained popularity due to its ability to
store lying and an accessing large amount of data, quickly and costs effectively through
clusters of commodity hardware.

Keywords- Big data, Healthcare, Clustering, Hadoop.

1
INTRODUCTION

Healthcare Industry is one of the world's greatest and most extensive ventures. Over, the ongoing
year Healthcare administration around the globe is changing from infection-focused to a patient-
focused model and volume-based model. Teaching the predominance of Healthcare and
diminishing the cost is a guideline behind the creating development towards-esteem-based
Healthcare conveyance model and a patient-focused mind. The volume and interest for huge
information in Healthcare associations are developing little by close to nothing. To give
successful patient-focused care, it is fundamental to oversee and analyze the huge amount of data
sets. The traditional methods are obsolete and are not sufficiently adequate to break down
enormous information as the assortment and volume of information sources have expanded and
a very large rate in the previous two decades. There is a requirement for new and creativetools
and methods that can meet and surpass the capacity of overseeing such a huge amount ofdata being
generated by the healthcare department. The social insurance framework of healthcare
departments is a community in nature.

This is since it comprises a substantial number of partners such as doctors with specialization in
different sectors, medical care takers research center technologists, and other individuals that
cooperate to accomplish the shared objectives of decreasing medicinal cost and blunders and also
giving quality healthcare experience. Every one of these partners produces information from
heterogeneous sources, for example, physical examination, clinical notes, patients’ meetings and
perceptions, research facility tests, imaging reports, medications, treatments, overviews, bills, and
protection. The rate at which information is being generated from heterogeneous sources from
various healthcare departments has incremented exponentially on the daily basis. Therefore, it is
becoming hard to store, process, and break down this interrelated information with traditional
dataset handling applications. Nonetheless, new and efficient methods and systems are in addition
to proprovidingeat processing advancements to store, process, break down andextricate values
from voluminous and heterogeneous medical information being generated continuously.

2
PROBLEM STATEMENT
Write a case study to process data driven for Digital Marketing OR Health care systems with
Hadoop Ecosystem components as shown.
● HDFS: Hadoop Distributed File System
● YARN: Yet Another Resource Negotiator
● MapReduce: Programming based Data Processing
● Spark: In-Memory data processing
● PIG, HIVE: Query based processing of data services
● HBase: NoSQL Database (Provides real-time reads and writes)
● Mahout, Spark MLLib: (Provides analytical tools) Machine Learning algorithm libraries
● Solar, Lucene: Searching and Indexing

Objective:

Our objective is to process and analyze healthcare data using various Hadoop components.
Let’s delve into how each component contributes to our healthcare use case:

HDFS (Hadoop Distributed File System): Store diverse healthcare data, such as electronic
health records (EHRs), medical images (X-rays, MRIs), and genomic sequences.

YARN: Efficiently allocate resources for healthcare applications.

MapReduce: Process structured and semi-structured data.

Spark: Perform in-memory data processing for faster analytics.

PIG and HIVE: Query and transform healthcare data.

HBase: Store real-time patient data, such as vital signs or telemetry data.

Mahout and Spark MLLib: Apply machine learning algorithms to predict disease outcomes,
recommend treatments, or personalize patient care.

Solar and Lucene: Build search indexes for medical literature, research papers, and clinical
guidelines.

Outcome:

By integrating the Hadoop ecosystem into healthcare systems, we achieve the following
outcomes:
Real-time patient monitoring and alerts.
Personalized treatment recommendations.
Early disease detection.
Streamlined research collaboration.

3
INTRODUCTION TO HADOOP ECOSYSTEM

2.1. Hadoop Distributed File System

It is the most important component of the Hadoop Ecosystem. HDFS is the primary storage
system of Hadoop. Hadoop distributed file system (HDFS) iJava-based based file system that
provides scalable, fault-tolerant, reliable, and cost-efficient data storage for Big Data. HDFS isa
distributed filesystem that runs on commodity hardware. HDFS is already configured with default
configuration for many installations. Most of the time for large clusters configuration is needed.
Hadoop interacts directly with HDFS by shell-like commands.

HDFS Components:

There are two major components of Hadoop HDFS- NameNode and DataNode. Let’s now discuss
these Hadoop HDFS Components-

i. NameNode

It is also known as the Master node. NameNode does not store actual data or datasets. NameNode
stores Metadata i.e. number of blocks, their location, on which Rack, which Datanode the data is
stored, and other details. It consists of files and directories.
Tasks of HDFS NameNode

● Manage file system namespace.

● Regulates client’s access to files.

● Executes file system execution such as naming, closing, and opening files, and
directories.

ii. DataNode

It is also known as Slave. HDFS Data node is responsible for storing actual data in HDFS.
Datanode performs operations as per the request of the clients. Replica block of Datanode consists
of 2 files on the file system. The first file is for data and the second file is for recordingthe block’s
metadata. HDFS Metadata includes checksums for data. At startup, each Datanodeconnects to its
corresponding Namenode and does handshaking. Verification of namespace IDand software
version of DataNode take place by handshaking. At the time of mismatch found,DataNode goes
down automatically.
4
Tasks of HDFS DataNode

● DataNode performs operations like block replica creation, deletion, and replication
according to the instruction of NameNode. ● DataNode manages the data storage ofthe
system.

This was all about HDFS as a Hadoop Ecosystem component.

2.2. MapReduce

Hadoop MapReduce is the core Hadoop ecosystem component that provides data processing.
MapReduce is a software framework for easily writing applications that process the vast amountof
structured and unstructured data stored in the Hadoop Distributed File system.
MapReduce programs are parallel, and thus are very useful for performing large-scale data analysis
using multiple machines in the cluster. Thus, it improves the speed and reliability of cluster this
parallel processing.

Fig.Hadoop MapReduce

Working of MapReduce

Hadoop Ecosystem component ‘MapReduce’ works by breaking the processing into two phases:

2.2.1. Map phase

2.2.2. Reduce phase
Each phase has key-value pairs as input and output. In addition, the programmer also specifies two
functions: map function and reduce function

5
Hadoop Ecosystem and it’s Components

The map function takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs). Read Mapper in detail.
Reduce function takes the output from the Map as an input and combines those data tuples basedon
the key and accordingly modifies the value of the key. Read Reducer in detail.

Features of MapReduce:

 Simplicity MapReduce jobs are easy to run. Applications can be written in any
language such as Java, C++, and python.
 Scalability – MapReduce can process petabytes of data.
 Speed – Using parallel processing problems that take days to solve, it is solved
in hours and minutes by MapReduce.
 Fault Tolerance – MapReduce takes care of failures. If one copy of data is
unavailable, another machine has a copy of the same key pair which can be
used forsolving the same subtask.

Refer to MapReduce Comprehensive Guide for more details.

Hope the Hadoop Ecosystem explained is helpful to you. The next component we take is YARN.

2.3. YARN

Hadoop YARN (Yet Another Resource Negotiator) is a Hadoop ecosystem component that provides
resource management. Yarn is also one of the most
important components of Hadoop Ecosystem. YARN is called the operating system of Hadoopas
it is responsible for managing and monitoring workloads. It allows multiple data processing
engines such as real-time streaming and batch processing to handle data stored on a single
platform.

6
Hadoop Ecosystem and it’s Components

Hadoop Yarn Diagram

YARN has been projected as a data operating system for Hadoop2. The main features of YARNare:
Flexibility Enables other purpose-built data processing models beyond MapReduce
(batch), such as interactive and streaming. Due to this feature of YARN, other
applications can also be run along with Map Reduce programs in Hadoop2.
2.3.1. Efficiency – As many applications run on the same cluster, Hence, efficiency of Hadoop
increases without much effect on the quality of service.
2.3.2. Shared – Provides a stable, reliable, secure foundation and shared operationalservices
across multiple workloads. Additional programming models such as graph processing
and iterative modeling are now possible for data processing.

2.3.3. Refer YARN Comprehensive Guide for more details.

● 2.4. Hive
● The Hadoop ecosystem component, Apache Hive, is an open-source data warehouse
system for querying and analyzing large datasets stored in Hadoop files. Hive does three
main functions: data summarization, query, and analysis.

7
Hadoop Ecosystem and it’s Components

● Hive uses a language called HiveQL (HQL), which is similar to SQL. HiveQL
automatically translates SQL-like queries into MapReduce jobs which will execute on
Hadoop.

Fig . Hive Diagram

The main parts of Hive are:

Metastore stores the metadata.

● Driver – Manage the lifecycle of a HiveQL statement.
● Query compiler – Compiles HiveQL into Directed Acyclic Graph(DAG).
● Hive server – Provide a thrift interface and JDBC/ODBC server.

Refer to Hive Comprehensive Guide for more details.

2.5. Pig

Apache Pig is a high-level language platform for analyzing and querying huge datasets that arestored
in HDFS. The pig as a component of the Hadoop Ecosystem uses PigLatin language. It

8
Hadoop Ecosystem and it’s Components

is very similar to SQL. It loads the data, applies the required filters, and dumps the data in the
required format. For Program execution, the pig requires a Java runtime environment.

Fig. Pig Diagram

Features of Apache Pig:

● Extensibility – For carrying out special-purpose processing, users can create their
functions.
● Optimization opportunities – Pig allows the system to optimize automatic execution.
This allows the user to pay attention to semantics instead of efficiency.
Handles all kinds of data – Pig analyzes both structured as well as unstructured.

Refer to Pig – A Complete guide for more details.

9
Hadoop Ecosystem and it’s Components

2.6. HBase

Apache HBase is a Hadoop ecosystem component which is a distributed database that was
designed to store structured data in tables that could have billions of rows and millions of
columns. HBase is a scalable, distributed, and NoSQL database that is built on top of HDFS. HBase,
provide real-time access to read or write data in HDFS.

Fig.HBase Diagram

10
Hadoop Ecosystem and it’s Components

Components of Hbase

There are two HBase Components namely- HBase Master and RegionServer. i.

HBase Master

It is not part of the actual data storage but negotiates load balancing across all RegionServer. ●
Maintain and monitor the Hadoop cluster.

● Performs administration (interface for creating, updating, and deleting tables.) ●

Controls the failover.
● HMaster handles DDL operation.

ii. RegionServer

It is the worker node which handles read, writes, updates, and deletes requests from clients. The
region server process runs on every node in the Hadoop cluster. Region server runs on HDFS
DateNode.

Refer to HBase Tutorial for more details.

2.7. HCatalog

It is a table and storage management layer for Hadoop. HCatalog supports different components
available in Hadoop ecosystems like MapReduce, Hive, and Pig to easily read and write data from
the cluster. HCatalog is a key component of Hive that enables the user to store their datain any
format and structure.
By default, HCatalog supports RCFile, CSV, JSON, sequence file, and ORC file formats.

Benefits of HCatalog:

● Enables notifications of data availability.

● With the table abstraction, HCatalog frees the user from the overhead of data storage.
● Provide visibility for data cleaning and archiving tools.
2.8. Avro

Acro is a part of the Hadoop ecosystem and is the most popular Data serialization system. Avrois
an open-source project that provides data serialization and data exchange services for

11
Hadoop Ecosystem and it’s Components

Hadoop. These services can be used together or independently. Big data can exchange programs
written in different languages using Avro.
Using serialization service programs can serialize data into files or messages. It stores data
definition and data together in one message or file making it easy for programs to dynamically
understand the information stored in the Avro files or message.

Avro schema – It relies on schemas for serialization/deserialization. Avro requires the schema for
data writing/reading. When Avro data is stored in a file its schema is stored with it, so that files
may be processed later by any program.

Dynamic typing – It refers to serialization and deserialization without code generation. It

complements the code generation which is available in Avro for statically typed language as an
optional optimization.

Features provided by Avro:

● Rich data structures.

● Remote procedure call.
● Compact, fast, binary data format.
● containerer file, to store persistent data.

2.9. Thrift

It is a software framework for scalable cross-language services development. Thrift is an interface

definition language for RPC(Remote procedure call) communication. Hadoop does alot of RPC
calls so there is a possibility of using the Hadoop Ecosystem component Apache Thrift for
performance or other reasons.

12
Hadoop Ecosystem and it’s Components

Fig. Thrift Diagram

2.10. Apache Drill

The main purpose of the Hadoop Ecosystem Component is large-scale data processing including
structured and semi-structured data. It is a low latency distributed query engine that is designed
to scale to several thousands of nodes and query petabytes of data. The drill is the first distributed
SQL query engine that has a schema-free model.

Application of Apache drill

The drill has become an invaluable tool at cardlytics, a company that provides consumer purchase
data for mobile and internet banking. Cardlytics is using a drill to quickly process trillions of
records and execute queries.

Features of Apache Drill:

The drill has a specialized memory management system to eliminate garbage collection and
optimize memory allocation and usage. Drill plays well with Hive by allowing developers to reuse
their existing Hive deployment.
● Extensibility – Drill provides an extensible architecture at all layers, including query
layer, query optimization, and client API. We can extend any layer for the specific need
of an organization.

13
Hadoop Ecosystem and it’s Components

● Flexibility – Drill provides a hierarchical columnar data model that can represent
complex, highly dynamic data and allow efficient processing.
● Dynamic schema discovery – Apache drill does not require schema or type specification
for data to start the query execution process. Instead, the drill starts processing the data
in units called record batches and discovers schema on the fly during processing.
● Drill decentralized metadata – Unlike other SQL Hadoop technologies, the drill doesnot
have a centralized metadata requirement. Drill users do not need to create and manage
tables in metadata to query data.

2.11. Apache Mahout

Mahout is an open-source framework for creating scalable machine learning algorithms and data
mining libraries. Once data is stored in Hadoop HDFS, mahout provides the data science tools to
automatically find meaningful patterns in those big data sets.

Algorithms of Mahout are:

● Clustering – Here it takes the item in a particular class and organizes them into naturally
occurring groups, such that items belonging to the same group are similar to each other.
● Collaborative filtering – It mines user behavior and makes product recommendations
(e.g. Amazon recommendations)
● Classifications – It learns from existing categorization and then assigns unclassified
items to the best category.
● Frequent pattern mining – It analyzes items in a group (e.g., items in a shopping cart or
terms in a query session) and then identifies which items typically appear together.
2.12. Apache Sqoop

Sqoop imports data from external sources into related Hadoop ecosystem components like HDFS,
HBase, or Hive. It also exports data from Hadoop to other external sources. Sqoop workswith
relational databases such as Teradata, Netezza, Oracle, and MySQL.

14
Hadoop Ecosystem and it’s Components

Fig. Apache Sqoop Diagram

Features of Apache Sqoop:

● Import sequential datasets from the mainframe – Sqoop satisfies the growing need to
move data from the mainframe to HDFS.
● Import direct to ORC files – Improves compression and lightweight indexing and
improves query performance.
● Parallel data transfer – For faster performance and optimal system utilization.
● Efficient data analysis – Improve the efficiency of data analysis by combining
structured data and unstructured data on a schema on the reading data lake.
● Fast data copies – from an external system into Hadoop.
2.13. Apache Flume

Flume efficiently collects, aggregates, and moves a large amount of data from its origin and sends
it back to HDFS. It is a fault-tolerant and reliable mechanism. This Hadoop Ecosystem component
allows the data to flow from the source into the Hadoop environment. It uses a simple extensible
data model that allows for the online analytic application. Using Flume, we can immediately get
the data from multiple servers into Hadoop.

15
Hadoop Ecosystem and it’s Components

Apache Flume

Refer to Flume Comprehensive Guide for more details

2.14. Ambari

Ambari, another Hadoop ecosystem component, is a management platform for provisioning,

managing, monitoring, and securing the Apache Hadoop cluster. Hadoop management gets
simpler as Ambari provides a consistent, secure platform for operational control.

16
Hadoop Ecosystem and it’s Components

Fig. Ambari Diagram

Features of Ambari:

● Simplified installation, configuration, and management – Ambari easily and

efficiently creates and manages large-scale clusters.
● Centralized security setup – Ambari reduces the complexity to administer and
configure cluster security across the entire platform.
● Highly extensible and customizable – Ambari is highly extensible for bringing
custom services under management.
● Full visibility into cluster health – Ambari ensures that the cluster is healthy and
available with a holistic approach to monitoring.

2.15. Zookeeper

Apache Zookeeper is a centralized service and a Hadoop Ecosystem component for maintaining
configuration information, naming, providing distributed synchronization, and providing group
services. Zookeeper manages and coordinates a large cluster of machines.

17
Hadoop Ecosystem and it’s Components

Fig. ZooKeeper Diagram

Features of Zookeeper:

● Fast – Zookeeper is fast with workloads where reads to data are more common than
writes. The ideal read/write ratio is 10:1.
● Ordered – Zookeeper maintains a record of all transactions.
2.16. Oozie
It is a workflow scheduler system for managing Apache Hadoop jobs. Oozie combines multiple
jobs sequentially into one logical unit of work. Oozie framework is fully integrated with apache
Hadoop stack, YARN as an architecture center, and supports Hadoop jobs for apache MapReduce,
Pig, Hive, and Sqoop.

Fig. Oozie Diagram

18
Hadoop Ecosystem and it’s Components

In Oozie, users can create a Directed Acyclic Graph of workflow, which can run in parallel and
sequentially in Hadoop. Oozie is scalable and can manage the timely execution ofthousands of
workflow in a Hadoop cluster. Oozie is very much flexible as well. One can easilystart, stop,
suspend and rerun jobs. It is even possible to skip a specific failed node or rerun it in Oozie.

There are two basic types of Oozie jobs:

● Oozie workflow – It is to store and run workflows composed of Hadoop jobs e.g.,
MapReduce, Pig, Hive.
● Oozie Coordinator – It runs workflow jobs based on predefined schedules and the
availability of data.

This was all about the Components of the Hadoop Ecosystem

3. Big Data in Healthcare Industry

Fig: Big data Adoption in Industry

As can be seen above most industries are already using or shifting towards Big Data includingthe
Healthcare Industry.
The Healthcare Industry uses it primarily for the following
• Data Warehouse Optimization
• Patient Analysis
• Predictive Maintenance

19
Hadoop Ecosystem and it’s Components

• Hadoop uses the MapReduce algorithm to create tasks, called jobs which can be executed
independently on different clusters (DataNodes) while the result is fetched back to a
single node (NameNode) for output.
• In our example, the map function is in the following format –
• <serial_number, name, drug_prescribed,
gender,total_expenditure_on_prescribed_drugs>

Fig 9: Map Reduction of Patient Data

As can be seen above, the system will group the items having the same key. Finally, the system
provides the requested output.

Patient data is stored in a centralized repository which makes the system cost-effective by
reducing the number of storage warehouses as well as eliminating any sort of data redundancy,
which leads to the system being consistent as well.

20
Hadoop Ecosystem and it’s Components

Oozie Diagram

In Oozie, users can create Directed Acyclic Graph of workflow, which can run in parallel and
sequentially in Hadoop. Oozie is scalable and can manage timely execution of thousands of workflow in
a Hadoop cluster. Oozie is very much flexible as well. One can easily start, stop, suspend and rerun jobs.
It is even possible to skip a specific failed node or rerun it in Oozie.

There are two basic types of Oozie jobs:

● Oozie workflow – It is to store and run workflows composed of Hadoop jobs e.g., MapReduce,
pig, Hive.
● Oozie Coordinator – It runs workflow jobs based on predefined schedules and availability of
data.

This was all about Components of Hadoop Ecosystem

24
Hadoop Ecosystem and it’s Components

CONCLUSION

We have covered all the Hadoop Ecosystem Components in detail. Hence these Hadoop ecosystem
components empower Hadoop functionality. Hadoop ecosystem is mainly designed to store and process
huge data that should have presented any of the two factors between volume, velocity, and variety. It is
storing data in a distributed processing system that runs on commodity hardware. Considering the full
Hadoop ecosystem process, HDFS distributes the data blocks, and Map Reduce provides the programming

framework to read data from a file stored in HDFS.

25
Hadoop Ecosystem and it’s Components

REFERENCES

1. Hadoop Ecosystem Overview

2. YARN: Yet Another Resource Negotiator
3. Apache Spark
4. HBase: The Hadoop Database
5. Apache Mahout
6. Apache Lucene

Submitted By:

Manasi Shivarkar
Roll no.: T213049
Div.: TE-C

Date:

Marks obtained:
Sign of course coordinator:

Name of course Coordinator:

26
Hadoop Ecosystem and it’s Components

CCS359 - Quantum Computing Manual (WOL)
No ratings yet
CCS359 - Quantum Computing Manual (WOL)
25 pages
Comp131 Test
76% (17)
Comp131 Test
84 pages
Unit 4 Java Codetantra Answers
No ratings yet
Unit 4 Java Codetantra Answers
57 pages
DSBDA Mini Project
No ratings yet
DSBDA Mini Project
19 pages
PaaS For Dummies
No ratings yet
PaaS For Dummies
69 pages
Case Study DS-BDA
No ratings yet
Case Study DS-BDA
29 pages
CSDF
No ratings yet
CSDF
39 pages
Anatomy of Map Reduce Job Run
100% (2)
Anatomy of Map Reduce Job Run
20 pages
Cs2358 Internet Programming Lab Anna University Syllabus
No ratings yet
Cs2358 Internet Programming Lab Anna University Syllabus
12 pages
Ccs334 Big Data Analytics
0% (1)
Ccs334 Big Data Analytics
2 pages
Bda - Unit 3
No ratings yet
Bda - Unit 3
29 pages
Algorithm For Asynchronous Check Pointing and Recovery
No ratings yet
Algorithm For Asynchronous Check Pointing and Recovery
4 pages
HKBK College of Engineering Department of Ise: Big Data Analytics (18Cs72) Seminar On The Topic Key-Value Pairs
100% (1)
HKBK College of Engineering Department of Ise: Big Data Analytics (18Cs72) Seminar On The Topic Key-Value Pairs
15 pages
DBDAL LAB - MANUAL - Final
No ratings yet
DBDAL LAB - MANUAL - Final
93 pages
CCS334 BDA Lab Manual Final
No ratings yet
CCS334 BDA Lab Manual Final
40 pages
Nosql Databases Unit-1
No ratings yet
Nosql Databases Unit-1
16 pages
AI Mini Project
No ratings yet
AI Mini Project
29 pages
Data Structure Module 5
No ratings yet
Data Structure Module 5
22 pages
Mini Project Report
No ratings yet
Mini Project Report
25 pages
Cs3391 Oops Unit 1 Notes Eduengg
No ratings yet
Cs3391 Oops Unit 1 Notes Eduengg
60 pages
CCS334 Big Data Analytics Important Question
No ratings yet
CCS334 Big Data Analytics Important Question
1 page
Cs3481 - Dbms Record
No ratings yet
Cs3481 - Dbms Record
63 pages
11 Aneka in Cloud Computing
No ratings yet
11 Aneka in Cloud Computing
14 pages
Bda Module 4 PPT (KM)
No ratings yet
Bda Module 4 PPT (KM)
76 pages
Hadoop ppt@87
No ratings yet
Hadoop ppt@87
16 pages
Unit 4 BDA
No ratings yet
Unit 4 BDA
31 pages
Dbms Co Po Mapping Justification
No ratings yet
Dbms Co Po Mapping Justification
3 pages
Iare DWDM and WT Lab Manual PDF
No ratings yet
Iare DWDM and WT Lab Manual PDF
69 pages
Unit 3 DBMS - 1596870407
100% (1)
Unit 3 DBMS - 1596870407
16 pages
Hadoop Distributed File System
No ratings yet
Hadoop Distributed File System
5 pages
FYBBA (CA) C Programming Sem - I (2021-22) Question Paper
No ratings yet
FYBBA (CA) C Programming Sem - I (2021-22) Question Paper
3 pages
VTU Exam Question Paper With Solution of 18CS72 Big Data and Analytics Feb-2022-Dr. v. Vijayalakshmi
No ratings yet
VTU Exam Question Paper With Solution of 18CS72 Big Data and Analytics Feb-2022-Dr. v. Vijayalakshmi
25 pages
Cloud Computing QB
No ratings yet
Cloud Computing QB
3 pages
Advanced Java (Question Bank)
No ratings yet
Advanced Java (Question Bank)
3 pages
Final Document
No ratings yet
Final Document
73 pages
Understanding Ans Measuring Value of Cloud Services.
No ratings yet
Understanding Ans Measuring Value of Cloud Services.
7 pages
Big Data Analytics TEXTBOOK
100% (1)
Big Data Analytics TEXTBOOK
230 pages
Transform and Conquer, Presorting
100% (1)
Transform and Conquer, Presorting
2 pages
F.Y.M.Sc. (CS) Sem-I AI Pract Slip
No ratings yet
F.Y.M.Sc. (CS) Sem-I AI Pract Slip
22 pages
Ccs368-Stream Processing Lab Manual
No ratings yet
Ccs368-Stream Processing Lab Manual
50 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
38 pages
NO SQL Data Management
No ratings yet
NO SQL Data Management
123 pages
DAA Question Bank
No ratings yet
DAA Question Bank
9 pages
CS3492 Database Management Systems Question Bank 1
No ratings yet
CS3492 Database Management Systems Question Bank 1
11 pages
CCS356 Object Oriented Software Engineering Lecture Notes 1
No ratings yet
CCS356 Object Oriented Software Engineering Lecture Notes 1
222 pages
Data Copy in Copy Out
No ratings yet
Data Copy in Copy Out
2 pages
CS3451 Course Plan
100% (1)
CS3451 Course Plan
10 pages
Cloud Computing Architecture and Management
No ratings yet
Cloud Computing Architecture and Management
4 pages
CCW331 BA IAT 1 Set 1 & Set 2 Questions
No ratings yet
CCW331 BA IAT 1 Set 1 & Set 2 Questions
19 pages
Pune Institute of Computer Technology, DHANKWADI, PUNE - 411043
100% (1)
Pune Institute of Computer Technology, DHANKWADI, PUNE - 411043
7 pages
IF4071 - Deep Learning Laboratory
No ratings yet
IF4071 - Deep Learning Laboratory
1 page
BDA Unit-2
No ratings yet
BDA Unit-2
11 pages
CCS354 Set1
No ratings yet
CCS354 Set1
2 pages
DSF - Unit V Notes
No ratings yet
DSF - Unit V Notes
7 pages
Storage Technologies
0% (1)
Storage Technologies
2 pages
Unit-3-Greedy Method PDF
No ratings yet
Unit-3-Greedy Method PDF
22 pages
Ethical Hacking Questions Paper
100% (1)
Ethical Hacking Questions Paper
1 page
Da Unit-2
No ratings yet
Da Unit-2
23 pages
Case Study On Processing Data Driven For Health
No ratings yet
Case Study On Processing Data Driven For Health
9 pages
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Bda QB Soln
No ratings yet
Bda QB Soln
22 pages
Practical No 4
No ratings yet
Practical No 4
1 page
Seminar PPT Rishu
No ratings yet
Seminar PPT Rishu
10 pages
Practical No 3
No ratings yet
Practical No 3
2 pages
DAA Practical4
No ratings yet
DAA Practical4
7 pages
Cybersecurity Mini Project - Rishu
No ratings yet
Cybersecurity Mini Project - Rishu
18 pages
BCT Miniproject-1
No ratings yet
BCT Miniproject-1
18 pages
BCT Miniproject-1
No ratings yet
BCT Miniproject-1
18 pages
Mobile Comp Mini - Sandip
No ratings yet
Mobile Comp Mini - Sandip
16 pages
Mobile Comp Mini
No ratings yet
Mobile Comp Mini
16 pages
Csa 7
No ratings yet
Csa 7
27 pages
Nvidia Profiling Tools Keipert 10 4 22
No ratings yet
Nvidia Profiling Tools Keipert 10 4 22
27 pages
PDF Grupo 4 Piura Analisis Terrenos
No ratings yet
PDF Grupo 4 Piura Analisis Terrenos
34 pages
Log
No ratings yet
Log
55 pages
Opennebula Instal Steps
No ratings yet
Opennebula Instal Steps
23 pages
Facial Recognition Attendance System
No ratings yet
Facial Recognition Attendance System
10 pages
Lec 1 CAD - CAM Introduction
No ratings yet
Lec 1 CAD - CAM Introduction
26 pages
BSC IT TB For 5th Semester (Data Warehousing - 53) Kuvempu University
No ratings yet
BSC IT TB For 5th Semester (Data Warehousing - 53) Kuvempu University
7 pages
Implementation Guideline For Application Jobs in S/4 HANA: Development Tasks
No ratings yet
Implementation Guideline For Application Jobs in S/4 HANA: Development Tasks
4 pages
Session 10 - BetaSystems-Process Stages-1
No ratings yet
Session 10 - BetaSystems-Process Stages-1
19 pages
Steps To Create Process - Control in GENTRAN - DIRECTOR
No ratings yet
Steps To Create Process - Control in GENTRAN - DIRECTOR
8 pages
Passwordless Authentication
No ratings yet
Passwordless Authentication
18 pages
Coding Challenge 6 - Order Management System
No ratings yet
Coding Challenge 6 - Order Management System
2 pages
A Novel Approach To Transform Relational Database Into Graph Database Using Neo4j
No ratings yet
A Novel Approach To Transform Relational Database Into Graph Database Using Neo4j
64 pages
Cisco Public: White Paper
No ratings yet
Cisco Public: White Paper
22 pages
Rest Api in Drupal 8
No ratings yet
Rest Api in Drupal 8
24 pages
Mod 2 Business Analytics
No ratings yet
Mod 2 Business Analytics
43 pages
Bods Interview
100% (3)
Bods Interview
61 pages
Yealink SIP-T19P E2 Datasheet
No ratings yet
Yealink SIP-T19P E2 Datasheet
2 pages
Ahmad Car Theft Reporting System Complete-1
No ratings yet
Ahmad Car Theft Reporting System Complete-1
47 pages
Ebanking Manual
No ratings yet
Ebanking Manual
45 pages
(Ebook PDF) Management Information Systems Managing The Digital Firm 15thpdf Download
100% (4)
(Ebook PDF) Management Information Systems Managing The Digital Firm 15thpdf Download
54 pages
Wireless Datagram Protocol (WDP)
No ratings yet
Wireless Datagram Protocol (WDP)
3 pages
Kubernetes Cluster Setup
No ratings yet
Kubernetes Cluster Setup
6 pages
Class-6-Computer Ch-7 - Answer Key
No ratings yet
Class-6-Computer Ch-7 - Answer Key
2 pages
C BITWF 73 Sample Items
No ratings yet
C BITWF 73 Sample Items
4 pages
Meridian2019 UG LTR
No ratings yet
Meridian2019 UG LTR
418 pages

Case Study DSBDA Report Final

Uploaded by

Case Study DSBDA Report Final

Uploaded by

ABSTRACT

Keywords- Big data, Healthcare, Clustering, Hadoop.

YARN: Efficiently allocate resources for healthcare applications.

MapReduce: Process structured and semi-structured data.

Spark: Perform in-memory data processing for faster analytics.

PIG and HIVE: Query and transform healthcare data.

2.1. Hadoop Distributed File System

● Manage file system namespace.

● Regulates client’s access to files.

This was all about HDFS as a Hadoop Ecosystem component.

2.2.1. Map phase

Refer to MapReduce Comprehensive Guide for more details.

Hadoop Yarn Diagram

2.3.3. Refer YARN Comprehensive Guide for more details.

Fig . Hive Diagram

The main parts of Hive are:

Metastore stores the metadata.

Refer to Hive Comprehensive Guide for more details.

Fig. Pig Diagram

Features of Apache Pig:

Refer to Pig – A Complete guide for more details.

● Performs administration (interface for creating, updating, and deleting tables.) ●

Refer to HBase Tutorial for more details.

● Enables notifications of data availability.

Dynamic typing – It refers to serialization and deserialization without code generation. It

Features provided by Avro:

● Rich data structures.

It is a software framework for scalable cross-language services development. Thrift is an interface

Fig. Thrift Diagram

2.10. Apache Drill

Application of Apache drill

Features of Apache Drill:

2.11. Apache Mahout

Algorithms of Mahout are:

Fig. Apache Sqoop Diagram

Features of Apache Sqoop:

Refer to Flume Comprehensive Guide for more details

Ambari, another Hadoop ecosystem component, is a management platform for provisioning,

Fig. Ambari Diagram

● Simplified installation, configuration, and management – Ambari easily and

Fig. ZooKeeper Diagram

Fig. Oozie Diagram

There are two basic types of Oozie jobs:

This was all about the Components of the Hadoop Ecosystem

3. Big Data in Healthcare Industry

Fig: Big data Adoption in Industry

Fig 9: Map Reduction of Patient Data

There are two basic types of Oozie jobs:

This was all about Components of Hadoop Ecosystem

framework to read data from a file stored in HDFS.

1. Hadoop Ecosystem Overview

Name of course Coordinator:

You might also like