0% found this document useful (0 votes)
136 views25 pages

Mini Project Doc 2

This document provides an overview of using big data in healthcare analytics. It discusses the 3 V's of big data - volume, velocity, and variety. It then describes the Hadoop ecosystem including HDFS for storage, MapReduce for processing, Pig and Hive for querying, Sqoop for data transfer, and Impala and Cloudera for analytics. The document outlines how these tools can be used to get healthcare data from various sources, store it in HDFS, process it using MapReduce, query it using Pig/Hive, and perform analytics and visualization. Screenshots of implementing the system are also included to demonstrate loading and analyzing healthcare data using Hadoop technologies.

Uploaded by

Likhil Goud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views25 pages

Mini Project Doc 2

This document provides an overview of using big data in healthcare analytics. It discusses the 3 V's of big data - volume, velocity, and variety. It then describes the Hadoop ecosystem including HDFS for storage, MapReduce for processing, Pig and Hive for querying, Sqoop for data transfer, and Impala and Cloudera for analytics. The document outlines how these tools can be used to get healthcare data from various sources, store it in HDFS, process it using MapReduce, query it using Pig/Hive, and perform analytics and visualization. Screenshots of implementing the system are also included to demonstrate loading and analyzing healthcare data using Hadoop technologies.

Uploaded by

Likhil Goud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

A

MINI PROJECT REPORT


on
HEALTH CARE ANALYTICS USING BIGDATA

BACHELOR OF TECHNOLOGY
in

COMPUTER SCIENCE AND ENGINEERING


Submit by
CH. Likhil Kumar Goud

(197Y1A0521)

Y. Navadeep Reddy
(197Y1A0526)

Under the Guidance of

Mrs. K. Jaysri (Assistant Professor)

DEPARTMENT OF COMPUTER SCIENCE AND


ENGINEERING
MARRI LAXMAN REDDY

INSTITUTE OF TECHNOLOGY AND MANAGEMENT


(AUTONOMOUS)

(Affiliated to JNTU-H, Approved by AICTE New Delhi and Accredited by


NBA & NAAC With ‘A’ Grade)
CERTIFICATE

This is to certify that the project report titled “Health Care Analytics using Bigdata” is

being submitted by CH. Likhil Kumar Goud (197Y1A0521) in IV B.Tech I Semester

Computer Science & Engineering is a record bonafide work carried out by him. The

results embodied in this report have not been submitted to any other University for the

award of anydegree.

Internal Guide HOD

Principal External Examiner


DECLARATION

I hereby declare that the Minor Project Report entitled, “Health Care Analytics using

Bigdata” submitted for the B.Tech degree is entirely my work and all ideas and references

have been duly acknowledged. It does not contain any work for the award of any other

degree.

Date:

CH.Likhil Kumar Goud

(197Y1A0521)

Y. Navadeep Reddy

(197Y1A0526)
Health Care Analytics

ACKNOWLEDGEMENT

I am happy to express my deep sense of gratitude to the principal of the college Dr. K.
Venkateswara Reddy, Professor, Department of Computer Science and Engineering, Marri
Laxman Reddy Institute of Technology & Management, for having provided me with adequate
facilities to pursue myproject.

I would like to thank Mr. Abdul Basith Khateeb, Assoc. Professor and Head, Department of
Computer Science and Engineering, Marri Laxman Reddy Institute of Technology &
Management, for having provided the freedom to use all the facilities available in the
department, especially the laboratories and the library.

I am very grateful to my project guide Mrs. K.Jaysri, Assi. Prof., Department of Computer
Science and Engineering, Marri Laxman Reddy Institute of Technology & Management, for his
extensive patience and guidance throughout my project work.

I sincerely thank my seniors and all the teaching and non-teaching staff of the Department of
Computer Science for their timely suggestions, healthy criticism and motivation during the
course of this work.

I would also like to thank my classmates for always being there whenever I needed help or
moral support. With great respect and obedience, I thank my parents and brother who were the
backbone behind my deeds.

Finally, I express my immense gratitude with pleasure to the other individuals who have either
directly or indirectly contributed to my need at right time for the development and success of
this work.

Department of CSE, MLRITM Page 4


September 2022
Health Care Analytics

CONTENTS

TABLE OF CONTENTS:
Certificates ii

Acknowledgement

Abstract vii

Chapter 1: Introduction 1

Chapter 2: Literature survey 2

1. INTRODUCTION
1.1 Bigdata 3V’s

1.2 Ecosystem
Hdfs
Map reduce
Pig
Hive

Sqoop

Impala

1.3 Applications of Bigdata:


Department of CSE, MLRITM Page 5
September 2022
Health Care Analytics

1.4 Cloudera

1.5 Hue
2. LITERATURE SURVEY

2.1 Existing system


2.2 Proposed system

3. REQUIREMENT ANALYSIS

3.1 Hardware requirements


3.2 Software requirements
4.IMPLEMENTATION

4.1 Problem Definition:


4.2 System Architecture
Get to the Source
Ingestion Strategy and Acquisition
Storage
Data processing
. Export Data sets

. Reporting and visualization 

. Data Exploration

. Adhoc Querying

Department of CSE, MLRITM Page 6


September 2022
Health Care Analytics

5. METHODOLOGY

5.1 how Hdfs is used in our project:


5.2 how hive is used:
5.3 how cloudera is used:
5.4 how hue is used:
5.5 how sqoop is used:
6. SCREENSHOTS

 To create database
 To create table
 To display fields
 Loading data into mysql
 To import data from mysql to hdfs
 COMPILATION TIME

Department of CSE, MLRITM Page 7


September 2022
Health Care Analytics

LIST OF FIGURES
4.2 System Architecture 17
5.1 how Hdfs is used in our project 21

Department of CSE, MLRITM Page 8


September 2022
Health Care Analytics

LIST OF TABLES

6. SCREENSHOTS 22

Department of CSE, MLRITM Page 9


September 2022
Health Care Analytics

ABSTRACT
In today's modern world, healthcare also needs to be modernized. It
means that the healthcare data should be properly analyzed so that we
can categorize it into groups of Gender, Disease, City, symptoms and
treatment.
BIGDATA is used to predict epidemics, cure disease, improve quality
of life and avoid preventable deaths. With the increasing population of
the world, and everyone living longer, models of treatment delivery are
rapidly changing and many of the decision behind those changes are
being driven by data.
The drive now is to understand as much as a patient as possible as early
in their life as possible, hopefully picking up warning signs of serious
illness at early enough stage that treatment is far simpler and less
expensive than if it had not been spotted until later. The gigantic size of
analytics will need large computation which can be done with the help
of distributed processing HADOOP.
The frameworks use will provide multipurpose beneficial outputs which
includes getting the healthcare data analysis into various forms.The
groups made by the system would be symptoms wise, age wise, gender
wise, season wise, disease wise etc. As the system will display the data
group wise, it would be helpful to get a clear idea about the disease and
their rate of spreading, so that appropriate treatment could be given at
proper time.

Department of CSE, MLRITM Page 10


September 2022
Health Care Analytics

1. INTRODUCTION
1.1 Bigdata 3V’s:

The 3Vs that define Big Data are Variety, Velocity


and Volume.
Volume
We currently see the exponential growth in the data storage as
the data is now more than text data. We can find data in the
format of videos, music and large images on our social media
channels. It is very common to have Terabytes and Petabytes of
the storage system for enterprises. As the database grows the
applications and architecture built to support the data needs to be
reevaluated quite often. Sometimes the same data is re-evaluated
with multiple angles and even though the original data is the
same the new found intelligence creates explosion of the data.
The big volume indeed represents Big Data.

Velocity
The data growth and social media explosion have changed how
we look at the data. There was a time when we used to believe
that data of yesterday is recent. The matter of the fact
newspapers is still following that logic. However, news channels
and radios have changed how fast we receive the news. Today,
people reply on social media to update them with the latest
happening. They often discard old messages and pay attention to
recent updates. The data movement is now almost real time and
the update window has reduced to fractions of the seconds. This
high velocity data represent Big Data.

Department of CSE, MLRITM Page 11


September 2022
Health Care Analytics

Variety
Data can be stored in multiple format. For example database,
excel, csv, access or for the matter of the fact, it can be stored in
a simple text file. Sometimes the data is not even in the
traditional format as we assume, it may be in the form of video,
SMS, pdf or something we might have not thought about it. It is
the need of the organization to arrange it and make it
meaningful. It will be easy to do so if we have data in the same
format, however it is not the case most of the time. The real
world have data in many different formats and that is the
challenge we need to overcome with the BigData. This variety
of the data represent  represent Big Data.

1.2 Ecosystem
Hdfs:
HDFS is built to support applications with large data sets,
including individual files that reach into the terabytes. It uses a
master/slave architecture, with each cluster consisting of a single
Namenode that manages file system operations and supporting
Datanodes that manage data storage on individual compute
nodes.
Map reduce:
MapReduce is a programming model for processing large
data sets with a parallel , distributed algorithm on a

Department of CSE, MLRITM Page 12


September 2022
Health Care Analytics

cluster. MapReduce model consist of two separate routines,


namely Map-function and Reduce-function. The computation on
an input in MapReduce model occurs in three stages:
In the map stage, the mapper takes a single(key, value) pair
as input and produces any number of pairs as output .
The shuffle stage is automatically handled by the
MapReduce framework. The underlying system implementing
MapReduce routes all of the values that are associated with an
individual key to the same reducer.
In the reduce stage, the reducer takes all of the values
associated with a single key k and outputs any number of pairs.
Pig:
Pig is a high-level platform for creating programs that run
on Apache Hadoop. The language for this platform is called Pig
Latin. Pig Latin abstracts the programming from
the Java MapReduce idiom into a notation which makes
MapReduce programming high level, similar to that
of SQL for RDBMS.

Hive:
Hive is a data warehouse infrastructure tool to process
structured data in Hadoop. It resides on top of Hadoop to
summarize Big Data, and makes querying and analyzing easy.
Hive gives an SQL-like interface to query data stored in various

Department of CSE, MLRITM Page 13


September 2022
Health Care Analytics

databases and file systems that integrate with Hadoop. The


traditional SQL queries must be implemented in
the MapReduce Java API to execute SQL applications and
queries over a distributed data. 
Sqoop:
Sqoop is a command-line interface application for
transferring data between relational databases and Hadoop. It is
a big data tool that offers the capability to extract datafrom
non-Hadoop data stores, transform the data into a form usable
by Hadoop, and then load the data into HDFS. This process is
called ETL, for Extract, Transform, and Load.
Impala:
Cloudera Impala is Cloudera's open source massively
parallel processing (MPP) SQL query engine for data stored in
a computer cluster running Apache Hadoop. Impala brings
scalable parallel database technology to Hadoop, enabling users
to issue low-latency SQL queries to data stored
in HDFS and Apache HBase without requiring data movement
or transformation. Impala is integrated with Hadoop to use the
same file and data formats, metadata, security and resource
management frameworks used by MapReduce, Apache
Hive, Apache Pig and other Hadoop software.
1.3 Applications of Bigdata:
Healthcare contributions

Banking Sectors and Fraud Detection

Department of CSE, MLRITM Page 14


September 2022
Health Care Analytics

Private sector uses the big data in traffic management,


direction preparation,intellectual transportation arrangements
and overcrowding administration.
Private sector uses the big data in income administration,
industrial improvements, logistics and for reasonable benefit.

1.4 Cloudera:
Cloudera's open-source Apache Hadoop distribution, CDH
(Cloudera Distribution Including Apache Hadoop), targets
enterprise-class deployments of that technology. Cloudera says
that more than 50% of its engineering output is donated
upstream to the various Apache-licensed open source projects
(Apache Hive, Apache Avro, Apache HBase, and so on) that
combine to form the Hadoop platform.
1.5 Hue:
Hue is an open source Web interface for analyzing data with
any Apache Hadoop.Hue allows technical and non-technical
users to take advantage of Hive, Pig, and many of the other tools
that are part of the Hadoop.
You can load your data, runinteractive Hive queries, develop
and run Pig scripts, work with HDFS, check on the status of
your jobs, and more. Hue’s File Browser allows you to browse

Department of CSE, MLRITM Page 15


September 2022
Health Care Analytics

Amazon Simple Storage Service (S3) buckets and you can use
the Hive editor to run queries against data stored in S3. 
2. LITERATURE SURVEY
2.1 Existing system:
The existing systems are done using RDBMS which stores
data in the form of tables. RDBMS allows to store only
structured data.
When any user want to know about the basic information of
diseases the person will interact the concern hospital and if the
user want to take appointment user want to go directly to the
hospital to fix the appointment . if the user is enable to go
hospital in particular time. User will enable to take appointment
instantly.

2.2 Proposed system:


The proposed system will group together the disease and
their symptoms data and analyze it to provide cumulative
information. After the analysis, algorithm could be applied to
the resultant and grouping can be made to show a clear picture
of the analysis.

3. REQUIREMENT ANALYSIS
3.1 Hardware requirements
Processor
16 GB Memory
4 TB Disk

Department of CSE, MLRITM Page 16


September 2022
Health Care Analytics

3.2 Software requirements


VM ware
Linux OS

4.IMPLEMENTATION
4.1 Problem Definition:
Health care analytics using big data hadoop.
4.2 System Architecture

Get to the Source!


Source profiling is one of the most important steps in deciding
the architecture. It involves identifying the different source
systems and categorizing them based on their nature and type.
Points to be considered while profiling the data sources:
 Identify the internal and external sources systems
 High Level assumption for the amount of data ingested
from each source
 Identify the mechanism used to get data – push or pull
 Determine the type of data source – Database, File, web
service, streams etc.
 Determine the type of data – structured, semi structured or
unstructured.
Ingestion Strategy and Acquisition

Department of CSE, MLRITM Page 17


September 2022
Health Care Analytics

Data ingestion in the Hadoop world means ELT (Extract, Load


and Transform) as opposed to ETL (Extract, Transform and
Load) in case of traditional warehouses.
Points to be considered:
 Determine the frequency at which data would be ingested
from each source
 Is there a need to change the semantics of the data append
replace etc?
 Is there any data validation or transformation required
before ingestion (Pre-processing)?
 Segregate the data sources based on mode of ingestion –
Batch or real-time
Storage:
Hadoop distributed file system is the most commonly used
storage framework in BigData world, others are the NoSql data
stores – MongoDB, HBase, Cassandra etc. One of the salient
features of Hadoop storage is its capability to scale, self-manage
and self-heal.
Things to consider while planning storage methodology:
 Type of data (Historical or Incremental)
 Format of data ( structured, semi structured and
unstructured)
 Compression requirements
 Frequency of incoming data
 Query pattern on the data
 Consumers of the data
Data processing:
Earlier frequently accessed data was stored in Dynamic
RAMs but now due to the sheer volume, it is been stored on
Department of CSE, MLRITM Page 18
September 2022
Health Care Analytics

multiple disks on a number of machines connected via the


network. Instead of bringing the data to processing, in the new
way, processing is taken closer to data which significantly
reduce the network I/O.The Processing methodology is driven
by business requirements. It can be categorized into Batch, real-
time or Hybrid based on the SLA.
 Batch Processing – Batch is collecting the input for a
specified interval of time and running transformations on it
in a scheduled way. Historical data load is a typical batch
operation.
Technology Used: MapReduce, Hive, Pig
 Real-time Processing – Real-time processing involves
running transformations as and when data is acquired.
Technology Used: Impala, Spark, spark SQL.
 Hybrid Processing – It’s a combination of both batch and
real-time processing needs.
Data consumption:
Different users like administrator, Business users, vendor,
partners etc. can consume data in different format. Output of
analysis can be consumed by recommendation engine or
business processes can be triggered based on the analysis.
Different forms of data consumption are:
 Export Data sets – There can be requirements for third
party data set generation. Data sets can be generated using
hive export or directly from HDFS.

Department of CSE, MLRITM Page 19


September 2022
Health Care Analytics

 Reporting and visualization – Different reporting and


visualization tool scan connect to Hadoop using
JDBC/ODBC connectivity to hive.
 Data Exploration – Data scientist can build models and
perform deep exploration in a sandbox environment.
Sandbox can be a separate cluster (Recommended
approach) or a separate schema within same cluster that
contains subset of actual data.
 Adhoc Querying – Adhoc or Interactive querying can be
supported by using Hive, Impala or spark SQL.
5. METHODOLOGY
5.1 how Hdfs is used in our project:
HDFS holds very large amount of data and provides
easier access. To store such huge data, the files are stored
across multiple machines. These files are stored in
redundant fashion to rescue the system from possible data
losses in case of failure.
HDFS also makes applications available to Parallel
processing. HDFS mainly consists of two nodes
 Namenode
 Datanode

Department of CSE, MLRITM Page 20


September 2022
Health Care Analytics

5.2 how hive is used:


Hive gives an SQL-like interface to query data stored
in various databases and file systems that integrate
with Hadoop.
Hive supports easy portability of SQL-based application
to Hadoop.
SQL statements are broken down by the Hive service
into MapReducejobs and executed across a Hadoop cluster.

5.3 how cloudera is used:


5.4 how hue is used:
5.4 how sqoop is used:

Department of CSE, MLRITM Page 21


September 2022
Health Care Analytics

6. SCREENSHOTS
 To create database
 To create table

 To display fields

Department of CSE, MLRITM Page 22


September 2022
Health Care Analytics

 Loading data into mysql

 To import data from mysql to hdfs

Department of CSE, MLRITM Page 23


September 2022
Health Care Analytics

COMPILATION TIME

Department of CSE, MLRITM Page 24


September 2022
Health Care Analytics

Department of CSE, MLRITM Page 25


September 2022

You might also like