Mini Project Doc 2
Mini Project Doc 2
BACHELOR OF TECHNOLOGY
in
(197Y1A0521)
Y. Navadeep Reddy
(197Y1A0526)
This is to certify that the project report titled “Health Care Analytics using Bigdata” is
Computer Science & Engineering is a record bonafide work carried out by him. The
results embodied in this report have not been submitted to any other University for the
award of anydegree.
I hereby declare that the Minor Project Report entitled, “Health Care Analytics using
Bigdata” submitted for the B.Tech degree is entirely my work and all ideas and references
have been duly acknowledged. It does not contain any work for the award of any other
degree.
Date:
(197Y1A0521)
Y. Navadeep Reddy
(197Y1A0526)
Health Care Analytics
ACKNOWLEDGEMENT
I am happy to express my deep sense of gratitude to the principal of the college Dr. K.
Venkateswara Reddy, Professor, Department of Computer Science and Engineering, Marri
Laxman Reddy Institute of Technology & Management, for having provided me with adequate
facilities to pursue myproject.
I would like to thank Mr. Abdul Basith Khateeb, Assoc. Professor and Head, Department of
Computer Science and Engineering, Marri Laxman Reddy Institute of Technology &
Management, for having provided the freedom to use all the facilities available in the
department, especially the laboratories and the library.
I am very grateful to my project guide Mrs. K.Jaysri, Assi. Prof., Department of Computer
Science and Engineering, Marri Laxman Reddy Institute of Technology & Management, for his
extensive patience and guidance throughout my project work.
I sincerely thank my seniors and all the teaching and non-teaching staff of the Department of
Computer Science for their timely suggestions, healthy criticism and motivation during the
course of this work.
I would also like to thank my classmates for always being there whenever I needed help or
moral support. With great respect and obedience, I thank my parents and brother who were the
backbone behind my deeds.
Finally, I express my immense gratitude with pleasure to the other individuals who have either
directly or indirectly contributed to my need at right time for the development and success of
this work.
CONTENTS
TABLE OF CONTENTS:
Certificates ii
Acknowledgement
Abstract vii
Chapter 1: Introduction 1
1. INTRODUCTION
1.1 Bigdata 3V’s
1.2 Ecosystem
Hdfs
Map reduce
Pig
Hive
Sqoop
Impala
1.4 Cloudera
1.5 Hue
2. LITERATURE SURVEY
3. REQUIREMENT ANALYSIS
. Data Exploration
. Adhoc Querying
5. METHODOLOGY
To create database
To create table
To display fields
Loading data into mysql
To import data from mysql to hdfs
COMPILATION TIME
LIST OF FIGURES
4.2 System Architecture 17
5.1 how Hdfs is used in our project 21
LIST OF TABLES
6. SCREENSHOTS 22
ABSTRACT
In today's modern world, healthcare also needs to be modernized. It
means that the healthcare data should be properly analyzed so that we
can categorize it into groups of Gender, Disease, City, symptoms and
treatment.
BIGDATA is used to predict epidemics, cure disease, improve quality
of life and avoid preventable deaths. With the increasing population of
the world, and everyone living longer, models of treatment delivery are
rapidly changing and many of the decision behind those changes are
being driven by data.
The drive now is to understand as much as a patient as possible as early
in their life as possible, hopefully picking up warning signs of serious
illness at early enough stage that treatment is far simpler and less
expensive than if it had not been spotted until later. The gigantic size of
analytics will need large computation which can be done with the help
of distributed processing HADOOP.
The frameworks use will provide multipurpose beneficial outputs which
includes getting the healthcare data analysis into various forms.The
groups made by the system would be symptoms wise, age wise, gender
wise, season wise, disease wise etc. As the system will display the data
group wise, it would be helpful to get a clear idea about the disease and
their rate of spreading, so that appropriate treatment could be given at
proper time.
1. INTRODUCTION
1.1 Bigdata 3V’s:
Velocity
The data growth and social media explosion have changed how
we look at the data. There was a time when we used to believe
that data of yesterday is recent. The matter of the fact
newspapers is still following that logic. However, news channels
and radios have changed how fast we receive the news. Today,
people reply on social media to update them with the latest
happening. They often discard old messages and pay attention to
recent updates. The data movement is now almost real time and
the update window has reduced to fractions of the seconds. This
high velocity data represent Big Data.
Variety
Data can be stored in multiple format. For example database,
excel, csv, access or for the matter of the fact, it can be stored in
a simple text file. Sometimes the data is not even in the
traditional format as we assume, it may be in the form of video,
SMS, pdf or something we might have not thought about it. It is
the need of the organization to arrange it and make it
meaningful. It will be easy to do so if we have data in the same
format, however it is not the case most of the time. The real
world have data in many different formats and that is the
challenge we need to overcome with the BigData. This variety
of the data represent represent Big Data.
1.2 Ecosystem
Hdfs:
HDFS is built to support applications with large data sets,
including individual files that reach into the terabytes. It uses a
master/slave architecture, with each cluster consisting of a single
Namenode that manages file system operations and supporting
Datanodes that manage data storage on individual compute
nodes.
Map reduce:
MapReduce is a programming model for processing large
data sets with a parallel , distributed algorithm on a
Hive:
Hive is a data warehouse infrastructure tool to process
structured data in Hadoop. It resides on top of Hadoop to
summarize Big Data, and makes querying and analyzing easy.
Hive gives an SQL-like interface to query data stored in various
1.4 Cloudera:
Cloudera's open-source Apache Hadoop distribution, CDH
(Cloudera Distribution Including Apache Hadoop), targets
enterprise-class deployments of that technology. Cloudera says
that more than 50% of its engineering output is donated
upstream to the various Apache-licensed open source projects
(Apache Hive, Apache Avro, Apache HBase, and so on) that
combine to form the Hadoop platform.
1.5 Hue:
Hue is an open source Web interface for analyzing data with
any Apache Hadoop.Hue allows technical and non-technical
users to take advantage of Hive, Pig, and many of the other tools
that are part of the Hadoop.
You can load your data, runinteractive Hive queries, develop
and run Pig scripts, work with HDFS, check on the status of
your jobs, and more. Hue’s File Browser allows you to browse
Amazon Simple Storage Service (S3) buckets and you can use
the Hive editor to run queries against data stored in S3.
2. LITERATURE SURVEY
2.1 Existing system:
The existing systems are done using RDBMS which stores
data in the form of tables. RDBMS allows to store only
structured data.
When any user want to know about the basic information of
diseases the person will interact the concern hospital and if the
user want to take appointment user want to go directly to the
hospital to fix the appointment . if the user is enable to go
hospital in particular time. User will enable to take appointment
instantly.
3. REQUIREMENT ANALYSIS
3.1 Hardware requirements
Processor
16 GB Memory
4 TB Disk
4.IMPLEMENTATION
4.1 Problem Definition:
Health care analytics using big data hadoop.
4.2 System Architecture
6. SCREENSHOTS
To create database
To create table
To display fields
COMPILATION TIME