0% found this document useful (0 votes)
27 views10 pages

BIG Data 1

Uploaded by

navata
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views10 pages

BIG Data 1

Uploaded by

navata
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

UNIT - 1

1) What is big data? Explain healthcare and credit risk management

A) Big Data is a term used for a collection of data sets that are large and complex, which is
difficult to store and process using available database management tools or traditional data
processing applications. The challenge includes capturing, curating, storing, searching, sharing,
transferring, analyzing and visualization of this data.

 It refers to a massive amount of data that keeps on growing exponentially with time.

 It is so voluminous that it cannot be processed or analyzed using conventional data


processing techniques.

 It includes data mining, data storage, data analysis, data sharing, and data visualization.

 The term is an all-comprehensive one including data, data frameworks, along with the
tools and techniques used to process and analyze the data.

Big Data Applications: Healthcare

 The level of data generated within healthcare systems is not trivial.


 Traditionally, the health care industry lagged in using Big Data, because of limited ability
to standardize and consolidate data.
 But now Big data analytics have improved healthcare by providing personalized
medicine and prescriptive analytics.
 Researchers are mining the data to see what treatments are more effective for
particular conditions, identify patterns related to drug side effects, and gains other
important information that can help patients and reduce costs.
 With the added adoption of mHealth, eHealth and wearable technologies the volume of
data is increasing at an exponential rate.
 This includes electronic health record data, imaging data, patient generated data, sensor
data, and other forms of data.
 By mapping healthcare data with geographical data sets, it’s possible to predict disease
that will escalate in specific areas.
 Based of predictions, it’s easier to strategize diagnostics and plan for stocking serums
and vaccines.

[1]
Big Data in Banking Sector

Credit Risk Management

 Establishing a robust risk management system is of utmost importance for banking


organizations or else they have to suffer from huge revenue losses.
 To stay alive in the competitive world and increase their profit as much as they can,
organizations have to keep innovating new things.
 Through Big Data Analysis, firms can detect risk in real-time and apparently saving the
customer from potential fraud.
 The rapidly growing digital world is furnishing us with numerous benefits but on the
other hand, gives birth to various kinds of frauds as well.
 Our personal data is now more vulnerable to cyber attacks than ever before and it is the
biggest challenge a banking organization faces.
 Employing Big Data Analytics with some Machine Learning Algorithms, organizations are
now able to detect frauds before they can be placed.
 This is done by identifying unfamiliar spending patterns of the user, predicting unusual
activities of the user, etc.

2) What is big Data? Explain Four V’s of big data?

Big Data Characteristics

The five characteristics that define Big Data are: Volume, Velocity, Variety, Veracity and Value.

1. VOLUME

Volume refers to the ‘amount of data’, which is growing day by day at a very fast pace. The size
of data generated by humans, machines and their interactions on social media itself is massive.
Researchers have predicted that 40 Zettabytes (40,000 Exabytes) will be generated by 2020,
which is an increase of 300 times from 2005.

2. VELOCITY

Velocity is defined as the pace at which different sources generate the data every day. This flow
of data is massive and continuous. There are 1.03 billion Daily Active Users (Facebook DAU) on
Mobile as of now, which is an increase of 22% year-over-year. This shows how fast the numbers
of users are growing on social media and how fast the data is getting generated daily. If you are
able to handle the velocity, you will be able to generate insights and take decisions based on
real-time data.

[2]
3. VARIETY

As there are many sources which are contributing to Big Data, the type of data they are
generating is different. It can be structured, semi-structured or unstructured. Hence, there is a
variety of data which is getting generated every day. Earlier, we used to get the data from excel
and databases, now the data are coming in the form of images, audios, videos, sensor data etc.
as shown in below image. Hence, this variety of unstructured data creates problems in
capturing, storage, mining and analyzing the data.

4. VERACITY

Veracity refers to the data in doubt or uncertainty of data available due to data inconsistency
and incompleteness. In the image below, you can see that few values are missing in the table.
Also, a few values are hard to accept, for example – 15000 minimum value in the 3rd row, it is
not possible. This inconsistency and incompleteness is Veracity.
Data available can sometimes get messy and maybe difficult to trust. With many forms of big
data, quality and accuracy are difficult to control like Twitter posts with hashtags,
abbreviations, typos and colloquial speech. The volume is often the reason behind for the lack
of quality and accuracy in the data.

 Due to uncertainty of data, 1 in 3 business leaders don’t trust the information


they use to make decisions.

 It was found in a survey that 27% of respondents were unsure of how much of
their data was inaccurate.

 Poor data quality costs the US economy around $3.1 trillion a year.

5. VALUE

After discussing Volume, Velocity, Variety and Veracity, there is another V that should be taken
into account when looking at Big Data i.e. Value. It is all well and good to have access to
big data but unless we can turn it into value it is useless. By turning it into value I mean, Is it
adding to the benefits of the organizations who are analyzing big data? Is
the organization working on Big Data achieving high ROI (Return On Investment)? Unless, it
adds to their profits by working on Big Data, it is useless.

[3]
3) Types of Big Data

Big Data could be of three types:

 Structured

 Semi-Structured

 Unstructured

1. Structured

The data that can be stored and processed in a fixed format is called as Structured Data. Data
stored in a relational database management system (RDBMS) is one example of ‘structured’
data. It is easy to process structured data as it has a fixed schema. Structured Query Language
(SQL) is often used to manage such kind of Data.

2. Semi-Structured

Semi-Structured Data is a type of data which does not have a formal structure of a data model,
i.e. a table definition in a relational DBMS, but nevertheless it has some organizational
properties like tags and other markers to separate semantic elements that make it easier to
analyze. XML files or JSON documents are examples of semi-structured data.

3. Unstructured

The data which have unknown form and cannot be stored in RDBMS and cannot be analyzed
unless it is transformed into a structured format is called as unstructured data. Text Files and
multimedia contents like images, audios, and videos are example of unstructured data. The
unstructured data is growing quicker than others; experts say that 80 percent of the data in an
organization are unstructured.

4) Source of Big Data..?

Sources of big data

Technology today allows us to collect data at an astounding rate--both in terms of volume and
variety. There are various sources that generate data, but in the context of big data, the
primary sources are as follows:

 Social networks: Arguably, the primary source of all big data that we know of today is
the social networks that have proliferated over the past 5-10 years. This is by and large
unstructured data that is represented by millions of social media postings and other
data that is generated on a second-by-second basis through user interactions on the

[4]
web across the world. Increase in access to the internet across the world has been a
self-fulfilling act for the growth of data in social networks.

 Media: Largely a result of the growth of social networks, media represents the millions,
if not billions, of audio and visual uploads that take place on a daily basis. Videos
uploaded on YouTube, music recordings on SoundCloud, and pictures posted on
Instagram are prime examples of media, whose volume continues to grow in an
unrestrained manner.

 Data warehouses: Companies have long invested in specialized data storage facilities
commonly known as data warehouses. A DW is essentially collections of historical data
that companies wish to maintain and catalog for easy retrieval, whether for internal use
or regulatory purposes. As industries gradually shift toward the practice of storing data
in platforms such as Hadoop and NoSQL, more and more companies are moving data
from their pre-existing data warehouses to some of the newer technologies. Company
emails, accounting records, databases, and internal documents are some examples of
DW data that is now being offloaded onto Hadoop or Hadoop-like platforms that
leverage multiple nodes to provide a highly-available and fault-tolerant platform.

 Sensors: A more recent phenomenon in the space of big data has been the collection of
data from sensor devices. While sensors have always existed and industries such as oil
and gas have been using drilling sensors for measurements at oil rigs for many decades,
the advent of wearable devices, also known as the Internet Of Things such as Fitbit and
Apple Watch, meant that now each individual could stream data at the same rate at
which a few oil rigs used to do just 10 years back.

5) Applications of Big Data..?

We cannot talk about data without talking about the people, people who are getting benefited
by Big Data applications. Almost all the industries today are leveraging Big Data applications in
one or the other way.

 Smarter Healthcare: Making use of the petabytes of patient’s data, the organization can
extract meaningful information and then build applications that can predict the
patient’s deteriorating condition in advance.

 Telecom: Telecom sectors collects information analyzes it and provide solutions to


different problems. By using Big Data applications, telecom companies have been able
to significantly reduce data packet loss, which occurs when networks are overloaded,
and thus, providing a seamless connection to their customers.

[5]
 Retail: Retail has some of the tightest margins, and is one of the greatest beneficiaries
of big data. The beauty of using big data in retail is to understand consumer behavior.
Amazon’s recommendation engine provides suggestion based on the browsing history
of the consumer.

 Traffic control: Traffic congestion is a major challenge for many cities globally. Effective
use of data and sensors will be key to managing traffic better as cities become
increasingly densely populated.

 Manufacturing: Analyzing big data in the manufacturing industry can reduce component
defects, improve product quality, increase efficiency, and save time and money.

 Search Quality: Every time we are extracting information from google, we are
simultaneously generating data for it. Google stores this data and uses it to improve its
search quality.

6) Traditional Approach:

In past we used to deal bigdata with this approach, in traditional approach we will have
a computer to store and process big data. Here data will be stored in an RDBMS like
Oracle Database, MS SQL Server or DB2 and sophisticated softwares can be written to
interact with the database, process the required data and present it to the users for
analysis purpose.

Limitation
This approach works well where we have less volume of data that can be
accommodated by standard database servers, or up to the limit of the processor which
is processing the data. But when it comes to dealing with huge amounts of data, it is
really a tedious task to process such data through a traditional database server.

Google’s Solution

[6]
Google solved this problem using an algorithm called MapReduce. This algorithm divides
the task into small parts and assigns those parts to many computers connected over the
network, and collects the results to form the final result dataset.

Above diagram shows various commodity hardwares which could be single CPU
machines or servers with higher capacity.
Hadoop:
Doug Cutting, Mike Cafarella and team took the solution provided by Google and
started an Open Source Project called HADOOP in 2005 and Doug named it after his
son’s toy elephant. Now Apache Hadoop is a registered trademark of the Apache
Software Foundation.
Hadoop runs applications using the MapReduce algorithm, where the data is processed
in parallel on different CPU nodes. In short, Hadoop framework is capable enough to
develop applications capable of running on clusters of computers and they could
perform complete statistical analysis for a huge amount of data.

[7]
7) Explain Core Hadoop Architecture .?

Hadoop is an open source framework from Apache and is used to store process and analyze
data which are very huge in volume. Hadoop is written in Java and is not OLAP (online analytical
processing). It is used for batch/offline processing.It is being used by Facebook, Yahoo, Google,
Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes in the
cluster.

Modules of Hadoop

1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis
of that HDFS was developed. It states that the files will be broken into blocks and stored
in nodes over the distributed architecture.

2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the
cluster.

3. Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and converts it
into a data set which can be computed in Key value pair. The output of Map task is
consumed by reduce task and then the out of reducer gives the desired result.

4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other
Hadoop modules.

[8]
Big Data Vs Cloud Computing (Major Differences)

Let’s see 8 major differences between Big Data and Cloud Computing:

1) Concept

In cloud computing, we can store and retrieve the data from anywhere at any time. Whereas,
big data is the large set of data which will process to extract the necessary information.

2) Characteristics

Cloud Computing provides the service over the internet which can be:

 Software as a Service (SaaS)

 Platform as a Service (PaaS)

 Infrastructure as a Service (IaaS)

Whereas, there are some important characteristics of Big data which can lead to strategic
business moves and they are Velocity, Variety, and Volume.

3) Accessibility

Cloud Computing provides universal access to the services. Whereas, Big data solves technical
problems and provides better results.

4) When to use

A customer can shift to Cloud Computing when they need rapid deployment and scaling of the
applications. The application deals with highly sensitive data and requires strict compliance one
should keep things on the cloud.

Whereas, we can use Big Data for traditional methods and here frameworks are ineffective. Big
data is not replacement for relational database system and big data solve specific problem
statement related to large data sets and most of the large data sets do not deal with small data.

5) Cost

Cloud Computing is economical as it has low maintenance costs centralized platform no upfront
cost and disaster safe implementation. Whereas, Big data is highly scalable, robust ecosystem,
and cost-effective.

[9]
6) Job roles and responsibility

The user of the cloud is the developers or office worker in an organization. Whereas, there is
big data analyst in big data which are responsible for analyzing the date of filing interesting and
sites and possible future trends.

7) Types and trends

Cloud Computing includes three types which are:

 Public Cloud

 Private Cloud

 Hybrid Cloud

 Community Cloud

Whereas, some important trends in Big Data Technology is Hadoop, MapReduce, and HDFS.

8) Vendors

Some of the vendors and solution providers of Cloud Computing are

 Google

 Amazon Web Service

 Microsoft

 Dell

 Apple

 IBM

Whereas, some of the vendors and solution providers of big data are

 Cloudera

 Hortonworks

 Apache

 MapR

[10]

You might also like