BIG Data 1
BIG Data 1
A) Big Data is a term used for a collection of data sets that are large and complex, which is
difficult to store and process using available database management tools or traditional data
processing applications. The challenge includes capturing, curating, storing, searching, sharing,
transferring, analyzing and visualization of this data.
It refers to a massive amount of data that keeps on growing exponentially with time.
It includes data mining, data storage, data analysis, data sharing, and data visualization.
The term is an all-comprehensive one including data, data frameworks, along with the
tools and techniques used to process and analyze the data.
[1]
Big Data in Banking Sector
The five characteristics that define Big Data are: Volume, Velocity, Variety, Veracity and Value.
1. VOLUME
Volume refers to the ‘amount of data’, which is growing day by day at a very fast pace. The size
of data generated by humans, machines and their interactions on social media itself is massive.
Researchers have predicted that 40 Zettabytes (40,000 Exabytes) will be generated by 2020,
which is an increase of 300 times from 2005.
2. VELOCITY
Velocity is defined as the pace at which different sources generate the data every day. This flow
of data is massive and continuous. There are 1.03 billion Daily Active Users (Facebook DAU) on
Mobile as of now, which is an increase of 22% year-over-year. This shows how fast the numbers
of users are growing on social media and how fast the data is getting generated daily. If you are
able to handle the velocity, you will be able to generate insights and take decisions based on
real-time data.
[2]
3. VARIETY
As there are many sources which are contributing to Big Data, the type of data they are
generating is different. It can be structured, semi-structured or unstructured. Hence, there is a
variety of data which is getting generated every day. Earlier, we used to get the data from excel
and databases, now the data are coming in the form of images, audios, videos, sensor data etc.
as shown in below image. Hence, this variety of unstructured data creates problems in
capturing, storage, mining and analyzing the data.
4. VERACITY
Veracity refers to the data in doubt or uncertainty of data available due to data inconsistency
and incompleteness. In the image below, you can see that few values are missing in the table.
Also, a few values are hard to accept, for example – 15000 minimum value in the 3rd row, it is
not possible. This inconsistency and incompleteness is Veracity.
Data available can sometimes get messy and maybe difficult to trust. With many forms of big
data, quality and accuracy are difficult to control like Twitter posts with hashtags,
abbreviations, typos and colloquial speech. The volume is often the reason behind for the lack
of quality and accuracy in the data.
It was found in a survey that 27% of respondents were unsure of how much of
their data was inaccurate.
Poor data quality costs the US economy around $3.1 trillion a year.
5. VALUE
After discussing Volume, Velocity, Variety and Veracity, there is another V that should be taken
into account when looking at Big Data i.e. Value. It is all well and good to have access to
big data but unless we can turn it into value it is useless. By turning it into value I mean, Is it
adding to the benefits of the organizations who are analyzing big data? Is
the organization working on Big Data achieving high ROI (Return On Investment)? Unless, it
adds to their profits by working on Big Data, it is useless.
[3]
3) Types of Big Data
Structured
Semi-Structured
Unstructured
1. Structured
The data that can be stored and processed in a fixed format is called as Structured Data. Data
stored in a relational database management system (RDBMS) is one example of ‘structured’
data. It is easy to process structured data as it has a fixed schema. Structured Query Language
(SQL) is often used to manage such kind of Data.
2. Semi-Structured
Semi-Structured Data is a type of data which does not have a formal structure of a data model,
i.e. a table definition in a relational DBMS, but nevertheless it has some organizational
properties like tags and other markers to separate semantic elements that make it easier to
analyze. XML files or JSON documents are examples of semi-structured data.
3. Unstructured
The data which have unknown form and cannot be stored in RDBMS and cannot be analyzed
unless it is transformed into a structured format is called as unstructured data. Text Files and
multimedia contents like images, audios, and videos are example of unstructured data. The
unstructured data is growing quicker than others; experts say that 80 percent of the data in an
organization are unstructured.
Technology today allows us to collect data at an astounding rate--both in terms of volume and
variety. There are various sources that generate data, but in the context of big data, the
primary sources are as follows:
Social networks: Arguably, the primary source of all big data that we know of today is
the social networks that have proliferated over the past 5-10 years. This is by and large
unstructured data that is represented by millions of social media postings and other
data that is generated on a second-by-second basis through user interactions on the
[4]
web across the world. Increase in access to the internet across the world has been a
self-fulfilling act for the growth of data in social networks.
Media: Largely a result of the growth of social networks, media represents the millions,
if not billions, of audio and visual uploads that take place on a daily basis. Videos
uploaded on YouTube, music recordings on SoundCloud, and pictures posted on
Instagram are prime examples of media, whose volume continues to grow in an
unrestrained manner.
Data warehouses: Companies have long invested in specialized data storage facilities
commonly known as data warehouses. A DW is essentially collections of historical data
that companies wish to maintain and catalog for easy retrieval, whether for internal use
or regulatory purposes. As industries gradually shift toward the practice of storing data
in platforms such as Hadoop and NoSQL, more and more companies are moving data
from their pre-existing data warehouses to some of the newer technologies. Company
emails, accounting records, databases, and internal documents are some examples of
DW data that is now being offloaded onto Hadoop or Hadoop-like platforms that
leverage multiple nodes to provide a highly-available and fault-tolerant platform.
Sensors: A more recent phenomenon in the space of big data has been the collection of
data from sensor devices. While sensors have always existed and industries such as oil
and gas have been using drilling sensors for measurements at oil rigs for many decades,
the advent of wearable devices, also known as the Internet Of Things such as Fitbit and
Apple Watch, meant that now each individual could stream data at the same rate at
which a few oil rigs used to do just 10 years back.
We cannot talk about data without talking about the people, people who are getting benefited
by Big Data applications. Almost all the industries today are leveraging Big Data applications in
one or the other way.
Smarter Healthcare: Making use of the petabytes of patient’s data, the organization can
extract meaningful information and then build applications that can predict the
patient’s deteriorating condition in advance.
[5]
Retail: Retail has some of the tightest margins, and is one of the greatest beneficiaries
of big data. The beauty of using big data in retail is to understand consumer behavior.
Amazon’s recommendation engine provides suggestion based on the browsing history
of the consumer.
Traffic control: Traffic congestion is a major challenge for many cities globally. Effective
use of data and sensors will be key to managing traffic better as cities become
increasingly densely populated.
Manufacturing: Analyzing big data in the manufacturing industry can reduce component
defects, improve product quality, increase efficiency, and save time and money.
Search Quality: Every time we are extracting information from google, we are
simultaneously generating data for it. Google stores this data and uses it to improve its
search quality.
6) Traditional Approach:
In past we used to deal bigdata with this approach, in traditional approach we will have
a computer to store and process big data. Here data will be stored in an RDBMS like
Oracle Database, MS SQL Server or DB2 and sophisticated softwares can be written to
interact with the database, process the required data and present it to the users for
analysis purpose.
Limitation
This approach works well where we have less volume of data that can be
accommodated by standard database servers, or up to the limit of the processor which
is processing the data. But when it comes to dealing with huge amounts of data, it is
really a tedious task to process such data through a traditional database server.
Google’s Solution
[6]
Google solved this problem using an algorithm called MapReduce. This algorithm divides
the task into small parts and assigns those parts to many computers connected over the
network, and collects the results to form the final result dataset.
Above diagram shows various commodity hardwares which could be single CPU
machines or servers with higher capacity.
Hadoop:
Doug Cutting, Mike Cafarella and team took the solution provided by Google and
started an Open Source Project called HADOOP in 2005 and Doug named it after his
son’s toy elephant. Now Apache Hadoop is a registered trademark of the Apache
Software Foundation.
Hadoop runs applications using the MapReduce algorithm, where the data is processed
in parallel on different CPU nodes. In short, Hadoop framework is capable enough to
develop applications capable of running on clusters of computers and they could
perform complete statistical analysis for a huge amount of data.
[7]
7) Explain Core Hadoop Architecture .?
Hadoop is an open source framework from Apache and is used to store process and analyze
data which are very huge in volume. Hadoop is written in Java and is not OLAP (online analytical
processing). It is used for batch/offline processing.It is being used by Facebook, Yahoo, Google,
Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes in the
cluster.
Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis
of that HDFS was developed. It states that the files will be broken into blocks and stored
in nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the
cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and converts it
into a data set which can be computed in Key value pair. The output of Map task is
consumed by reduce task and then the out of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other
Hadoop modules.
[8]
Big Data Vs Cloud Computing (Major Differences)
Let’s see 8 major differences between Big Data and Cloud Computing:
1) Concept
In cloud computing, we can store and retrieve the data from anywhere at any time. Whereas,
big data is the large set of data which will process to extract the necessary information.
2) Characteristics
Cloud Computing provides the service over the internet which can be:
Whereas, there are some important characteristics of Big data which can lead to strategic
business moves and they are Velocity, Variety, and Volume.
3) Accessibility
Cloud Computing provides universal access to the services. Whereas, Big data solves technical
problems and provides better results.
4) When to use
A customer can shift to Cloud Computing when they need rapid deployment and scaling of the
applications. The application deals with highly sensitive data and requires strict compliance one
should keep things on the cloud.
Whereas, we can use Big Data for traditional methods and here frameworks are ineffective. Big
data is not replacement for relational database system and big data solve specific problem
statement related to large data sets and most of the large data sets do not deal with small data.
5) Cost
Cloud Computing is economical as it has low maintenance costs centralized platform no upfront
cost and disaster safe implementation. Whereas, Big data is highly scalable, robust ecosystem,
and cost-effective.
[9]
6) Job roles and responsibility
The user of the cloud is the developers or office worker in an organization. Whereas, there is
big data analyst in big data which are responsible for analyzing the date of filing interesting and
sites and possible future trends.
Public Cloud
Private Cloud
Hybrid Cloud
Community Cloud
Whereas, some important trends in Big Data Technology is Hadoop, MapReduce, and HDFS.
8) Vendors
Microsoft
Dell
Apple
IBM
Whereas, some of the vendors and solution providers of big data are
Cloudera
Hortonworks
Apache
MapR
[10]