Understanding of Big Data
Understanding of Big Data
Introduction to Big Data, Definition of Big Data, Need of Big Data Management, Sources of Big
Data, Characteristics of Big Data, Evolution of Big Data, Differentiating between Data Warehouse
and Big Data, Real time data processing, Structure of Big Data, Big Data Life Cycle and
processing, Applications of Big Data, Benefits of Big Data Management,
According to Gartner, the definition of Big Data – “Big data” is high-volume, velocity, and variety
information assets that demand cost-effective, innovative forms of information processing for
enhanced insight and decision making.” This definition clearly answers the “What is Big Data?”
question – Big Data refers to complex and large data sets that have to be processed and analyzed
to uncover valuable information that can benefit businesses and organizations. However, there are
certain basic tenets of Big Data that will make it even simpler to answer what is Big Data: It
refers to a massive amount of data that keeps on growing exponentially with time. It is so
voluminous that it cannot be processed or analyzed using conventional data processing techniques.
It includes data mining, data storage, data analysis, data sharing, and data visualization. The
term is an all-comprehensive one including data, data frameworks, along with the tools and
techniques used to process and analyze the data.
Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-
effective, innovative forms of information processing that enable enhanced insight, decision
making, and process automation.
The importance of big data does not revolve around how much data a company has but how a
company utilizes the collected data. Every company uses data in its own way; the more efficiently
a company uses its data, the more potential it has to grow. The company can take data from any
2. Time Reductions: The high speed of tools like Hadoop and in-memory analytics can easily
identify new sources of data which helps businesses analyzing data immediately and make quick
decisions based on the learning.
3. Understand the market conditions: By analyzing big data you can get a better understanding of
current market conditions. For example, by analyzing customers’ purchasing behaviors, a
company can find out the products that are sold the most and produce products according to this
trend. By this, it can get ahead of its competitors.
4. Control online reputation: Big data tools can do sentiment analysis. Therefore, you can get
feedback about who is saying what about your company. If you want to monitor and improve the
online presence of your business, then, big data tools can help in all this.
5. Using Big Data Analytics to Boost Customer Acquisition and Retention The customer is the
most important asset any business depends on. There is no single business that can claim success
without first having to establish a solid customer base. However, even with a customer base, a
business cannot afford to disregard the high competition it faces. If a business is slow to learn what
customers are looking for, then it is very easy to begin offering poor quality products. In the end,
loss of clientele will result, and this creates an adverse overall effect on business success. The use
of big data allows businesses to observe various customer related patterns and trends. Observing
6. Using Big Data Analytics to Solve Advertisers Problem and Offer Marketing Insights
Voluminous amounts of big data make it crucial for businesses to differentiate, for the purpose of
effectiveness, the disparate big data sources available. Big data is used by organizations for the
sole purpose of analytics. However, before companies can set out to extract insights and valuable
information from big data, they must have the knowledge of several big data sources available.
Data, as we know, is massive and exists in various forms. If it is not classified or sourced well, it
Media is the most popular source of big data, as it provides valuable insights on consumer preferences
and changing trends. Since it is self-broadcasted and crosses all physical and demographical barriers, it is
the fastest way for businesses to get an in-depth overview of their target audience, draw patterns and
conclusions, and enhance their decision-making. Media includes social media and interactive platforms,
like Google, Facebook, Twitter, YouTube, Instagram, as well as generic media like images, videos, audios,
and podcasts that provide quantitative and qualitative insights on every aspect of user interaction.
Today, companies have moved ahead of traditional data sources by shifting their data on the cloud. Cloud
storage accommodates structured and unstructured data and provides business with real-time
information and on-demand insights. The main attribute of cloud computing is its flexibility and scalability.
As big data can be stored and sourced on public or private clouds, via networks and servers, cloud makes
for an efficient and economical data source.
The public web constitutes big data that is widespread and easily accessible. Data on the Web or ‘Internet’
is commonly available to individuals and companies alike. Moreover, web services such as Wikipedia
provide free and quick informational insights to everyone. The enormity of the Web ensures for its diverse
usability and is especially beneficial to start-ups and SME’s, as they don’t have to wait to develop their
own big data infrastructure and repositories before they can leverage big data.
Machine-generated content or data created from IoT constitute a valuable source of big data. This data is
usually generated from the sensors that are connected to electronic devices. The sourcing capacity
depends on the ability of the sensors to provide real-time accurate information. IoT is now gaining
momentum and includes big data generated, not only from computers and smartphones, but also possibly
from every device that can emit data. With IoT, data can now be sourced from medical devices, vehicular
processes, video games, meters, cameras, household appliances, and the like.
Businesses today prefer to use an amalgamation of traditional and modern databases to acquire relevant
big data. This integration paves the way for a hybrid data model and requires low investment and IT
infrastructural costs. Furthermore, these databases are deployed for several business intelligence
purposes as well. These databases can then provide for the extraction of insights that are used to drive
business profits. Popular databases include a variety of data sources, such as MS Access, DB2, Oracle, SQL,
and Amazon Simple, among others.
The process of extracting and analyzing data amongst extensive big data sources is a complex process and
can be frustrating and time-consuming. These complications can be resolved if organizations encompass
all the necessary considerations of big data, take into account relevant data sources, and deploy them in
a manner which is well tuned to their organizational goals.
Big Data contains a large amount of data that is not being processed by traditional data storage or
the processing unit. It is used by many multinational companies to process the data and business
of many organizations. The data flow would exceed 150 exabytes per day before replication.
There are five v's of Big Data that explains the characteristics.
o Volume
o Veracity
o Variety
o Value
o Velocity
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data generated
from many sources daily, such as business processes, machines, social media platforms, networks, human
interactions, and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that the "Like" button is
recorded, and more than 350 million new posts are uploaded each day. Big data technologies can handle
large amounts of data.
Variety
Big Data can be structured, unstructured, and semi-structured that are being collected from different
sources. Data will only be collected from databases and sheets in the past, But these days the data will
comes in array forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.
Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which the data is
created in real-time. It contains the linking of incoming data sets speeds, rate of change, and activity
bursts. The primary aspect of Big Data is to provide demanding data rapidly.
Big data has revolutionized the modern business environment in recent years. A mixture of
structured, semistructured and unstructured data, big data is a collection of information that
organizations can mine for business purposes through machine learning, predictive modeling, and
other advanced data analytics applications.
At one time the concept of big data may have seemed like a buzzword, but the reality is the impact
of big data on the world around us has been tremendous. As you will see from this timeline
covering the history of big data, big data analytics builds on concepts that have been around for
centuries.
The history data analysis that led to today's advanced big data analytics starts way back in the 17th
century in London. Let's begin our journey.
1663
John Graunt introduces statistical data analysis with the bubonic plague. The London haberdasher
published the first collection of public health records when he recorded death rates and their
variations during the bubonic plague in England.
1865
1884
Herman Hollerith invents the punch card tabulating machine, marking the beginning of data
processing. The tabulating device Hollerith developed was used to process data from the 1890 U.S.
Census. Later, in 1911, he founded the Computing-Tabulating-Recording Company, which would
eventually become IBM.
1926
Nikola Tesla predicts humans will one day have access to large swaths of data via an instrument
that can be carried "in [one's] vest pocket." Tesla managed to predict our modern affinity for
smartphones and other handheld devices based on his understanding of how wireless technology
would change particles: "When wireless is perfectly applied, the whole earth will be converted
into a huge brain, which in fact it is, all things being particles of a real and rhythmic whole. We
shall be able to communicate with one another instantly, irrespective of distance."
1928
Fritz Pfleumer invents a way to store information on tape. Pfleumer's process for putting metal
stripes on magnetic papers eventually led him to create magnetic tape, which formed the
foundation for video cassettes, movie reels and more.
1943
The U.K. created a theoretical computer and one of the first data processing machines to decipher
Nazi codes during WWII. The Colossus, as it was called, performed Boolean and counting
operations to analyze large volumes of data.
1959
Arthur Samuel, a programmer at IBM and pioneer of artificial intelligence, coined the term
machine learning (ML).
1969
Advanced Research Projects Agency Network (ARPANET), the first wide area network that
included distributed control and TCI/IP protocols, was created. This formed the foundation of
today's internet.
The internet age: The dawn of big data
As computers start sharing information at exponentially greater rates due to the internet, the next
stage in the history of big data takes shape.
1996
Digital data storage becomes more cost-effective than storing information on paper for the first
time in 1996, as reported by R.J.T. Morris and B.J. Truskowski in their 2003 IBM Systems Journal
paper, "The Evolution of Storage Systems."
1997
The domain google.com is registered a year before launching, starting the search engine's climb to
dominance and development of numerous other technological innovations, including in the areas
of machine learning, big data and analytics.
1998
Carlo Strozzi develops NoSQL, an open source relational database that provides a way to store
and retrieve data modeled differently from the traditional tabular methods found in relational
databases.
2001
Doug Laney of analyst firm Gartner coins the 3Vs (volume, variety and velocity), defining the
dimensions and properties of big data. The Vs encapsulate the true definition of big data and usher
in a new period where big data can be viewed as a dominant feature of the 21st century. Additional
Vs -- such as veracity, value and variability -- have since been added to the list.
2005
Computer scientists Doug Cutting and Mike Cafarella create Apache Hadoop, the open source
framework used to store and process large data sets, with a team of engineers spun off from Yahoo.
2006
2008
The world's CPUs process over 9.57 zettabytes (or 9.57 trillion gigabytes) of
data, about equal to 12 gigabytes per person. Global production of new
information hits an estimated 14.7 exabytes.
2009
Gartner reports business intelligence as the top priority for CIOs. As companies
face a period of economic volatility and uncertainty due to the Great Recession,
squeezing value out of data becomes paramount.
2011
McKinsey reports that by 2018 the U.S. will face a shortage of analytics talent.
Lacking between 140,000 and 190,000 people with deep analytical skills and a
further 1.5 million analysts and managers with the ability to make accurate data-
driven decisions.
Also, Facebook launches the Open Compute Project to share specifications for
energy-efficient data centers. The initiative's goal is to deliver a 38% increase in
energy efficiency at a 24% lower cost.
2012
The Obama administration announces the Big Data Research and Development
Initiative with a $200 million commitment, citing a need to improve the ability to
extract valuable insights from data and accelerate the pace of STEM (science,
technology, engineering, and mathematics) growth, enhance national security
and transform learning. The acronym has since become STEAM, adding an A by
incorporating the arts.
Big Data and Data Warehouse both are used as main source of input for Business Intelligence, such as
creation of Analytical results and Report generation, in order to provision effective business decision-
making processes. Big Data allows unrefined data from any source, but Data Warehouse allows only
processed data, as it has to maintain the reliability and consistency of the data. The unprocessed data in
Big Data systems can be of any size depending on the type their formats. Almost all the data in Data
Warehouse are of common size due to its refined structured system organization.
Data Warehouse is an architecture of data Big Data is a technology to handle huge data
storing or data repository and prepare the repository.
Any kind of DBMS data accepted by Data Big Data accept all kind of data including
warehouse transnational data, social media data,
machinery data or any DBMS data.
Data warehouse only handles structure data big data can handle structure, non-structure,
semi-structured data.
data warehouse doesn’t have that kind of Big data normally used a distributed file
concept. system to load huge data in a distributed way
Data warehouse means the relational big data is not following proper database
database, so storing, fetching data will be structure, we need to use hive or spark SQL
similar with a normal SQL query. to see the data by using hive specific query.
100% data loaded into data warehousing are data loaded by Hadoop, maximum 0.5% used
using for analytics reports. on analytics reports till now. Others data are
loaded into the system, but in not use status.
Data Warehousing never able to handle Big data (Apache Hadoop) is the only option
totally unstructured data. to handle unstructured data.
The timing of fetching increasing it will take a small period of time to fetch
simultaneously in data warehouse based on huge data
data volume.
On-demand real-time analytics — This is a reactive approach. It awaits a query from the end user
to process a request and then deliver the analytics. For example, a web analyst monitors site traffic
to avoid a potential crash of the website.
Continuous real-time analytics — This is a proactive approach. It alerts users with continuous
updates in real time. For example, tracking the stock market with various visualization
representations on a website.
Structure Of Big Data
In the last few years, big data has become central to the tech landscape. You can consider big data
as a collection of massive and complex datasets that are difficult to store and process utilizing
traditional database management tools and traditional data processing applications. The key
challenges include capturing, storing, managing, analyzing, and visualization of that data.
When it comes to the structure of big data, you can consider it a collection of data values, the
relationships between them together with the operations or functions which can be applied to that
data.
These days, lots of resources (social media platforms being the number one) have become available
to companies from where they can capture massive amounts of data. Now, this captured data is
used by enterprises to develop a better understanding and closer relationships with their target
customers. It’s important to understand that every new customer action essentially creates a more
complete picture of the customer, helping organizations achieve a more detailed understanding of
their ideal customers. Therefore, it can be easily imagined why companies across the globe are
striving to leverage big data. Put simply, big data comes with the potential that can redefine a
business, and organizations, which succeed in analyzing big data effectively, stand a huge chance
to become global leaders in the business domain.
Structures of big data
Big data structures can be divided into three categories – structured, unstructured, and semi-
structured. Let’s have a look at them in detail.
1- Structured data
It’s the data which follows a pre-defined format and thus, is straightforward to analyze. It
conforms to a tabular format together with relationships between different rows and
In this stage, the team learns about the business domain, which presents the motivation
and goals for carrying out the analysis. In this stage, the problem is identified, and
assumptions are made that how much potential gain a company will make after carrying
out the analysis. Important activities in this step include framing the business problem
as an analytics challenge that can be addressed in subsequent phases. It helps the
decision-makers understand the business resources that will be required to be utilized
thereby determining the underlying budget required to carry out the project.
Moreover, it can be determined, whether the problem identified, is a Big Data problem
Once the business case is identified, now it’s time to find the appropriate datasets to
work with. In this stage, analysis is done to see what other companies have done for a
similar case.
Depending on the business case and the scope of analysis of the project being addressed,
the sources of datasets can be either external or internal to the company. In the case of
internal datasets, the datasets can include data collected from internal sources, such as
feedback forms, from existing software, On the other hand, for external datasets, the
list includes datasets from third-party providers.
Once the source of data is identified, now it is time to gather the data from such sources.
This kind of data is mostly unstructured.Then it is subjected to filtration, such as
removal of the corrupt data or irrelevant data, which is of no scope to the analysis
objective. Here corrupt data means data that may have missing records, or the ones,
which include incompatible data types.
After filtration, a copy of the filtered data is stored and compressed, as it can be of use
in the future, for some other analysis.
Now the data is filtered, but there might be a possibility that some of the entries of the
data might be incompatible, to rectify this issue, a separate phase is created, known as
the data extraction phase. In this phase, the data, which don’t match with the underlying
scope of the analysis, are extracted and transformed in such a form.
As mentioned in phase III, the data is collected from various sources, which results in
the data being unstructured. There might be a possibility, that the data might have
constraints, that are unsuitable, which can lead to false results. Hence there is a need to
clean and validate the data.
It includes removing any invalid data and establishing complex validation rules. There
are many ways to validate and clean the data. For example, a dataset might contain few
rows, with null entries. If a similar dataset is present, then those entries are copied from
that dataset, else those rows are dropped.
The data is cleansed and validates, against certain rules set by the enterprise. But the
data might be spread across multiple datasets, and it is not advisable to work with
multiple datasets. Hence, the datasets are joined together. For example: If there are two
datasets, namely that of a Student Academic section and Student Personal Details
section, then both can be joined together via common fields, i.e. roll number.
This phase calls for intensive operation since the amount of data can be very large.
Automation can be brought into consideration, so that these things are executed,
without any human intervention.
Now we have the answer to some questions, using the information from the data in the
datasets. But these answers are still in a form that can’t be presented to business users.
A sort of representation is required to obtains value or some conclusion from the
analysis. Hence, various tools are used to visualize the data in graphic form, which can
easily be interpreted by business users.
Visualization is said to influence the interpretation of the results. Moreover, it allows
the users to discover answers to questions that are yet to be formulated.
Retail traders, Big banks, hedge funds, and other so-called ‘big boys’ in the financial
markets use Big Data for trade analytics used in high-frequency trading, pre-trade decision-
support analytics, sentiment measurement, Predictive Analytics, etc.
This industry also heavily relies on Big Data for risk analytics, including; anti-money
laundering, demand enterprise risk management, "Know Your Customer," and fraud
mitigation.
Big Data providers are specific to this industry includes 1010data, Panopticon Software,
Streambase Systems, Nice Actimize, and Quartet FS.
Spotify, an on-demand music service, uses Hadoop Big Data analytics, to collect data from
its millions of users worldwide and then uses the analyzed data to give informed music
recommendations to individual users.
Amazon Prime, which is driven to provide a great customer experience by offering video,
music, and Kindle books in a one-stop-shop, also heavily utilizes Big Data.
Big Data Providers in this industry include Infochimps, Splunk, Pervasive Software, and
Visible Measures.
3. Healthcare Providers
Industry-specific Big Data Challenges
The healthcare sector has access to huge amounts of data but has been plagued by failures
in utilizing the data to curb the cost of rising healthcare and by inefficient systems that
stifle faster and better healthcare benefits across the board.
Other challenges related to Big Data include the exclusion of patients from the decision-
making process and the use of data from different readily available sensors.
Free public health data and Google Maps have been used by the University of Florida to
create visual data that allows for faster identification and efficient analysis of healthcare
information, used in tracking the spread of chronic disease. Obamacare has also utilized
Big Data in a variety of ways. Big Data Providers in this industry include Recombinant
Data, Humedica, Explorys, and Cerner.
From a practical point of view, staff and institutions have to learn new data management
and analysis tools.
On the technical side, there are challenges to integrating data from different sources on
different platforms and from different vendors that were not designed to work with one
another. Politically, issues of privacy and personal data protection associated with Big Data
used for educational purposes is a challenge.
In a different use case of the use of Big Data in education, it is also used to measure
teacher’s effectiveness to ensure a pleasant experience for both students and teachers.
Teacher’s performance can be fine-tuned and measured against student numbers, subject
matter, student demographics, student aspirations, behavioral classification, and several
other variables.
Big Data Providers in this industry include Knewton and Carnegie Learning and
MyFit/Naviance.
Similarly, large volumes of data from the manufacturing industry are untapped. The
underutilization of this information prevents the improved quality of products, energy
efficiency, reliability, and better profit margins.
Big data has also been used in solving today’s manufacturing challenges and to gain a
competitive advantage, among other benefits.
Big Data Providers in this industry include CSC, Aspen Technology, Invensys, and
Pentaho.
6. Government
Industry-specific Big Data Challenges
In governments, the most significant challenges are the integration and interoperability of
Big Data across different government departments and affiliated organizations.
Big data is being used in the analysis of large amounts of social disability claims made to
the Social Security Administration (SSA) that arrive in the form of unstructured data. The
analytics are used to process medical information rapidly and efficiently for faster decision
making and to detect suspicious or fraudulent claims.
The Food and Drug Administration (FDA) is using Big Data to detect and study patterns
of food-related illnesses and diseases. This allows for a faster response, which has led to
more rapid treatment and less death.
The Department of Homeland Security uses Big Data for several different use cases. Big
data is analyzed from various government agencies and is used to protect the country.
Big Data Providers in this industry include Digital Reasoning, Socrata, and HP.
When it comes to claims management, predictive analytics from Big Data has been used
to offer faster service since massive amounts of data can be analyzed mainly in the
underwriting stage. Fraud detection has also been enhanced.
Through massive data from digital channels and social media, real-time monitoring of
claims throughout the claims cycle has been used to provide insights.
Big Data Providers in this industry include Sprint, Qualcomm, Octo Telematics, The
Climate Corp.
In New York’s Big Show retail trade conference in 2014, companies like Microsoft, Cisco,
and IBM pitched the need for the retail industry to utilize Big Data for analytics and other
uses, including:
Optimized staffing through data from shopping patterns, local events, and so on
Reduced fraud
Timely analysis of inventory
Social media use also has a lot of potential use and continues to be slowly but surely
adopted, especially by brick and mortar stores. Social media is used for customer
prospecting, customer retention, promotion of products, and more.
9. Transportation
Industry-specific Big Data Challenges
In recent times, huge amounts of data from location-based social networks and high-speed
data from telecoms have affected travel behavior. Regrettably, research to understand
travel behavior has not progressed as quickly.
In most places, transport demand models are still based on poorly understood new social
media structures.
Governments use of Big Data: traffic control, route planning, intelligent transport systems,
congestion management (by predicting traffic conditions)
Private-sector use of Big Data in transport: revenue management, technological
enhancements, logistics and for competitive advantage (by consolidating shipments and
optimizing freight movement)
Individual use of Big Data includes route planning to save on fuel and time, for travel
arrangements in tourism, etc.
Real-time Applications of big data in top ten industries infographics
Source: Using Big Data in the Transport Sector
Big Data Providers in this industry include Qualcomm and Manhattan Associates.
In utility companies, the use of Big Data also allows for better asset and workforce management,
which is useful for recognizing errors and correcting them as soon as possible before complete
failure is experienced.