What Is Big Data
What Is Big Data
The term big data has been in use since the 1990s, with some giving credit to John Mashey for
popularizing the term.[16][17] Big data usually includes data sets with sizes beyond the ability of
commonly used software tools to capture, curate, manage, and process data within a tolerable
elapsed time.[18] Big data philosophy encompasses unstructured, semi-structured and structured
data, however the main focus is on unstructured data. [19] Big data "size" is a constantly moving target;
as of 2012 ranging from a few dozen terabytes to many zettabytes of data.[20] Big data requires a set
of techniques and technologies with new forms of integration to reveal insights from data-sets that
are diverse, complex, and of a massive scale.[21]
"Variety", "veracity", and various other "Vs" are added by some organizations to describe it, a
revision challenged by some industry authorities.[22] The Vs of big data were often referred to as the
"three Vs", "four Vs", and "five Vs". They represented the qualities of big data in volume, variety,
velocity, veracity, and value. [3] Variability is often included as an additional quality of big data.
A 2018 definition states "Big data is where parallel computing tools are needed to handle data", and
notes, "This represents a distinct and clearly defined change in the computer science used, via
parallel programming theories, and losses of some of the guarantees and capabilities made
by Codd's relational model."[23]
In a comparative study of big datasets, Kitchin and McArdle found that none of the commonly
considered characteristics of big data appear consistently across all of the analyzed cases. [24] For this
reason, other studies identified the redefinition of power dynamics in knowledge discovery as the
defining trait.[25] Instead of focusing on intrinsic characteristics of big data, this alternative perspective
pushes forward a relational understanding of the object claiming that what matters is the way in
which data is collected, stored, made available and analyzed.
Characteristics[edit]
Shows the growth of big data's primary characteristics of volume, velocity, and variety
Big data can be described by the following characteristics:
Volume
The quantity of generated and stored data. The size of the data determines the value and
potential insight, and whether it can be considered big data or not. The size of big data is
usually larger than terabytes and petabytes.[30]
Variety
The type and nature of the data. The earlier technologies like RDBMSs were capable to
handle structured data efficiently and effectively. However, the change in type and nature
from structured to semi-structured or unstructured challenged the existing tools and
technologies. The big data technologies evolved with the prime intention to capture, store,
and process the semi-structured and unstructured (variety) data generated with high speed
(velocity), and huge in size (volume). Later, these tools and technologies were explored and
used for handling structured data also but preferable for storage. Eventually, the processing
of structured data was still kept as optional, either using big data or traditional RDBMSs. This
helps in analyzing data towards effective usage of the hidden insights exposed from the data
collected via social media, log files, sensors, etc. Big data draws from text, images, audio,
video; plus it completes missing pieces through data fusion.
Velocity
The speed at which the data is generated and processed to meet the demands and
challenges that lie in the path of growth and development. Big data is often available in real-
time. Compared to small data, big data is produced more continually. Two kinds of velocity
related to big data are the frequency of generation and the frequency of handling, recording,
and publishing.[31]
Veracity
The truthfulness or reliability of the data, which refers to the data quality and the data value.
[32]
Big data must not only be large in size, but also must be reliable in order to achieve value
in the analysis of it. The data quality of captured data can vary greatly, affecting an accurate
analysis.[33]
Value
The worth in information that can be achieved by the processing and analysis of large
datasets. Value also can be measured by an assessment of the other qualities of big data.
[34]
Value may also represent the profitability of information that is retrieved from the analysis
of big data.
Variability
The characteristic of the changing formats, structure, or sources of big data. Big data can
include structured, unstructured, or combinations of structured and unstructured data. Big
data analysis may integrate raw data from multiple sources. The processing of raw data may
also involve transformations of unstructured data to structured data.
Other possible characteristics of big data are:[35]
Exhaustive
Whether the entire system (i.e., =all) is captured or recorded or not. Big data may or may not
include all the available data from sources.
Fine-grained and uniquely lexical
Respectively, the proportion of specific data of each element per element collected and if the
element and its characteristics are properly indexed or identified.
Relational
If the data collected contains common fields that would enable a conjoining, or meta-
analysis, of different data sets.
Extensional
If new fields in each element of the data collected can be added or changed easily.
Scalability
If the size of the big data storage system can expand rapidly.
Architecture[edit]
Big data repositories have existed in many forms, often
built by corporations with a special need. Commercial
vendors historically offered parallel database
management systems for big data beginning in the
1990s. For many years, WinterCorp published the
largest database report.[36][promotional source?]
Teradata Corporation in 1984 marketed the parallel
processing DBC 1012 system. Teradata systems were
the first to store and analyze 1 terabyte of data in
1992. Hard disk drives were 2.5 GB in 1991 so the
definition of big data continuously evolves according
to Kryder's law. Teradata installed the first petabyte
class RDBMS based system in 2007. As of 2017, there
are a few dozen petabyte class Teradata relational
databases installed, the largest of which exceeds 50
PB. Systems up until 2008 were 100% structured
relational data. Since then, Teradata has added
unstructured data types including XML, JSON, and
Avro.
In 2000, Seisint Inc. (now LexisNexis Risk Solutions)
developed a C++-based distributed platform for data
processing and querying known as the HPCC
Systems platform. This system automatically partitions,
distributes, stores and delivers structured, semi-
structured, and unstructured data across multiple
commodity servers. Users can write data processing
pipelines and queries in a declarative dataflow
programming language called ECL. Data analysts
working in ECL are not required to define data
schemas upfront and can rather focus on the particular
problem at hand, reshaping data in the best possible
manner as they develop the solution. In 2004,
LexisNexis acquired Seisint Inc.[37] and their high-speed
parallel processing platform and successfully used this
platform to integrate the data systems of Choicepoint
Inc. when they acquired that company in 2008.[38] In
2011, the HPCC systems platform was open-sourced
under the Apache v2.0 License.
CERN and other physics experiments have collected
big data sets for many decades, usually analyzed
via high-throughput computing rather than the map-
reduce architectures usually meant by the current "big
data" movement.
In 2004, Google published a paper on a process
called MapReduce that uses a similar architecture. The
MapReduce concept provides a parallel processing
model, and an associated implementation was
released to process huge amounts of data. With
MapReduce, queries are split and distributed across
parallel nodes and processed in parallel (the "map"
step). The results are then gathered and delivered (the
"reduce" step). The framework was very successful,
[39]
so others wanted to replicate the algorithm.
Therefore, an implementation of the MapReduce
framework was adopted by an Apache open-source
project named "Hadoop".[40] Apache Spark was
developed in 2012 in response to limitations in the
MapReduce paradigm, as it adds the ability to set up
many operations (not just map followed by reducing).
MIKE2.0 is an open approach to information
management that acknowledges the need for revisions
due to big data implications identified in an article titled
"Big Data Solution Offering".[41] The methodology
addresses handling big data in terms of
useful permutations of data sources, complexity in
interrelationships, and difficulty in deleting (or
modifying) individual records.[42]
Studies in 2012 showed that a multiple-layer
architecture was one option to address the issues that
big data presents. A distributed parallel architecture
distributes data across multiple servers; these parallel
execution environments can dramatically improve data
processing speeds. This type of architecture inserts
data into a parallel DBMS, which implements the use
of MapReduce and Hadoop frameworks. This type of
framework looks to make the processing power
transparent to the end-user by using a front-end
application server.[43]
The data lake allows an organization to shift its focus
from centralized control to a shared model to respond
to the changing dynamics of information management.
This enables quick segregation of data into the data
lake, thereby reducing the overhead time. [44][45]
Technologies[edit]
A 2011 McKinsey Global Institute report characterizes
the main components and ecosystem of big data as
follows:[46]
Applications[edit]
Bus wrapped with SAP big data parked outside IDF13.
Government[edit]
The use and adoption of big data within governmental
processes allows efficiencies in terms of cost,
productivity, and innovation,[57] but does not come
without its flaws. Data analysis often requires multiple
parts of government (central and local) to work in
collaboration and create new and innovative processes
to deliver the desired outcome. A common government
organization that makes use of big data is the National
Security Administration (NSA), which monitors the
activities of the Internet constantly in search for
potential patterns of suspicious or illegal activities their
system may pick up.
Civil registration and vital statistics (CRVS) collects all
certificates status from birth to death. CRVS is a
source of big data for governments.
International development[edit]
Research on the effective usage of information and
communication technologies for development (also
known as "ICT4D") suggests that big data technology
can make important contributions but also present
unique challenges to international development.[58]
[59]
Advancements in big data analysis offer cost-
effective opportunities to improve decision-making in
critical development areas such as health care,
employment, economic productivity, crime, security,
and natural disaster and resource management.[60][61]
[62]
Additionally, user-generated data offers new
opportunities to give the unheard a voice.[63] However,
longstanding challenges for developing regions such
as inadequate technological infrastructure and
economic and human resource scarcity exacerbate
existing concerns with big data such as privacy,
imperfect methodology, and interoperability issues.
[60]
The challenge of "big data for development"[60] is
currently evolving toward the application of this data
through machine learning, known as "artificial
intelligence for development (AI4D).[64]
Benefits[edit]
A major practical application of big data for
development has been "fighting poverty with data". [65] In
2015, Blumenstock and colleagues estimated
predicted poverty and wealth from mobile phone
metadata [66] and in 2016 Jean and colleagues
combined satellite imagery and machine learning to
predict poverty.[67] Using digital trace data to study the
labor market and the digital economy in Latin America,
Hilbert and colleagues [68][69] argue that digital trace data
has several benefits such as:
Education[edit]
A McKinsey Global Institute study found a shortage of
1.5 million highly trained data professionals and
managers[46] and a number of universities[84]
[better source needed]
including University of Tennessee and UC
Berkeley, have created masters programs to meet this
demand. Private boot camps have also developed
programs to meet that demand, including free
programs like The Data Incubator or paid programs
like General Assembly.[85] In the specific field of
marketing, one of the problems stressed by Wedel and
Kannan[86] is that marketing has several sub domains
(e.g., advertising, promotions, product development,
branding) that all use different types of data.
Media[edit]
To understand how the media uses big data, it is first
necessary to provide some context into the mechanism
used for media process. It has been suggested by Nick
Couldry and Joseph Turow that practitioners in media
and advertising approach big data as many actionable
points of information about millions of individuals. The
industry appears to be moving away from the
traditional approach of using specific media
environments such as newspapers, magazines, or
television shows and instead taps into consumers with
technologies that reach targeted people at optimal
times in optimal locations. The ultimate aim is to serve
or convey, a message or content that is (statistically
speaking) in line with the consumer's mindset. For
example, publishing environments are increasingly
tailoring messages (advertisements) and content
(articles) to appeal to consumers that have been
exclusively gleaned through various data-
mining activities.[87]
Insurance[edit]
Health insurance providers are collecting data on
social "determinants of health" such as food and TV
consumption, marital status, clothing size, and
purchasing habits, from which they make predictions
on health costs, in order to spot health issues in their
clients. It is controversial whether these predictions are
currently being used for pricing. [90]
Information technology[edit]
Especially since 2015, big data has come to
prominence within business operations as a tool to
help employees work more efficiently and streamline
the collection and distribution of information
technology (IT). The use of big data to resolve IT and
data collection issues within an enterprise is called IT
operations analytics (ITOA).[95] By applying big data
principles into the concepts of machine
intelligence and deep computing, IT departments can
predict potential issues and prevent them.[95] ITOA
businesses offer platforms for systems
management that bring data silos together and
generate insights from the whole of the system rather
than from isolated pockets of data.
Case studies[edit]
Government[edit]
China[edit]
In 2012, the Obama
administration announced the Big Data
Research and Development Initiative, to
explore how big data could be used to
address important problems faced by the
government.[105] The initiative is composed
of 84 different big data programs spread
across six departments.[106]
Big data analysis played a large role
in Barack Obama's successful 2012 re-
election campaign.[107]
The United States Federal
Government owns five of the ten most
powerful supercomputers in the world.[108][109]
The Utah Data Center has been
constructed by the United States National
Security Agency. When finished, the facility
will be able to handle a large amount of
information collected by the NSA over the
Internet. The exact amount of storage
space is unknown, but more recent
sources claim it will be on the order of a
few exabytes.[110][111][112] This has posed
security concerns regarding the anonymity
of the data collected.[113]
Retail[edit]
Walmart handles more than 1 million
customer transactions every hour, which
are imported into databases estimated to
contain more than 2.5 petabytes (2560
terabytes) of data—the equivalent of 167
times the information contained in all the
books in the US Library of Congress.[5]
Windermere Real Estate uses location
information from nearly 100 million drivers
to help new home buyers determine their
typical drive times to and from work
throughout various times of the day.[114]
FICO Card Detection System protects
accounts worldwide.[115]
Science[edit]
The Large Hadron Collider experiments
represent about 150 million sensors
delivering data 40 million times per second.
There are nearly 600 million collisions per
second. After filtering and refraining from
recording more than 99.99995%[116] of these
streams, there are 1,000 collisions of
interest per second.[117][118][119]
o As a result, only working with
less than 0.001% of the sensor
stream data, the data flow from
all four LHC experiments
represents 25 petabytes
annual rate before replication
(as of 2012). This becomes
nearly 200 petabytes after
replication.
o If all sensor data were
recorded in LHC, the data flow
would be extremely hard to
work with. The data flow would
exceed 150 million petabytes
annual rate, or nearly
500 exabytes per day, before
replication. To put the number
in perspective, this is
equivalent to
500 quintillion (5×1020) bytes
per day, almost 200 times
more than all the other sources
combined in the world.
The Square Kilometre Array is a radio
telescope built of thousands of antennas. It
is expected to be operational by 2024.
Collectively, these antennas are expected
to gather 14 exabytes and store one
petabyte per day.[120][121] It is considered one
of the most ambitious scientific projects
ever undertaken.[122]
When the Sloan Digital Sky Survey (SDSS)
began to collect astronomical data in 2000,
it amassed more in its first few weeks than
all data collected in the history of
astronomy previously. Continuing at a rate
of about 200 GB per night, SDSS has
amassed more than 140 terabytes of
information.[5] When the Large Synoptic
Survey Telescope, successor to SDSS,
comes online in 2020, its designers expect
it to acquire that amount of data every five
days.[5]
Decoding the human genome originally
took 10 years to process; now it can be
achieved in less than a day. The DNA
sequencers have divided the sequencing
cost by 10,000 in the last ten years, which
is 100 times cheaper than the reduction in
cost predicted by Moore's law.[123]
The NASA Center for Climate Simulation
(NCCS) stores 32 petabytes of climate
observations and simulations on the
Discover supercomputing cluster.[124][125]
Google's DNAStack compiles and
organizes DNA samples of genetic data
from around the world to identify diseases
and other medical defects. These fast and
exact calculations eliminate any "friction
points", or human errors that could be
made by one of the numerous science and
biology experts working with the DNA.
DNAStack, a part of Google Genomics,
allows scientists to use the vast sample of
resources from Google's search server to
scale social experiments that would usually
take years, instantly.[126][127]
23andme's DNA database contains the
genetic information of over 1,000,000
people worldwide.[128] The company
explores selling the "anonymous
aggregated genetic data" to other
researchers and pharmaceutical
companies for research purposes if
patients give their consent.[129][130][131][132]
[133]
Ahmad Hariri, professor of psychology
and neuroscience at Duke University who
has been using 23andMe in his research
since 2009 states that the most important
aspect of the company's new service is
that it makes genetic research accessible
and relatively cheap for scientists.[129] A
study that identified 15 genome sites linked
to depression in 23andMe's database lead
to a surge in demands to access the
repository with 23andMe fielding nearly 20
requests to access the depression data in
the two weeks after publication of the
paper.[134]
Computational fluid dynamics (CFD) and
hydrodynamic turbulence research
generate massive data sets. The Johns
Hopkins Turbulence Databases (JHTDB)
contains over 350 terabytes of
spatiotemporal fields from Direct Numerical
simulations of various turbulent flows. Such
data have been difficult to share using
traditional methods such as downloading
flat simulation output files. The data within
JHTDB can be accessed using "virtual
sensors" with various access modes
ranging from direct web-browser queries,
access through Matlab, Python, Fortran
and C programs executing on clients'
platforms, to cut out services to download
raw data. The data have been used in over
150 scientific publications.
Sports[edit]
Big data can be used to improve training and
understanding competitors, using sport sensors. It is
also possible to predict winners in a match using big
data analytics.[135] Future performance of players could
be predicted as well. Thus, players' value and salary is
determined by data collected throughout the season. [136]
In Formula One races, race cars with hundreds of
sensors generate terabytes of data. These sensors
collect data points from tire pressure to fuel burn
efficiency.[137] Based on the data, engineers and data
analysts decide whether adjustments should be made
in order to win a race. Besides, using big data, race
teams try to predict the time they will finish the race
beforehand, based on simulations using data collected
over the season.[138]
Technology[edit]
eBay.com uses two data warehouses at
7.5 petabytes and 40PB as well as a
40PB Hadoop cluster for search, consumer
recommendations, and merchandising.[139]
Amazon.com handles millions of back-end
operations every day, as well as queries
from more than half a million third-party
sellers. The core technology that keeps
Amazon running is Linux-based and as of
2005 they had the world's three largest
Linux databases, with capacities of 7.8 TB,
18.5 TB, and 24.7 TB.[140]
Facebook handles 50 billion photos from
its user base.[141] As of June 2017,
Facebook reached 2 billion monthly active
users.[142]
Google was handling roughly 100 billion
searches per month as of August 2012.[143]
COVID-19[edit]
During the COVID-19 pandemic, big data was raised
as a way to minimise the impact of the disease.
Significant applications of big data included minimising
the spread of the virus, case identification and
development of medical treatment. [144]
Governments used big data to track infected people to
minimise spread. Early adopters included China,
Taiwan, South Korea, and Israel. [145][146][147]
Research activities[edit]
Encrypted search and cluster formation in big data
were demonstrated in March 2014 at the American
Society of Engineering Education. Gautam Siwach
engaged at Tackling the challenges of Big Data by MIT
Computer Science and Artificial Intelligence
Laboratory and Amir Esmailpour at the UNH Research
Group investigated the key features of big data as the
formation of clusters and their interconnections. They
focused on the security of big data and the orientation
of the term towards the presence of different types of
data in an encrypted form at cloud interface by
providing the raw definitions and real-time examples
within the technology. Moreover, they proposed an
approach for identifying the encoding technique to
advance towards an expedited search over encrypted
text leading to the security enhancements in big data.
[148]
Critique[edit]
Critiques of the big data paradigm come in two flavors:
those that question the implications of the approach
itself, and those that question the way it is currently
done.[177] One approach to this criticism is the field
of critical data studies.