Big Data 101
Big Data 101
In ancient days, people used to travel from one village to another village on a horse driven cart, but
as the time passed, villages became towns and people spread out. The distance to travel from one
town to the other town also increased. So, it became a problem to travel between towns, along with
the luggage. Out of the blue, one smart fella suggested, we should groom and feed a horse more, to
solve this problem. When I look at this solution, it is not that bad, but do you think a horse can
become an elephant? I don’t think so. Another smart guy said, instead of 1 horse pulling the cart, let
us have 4 horses to pull the same cart. What do you guys think of this solution? I think it is a fantastic
solution. Now, people can travel large distances in less time and even carry more luggage.
The same concept applies to Big Data. Big Data says, till today, we were okay with storing the data
into our servers because the volume of the data was pretty limited, and the amount of time to
process this data was also okay. But now in this current technological world, the data is growing too
fast and people are relying on the data a lot of times. Also, the speed at which the data is growing, it
is becoming impossible to store the data into any server.
Through this blog on Big Data Tutorial, let us explore the sources of Big Data, which the traditional
systems are failing to store and process.
2. Big Data Driving Factors
The quantity of data on planet earth is growing exponentially for many reasons. Various sources and
our day to day activities generate lots of data. With the invent of the web, the whole world has gone
online, every single thing we do leaves a digital trace. With smart-objects going online, the data
growth rate has increased rapidly. The major sources of Big Data are social media sites, sensor
networks, digital images/videos, cell phones, purchase transaction records, weblogs, medical
records, archives, military surveillance, eCommerce, complex scientific research and so on. All these
information amounts to around some Quintillion bytes of data. By 2020, the data volumes will be
around 40 Zettabytes which is equivalent to adding every single grain of sand on the planet
multiplied by seventy-five.
Big Data is a term used for a collection of data sets that are large and complex, which is difficult to
store and process using available database management tools or traditional data processing
applications. The challenge includes capturing, curating, storing, searching, sharing, transferring,
analyzing and visualization of this data.
The five characteristics that define Big Data are Volume, Velocity, Variety, Veracity, and Value.
4.1 Volume:
Refers to the ‘amount of data’, which is growing day by day at a very fast pace. The size of data
generated by humans, machines and their interactions on social media itself is massive. The name
Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data generated from
many sources daily, such as business processes, machines, social media platforms, networks,
human interactions, and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that the "Like" button is
recorded, and more than 350 million new posts are uploaded each day. Big data technologies can
handle large amounts of data.
4.2 Velocity:
Velocity is defined as the pace at which different sources generate the data every day. This flow of
data is massive and continuous. There are 1.03 billion Daily Active Users (Facebook DAU) on Mobile
as of now, which is an increase of 22% year-over-year. This shows how fast the number of users are
growing on social media and how fast the data is getting generated daily. If you are able to handle
the velocity, you will be able to generate insights and take decisions based on real-time data.
Velocity plays an important role compared to others. Velocity creates the speed by which the data is
created in real-time. It contains the linking of incoming data sets speeds, rate of change,
and activity bursts. The primary aspect of Big Data is to provide demanding data rapidly.
Big data velocity deals with the speed at the data flows from sources like application logs, business
processes, networks, and social media sites, sensors, mobile devices, etc.
4.3 Variety:
As there are many sources which are contributing to Big Data, the type of data they are generating is
different. It can be structured, semi-structured or unstructured. Hence, there is a variety of data
which is getting generated every day. Earlier, we used to get the data from excel and databases, now
the data are coming in the form of images, audios, videos, sensor data etc. as shown in below image.
Hence, this variety of unstructured data creates problems in capturing, storage, mining and
analyzing the data.
Structured data: In Structured schema, along with all the required columns. It is in a tabular
form. Structured Data is stored in the relational database management system.
Quasi-structured Data:The data format contains textual data with inconsistent data formats
that are formatted with effort and time with some tools.
4.4 Veracity:
Veracity refers to the data in doubt or uncertainty of data available due to data inconsistency and
incompleteness. In the image below, you can see that few values are missing in the table. Also, a few
values are hard to accept, for example — 15000 minimum value in the 3rd row, it is not possible.
This inconsistency and incompleteness is Veracity.
Data available can sometimes get messy and may be difficult to trust. With many forms of big data,
quality and accuracy are difficult to control like Twitter posts with hashtags, abbreviations, typos,
and colloquial speech. The volume is often the reason behind for the lack of quality and accuracy in
the data.
Due to uncertainty of data, 1 in 3 business leaders don’t trust the information they use to make
decisions.
It was found in a survey that 27% of respondents were unsure of how much of their data was
inaccurate.
Poor data quality costs the US economy around $3.1 trillion a year.
4.5 Value:
After discussing Volume, Velocity, Variety, and Veracity, there is another V that should be taken into
account when looking at Big Data i.e. Value. It is all well and good to have access to big data but
unless we can turn it into value it is useless. By turning it into value I mean, Is it adding to the
benefits of the organizations who are analyzing big data? Is the organization working on Big Data
achieving high ROI (Return On Investment)? Unless it adds to their profits by working on Big Data, it
is useless.
As discussed in Variety, there are different types of data which is getting generated every day. So, let
us now understand the types of data:
Structured
Semi-Structured
Unstructured
5.1 Structured
The data that can be stored and processed in a fixed format is called Structured Data. Data stored in
a relational database management system (RDBMS) is one example of ‘structured’ data. It is easy to
process structured data as it has a fixed schema. Structured Query Language (SQL) is often used to
manage such kind of Data.
5.2 Semi-Structured
Semi-Structured Data is a type of data which does not have a formal structure of a data model, i.e. a
table definition in a relational DBMS, but nevertheless, it has some organizational properties like
tags and other markers to separate semantic elements that make it easier to analyze. XML files or
JSON documents are examples of semi-structured data.
5.3 Unstructured
The data which have unknown form and cannot be stored in RDBMS and cannot be analyzed unless
it is transformed into a structured format is called as unstructured data. Text Files and multimedia
contents like images, audios, videos are examples of unstructured data. The unstructured data is
growing quicker than others, experts say that 80 percent of the data in an organization is
unstructured.
The evolution of Big Data can roughly be subdivided into three main phases. Each phase was driven
by technological advancements and has its own characteristics and capabilities. In order to
understand the context of Big Data today, it is important to understand how each of these phases
contributed to the modern meaning of Big Data.
Data analysis, data analytics and Big Data originate from the longstanding domain of database
management. It relies heavily on the storage, extraction, and optimization techniques that are
common in data that is stored in Relational Database Management Systems (RDBMS). The
techniques that are used in these systems, such as structured query language (SQL) and the
extraction, transformation and loading (ETL) of data, started to professionalize in the 1970s.
Database management and data warehousing systems are still fundamental components of modern-
day Big Data solutions. The ability to quickly store and retrieve data from databases or find
information in large data sets, is still a core requirement for the analysis of Big Data. Relational
database management technology and other data processing technologies that were developed
during this phase, are still strongly embedded in the Big Data solutions from leading IT vendors, such
as Microsoft, Google and Amazon. A number of core technologies and characteristics of this first
phase in the evolution of Big Data is outlined in figure 3.
From the early 2000s, the internet and corresponding web applications started to generate
tremendous amounts of data. In addition to the data that these web applications stored in relational
databases, IP-specific search and interaction logs started to generate web based unstructured data.
These unstructured data sources provided organizations with a new form of knowledge: insights into
the needs and behaviours of internet users. With the expansion of web traffic and online stores,
companies such as Yahoo, Amazon and eBay started to analyse customer behaviour by analysing
click-rates, IP-specific location data and search logs, opening a whole new world of possibilities.
From a technical point of view, HTTP-based web traffic introduced a massive increase in semi-
structured and unstructured data (further discussed in chapter 1.6). Besides the standard structured
data types, organizations now needed to find new approaches and storage solutions to deal with
these new data types in order to analyse them effectively. The arrival and growth of social media
data greatly aggravated the need for tools, technologies and analytics techniques that were able to
extract meaningful information out of this unstructured data. New technologies, such as networks
analysis, web-mining and spatial-temporal analysis, were specifically developed to analyse these
large quantities of web based unstructured data effectively.
The third and current phase in the evolution of Big Data is driven by the rapid adoption of mobile
technology and devices, and the data they generate. The number of mobile devices and tablets
surpassed the number of laptops and PCs for the first time in 2011. In 2020, there are an estimated
10 billion devices that are connected to the internet. And all of these devices generate data every
single second of the day.
Mobile devices not only give the possibility to analyse behavioural data (such as clicks and search
queries), but they also provide the opportunity to store and analyse location-based GPS data.
Through these mobile devices and tablets, it is possible to track movement, analyse physical
behaviour and even health-related data (for example the number of steps you take per day). And
because these devices are connected to the internet almost every single moment, the data that
these devices generate provide a real-time and unprecedented picture of people’s behaviour.
Simultaneously, the rise of sensor-based internet-enabled devices is increasing the creation of data
to even greater volumes. Famously coined the ‘Internet of Things’ (IoT), millions of new TVs,
thermostats, wearables and even refrigerators are connected to the internet every single day,
providing massive additional data sets. Since this development is not expected to stop anytime soon,
it could be stated that the race to extract meaningful and valuable information out of these new
data sources has only just begun. A summary of the evolution of Big Data and its key characteristics
per phase is outlined in figure 3.
According to a study conducted by IBM, 90% of all data in the world was generated just in
the last two years.
Recent surveys by Gartner found that 89% of companies are investing in Big Data to gain a
competitive edge.
The National Small Business Association found that 63% of small businesses use Big Data to
improve operations.
Big Data also includes unstructured datasets 80% of all data is unstructured and still requires
analysis.
McKinsey Co. found that Big Data can lead to a 2-3% increase in productivity and a 20-25%
reduction in costs.
More than 5 billion people are calling, texting, tweeting and browsing on mobile phones
worldwide.
YouTube users upload 48 hours of new video every minute of the day.
Amazon handles 15 million customer click-stream user data per day to recommend
products.
294 billion emails are sent every day. Services analyses this data to find the spams.
Modern cars have close to 100 sensors which monitor fuel level, tire pressure etc. , each
vehicle generates a lot of sensor data.
We cannot talk about data without talking about the people, people who are getting benefited by
Big Data applications. Almost all the industries today are leveraging Big Data applications in one or
the other way.
Smarter Healthcare: Making use of the petabytes of patient’s data, the organization can
extract meaningful information and then build applications that can predict the patient’s
deteriorating condition in advance.
Telecom: Telecom sectors collects information, analyzes it and provide solutions to different
problems. By using Big Data applications, telecom companies have been able to significantly
reduce data packet loss, which occurs when networks are overloaded, and thus, providing a
seamless connection to their customers.
Retail: Retail has some of the tightest margins, and is one of the greatest beneficiaries of big
data. The beauty of using big data in retail is to understand consumer behavior. Amazon’s
recommendation engine provides suggestion based on the browsing history of the
consumer.
Traffic control: Traffic congestion is a major challenge for many cities globally. Effective use
of data and sensors will be key to managing traffic better as cities become increasingly
densely populated.
Manufacturing: Analyzing big data in the manufacturing industry can reduce component
defects, improve product quality, increase efficiency, and save time and money.
Search Quality: Every time we are extracting information from Google, we are
simultaneously generating data for it. Google stores this data and uses it to improve its
search quality.
Someone has rightly said: “Not everything in the garden is Rosy!”. Till now in this Big Data tutorial, I
have just shown you the rosy picture of Big Data. But if it was so easy to leverage Big data, don’t you
think all the organizations would invest in it? Let me tell you up front, that is not the case. There are
several challenges which come along when you are working with Big Data.
Now that you are familiar with Big Data and its various features, the next section of this blog on Big
Data Tutorial will shed some light on some of the major challenges faced by Big Data.
1. Data Quality — The problem here is the 4thV i.e. Veracity. The data here is very messy,
inconsistent and incomplete. Dirty data cost $600 billion to the companies every year in the
United States.
2. Discovery — Finding insights on Big Data is like finding a needle in a haystack. Analyzing
petabytes of data using extremely powerful algorithms to find patterns and insights are very
difficult.
3. Storage — The more data an organization has, the more complex the problems of managing
it can become. The question that arises here is “Where to store it?”. We need a storage
system which can easily scale up or down on-demand.
4. Analytics — In the case of Big Data, most of the time we are unaware of the kind of data we
are dealing with, so analyzing that data is even more difficult.
5. Security — Since the data is huge in size, keeping it secure is another challenge. It includes
user authentication, restricting access based on a user, recording data access histories,
proper use of data encryption etc.
6. Lack of Talent — There are a lot of Big Data projects in major organizations, but a
sophisticated team of developers, data scientists and analysts who also have sufficient
amount of domain knowledge is still a challenge.
We have a savior to deal with Big Data challenges — its Hadoop. Hadoop is an open source, a Java-
based programming framework that supports the storage and processing of extremely large data
sets in a distributed computing environment. It is part of the Apache project sponsored by the
Apache Software Foundation.
Hadoop with its distributed processing handles large volumes of structured and unstructured data
more efficiently than the traditional enterprise data warehouse. Hadoop makes it possible to run
applications on systems with thousands of commodity hardware nodes, and to handle thousands of
terabytes of data. Organizations are adopting Hadoop because it is an open source software and can
run on commodity hardware (your personal computer). The initial cost savings are dramatic as
commodity hardware is very cheap. As the organizational data increases, you need to add more &
more commodity hardware on the fly to store it and hence, Hadoop proves to be economical.
Additionally, Hadoop has a robust Apache community behind it that continues to contribute to its
advancement.
Big Data has several advantages that make it a valuable asset for organizations in various industries.
Some of the advantages of Big Data include:
1. Improved decision-making: Big Data provides organizations with access to vast amounts of
data, allowing them to make more informed and data-driven decisions. By analyzing Big
Data, organizations can identify trends, patterns, and insights that would be difficult or
impossible to discern from smaller datasets.
2. Increased efficiency and productivity: Big Data technologies enable organizations to process
and analyze data more quickly and accurately. This can help organizations to optimize their
operations, reduce waste and inefficiencies, and increase productivity.
3. Better customer insights: Big Data can provide organizations with a more complete and
detailed understanding of their customers' behaviors, preferences, and needs. This can help
organizations to improve their marketing and customer engagement strategies, leading to
higher customer satisfaction and loyalty.
4. Enhanced product and service innovation: Big Data can provide organizations with insights
into emerging trends, consumer preferences, and market opportunities, which can help to
drive product and service innovation. By leveraging Big Data, organizations can develop
products and services that better meet customer needs and preferences.
5. Cost savings: By improving efficiency and productivity, Big Data can help organizations to
reduce costs and increase profitability. For example, Big Data can be used to optimize supply
chain operations, reduce inventory costs, and improve resource allocation.
Overall, the advantages of Big Data can be significant, and organizations that effectively manage and
analyze their data assets can gain a competitive advantage in their respective industries. However, it
is important to note that working with Big Data also presents significant challenges, including the
need for specialized expertise, tools, and infrastructure to manage and analyze large datasets.
12. Big Data Tools
There are many tools available for managing and analyzing Big Data, each with its own strengths and
weaknesses. Some popular Big Data tools include:
1. Apache Hadoop: Apache Hadoop is an open-source software framework that is widely used
for distributed storage and processing of large datasets. It provides a scalable and fault-
tolerant system for storing and processing data, and it includes several tools for data
processing and analysis, such as Hadoop Distributed File System (HDFS) and MapReduce.
2. Apache Spark: Apache Spark is an open-source data processing engine that is designed for
high-speed data processing and analytics. It provides a unified analytics engine for data
processing, machine learning, and graph processing, and it supports multiple programming
languages, including Java, Python, and Scala.
4. NoSQL databases: NoSQL databases are a category of databases that are designed for
handling unstructured and semi-structured data. They provide a flexible and scalable system
for storing and retrieving data, and they include several popular databases such as
MongoDB, Couchbase, and Apache CouchDB.
5. Data visualization tools: Data visualization tools are used for creating visual representations
of data, such as charts, graphs, and maps. They provide an effective way to communicate
insights and trends to stakeholders and decision-makers, and they include popular tools
such as Tableau, D3.js, and QlikView.
6. Machine learning libraries: Machine learning libraries are used for developing and deploying
machine learning models that can be used for a variety of applications, such as predictive
analytics, natural language processing, and computer vision. Popular machine learning
libraries include TensorFlow, Scikit-learn, and Keras.
These are just a few examples of the many Big Data tools available today. Choosing the right tool for
a given use case depends on several factors, such as the size and complexity of the data, the desired
analysis or processing capabilities, and the available resources and expertise.
There are various job types related to Big Data, depending on the specific skills and expertise
required. Some of the common Big Data job types include:
1. Data Scientist: This job involves analyzing and interpreting complex data sets to identify
patterns and insights, and using them to develop predictive models and machine learning
algorithms.
2. Data Analyst: This job involves collecting, cleaning, and processing large data sets to derive
insights and trends, and presenting them in an understandable format to business
stakeholders.
3. Big Data Engineer: This job involves designing and building scalable data architectures and
pipelines that can process and manage large volumes of data from various sources.
4. Data Architect: This job involves designing and maintaining the overall data architecture of
an organization, including data models, schemas, and metadata.
5. Business Intelligence Analyst: This job involves designing and developing dashboards and
reports that help businesses make data-driven decisions.
6. Database Administrator: This job involves managing and maintaining databases, ensuring
their reliability, security, and scalability.
7. Machine Learning Engineer: This job involves designing and building machine learning
models and systems that can learn and improve over time.
8. Data Warehouse Developer: This job involves designing and building data warehouses,
which are central repositories of data used for reporting and analysis.
9. Data Mining Engineer: This job involves using machine learning and statistical techniques to
extract insights and patterns from large data sets.
10. Data Visualization Specialist: This job involves designing and creating visual representations
of data, such as charts and graphs, to help stakeholders understand complex data sets.