Unit 1 Big Data
Unit 1 Big Data
Definition 1: Big Data is a collection of data that is huge in volume, yet growing
exponentially with time. It is a data with so large size and complexity that none of
traditional data management tools can store it or process it efficiently. Big data is also a
data but with huge size.
Big data is used in machine learning, predictive modeling, and other advanced analytics
to solve business problems and make informed decisions.
Example-Social Media
The statistic shows that 500+terabytes of new data get ingested into the databases of
social media site Facebook, every day. This data is mainly generated in terms of photo
and video uploads, message exchanges, putting comments etc.
Types of Digital Data
1.Structured Data (10-15%): This is relational data that can be processed, is easily accessible, and can be
stored in a fixed format.
Examples include: Financial transactions ,Customer records (CRM),Enterprise databases
(SQL, data warehouses)
Traditional enterprises rely on structured data, but its share is shrinking due to the rise of unstructured data.
3. Unstructured Data (60-65%): This is data of different formats, such as document files, multimedia
files, images, backup files, social media posts, audio files, and open-ended customer comments.
Example Include: Social media data (posts, images, videos),Multimedia (videos,
audio files, images),IoT-generated sensor data,Medical imaging (X-rays, MRIs)
The dominant form of Big Data, driven by multimedia, social media, and IoT devices.
History of Big Data Innovation
1. Early Foundations (1960s–1980s): Birth of Data Management
Note: Stream computing is a way to analyze and process Big Data in real time to gain
current insights to take appropriate decisions or to predict new trends in the immediate
Need for a Big Data Platform
To provide users with efficient analytics tools specifically designed for
handling massive datasets.
Data engineers often utilize these platforms to aggregate, clean, and
prepare data for insightful business analysis.
Data scientists, on the other hand, leverage this platform to uncover
valuable relationships and patterns within large datasets using advanced
machine learning algorithms.
Furthermore, users have the flexibility to build custom applications
tailored to their specific use cases, such as calculating customer loyalty in
the e-commerce industry, among countless other possibilities.
Different Types of Big Data Platforms and Tools
This includes four letters: S, A, P, and S, which means Scalability, Availability, Performance,
and Security. There are various tools responsible for managing hybrid data of IT systems. The
list of platforms are listed below:
Hadoop- Delta Lake Migration Platform
It is an open-source software platform managed by Apache Software Foundation. It is used to
collect and store large data sets cheaply and efficiently.
Note: A Delta Lake is a table format that supports Parquet file format. It is an open-source
storage layer that helps bring reliability to the data lakes. It provides ACID transactions,
unifies streaming and batch data processing, and scalable metadata handling.
Data Catalog and Data Observability Platform
It provides a single self-service environment to the users, helping them find, understand, and
trust the data source. It also helps the users discover new data sources, if any. Seeing and
understanding data sources are the initial steps for registering the births. Users search for the
Data Catalog Tools and filter the appropriate results based on their needs. In Enterprises, Data
Lake is needed for Business Intelligence, Data Scientists, and ETL Developers where the
correct data is needed. The users use catalog discovery to find the data that fits their needs.
Different Types of Big Data Platforms and Tools (Cont..)
4. Data Integration and Warehouse – It provides its users with features like
integrating it from any source with ease.
8. Data Discovery Platform for Price Optimization – Data analytics, with the help
of a big data platform, provides insight for B2C and B2B enterprises, which helps
businesses optimize the prices they charge accordingly.
9. Data Observability – With the warehouse set, analytics tools, and efficient Data
transformation, it helps reduce the data latency and provide high throughput.
Drivers for Big Data
Big Data emerged in the last decade from a combination of business needs and
technology innovations. A number of companies that have Big Data at the core of their
strategy have become very successful at the beginning of the 21st century. Famous
examples include Apple, Amazon, Facebook and Netflix. A number of business drivers are
at the core of this success and explain why Big Data has quickly risen to become one of
the most coveted topics in the industry. Six main business drivers can be identified :
1. The digitization of society;
2. The plummeting of technology costs;
3. Connectivity through cloud computing;
4. Increased knowledge about data science;
5. Social media applications;
6. The upcoming Internet-of-Things (IoT)
1. The digitization of society
Big data is largely
consumer driven and
consumer oriented.
Most people now
spent 4 to 6 hours per
day consuming and
generating data
through a variety of
devices and
applications. Some
studies estimate that
60% of data was
generated within the
last three years, which
is a good indication of
the rate with which
society has digitized.
*Example: The adoption of electronic Bills of Lading (eBL) will enable the trade industry to
benefit from faster transactions, cost savings, and lowered fraud risks.
2. The plummeting of technology costs
The costs of data storage and processors keep declining, making it
possible for small businesses and individuals to become involved
with Big Data.
Besides the plummeting of the storage costs, a second key
contributing factor to the affordability of Big Data has been the
development of open source Big Data software frameworks.
The most popular software framework is Apache Hadoop for
distributed storage and processing.
Due to the high availability of these software frameworks in open
sources, it has become increasingly inexpensive to start Big Data
projects in organizations.
3. Connectivity through cloud computing
Cloud computing
environments have
made it possible to
quickly scale up or scale
down IT infrastructure
and facilitate a pay-as-
you-go model. Instead,
they can license the
storage and processing
capacity they need and
only pay for the amounts
they actually used. As a
result, most of Big Data
solutions leverage the
possibilities of cloud
computing to deliver
their solutions to
4. Increased knowledge about data science
1. Volume
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data
generated from many sources daily, such as business processes, machines, social media
platforms, networks, human interactions, and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that the "Like" button
is recorded, and more than 350 million new posts are uploaded each day. Big data technologies
can handle large amounts of data.
2. Variety
Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these
days the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos, videos,
etc.
2. Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also
essential in business development.
For example, Facebook posts with hashtags.
Characteristics of Big Data (Cont…)
4. Value
Value is an essential characteristic of big data. It is not the data
that we process or store. It is valuable and reliable data that we
store, process, and also analyze.
5. Velocity
Velocity plays an important role compared to others. Velocity
creates the speed by which the data is created in real-time. It
contains the linking of incoming data sets speeds, rate of change,
and activity bursts. The primary aspect of Big Data is to provide
demanding data rapidly.
Characteristics of Big Data Cont..
Big Data Technology Components
Big data technology that deals with data storage has the capability to fetch, store, and
manage big data.
It is made up of infrastructure that allows users to store the data so that it is convenient
to access.
Most data storage platforms are compatible with other programs.
Two commonly used tools are Apache Hadoop and MongoDB.
Example 1- Apache Hadoop
Apache is the most widely used big data tool.
It is an open-source software platform that stores and processes big data
in a distributed computing environment across hardware clusters.
This distribution allows for faster data processing.
The framework is designed to reduce bugs or faults, be scalable, and
process all data formats.
1. Data storage (Cont…)
Example 2- MongoDB
•MongoDB is a NoSQL database that can be used to store large
volumes of data.
•Using key-value pairs (a basic unit of data), MongoDB categorizes
documents into collections.
•It is written in C, C++, and JavaScript, and is one of the most
popular big data databases because it can manage and store
unstructured data with ease.
2. Data mining
Data mining extracts the useful patterns and trends from the raw data.
Big data technologies such as Rapidminer and Presto can turn unstructured and structured data
into usable information.
Example 1- Rapidminer
Rapidminer is a data mining tool that can be used to build predictive models.
It draws on these two roles as strengths, of processing and preparing data, and
building machine and deep learning models.
The end-to-end model allows for both functions to drive impact across the
organization.
Example 2- Presto
Presto is an open-source query engine that was originally developed by Facebook
to run analytic queries against their large datasets.
Now, it is available widely. One query on Presto can combine data from multiple
sources within an organization and perform analytics on them in a matter of
minutes.
3.Dataanalytics
In big data analytics, technologies are used to clean and transform data into information that can
be used to drive business decisions.
This next step (after data mining) is where users perform algorithms, models, and predictive
analytics using tools such as Apache Spark and Splunk.
Example 1- Apache Spark
•Spark is a popular big data tool for data analysis because it is fast and efficient at
running applications.
•It is faster than Hadoop because it uses random access memory (RAM) instead of
being stored and processed in batches via MapReduce.
•Spark supports a wide variety of data analytics tasks and queries.
Example 2- Splunk
•Splunk is another popular big data analytics tool for deriving insights from large
datasets.
•It has the ability to generate graphs, charts, reports, and dashboards.
•Splunk also enables users to incorporate artificial intelligence (AI) into data
outcomes.
4.Datavisualization
Finally, big data technologies can be used to create stunning visualizations from the data.
In data-oriented roles, data visualization is a skill that is beneficial for presenting
recommendations to stakeholders for business profitability and operations—to tell an
impactful story with a simple graph.
Example 1- Tableau
Tableau is a very popular tool in data visualization because its drag-and-
drop interface makes it easy to create pie charts, bar charts, box plots,
Gantt charts, and more.
It is a secure platform that allows users to share visualizations and
dashboards in real time.
Example 2- Looker
Looker is a business intelligence (BI) tool used to make sense of big data
analytics and then share those insights with other teams.
Charts, graphs, and dashboards can be configured with a query, such as
monitoring weekly brand engagement through social media analytics.
Big DataApplications
Big companies utilize those data for their business growth. By analyzing this data, the
useful decision can be made in various cases as discussed below:
1. Tracking Customer Spending Habit, Shopping Behavior:
In big retails store (like Amazon, Walmart, Big Bazar etc.) management team has to keep
data of customer’s spending habit, shopping behavior, customer’s most liked product .
Which product is being searched/sold most, based on that data, production/collection rate
of that product get fixed.
2. Recommendation:
By tracking customer spending habit, shopping behavior, Big retails store provide a
recommendation to the customer.
E-commerce site like Amazon, Walmart, Flipkart does product recommendation.
They track what product a customer is searching, based on that data they recommend that
type of product to that customer.
Big Data Applications (Cont…)
7. IoT:
Manufacturing company install IOT sensor into machines to collect operational data.
Analyzing such data, it can be predicted how long machine will work without any
problem when it requires repairing so that company can take action before the situation
when machine facing a lot of issues or gets totally down. Thus, the cost to replace the
whole machine can be saved.
In the Healthcare field, Big data is providing a significant contribution.
Using big data tool, data regarding patient experience is collected and is used by doctors
to give better treatment.
8. Education Sector:
Online educational course conducting organization utilize big data to search candidate,
interested in that course.
Big Data Applications (Cont…)
9. Energy Sector:
Smart electric meter read consumed power every 15 minutes and sends this read data to
the server, where data analyzed and it can be estimated what is the time in a day when the
power load is less throughout the city.
By this system manufacturing unit or housekeeper are suggested the time when they
should drive their heavy machine in the night time when power load less to enjoy less
electricity bill.
10. Media and Entertainment Sector:
Media and entertainment service providing company like Netflix, Amazon Prime,
Spotify do analysis on data collected from their users.
Data like what type of video, music users are watching, listening most, how long users
are spending on site, etc are collected and analyzed to set the next business strategy.
Importance of Big Data
2. Variety
As Discussed before, Big Data is generated in multiple varieties. Compared to the
traditional data like phone numbers and addresses, the latest trend of data is in the form
of photos, videos, and audios and many more, making about 80% of the data to be
completely unstructured.
Big Data Characteristics(5 V’s) (Cont…)
3. Veracity
Veracity means the degree of reliability that the data has to offer. Since a major part of the
data is unstructured and irrelevant, Big Data needs to find an alternate way to filter them or
to translate them out as the data is crucial in business developments.
4. Value
Value is the major issue that we need to concentrate on. It is not just the amount of data that
we store or process. It is actually the amount of valuable, reliable and trustworthy data that
needs to be stored, processed, and analyzed to find insights. You can get a better
understanding with the Azure Data Engineering certification.
5. Velocity
Velocity plays a major role compared to the others, there is no point in investing so much to
end up waiting for the data. So, the major aspect of Big Dat is to provide data on demand
and at a faster pace. Example- Unleash the power of distributed computing and scalable
data processing with Apache Spark.
Big Data Features
Analytics is the technique of examining data and reports to obtain actionable insights
that can be used to comprehend and improve business performance. Business users
may gain insights from data, recognize trends, and make better decisions with
workforce analytics.
reporting is the process of presenting data from numerous sources clearly and simply.
The procedure is always carefully set out to report correct data and avoid
misunderstandings.
In general, the procedures needed to create a report are as follows:
Determining the business requirement
Obtaining and compiling essential data
Technical data translation
Recognizing the data context
Building dashboards for reporting
Providing real-time reporting
Allowing users to dive down into reports
Modern DataAnalytic Tools
THANK YOU