0% found this document useful (0 votes)
6 views15 pages

Unit - 1 (Big Data)

The document provides an overview of Big Data, defining it as data that exceeds conventional processing capabilities and categorizing it into structured, unstructured, and semi-structured data. It discusses the evolution of Big Data through three phases, highlights the architecture and technology components involved, and outlines various applications across sectors like healthcare, banking, and retail. Additionally, it emphasizes the importance of Big Data in improving business operations and decision-making, while also addressing security and compliance measures.

Uploaded by

dss745147
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views15 pages

Unit - 1 (Big Data)

The document provides an overview of Big Data, defining it as data that exceeds conventional processing capabilities and categorizing it into structured, unstructured, and semi-structured data. It discusses the evolution of Big Data through three phases, highlights the architecture and technology components involved, and outlines various applications across sectors like healthcare, banking, and retail. Additionally, it emphasizes the importance of Big Data in improving business operations and decision-making, while also addressing security and compliance measures.

Uploaded by

dss745147
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

B.S.A.

COLLEGE OF ENGG AND


TECHNOLOGY,MATHURA

KCA022: Big Data


BIG DATA ANALYTICS
UNIT - I

INTRODUCTION TO BIG DATA


Prepared by : Er. Shahid Hussain
Introduction to Big Data
Big data is data that exceeds the processing capacity of conventional database
systems. The data is too big, moves too fast, or doesn’t fit the strictures of your
database architectures. To gain value from this data, you must choose an
alternative way to process it. Big Data has to deal with large and complex datasets
that can be structured, Semi-structured, or unstructured and will typically not fit
into memory to be processed. Big data is a field that treats ways to analyze,
systematically extract information from, or otherwise deal with data sets that are
too large or complex to be dealt with by traditional data-processing application
software.

Types of digital data,


Digital data can be broadly classified into 3 types.

1. Structured Data:
Structured data is created using a fixed schema and is maintained in tabular format.
The elements in structured data are addressable for effective analysis. It contains
all the data which can be stored in the SQL database in a tabular format. Today,
most of the data is developed and processed in the simplest way to manage
information.
Examples –
Relational data, Geo-location, credit card numbers, addresses, etc.
Consider an example for Relational Data like you have to maintain a record of
students for a university like the name of the student, ID of a student, address, and
Email of the student. To store the record of students used the following relational
schema and table for the same.

S_ID S_Name S_Address S_Email

1001 A Delhi [email protected]

1002 B Mumbai [email protected]

2. Unstructured Data :
It is defined as the data in which is not follow a pre-defined standard or
you can say that any does not follow any organized format. This kind of

Prepared by : Er. Shahid Hussain


data is also not fit for the relational database because in the relational
database you will see a pre-defined manner or you can say organized way
of data. Unstructured data is also very important for the big data domain
and To manage and store Unstructured data there are many platforms to
handle it like No-SQL Database.
Examples –Word, PDF, text, media logs, etc.

3. Semi-Structured Data :
Semi-structured data is information that does not reside in a relational
database but that have some organizational properties that make it easier to
analyze. With some process, you can store them in a relational database but
is very hard for some kind of semi-structured data, but semi-structured exist
to ease space.
Example –XML data.

History of Big Data innovation

Big Data phase 1.0


Data analysis, data analytics and Big Data originate from the longstanding domain of
database management. It relies heavily on the storage, extraction, and optimization
techniques that are common in data that is stored in Relational Database
Management Systems (RDBMS).
Database management and data warehousing are considered the core components
of Big Data Phase 1. It provides the foundation of modern data analysis as we know
it today, using well-known techniques such as database queries, online analytical
processing and standard reporting tools.

Prepared by : Er. Shahid Hussain


Big Data phase 2.0
Since the early 2000s, the Internet and the Web began to offer unique data
collections and data analysis opportunities. With the expansion of web traffic and
online stores, companies such as Yahoo, Amazon and eBay started to analyze
customer behavior by analyzing click-rates, IP-specific location data and search
logs. This opened a whole new world of possibilities.
From a data analysis, data analytics, and Big Data point of view, HTTP-based web
traffic introduced a massive increase in semi-structured and unstructured data.
Besides the standard structured data types, organizations now needed to find new
approaches and storage solutions to deal with these new data types in order to
analyze them effectively. The arrival and growth of social media data greatly
aggravated the need for tools, technologies and analytics techniques that were able
to extract meaningful information out of this unstructured data.

Big Data phase 3.0


Although web-based unstructured content is still the main focus for many
organizations in data analysis, data analytics, and big data, the current possibilities
to retrieve valuable information are emerging out of mobile devices.
Mobile devices not only give the possibility to analyze behavioral data (such as clicks
and search queries), but also give the possibility to store and analyze location -based
data (GPS-data). With the advancement of these mobile devices, it is possible to
track movement, analyze physical behavior and even health-related data (number of
steps you take per day). This data provides a whole new range of opportunities, from
transportation, to city design and health care.
Simultaneously, the rise of sensor-based internet-enabled devices is increasing the
data generation like never before. Famously coined as the ‘Internet of Things’ (IoT),
millions of TVs, thermostats, wearables and even refrigerators are now generating
zettabytes of data every day. And the race to extract meaningful and valuable
information out of these new data sources has only just begun.

Introduction to Big Data platform


A big data platform is an integrated computing solution that combines numerous
software systems, tools, and hardware for big data management. It is a one-stop
architecture that solves all the data needs of a business regardless of the volume
and size of the data at hand. Due to their efficiency in data management, enterprises
are increasingly adopting big data platforms to gather tons of data and convert them
into structured, actionable business insights.

Prepared by : Er. Shahid Hussain


Drivers for Big Data
A number of Big Data business drivers are at the core of this success and explain
why Big Data has quickly risen to become one of the most coveted topics in the
industry. Six main business drivers can be identified:
1. The digitization of society;
2. The plummeting of technology costs;
3. Connectivity through cloud computing;
4. Increased knowledge about data science;
5. Social media applications;
6. The upcoming Internet-of-Things (IoT).

Big data architecture


A big data architecture is designed to handle the ingestion, processing, and analysis of
data that is too large or complex for traditional database systems.

Big data solutions typically involve one or more of the following types of workload:

 Batch processing of big data sources at rest.


 Real-time processing of big data in motion.
 Interactive exploration of big data.
 Predictive analytics and machine learning.

Prepared by : Er. Shahid Hussain


Most big data architectures include some or all of the following components:

 Data sources: All big data solutions start with one or more data sources. Examples
include:
o Application data stores, such as relational databases.
o Static files produced by applications, such as web server log files.
o Real-time data sources, such as IoT devices.
 Data storage: Data for batch processing operations is typically stored in a
distributed file store that can hold high volumes of large files in various formats.
This kind of store is often called a data lake. Options for implementing this storage
include Azure Data Lake Store or blob containers in Azure Storage.
 Batch processing: Because the data sets are so large, often a big data solution
must process data files using long-running batch jobs to filter, aggregate, and
otherwise prepare the data for analysis. Usually these jobs involve reading source
files, processing them, and writing the output to new files. Options include running
U-SQL jobs in Azure Data Lake Analytics, using Hive, Pig, or custom Map/Reduce
jobs in an HDInsight Hadoop cluster, or using Java, Scala, or Python programs in
an HDInsight Spark cluster.
 Real-time message ingestion: If the solution includes real-time sources, the
architecture must include a way to capture and store real-time messages for
stream processing. This might be a simple data store, where incoming messages
are dropped into a folder for processing. However, many solutions need a message
ingestion store to act as a buffer for messages, and to support scale-out
processing, reliable delivery, and other message queuing semantics. Options
include Azure Event Hubs, Azure IoT Hubs, and Kafka.
 Stream processing: After capturing real-time messages, the solution must process
them by filtering, aggregating, and otherwise preparing the data for analysis. The
processed stream data is then written to an output sink. Azure Stream Analytics
provides a managed stream processing service based on perpetually running SQL
queries that operate on unbounded streams. You can also use open source Apache
streaming technologies like Spark Streaming in an HDInsight cluster.
 Analytical data store: Many big data solutions prepare data for analysis and then
serve the processed data in a structured format that can be queried using
analytical tools. The analytical data store used to serve these queries can be a
Kimball-style relational data warehouse, as seen in most traditional business
intelligence (BI) solutions. Alternatively, the data could be presented through a
low-latency NoSQL technology such as HBase, or an interactive Hive database that
provides a metadata abstraction over data files in the distributed data store. Azure
Synapse Analytics provides a managed service for large-scale, cloud-based data

Prepared by : Er. Shahid Hussain


warehousing. HDInsight supports Interactive Hive, HBase, and Spark SQL, which
can also be used to serve data for analysis.
 Analysis and reporting: The goal of most big data solutions is to provide insights
into the data through analysis and reporting. To empower users to analyze the
data, the architecture may include a data modeling layer, such as a
multidimensional OLAP cube or tabular data model in Azure Analysis Services. It
might also support self-service BI, using the modeling and visualization
technologies in Microsoft Power BI or Microsoft Excel. Analysis and reporting can
also take the form of interactive data exploration by data scientists or data
analysts. For these scenarios, many Azure services support analytical notebooks,
such as Jupyter, enabling these users to leverage their existing skills with Python or
R. For large-scale data exploration, you can use Microsoft R Server, either
standalone or with Spark.
 Orchestration: Most big data solutions consist of repeated data processing
operations, encapsulated in workflows, that transform source data, move data
between multiple sources and sinks, load the processed data into an analytical
data store, or push the results straight to a report or dashboard. To automate
these workflows, you can use an orchestration technology such Azure Data Factory
or Apache Oozie and Sqoop.

7 V's of Big Data


 Volume: - As the term implies, big data analytics entails handling and analyzing vast
amounts of data. To effectively work with such massive datasets, specialized tools
and infrastructure are necessary for capturing, storing, managing, cleaning,
transforming, analyzing, and reporting the data.
 Velocity: - Velocity denotes the speed at which data is generated. To keep up with
the rapid generation of data, systems for processing and analyzing data must
possess sufficient capacity to handle the influx of data and deliver timely, actionable
insights.
 Variety: - Variety refers to the diversity of data types and sources. Data can
manifest in various forms, originate from diverse sources, and exist in structured or
unstructured formats. Understanding the types of data and their sources, as well as
the interrelationships within the datasets, is vital for generating meaningful insights
from big data.
 Variability: - Big data often contains noisy and incomplete data points, which can
obscure valuable insights. Addressing this variability typically involves data cleaning
and validation processes to ensure data quality.
 Veracity: - Veracity pertains to the accuracy and authenticity of the data. Data must
undergo validation to ensure that it accurately represents essential business

Prepared by : Er. Shahid Hussain


functions and that any data manipulation, modeling, and analysis does not
compromise the data's accuracy.
 Value: - A successful big data analytics strategy must generate value. The insights
derived from the analysis should provide meaningful guidance for improving
operations, enhancing customer service, or creating other forms of value. An integral
part of developing a big data analytics strategy is distinguishing between data that
can contribute value and data that cannot.
 Visualization: - Visualization plays a vital role in data analytics, as it involves
presenting the analyzed data in a visually comprehensible manner. When planning
data visualization, it is essential to consider the end user and the decisions the
visualizations aim to support. Well-executed data visualization facilitates swift and
well-informed decision-making.

Big Data technology component


Two main building blocks are being added to the enterprise stack to accommodate big data:
● Hadoop: Provides storage capability through a distributed, shared-nothing file system, and
analysis capability through MapReduce
● NoSQL: Provides the capability to capture, read, and update, in real time, the large influx of
unstructured data and data without schemas;
examples include click streams, social media, log files, event data, mobility trends, and sensor
and machine data .

APPLICATIONS of big data–


Where it is used
1. Life Sciences:
Clinical research is a slow and expensive process, with trials failing for a variety of reasons.
Advanced analytics, artificial intelligence (AI) and the Internet of Medical Things (IoMT)
unlocks the potential of improving speed and efficiency at every stage of clinical research by
delivering more intelligent, automated solutions.
2. Banking:
Financial institutions gather and access analytical insight from large volumes of unstructured
data in order to make sound financial decisions. Bi
g data analytics allows them to access the information they need when they need it, by
eliminating overlapping, redundant tools and systems.
3. Manufacturing:
For manufacturers, solving problems is nothing new. They wrestle with difficult problems on a
daily basis - from complex supply chains, to motion applications, to labor constraints and
equipment breakdowns. That's why big data analytics is essential in the manufacturing industry,
as it has allowed competitive organizations to discover new cost saving opportunities and
revenue opportunities.
4. Health Care:

Prepared by : Er. Shahid Hussain


Big data is a given in the health care industry. Patient records, health plans, insurance
information and other types of information can be difficult to manage – but are full of key
insights once analytics are applied. That’s why big data analytics technology is so important to
heath care. By analyzing large amounts of information -both structured and unstructured –
quickly, health care providers can provide lifesaving diagnoses or treatment options almost
immediately.
5. Government:
Certain government agencies face a big challenge: tighten the budget without compromising
quality or productivity. This is particularly troublesome with law enforcement agencies, which
are struggling to keep crime rates d
own with relatively scarce resources. And that’s why many agencies use big data analytics; the
technology streamlines operations while giving the agency a more holistic view of criminal
activity.
6. Retail:
Customer service has evolved in the past several years, as savvier shoppers expect retailers to
understand exactly what they need, when they need it. Big data analytics technology helps
retailers meet those demands. Armed with endless amounts of data from customer loyalty
programs, buying habits and other sources, retailers not only have an in-depth understanding
of their customers, they can also predict trends, recommend new products–
and boost profitability.

What’s the importance of Big Data?


Big Data can improve business operations, offer more personalized service to customers,
improve marketing campaigns and, in general, contribute to more efficient decision-making.
When a business knows how to use its data, it gains a competitive advantage over those that
don’t, which makes it easier to grow and to increase market share.
Regardless of its sector, any business can use Big Data to improve its operations and better reach
its audience.

Big Data features: Big data features divided


into four categories
(1) Big data security is the collective term for all the measures and tools
used to guard both the data and analytics processes from attacks, theft, or
other malicious activities that could harm or negatively affect them. Much like
other forms of cyber-security, the big data variant is concerned with attacks
that originate either from the online or offline spheres.
Solution:
 One of the most common security tools is encryption, a relatively simple
tool that can go a long way. Encrypted data is useless to external actors

Prepared by : Er. Shahid Hussain


such as hackers if they don’t have the key to unlock it. Moreover,
encrypting data means that both at input and output, information is
completely protected.
 Building a strong firewall is another useful big data security tool.
Firewalls are effective at filtering traffic that both enters and leaves
servers. Organizations can prevent attacks before they happen by
creating strong filters that avoid any third parties or unknown data
sources.
(2)Big data compliance: Data compliance is the formal governance
structure in place to ensure an organization complies with laws, regulations,
and standards around its data. The process governs the possession,
organization, storage, and management of digital assets or data to prevent it
from loss, theft, misuse, or compromise. The stipulated regulations and
standards determine what data needs to be protected as well as the most
suitable processes for doing so.
Solution:
 Successful businesses approach data compliance in a holistic way.
They embark on integrating data governance with a data management
program that involves documentation of ownership, procedures,
definitions, and policies to bolster data compliance.
 To cultivate a culture of data compliance, organizations should develop
an all-encompassing approach for reviewing services, operations,
products, and processes to eliminate any compliance gaps. The
organization’s ability to protect data while maintaining user access to
real-time data should be a priority.

(3)Big data Auditing, and Protection: With Big Data and analytics, there is
a possibility of a more efficient and effective identification of financial
reporting, detection of fraud, and examination of operational business risks

Data auditing, or data risk management, is a comprehensive assessment of


all aspects of data gathering, storage, and usage, including internal data such
as financial records and external data like customer and market trend
information.

Data auditing involves monitoring data creation, collection, usage, storage,


and destruction. It helps improve data quality, identify gaps and errors, and
make informed decisions based on accurate analytics.

Prepared by : Er. Shahid Hussain


Solution: Conducting a data audit involves planning, data collection and
analysis, monitoring compliance, presenting findings, and implementing
adjustments.

(4)Big data privacy and protection: .

Big data privacy involves properly managing big data to minimize risk and
protect sensitive data. Because big data comprises large and complex data
sets, many traditional privacy processes cannot handle the scale and velocity
required

Big Data privacy and ethic.


Big data privacy involves properly managing big data to minimize risk and
protect sensitive data. Because big data comprises large and complex data
sets, many traditional privacy processes cannot handle the scale and velocity
required. To safeguard big data and ensure it can be used for analytics, you
need to create a framework for privacy protection that can handle the volume,
velocity, variety, and value of big data as it is moved between environments,
processed, analyzed, and shared.

Big Data Analytics


Big data analytics refers to the methods, tools, and applications used to collect,
process, and derive insights from varied, high-volume, high-velocity data sets. These
data sets may come from a variety of sources, such as web, mobile, email, social
media, and networked smart devices.

Prepared by : Er. Shahid Hussain


Business Intelligence vs Big Data Table

Difference by Parameters Big Data Business Intelligence

Big Data: Large and diverse


Business Intelligence (BI): The
datasets that require advanced
process of collecting, analyzing,
analytics techniques to uncover
and presenting structured data
Definition patterns, correlations, and
to support informed decision-
insights, often involving
making and drive business
unstructured and external data
growth.
sources.

Diverse data types, including Structured data from internal


Data Type
unstructured data sources

Data Volume Vast amounts of data Moderate to large datasets

External and internal sources


Internal sources (databases,
Data Sources (social media, sensors,
spreadsheets, etc.)
transactions, etc.)

Advanced analytics techniques


Aggregating and analyzing
Analysis Approach (data mining, machine learning,
structured data
predictive analytics, etc.)

Prepared by : Er. Shahid Hussain


Difference by Parameters Big Data Business Intelligence

Discover insights, patterns, and Support operational decision-


Purpose
trends making

Real-time and near-real-time Real-time and historical


Time Sensitivity
processing analysis

Data scientists, analysts, Executives, managers, analysts,


User Role
researchers decision-makers

Challenges in BIG DATA


1.Need For Synchronization Across Disparate Data Sources
As data sets are becoming bigger and more diverse, there is a big
challenge to incorporate them into an analytical platform. If this is
overlooked, it will create gaps and lead to wrong messages
and insights.
2. Acute Shortage Of Professionals Who Understand Big Data
AnalysisThe analysis of data is important to make this voluminous
amount of data being produced in every minute, useful. With the
exponential rise of data, a huge demand for big data scientists and Big
Data analysts has been created in the market. It is important for
business organizations to hire a data scientist having skills that are
varied as the job of a data scientist is multidisciplinary. Another
major challenge faced by businesses is the shortage of professionals
who understand Big Data analysis. There is a sharp shortage of data

Prepared by : Er. Shahid Hussain


scientists in comparison to the massive amount of data being
produced.
3. Getting Meaningful Insights Through The Use Of Big Data
Analytics
It is imperative for business organizations to gain important insights
from Big Data analytics, and also it is important that only the relevant
department has access to this information. A big challenge faced by
the companies in the Big Data analytics is mending this wide gap in an
effective manner.
4. Getting Voluminous Data Into The Big Data Platform
It is hardly surprising that data is growing with every passing day. This
simply indicates that business organizations need to handle a large
amount of data on daily basis. The amount and variety of data
available these days can overwhelm any data engineer and that is
why it is considered vital to make data accessibility easy and
convenient for brand owners and managers.
5. Uncertainty Of Data Management Landscape With the rise of
Big Data, new technologies and companies are being developed
every day. However, a big challenge faced by the companies in the
Big Data analytics is to find out which technology will be best suited to
them without the introduction of new problems and potential risks.
6. Data Storage And Quality Business organizations are growing
at a rapid pace. With the tremendous growth of the companies and
large business organizations, increases the amount of data produced.
The storage of this massive amount of data is becoming a real
challenge for everyone. Popular data storage options like data lakes/
warehouses are commonly used to gather and store large quantities
of unstructured and structured data in its native format. The real
problem arises when a data lakes/ warehouse try to combine
unstructured and inconsistent data from diverse sources, it encounters
errors. Missing data, inconsistent data, logic conflicts, and duplicates
data all result in data quality challenges.
7. Security And Privacy Of Data
Once business enterprises discover how to use Big Data, it brings
them a wide range of possibilities and opportunities. However, it also
involves the potential risks associated with big data when it comes to
the privacy and the security of the data. The Big Data tools used for

Prepared by : Er. Shahid Hussain


analysis and storage utilizes the data disparate sources. This
eventually leads to a high risk of exposure of the data, making it
vulnerable. Thus, the rise of voluminous amount of data increases
privacy and security concerns.

Classification of analytics
1) Descriptive analytics
Descriptive analytics is a statistical method that is used to search and
summarize historical data in order to identify patterns or meaning.
2) Predictive analytics
Predictive Analytics is a statistical method that utilizes algorithms and
machine learning to identify trends in data and predict future
behaviors.
3) Prescriptive analytics
Prescriptive analytics is a statistical method used to generate
recommendations and make decisions based on the computational
findings of algorithmic models.

Prepared by : Er. Shahid Hussain

You might also like