Big Data Unit 1 Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

Unit 1 Topic:

Introduction to Big Data


Types of digital data
History of Big Data innovation
Introduction to Big Data platform
Drivers for Big Data
Big Data architecture and characteristics
5 Vs of Big Data
Big Data technology components
Big Data importance and applications
Big Data features – security, compliance, auditing and protection
Big Data privacy and ethics
Big Data Analytics
Challenges of conventional systems
Intelligent data analysis
Nature of data
Analytic processes and tools
Analysis vs reporting
Modern data analytic tools

1. Introduction to Big Data:


The term big data has been in use since the 1990s, with some giving
credit to John Mashey for popularizing the term. Big data usually includes
data sets with sizes beyond the ability of commonly used software
tools to capture, curate, manage, and process data within a tolerable
elapsed time.
Data which are very large in size is called Big Data. Normally we work
on data of size MB (WordDoc ,Excel) or maximum GB(Movies, Codes)
but data in Peta bytes i.e. 10^15 byte size is called Big Data. It is
stated that almost 90% of today's data has been generated in the
past 3 years.
Sources of Big Data
These data come from many sources like
• Social networking sites: Facebook, Google, LinkedIn all
these sites generate huge amount of data on a day-to-day
basis as they have billions of users worldwide.
• E-commerce site: Sites like Amazon, Flipkart, Alibaba
generates huge number of logs from which users buying
trends can be traced.
• Weather Station: All the weather station and satellite
gives very huge data which are stored and manipulated to
forecast weather.
• Telecom company: Telecom giants like Airtel, Vodafone
study the user trends and accordingly publish their plans
and for this they store the data of its million users.
• Share Market: Stock exchange across the world generates
huge amount of data through its daily transaction.
3V's of Big Data
• Velocity: The data is increasing at a very fast rate. It is
estimated that the volume of data will double in every 2
years.
• Variety: Now a days data are not stored in rows and
column. Data is structured as well as unstructured. Log
file, CCTV footage is unstructured data. Data which can be
saved in tables are structured data like the transaction
data of the bank.
• Volume: The amount of data which we deal with is of very
large size of Peta bytes.

2. Types of Digital Data: SUS

• Structured Data: Organized in a predefined format,


typically stored in databases and easily searchable.
Examples include transaction records, customer profiles,
and inventory databases.
Cons of Structured Data
1. Structured data can only be leveraged in cases of predefined
functionalities. This means that structured data has limited
flexibility and is suitable for certain specific use cases only.
2. Structured data is stored in a data warehouse with rigid
constraints and a definite schema. Any change in requirements
would mean updating all that structured data to meet the new
needs. This is a massive drawback in terms of resource and time
management.

• Unstructured Data: Doesn't have a predefined structure


and is not easily organized or analyzed by traditional
methods. Examples include text documents, social media
posts, videos, and images.
• Unstructured data is the kind of data that doesn’t adhere to any
definite schema or set of rules. Its arrangement is unplanned and
haphazard.
• Photos, videos, text documents, and log files can be generally
considered unstructured data. Even though the metadata
accompanying an image or a video may be semi-structured, the
actual data being dealt with is unstructured.
• Additionally, Unstructured data is also known as “dark data”
because it cannot be analyzed without the proper software tools.

• Semi-Structured Data: Has some organizational


properties but doesn't conform to the structure of
traditional relational databases. Examples include XML
and JSON files, web logs, and metadata.

• Semi-structured data is not bound by any rigid schema for data


storage and handling. The data is not in the relational format
and is not neatly organized into rows and columns like that in a
spreadsheet. However, there are some features like key-value
pairs that help in discerning the different entities from each
other.
• Since semi-structured data doesn’t need a structured query
language, it is commonly called NoSQL data.
• A data serialization language is used to exchange semi-
structured data across systems that may even have varied
underlying infrastructure.
• Semi-structured content is often used to store metadata about
a business process but it can also include files containing
machine instructions for computer programs.
• This type of information typically comes from external sources
such as social media platforms or other web-based data feeds.
3. History of Big Data Innovation:
• The concept of Big Data emerged in the early 2000s,
driven by the need to process and analyze large volumes
of data generated by internet companies like Google,
Yahoo, and Amazon.
• Technologies such as Apache Hadoop, developed by Doug
Cutting and Mike Cafarella, played a pivotal role in
enabling the storage and processing of massive datasets
across distributed computing clusters.
• Since then, advancements in storage, processing, and
analytics technologies, as well as the proliferation of cloud
computing, have further accelerated the innovation and
adoption of Big Data solutions.
4. Introduction to Big Data Platform:
• A Big Data platform comprises the infrastructure and tools
used to store, process, and analyze large volumes of data.
It typically includes distributed storage systems,
distributed processing frameworks, and analytics tools.
• Examples of Big Data platforms include Apache Hadoop,
which provides a distributed file system (HDFS) and
MapReduce framework for parallel processing, as well as
cloud-based platforms like Amazon Web Services (AWS)
EMR, Google BigQuery, and Microsoft Azure HDInsight.
5. Drivers for Big Data: DBTR
• Several factors drive the growth and adoption of Big Data
solutions:
• Data Explosion: The exponential growth in data
volume generated by various sources, including
social media, IoT devices, sensors, and mobile
devices.
• Business Needs: Organizations increasingly rely on
data-driven insights to gain a competitive edge,
improve decision-making, and enhance customer
experiences.
• Technological Advancements: Advances in storage,
processing, and analytics technologies have made it
more cost-effective and feasible to store, process,
and analyze large datasets.
• Regulatory Requirements: Compliance with data
protection and privacy regulations, such as GDPR
and HIPAA, also drives the adoption of Big Data
solutions to ensure data governance and
compliance.
6. Big Data Architecture and Characteristics:
A big data architecture is designed to handle the ingestion,
processing, and analysis of data that is too large or complex for
traditional database systems.

Big data solutions typically involve one or more of the following types
of workloads:
• Batch processing of big data sources at rest.
• Real-time processing of big data in motion.
• Interactive exploration of big data.
• Predictive analytics and machine learning.
Most big data architectures include some or all of the following
components:
• Data sources: All big data solutions start with one or more data
sources. Examples include:
o Application data stores, such as relational databases.
o Static files produced by applications, such as web server
log files.
o Real-time data sources, such as IoT devices.
• Data storage: Data for batch processing operations is typically
stored in a distributed file store that can hold high volumes of
large files in various formats. This kind of store is often called
a data lake. Options for implementing this storage include
Azure Data Lake Store or blob containers in Azure Storage.
• Batch processing: Because the data sets are so large, often a big
data solution must process data files using long-running batch
jobs to filter, aggregate, and otherwise prepare the data for
analysis. Usually these jobs involve reading source files,
processing them, and writing the output to new files. Options
include running U-SQL jobs in Azure Data Lake Analytics, using
Hive, Pig, or custom Map/Reduce jobs in an HDInsight Hadoop
cluster, or using Java, Scala, or Python programs in an HDInsight
Spark cluster.
• Real-time message ingestion: If the solution includes real-time
sources, the architecture must include a way to capture and
store real-time messages for stream processing. This might be a
simple data store, where incoming messages are dropped into a
folder for processing. However, many solutions need a message
ingestion store to act as a buffer for messages, and to support
scale-out processing, reliable delivery, and other message
queuing semantics. Options include Azure Event Hubs, Azure
IoT Hubs, and Kafka.
• Stream processing: After capturing real-time messages, the
solution must process them by filtering, aggregating, and
otherwise preparing the data for analysis. The processed
stream data is then written to an output sink. Azure Stream
Analytics provides a managed stream processing service based
on perpetually running SQL queries that operate on unbounded
streams. You can also use open source Apache streaming
technologies like Spark Streaming in an HDInsight cluster.
• Analytical data store: Many big data solutions prepare data for
analysis and then serve the processed data in a structured
format that can be queried using analytical tools. The analytical
data store used to serve these queries can be a Kimball-style
relational data warehouse, as seen in most traditional business
intelligence (BI) solutions. Alternatively, the data could be
presented through a low-latency NoSQL technology such as
HBase, or an interactive Hive database that provides a
metadata abstraction over data files in the distributed data
store. Azure Synapse Analytics provides a managed service for
large-scale, cloud-based data warehousing. HDInsight supports
Interactive Hive, HBase, and Spark SQL, which can also be used
to serve data for analysis.
• Analysis and reporting: The goal of most big data solutions is to
provide insights into the data through analysis and reporting. To
empower users to analyze the data, the architecture may
include a data modeling layer, such as a multidimensional OLAP
cube or tabular data model in Azure Analysis Services. It might
also support self-service BI, using the modeling and
visualization technologies in Microsoft Power BI or Microsoft
Excel. Analysis and reporting can also take the form of
interactive data exploration by data scientists or data analysts.
For these scenarios, many Azure services support analytical
notebooks, such as Jupyter, enabling these users to leverage
their existing skills with Python or R. For large-scale data
exploration, you can use Microsoft R Server, either standalone
or with Spark.
• Orchestration: Most big data solutions consist of repeated data
processing operations, encapsulated in workflows, that
transform source data, move data between multiple sources
and sinks, load the processed data into an analytical data store,
or push the results straight to a report or dashboard. To
automate these workflows, you can use an orchestration
technology such Azure Data Factory or Apache Oozie and
Sqoop.
Azure includes many services that can be used in a big data
architecture. They fall roughly into two categories:
• Managed services, including Azure Data Lake Store, Azure Data
Lake Analytics, Azure Synapse Analytics, Azure Stream Analytics,
Azure Event Hubs, Azure IoT Hub, and Azure Data Factory.
• Open source technologies based on the Apache Hadoop
platform, including HDFS, HBase, Hive, Spark, Oozie, Sqoop, and
Kafka. These technologies are available on Azure in the Azure
HDInsight service.
These options are not mutually exclusive, and many solutions
combine open source technologies with Azure services.
When to use this architecture
Consider this architecture style when you need to:
• Store and process data in volumes too large for a traditional
database.
• Transform unstructured data for analysis and reporting.
• Capture, process, and analyze unbounded streams of data in
real time, or with low latency.
• Use Azure Machine Learning or Azure Cognitive Services.
Benefits
• Technology choices. You can mix and match Azure managed
services and Apache technologies in HDInsight clusters, to
capitalize on existing skills or technology investments.
• Performance through parallelism. Big data solutions take
advantage of parallelism, enabling high-performance solutions
that scale to large volumes of data.
• Elastic scale. All of the components in the big data architecture
support scale-out provisioning, so that you can adjust your
solution to small or large workloads, and pay only for the
resources that you use.
• Interoperability with existing solutions. The components of the
big data architecture are also used for IoT processing and
enterprise BI solutions, enabling you to create an integrated
solution across data workloads.
Challenges
• Complexity. Big data solutions can be extremely complex, with
numerous components to handle data ingestion from multiple
data sources. It can be challenging to build, test, and
troubleshoot big data processes. Moreover, there may be a
large number of configuration settings across multiple systems
that must be used in order to optimize performance.
• Skillset. Many big data technologies are highly specialized, and
use frameworks and languages that are not typical of more
general application architectures. On the other hand, big data
technologies are evolving new APIs that build on more
established languages. For example, the U-SQL language in
Azure Data Lake Analytics is based on a combination of
Transact-SQL and C#. Similarly, SQL-based APIs are available for
Hive, HBase, and Spark.
• Technology maturity. Many of the technologies used in big data
are evolving. While core Hadoop technologies such as Hive and
Pig have stabilized, emerging technologies such as Spark
introduce extensive changes and enhancements with each new
release. Managed services such as Azure Data Lake Analytics
and Azure Data Factory are relatively young, compared with
other Azure services, and will likely evolve over time.
• Security. Big data solutions usually rely on storing all static data
in a centralized data lake. Securing access to this data can be
challenging, especially when the data must be ingested and
consumed by multiple applications and platforms.

• Big Data architectures are designed to handle the unique


challenges posed by large volumes of data. Key
characteristics include:
• Scalability: The ability to scale horizontally by adding
more nodes to handle increasing data volumes and
processing loads.
• Fault Tolerance: Systems are resilient to failures,
with built-in redundancy and data replication to
ensure data integrity and availability.
• Flexibility: Support for diverse data types and
sources, including structured, unstructured, and
semi-structured data.
• Distributed Computing: Data is distributed across
multiple nodes in a cluster, with parallel processing
capabilities to improve performance and scalability.
7. 5 Vs of Big Data:
• The 5 Vs framework describes the key characteristics of
Big Data:
• Volume: The sheer size of data, ranging from
terabytes to exabytes and beyond.
• Velocity: The speed at which data is generated and
processed, often in real-time or near real-time.
• Variety: The diverse types of data, including
structured, unstructured, and semi-structured data.
• Veracity: The quality and reliability of data, including
accuracy, completeness, and consistency.
• Value: The importance and potential insights
derived from analyzing Big Data to drive business
outcomes and create value.
I'll continue with more detailed explanations for the remaining topics
in the next response.
Continuing with the detailed explanations:
8. Big Data Technology Components: SPA

• Storage: Big Data storage solutions include distributed file


systems like Hadoop Distributed File System (HDFS),
NoSQL databases (e.g., MongoDB, Cassandra), and cloud-
based storage services (e.g., Amazon S3, Google Cloud
Storage).
• Processing: Big Data processing frameworks enable
distributed computing across clusters of machines.
Examples include Apache Hadoop (with MapReduce),
Apache Spark, Apache Flink, and stream processing
systems like Apache Kafka and Apache Storm.
• Analytics: Big Data analytics tools and algorithms enable
organizations to derive insights from large datasets. This
includes data mining techniques, machine learning
algorithms, predictive analytics, and business intelligence
tools.
9. Big Data Importance and Applications:
• Big Data is crucial for organizations across various
industries to gain insights, make informed decisions, and
drive innovation. Some key applications include:
• Retail: Customer segmentation, personalized
recommendations, supply chain optimization.
• Healthcare: Predictive analytics for disease diagnosis
and treatment, patient monitoring, drug discovery.
• Finance: Fraud detection, risk management,
algorithmic trading, customer analytics.
• Manufacturing: Predictive maintenance, quality
control, supply chain management.
• Telecommunications: Network optimization,
customer churn prediction, targeted marketing.
• Media and Entertainment: Content
recommendation, audience segmentation,
sentiment analysis.
10. Big Data Features – Security, Compliance, Auditing, and
Protection:
• Security: Ensuring the confidentiality, integrity, and
availability of data through encryption, access controls,
and authentication mechanisms.
• Compliance: Adhering to regulatory requirements such as
GDPR, HIPAA, and PCI-DSS to protect sensitive data and
ensure legal and ethical use.
• Auditing: Tracking and monitoring data access, usage, and
modifications for accountability and compliance purposes.
• Protection: Implementing measures to safeguard data
against unauthorized access, data breaches, and cyber
threats using firewalls, intrusion detection systems, and
data encryption.
11. Big Data Privacy and Ethics:
• Privacy concerns arise from the collection, storage, and
use of personal data in Big Data analytics. Ethical
considerations include:
• Transparency: Providing clear information about
data collection and usage practices to users.
• Consent: Obtaining explicit consent from individuals
before collecting and processing their data.
• Fairness: Ensuring fair and unbiased data analysis to
avoid discrimination or unfair treatment.
• Accountability: Holding organizations accountable
for the responsible use of data and compliance with
privacy regulations.
12. Big Data Analytics:
How does big data analytics work? CSPIPAVIF
Big data analytics combines several stages and processes to extract
insights.
Here’s a quick overview of what this could look like:
1. Data collection: Gather data from various sources, such as
surveys, social media, websites, databases, and transaction
records. This data can be structured, unstructured, or semi-
structured.
2. Data storage: Store data in distributed systems or cloud-based
solutions. These types of storage can handle a large volume of
data and provide fault tolerance.
3. Data preprocessing: It’s best to clean and preprocess the raw
data before performing analysis. This process could involve
handling missing values, standardizing formats, addressing
outliers, and structuring the data into a more suitable format.
4. Data integration: Data usually comes from various sources in
different formats. Data integration combines the data into a
unified format.
5. Data processing: Most organizations benefit from using
distributed frameworks to process big data. These break down
the tasks into smaller chunks and distribute them across
multiple machines for parallel processing.
6. Data analysis techniques: Depending on the goal of the
analysis, you’ll likely apply several data analysis techniques.
These could include descriptive, predictive, and prescriptive analytics using
machine learning, text mining, exploratory analysis, and other
methods.
7. Data visualization: After analysis, communicate the results
visually, like charts, graphs, dashboards, or other visual tools.
Visualization helps you communicate complex insights in an
understandable and accessible way.
8. Interpretation and decision making: Interpret the insights
gained from your analysis to draw conclusions and make data-
backed decisions. These decisions impact business strategies,
processes, and operations.
9. Feedback and scale: One of the main advantages of big data
analytics frameworks is their ability to scale horizontally. This
scalability enables you to handle increasing data volumes and
maintain performance, so you have a sustainable method for
analyzing large datasets.
It’s important to remember that big data analytics isn’t a linear
process, but a cycle.
You’ll continually gather new data, analyze it, and refine business
strategies based on the results. The whole process is iterative, which
means adapting to changes and making adjustments is key.
The importance of big data analytics
Big data analytics has the potential to transform the way you
operate, make decisions, and innovate. It’s an ideal solution if you’re
dealing with massive datasets and are having difficulty choosing a
suitable analytical approach.
By tapping into the finer details of your information, using techniques
and specific tools, you can use your data as a strategic asset.
Big data analytics enables you to benefit from:
• Informed decision-making: You can make informed decisions
based on actual data, which reduces uncertainty and improves
outcomes.
• Business insights: Analyzing large datasets uncovers hidden
patterns and trends, providing a deeper understanding
of customer behavior and market dynamics.
• Customer understanding: Get insight into customer
preferences and needs so you can personalize experiences and
create more impactful marketing strategies.
• Operational efficiency: By analyzing operational data, you can
optimize processes, identify bottlenecks, and streamline
operations to reduce costs and improve productivity.
• Innovation: Big data analytics can help you uncover new
opportunities and niches within industries. You can identify
unmet needs and emerging trends to develop more innovative
products and services to stay ahead of the competition.
Types of big data analytics
There are four main types of big data analytics— DDPP
descriptive, diagnostic, predictive, and prescriptive. Each serves a different
purpose and offers varying levels of insight.
Collectively, they enable businesses to comprehensively understand
their big data and make decisions to drive improved performance.
Let’s take a closer look at each one.
Descriptive analytics
This type focuses on summarizing historical data to tell youwhat’s
happened in the past. It uses aggregation, data mining, and
visualization techniques to understand trends, patterns, and key
performance indicators (KPIs).
Descriptive analytics helps you understand your current situation and
make informed decisions based on historical information.
Diagnostic analytics
Diagnostic analytics goes beyond describing past events and aims to
understand why they occurred. It separates data to identify the root
causes of specific outcomes or issues.
By analyzing relationships and correlations within the data,
diagnostic analytics helps you gain insights into factors influencing
your results.
Predictive analytics
This type of analytics uses historical data and statistical algorithms to
predict future events. It spots patterns and trends and forecasts what
might happen next.
You can use predictive analytics to anticipate customer behavior,
product demand, market trends, and more to plan and make
strategic decisions proactively.
Prescriptive analytics
Prescriptive analytics builds on predictive analytics by recommending
actions to optimize future outcomes. It considers various possible
actions and their potential impact on the predicted event or
outcome.
Prescriptive analytics help you make data-driven decisions by
suggesting the best course of action based on your desired goals and
any constraints.
13. Challenges of Conventional Systems:
• Scalability: Traditional systems often struggle to handle
large volumes of data efficiently. As data grows,
conventional systems may experience performance issues
and require additional resources for processing.
• Efficiency: Conventional systems may lack the efficiency to
process and analyze data in real-time or near-real-time.
This can lead to delays in decision-making and hinder
agility.
• Adaptability: These systems are often rigid and require
significant effort to adapt to changing data structures or
analytical requirements. They may not easily
accommodate new data sources or types of analysis.
• Maintenance: Conventional systems may require manual
intervention for maintenance, updates, and
troubleshooting. This can increase operational costs and
complexity.
14. Intelligent Data Analysis:
• Automation: Intelligent data analysis focuses on
automating analytical processes using advanced
techniques such as machine learning and artificial
intelligence. This automation helps in extracting insights
from data more efficiently.
• Pattern Recognition: Advanced algorithms enable the
identification of patterns, trends, and anomalies in data,
facilitating better decision-making and predictive
capabilities.
• Predictive Analytics: By leveraging historical data and
predictive modeling, intelligent data analysis can forecast
future trends, behaviors, and outcomes.
• Recommendation Systems: Intelligent data analysis
powers recommendation systems that provide
personalized recommendations based on user preferences
and behavior.
15. Nature of Data:
• Volume: Refers to the amount of data generated, which
can range from small datasets to massive volumes
commonly seen in big data applications.
• Velocity: Indicates the speed at which data is generated
and must be processed, such as real-time data streams
from sensors or social media.
• Variety: Data comes in various formats and types,
including structured data (e.g., databases), unstructured
data (e.g., text, images), and semi-structured data (e.g.,
JSON, XML).
• Veracity: Concerns the reliability and accuracy of data, as
it may contain errors, inconsistencies, or biases.
• Value: Refers to the usefulness and relevance of data in
addressing specific business needs or objectives.
16. Analytic Processes and Tools:
• Data Collection: Gathering data from various sources,
including databases, files, APIs, sensors, and streaming
platforms.
• Data Cleaning: Preprocessing data to remove noise,
handle missing values, standardize formats, and ensure
data quality.
• Data Transformation: Converting raw data into a format
suitable for analysis, such as feature engineering or
dimensionality reduction.
• Data Analysis: Applying statistical, machine learning, or
deep learning techniques to uncover insights, patterns,
and relationships in the data.
• Interpretation: Explaining the findings from data analysis
in the context of business objectives or research
questions.
• Visualization: Presenting data analysis results using
charts, graphs, dashboards, or other visual
representations to aid interpretation and decision-making.
17. Analysis vs Reporting:
• Analysis: Involves exploring data to uncover patterns,
trends, correlations, and insights. It focuses on
understanding the underlying factors driving observed
phenomena.
• Reporting: Involves summarizing data in a structured
format for communication purposes. Reports typically
provide descriptive information, such as key performance
indicators (KPIs), metrics, and trends, often using
visualizations or dashboards.
• Depth of Insight: Analysis goes deeper into understanding
the underlying patterns and relationships in the data,
whereas reporting provides a high-level summary of
findings.
• Actionability: Analysis aims to provide actionable insights
that drive decision-making and problem-solving, while
reporting primarily communicates information for
monitoring and review purposes.
18. Modern Data Analytic Tools:
• Open-Source Libraries: Tools like pandas, NumPy, and
scikit-learn for Python, and R for statistical computing,
provide powerful capabilities for data manipulation,
analysis, and machine learning.
• Deep Learning Frameworks: TensorFlow, PyTorch, and
Keras offer libraries and APIs for building and training
deep neural networks, enabling advanced tasks like image
recognition, natural language processing, and
reinforcement learning.
• Data Visualization Tools: Platforms like Tableau, Power BI,
and Plotly enable the creation of interactive dashboards
and visualizations to explore and communicate insights
from data.
• Cloud-Based Platforms: Services such as Google Cloud
Platform (GCP), Amazon Web Services (AWS), and
Microsoft Azure offer scalable infrastructure and managed
services for big data processing, analytics, and machine
learning.
• Business Intelligence (BI) Tools: Solutions like SAS, IBM
Watson Analytics, and Qlik provide comprehensive BI and
analytics capabilities for businesses, including data
integration, reporting, and predictive modeling.
• Big Data Technologies: Technologies like Apache Hadoop,
Spark, and Kafka enable distributed processing of large
datasets and real-time data streams, supporting big data
analytics and data engineering workflows.

You might also like