Big Data Components

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 58

Introduction to Big

Data Unit-1
P.SRIDEVI
DEPT OF CSE-IT
UNIT1
Introduction to Big Data Platform
Challenges of Conventional Systems
Intelligent data analysis
Nature of Data
Analytic Processes and Tools
Analysis vs Reporting
What is BigData
“ Big data refers to huge volumes of data sets
whose size is beyond the ability of typical
traditional database software tools to capture,
store, manage and analyze. ”
Steps of big data analytics
Big data platform
• A big data platform is an integrated computing solution that
combines numerous software systems, tools, and hardware for big
data management.
Characteristics of a big data
platform
• Big data platform features:
• Ability to accommodate new applications and tools depending on the evolving
business needs
• Support several data formats(numerical data, text, multimedia, research data,
or sensor) document type(JSON, XML) or table format
• Relational or non-relational data
• Ability to accommodate large volumes of continuous streaming (data flow) or
at-rest data
• Have a wide variety of conversion tools to transform data to different preferred
formats (excel to pdf), (pdf to word), text extractor
• Capacity to accommodate data at any speed
• Provide the tools for scouring the data(thoroughly search or explore the vast
expanse of information) through massive data sets
• The ability for quick deployment.
• Have the tools for data analysis and reporting requirements
Sources of Big data
BigData technology components
1. Ingestion
The process of bringing the data into the data system we are building
• The ingestion layer is the very first step of pulling in raw data.
• The various sources of data.
• It comes from internal sources,
• relational databases,
• non-relational databases,
• social media,
• emails,
• phone calls/mobile apps etc.
Types of ingestion
• There are two kinds of ingestions :
• Batch, in which large groups of data are gathered and delivered together.
• A batch layer (cold path) stores all of the incoming data in its raw form
and performs batch processing on the data. The result of this processing
is stored as a batch view.
• Streaming, which is a continuous flow of data. This is necessary for real-
time data analytics
• A speed layer (hot path) analyzes data in real time. This layer is designed
for low latency (minimum delay), at the expense of accuracy.
Data sources.
• All big data solutions start with one or more data sources.
• Examples include:
• Relational databases -- Application data stores.
• WEB SERVER LOG FILES -- Static files produced by applications.
• IOT DEVICES--Real-time data sources.
Batch processing.(OSS SW used is spark)
• Because the data sets are so large, often a big data solution must
process data files using batch jobs to filter, aggregate, and otherwise
prepare the data for analysis.
• Usually these jobs involve reading source files, processing them, and
writing the output to new files.
• Options include running U-SQL (one query language to process data
for any format)jobs in Azure Data Lake Analytics, in an HDInsight
Hadoop cluster, or using Java, Scala, or Python programs in an
HDInsight Spark cluster.
• Azure HDInsight is a fully-managed cloud service that makes it easy,
fast, and cost-effective to process massive amounts of data. It Uses
the most popular open-source frameworks such as Hadoop, Spark,
Hive, Kafka, Storm, HBase, Microsoft ML Server and more.
Real-time message ingestion.
• If the solution includes real-time sources, the architecture must include
a way to capture and store real-time messages for stream processing.
• This might be a simple data store, where incoming messages are
dropped into a folder for processing.
• However, many solutions need a message ingestion store to act as a
buffer for messages, and to support scale-out processing, reliable
delivery, and other message queuing semantics.
• This streaming architecture is often referred to as stream buffering.
Options include Azure Event Hubs.
• (Azure Event Hubs is a big data streaming platform and event ingestion
service. It can receive and process millions of events per second. Data
sent to an event hub can be transformed and stored by using any real-
time analytics provider or batching/storage).
2. Data storage.(DataWH Vs
DataLake)
• Data for batch processing operations is typically stored in a distributed
file store that can hold high volumes of large files in various formats.
• This kind of store is often called a data lake. Options for implementing
this storage include Azure Data Lake Store in Azure Storage.
• Storage is where the converted data is stored in a data lake or
warehouse and eventually processed.
• The data lake/warehouse is the most essential component of a big data
ecosystem.
• Data in data lake contain only thorough, relevant data to make insights
as valuable as possible.
• It must be efficient with as little redundancy as possible to allow for
quicker processing.
Azure Data Lake
• Azure Data Lake is a big data solution based on multiple cloud
services in the Microsoft Azure ecosystem.
• It allows organizations to ingest multiple data sets, including
structured, unstructured, and semi-structured data, into an infinitely
scalable data lake, enabling storage, processing, and analytics.
DWH vs Data MART vs Data Lake
• Data warehouses, data lakes, and data marts are different cloud
storage solutions.
• A data warehouse stores data in a structured format. It is a central
repository of pre-processed data for analytics and business
intelligence.
• A data mart is a data warehouse that serves the needs of a specific
business unit, like a company’s finance, marketing, or sales
department.
• a data lake is a central repository for raw data and unstructured data.
You can store data first and process it later on.
3. Big data Analytics :

• In the analysis layer, data gets passed through several tools, shaping it
into actionable insights.
• There are four types of analytics on big data :
• Diagnostic: Explains why a problem is happening.
• Descriptive: Describes the current state of a business through
historical data.
• Predictive: Projects future results based on historical data.
• Prescriptive: Takes predictive analytics a step further by projecting
best future efforts.
Big data solutions typically deals with the
following types of workload:

• Batch processing of big data sources at rest.(distributed processing)


• Real-time processing of big data in motion.(Stream processing)
• Interactive exploration of big data.(analytics)
• Predictive analytics and machine learning.
4. Consumption : (end user)

• The final big data component is presenting the information in a


format useful to the end-user.
• This can be in the forms of :
• tables
• advanced visualizations and even single numbers if requested.
• The most important thing in this layer is making sure the intent and
meaning of the output is understandable.
BIG DATA MANAGEMENT TOOLS

• These days, organizations are realising the value they get out of big
data analytics and hence they are deploying big data tools and
processes to bring more efficiency in their work
environment(SECODA,COLLIBRA)
• Collibra is a data catalog platform and tool that helps organizations
better understand and manage their data assets. Collibra helps
create an inventory of data assets, capture information (metadata)
about them, and govern these assets.
• Secoda is a tool for writing queries to search company data(SECODA)
Challenges of conventional systems
• Big data is the storage and analysis of large data sets.
• These are complex data sets that can be both structured or unstructured.
• They are so large that it is not possible to work on them with traditional
analytical tools.
• One of the major challenges of conventional systems was the uncertainty of
the Data Management.
• Big data is continuously expanding, there are new companies and
technologies that are being developed every day.(Google,Amazon,Netflix)
• Trusting the quality of data. Data security and privacy is a challenge.
• Not designed as user friendly for data extraction.
• A big challenge for companies is to find out which technology works best for
them without the introduction of new risks and problems.
• These days, organizations are realising the value they get out of big data
analytics and hence they are deploying big data tools and processes to bring
more efficiency in their work environment.
BIG DATA AS A SERVICE

• Big Data has created a demand for scalable, flexible and affordable data
management platforms to meet modern compute requirements.
• Big Data as a Service (BDaaS) integrates many of the functionalities and
benefits of SaaS, IaaS, PaaS and DaaS, and leverages additional resources
in the market for analyzing Big Data.
• Big Data as a Service encompasses the software, data warehousing,
infrastructure and platform service models in order to deliver advanced
analysis of large data sets, generally through a cloud-based network.
• It is a solution-based system designed to provide organizations with the
wide-ranging capabilities to gain insights from data.
List some of data analytics tools
• Data analytics tools not only report the results of the data but also explain
why the results occurred to help identify weaknesses, fix potential problem
areas, alert decision-makers to unforeseen events and even forecast future
results based on decisions the company might make.
• R Programming (Leading Analytics Tool in the industry)
• Python
• Excel
• SAS
• Apache Spark
• Splunk
• RapidMiner
• Tableau Public
Orchestration.
• Most big data solutions consist of repeated data processing
operations, encapsulated in workflows, that transform source data,
move data between multiple sources and sinks.
• load the processed data into an analytical data store, or push the
results straight to a report or dashboard.
• To automate these workflows, an orchestration technology such
Azure Data Factory or Apache Oozie and Sqoop is used
• Workflow:
Sourcedata->move data between sources and sinks->load
processed data for analytics->display the results on
dashboard.
A big data architecture is designed to handle the
ingestion, processing, and analysis of data that is too
large or complex for traditional database systems.
Analysis and reporting.
• The goal of most big data solutions is to provide insights into the data
through analysis and reporting. To empower users to analyze the data,
the architecture may include a data modelling layer, such as a
multidimensional OLAP cube or tabular data model in Azure Analysis
Services.
• It might also support self-service BI, using the modelling and
visualization technologies in Microsoft Power BI or Microsoft Excel.
• Analysis and reporting can also take the form of interactive data
exploration by data scientists or data analysts. For these scenarios,
many Azure services support analytical notebooks, such as Jupyter,
enabling these users to leverage their existing skills with Python or R.
Reporting vs Analytics
Reporting Analysis

DWH DATA LAKE

DWH contains Structured data Unstructured data


Collect data from many sources(Flat Files, spread Collect all kinds of data( structured and unstructured) in
sheets, DBs , apps etc) one place

Designed for report generation Designed for big data analytics


Reporting is used to provide facts, which stakeholders Analytics offers pre-analyzed conclusions that a company
can use for presentations can use to solve problems and improve its performance.

Reporting presents the actual data to end-users, after Analytics doesn't present the data but instead draws
collecting, sorting and summarizing it to make it easy information from the available data and uses it to
to understand generate insights, forecasts and recommended actions.

Mostly done by automated tools Requires skill sets


Stream processing.
• After capturing real-time messages, the solution must process them by
filtering, aggregating, and otherwise preparing the data for analysis.
• The processed stream data is then written to an output sink. Azure
Stream Analytics provides a managed stream processing service based
on perpetually running SQL queries that operate on unbounded streams.
• open source Apache streaming technologies like Storm and Spark
Streaming in an HDInsight cluster can be used
• Azure HDInsight is a service offered by Microsoft, that enables us to use
open source frameworks for big data analytics.
• Azure HDInsight allows the use of frameworks like Hadoop, Apache
Spark, Apache Hive, LLAP, Apache Kafka, Apache Storm, R, etc., for
processing large volumes of data.
Intelligent data analysis(IDA)
Data Intelligence Definition:
• Data intelligence is the use of various tools and methods to analyze
and transform data into information from which valuable insight can
be drawn.
• Data intelligence refers to the practice of using artificial intelligence
and machine learning tools to analyze and transform massive datasets
into intelligent data insights, which can then be used to improve
services and Revenues to the organization.
• The application of data intelligence tools and techniques can help
decision makers develop a better understanding of collected
information with the goal of developing better business processes.
What is Intelligent Data Analysis?
• Intelligent data analysis refers to the use of analysis, classification, conversion,
extraction, organization, and reasoning methods to extract useful knowledge
from data.
• This data analytics intelligence process generally consists of
1. the data preparation stage,
2. the data mining stage,
3. the result validation
4. and result explanation stage.
• Data preparation involves the integration of required data into a dataset that will
be used for data mining;
• data mining involves examining large databases in order to generate new
information and patterns.
• result validation involves the verification of accuracy and patterns produced by
data mining algorithms;
• result explanation involves the intuitive communication of results using
Five major components of IDA
1 descriptive data,
2.prescriptive data,
3.diagnostic data,
4. decisive data, and
5. predictive data.
These disciplines focus on understanding data, developing alternative
knowledge, resolving issues, and analyzing historical data to predict
future trends.
IDA
• Some industries with the greatest need for data intelligence include 1)
cyber security,
• 2) finance,
• 3) health,
• 4)insurance, and
• 5) law enforcement.
Intelligent data capture technology is a valuable application in these
industries for transforming print documents or images into meaningful
data.
Big Data Tools & Software for
Analytics
• Best Big Data Tools & Software for Analytics 2022
• Tableau.
• Apache Hadoop.
• Apache Spark. unified analytics engine for large-scale data processing
• Zoho Analytics.
• MongoDB.
• Xplenty.
Modern data analytic tools
• These days, organizations are realising the value they get out of big data
analytics and hence they are deploying big data tools and processes to bring
more efficiency to their work environment.
• Many big data tools and processes are being utilised by companies these
days in the processes of discovering insights and supporting decision making.
• Data Analytics tools are types of application software that retrieve data from
one or more systems and combine it in a repository, such as a data
warehouse, to be reviewed and analysed.
• Most organizations use more than one analytics tool including spreadsheets
with statistical functions, statistical software packages, data mining tools,
and predictive modelling tools.
• Together, these Data Analytics Tools give the organization a complete
overview of the company to provide key insights and understanding of the
market/business so smarter decisions may be made.
MICROSOFT AZURE
• Azure Data Lake is a big data solution based on multiple cloud
services in the Microsoft Azure ecosystem.
• It allows organizations to ingest multiple data sets, including
structured, unstructured, and semi-structured data, into an infinitely
scalable data lake enabling storage, processing, and analytics.
Microsoft Azure
• Microsoft Azure provides robust services for analyzing big data. One
of the most effective ways is to store your data in Azure Data Lake
Storage Gen2 and then process it using Spark on Azure Databricks.
• Azure Stream Analytics (ASA) is Microsoft’s service for real-time data
analytics.
• Ex: stock trading analysis,
• fraud detection,
• embedded sensor analysis,
• and web clickstream analytics.
ASA uses Stream Analytics Query Language, which is a variant of T-SQL.
That means anyone who knows SQL will have a fairly easy time learning
how to write jobs for Stream Analytics.
Amazon S3 to store data
Analytics
• Analyze the consumption patterns of electricity. Draw a histogram
with hrly interval on x axis and number of units consumed on y axis
• LR model with outdoor temp and elec consumption
• Auto regressive model on time series data

You might also like