Big Data Components
Big Data Components
Big Data Components
Data Unit-1
P.SRIDEVI
DEPT OF CSE-IT
UNIT1
Introduction to Big Data Platform
Challenges of Conventional Systems
Intelligent data analysis
Nature of Data
Analytic Processes and Tools
Analysis vs Reporting
What is BigData
“ Big data refers to huge volumes of data sets
whose size is beyond the ability of typical
traditional database software tools to capture,
store, manage and analyze. ”
Steps of big data analytics
Big data platform
• A big data platform is an integrated computing solution that
combines numerous software systems, tools, and hardware for big
data management.
Characteristics of a big data
platform
• Big data platform features:
• Ability to accommodate new applications and tools depending on the evolving
business needs
• Support several data formats(numerical data, text, multimedia, research data,
or sensor) document type(JSON, XML) or table format
• Relational or non-relational data
• Ability to accommodate large volumes of continuous streaming (data flow) or
at-rest data
• Have a wide variety of conversion tools to transform data to different preferred
formats (excel to pdf), (pdf to word), text extractor
• Capacity to accommodate data at any speed
• Provide the tools for scouring the data(thoroughly search or explore the vast
expanse of information) through massive data sets
• The ability for quick deployment.
• Have the tools for data analysis and reporting requirements
Sources of Big data
BigData technology components
1. Ingestion
The process of bringing the data into the data system we are building
• The ingestion layer is the very first step of pulling in raw data.
• The various sources of data.
• It comes from internal sources,
• relational databases,
• non-relational databases,
• social media,
• emails,
• phone calls/mobile apps etc.
Types of ingestion
• There are two kinds of ingestions :
• Batch, in which large groups of data are gathered and delivered together.
• A batch layer (cold path) stores all of the incoming data in its raw form
and performs batch processing on the data. The result of this processing
is stored as a batch view.
• Streaming, which is a continuous flow of data. This is necessary for real-
time data analytics
• A speed layer (hot path) analyzes data in real time. This layer is designed
for low latency (minimum delay), at the expense of accuracy.
Data sources.
• All big data solutions start with one or more data sources.
• Examples include:
• Relational databases -- Application data stores.
• WEB SERVER LOG FILES -- Static files produced by applications.
• IOT DEVICES--Real-time data sources.
Batch processing.(OSS SW used is spark)
• Because the data sets are so large, often a big data solution must
process data files using batch jobs to filter, aggregate, and otherwise
prepare the data for analysis.
• Usually these jobs involve reading source files, processing them, and
writing the output to new files.
• Options include running U-SQL (one query language to process data
for any format)jobs in Azure Data Lake Analytics, in an HDInsight
Hadoop cluster, or using Java, Scala, or Python programs in an
HDInsight Spark cluster.
• Azure HDInsight is a fully-managed cloud service that makes it easy,
fast, and cost-effective to process massive amounts of data. It Uses
the most popular open-source frameworks such as Hadoop, Spark,
Hive, Kafka, Storm, HBase, Microsoft ML Server and more.
Real-time message ingestion.
• If the solution includes real-time sources, the architecture must include
a way to capture and store real-time messages for stream processing.
• This might be a simple data store, where incoming messages are
dropped into a folder for processing.
• However, many solutions need a message ingestion store to act as a
buffer for messages, and to support scale-out processing, reliable
delivery, and other message queuing semantics.
• This streaming architecture is often referred to as stream buffering.
Options include Azure Event Hubs.
• (Azure Event Hubs is a big data streaming platform and event ingestion
service. It can receive and process millions of events per second. Data
sent to an event hub can be transformed and stored by using any real-
time analytics provider or batching/storage).
2. Data storage.(DataWH Vs
DataLake)
• Data for batch processing operations is typically stored in a distributed
file store that can hold high volumes of large files in various formats.
• This kind of store is often called a data lake. Options for implementing
this storage include Azure Data Lake Store in Azure Storage.
• Storage is where the converted data is stored in a data lake or
warehouse and eventually processed.
• The data lake/warehouse is the most essential component of a big data
ecosystem.
• Data in data lake contain only thorough, relevant data to make insights
as valuable as possible.
• It must be efficient with as little redundancy as possible to allow for
quicker processing.
Azure Data Lake
• Azure Data Lake is a big data solution based on multiple cloud
services in the Microsoft Azure ecosystem.
• It allows organizations to ingest multiple data sets, including
structured, unstructured, and semi-structured data, into an infinitely
scalable data lake, enabling storage, processing, and analytics.
DWH vs Data MART vs Data Lake
• Data warehouses, data lakes, and data marts are different cloud
storage solutions.
• A data warehouse stores data in a structured format. It is a central
repository of pre-processed data for analytics and business
intelligence.
• A data mart is a data warehouse that serves the needs of a specific
business unit, like a company’s finance, marketing, or sales
department.
• a data lake is a central repository for raw data and unstructured data.
You can store data first and process it later on.
3. Big data Analytics :
• In the analysis layer, data gets passed through several tools, shaping it
into actionable insights.
• There are four types of analytics on big data :
• Diagnostic: Explains why a problem is happening.
• Descriptive: Describes the current state of a business through
historical data.
• Predictive: Projects future results based on historical data.
• Prescriptive: Takes predictive analytics a step further by projecting
best future efforts.
Big data solutions typically deals with the
following types of workload:
• These days, organizations are realising the value they get out of big
data analytics and hence they are deploying big data tools and
processes to bring more efficiency in their work
environment(SECODA,COLLIBRA)
• Collibra is a data catalog platform and tool that helps organizations
better understand and manage their data assets. Collibra helps
create an inventory of data assets, capture information (metadata)
about them, and govern these assets.
• Secoda is a tool for writing queries to search company data(SECODA)
Challenges of conventional systems
• Big data is the storage and analysis of large data sets.
• These are complex data sets that can be both structured or unstructured.
• They are so large that it is not possible to work on them with traditional
analytical tools.
• One of the major challenges of conventional systems was the uncertainty of
the Data Management.
• Big data is continuously expanding, there are new companies and
technologies that are being developed every day.(Google,Amazon,Netflix)
• Trusting the quality of data. Data security and privacy is a challenge.
• Not designed as user friendly for data extraction.
• A big challenge for companies is to find out which technology works best for
them without the introduction of new risks and problems.
• These days, organizations are realising the value they get out of big data
analytics and hence they are deploying big data tools and processes to bring
more efficiency in their work environment.
BIG DATA AS A SERVICE
• Big Data has created a demand for scalable, flexible and affordable data
management platforms to meet modern compute requirements.
• Big Data as a Service (BDaaS) integrates many of the functionalities and
benefits of SaaS, IaaS, PaaS and DaaS, and leverages additional resources
in the market for analyzing Big Data.
• Big Data as a Service encompasses the software, data warehousing,
infrastructure and platform service models in order to deliver advanced
analysis of large data sets, generally through a cloud-based network.
• It is a solution-based system designed to provide organizations with the
wide-ranging capabilities to gain insights from data.
List some of data analytics tools
• Data analytics tools not only report the results of the data but also explain
why the results occurred to help identify weaknesses, fix potential problem
areas, alert decision-makers to unforeseen events and even forecast future
results based on decisions the company might make.
• R Programming (Leading Analytics Tool in the industry)
• Python
• Excel
• SAS
• Apache Spark
• Splunk
• RapidMiner
• Tableau Public
Orchestration.
• Most big data solutions consist of repeated data processing
operations, encapsulated in workflows, that transform source data,
move data between multiple sources and sinks.
• load the processed data into an analytical data store, or push the
results straight to a report or dashboard.
• To automate these workflows, an orchestration technology such
Azure Data Factory or Apache Oozie and Sqoop is used
• Workflow:
Sourcedata->move data between sources and sinks->load
processed data for analytics->display the results on
dashboard.
A big data architecture is designed to handle the
ingestion, processing, and analysis of data that is too
large or complex for traditional database systems.
Analysis and reporting.
• The goal of most big data solutions is to provide insights into the data
through analysis and reporting. To empower users to analyze the data,
the architecture may include a data modelling layer, such as a
multidimensional OLAP cube or tabular data model in Azure Analysis
Services.
• It might also support self-service BI, using the modelling and
visualization technologies in Microsoft Power BI or Microsoft Excel.
• Analysis and reporting can also take the form of interactive data
exploration by data scientists or data analysts. For these scenarios,
many Azure services support analytical notebooks, such as Jupyter,
enabling these users to leverage their existing skills with Python or R.
Reporting vs Analytics
Reporting Analysis
Reporting presents the actual data to end-users, after Analytics doesn't present the data but instead draws
collecting, sorting and summarizing it to make it easy information from the available data and uses it to
to understand generate insights, forecasts and recommended actions.