The document describes the architecture of big data systems. It discusses the different layers including data sources, ingestion, storage, analysis and consumption. It also covers cross-layer operations like connecting data sources, governance, systems management and quality of service. Finally, it explains the Lambda architecture which combines batch and real-time processing for queries.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
249 views41 pages
Big Data Architecture
The document describes the architecture of big data systems. It discusses the different layers including data sources, ingestion, storage, analysis and consumption. It also covers cross-layer operations like connecting data sources, governance, systems management and quality of service. Finally, it explains the Lambda architecture which combines batch and real-time processing for queries.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41
Big Data Architecture
Big Data Architecture
• The architecture of Big Data consists of methods and mechanisms for collecting and storing data, securing it, processing it and then converting it into data base structures and file systems. • The analysis tools help us analyze the data that is collected and then make intelligent decisions on the basis of this collected data. • Hence, the greater the amount of data collected and analyzed the better will be the decision taking ability of the machine or device. Big Data Architecture: A very generic design • The architecture of Big Data consists of multiple layers: • Layer 1. Big data sources layer • Layer 2. Data Ingestion layer • Layer 3. Data messaging and storage layer • Layer 4. Analysis layer • Layer 5. Consumption layer • There are four logical layers that are as follows: • Big data sources layer: The data that comes to Big Data has a lot of various sources. These sources can be company servers, third-party data providers and various sensors relate to companies. • Big Data has the ability to store and take in data in two modes namely real-time mode and batch mode. Some examples of the sources of data include applications and softwares such as MS Office docs, ERP, Relational Database Management System (RDMS), mobile devices, social media, sensors, data warehouses and Data Ingestion layer • Data ingestion is the process of obtaining and importing data for immediate use or storage in a database. To ingest something is to take something in or absorb something. • Data can be streamed in real time or ingested in batches. In real-time data ingestion, each data item is imported as the source emits it. When data is ingested in batches, data items are imported in discrete chunks at periodic intervals of time. The first step in an effective data ingestion process is to prioritize the data sources. Individual files must be validated and data items routed to the correct destinations. • Data messaging and storage layer: • All the data from various sources is received by this layer. If the data received is unstructured and is not in a format that could be understood by the analytic tools then this layer converts this data into a format that is readable by the analysis tools. In Big Data, unstructured data is stored in specialized file systems such as Hadoop Distributed File System (HDFS) or in a NoSQL database whereas, structured data is stored in RDBMS. • Analysis layer: • This layer deals with the analysis of stored data. In this layer the stored data is analyzed to extract various trends and business intelligence from it. Many different sorts of tools operate in big data environment. For the analysis of structured data techniques such as sampling is used whereas, for unstructured data advanced and never specialized analytics toolsets are required. • Consumption layer: • All the analyzed data is received by this layer. The task of this layer is to present this analyzed data as an output to the desired receiver. • There are various types of outputs such as applications, business processes and human viewers • There are four types of processes that operate in between these four logical layers. These cross-layer operations are: • connecting to data sources, • governance, • systems management and • quality of service (QoS). Cross-layer operations • 1. Connecting to data sources: The data received by Big Data is at a very fast rate. In order to quickly receive and analyze this data we need to have connections that can support these actions at a fast rate. • For that the architecture requires adapters and connecters that can connect data to the storage system, protocols and networks. Cross-layer operations • 2. Governing big data: The architecture of Big Data provides privacy as well as security of the data that it receives and analyzes. • The organizations using Big Data has a choice to use a security tool of their own on the analytics storage system, spend in a specialized software to keep their Hadoop environment safe and secure or they can sign an agreement with their cloud Hadoop provider that provide service level security. • The policies that deal with the protection and security of data should include the security of the process starting from data ingestion till analysis and deletion or archiving of data. Cross-layer operations • 3. Managing systems: • The architecture of Big Data is a large-scale cluster which has a distributive structure that has highly scalable performance and capacity. • It should regularly and continuously check the health of the system with the help of central management system consoles. • If the consumer is using cloud as an environment for Big Data then they should establish and monitor Strong Service Level Agreements (SLAs) with their cloud provider. Cross-layer operations • 4. Quality of service: • Quality of service is an important aspect of Big Data and it is the framework that helps define the quality of data, security and compliance policies, sizes and frequency of incoming data sets and filtering data. Lambda Architecture • The Lambda Architecture is a deployment model for data processing that organizations use to combine a traditional batch pipeline with a fast real-time stream pipeline for data access. • It is a common architecture model in IT and development organizations toolkits as businesses strive to become more data-driven and event-driven in the face of massive volumes of rapidly generated data, often referred to as “big data.” Lambda Architecture Lambda Architecture • The Lambda Architecture contains both a traditional batch data pipeline and a fast streaming pipeline for real-time data, as well as a serving layer for responding to queries.
• Five main components of the Lambda Architecture:
variety of sources, which can then be included in the Lambda Architecture for analysis. • This component is oftentimes a streaming source like Apache Kafka, which is not the original data source, but is an intermediary store that can hold data in order to serve both the batch layer and the speed layer of the Lambda Architecture. • The data is delivered simultaneously to both the batch layer and the speed layer to enable a parallel indexing effort. Lambda Architecture
• Batch Layer. This component saves all data coming
into the system as batch views in preparation for indexing. • The input data is saved in a model that looks like a series of changes/updates that were made to a system of record, similar to the output of a change data capture (CDC) system. Oftentimes this is simply a file in the comma-separated values (CSV) format. • The data is treated as immutable and append-only to ensure a trusted historical record of all incoming data. A technology like Apache Hadoop is often used as a system for ingesting the data as well as storing the data in a cost-effective way. Lambda Architecture
• Serving Layer. This layer incrementally indexes
the latest batch views to make it queryable by end users. This layer can also reindex all data to fix a coding bug or to create different indexes for different use cases. • The key requirement in the serving layer is that the processing is done in an extremely parallelized way to minimize the time to index the data set. While an indexing job is run, newly arriving data will be queued up for indexing in the next indexing job. • Speed Layer. This layer complements the serving layer by indexing the most recently added data not yet fully indexed by the serving layer. This includes the data that the serving layer is currently indexing as well as new data that arrived after the current indexing job started. Since there is an expected lag between the time the latest data was added to the system and the time the latest data is available for querying (due to the time it takes to perform the batch indexing work), it is up to the speed layer to index the latest data to narrow this gap. • This layer typically leverages stream processing software to index the incoming data in near real-time to minimize the latency of getting the data available for querying. When the Lambda Architecture was first introduced, Apache Storm was a leading stream processing engine used in deployments, but other technologies have since gained more popularity as candidates for this component (like Hazelcast Jet, Apache Flink, and Apache Spark Streaming). Lambda Architecture
• Query. This component is responsible for
submitting end user queries to both the serving layer and the speed layer and consolidating the results. • This gives end users a complete query on all data, including the most recently added data, to provide a near real-time analytics system. Working of Lambda Architecture • Data is indexed simultaneously by both the serving layer and the speed layer. Working of Lambda Architecture • The batch/serving layers continue to index incoming data in batches. • Since the batch indexing takes time, the speed layer complements the batch/serving layers by indexing all the new, unindexed data in near real-time. • This gives you a large and consistent view of data in the batch/serving layers that can be recreated at any time, along with a smaller index that contains the most recent data. • Once a batch indexing job completes, the newly batch-indexed data is available for querying, so the speed layer’s copy of the same data/indexes is no longer needed and is therefore deleted from the speed layer. • The serving layer then begins indexing the latest data in the system that had not yet been indexed by this layer, which has already been indexed by the speed layer (so it is available for querying at the speed layer). • This ongoing hand-off between the speed layer and the batch/serving layers ensures that all data is ready for querying and that the latency for data availability is low. When the serving layer completes a job, it moves to the next batch and the speed layer discards its copy of the data that the serving layer just indexed. Benefits of the Lambda Architecture • The Lambda Architecture attempts to: • Reduced latency: The speed layer uses stream processing technologies to immediately index recent data that is currently not queryable in the batch/serving layers, thus narrowing the time window of unanalyzable data. This helps to reduce the latency. • Data consistency: the indexing process can ensure the data reflects the latest state in both the batch and speed layers. • Scalability: The Lambda Architecture is based on distributed, scale-out technologies that can be expanded by simply adding more nodes. • Fault tolerance: the Lambda Architecture is based on distributed systems that support fault tolerance, so should a hardware failure occur, other nodes are available to continue the workload. • Human fault tolerance: if there are any bugs in the indexing code or any omissions, the code can be updated and then rerun to reindex all data. Kappa Architecture • The Kappa Architecture is a software architecture used for processing streaming data. The main premise behind the Kappa Architecture is that you can perform both real-time and batch processing, especially for analytics, with a single technology stack. • It is based on a streaming architecture in which an incoming series of data is first stored in a messaging engine like Apache Kafka. • From there, a stream processing engine will read the data and transform it into an analyzable format, and then store it into an analytics database for end users to query. Kappa Architecture
• The Kappa Architecture is useful for on-demand
analytics. • The Kappa Architecture is considered a simpler alternative to the Lambda Architecture as it uses the same technology stack to handle both real-time stream processing and historical batch processing. • A streaming architecture is a defined set of technologies that work together to handle stream processing, which is the practice of taking action on a series of data at the time the data is created. Kappa Architecture Kappa Architecture • Streaming layer • The streaming layer delivers low-latency, near-real- time results. It uses incremental algorithms to perform updates, which saves time, but sacrifices accuracy. • Serving layer • The serving layer is used to serve the data computed from the streaming layer. Lambda vs. Kappa architecture
Lambda architecture Kappa architecture
Separate layers for batch Unified layer for both batch
and streaming and stream
Lower code complexity, i.e.,
Higher code complexity, i.e., maintain single technology maintain 2 technology stack for batch and stream stacks for batch and stream layer
Processing large amounts of
Faster performance with data from database would batch and stream layer be expensive Technology of Big Data Technology of Big Data Technology of Big Data Technology of Big Data Technology of Big Data Technology of Big Data Technology of Big Data Technology of Big Data ADVANTAGE OF BIG DATA TOOLS
For More Details, Please Consult Your Hyundai Dealer. Hyundai Motor India LTD 5th-6th Floor, Corporate One - Baani Building, Plot No.-5, Commercial Centre, Jasola, New Delhi-110076