0% found this document useful (0 votes)

62 views11 pages

Introduction To Big Data Platforms

Uploaded by

it21047

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views11 pages

Introduction To Big Data Platforms

Uploaded by

it21047

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

BIG DATA:

According to Gartner, the definition of Big Data –

“Big data” is high-volume, velocity, and variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight
and decision making.”

However, there are certain basic tenets (principles) of Big Data that will make it
even simpler to answer what is Big Data:
Big Data Characteristics
Big Data contains a large amount of data that is not being processed by traditional data
storage or the processing unit. It is used by many multinational companies to process the
data and business of many organizations. The data flow would exceed 150 exabytes per day
before replication.

There are five v's of Big Data that explains the characteristics.

5 V's of Big Data

o Volume
o Veracity
o Variety
o Value
o Velocity

Volume
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data
generated from many sources daily, such as business processes, machines, social media
platforms, networks, human interactions, and many more.

Facebook can generate approximately a billion messages, 4.5 billion times that the "Like"
button is recorded, and more than 350 million new posts are uploaded each day. Big data
technologies can handle large amounts of data.

Variety
Big Data can be structured, unstructured, and semi-structured that are being collected
from different sources. Data will only be collected from databases and sheets in the past, But
these days the data will comes in array forms, that are PDFs, Emails, audios, SM posts,
photos, videos, etc.
The data is categorized as below:

a. Structured data: In Structured schema, along with all the required columns. It is in a
tabular form. Structured Data is stored in the relational database management system.

a. Semi-structured: In Semi-structured, the schema is not appropriately defined,

e.g., JSON, XML, CSV, TSV, and email. OLTP (Online Transaction Processing)
systems are built to work with semi-structured data. It is stored in relations,
i.e., tables.

a. Unstructured Data: All the unstructured files, log files, audio files, and image files
are included in the unstructured data. Some organizations have much data available,
but they did not know how to derive the value of data since the data is raw.

a. Quasi-structured Data:The data format contains textual data with inconsistent data
formats that are formatted with effort and time with some tools.

Example: Web server logs, i.e., the log file is created and maintained by some server that
contains a list of activities.

Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also
essential in business development.

For example, Facebook posts with hashtags.

Value
Value is an essential characteristic of big data. It is not the data that we process or store. It
is valuable and reliable data that we store, process, and also analyze.

Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which the
data is created in real-time. It contains the linking of incoming data sets speeds, rate of
change, and activity bursts. The primary aspect of Big Data is to provide demanding data
rapidly.

Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc.

What is a big data platform?

Big data platforms are comprehensive frameworks that enable organizations to store, process,
and analyze vast amounts of structured and unstructured data.
Big data platform features
Big data platforms offer several features - from data sourcing to advanced analytics, helping
businesses utilize data to achieve their business objectives. Some prominent features offered
by big data platforms include:
a. Data storage and management
Data storage and management is a fundamental feature of big data platforms. These platforms
provide robust and scalable storage solutions for handling large volumes of structured and
unstructured data. They offer various storage options, such as distributed file systems,
NoSQL databases, and data lakes, allowing organizations to store and organize data
efficiently.
One key advantage of big data platforms is the support for distributed file systems like
Hadoop Distributed File System (HDFS) and cloud-based storage options, which enable
seamless data storage across various environments. With advanced data management
capabilities, these platforms facilitate data integration, cleansing, and transformation,
ensuring the data is readily accessible for analysis and decision-making.
b. Distributed processing
Distributed processing is a crucial feature of big data platforms that enables processing large
volumes of data across multiple nodes or servers in a distributed computing environment.
This feature allows big data platforms to scale horizontally, meaning they can be easily
expanded to handle more data by adding more nodes.
This approach enables parallel or simultaneous processing, significantly reducing the time
required for data analysis. By distributing the workload across multiple nodes, big data
platforms handle massive data sets that would be impractical on a single machine. Distributed
processing is essential for achieving scalability and high performance in big data
environments.
c. Fault tolerance
Fault tolerance refers to the ability of a system to continue functioning even in the event of
software or hardware failures. The risk of failures is significantly higher in big data, where
massive data is processed and analyzed. A fault-tolerant big data platform ensures that data
processing and analytics operations can continue seamlessly, even if individual components
or nodes within the system fail.
Fault tolerance is achieved through various techniques such as data replication, distributed
computing, and automatic failover mechanisms. In the event of a hardware failure or software
glitch, the system can seamlessly switch to backup resources, preventing data loss and
minimizing disruptions. This feature ensures continuous data availability and uninterrupted
processing crucial for mission-critical applications and real-time analytics.
d. Data analytics and visualization
The big data analysis platforms offer robust tools and algorithms that can process large
volumes of data in real-time or near real-time. Big data platforms support various analytical
techniques - from descriptive analytics to predictive and prescriptive analytics for processing
complex business data.
Additionally, big data platforms offer advanced visualization capabilities, allowing users to
create interactive dashboards, charts, and graphs to convey insights in a visually appealing
and easily understandable manner. These capabilities enhance data comprehension and
facilitate effective communication across different teams and stakeholders.
How do big data platforms work?
Big data platforms follow a structured process to ensure companies can harness data to make
informed decisions. This process involves the following steps:
a. Data collection
Data collection is the initial step in the operation of big data platforms. It systematically
gathers data from various sources such as databases, social media, sensors, and other sources.
The data is collected using various methods such as web scraping, data feeds, APIs, and data
integration tools. The collected data is then stored in a centralized repository, often a data
lake or a data warehouse, where it can be easily accessed and processed for further analysis.
b. Data storage
Once the data is collected, it must be stored for efficient retrieval and processing. Big data
platforms typically utilize distributed storage systems that can handle large volumes of data.
These systems include Hadoop Distributed File System (HDFS), Google Cloud Storage, or
Amazon S3. This distributed storage architecture ensures high availability, fault tolerance,
and scalability.
c. Data processing
Once the data is collected, it must be processed to extract valuable insights. This process
involves various operations such as cleaning, transforming, and aggregating the data. The
parallel processing capabilities of big data platforms, such as Apache Hadoop and Apache
Spark, enable rapid computations and complex data transformations.
d. Data analysis
Data analysis involves examining and interpreting large volumes of data to extract
meaningful insights and patterns. The analysis process includes using machine learning
algorithms, data mining techniques, or visualization tools to better understand the
information. The analysis results can then be used to make data-driven decisions, optimize
processes, identify opportunities, or solve complex problems.
e. Data quality assurance
This stage ensures accuracy, consistency, integrity, relevance, and data security. The
prominent techniques to implement data quality and governance include data quality
management, lineage tracking, and cataloging. By implementing robust data quality
assurance measures, organizations can have confidence in the data they use for
decision-making.
f. Data management
Data management is a crucial aspect of big data platforms. It involves organizing, storing,
and retrieving large volumes of data. Platforms employ various techniques such as data
backup, recovery, and archiving to manage data effectively. These techniques help implement
fault tolerance and ensure optimized data retrieval for all use cases.
The Big data platforms
Several big data platforms offer comprehensive features and solutions for businesses to
manage and analyze complex datasets. The most prominent big data platforms used by
companies include the following:

a. Apache Hadoop
Apache Hadoop is one of the industry's most widely used big data platforms. It is an
open-source framework that enables distributed processing for massive datasets throughout
clusters. Hadoop provides a scalable and cost-effective solution for storing, processing, and
analyzing massive amounts of structured and unstructured data.
One of the key features of Hadoop is its distributed file system, known as Hadoop Distributed
File System (HDFS). HDFS enables data to be stored across multiple machines, providing
fault tolerance and high availability. This feature allows businesses to store and process data
at a previously unattainable scale. Hadoop also includes a powerful processing engine called
MapReduce, which allows for parallel data processing across the cluster. The prominent
companies that use Apache Hadoop are:
● Yahoo
● Facebook
● Twitter

b. Apache Spark
Apache Spark is a unified analytics engine for batch processing, streaming data, machine
learning, and graph processing. It is one of the most popular big data platforms used by
companies. One of the key benefits that Apache Spark offers is speed. It is designed to
perform data processing tasks in-memory and achieve significantly faster processing times
than traditional disk-based systems.

Spark also supports various programming languages, including Java, Scala, Python, and R,
making it accessible to a wide range of developers. Hadoop offers a rich set of libraries and
tools, such as Spark SQL for querying structured data, MLlib for machine learning, and
GraphX for graph processing. Spark integrates well with other big data technologies, such as
Hadoop, allowing companies to leverage their existing infrastructure. The prominent
companies that use Apache Spark include:

● Netflix
● Uber
● Airbnb
c. Google Cloud BigQuery
Google Cloud BigQuery is a top-rated big data platform that provides a fully managed and
serverless data warehouse solution. It offers a robust and scalable infrastructure for storing,
querying, and analyzing massive datasets. BigQuery is designed to handle petabytes of data
and allows users to run SQL queries on large datasets with impressive speed and efficiency.

BigQuery supports multiple data formats and integrates seamlessly with other Google Cloud
services, such as Google Cloud Storage and Google Data Studio. BigQuery's unique
architecture enables automatic scaling, ensuring users can process data quickly without
worrying about infrastructure management. BigQuery offers a standard SQL interface for
querying data, built-in machine learning algorithms for predictive analytics, and geospatial
analysis capabilities. The prominent companies that use Google Cloud BigQuery are:

● Spotify
● Walmart
● The New York Times
d. Amazon EMR
Amazon EMR is a widely used big data platform from Amazon Web Services (AWS). It
offers a scalable and cost-effective solution for processing and analyzing large datasets using
popular open-source frameworks such as Apache Hadoop, Apache Spark, and Apache Hive.
EMR allows users to quickly provision and manage clusters of virtual servers, known as
instances, to process data in parallel.
EMR integrates seamlessly with other AWS services, such as Amazon S3 for data storage and
Amazon Redshift for data warehousing, enabling a comprehensive big data ecosystem.
Additionally, EMR supports various data processing frameworks and tools, making it suitable
for a wide range of use cases, including data transformation, machine learning, log analysis,
and real-time analytics. The prominent companies that use Amazon EMR are:
● Expedia
● Lyft
● Pfizer
e. Microsoft Azure HDInsight
Microsoft Azure HDInsight is a leading big data platform offered by Microsoft Azure. It
provides a fully managed cloud service for processing and analyzing large datasets using
popular open-source frameworks such as Apache Hadoop, Apache Spark, Apache Hive, and
Apache HBase. HDInsight offers a scalable and reliable infrastructure that allows users to
easily deploy and manage clusters.
HDInsight integrates seamlessly with other Azure services, such as Azure Data Lake Storage
and Azure Synapse Analytics, offering a comprehensive ecosystem of Microsoft Azure
services. HDInsight supports various programming languages, including Java, Python, and R,
making it accessible to a wide range of users. The prominent companies that use Microsoft
Azure HDInsight are:
● Starbucks
● Boeing
● T-Mobile
f. Cloudera
Cloudera is a leading big data platform that offers a comprehensive suite of tools and services
designed to help organizations effectively manage and analyze large volumes of data.
Cloudera's platform is built on Apache Hadoop, an open-source framework for distributed
storage and processing of big data. Cloudera is a hybrid data platform deployed across
on-premise, cloud, and edge environments.
Cloudera offers a unified platform that integrates various components such as Hadoop
Distributed File System (HDFS), Apache Spark, and Apache Hive, enabling users to perform
various data processing and analytics tasks. Cloudera also provides machine learning and
advanced analytics tools, allowing businesses to gain deeper insights from their data. The
prominent companies that use Cloudera are:
● Dell
● Nissan Motor
● Comcast
g. IBM InfoSphere BigInsights
IBM InfoSphere BigInsights is a powerful big data platform that offers a range of tools to
manage and analyze large volumes of structured as well as unstructured data in a reliable
manner. IBM InfoSphere BigInsights can handle massive data, making it suitable for
enterprises dealing with complex datasets. It provides a comprehensive set of features for
data management, data warehousing, data analytics, machine learning, and more.
IBM InfoSphere BigInsights provides a user-friendly interface and intuitive data exploration
and visualization tools. The platform also offers robust security and governance features,
ensuring data privacy and compliance with regulatory requirements. BigInsights is built on
top of Apache Hadoop and Apache Spark, and it integrates with other IBM products and
services, such as IBM DB2, IBM SPSS Modeler, and IBM Watson Analytics. This
integration makes it a good choice for businesses already using the IBM product/services
ecosystem. The prominent companies that use IBM Infosphere BigInsights are:
● Lenovo
● DBS Bank
● General Motors
h. Databricks
Databricks is a prominent big data platform built on Apache Spark. Databricks simplifies the
process of building and deploying big data applications by providing a scalable and fully
managed infrastructure. It allows users to process large datasets in real-time, perform
complex analytics, and build machine learning models using Spark's powerful capabilities.
Databricks provides an interactive workspace where users can write code, visualize data, and
collaborate on projects. It also integrates with popular data sources and tools, making it easy
to ingest and process data from various sources. With its auto-scaling capabilities, Databricks
ensures that users have the resources to handle their workloads efficiently. Its automated
infrastructure management and scaling capabilities make it a reliable choice for handling
large datasets and complex workloads. The prominent companies that use Databricks are:
● Nvidia Corporation
● Johnson & Johnson
● Salesforce

Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Data Analytics Notes Unit 1
No ratings yet
Data Analytics Notes Unit 1
23 pages
Unit 1
No ratings yet
Unit 1
20 pages
List Out The Best Practices of Big Dataanalytics.: Question Bank Part-A
No ratings yet
List Out The Best Practices of Big Dataanalytics.: Question Bank Part-A
89 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
20 pages
Unit 1 B Tech 3 Year BD
No ratings yet
Unit 1 B Tech 3 Year BD
10 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Self Prepared
No ratings yet
Self Prepared
147 pages
Bda Ans
No ratings yet
Bda Ans
18 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
UNIT-1 BigData
No ratings yet
UNIT-1 BigData
10 pages
# What Is Big Data
No ratings yet
# What Is Big Data
10 pages
Ethiopin Tecica University Departement of Ict Cours Title: Big Data
No ratings yet
Ethiopin Tecica University Departement of Ict Cours Title: Big Data
15 pages
Introduction To Big Data: Types of Digital Data, History of Big Data Innovation
No ratings yet
Introduction To Big Data: Types of Digital Data, History of Big Data Innovation
12 pages
BIGDATAUNIT1 AKTUpdf
No ratings yet
BIGDATAUNIT1 AKTUpdf
33 pages
Big Data Analytics Unit - 1 Notes
No ratings yet
Big Data Analytics Unit - 1 Notes
24 pages
Bigdata Notes
No ratings yet
Bigdata Notes
136 pages
ABSTRACT
No ratings yet
ABSTRACT
9 pages
Big Data Analytics M1
No ratings yet
Big Data Analytics M1
27 pages
Big Data Chatgpt
No ratings yet
Big Data Chatgpt
8 pages
Unit 1 BD
No ratings yet
Unit 1 BD
24 pages
Lecture 2
No ratings yet
Lecture 2
11 pages
Now To Be Data
No ratings yet
Now To Be Data
16 pages
Bda Mse
No ratings yet
Bda Mse
62 pages
BIG DATA Notes
No ratings yet
BIG DATA Notes
11 pages
Unit 1 and Unit 2 Notes Bda
No ratings yet
Unit 1 and Unit 2 Notes Bda
11 pages
Abhishek Seminar 222
No ratings yet
Abhishek Seminar 222
19 pages
Bigdata
No ratings yet
Bigdata
12 pages
Big Data Ashish
No ratings yet
Big Data Ashish
7 pages
Module 3 Free Elective
No ratings yet
Module 3 Free Elective
19 pages
Big Data
No ratings yet
Big Data
16 pages
Bda Unit1
No ratings yet
Bda Unit1
19 pages
Big Data All Unit by Study4sub
No ratings yet
Big Data All Unit by Study4sub
161 pages
Big Data Analytics
No ratings yet
Big Data Analytics
21 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
15 pages
BCC (IEEE Format) Big Data
No ratings yet
BCC (IEEE Format) Big Data
2 pages
BIG DATA 1 Unit
100% (1)
BIG DATA 1 Unit
17 pages
IEEE BigDataOpenSourcePlatforms
No ratings yet
IEEE BigDataOpenSourcePlatforms
8 pages
Big Data and Hadoop Self Notes
No ratings yet
Big Data and Hadoop Self Notes
16 pages
Stream Processing Chapter 2
No ratings yet
Stream Processing Chapter 2
21 pages
BIG Data Analytics 21CSH-471: Computer Science & Engineering
No ratings yet
BIG Data Analytics 21CSH-471: Computer Science & Engineering
13 pages
Bda QB
No ratings yet
Bda QB
24 pages
Unit 3.BigData Notes
No ratings yet
Unit 3.BigData Notes
19 pages
Big Data-Introduction
No ratings yet
Big Data-Introduction
14 pages
BDA Unit 1
No ratings yet
BDA Unit 1
39 pages
Unit 1
No ratings yet
Unit 1
51 pages
Bda Unit-1 Notes
No ratings yet
Bda Unit-1 Notes
10 pages
Finance - Unit 4
No ratings yet
Finance - Unit 4
39 pages
Block-2-Unit 5
No ratings yet
Block-2-Unit 5
101 pages
Big Data Analysis by Deshbandhu
No ratings yet
Big Data Analysis by Deshbandhu
368 pages
V'S" V'S,"
No ratings yet
V'S" V'S,"
4 pages
Big Data Analytics
No ratings yet
Big Data Analytics
32 pages
UNIT-1:Overview of Big Data
No ratings yet
UNIT-1:Overview of Big Data
10 pages
Detailednotes - Unit1 - Big Data
No ratings yet
Detailednotes - Unit1 - Big Data
22 pages
Big Data Analytics
100% (1)
Big Data Analytics
14 pages
BDA Unit 1 Notes
No ratings yet
BDA Unit 1 Notes
34 pages
BDA Unit 1 Notes-1
No ratings yet
BDA Unit 1 Notes-1
34 pages
BDAchap 1
No ratings yet
BDAchap 1
15 pages
Fundamentals of Big Data Engineering: A Guide To The
No ratings yet
Fundamentals of Big Data Engineering: A Guide To The
14 pages
Big Data - Comprehensive Summary
No ratings yet
Big Data - Comprehensive Summary
12 pages
ORAchk User Guide
No ratings yet
ORAchk User Guide
51 pages
Big Data Use Case Template 2
No ratings yet
Big Data Use Case Template 2
27 pages
Lefunlog 2025 01 13 Com - Tjd.tjdmains2
No ratings yet
Lefunlog 2025 01 13 Com - Tjd.tjdmains2
50 pages
PIC18F4550 I2C - PIC Controllers
100% (1)
PIC18F4550 I2C - PIC Controllers
8 pages
Adafruit SSD1306
100% (1)
Adafruit SSD1306
14 pages
Library Management System
No ratings yet
Library Management System
17 pages
Networking Concepts - Basics
No ratings yet
Networking Concepts - Basics
33 pages
18ec71 Test 1 2021
No ratings yet
18ec71 Test 1 2021
3 pages
Intrusion Detection System
No ratings yet
Intrusion Detection System
18 pages
Architectural Structure of The C5x
No ratings yet
Architectural Structure of The C5x
13 pages
Dictionaries
No ratings yet
Dictionaries
25 pages
Vega Admin Guide R85 v1.6
No ratings yet
Vega Admin Guide R85 v1.6
349 pages
Chap6-Relational Algebra
No ratings yet
Chap6-Relational Algebra
49 pages
9691 May June 2011 All Question Papers
No ratings yet
9691 May June 2011 All Question Papers
8 pages
VFD PL 200 User Manual
No ratings yet
VFD PL 200 User Manual
32 pages
M. Tech. Semester - I: Distributed Computing (MCSCS 101/1MCS1)
No ratings yet
M. Tech. Semester - I: Distributed Computing (MCSCS 101/1MCS1)
20 pages
Computer Communication & Networks: Waleed - Ejaz@uettaxila - Edu.pk
No ratings yet
Computer Communication & Networks: Waleed - Ejaz@uettaxila - Edu.pk
32 pages
SQL Notes Final
No ratings yet
SQL Notes Final
50 pages
Modulation & Coding Scheme (MCS Table)
No ratings yet
Modulation & Coding Scheme (MCS Table)
2 pages
Unit 3 Networking 1 (GZRSC)
No ratings yet
Unit 3 Networking 1 (GZRSC)
12 pages
Low Area TCAM Using A Dont Care Reduction Scheme
No ratings yet
Low Area TCAM Using A Dont Care Reduction Scheme
7 pages
Background Processes in Oracle
No ratings yet
Background Processes in Oracle
12 pages
McSa Guide To Installation, Storage, and Compute With Microsoft Windows Server2016, Exam 70-740 Greg Tomsho Download PDF
100% (1)
McSa Guide To Installation, Storage, and Compute With Microsoft Windows Server2016, Exam 70-740 Greg Tomsho Download PDF
53 pages
Lecture 1
No ratings yet
Lecture 1
16 pages
NFS (Network File System)
No ratings yet
NFS (Network File System)
21 pages
Samc2090 320
No ratings yet
Samc2090 320
4 pages
BECE 355L AWS Cloud Module 3 Total
No ratings yet
BECE 355L AWS Cloud Module 3 Total
133 pages
Vendor: Microsoft Exam Code: AZ-104 Exam Name: Microsoft Azure Administrator Version: DEMO
No ratings yet
Vendor: Microsoft Exam Code: AZ-104 Exam Name: Microsoft Azure Administrator Version: DEMO
7 pages
Linq Notes
No ratings yet
Linq Notes
8 pages
Practical 8
No ratings yet
Practical 8
29 pages

Introduction To Big Data Platforms

Uploaded by

Introduction To Big Data Platforms

Uploaded by

BIG DATA:

According to Gartner, the definition of Big Data –

5 V's of Big Data

a. Semi-structured: In Semi-structured, the schema is not appropriately defined,

For example, Facebook posts with hashtags.

What is a big data platform?

You might also like