Introduction To Big Data Platforms
Introduction To Big Data Platforms
However, there are certain basic tenets (principles) of Big Data that will make it
even simpler to answer what is Big Data:
Big Data Characteristics
Big Data contains a large amount of data that is not being processed by traditional data
storage or the processing unit. It is used by many multinational companies to process the
data and business of many organizations. The data flow would exceed 150 exabytes per day
before replication.
There are five v's of Big Data that explains the characteristics.
Volume
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data
generated from many sources daily, such as business processes, machines, social media
platforms, networks, human interactions, and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that the "Like"
button is recorded, and more than 350 million new posts are uploaded each day. Big data
technologies can handle large amounts of data.
Variety
Big Data can be structured, unstructured, and semi-structured that are being collected
from different sources. Data will only be collected from databases and sheets in the past, But
these days the data will comes in array forms, that are PDFs, Emails, audios, SM posts,
photos, videos, etc.
The data is categorized as below:
a. Structured data: In Structured schema, along with all the required columns. It is in a
tabular form. Structured Data is stored in the relational database management system.
a. Unstructured Data: All the unstructured files, log files, audio files, and image files
are included in the unstructured data. Some organizations have much data available,
but they did not know how to derive the value of data since the data is raw.
a. Quasi-structured Data:The data format contains textual data with inconsistent data
formats that are formatted with effort and time with some tools.
Example: Web server logs, i.e., the log file is created and maintained by some server that
contains a list of activities.
Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also
essential in business development.
Value
Value is an essential characteristic of big data. It is not the data that we process or store. It
is valuable and reliable data that we store, process, and also analyze.
Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which the
data is created in real-time. It contains the linking of incoming data sets speeds, rate of
change, and activity bursts. The primary aspect of Big Data is to provide demanding data
rapidly.
Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc.
a. Apache Hadoop
Apache Hadoop is one of the industry's most widely used big data platforms. It is an
open-source framework that enables distributed processing for massive datasets throughout
clusters. Hadoop provides a scalable and cost-effective solution for storing, processing, and
analyzing massive amounts of structured and unstructured data.
One of the key features of Hadoop is its distributed file system, known as Hadoop Distributed
File System (HDFS). HDFS enables data to be stored across multiple machines, providing
fault tolerance and high availability. This feature allows businesses to store and process data
at a previously unattainable scale. Hadoop also includes a powerful processing engine called
MapReduce, which allows for parallel data processing across the cluster. The prominent
companies that use Apache Hadoop are:
● Yahoo
● Facebook
● Twitter
b. Apache Spark
Apache Spark is a unified analytics engine for batch processing, streaming data, machine
learning, and graph processing. It is one of the most popular big data platforms used by
companies. One of the key benefits that Apache Spark offers is speed. It is designed to
perform data processing tasks in-memory and achieve significantly faster processing times
than traditional disk-based systems.
Spark also supports various programming languages, including Java, Scala, Python, and R,
making it accessible to a wide range of developers. Hadoop offers a rich set of libraries and
tools, such as Spark SQL for querying structured data, MLlib for machine learning, and
GraphX for graph processing. Spark integrates well with other big data technologies, such as
Hadoop, allowing companies to leverage their existing infrastructure. The prominent
companies that use Apache Spark include:
● Netflix
● Uber
● Airbnb
c. Google Cloud BigQuery
Google Cloud BigQuery is a top-rated big data platform that provides a fully managed and
serverless data warehouse solution. It offers a robust and scalable infrastructure for storing,
querying, and analyzing massive datasets. BigQuery is designed to handle petabytes of data
and allows users to run SQL queries on large datasets with impressive speed and efficiency.
BigQuery supports multiple data formats and integrates seamlessly with other Google Cloud
services, such as Google Cloud Storage and Google Data Studio. BigQuery's unique
architecture enables automatic scaling, ensuring users can process data quickly without
worrying about infrastructure management. BigQuery offers a standard SQL interface for
querying data, built-in machine learning algorithms for predictive analytics, and geospatial
analysis capabilities. The prominent companies that use Google Cloud BigQuery are:
● Spotify
● Walmart
● The New York Times
d. Amazon EMR
Amazon EMR is a widely used big data platform from Amazon Web Services (AWS). It
offers a scalable and cost-effective solution for processing and analyzing large datasets using
popular open-source frameworks such as Apache Hadoop, Apache Spark, and Apache Hive.
EMR allows users to quickly provision and manage clusters of virtual servers, known as
instances, to process data in parallel.
EMR integrates seamlessly with other AWS services, such as Amazon S3 for data storage and
Amazon Redshift for data warehousing, enabling a comprehensive big data ecosystem.
Additionally, EMR supports various data processing frameworks and tools, making it suitable
for a wide range of use cases, including data transformation, machine learning, log analysis,
and real-time analytics. The prominent companies that use Amazon EMR are:
● Expedia
● Lyft
● Pfizer
e. Microsoft Azure HDInsight
Microsoft Azure HDInsight is a leading big data platform offered by Microsoft Azure. It
provides a fully managed cloud service for processing and analyzing large datasets using
popular open-source frameworks such as Apache Hadoop, Apache Spark, Apache Hive, and
Apache HBase. HDInsight offers a scalable and reliable infrastructure that allows users to
easily deploy and manage clusters.
HDInsight integrates seamlessly with other Azure services, such as Azure Data Lake Storage
and Azure Synapse Analytics, offering a comprehensive ecosystem of Microsoft Azure
services. HDInsight supports various programming languages, including Java, Python, and R,
making it accessible to a wide range of users. The prominent companies that use Microsoft
Azure HDInsight are:
● Starbucks
● Boeing
● T-Mobile
f. Cloudera
Cloudera is a leading big data platform that offers a comprehensive suite of tools and services
designed to help organizations effectively manage and analyze large volumes of data.
Cloudera's platform is built on Apache Hadoop, an open-source framework for distributed
storage and processing of big data. Cloudera is a hybrid data platform deployed across
on-premise, cloud, and edge environments.
Cloudera offers a unified platform that integrates various components such as Hadoop
Distributed File System (HDFS), Apache Spark, and Apache Hive, enabling users to perform
various data processing and analytics tasks. Cloudera also provides machine learning and
advanced analytics tools, allowing businesses to gain deeper insights from their data. The
prominent companies that use Cloudera are:
● Dell
● Nissan Motor
● Comcast
g. IBM InfoSphere BigInsights
IBM InfoSphere BigInsights is a powerful big data platform that offers a range of tools to
manage and analyze large volumes of structured as well as unstructured data in a reliable
manner. IBM InfoSphere BigInsights can handle massive data, making it suitable for
enterprises dealing with complex datasets. It provides a comprehensive set of features for
data management, data warehousing, data analytics, machine learning, and more.
IBM InfoSphere BigInsights provides a user-friendly interface and intuitive data exploration
and visualization tools. The platform also offers robust security and governance features,
ensuring data privacy and compliance with regulatory requirements. BigInsights is built on
top of Apache Hadoop and Apache Spark, and it integrates with other IBM products and
services, such as IBM DB2, IBM SPSS Modeler, and IBM Watson Analytics. This
integration makes it a good choice for businesses already using the IBM product/services
ecosystem. The prominent companies that use IBM Infosphere BigInsights are:
● Lenovo
● DBS Bank
● General Motors
h. Databricks
Databricks is a prominent big data platform built on Apache Spark. Databricks simplifies the
process of building and deploying big data applications by providing a scalable and fully
managed infrastructure. It allows users to process large datasets in real-time, perform
complex analytics, and build machine learning models using Spark's powerful capabilities.
Databricks provides an interactive workspace where users can write code, visualize data, and
collaborate on projects. It also integrates with popular data sources and tools, making it easy
to ingest and process data from various sources. With its auto-scaling capabilities, Databricks
ensures that users have the resources to handle their workloads efficiently. Its automated
infrastructure management and scaling capabilities make it a reliable choice for handling
large datasets and complex workloads. The prominent companies that use Databricks are:
● Nvidia Corporation
● Johnson & Johnson
● Salesforce