Unit 1
Unit 1
Data which are very large in size is called Big Data. Normally we work on data of size MB
(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte
size is called Big Data. It is stated that almost 90% of today's data has been generated in the
past 3 years.
o Social networking sites: Facebook, Google, LinkedIn all these sites generates huge
amount of data on a day to day basis as they have billions of users worldwide.
o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of
logs from which users buying trends can be traced.
o Weather Station: All the weather station and satellite gives very huge data which are
stored and manipulated to forecast weather.
o Telecom company: Telecom giants like Airtel, Vodafone study the user trends and
accordingly publish their plans and for this they store the data of its million users.
o Share Market: Stock exchange across the world generates huge amount of data
through its daily transaction.
1. Velocity: The data is increasing at a very fast rate. It is estimated that the volume of
data will double in every 2 years.
2. Variety: Now a days data are not stored in rows and column. Data is structured as
well as unstructured. Log file, CCTV footage is unstructured data. Data which can be
saved in tables are structured data like the transaction data of the bank.
3. Volume: The amount of data which we deal with is of very large size of Peta bytes.
Use case
An e-commerce site XYZ (having 100 million users) wants to offer a gift voucher of 100$ to
its top 10 customers who have spent the most in the previous year.Moreover, they want to
find the buying trend of these customers so that company can suggest more items related to
them.
Issues
Huge amount of unstructured data which needs to be stored, processed and analyzed.
Solution
Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed File System)
which uses commodity hardware to form clusters and store data in a distributed fashion. It
works on Write once, read many times principle.
Processing: Map Reduce paradigm is applied to data distributed over network to find the
required output.
Social media listening. This analyzes what people are saying on social media
about a business or product, which can help identify potential problems and target
audiences for marketing campaigns.
Sentiment analysis. All of the data that's gathered on customers can be analyzed
to reveal how they feel about a company or brand, customer satisfaction levels,
potential issues and how customer service could be improved.
Big data management technologies
Hadoop, an open source distributed processing framework released in 2006, initially was at
the center of most big data architectures. The development of Spark and other processing
engines pushed MapReduce, the engine built into Hadoop, more to the side. The result is
an ecosystem of big data technologies that can be used for different applications but often are
deployed together.
Big data platforms and managed services offered by IT vendors combine many of those
technologies in a single package, primarily for use in the cloud. Currently, that includes these
offerings, listed alphabetically:
For organizations that want to deploy big data systems themselves, either on premises or in
the cloud, the technologies that are available to them in addition to Hadoop and Spark include
the following categories of tools:
storage repositories, such as the Hadoop Distributed File System (HDFS) and
cloud object storage services that include Amazon Simple Storage Service (S3),
Google Cloud Storage and Azure Blob Storage;
stream processing engines, such as Flink, Hudi, Kafka, Samza, Storm and the
Spark Streaming and Structured Streaming modules built into Spark;
data lake and data warehouse platforms, among them Amazon Redshift, Delta
Lake, Google BigQuery, Kylin and Snowflake; and
SQL query engines, like Drill, Hive, Impala, Presto and Trino.
Big data challenges
In connection with the processing capacity issues, designing a big data architecture is a
common challenge for users. Big data systems must be tailored to an organization's particular
needs, a DIY undertaking that requires IT and data management teams to piece together a
customized set of technologies and tools. Deploying and managing big data systems also
require new skills compared to the ones that database administrators and developers focused
on relational software typically possess.
Both of those issues can be eased by using a managed cloud service, but IT managers need to
keep a close eye on cloud usage to make sure costs don't get out of hand. Also, migrating on-
premises data sets and processing workloads to the cloud is often a complex process.
Other challenges in managing big data systems include making the data accessible to data
scientists and analysts, especially in distributed environments that include a mix of different
platforms and data stores. To help analysts find relevant data, data management and analytics
teams are increasingly building data catalogs that incorporate metadata management and data
lineage functions. The process of integrating sets of big data is often also complicated,
particularly when data variety and velocity are factors.
To ensure that sets of big data are clean, consistent and used properly, a data
governance program and associated data quality management processes also must be
priorities. Other best practices for managing and analyzing big data include focusing on
business needs for information over the available technologies and using data visualization to
aid in data discovery and analysis.
While there aren't similar federal laws in the U.S., the California Consumer Privacy Act
(CCPA) aims to give California residents more control over the collection and use of their
personal information by companies that do business in the state. CCPA was signed into law
in 2018 and took effect on Jan. 1, 2020.
To ensure that they comply with such laws, businesses need to carefully manage the process
of collecting big data. Controls must be put in place to identify regulated data and prevent
unauthorized employees from accessing it.
Big data can be contrasted with small data, a term that's sometimes used to describe data sets
that can be easily used for self-service BI and analytics. A commonly quoted axiom is, "Big
data is for machines; small data is for people."
2. Only using one hundredth of a disk may seem wasteful. But we can store one hundred
datasets, each of which is one terabyte, and provide shared access to them. We can
imagine that the users of such a system would be happy to share access in return for
shorter analysis times, and, statistically, that their analysis jobs would be likely to be
spread over time, so they wouldn’t interfere with each other too much.
3. There’s more to being able to read and write data in parallel to or from multiple disks,
though. The first problem to solve is hardware failure: as soon as you start using many
pieces of hardware, the chance that one will fail is fairly high. A common way of
avoiding data loss is through replication: redundant copies of the data are kept by the
system so that in the event of failure, there is another copy available. This is how
RAID works, for instance, although Hadoop’s filesystem, the Hadoop Distributed
Filesystem (HDFS)
4. The second problem is that most analysis tasks need to be able to combine the data in
some way; data read from one disk may need to be combined with the data from any
of the other 99 disks. Various distributed systems allow data to be combined from
multiple sources, but doing this correctly is notoriously challenging. MapReduce
provides a programming model that abstracts the problem from disk reads and writes.
Another difference between MapReduce and an RDBMS is the amount of structure in the
datasets that they operate on. Structured data is data that is organized into entities that
have a defined format, such as XML documents or database tables that conform to a
particular predefined schema. This is the realm of the RDBMS. Semi-structured data, on
the other hand, is looser, and though there may be a schema, it is often ignored, so it may
be used only as a guide to the structure of the data: for example, a spreadsheet, in which
the structure is the grid of cells, although the cells themselves may hold any form of data.
Unstructured data does not have any particular internal structure: for example, plain text
or image data. MapReduce works well on unstructured or semistructured data, since it is
designed to interpret the data at processing time. In other words, the input keys and values
for MapReduce are not an intrinsic property of the data, but they are chosen by the person
analyzing the data.
Grid Computing
High-Performance Computing (HPC) and framework processing networks have been
doing enormous scale information handling for quite a long time, utilizing
such Application Program Interfaces (APIs) as the Message Passing Interface
(MPI). Comprehensively, the methodology in HPC is to disseminate the work over a
bunch of machines, which access a mutual filesystem, facilitated by a Storage Area
Network (SAN). This functions admirably for process escalated occupations; however,
it turns into an issue when hubs need to get to bigger information volumes (hundreds of
gigabytes, the time when Hadoop truly begins to sparkle) since the system data
transmission is the bottleneck and process hubs become inert.
Hadoop attempts to co-find the information with the process hubs, so information
access is quick since it is local. This component, known as information territory, is at
the core of information preparing in Hadoop and is the purpose behind its great
execution. Perceiving that system transfer speed is the most valuable asset in a server
farm condition (it is anything but difficult to immerse organize connects by duplicating
information around), Hadoop tries really hard to moderate it by expressly demonstrating
system topology. Notice that this course of action does not block high-CPU
examinations in Hadoop. MPI gives incredible control to software engineers, yet it
necessitates that they unequivocally handle the mechanics of the information stream,
uncovered by means of low-level C schedules and builds, for example, attachments, just
as the more elevated amount calculations for the investigations. Preparing in Hadoop
works just at the more elevated amount: the developer thinks as far as the information
model (such as key-esteem sets for MapReduce), while the information stream stays
verifiable.
History of Hadoop
Apache Software Foundation is the developers of Hadoop, and it’s co-founders are Doug
Cutting and Mike Cafarella.
It’s co-founder Doug Cutting named it on his son’s toy elephant. In October 2003 the first
paper release was Google File System. In January 2006, MapReduce development started
on the Apache Nutch which consisted of around 6000 lines coding for it and around 5000
lines coding for HDFS. In April 2006 Hadoop 0.1.0 was released.
It has distributed file system known as HDFS and this HDFS splits files into blocks and
sends them across various nodes in form of large clusters. Also in case of a node failure, the
system operates and data transfer takes place between the nodes which are facilitated by
HDFS.
Advantages of HDFS:
It is inexpensive, immutable in nature, stores data reliably, ability to tolerate faults,
scalable, block structured, can process a large amount of data simultaneously and many
more
.
Disadvantages of HDFS:
It’s the biggest disadvantage is that it is not fit for small quantities of data. Also, it has
issues related to potential stability, restrictive and rough in nature.
Hadoop also supports a wide range of software packages such as Apache Flumes, Apache
Oozie, Apache HBase, Apache Sqoop, Apache Spark, Apache Storm, Apache Pig, Apache
Hive, Apache Phoenix, Cloudera Impala.
1. Hive- It uses HiveQl for data structuring and for writing complicated
MapReduce in HDFS.
2. Drill- It consists of user-defined functions and is used for data exploration.
3. Storm- It allows real-time processing and streaming of data.
4. Spark- It contains a Machine Learning Library(MLlib) for providing enhanced
machine learning and is widely used for data processing. It also supports Java,
Python, and Scala.
5. Pig- It has Pig Latin, a SQL-Like language and performs data transformation of
unstructured data.
6. Tez- It reduces the complexities of Hive and Pig and helps in the running of
their codes faster.
7.
Hadoop framework is made up of the following modules:
1. Hadoop MapReduce- a MapReduce programming model for handling and
processing large data.
2. Hadoop Distributed File System- distributed files in clusters among nodes.
3. Hadoop YARN- a platform which manages computing resources.
4. Hadoop Common- it contains packages and libraries which are used for other
modules.
Advantages:
Ability to store a large amount of data.
High flexibility.
Cost effective.
High computational power.
Tasks are independent.
Linear scaling.
Disadvantages:
Not very effective for small data.
Hard cluster management.
Has stability issues.
Security concerns.
Most of the core projects covered in this book are hosted by the Apache Software
Foundation, which provides support for a community of open-source software projects,
including the original HTTP Server from which it gets its name. As the Hadoop ecosystem
grows, more projects are appearing, not necessarily hosted at Apache, which provide
complementary services to Hadoop, or build on the core to add higher-level abstractions.
The Hadoop projects that are covered in this book are described briefly here:
Common
A set of components and interfaces for distributed filesystems and general I/O
(serialization, Java RPC, persistent data structures).
Avro
A serialization system for efficient, cross-language RPC, and persistent data storage.
MapReduce
A distributed data processing model and execution environment that runs on large
clusters of commodity machines.
HDFS
A distributed filesystem that runs on large clusters of commodity machines.
Pig
A data flow language and execution environment for exploring very large datasets. Pig
runs on HDFS and MapReduce clusters.
Hive
A distributed data warehouse. Hive manages data stored in HDFS and provides a query
language based on SQL (and which is translated by the runtime engine to MapReduce
jobs) for querying the data.
HBase
A distributed, column-oriented database. HBase uses HDFS for its underlying storage,
and supports both batch-style computations using MapReduce and point
queries (random reads).
ZooKeeper
A distributed, highly available coordination service. ZooKeeper provides primitives
such as distributed locks that can be used for building distributed applications.
Sqoop
A tool for efficiently moving data between relational databases and HDFS.