0% found this document useful (0 votes)
40 views11 pages

Unit 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views11 pages

Unit 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Unit-1

What is Big Data

Data which are very large in size is called Big Data. Normally we work on data of size MB
(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte
size is called Big Data. It is stated that almost 90% of today's data has been generated in the
past 3 years.

Sources of Big Data

These data come from many sources like

o Social networking sites: Facebook, Google, LinkedIn all these sites generates huge
amount of data on a day to day basis as they have billions of users worldwide.
o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of
logs from which users buying trends can be traced.
o Weather Station: All the weather station and satellite gives very huge data which are
stored and manipulated to forecast weather.
o Telecom company: Telecom giants like Airtel, Vodafone study the user trends and
accordingly publish their plans and for this they store the data of its million users.
o Share Market: Stock exchange across the world generates huge amount of data
through its daily transaction.

3V's of Big Data

1. Velocity: The data is increasing at a very fast rate. It is estimated that the volume of
data will double in every 2 years.
2. Variety: Now a days data are not stored in rows and column. Data is structured as
well as unstructured. Log file, CCTV footage is unstructured data. Data which can be
saved in tables are structured data like the transaction data of the bank.
3. Volume: The amount of data which we deal with is of very large size of Peta bytes.
Use case

An e-commerce site XYZ (having 100 million users) wants to offer a gift voucher of 100$ to
its top 10 customers who have spent the most in the previous year.Moreover, they want to
find the buying trend of these customers so that company can suggest more items related to
them.

Issues

Huge amount of unstructured data which needs to be stored, processed and analyzed.

Solution

Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed File System)
which uses commodity hardware to form clusters and store data in a distributed fashion. It
works on Write once, read many times principle.

Processing: Map Reduce paradigm is applied to data distributed over network to find the
required output.

Analyze: Pig, Hive can be used to analyze the data.

Cost: Hadoop is open source so the cost is no more an issue.


Using customer data as an example, the different branches of analytics that can be done with
sets of big data include the following:

 Comparative analysis. This examines customer behavior metrics and real-time


customer engagement in order to compare a company's products, services and
branding with those of its competitors.

 Social media listening. This analyzes what people are saying on social media
about a business or product, which can help identify potential problems and target
audiences for marketing campaigns.

 Marketing analytics. This provides information that can be used to improve


marketing campaigns and promotional offers for products, services and business
initiatives.

 Sentiment analysis. All of the data that's gathered on customers can be analyzed
to reveal how they feel about a company or brand, customer satisfaction levels,
potential issues and how customer service could be improved.
Big data management technologies
Hadoop, an open source distributed processing framework released in 2006, initially was at
the center of most big data architectures. The development of Spark and other processing
engines pushed MapReduce, the engine built into Hadoop, more to the side. The result is
an ecosystem of big data technologies that can be used for different applications but often are
deployed together.

Big data platforms and managed services offered by IT vendors combine many of those
technologies in a single package, primarily for use in the cloud. Currently, that includes these
offerings, listed alphabetically:

 Amazon EMR (formerly Elastic MapReduce)

 Cloudera Data Platform

 Google Cloud Dataproc


 HPE Ezmeral Data Fabric (formerly MapR Data Platform)

 Microsoft Azure HDInsight

For organizations that want to deploy big data systems themselves, either on premises or in
the cloud, the technologies that are available to them in addition to Hadoop and Spark include
the following categories of tools:

 storage repositories, such as the Hadoop Distributed File System (HDFS) and
cloud object storage services that include Amazon Simple Storage Service (S3),
Google Cloud Storage and Azure Blob Storage;

 cluster management frameworks, like Kubernetes, Mesos and YARN, Hadoop's


built-in resource manager and job scheduler, which stands for Yet Another
Resource Negotiator but is commonly known by the acronym alone;

 stream processing engines, such as Flink, Hudi, Kafka, Samza, Storm and the
Spark Streaming and Structured Streaming modules built into Spark;

 NoSQL databases that include Cassandra, Couchbase, CouchDB, HBase,


MarkLogic Data Hub, MongoDB, Neo4j, Redis and various other technologies;

 data lake and data warehouse platforms, among them Amazon Redshift, Delta
Lake, Google BigQuery, Kylin and Snowflake; and

 SQL query engines, like Drill, Hive, Impala, Presto and Trino.
Big data challenges
In connection with the processing capacity issues, designing a big data architecture is a
common challenge for users. Big data systems must be tailored to an organization's particular
needs, a DIY undertaking that requires IT and data management teams to piece together a
customized set of technologies and tools. Deploying and managing big data systems also
require new skills compared to the ones that database administrators and developers focused
on relational software typically possess.

Both of those issues can be eased by using a managed cloud service, but IT managers need to
keep a close eye on cloud usage to make sure costs don't get out of hand. Also, migrating on-
premises data sets and processing workloads to the cloud is often a complex process.
Other challenges in managing big data systems include making the data accessible to data
scientists and analysts, especially in distributed environments that include a mix of different
platforms and data stores. To help analysts find relevant data, data management and analytics
teams are increasingly building data catalogs that incorporate metadata management and data
lineage functions. The process of integrating sets of big data is often also complicated,
particularly when data variety and velocity are factors.

Keys to an effective big data strategy


In an organization, developing a big data strategy requires an understanding of business goals
and the data that's currently available to use, plus an assessment of the need for additional
data to help meet the objectives. The next steps to take include the following:

 prioritizing planned use cases and applications;

 identifying new systems and tools that are needed;

 creating a deployment roadmap; and

 evaluating internal skills to see if retraining or hiring are required.

To ensure that sets of big data are clean, consistent and used properly, a data
governance program and associated data quality management processes also must be
priorities. Other best practices for managing and analyzing big data include focusing on
business needs for information over the available technologies and using data visualization to
aid in data discovery and analysis.

Big data collection practices and regulations


As the collection and use of big data have increased, so has the potential for data misuse. A
public outcry about data breaches and other personal privacy violations led the European
Union to approve the General Data Protection Regulation (GDPR), a data privacy law that
took effect in May 2018. GDPR limits the types of data that organizations can collect and
requires opt-in consent from individuals or compliance with other specified reasons for
collecting personal data. It also includes a right-to-be-forgotten provision, which lets EU
residents ask companies to delete their data.

While there aren't similar federal laws in the U.S., the California Consumer Privacy Act
(CCPA) aims to give California residents more control over the collection and use of their
personal information by companies that do business in the state. CCPA was signed into law
in 2018 and took effect on Jan. 1, 2020.

To ensure that they comply with such laws, businesses need to carefully manage the process
of collecting big data. Controls must be put in place to identify regulated data and prevent
unauthorized employees from accessing it.

The human side of big data management and analytics


Ultimately, the business value and benefits of big data initiatives depend on the workers
tasked with managing and analyzing the data. Some big data tools enable less technical users
to run predictive analytics applications or help businesses deploy a suitable infrastructure for
big data projects, while minimizing the need for hardware and distributed software know-
how.

Big data can be contrasted with small data, a term that's sometimes used to describe data sets
that can be easily used for self-service BI and analytics. A commonly quoted axiom is, "Big
data is for machines; small data is for people."

Data Storage and Analysis


1. The storage capacities of hard drives have increased massively over the years, access
speeds—the rate at which data can be read from drives have not kept up. One typical
drive from 1990 could store 1,370 MB of data and had a transfer speed of 4.4 MB/s,
so you could read all the data from a full drive in around five minutes. Over 20 years
later, one terabyte drives are the norm, but the transfer speed is around 100 MB/s, so
it takes more than two and a half hours to read all the data off the disk. This is a long
time to read all data on a single drive—and writing is even slower. The obvious way
to reduce the time is to read from multiple disks at once. Imagine if we had 100
drives, each holding one hundredth of the data. Working in parallel, we could read the
data in under two minutes.

2. Only using one hundredth of a disk may seem wasteful. But we can store one hundred
datasets, each of which is one terabyte, and provide shared access to them. We can
imagine that the users of such a system would be happy to share access in return for
shorter analysis times, and, statistically, that their analysis jobs would be likely to be
spread over time, so they wouldn’t interfere with each other too much.

3. There’s more to being able to read and write data in parallel to or from multiple disks,
though. The first problem to solve is hardware failure: as soon as you start using many
pieces of hardware, the chance that one will fail is fairly high. A common way of
avoiding data loss is through replication: redundant copies of the data are kept by the
system so that in the event of failure, there is another copy available. This is how
RAID works, for instance, although Hadoop’s filesystem, the Hadoop Distributed
Filesystem (HDFS)

4. The second problem is that most analysis tasks need to be able to combine the data in
some way; data read from one disk may need to be combined with the data from any
of the other 99 disks. Various distributed systems allow data to be combined from
multiple sources, but doing this correctly is notoriously challenging. MapReduce
provides a programming model that abstracts the problem from disk reads and writes.

Comparison with other systems


 RDBMS
In many ways, MapReduce can be seen as a complement to an RDBMS. (The
differences between the two systems are shown in Table 1-1.) MapReduce is a
good fit for problems that need to analyze the whole dataset, in a batch fashion,
particularly for ad hoc analysis. An RDBMS is good for point queries or updates,
where the dataset has been indexed to deliver low-latency retrieval and update
times of a relatively small amount of data. MapReduce suits applications where
the data is written once, and read many
times, whereas a relational database is good for datasets that are continually
updated.

Another difference between MapReduce and an RDBMS is the amount of structure in the
datasets that they operate on. Structured data is data that is organized into entities that
have a defined format, such as XML documents or database tables that conform to a
particular predefined schema. This is the realm of the RDBMS. Semi-structured data, on
the other hand, is looser, and though there may be a schema, it is often ignored, so it may
be used only as a guide to the structure of the data: for example, a spreadsheet, in which
the structure is the grid of cells, although the cells themselves may hold any form of data.
Unstructured data does not have any particular internal structure: for example, plain text
or image data. MapReduce works well on unstructured or semistructured data, since it is
designed to interpret the data at processing time. In other words, the input keys and values
for MapReduce are not an intrinsic property of the data, but they are chosen by the person
analyzing the data.

Grid Computing
High-Performance Computing (HPC) and framework processing networks have been
doing enormous scale information handling for quite a long time, utilizing
such Application Program Interfaces (APIs) as the Message Passing Interface
(MPI). Comprehensively, the methodology in HPC is to disseminate the work over a
bunch of machines, which access a mutual filesystem, facilitated by a Storage Area
Network (SAN). This functions admirably for process escalated occupations; however,
it turns into an issue when hubs need to get to bigger information volumes (hundreds of
gigabytes, the time when Hadoop truly begins to sparkle) since the system data
transmission is the bottleneck and process hubs become inert.
Hadoop attempts to co-find the information with the process hubs, so information
access is quick since it is local. This component, known as information territory, is at
the core of information preparing in Hadoop and is the purpose behind its great
execution. Perceiving that system transfer speed is the most valuable asset in a server
farm condition (it is anything but difficult to immerse organize connects by duplicating
information around), Hadoop tries really hard to moderate it by expressly demonstrating
system topology. Notice that this course of action does not block high-CPU
examinations in Hadoop. MPI gives incredible control to software engineers, yet it
necessitates that they unequivocally handle the mechanics of the information stream,
uncovered by means of low-level C schedules and builds, for example, attachments, just
as the more elevated amount calculations for the investigations. Preparing in Hadoop
works just at the more elevated amount: the developer thinks as far as the information
model (such as key-esteem sets for MapReduce), while the information stream stays
verifiable.

A brief history of Hadoop


Hadoop is an open source software programming framework for storing a large amount of
data and performing the computation. Its framework is based on Java programming with
some native code in C and shell scripts.

History of Hadoop

Apache Software Foundation is the developers of Hadoop, and it’s co-founders are Doug
Cutting and Mike Cafarella.
It’s co-founder Doug Cutting named it on his son’s toy elephant. In October 2003 the first
paper release was Google File System. In January 2006, MapReduce development started
on the Apache Nutch which consisted of around 6000 lines coding for it and around 5000
lines coding for HDFS. In April 2006 Hadoop 0.1.0 was released.

It has distributed file system known as HDFS and this HDFS splits files into blocks and
sends them across various nodes in form of large clusters. Also in case of a node failure, the
system operates and data transfer takes place between the nodes which are facilitated by
HDFS.

Advantages of HDFS:
It is inexpensive, immutable in nature, stores data reliably, ability to tolerate faults,
scalable, block structured, can process a large amount of data simultaneously and many
more
.
Disadvantages of HDFS:

It’s the biggest disadvantage is that it is not fit for small quantities of data. Also, it has
issues related to potential stability, restrictive and rough in nature.
Hadoop also supports a wide range of software packages such as Apache Flumes, Apache
Oozie, Apache HBase, Apache Sqoop, Apache Spark, Apache Storm, Apache Pig, Apache
Hive, Apache Phoenix, Cloudera Impala.

Some common frameworks of Hadoop

1. Hive- It uses HiveQl for data structuring and for writing complicated
MapReduce in HDFS.
2. Drill- It consists of user-defined functions and is used for data exploration.
3. Storm- It allows real-time processing and streaming of data.
4. Spark- It contains a Machine Learning Library(MLlib) for providing enhanced
machine learning and is widely used for data processing. It also supports Java,
Python, and Scala.
5. Pig- It has Pig Latin, a SQL-Like language and performs data transformation of
unstructured data.
6. Tez- It reduces the complexities of Hive and Pig and helps in the running of
their codes faster.
7.
Hadoop framework is made up of the following modules:
1. Hadoop MapReduce- a MapReduce programming model for handling and
processing large data.
2. Hadoop Distributed File System- distributed files in clusters among nodes.
3. Hadoop YARN- a platform which manages computing resources.
4. Hadoop Common- it contains packages and libraries which are used for other
modules.

Advantages and Disadvantages of Hadoop

Advantages:
 Ability to store a large amount of data.
 High flexibility.
 Cost effective.
 High computational power.
 Tasks are independent.
 Linear scaling.

Disadvantages:
 Not very effective for small data.
 Hard cluster management.
 Has stability issues.
 Security concerns.

Apache Hadoop and the Hadoop Eco System


Although Hadoop is best known for MapReduce and its distributed filesystem (HDFS,
renamed from NDFS), the term is also used for a family of related projects that fall under
the umbrella of infrastructure for distributed computing and large-scale data processing.

Most of the core projects covered in this book are hosted by the Apache Software
Foundation, which provides support for a community of open-source software projects,
including the original HTTP Server from which it gets its name. As the Hadoop ecosystem
grows, more projects are appearing, not necessarily hosted at Apache, which provide
complementary services to Hadoop, or build on the core to add higher-level abstractions.

The Hadoop projects that are covered in this book are described briefly here:
Common
A set of components and interfaces for distributed filesystems and general I/O
(serialization, Java RPC, persistent data structures).
Avro
A serialization system for efficient, cross-language RPC, and persistent data storage.
MapReduce
A distributed data processing model and execution environment that runs on large
clusters of commodity machines.

HDFS
A distributed filesystem that runs on large clusters of commodity machines.
Pig
A data flow language and execution environment for exploring very large datasets. Pig
runs on HDFS and MapReduce clusters.
Hive
A distributed data warehouse. Hive manages data stored in HDFS and provides a query
language based on SQL (and which is translated by the runtime engine to MapReduce
jobs) for querying the data.
HBase
A distributed, column-oriented database. HBase uses HDFS for its underlying storage,
and supports both batch-style computations using MapReduce and point
queries (random reads).
ZooKeeper
A distributed, highly available coordination service. ZooKeeper provides primitives
such as distributed locks that can be used for building distributed applications.
Sqoop
A tool for efficiently moving data between relational databases and HDFS.

You might also like