0% found this document useful (0 votes)

40 views11 pages

Unit 1

Uploaded by

itskuldeepitskuldeep2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views11 pages

Unit 1

Uploaded by

itskuldeepitskuldeep2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

Unit-1

What is Big Data

Data which are very large in size is called Big Data. Normally we work on data of size MB
(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte
size is called Big Data. It is stated that almost 90% of today's data has been generated in the
past 3 years.

Sources of Big Data

These data come from many sources like

o Social networking sites: Facebook, Google, LinkedIn all these sites generates huge
amount of data on a day to day basis as they have billions of users worldwide.
o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of
logs from which users buying trends can be traced.
o Weather Station: All the weather station and satellite gives very huge data which are
stored and manipulated to forecast weather.
o Telecom company: Telecom giants like Airtel, Vodafone study the user trends and
accordingly publish their plans and for this they store the data of its million users.
o Share Market: Stock exchange across the world generates huge amount of data
through its daily transaction.

3V's of Big Data

1. Velocity: The data is increasing at a very fast rate. It is estimated that the volume of
data will double in every 2 years.
2. Variety: Now a days data are not stored in rows and column. Data is structured as
well as unstructured. Log file, CCTV footage is unstructured data. Data which can be
saved in tables are structured data like the transaction data of the bank.
3. Volume: The amount of data which we deal with is of very large size of Peta bytes.
Use case

An e-commerce site XYZ (having 100 million users) wants to offer a gift voucher of 100$ to
its top 10 customers who have spent the most in the previous year.Moreover, they want to
find the buying trend of these customers so that company can suggest more items related to
them.

Issues

Huge amount of unstructured data which needs to be stored, processed and analyzed.

Solution

Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed File System)
which uses commodity hardware to form clusters and store data in a distributed fashion. It
works on Write once, read many times principle.

Processing: Map Reduce paradigm is applied to data distributed over network to find the
required output.

Analyze: Pig, Hive can be used to analyze the data.

Cost: Hadoop is open source so the cost is no more an issue.

Using customer data as an example, the different branches of analytics that can be done with
sets of big data include the following:

 Comparative analysis. This examines customer behavior metrics and real-time

customer engagement in order to compare a company's products, services and
branding with those of its competitors.

 Social media listening. This analyzes what people are saying on social media
about a business or product, which can help identify potential problems and target
audiences for marketing campaigns.

 Marketing analytics. This provides information that can be used to improve

marketing campaigns and promotional offers for products, services and business
initiatives.

 Sentiment analysis. All of the data that's gathered on customers can be analyzed
to reveal how they feel about a company or brand, customer satisfaction levels,
potential issues and how customer service could be improved.
Big data management technologies
Hadoop, an open source distributed processing framework released in 2006, initially was at
the center of most big data architectures. The development of Spark and other processing
engines pushed MapReduce, the engine built into Hadoop, more to the side. The result is
an ecosystem of big data technologies that can be used for different applications but often are
deployed together.

Big data platforms and managed services offered by IT vendors combine many of those
technologies in a single package, primarily for use in the cloud. Currently, that includes these
offerings, listed alphabetically:

 Amazon EMR (formerly Elastic MapReduce)

 Cloudera Data Platform

 Google Cloud Dataproc

 HPE Ezmeral Data Fabric (formerly MapR Data Platform)

 Microsoft Azure HDInsight

For organizations that want to deploy big data systems themselves, either on premises or in
the cloud, the technologies that are available to them in addition to Hadoop and Spark include
the following categories of tools:

 storage repositories, such as the Hadoop Distributed File System (HDFS) and
cloud object storage services that include Amazon Simple Storage Service (S3),
Google Cloud Storage and Azure Blob Storage;

 cluster management frameworks, like Kubernetes, Mesos and YARN, Hadoop's

built-in resource manager and job scheduler, which stands for Yet Another
Resource Negotiator but is commonly known by the acronym alone;

 stream processing engines, such as Flink, Hudi, Kafka, Samza, Storm and the
Spark Streaming and Structured Streaming modules built into Spark;

 NoSQL databases that include Cassandra, Couchbase, CouchDB, HBase,

MarkLogic Data Hub, MongoDB, Neo4j, Redis and various other technologies;

 data lake and data warehouse platforms, among them Amazon Redshift, Delta
Lake, Google BigQuery, Kylin and Snowflake; and

 SQL query engines, like Drill, Hive, Impala, Presto and Trino.
Big data challenges
In connection with the processing capacity issues, designing a big data architecture is a
common challenge for users. Big data systems must be tailored to an organization's particular
needs, a DIY undertaking that requires IT and data management teams to piece together a
customized set of technologies and tools. Deploying and managing big data systems also
require new skills compared to the ones that database administrators and developers focused
on relational software typically possess.

Both of those issues can be eased by using a managed cloud service, but IT managers need to
keep a close eye on cloud usage to make sure costs don't get out of hand. Also, migrating on-
premises data sets and processing workloads to the cloud is often a complex process.
Other challenges in managing big data systems include making the data accessible to data
scientists and analysts, especially in distributed environments that include a mix of different
platforms and data stores. To help analysts find relevant data, data management and analytics
teams are increasingly building data catalogs that incorporate metadata management and data
lineage functions. The process of integrating sets of big data is often also complicated,
particularly when data variety and velocity are factors.

Keys to an effective big data strategy

In an organization, developing a big data strategy requires an understanding of business goals
and the data that's currently available to use, plus an assessment of the need for additional
data to help meet the objectives. The next steps to take include the following:

 prioritizing planned use cases and applications;

 identifying new systems and tools that are needed;

 creating a deployment roadmap; and

 evaluating internal skills to see if retraining or hiring are required.

To ensure that sets of big data are clean, consistent and used properly, a data
governance program and associated data quality management processes also must be
priorities. Other best practices for managing and analyzing big data include focusing on
business needs for information over the available technologies and using data visualization to
aid in data discovery and analysis.

Big data collection practices and regulations

As the collection and use of big data have increased, so has the potential for data misuse. A
public outcry about data breaches and other personal privacy violations led the European
Union to approve the General Data Protection Regulation (GDPR), a data privacy law that
took effect in May 2018. GDPR limits the types of data that organizations can collect and
requires opt-in consent from individuals or compliance with other specified reasons for
collecting personal data. It also includes a right-to-be-forgotten provision, which lets EU
residents ask companies to delete their data.

While there aren't similar federal laws in the U.S., the California Consumer Privacy Act
(CCPA) aims to give California residents more control over the collection and use of their
personal information by companies that do business in the state. CCPA was signed into law
in 2018 and took effect on Jan. 1, 2020.

To ensure that they comply with such laws, businesses need to carefully manage the process
of collecting big data. Controls must be put in place to identify regulated data and prevent
unauthorized employees from accessing it.

The human side of big data management and analytics

Ultimately, the business value and benefits of big data initiatives depend on the workers
tasked with managing and analyzing the data. Some big data tools enable less technical users
to run predictive analytics applications or help businesses deploy a suitable infrastructure for
big data projects, while minimizing the need for hardware and distributed software know-
how.

Big data can be contrasted with small data, a term that's sometimes used to describe data sets
that can be easily used for self-service BI and analytics. A commonly quoted axiom is, "Big
data is for machines; small data is for people."

Data Storage and Analysis

1. The storage capacities of hard drives have increased massively over the years, access
speeds—the rate at which data can be read from drives have not kept up. One typical
drive from 1990 could store 1,370 MB of data and had a transfer speed of 4.4 MB/s,
so you could read all the data from a full drive in around five minutes. Over 20 years
later, one terabyte drives are the norm, but the transfer speed is around 100 MB/s, so
it takes more than two and a half hours to read all the data off the disk. This is a long
time to read all data on a single drive—and writing is even slower. The obvious way
to reduce the time is to read from multiple disks at once. Imagine if we had 100
drives, each holding one hundredth of the data. Working in parallel, we could read the
data in under two minutes.

2. Only using one hundredth of a disk may seem wasteful. But we can store one hundred
datasets, each of which is one terabyte, and provide shared access to them. We can
imagine that the users of such a system would be happy to share access in return for
shorter analysis times, and, statistically, that their analysis jobs would be likely to be
spread over time, so they wouldn’t interfere with each other too much.

3. There’s more to being able to read and write data in parallel to or from multiple disks,
though. The first problem to solve is hardware failure: as soon as you start using many
pieces of hardware, the chance that one will fail is fairly high. A common way of
avoiding data loss is through replication: redundant copies of the data are kept by the
system so that in the event of failure, there is another copy available. This is how
RAID works, for instance, although Hadoop’s filesystem, the Hadoop Distributed
Filesystem (HDFS)

4. The second problem is that most analysis tasks need to be able to combine the data in
some way; data read from one disk may need to be combined with the data from any
of the other 99 disks. Various distributed systems allow data to be combined from
multiple sources, but doing this correctly is notoriously challenging. MapReduce
provides a programming model that abstracts the problem from disk reads and writes.

Comparison with other systems

 RDBMS
In many ways, MapReduce can be seen as a complement to an RDBMS. (The
differences between the two systems are shown in Table 1-1.) MapReduce is a
good fit for problems that need to analyze the whole dataset, in a batch fashion,
particularly for ad hoc analysis. An RDBMS is good for point queries or updates,
where the dataset has been indexed to deliver low-latency retrieval and update
times of a relatively small amount of data. MapReduce suits applications where
the data is written once, and read many
times, whereas a relational database is good for datasets that are continually
updated.

Another difference between MapReduce and an RDBMS is the amount of structure in the
datasets that they operate on. Structured data is data that is organized into entities that
have a defined format, such as XML documents or database tables that conform to a
particular predefined schema. This is the realm of the RDBMS. Semi-structured data, on
the other hand, is looser, and though there may be a schema, it is often ignored, so it may
be used only as a guide to the structure of the data: for example, a spreadsheet, in which
the structure is the grid of cells, although the cells themselves may hold any form of data.
Unstructured data does not have any particular internal structure: for example, plain text
or image data. MapReduce works well on unstructured or semistructured data, since it is
designed to interpret the data at processing time. In other words, the input keys and values
for MapReduce are not an intrinsic property of the data, but they are chosen by the person
analyzing the data.

Grid Computing
High-Performance Computing (HPC) and framework processing networks have been
doing enormous scale information handling for quite a long time, utilizing
such Application Program Interfaces (APIs) as the Message Passing Interface
(MPI). Comprehensively, the methodology in HPC is to disseminate the work over a
bunch of machines, which access a mutual filesystem, facilitated by a Storage Area
Network (SAN). This functions admirably for process escalated occupations; however,
it turns into an issue when hubs need to get to bigger information volumes (hundreds of
gigabytes, the time when Hadoop truly begins to sparkle) since the system data
transmission is the bottleneck and process hubs become inert.
Hadoop attempts to co-find the information with the process hubs, so information
access is quick since it is local. This component, known as information territory, is at
the core of information preparing in Hadoop and is the purpose behind its great
execution. Perceiving that system transfer speed is the most valuable asset in a server
farm condition (it is anything but difficult to immerse organize connects by duplicating
information around), Hadoop tries really hard to moderate it by expressly demonstrating
system topology. Notice that this course of action does not block high-CPU
examinations in Hadoop. MPI gives incredible control to software engineers, yet it
necessitates that they unequivocally handle the mechanics of the information stream,
uncovered by means of low-level C schedules and builds, for example, attachments, just
as the more elevated amount calculations for the investigations. Preparing in Hadoop
works just at the more elevated amount: the developer thinks as far as the information
model (such as key-esteem sets for MapReduce), while the information stream stays
verifiable.

A brief history of Hadoop

Hadoop is an open source software programming framework for storing a large amount of
data and performing the computation. Its framework is based on Java programming with
some native code in C and shell scripts.

History of Hadoop

Apache Software Foundation is the developers of Hadoop, and it’s co-founders are Doug
Cutting and Mike Cafarella.
It’s co-founder Doug Cutting named it on his son’s toy elephant. In October 2003 the first
paper release was Google File System. In January 2006, MapReduce development started
on the Apache Nutch which consisted of around 6000 lines coding for it and around 5000
lines coding for HDFS. In April 2006 Hadoop 0.1.0 was released.

It has distributed file system known as HDFS and this HDFS splits files into blocks and
sends them across various nodes in form of large clusters. Also in case of a node failure, the
system operates and data transfer takes place between the nodes which are facilitated by
HDFS.

Advantages of HDFS:
It is inexpensive, immutable in nature, stores data reliably, ability to tolerate faults,
scalable, block structured, can process a large amount of data simultaneously and many
more
.
Disadvantages of HDFS:

It’s the biggest disadvantage is that it is not fit for small quantities of data. Also, it has
issues related to potential stability, restrictive and rough in nature.
Hadoop also supports a wide range of software packages such as Apache Flumes, Apache
Oozie, Apache HBase, Apache Sqoop, Apache Spark, Apache Storm, Apache Pig, Apache
Hive, Apache Phoenix, Cloudera Impala.

Some common frameworks of Hadoop

1. Hive- It uses HiveQl for data structuring and for writing complicated
MapReduce in HDFS.
2. Drill- It consists of user-defined functions and is used for data exploration.
3. Storm- It allows real-time processing and streaming of data.
4. Spark- It contains a Machine Learning Library(MLlib) for providing enhanced
machine learning and is widely used for data processing. It also supports Java,
Python, and Scala.
5. Pig- It has Pig Latin, a SQL-Like language and performs data transformation of
unstructured data.
6. Tez- It reduces the complexities of Hive and Pig and helps in the running of
their codes faster.
7.
Hadoop framework is made up of the following modules:
1. Hadoop MapReduce- a MapReduce programming model for handling and
processing large data.
2. Hadoop Distributed File System- distributed files in clusters among nodes.
3. Hadoop YARN- a platform which manages computing resources.
4. Hadoop Common- it contains packages and libraries which are used for other
modules.

Advantages and Disadvantages of Hadoop

Advantages:
 Ability to store a large amount of data.
 High flexibility.
 Cost effective.
 High computational power.
 Tasks are independent.
 Linear scaling.

Disadvantages:
 Not very effective for small data.
 Hard cluster management.
 Has stability issues.
 Security concerns.

Apache Hadoop and the Hadoop Eco System

Although Hadoop is best known for MapReduce and its distributed filesystem (HDFS,
renamed from NDFS), the term is also used for a family of related projects that fall under
the umbrella of infrastructure for distributed computing and large-scale data processing.

Most of the core projects covered in this book are hosted by the Apache Software
Foundation, which provides support for a community of open-source software projects,
including the original HTTP Server from which it gets its name. As the Hadoop ecosystem
grows, more projects are appearing, not necessarily hosted at Apache, which provide
complementary services to Hadoop, or build on the core to add higher-level abstractions.

The Hadoop projects that are covered in this book are described briefly here:
Common
A set of components and interfaces for distributed filesystems and general I/O
(serialization, Java RPC, persistent data structures).
Avro
A serialization system for efficient, cross-language RPC, and persistent data storage.
MapReduce
A distributed data processing model and execution environment that runs on large
clusters of commodity machines.

HDFS
A distributed filesystem that runs on large clusters of commodity machines.
Pig
A data flow language and execution environment for exploring very large datasets. Pig
runs on HDFS and MapReduce clusters.
Hive
A distributed data warehouse. Hive manages data stored in HDFS and provides a query
language based on SQL (and which is translated by the runtime engine to MapReduce
jobs) for querying the data.
HBase
A distributed, column-oriented database. HBase uses HDFS for its underlying storage,
and supports both batch-style computations using MapReduce and point
queries (random reads).
ZooKeeper
A distributed, highly available coordination service. ZooKeeper provides primitives
such as distributed locks that can be used for building distributed applications.
Sqoop
A tool for efficiently moving data between relational databases and HDFS.

SQL Interview Questions - 1
No ratings yet
SQL Interview Questions - 1
68 pages
Natas - Walkthrough
No ratings yet
Natas - Walkthrough
31 pages
Big Data
No ratings yet
Big Data
190 pages
Pseudo Dionysius of Areopagite - The Celestial & Ecclesiastical Hierarchy Transl John Parker (1894)
100% (2)
Pseudo Dionysius of Areopagite - The Celestial & Ecclesiastical Hierarchy Transl John Parker (1894)
119 pages
Big Data Analysis by Deshbandhu
No ratings yet
Big Data Analysis by Deshbandhu
368 pages
Bda Unit 1
No ratings yet
Bda Unit 1
47 pages
Lab9 IAP301 HE172600 IA1802
No ratings yet
Lab9 IAP301 HE172600 IA1802
4 pages
Unit 1
No ratings yet
Unit 1
76 pages
Introduction To Big Data Platform
No ratings yet
Introduction To Big Data Platform
20 pages
Big Data Seminar
100% (2)
Big Data Seminar
27 pages
Big Data Project
100% (3)
Big Data Project
61 pages
Ericsson Microwave Products
100% (1)
Ericsson Microwave Products
155 pages
Bda U1
No ratings yet
Bda U1
80 pages
Big Data - Unit-I
No ratings yet
Big Data - Unit-I
17 pages
Big Data Analytics M1
No ratings yet
Big Data Analytics M1
27 pages
Cambridge International AS Level: 8021/23 English General Paper
No ratings yet
Cambridge International AS Level: 8021/23 English General Paper
8 pages
Data Science
No ratings yet
Data Science
87 pages
Fundamentals of Big Data Engineering: A Guide To The
No ratings yet
Fundamentals of Big Data Engineering: A Guide To The
14 pages
BDA 01 - Introduction
No ratings yet
BDA 01 - Introduction
43 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
124 pages
SIM EMU 6.01 CFG v2.1
0% (1)
SIM EMU 6.01 CFG v2.1
1 page
Unit1 - BDH
No ratings yet
Unit1 - BDH
77 pages
Unit-Iii CC&BD CS71
No ratings yet
Unit-Iii CC&BD CS71
89 pages
Introduction To Bda
No ratings yet
Introduction To Bda
67 pages
Unit-1.1-Introduction To Big Data
No ratings yet
Unit-1.1-Introduction To Big Data
50 pages
Unit-III CC&BD Cs62 Ab
No ratings yet
Unit-III CC&BD Cs62 Ab
85 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
83 pages
Bda U1
No ratings yet
Bda U1
78 pages
A Simulation of A Monitoring and Alarm System in An Energy
No ratings yet
A Simulation of A Monitoring and Alarm System in An Energy
58 pages
Ultrasonic Thickness Gauge NOVOTEST UT-1М-ST
No ratings yet
Ultrasonic Thickness Gauge NOVOTEST UT-1М-ST
4 pages
Unit 1
No ratings yet
Unit 1
54 pages
Data Analytics
No ratings yet
Data Analytics
69 pages
MATHEMATICS 7-10 Edited LAS WEEK 1 AND 2
100% (2)
MATHEMATICS 7-10 Edited LAS WEEK 1 AND 2
5 pages
BIG Data Analytics 21CSH-471: Computer Science & Engineering
No ratings yet
BIG Data Analytics 21CSH-471: Computer Science & Engineering
17 pages
ETB 1 (Big Data)
No ratings yet
ETB 1 (Big Data)
28 pages
8 Revolution of Big Data
No ratings yet
8 Revolution of Big Data
18 pages
117769
No ratings yet
117769
20 pages
Bda Unit1
No ratings yet
Bda Unit1
19 pages
Big Data Technology Report With Pages Removed
No ratings yet
Big Data Technology Report With Pages Removed
32 pages
Big Data - Module 1
No ratings yet
Big Data - Module 1
35 pages
Big Data Analytics02
No ratings yet
Big Data Analytics02
20 pages
Big Data A Comprehensive Overview
No ratings yet
Big Data A Comprehensive Overview
25 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
20 pages
Chapter 1
No ratings yet
Chapter 1
40 pages
Big Data
No ratings yet
Big Data
30 pages
Operation Manual: 8008 Trotec Laserati C100/C180
No ratings yet
Operation Manual: 8008 Trotec Laserati C100/C180
48 pages
Big Data Class - Introduction
No ratings yet
Big Data Class - Introduction
60 pages
Title - Concept of Big Data: Presented by - Divyanshu Upadhyay Naman Gupta Adarsh Pandey Pankaj Chaudhary Shivbrat Singh
No ratings yet
Title - Concept of Big Data: Presented by - Divyanshu Upadhyay Naman Gupta Adarsh Pandey Pankaj Chaudhary Shivbrat Singh
17 pages
CRM Unit-1
No ratings yet
CRM Unit-1
23 pages
Da Unit - I - Notes
No ratings yet
Da Unit - I - Notes
30 pages
Catalog Placement Tester User's Guide
No ratings yet
Catalog Placement Tester User's Guide
21 pages
Purple Pink Trendy Cyber Y2K Creative Presentation - 20241202 - 093632 - 0000
No ratings yet
Purple Pink Trendy Cyber Y2K Creative Presentation - 20241202 - 093632 - 0000
16 pages
Big Data Analytics
No ratings yet
Big Data Analytics
36 pages
Big Data Unit 1 Notes - 240311 - 100703
No ratings yet
Big Data Unit 1 Notes - 240311 - 100703
15 pages
Book Chapter
No ratings yet
Book Chapter
23 pages
Virtual Arbitration in India, CAR MNLU Mumbai PDF
No ratings yet
Virtual Arbitration in India, CAR MNLU Mumbai PDF
53 pages
Um2559 Getting Started With The Xnucleoiks01a3 Motion Mems and Environmental Sensor Expansion Board For stm32 Nucleo Stmicroelectronics
No ratings yet
Um2559 Getting Started With The Xnucleoiks01a3 Motion Mems and Environmental Sensor Expansion Board For stm32 Nucleo Stmicroelectronics
17 pages
2024-03-06
No ratings yet
2024-03-06
17 pages
Session 2: - Manipulating Container With Docker Client
No ratings yet
Session 2: - Manipulating Container With Docker Client
20 pages
What Is Big Data
No ratings yet
What Is Big Data
8 pages
Determination of Discharge Coefficient of Stepped Morning Glory Spillway Using A Hybrid Data-Driven Method
No ratings yet
Determination of Discharge Coefficient of Stepped Morning Glory Spillway Using A Hybrid Data-Driven Method
13 pages
Big Data
No ratings yet
Big Data
31 pages
Bda Unit 1
No ratings yet
Bda Unit 1
20 pages
Binomial Theorem EX-3
No ratings yet
Binomial Theorem EX-3
11 pages
Big Data Analytics
No ratings yet
Big Data Analytics
31 pages
Paymenow Employee App T&Cs Feb 2025
No ratings yet
Paymenow Employee App T&Cs Feb 2025
15 pages
Big Data - Comprehensive Summary
No ratings yet
Big Data - Comprehensive Summary
12 pages
Fire Detection Algorithm Based On The Fusion of YOLOv8 and Deformable Conv DCN
No ratings yet
Fire Detection Algorithm Based On The Fusion of YOLOv8 and Deformable Conv DCN
8 pages
Asset Management System Introduction
No ratings yet
Asset Management System Introduction
7 pages
Unit - 3
No ratings yet
Unit - 3
15 pages
Unit - 2: Onventional Ncryption Principles
No ratings yet
Unit - 2: Onventional Ncryption Principles
35 pages
BIG DATA Module 1
No ratings yet
BIG DATA Module 1
16 pages
Big Data Analytics: by S. P. Sajjan
No ratings yet
Big Data Analytics: by S. P. Sajjan
21 pages
Big Data
No ratings yet
Big Data
31 pages
BDA UNIT-1 (Lecture-1)
No ratings yet
BDA UNIT-1 (Lecture-1)
5 pages
What Is Big Data
No ratings yet
What Is Big Data
7 pages
React Fundamentals and Environment Setup
No ratings yet
React Fundamentals and Environment Setup
8 pages
Big Data MINING AND TOOLS
No ratings yet
Big Data MINING AND TOOLS
44 pages
Islington Aayush Proposal
No ratings yet
Islington Aayush Proposal
6 pages
Big Data Introduction
No ratings yet
Big Data Introduction
7 pages
How To Install FFmpeg On Windows - 15 Steps (With Pictures)
No ratings yet
How To Install FFmpeg On Windows - 15 Steps (With Pictures)
4 pages
Programming 1 Course Outline
No ratings yet
Programming 1 Course Outline
3 pages
Big Data
No ratings yet
Big Data
5 pages
Fixed Wireless Data WM550
No ratings yet
Fixed Wireless Data WM550
2 pages
CPP Regeneration Pump-2 PDF
No ratings yet
CPP Regeneration Pump-2 PDF
1 page
CU-2021 B.Sc. (Honours) Computer Science Semester-IV Paper-CC-10 QP
No ratings yet
CU-2021 B.Sc. (Honours) Computer Science Semester-IV Paper-CC-10 QP
2 pages

Unit 1

Uploaded by

Unit 1

Uploaded by

Unit-1

What is Big Data

Sources of Big Data

These data come from many sources like

3V's of Big Data

Analyze: Pig, Hive can be used to analyze the data.

Cost: Hadoop is open source so the cost is no more an issue.

 Comparative analysis. This examines customer behavior metrics and real-time

 Marketing analytics. This provides information that can be used to improve

 Amazon EMR (formerly Elastic MapReduce)

 Cloudera Data Platform

 Google Cloud Dataproc

 Microsoft Azure HDInsight

 cluster management frameworks, like Kubernetes, Mesos and YARN, Hadoop's

 NoSQL databases that include Cassandra, Couchbase, CouchDB, HBase,

Keys to an effective big data strategy

 prioritizing planned use cases and applications;

 identifying new systems and tools that are needed;

 creating a deployment roadmap; and

 evaluating internal skills to see if retraining or hiring are required.

Big data collection practices and regulations

The human side of big data management and analytics

Data Storage and Analysis

Comparison with other systems

A brief history of Hadoop

Some common frameworks of Hadoop

Advantages and Disadvantages of Hadoop

Apache Hadoop and the Hadoop Eco System

You might also like