Unit 1

Uploaded by

Rohan Saraswat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views50 pages

Unit 1

Uploaded by

Rohan Saraswat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 50

Big Data Analytics

DR. SHILPA BADE- GITE

Syllabus
Mode of Conduction
•Unit 1,2 and 3- 2 credits-Dr Shilpa Bade-Gite-July-Sept 24
•Unit 4 and 5-1 credit-Mr. Amit Khedkar-Oct 7-11, 2024-3 hrs daily

Amit Khedkar’s profile

https://www.linkedin.com/in/amit-khedkar-023758166/?originalSubdomain=in
Director and Lead Instructor at Talentum Global Technologies
[email protected]
My Timetable
Evaluation Plan

Unit test is cancelled….

BDA Final Evaluation-30 Marks
1. Quiz-CO1, CO2-Unit 1 ,Unit 2-12 Marks- Individual submission-31 Aug 24.
2. Poster-CO3-Unit 3-6 Marks- Group submission-22 Sept 24.
3. Case study-CO4,CO5-Unit4, Unit 5-5 Marks-12 Marks- Individual submission-20 Oct 24.
Unit 1-Introduction to Big Data-6
Hrs

•Big Data Fundamentals and Big Data Analytics.

•Structured Data, unstructured Data and semi Structured Data.
•Hadoop Overview and Evolution of Big-Data Hadoop
• Hadoop Architecture/Framework
•HDFS
•Map reduce
•Hadoop Environment Setup
•Distributed File System(s)
“Big data is high-volume, high-velocity, and/or high-variety information assets that demand cost-
effective, innovative forms of information processing that enable enhanced insight, decision making,
and process automation.”
- Gartner, Research and Advisory Company
What is Big Data?

Source-https://medium.com/analysts-corner/what-is-big-data-and-why-is-it-important-to-business-41d3d0bd9d87
Why Big Data Analytics?
• Risk Management
• Product Development and Innovations
• Quicker and Better Decision making
• Improve Customer Experience
• Complex Supplier Networks
• Focused And Targeted Campaigns

https://www.analyticssteps.com/blogs/what-big-data-analytics-definition-advantages-and-types
Types of BDA
Big data analytics is categorized into four subcategories that are:

•Descriptive Analytics
•Diagnostic Analytics
•Predictive Analytics
•Prescriptive Analytics

https://www.analyticssteps.com/blogs/what-big-data-analytics-definition-advantages-and-types
https://www.zucisystems.com/blog/big-data-analytics/
Types of Data
Healthcare Example
Hadoop
Hadoop is an open source framework overseen by
Apache Software Foundation which is written in Java
for storing and processing of huge datasets with the
cluster of commodity hardware.
There are mainly two problems with the big data.
First one is to store such a huge amount of data and
the second one is to process that stored data.

There are mainly two components of Hadoop which

are Hadoop Distributed File System
(HDFS) and Yet Another Resource
Negotiator(YARN).
What is Hadoop
Hadoop is an open source framework from Apache and is used to store process and analyze data which are very huge in
volume.
Hadoop is written in Java and is not OLAP (online analytical processing).
It is used for batch/offline processing.
It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more.
Moreover it can be scaled up just by adding nodes in the cluster.
Modules of Hadoop
HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of that HDFS was developed. It
states that the files will be broken into blocks and stored in nodes over the distributed architecture.
Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
Map Reduce: This is a framework which helps Java programs to do the parallel computation on data using key value pair.
The Map task takes input data and converts it into a data set which can be computed in Key value pair. The output of
Map task is consumed by reduce task and then the out of reducer gives the desired result.
Hadoop Common: These Java libraries are used to start Hadoop and are used by other Hadoop modules.
Hadoop Architecture
DFS (Distributed File System)
A Distributed File System (DFS) is a file system that is distributed on multiple file servers or multiple locations. It
allows programs to access or store isolated files as they do with the local ones, allowing programmers to access files
from any network or computer.
DFS (Distributed File System) is a technology that allows you to group shared folders located on different servers into
one or more logically structured namespaces.
The main purpose of the Distributed File System (DFS) is to allows users of physically distributed systems to share
their data and resources by using a Common File System.
A collection of workstations and mainframes connected by a Local Area Network (LAN) is a configuration on
Distributed File System.
A DFS is executed as a part of the operating system. In DFS, a namespace is created and this process is transparent for
the clients.
Components of DFS
Location Transparency
Redundancy
A distributed file system (DFS) is a file system that is distributed on various file servers and locations.
It permits programs to access and store isolated data in the same method as in the local files.
It also permits the user to access files from any system. It allows network users to share information and files in
a regulated and permitted manner. Although, the servers have complete control over the data and provide
users access control.
DFS's primary goal is to enable users of physically distributed systems to share resources and information
through the Common File System (CFS).
It is a file system that runs as a part of the operating systems. Its configuration is a set of workstations and
mainframes that a LAN connects. The process of creating a namespace in DFS is transparent to the clients.
Hadoop Distributed File
System
It has distributed file system known as HDFS and this HDFS splits files into blocks and sends them across various nodes in form of large clusters. Also in case of a node
failure, the system operates and data transfer takes place between the nodes which are facilitated by HDFS.
Advantages of HDFS: It is inexpensive, immutable in nature, stores data reliably, ability to tolerate faults, scalable, block structured, can process a large amount of data
simultaneously and many more. Disadvantages of HDFS: It’s the biggest disadvantage is that it is not fit for small quantities of data. Also, it has issues related to potential
stability, restrictive and rough in nature. Hadoop also supports a wide range of software packages such as Apache Flumes, Apache Oozie, Apache HBase, Apache Sqoop,
Apache Spark, Apache Storm, Apache Pig, Apache Hive, Apache Phoenix, Cloudera Impala.
Some common frameworks of Hadoop
Hive- It uses HiveQl for data structuring and for writing complicated MapReduce in HDFS.
Drill- It consists of user-defined functions and is used for data exploration.
Storm- It allows real-time processing and streaming of data.
Spark- It contains a Machine Learning Library(MLlib) for providing enhanced machine learning and is widely used for data processing. It also supports Java, Python, and
Scala.
Pig- It has Pig Latin, a SQL-Like language and performs data transformation of unstructured data.
Tez- It reduces the complexities of Hive and Pig and helps in the running of their codes faster.
Hadoop framework is made up of the following modules:
Hadoop MapReduce- a MapReduce programming model for handling and processing large data.
Hadoop Distributed File System- distributed files in clusters among nodes.
Hadoop YARN- a platform which manages computing resources.
Hadoop Common- it contains packages and libraries which are used for other modules.
Advantages
It allows the users to access and store the data.
It helps to improve the access time, network efficiency, and availability of files.
It provides the transparency of data even if the server of disk files.
It permits the data to be shared remotely.
It helps to enhance the ability to change the amount of data and exchange data.
Disadvantages
In a DFS, the database connection is complicated.
In a DFS, database handling is also more complex than in a single-user system.
If all nodes try to transfer data simultaneously, there is a chance that overloading will happen.
There is a possibility that messages and data would be missed in the network while moving from one node to
another.
Hadoop has several key features that make
it well-suited for big data processing:
Distributed Storage: Hadoop stores large data sets across multiple machines, allowing for the storage and processing of extremely
large amounts of data.
Scalability: Hadoop can scale from a single server to thousands of machines, making it easy to add more capacity as needed.
Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it can continue to operate even in the presence of hardware
failures.
Data locality: Hadoop provides data locality feature, where the data is stored on the same node where it will be processed, this
feature helps to reduce the network traffic and improve the performance
High Availability: Hadoop provides High Availability feature, which helps to make sure that the data is always available and is not lost.
Flexible Data Processing: Hadoop’s MapReduce programming model allows for the processing of data in a distributed fashion, making
it easy to implement a wide variety of data processing tasks.
Data Integrity: Hadoop provides built-in checksum feature, which helps to ensure that the data stored is consistent and correct.
Data Replication: Hadoop provides data replication feature, which helps to replicate the data across the cluster for fault tolerance.
Data Compression: Hadoop provides built-in data compression feature, which helps to reduce the storage space and improve the
performance.
YARN: A resource management platform that allows multiple data processing engines like real-time streaming, batch processing, and
interactive SQL, to run and process data stored in HDFS.
Disadvantages
Not very effective for small data.
Hard cluster management.
Has stability issues.
Security concerns.
Complexity: Hadoop can be complex to set up and maintain, especially for organizations without a dedicated team of experts.
Latency: Hadoop is not well-suited for low-latency workloads and may not be the best choice for real-time data processing.
Limited Support for Real-time Processing: Hadoop’s batch-oriented nature makes it less suited for real-time streaming or interactive data processing use cases.
Limited Support for Structured Data: Hadoop is designed to work with unstructured and semi-structured data, it is not well-suited for structured data
processing
Data Security: Hadoop does not provide built-in security features such as data encryption or user authentication, which can make it difficult to secure sensitive
data.
Limited Support for Ad-hoc Queries: Hadoop’s MapReduce programming model is not well-suited for ad-hoc queries, making it difficult to perform exploratory
data analysis.
Limited Support for Graph and Machine Learning: Hadoop’s core component HDFS and MapReduce are not well-suited for graph and machine learning
workloads, specialized components like Apache Graph and Mahout are available but have some limitations.
Cost: Hadoop can be expensive to set up and maintain, especially for organizations with large amounts of data.
Data Loss: In the event of a hardware failure, the data stored in a single node may be lost permanently.
Data Governance: Data Governance is a critical aspect of data management, Hadoop does not provide a built-in feature to manage data lineage, data quality,
data cataloging, data lineage, and data audit.
References
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
https://www.databricks.com/glossary/hadoop-distributed-file-system-hdfs
https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pd
f
https://www.geeksforgeeks.org/hadoop-history-or-evolution/
https://data-flair.training/blogs/hadoop-history/

HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Production Process of Monolithic IC
100% (2)
Production Process of Monolithic IC
5 pages
Big Data Aktu Unit 2
No ratings yet
Big Data Aktu Unit 2
127 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
15 pages
Big Data Unit II
No ratings yet
Big Data Unit II
42 pages
Unit 2
No ratings yet
Unit 2
73 pages
Unit Iii
No ratings yet
Unit Iii
20 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Bda 18CS72 Mod-2
No ratings yet
Bda 18CS72 Mod-2
152 pages
Unit 4 Hadoop
No ratings yet
Unit 4 Hadoop
31 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
Unit 2 Big Data Notes
No ratings yet
Unit 2 Big Data Notes
21 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
14 pages
Big Data 2 - Part
No ratings yet
Big Data 2 - Part
40 pages
Module 2 Big Data Analytics
No ratings yet
Module 2 Big Data Analytics
38 pages
Hadoop - Project 5th Sem - 1
No ratings yet
Hadoop - Project 5th Sem - 1
62 pages
Unit 5
No ratings yet
Unit 5
32 pages
Unit 2 Part A
No ratings yet
Unit 2 Part A
34 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
CC Unit 2
No ratings yet
CC Unit 2
29 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
Unit 2-1
No ratings yet
Unit 2-1
43 pages
Unit Ii
No ratings yet
Unit Ii
30 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
MA - VaishuAchini - VIT - 24 - ICT703 - A3
No ratings yet
MA - VaishuAchini - VIT - 24 - ICT703 - A3
21 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
Hadoop ISE 2
No ratings yet
Hadoop ISE 2
25 pages
BDA Unit2 Notes
No ratings yet
BDA Unit2 Notes
23 pages
CC Unit - 5
No ratings yet
CC Unit - 5
27 pages
Unit 3 ETI (BDA)
No ratings yet
Unit 3 ETI (BDA)
34 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
BD Unit-02
No ratings yet
BD Unit-02
16 pages
Report On An Exploratory Analysis of The
No ratings yet
Report On An Exploratory Analysis of The
19 pages
UNIT-4-Hadoop Ecosystem-Part 1
No ratings yet
UNIT-4-Hadoop Ecosystem-Part 1
22 pages
DSCI 5350 - Lecture 2 PDF
No ratings yet
DSCI 5350 - Lecture 2 PDF
54 pages
Unit Ii LM
No ratings yet
Unit Ii LM
18 pages
CC-KML051-Unit V
No ratings yet
CC-KML051-Unit V
17 pages
Lect7 IoT BigData1
No ratings yet
Lect7 IoT BigData1
28 pages
Data Science and Big Data UNIT 3
No ratings yet
Data Science and Big Data UNIT 3
11 pages
Unit 2
No ratings yet
Unit 2
9 pages
Unit2 Bda
No ratings yet
Unit2 Bda
12 pages
Hadoop
No ratings yet
Hadoop
11 pages
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
No ratings yet
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
15 pages
Big Data ABHISHEK PRAJA C CCCCCCCCCCC
No ratings yet
Big Data ABHISHEK PRAJA C CCCCCCCCCCC
11 pages
Big Data RAJNEESH CCC
No ratings yet
Big Data RAJNEESH CCC
11 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
Big Data Unit 2 Notes
No ratings yet
Big Data Unit 2 Notes
6 pages
Introduction To Big Dat1
No ratings yet
Introduction To Big Dat1
6 pages
Hadoop Vs Apache Spark
No ratings yet
Hadoop Vs Apache Spark
6 pages
Unit 2 - Hadoop PDF
No ratings yet
Unit 2 - Hadoop PDF
7 pages
What Is The Hadoop Ecosystem?
No ratings yet
What Is The Hadoop Ecosystem?
4 pages
Hadoop in Bigdata Processing Concept
No ratings yet
Hadoop in Bigdata Processing Concept
2 pages
Big Data?: Hadoop?
No ratings yet
Big Data?: Hadoop?
2 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
Dietary Practices Among Individuals With Type 2 Diabetes (Diabetes Mellitus) : A Guide To Nutrition Intervention
100% (2)
Dietary Practices Among Individuals With Type 2 Diabetes (Diabetes Mellitus) : A Guide To Nutrition Intervention
68 pages
FBS Midterm
No ratings yet
FBS Midterm
2 pages
PIL - 3rd Sem LLB
No ratings yet
PIL - 3rd Sem LLB
68 pages
James Hou - Salesforce - Com Developer Resume
No ratings yet
James Hou - Salesforce - Com Developer Resume
3 pages
Teaching Behavioral Ethics by Robert A. Prentice
No ratings yet
Teaching Behavioral Ethics by Robert A. Prentice
41 pages
Meltem Adar Essay1
No ratings yet
Meltem Adar Essay1
3 pages
Risk Ranger
No ratings yet
Risk Ranger
31 pages
Warmups Linear Functions 8 TH Grade Math Common Core Standards
No ratings yet
Warmups Linear Functions 8 TH Grade Math Common Core Standards
61 pages
Eugen Fink Oasis of Happiness
No ratings yet
Eugen Fink Oasis of Happiness
29 pages
MFM Assignment 1 Draft
No ratings yet
MFM Assignment 1 Draft
9 pages
Harshit Ipr PPT Mba Sec B First Sem
No ratings yet
Harshit Ipr PPT Mba Sec B First Sem
12 pages
A First Introduction To P-Adic Numbers
No ratings yet
A First Introduction To P-Adic Numbers
6 pages
The Act
No ratings yet
The Act
2 pages
Sains (Kertas 2) PMR Perak
No ratings yet
Sains (Kertas 2) PMR Perak
17 pages
RC Column Sample Problem
No ratings yet
RC Column Sample Problem
10 pages
Table of Contents (The Summary) : Intro
No ratings yet
Table of Contents (The Summary) : Intro
14 pages
Metalsa Supplier Manual Rev 4 1
No ratings yet
Metalsa Supplier Manual Rev 4 1
58 pages
4as Tle7 LC4
No ratings yet
4as Tle7 LC4
5 pages
Manual HON 370 20 GB
No ratings yet
Manual HON 370 20 GB
51 pages
How Human Behaviour Amplifies The Bullwhip Effect A Study Based On The Beer Distribution Game Online
No ratings yet
How Human Behaviour Amplifies The Bullwhip Effect A Study Based On The Beer Distribution Game Online
12 pages
Gascon, Alexia Denise - Argumentative Essay
No ratings yet
Gascon, Alexia Denise - Argumentative Essay
3 pages
Altman Z Score Model
No ratings yet
Altman Z Score Model
7 pages
Notes Summer 2024 - Finance and Economics Summary
No ratings yet
Notes Summer 2024 - Finance and Economics Summary
3 pages
Traumatic Care DR - GOLDEN
No ratings yet
Traumatic Care DR - GOLDEN
34 pages
File Page No 1663658874765
No ratings yet
File Page No 1663658874765
10 pages
Aluminum and Glass Company in Qatar
No ratings yet
Aluminum and Glass Company in Qatar
5 pages
Oral Characteristics of Newborns: Journal of Dentistry For Children (Chicago, Ill.) December 2008
No ratings yet
Oral Characteristics of Newborns: Journal of Dentistry For Children (Chicago, Ill.) December 2008
4 pages
Mapping Pulling Cable Grounding System
No ratings yet
Mapping Pulling Cable Grounding System
1 page
Upgrading Cimplicity 6.1 To 8.1 License Issue
No ratings yet
Upgrading Cimplicity 6.1 To 8.1 License Issue
2 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet

Unit 1

Uploaded by

Unit 1

Uploaded by

Big Data Analytics

DR. SHILPA BADE- GITE

Amit Khedkar’s profile

Unit test is cancelled….

•Big Data Fundamentals and Big Data Analytics.

There are mainly two components of Hadoop which

You might also like