Unit1

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 50

Big Data Analytics

DR. SHILPA BADE- GITE


Syllabus
Mode of Conduction
•Unit 1,2 and 3- 2 credits-Dr Shilpa Bade-Gite-July-Sept 24
•Unit 4 and 5-1 credit-Mr. Amit Khedkar-Oct 7-11, 2024-3 hrs daily

Amit Khedkar’s profile


https://www.linkedin.com/in/amit-khedkar-023758166/?originalSubdomain=in
Director and Lead Instructor at Talentum Global Technologies
[email protected]
My Timetable
Evaluation Plan

Unit test is cancelled….


BDA Final Evaluation-30 Marks
1. Quiz-CO1, CO2-Unit 1 ,Unit 2-12 Marks- Individual submission-31 Aug 24.
2. Poster-CO3-Unit 3-6 Marks- Group submission-22 Sept 24.
3. Case study-CO4,CO5-Unit4, Unit 5-5 Marks-12 Marks- Individual submission-20 Oct 24.
Unit 1-Introduction to Big Data-6
Hrs

•Big Data Fundamentals and Big Data Analytics.


•Structured Data, unstructured Data and semi Structured Data.
•Hadoop Overview and Evolution of Big-Data Hadoop
• Hadoop Architecture/Framework
•HDFS
•Map reduce
•Hadoop Environment Setup
•Distributed File System(s)
“Big data is high-volume, high-velocity, and/or high-variety information assets that demand cost-
effective, innovative forms of information processing that enable enhanced insight, decision making,
and process automation.”
- Gartner, Research and Advisory Company
What is Big Data?

Source-https://medium.com/analysts-corner/what-is-big-data-and-why-is-it-important-to-business-41d3d0bd9d87
Why Big Data Analytics?
• Risk Management
• Product Development and Innovations
• Quicker and Better Decision making
• Improve Customer Experience
• Complex Supplier Networks
• Focused And Targeted Campaigns

https://www.analyticssteps.com/blogs/what-big-data-analytics-definition-advantages-and-types
Types of BDA
Big data analytics is categorized into four subcategories that are:

•Descriptive Analytics
•Diagnostic Analytics
•Predictive Analytics
•Prescriptive Analytics

https://www.analyticssteps.com/blogs/what-big-data-analytics-definition-advantages-and-types
https://www.zucisystems.com/blog/big-data-analytics/
Types of Data
Healthcare Example
Hadoop
Hadoop is an open source framework overseen by
Apache Software Foundation which is written in Java
for storing and processing of huge datasets with the
cluster of commodity hardware.
There are mainly two problems with the big data.
First one is to store such a huge amount of data and
the second one is to process that stored data.

There are mainly two components of Hadoop which


are Hadoop Distributed File System
(HDFS) and Yet Another Resource
Negotiator(YARN).
What is Hadoop
Hadoop is an open source framework from Apache and is used to store process and analyze data which are very huge in
volume.
Hadoop is written in Java and is not OLAP (online analytical processing).
It is used for batch/offline processing.
It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more.
Moreover it can be scaled up just by adding nodes in the cluster.
Modules of Hadoop
HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of that HDFS was developed. It
states that the files will be broken into blocks and stored in nodes over the distributed architecture.
Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
Map Reduce: This is a framework which helps Java programs to do the parallel computation on data using key value pair.
The Map task takes input data and converts it into a data set which can be computed in Key value pair. The output of
Map task is consumed by reduce task and then the out of reducer gives the desired result.
Hadoop Common: These Java libraries are used to start Hadoop and are used by other Hadoop modules.
Hadoop Architecture
DFS (Distributed File System)
A Distributed File System (DFS) is a file system that is distributed on multiple file servers or multiple locations. It
allows programs to access or store isolated files as they do with the local ones, allowing programmers to access files
from any network or computer.
DFS (Distributed File System) is a technology that allows you to group shared folders located on different servers into
one or more logically structured namespaces.
The main purpose of the Distributed File System (DFS) is to allows users of physically distributed systems to share
their data and resources by using a Common File System.
A collection of workstations and mainframes connected by a Local Area Network (LAN) is a configuration on
Distributed File System.
A DFS is executed as a part of the operating system. In DFS, a namespace is created and this process is transparent for
the clients.
Components of DFS
Location Transparency
Redundancy
A distributed file system (DFS) is a file system that is distributed on various file servers and locations.
It permits programs to access and store isolated data in the same method as in the local files.
It also permits the user to access files from any system. It allows network users to share information and files in
a regulated and permitted manner. Although, the servers have complete control over the data and provide
users access control.
DFS's primary goal is to enable users of physically distributed systems to share resources and information
through the Common File System (CFS).
It is a file system that runs as a part of the operating systems. Its configuration is a set of workstations and
mainframes that a LAN connects. The process of creating a namespace in DFS is transparent to the clients.
Hadoop Distributed File
System
It has distributed file system known as HDFS and this HDFS splits files into blocks and sends them across various nodes in form of large clusters. Also in case of a node
failure, the system operates and data transfer takes place between the nodes which are facilitated by HDFS.
Advantages of HDFS: It is inexpensive, immutable in nature, stores data reliably, ability to tolerate faults, scalable, block structured, can process a large amount of data
simultaneously and many more. Disadvantages of HDFS: It’s the biggest disadvantage is that it is not fit for small quantities of data. Also, it has issues related to potential
stability, restrictive and rough in nature. Hadoop also supports a wide range of software packages such as Apache Flumes, Apache Oozie, Apache HBase, Apache Sqoop,
Apache Spark, Apache Storm, Apache Pig, Apache Hive, Apache Phoenix, Cloudera Impala.
Some common frameworks of Hadoop
Hive- It uses HiveQl for data structuring and for writing complicated MapReduce in HDFS.
Drill- It consists of user-defined functions and is used for data exploration.
Storm- It allows real-time processing and streaming of data.
Spark- It contains a Machine Learning Library(MLlib) for providing enhanced machine learning and is widely used for data processing. It also supports Java, Python, and
Scala.
Pig- It has Pig Latin, a SQL-Like language and performs data transformation of unstructured data.
Tez- It reduces the complexities of Hive and Pig and helps in the running of their codes faster.
Hadoop framework is made up of the following modules:
Hadoop MapReduce- a MapReduce programming model for handling and processing large data.
Hadoop Distributed File System- distributed files in clusters among nodes.
Hadoop YARN- a platform which manages computing resources.
Hadoop Common- it contains packages and libraries which are used for other modules.
Advantages
It allows the users to access and store the data.
It helps to improve the access time, network efficiency, and availability of files.
It provides the transparency of data even if the server of disk files.
It permits the data to be shared remotely.
It helps to enhance the ability to change the amount of data and exchange data.
Disadvantages
In a DFS, the database connection is complicated.
In a DFS, database handling is also more complex than in a single-user system.
If all nodes try to transfer data simultaneously, there is a chance that overloading will happen.
There is a possibility that messages and data would be missed in the network while moving from one node to
another.
Hadoop has several key features that make
it well-suited for big data processing:
Distributed Storage: Hadoop stores large data sets across multiple machines, allowing for the storage and processing of extremely
large amounts of data.
Scalability: Hadoop can scale from a single server to thousands of machines, making it easy to add more capacity as needed.
Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it can continue to operate even in the presence of hardware
failures.
Data locality: Hadoop provides data locality feature, where the data is stored on the same node where it will be processed, this
feature helps to reduce the network traffic and improve the performance
High Availability: Hadoop provides High Availability feature, which helps to make sure that the data is always available and is not lost.
Flexible Data Processing: Hadoop’s MapReduce programming model allows for the processing of data in a distributed fashion, making
it easy to implement a wide variety of data processing tasks.
Data Integrity: Hadoop provides built-in checksum feature, which helps to ensure that the data stored is consistent and correct.
Data Replication: Hadoop provides data replication feature, which helps to replicate the data across the cluster for fault tolerance.
Data Compression: Hadoop provides built-in data compression feature, which helps to reduce the storage space and improve the
performance.
YARN: A resource management platform that allows multiple data processing engines like real-time streaming, batch processing, and
interactive SQL, to run and process data stored in HDFS.
Disadvantages
Not very effective for small data.
Hard cluster management.
Has stability issues.
Security concerns.
Complexity: Hadoop can be complex to set up and maintain, especially for organizations without a dedicated team of experts.
Latency: Hadoop is not well-suited for low-latency workloads and may not be the best choice for real-time data processing.
Limited Support for Real-time Processing: Hadoop’s batch-oriented nature makes it less suited for real-time streaming or interactive data processing use cases.
Limited Support for Structured Data: Hadoop is designed to work with unstructured and semi-structured data, it is not well-suited for structured data
processing
Data Security: Hadoop does not provide built-in security features such as data encryption or user authentication, which can make it difficult to secure sensitive
data.
Limited Support for Ad-hoc Queries: Hadoop’s MapReduce programming model is not well-suited for ad-hoc queries, making it difficult to perform exploratory
data analysis.
Limited Support for Graph and Machine Learning: Hadoop’s core component HDFS and MapReduce are not well-suited for graph and machine learning
workloads, specialized components like Apache Graph and Mahout are available but have some limitations.
Cost: Hadoop can be expensive to set up and maintain, especially for organizations with large amounts of data.
Data Loss: In the event of a hardware failure, the data stored in a single node may be lost permanently.
Data Governance: Data Governance is a critical aspect of data management, Hadoop does not provide a built-in feature to manage data lineage, data quality,
data cataloging, data lineage, and data audit.
References
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
https://www.databricks.com/glossary/hadoop-distributed-file-system-hdfs
https://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pd
f
https://www.geeksforgeeks.org/hadoop-history-or-evolution/
https://data-flair.training/blogs/hadoop-history/

You might also like