BIG DATA ANALYTICS
Understanding Big
Data :
HDSF, NOSQL ,
Functions in R Presented By
Sidrah Mohammadi Waris
AI&DS-III yr-II sem
22L51A7218
Introduction To HDFS
The Storage Backbone of Big Data
HDFS, or Hadoop Distributed File System, is a core
component of the Hadoop ecosystem used to store
large volumes of data reliably. It works by dividing
files into large blocks and distributing them across
several computers (DataNodes). A central server
(NameNode) manages where each part is stored.
HDFS is designed to handle hardware failures
automatically through data replication, and it
supports high-speed, batch-style processing,
making it ideal for big data applications.
Hadoop Distributed File System
PURPOSE OF HDFS IN HADOOP ECOSYSTEM
HDFS stands for Hadoop Distributed File
System.
It is the primary storage system of Hadoop.
Acts as a central storage layer that allows other components
like MapReduce, Hive, Pig, and Spark to access data efficiently.
HDFS is designed for high throughput and reliable data
storage across clusters.
It supports large file sizes, enabling efficient data
processing and accessibility.
HDFS is fault-tolerant, ensuring data is safe
even in the event of hardware failures.
HDFS Architecture
HDFS follows a master- Components
slave architecture that
ensures reliable, scalable, NameNode: The
and fault-tolerant storage master node
for big data applications. that manages
metadata (file DataNodes:
names, Worker
locations). nodes where
the actual
Secondary data is
NameNode: stored.
Periodically
HDFS stores each file by splitting
merges the
it into blocks and distributing
those blocks across multiple
NameNode’s
DataNodes — ensuring speed, metadata to
scalability, and data safety prevent memory
overload.
NOSQL
NoSQL databases are perfect for handling modern
big data where traditional SQL fails.
NoSQL (Not Only SQL) refers to a class of non-
relational databases designed to store and
manage large volumes of unstructured or semi-
structured data with high scalability, flexibility,
and performance. Unlike traditional SQL
📋❌
No fixed table rules
databases, NoSQL systems support dynamic
Handles huge data 💾
schemas and horizontal scaling, making them
⬆️
Super fast at scaling
ideal for big data and real-time applications.
Perfect for modern apps 🌐📱
“Whether it’s
Types of keys, docs,
columns, or
graphs — NoSQL
NoSQL
has a type for
every kind of
data need.”
Databases NoSQL = freedom to store
data your way.
Types:
🔑 Key-Value Pair 📄 Document Based 📊 Column-Family 🔗 Graph Database
Stores and
Stores data as a Stores data in Stores data
organizes data in
collection of key columns rather using nodes and
flexible, JSON-like
and value pairs than rows relationships
documents
Samples of types of
NoSQL databases
🔑 1. Key-Value Pair (e.g., Redis) 📊 3. Column-Family (Cassandra)
SET user:101 "{'name':'Laila', 'age':20}" INSERT INTO users (id, name, age,
GET user:101 hobbies) VALUES (101, 'Laila', 20,
Key: user:101
['reading', 'music']);
Value: {'name':'Laila', 'age':20}
Output: Output:
{'name':'Laila', 'age':20} id | name | age | hobbies
-----+-------+-----+----------------------
📄 2. Document Based (e.g., MongoDB) 101 | Laila | 20 | ['reading', 'music']
db.users.insertOne({ 🔗 4. Graph Database (Neo4j)
_id: 101,
CREATE (laila:Person {name: 'Laila', age:
name: "Laila",
age: 20, 20})
hobbies: ["reading", "music"] CREATE (book:Interest {type: 'Reading'})
}) CREATE (laila)-[:LIKES]->(book)
Output: Output:
{ +----------------------+
"_id": 101, | p.name | i.type |
"name": "Laila",
+--------+-------------+
"age": 20,
"hobbies": ["reading", "music"] | Laila | Reading |
} +----------------------+
Functions In R
A function is a set of statements organized together to perform a specific
task. R has a large number of in-built functions and the user can create their
own functions.
In R, a function is an object so the R interpreter is able to pass control to the
function, along with arguments that may be necessary for the function to
accomplish the actions.
The function in turn performs its task and returns control to the interpreter
as well as any result which may be stored in other objects.
Creating A Function: Calling A Function:
An R function is created by using the keyword This just means calling or executing a function.
function() Syntax
Syntax function_name(argument1, argument2, ...)
function_name <- function(arg_1, arg_2, ...) { Example
Function body sqrt(49)
} O/P: 7
sum(1, 2) # Calling sum function
o/p: 3
Functions are the building blocks of R programming, allowing you to
write once, use many times — and stay DRY (Don’t Repeat Yourself).
Function Components Every R function is like a
recipe — with ingredients
(arguments), instructions
(body), and a finished dish
(return value).
Function Name :
This is the actual name of the function. It is stored in R
environment as an object with this name.
Arguments :
An argument is a placeholder. When a function is invoked, you
pass a value to the argument. Arguments are optional; that is, a
function may contain no arguments. Also arguments can have
default values.
Function Body:
The function body contains a collection of statements that
defines what the function does.
Return Value:
The return value of a function is the last expression in the
function body to be evaluated.
Types Of Functions In R
1.Built-in Functions 2. User-Defined Functions
R has many in-built functions which can be directly called in the We can create user-defined functions in R. They are specific to what
program without defining them first. a user wants and once created they can be used like the built-in
These are functions that come pre-defined in R, ready to use. functions.
A few of the built-in in functions are as follows: Syntax:
greet <- function(name) {
message <- paste("Hello", name, "!")
Function Description Example Output return(message)
}
greet("Laila")
Adds
1.Sum () Sum(1, 2, 3) [1] 6 Output:
numbers
[1] "Hello Laila!"
2.Mean () Calculates Avg mean(c(2, 4)) [1] 3
3.Sqrt () Finds Sqr root sqrt(16) [1] 4
Returns length
4.Length () length(c(1, 2, 3)) [1] 3
of a vector
3. Return statement
The return() function is used to return
output from a function
If not explicitly used, R returns the last 4. Nested Functions
evaluated expression
A function defined inside another
Example: function.
myfunc <- function(x) { Useful for organizing logic or keeping
helper functions private. 5. Function Scoping
return(x + 10)
} R uses lexical scoping, which means that
myfunc(5) Example:
outer_func <- function(a) { the value of a variable is looked up in the
inner_func <- function(b) { environment where the function was
Output: defined, not where it is called.
[1] 15 return(b^2)
}
return(inner_func(a) + 1) Example
} x <- 10
outer_func(3) scope_example <- function(){
x <- 5
Output: return(x)}
[1]10 scope_example() # Returns 5 because x is
5 inside the function scope
x # Returns 10 as the global x remains
unchanged
Output:
[1] 5
[1] 10
You install a package once,
6. Recursion but you must load it every
A recursive function is a function that calls itself. time you use it.
Recursion is useful for tasks that can be divided into
similar subtasks, such as calculating factorials.
Example: Factorial Function
factorial <- function(n){
if (n <= 1){
return(1) Loading an R Package
} else{ R packages contain functions, data, and code that
return(n * factorial(n -1)) you can use in your R scripts. You can load a package
}} using library() or require().
factorial(5)
Example:
Output: # Install a package if not already installed
[1] 120 if (!require(ggplot2)) install.packages("ggplot2")
# Load the package
library(ggplot2)
Thank You!