Unit-3 BDA
Unit-3 BDA
Examples:
Arrays, lists,
Nested documents,
Sets.
• Fraud detection:
Patterns in financial transactions.
Introduction to Graph Databases
Graph Databases:
Store data as nodes, edges, and properties.
Nodes represent entities, edges represent relationships,
properties contain attributes.
Examples:
Neo4j
Amazon Neptune
Schema-less Databases
Definition:
Schema-less databases, also known as schema-less or
schema-flexible databases, allow for storing data without
a predefined schema.
Benefits:
• Flexibility: Data can be added or modified without altering the
entire database schema.
• Agility: Rapid development and iteration, especially in evolving or
experimental projects.
• Scalability: Easier to scale horizontally as there are no
constraints on schema changes.
Materialized Views
Purpose:
Materialized views are precomputed views of data that are stored
physically and updated periodically based on changes to the underlying
data.
Benefits:
– Improved Performance: Materialized views can significantly speed up
query processing by precomputing and storing frequently accessed or
complex query results.
– Reduced Complexity: They simplify query execution by reducing the
need for complex joins or computations, leading to more efficient data
retrieval.
– Enhanced Scalability: Materialized views can help distribute query load
and improve scalability by offloading heavy query processing tasks to
precomputed views.
Distribution Models
Distribution models in NoSQL databases determine
how data is distributed across multiple nodes or servers in
a distributed environment.
Common distribution models include replication,
partitioning, and hybrid approaches.
Distribution models decide how data spreads across
many servers.
Think of it like sharing a pizza—everyone gets a slice.
Overview of Sharding
Sharding is a database partitioning technique where large
datasets are divided into smaller, more manageable parts
called shards.
Each shard is stored on a separate server or node.
Sharding helps distribute data and query load evenly across
multiple servers, improving scalability and performance.
Imagine it like organizing a library: spreading books across
multiple shelves for quicker access.
Minimizes risks—if one server fails, others keep running .
Speeds up data access by spreading the workload.
Versioning
Versioning tracks changes made to data or documents over
time, creating a historical record of edits and updates.
Enables auditing, compliance, and error recovery.
Methods:
Timestamps: Assigns a timestamp to each data change, enabling
chronological tracking of revisions.
Incremental Versioning: Assigns a unique version number to each
update, allowing easy comparison and retrieval of specific
versions.
Branch Versioning: Creates separate branches for different
versions or branches of data, enabling parallel development or
experimentation without affecting the main version.
MapReduce: Overview and Significance
Overview:
MapReduce is a programming model for processing and
generating large datasets in parallel.
It consists of two phases: Map and Reduce.
Significance:
Revolutionized big data processing by enabling
distributed computing on large datasets.
Key component in processing and analyzing massive
volumes of data efficiently.
MapReduce in NoSQL Databases
How it Works:
NoSQL databases leverage MapReduce for distributed data processing.
MapReduce tasks are executed across multiple nodes in the database
cluster.
Benefits:
Enables parallel processing and distributed computation.
Enhances scalability and performance of NoSQL databases for handling
large datasets.
MapReduce Phases: Map,Shuffle & Sort/Reduce.
Example:
In MongoDB, MapReduce can be used for aggregating, filtering, and
analyzing large volumes of data stored in document collections.
Utilization of Partitioning and Combining
NoSQL databases employ Partitioning to distribute data across
nodes for scalability and fault tolerance.
Combining is utilized during query processing to aggregate and
summarize data from multiple partitions, reducing the need for
data movement and improving query performance.
Example:
Partitioning: Imagine dividing a large library into smaller sections.
NoSQL databases use partitioning to split data across servers.
Combining: Then, you count or summarize information from all
sections.
Combining merges data from partitions for efficient querying.
Partitioning and Combining
Partitioning involves dividing a large dataset into
smaller, manageable partitions or shards.
It enables horizontal scalability by distributing data
across multiple nodes.
Examples:
Word Count: Counts the frequency of words in a document.
Average Calculation: Computes the average value of a dataset.
Sorting: Orders data based on specific criteria, like alphabetical
order or numerical value.
Conclusion: Recap of Key Points
NoSQL Data Management:
Explored various NoSQL concepts including aggregate data models, graph databases,
and MapReduce.
Importance:
NoSQL offers flexibility, scalability, and performance for modern data management
needs.
Key Takeaways:
– NoSQL databases provide diverse data models suited for different use cases.
– MapReduce enables distributed processing of large datasets.
– Techniques like partitioning and combining improve scalability and efficiency.
Future Directions:
Explore emerging trends and advancements in NoSQL technology for continued
innovation and growth.
Questions and Discussion
Feel free to ask questions or share insights. We're here to engage
and learn together.
Topics to Explore:
NoSQL databases
Aggregate data models
Graph databases
MapReduce
Partitioning and combining
Any related topics of interest
THANK
YOU