Unit 5 DMS
Unit 5 DMS
Backup Strategies
1. Local Backup:
o Backups are stored on the same server or nearby local storage devices.
o Pros: Fast access for restoring data.
o Cons: Vulnerable to local disasters (e.g., fire, theft).
2. Remote Backup:
o Backups are stored on external servers or cloud storage.
o Pros: Provides protection against local failures and disasters.
o Cons: Slower recovery times, and the need for internet connectivity.
3. Cloud Backup:
o Backups are stored in cloud services like AWS, Google Cloud, or Azure.
o Pros: Off-site storage, automated backups, and scalability.
o Cons: Reliant on internet access and cloud service provider reliability.
Data Warehouse:
Definition: A Data Warehouse is a centralized repository used for storing and analyzing large volumes of
structured data from multiple sources. It’s specifically designed for business intelligence (BI) tasks like reporting,
querying, and data analysis.
Characteristics:
o Structured Data: Primarily stores structured data (e.g., relational data from operational systems).
o OLAP (Online Analytical Processing): Optimized for complex queries and analytical workloads rather
than transactional processing.
o Data Integration: Integrates data from various sources like transactional databases, external data
sources, and other systems.
o ETL Process: Data is extracted, transformed, and loaded (ETL) into the warehouse for analysis.
Use Cases:
o Business intelligence (BI) reporting
o Trend analysis and decision-making support
o Historical data analysis
Examples:
o Amazon Redshift, Google BigQuery, Microsoft Azure Synapse Analytics.
Data Lake:
Definition: A Data Lake is a large, centralized repository that stores vast amounts of raw, unstructured, semi-
structured, and structured data. It allows for the storage of data in its native format until it is needed for
analysis.
Characteristics:
o Raw and Unstructured Data: Can store unstructured data like text, images, videos, logs, and social
media data, alongside structured data.
o Scalability: Highly scalable, capable of storing petabytes of data, often used in big data applications.
o Schema-on-Read: Data is stored without a predefined schema, and the schema is applied when the data
is read or analyzed, making it more flexible for future analysis.
o Data Variety: Capable of storing diverse data types (e.g., JSON, XML, images, text).
Use Cases:
o Big data analytics
o Machine learning and data mining
o Real-time analytics and data exploration
Examples:
o Amazon S3 (with analytics tools like AWS Lake Formation), Microsoft Azure Data Lake Storage, Google
Cloud Storage.
Difference Between
Feature Data Warehouse Data Lake
Data Type Primarily structured data Structured, semi-structured, and unstructured data
Storage Data is processed and structured before
Raw data stored in its native format
Format storing
Schema Schema-on-write (predefined schema) Schema-on-read (schema applied during analysis)
Business intelligence, reporting, historical Big data analytics, machine learning, real-time
Use Case
analysis processing
Optimized for complex queries and reporting Optimized for flexible and scalable data storage and
Processing
(OLAP) processing
Data Mining
Definition: Data mining is the process of discovering patterns, correlations, trends, and useful information from
large datasets using statistical, mathematical, and computational techniques. It transforms raw data into
valuable insights.
Techniques:
o Classification: Assigning items to predefined categories (e.g., spam vs. non-spam emails).
o Clustering: Grouping similar items without predefined labels (e.g., customer segmentation).
o Association Rule Mining: Finding interesting relationships between variables (e.g., market basket
analysis).
o Regression: Predicting a continuous value based on data (e.g., forecasting sales).
o Anomaly Detection: Identifying outliers or unusual patterns in data (e.g., fraud detection).
Applications:
o Market basket analysis
o Fraud detection
o Customer segmentation
o Predictive analytics (e.g., stock market forecasting)
2. Big Data
Definition: Big Data refers to extremely large datasets that are too complex or voluminous to be processed by
traditional database management systems (DBMS) or computing tools. Big data typically involves the 3 Vs:
o Volume: Large amounts of data (petabytes, exabytes).
o Velocity: The speed at which data is generated, processed, and analyzed (real-time or near real-time).
o Variety: Different types of data (structured, semi-structured, unstructured).
Characteristics:
o Scale: Big data can include vast amounts of data that come from diverse sources like social media,
sensors, logs, and more.
o Data Processing: It requires distributed computing systems, such as Hadoop and Spark, to process
efficiently.
o Advanced Analytics: Often used for advanced analytics, predictive modeling, and machine learning.
Applications:
o Real-time analytics (e.g., stock market analysis)
o Internet of Things (IoT)
o Social media analysis
o Healthcare, genomics, and scientific research
3. MongoDB
Definition: MongoDB is an open-source, NoSQL database that stores data in a flexible, document-oriented
format using JSON-like documents (BSON - Binary JSON). MongoDB is designed to handle large-scale,
unstructured, or semi-structured data.
Key Features:
o Schema-less Design: Data is stored in JSON-like documents, allowing flexibility in the structure (no
predefined schema).
o Scalability: MongoDB is horizontally scalable, meaning it can distribute data across many servers
(sharding).
o High Availability: Supports replication, where data is duplicated across multiple nodes to ensure
availability.
o Aggregation Framework: Provides a powerful way to perform data analysis and aggregation.
Advantages:
o Flexibility and scalability
o High availability and fault tolerance
o Suitable for unstructured data
4. DynamoDB
Definition: DynamoDB is a fully managed NoSQL database service provided by Amazon Web Services (AWS). It
is designed for high availability, scalability, and low-latency performance. DynamoDB is optimized for
applications that require consistent, single-digit millisecond response times.
Features:
o Key-Value and Document Data Model: It supports both key-value pairs and document-based structures,
allowing for flexible data representation.
o Scalable and Fast: DynamoDB can automatically scale up or down to handle increasing traffic without
manual intervention.
o Fully Managed: Amazon handles infrastructure, backups, replication, and scaling automatically.
o Global Replication: Allows for data replication across multiple regions for low-latency access.
Advantages:
o Automatic scaling and management
o Low-latency reads and writes
o Built-in fault tolerance and replication