0% found this document useful (0 votes)
2 views

Unit II

Data formats refer to the structure in which data is stored and processed, including structured, semi-structured, and unstructured types. A data model is a conceptual representation that organizes data in databases, with key elements like entities, attributes, and relationships, and various types including hierarchical and relational models. Data marts are smaller, subject-oriented data warehouses that provide faster access to specific data compared to larger data warehouses, with advantages and disadvantages in terms of cost, complexity, and data analysis capabilities.

Uploaded by

Shafeen Nagoor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Unit II

Data formats refer to the structure in which data is stored and processed, including structured, semi-structured, and unstructured types. A data model is a conceptual representation that organizes data in databases, with key elements like entities, attributes, and relationships, and various types including hierarchical and relational models. Data marts are smaller, subject-oriented data warehouses that provide faster access to specific data compared to larger data warehouses, with advantages and disadvantages in terms of cost, complexity, and data analysis capabilities.

Uploaded by

Shafeen Nagoor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Unit II: Data Models

What is Data Format?

 Data format refers to the structure or arrangement in which data is stored, processed, and
exchanged between systems. . It specifies how information is organized for storage and
retrieval
 In Big Data, understanding different data formats is crucial for efficient storage and
retrieval.
 Data can come in a variety of formats—structured, semi-structured, and unstructured—
depending on its source and use case.

Types of Data Formats

1. Structured Data
o This data is well-organized and stored in predefined schemas such as rows and
columns in relational databases.
o It is easy to store, access, and process.
o Example: Sales records, employee data in an Excel sheet, or SQL databases.
2. Semi-structured Data
o This type of data does not follow a rigid schema but has some organizational
properties, making it easier to analyze than unstructured data.
o Common formats include XML, JSON, and CSV files.
o Example: Web server logs, sensor data, and emails.
3. Unstructured Data
o This data does not have a predefined format, making it the most challenging to store
and analyze.
o It requires specialized tools for processing.
o Example: Images, videos, social media posts, and multimedia files.

Importance of Understanding Data Formats:

 Helps in choosing the appropriate storage system.


 Optimizes data processing and retrieval.
 Improves decision-making by enabling better data analysis

What is a Data Model?

A data model is a conceptual representation of how data is organized, stored, and accessed in
databases. It acts as a blueprint for designing and managing data in both traditional databases and
Big Data environments. Data models help manage large and complex datasets, ensuring efficiency
in data storage, retrieval, and analysis.

Key Elements of a Data Model

1. Entities:
o Objects or concepts that store data.
o Example: In a university database, entities can be Students, Courses, and
Departments.
2. Attributes:
o Characteristics or properties of entities.
o Example: For the Student entity, attributes could be Name, ID Number, and Date of
Birth.
3. Relationships:
o Defines how entities are related to each other.
o Example: A Student entity can have a relationship with a Course entity—enrolled
in.

Types of Data Models

1. Hierarchical Data Model:


o Data is organized in a tree-like structure. Each child node has a single parent.
o Example: File systems, where folders have subfolders or files.
o Usage: Early mainframe databases.
2. Relational Data Model:
o Represents data in tables (rows and columns) with relationships between tables.
o Example: SQL-based databases like MySQL and PostgreSQL.
o Usage: Banking systems, customer relationship management (CRM).
3. Network Data Model:
o Similar to the hierarchical model but allows many-to-many relationships between
entities.
o Usage: Complex applications like telecommunications networks.
4. Document Data Model:
o Stores data as documents, typically in JSON or BSON format.
o Example: MongoDB.
o Usage: Web applications, content management systems.
5. Key-Value Data Model:
o Stores data as key-value pairs for fast access.
o Example: Redis, Amazon DynamoDB.
o Usage: Real-time applications and caching.
6. Graph Data Model:
o Represents data as nodes and edges to show relationships.
o Example: Neo4j.
o Usage: Social network analysis, recommendation systems, fraud detection.

Example of a Data Model in a Relational Database

 Entity: Customer
o Attributes: Customer ID, Name, Email, Phone Number
 Entity: Orders
o Attributes: Order ID, Order Date, Amount
 Relationship: Customer places Orders.

Benefits of Appropriate Models and Storage Environments in Big Data

Selecting the right data model and storage environment is crucial for the success of Big Data
projects. The benefits include:

1. Scalability:
o Allows systems to handle large datasets and grow as data volume increases.
o Example: Hadoop Distributed File System (HDFS) and Amazon S3 provide
scalable storage solutions.
2. High Performance:
o Optimized data models enhance query performance and reduce latency.
o Example: Columnar storage like Apache Parquet for fast read-heavy analytics.
3. Cost-Effectiveness:
o Appropriate models and storage systems reduce hardware and maintenance costs.
o Cloud-based storage (Google BigQuery) eliminates the need for physical
infrastructure.
4. Data Integration:
o Enables seamless integration of data from multiple sources, improving data quality
and consistency.
o Example: Data lakes integrate structured, semi-structured, and unstructured data.
5. Real-Time Analytics:
o Supports real-time data processing for immediate insights and decision-making.
o Example: Kafka and Spark Streaming for processing streaming data.
6. Data Security and Compliance:
o Modern storage environments offer advanced security features to protect sensitive
data.
o Example: Role-based access control (RBAC) and encryption in cloud storage
systems like Azure Data Lake.
7. Flexibility:

o Different models cater to different types of data (structured, semi-structured,


unstructured).
o Example: Document databases (MongoDB) for flexible schema designs.

Examples of Storage Environments:

 Hadoop Distributed File System (HDFS): Manages large datasets across distributed
clusters.
 Amazon S3: Cloud-based object storage for scalable and secure data storage.
 Google BigQuery: Real-time analytics platform for processing large datasets.

What is a Data Mart?


A: A data mart is a small, subject-oriented data warehouse that stores data specific to a business
function or department, such as sales or marketing. It provides faster access to relevant data for
analysis and decision-making.

Features of a Data Mart:

 Smaller and more focused than a data warehouse.


 Provides fast access to specific data.
 Helps in targeted business analysis.
 Easier to maintain and administer.

How is a data mart different from a data warehouse?

A data warehouse stores the entire organization’s data, while a data mart stores data related to a
specific department or function. Data marts are smaller, faster, and more focused.
Different types of data mart

1. Dependent Data Mart


o Extracts data from a central data warehouse.
o Ensures consistency with enterprise-wide data.
o Example: A sales department data mart derived from a company’s main data
warehouse.
2. Independent Data Mart
o Created directly from operational systems without relying on a central data
warehouse.
o Suitable for smaller organizations.
o Example: A stand-alone marketing data mart.
3. Hybrid Data Mart
o Combines data from both a data warehouse and external operational systems.
o Offers more flexibility in integrating data.
o Example: A finance data mart that combines internal financial data with external
market trends.
Advantages of Data mart

1. Simpler, more focused & flexible.


2. Low cost for both h/w and s/w.
3. Faster and cheaper to build.
4. Stores data closer that enhances performance.

Disadvantages of data mart

1. Maintaining a large number of independent data marts can be time-consuming, resource-


intensive, and challenging to manage.
2. Data Mart cannot provide company-wide data analysis as their data set is limited.
3. Without proper planning and coordination, the development of data marts can become
unorganized.
4. Increase in data mart size leads to problems such as performance degradation and data
inconsistency.

Data Mart vs Data Warehouse

Data Mart Data Warehouse

It is a decentralized system It is a centralized system


It is a bottom-up model It is a top-down model
Building a mart is easy Building a warehouse is difficult
Smaller, handles a subset of data Larger, handles vast amounts of data
Limited to a few sources Multiple sources across the organization
Focused on one department or group Serves multiple departments and users
Simpler design, fewer integration needs More complex, integrates data from diverse sources
Lower cost, faster implementation Higher cost, longer implementation
Better performance, faster queries May have slower queries due to large data volume

You might also like