Ex 1 -DATA EXPLORATION AND INTEGRATION WITH WEKA
AIM:
To Explore Data and Integrate with WEKA
ALGORTIHM AND EXPLORES:
1. Download and install Weka. You can find it here: https://sourceforge.net/projects/weka/
2. Open the weka tool and select the explorer option.
3. New window will be opened which consists of different options (Preprocess, Association etc.)
3. In the pre-process, click the―open file‖ option.
4. Go to C:\ProgramFiles\Weka-3-8-6\data for finding different existing. arff datasets. Click on any
dataset for loading the data then the data will be displayed.
OUTPUT:
RESULT:
Thus the data exploration and integration with WEKA executed successfully.
Ex 2 -APPLY WEKA TOOL FOR DATA VALIDATION
AIM:
To Apply WEKA tool for Data Validation
Steps and Apply:
1. Open file ... option under the Pre-process tag select the weather- nominal.arff file.
2. Go to classify option & in left-hand navigation bar we can see different classification algorithms
under rules section.
3. Click on the Choose button in the Filter sub window and select the following filter:
Applying Filters -weka->filters->supervised->attribute->Discretize.
weka->filters->supervised->attribute->Attribute Selection
Click on the Apply button and examine the temperature and/or humidity attribute.
4. Selecting Weka - Classifiers
5. Setting Test Data testing options as listed below
Training set
Supplied test set
Cross-validation
Percentage split
6. Selecting Classifier Click on the Choose button and select the following classifier.
weka->classifiers>trees>J48.
7. Visualize Results
Select Visualize tree to get a visual representation of the traversal tree.
Selecting Visualize classifier errors would plot the results of classification
8. The current plot is outlook versus play.
OUTPUT:
RESULT:
Thus the WEKA tool for Data Validation done successfully.
Ex 3 -PLAN THE ARCHITECTURE FOR REAL TIME APPLICATION
AIM:
Designing the architecture for a real−time application involves considering various factors such as
scalability, performance, reliability, and maintainability
Algorithm:
Here's a high−level guide to help you plan the architecture for your real−time application:
Define Requirements:
Clearly define the functional and non−functional requirements of your real−time application.
Identify the specific use cases and scenarios that require real−time processing.
System Components:
Identify the major components of your system. This could include servers, databases, user
interfaces, external APIs, and more.
Divide the system into smaller, manageable modules that can be developed, tested, and deployed
independently.
Scalability:
Plan for scalability from the beginning. Consider how the system will handle an increase in load
and user activity.
Use scalable infrastructure, such as cloud services, to easily adapt to changing demands.
Data Storage:
Choose an appropriate database solution for real−time data storage and retrieval.
Consider using in−memory databases or caching mechanisms to improve data access speed.
Real-time Processing:
Decide on the technologies and frameworks for real−time processing. This may include stream
processing systems like Apache Kafka, Apache Flink, or Rabbit MQ.
Implement mechanisms for event−driven architecture to handle real−time events efficiently.
Communication:
Establish efficient communication channels between different components of the system. APIs,
message queues, and Web Socket protocols are common choices for real−time communication.
Ensure low latency and high throughput for communication between components.
Fault Tolerance:
Design the system to be fault−tolerant. Use redundant components, implement backup and
recovery strategies, and handle errors gracefully.
Consider implementing micro services architecture to isolate failures and improve overall system
resilience.
Security:
Prioritize security measures to protect real−time data and communication.
Implement secure communication protocols, access controls, and encryption mechanisms.
Monitoring and Analytics:
Incorporate monitoring tools to track the performance of your real−time application.
Use analytics to gain insights into user behavior, system performance, and potential issues.
Testing:
Develop a comprehensive testing strategy, including unit testing, integration testing, and
performance testing.
Implement continuous integration and continuous deployment (CI/CD) pipelines to automate
testing and deployment processes.
Documentation:
Document the architecture, APIs, and data flow to facilitate easier maintenance and future
development. Include clear documentation on how to troubleshoot and resolve common issues.
Compliance:
Ensure that your real−time application complies with relevant regulations and standards,
especially if it involves sensitive data.
ARCHITECTURE:
Explanation:
The above reference architecture is generally applicable: Data streams in from a variety of
producers, typically delivered via Apache Kafka, Amazon Kinesis, or Azure Event Hub, to tools that
ingest it and deliver it to a range of data stores and analytics engines. Between source and destination the
data is prepared for consumption for a variety of reasons, including normalization, obfuscation of PII,
flattening of nested data, filtering, and joining of data from multiple sources.
RESULT:
Thus architecture for real time applications was planned.
EX 4 -WRITE THE QUERY FOR SCHEMA DEFINITION
AIM:
To write the query for schema definition.
ALGORITHM:
1. Create a new database
2. Switch to the newly created database
3. Define the schema for each table
4. Define relationships between tables (if needed)
5. Execute the schema definition queries
QUERY:
Create a new database named “library"
CREATE DATABASE library;
Switch to the “library"
Database USE library;
Define the schema for the"books"table
CREATE TABLE books (book_id INT AUTO_INCREMENT PRIMARY KEY, title VARCHAR(255)
NOT NULL, author VARCHAR(100) NOT NULL,publication_year INT, isbn VARCHAR(20),available
BOOLEAN DEFAULT TRUE);
Define the schema for the"members"table
CREATE TABLE members ( member_id INT AUTO_INCREMENT PRIMARY KEY, name
VARCHAR(100) NOT NULL, email VARCHAR(255) UNIQUE, phone_number VARCHAR(20),
address VARCHAR(255) );
Define the schema for the "checkouts" table
CREATE TABLE checkouts ( checkout_id INT AUTO_INCREMENT PRIMARY KEY, book_id INT
NOT NULL, member_id INT NOT NULL, checkout_date DATE NOT NULL, return_date DATE,
FOREIGN KEY (book_id) REFERENCES books(book_id), FOREIGN KEY (member_id)
REFERENCES members(member_id) );
OUTPUT:
RESULT:
Thus Schema Definition was written and executed successfully.
EX 5 -DESIGN DATA WARE HOUSE FOR REAL TIME APPLICATION
AIM:
Design a data warehouse for a real−time application to store and analyze large volumes of
real−time data efficiently.
ALGORITHM:
1. Define Dimensional Modeling:
Identify key business processes and metrics relevant to the real−time application.Define facts
(measurable metrics) and dimensions (contextual attributes) to create a dimensional model.
2. Select ETL Processes:
Choose Extract, Transform, Load (ETL) processes suitable for real−time data ingestion.Implement
mechanisms to continuously extract data from various sources, transform it to fit the data warehouse
schema, and load it efficiently.
3. Implement Star or Snowflake Schema:
Choose a schema design that optimizes query performance for analytical processing.For simplicity
and performance; consider a star schema where a central fact table is surrounded by dimension tables.
Each dimension table represents specific attributes related to the facts.
4. Define Fact Table:
Create a central fact table, e.g., "FactRealTimeData," to store real−time metrics. Include a
timestamp for time−based analysis and foreign keys referencing dimension tables for additional context.
5. Define Dimension Tables:
Create dimension tables (e.g., "Dimension1" and "Dimension2") to store contextual attributes.
Each dimension table has a primary key and attributes related to specific dimensions.
6. Implement Foreign Key Relationships:
Establish foreign key relationships between the fact table and dimension tables. Ensure referential
integrity to maintain consistency in the data warehouse.
QUERY:
Define a fact table
CREATE TABLE FactRealTimeData ( timestamp TIMESTAMP, metric1 INT,metric2 FLOAT,
dimension1_id INT, dimension2_id INT, PRIMARY KEY (timestamp),FOREIGN KEY (dimension1_id)
REFERENCES Dimension1(dimension1_id), FOREIGN KEY (dimension2_id) REFERENCES
Dimension2(dimension2_id));
Define dimension tables
CREATE TABLE Dimension1 (dimension1_id INT PRIMARY KEY, attribute1 VARCHAR(255),
attribute2 DATE);
CREATE TABLE Dimension2 ( dimension2_id INT PRIMARY KEY, attribute3
VARCHAR(255),attribute4 BOOLEAN);
OUTPUT:
OUTPUT EXPLANATION:
The SQL script provided in the output example creates a simple data warehouse structure.The
"FactRealTimeData" table stores real−time metrics with a timestamp and references two dimension tables
("Dimension1" and "Dimension2") to provide additional context to the metrics.This structure facilitates
efficient querying and analysis of real−time data within the context of various dimensions.
RESULT:
Implemented a real−time data warehouse using a star schema with "FactRealTimeData,"
"Dimension1,"and "Dimension2" tables. Fact table stores metrics with timestamps, and dimension tables
provide additional context.
EX 6- ANALYSE THE DIMENSIONAL MODELING
AIM:
To Analyse the dimensional Modeling
ALGORITHM:
1. Identify the business process
2. Identify dimensional and facts
3. Design the dimensional model
4. Define relationships
5. Optimize for query performance
QUERY:
1. *Sales Fact Table:* sql
CREATE TABLE SalesFact ( SaleID INT PRIMARY KEY, DateID INT, ProductID INT, QuantitySold
INT, AmountSold DECIMAL(10,2));
2. *Date Dimension:* sql
CREATE TABLE DateDim ( DateID INT PRIMARY KEY, CalendarDate DATE, Day INT, Month INT,
Year INT );
Populate Date Dimension (sample data)
INSERT INTO DateDim (DateID, CalendarDate, Day, Month, Year) VALUES (1, '2024-01-01', 1, 1,
2024), (2, '2024-01-02', 2, 1, 2024);
--Add more dates as needed
3. *Product Dimension:* sql
CREATE TABLE ProductDim ( ProductID INT PRIMARY KEY, ProductName VARCHAR(255),
Category VARCHAR(50));
-- Additional attributes as needed
Populate Product Dimension (sample data)
INSERT INTO ProductDim (ProductID, ProductName, Category) VALUES (101, 'Product A',
'Electronics'), (102, 'Product B', 'Clothing');
-- Add more products as needed
4. *Query to retrieve sales with date and product details:*
SELECT s.SaleID, d.CalendarDate, p.ProductName, s.QuantitySold, s.AmountSold FROM SalesFact s
JOIN DateDim d ON s.DateID = d.DateID JOIN ProductDim p ON s.ProductID = p.ProductID;
This query retrieves sales information along with corresponding date and product details, leveraging the
dimensional model.
OUTPUT:
RESULT:
Thus the dimensional modeling Analysed Successfully.
EX 7-CASE STUDY USING OLAP
AIM:
To study case using OLAP
Introduction:
OLAP:-
OLAP Stands for "Online Analytical Processing." OLAP allows users to analyze database
information from multiple database systems at one time. While relational databases are considered to be
two-dimensional, OLAP data is multidimensional, meaning the information can be compared in many
different ways. For example, a company might compare their computer sales in June with sales in July,
and then compare those results with the sales from another location, which might be stored in a different
database. In order to process database information using OLAP, an OLAP server is required to organize
and compare the information. Clients can analyze different sets of data using functions built into the
OLAP server. Some popular OLAP server software programs include Oracle Express Server and Hyperion
Solutions Essbase.
Purpose of OLAP:-
An effective OLAP solution solves problems for both business users and IT departments. For
business users, it enables fast and intuitive access to centralized data and related calculations for the
purposes of analysis and reporting. For IT, an OLAP solution enhances a data warehouse or other
relational database with aggregate data and business calculations. In addition, by enabling business users
to do their own analyses and reporting, OLAP systems reduce demands on IT resources.
OLAP offers five key benefits:
Business-focused multidimensional data
Business-focused calculations
Trustworthy data and calculations
Speed-of-thought analysis
Flexible, self-service reporting
OLAP operations
These are used to analyze data in an OLAP cube. There are five basic operations:
Drill down
This makes the data more detailed by moving down the concept hierarchy or adding a new dimension. For
example, in a cube showing sales data by Quarter, drilling down would show sales data by Month.
Roll up
This makes the data less detailed by climbing up the concept hierarchy or reducing dimensions. For
example, in a cube showing sales data by City, rolling up would show sales data by Country.
Dice
This selects a sub-cube by choosing two or more dimensions and criteria. For example, in a cube showing
sales data by Location, Time, and Item, dicing could select sales data for Delhi or Kolkata, in Q1 or Q2,
for Cars or Buses.
Slice
This selects a single dimension and creates a new sub-cube. For example, in a cube showing sales data by
Location, Time, and Item, slicing by Time would create a new sub-cube showing sales data for Q1.
Pivot
This rotates the current view to get a new representation. For example, after slicing by Time, pivoting
could show the same data but with Location and Item as rows instead of columns
RESULT:
Thus case study using OLAP done successfully.
EX 8- CASE STUDY USING OTLP
AIM:
To study case using OTLP
Introduction:
OLTP or online transactional processing is a software program or operating system that supports
transaction-oriented applications in three-tier architecture. It facilitates and supports the execution of a
large number of real-time transactions in a database.
OLTP monitors daily transactions and is typically done over an internet-based multi-access
environment. It handles query processing and, at the same time, ensures and protects data integrity. The
efficacy of OLTP is determined by the number of transactions per second that it can process. OLTP
systems are optimized for transactional superiority hence, suitable for most monetary transactions.
The defining characteristic of OLTP transactions is atomicity and concurrency. Concurrency
prevents multiple users from changing the same data simultaneously. Atomicity (or indivisibility) ensures
that all transactional steps are completed for the transaction to be successful. If one step fails or is
incomplete, the entire transaction fails.
Atomic statefulness is a computing condition in which database changes are permanent, requiring
transactions to be completed successfully. OLTP systems enable inserting, deleting, changing, and
querying data in a database.
OLTP systems activities consist of gathering input data, processing the data, and updating it using
the data collected. OLTP is usually supported by a database management system (DBMS) and operates in
a client-server system. It also relies on advanced transaction management systems to facilitate multiple
concurrent updates.
OLTP Transaction Examples
OLTP systems facilitate many types of financial and non-financial transactions such as:
Automated teller machines (ATMs)
Online banking applications
Online bookings for airline ticketing, hotel reservations, etc.
Online and in-store credit card payment processing
Order entry
E-commerce and in-store purchases
Password changes and sending text messages
OLTP systems are found in a broad spectrum of industries with a concentration in client-facing
environments.
OLTP Characteristics
1. Short response time
OLTP systems maintain very short response times to be effective for users. For example, responses from
an ATM operation need to be quick to make the process effective, worthwhile, and convenient.
2. Process small transactions
OLTP systems support numerous small transactions with a small amount of data executed simultaneously
over the network. It can be a mixture of queries and Data Manipulation Language (DML) overload. The
queries normally include insertions, deletions, updates, and related actions. Response time measures the
effectiveness of OLTP transactions, and millisecond responses are becoming common.
3. Data maintenance operations
Data maintenance operations are data-intensive computational reporting and data update programs that run
alongside OLTP systems without interfering with user queries.
4. High-level transaction volume and multi-user access
OLTP systems are synonymous with a large number of users accessing the same data at the same time.
Online purchases of a popular or trending gadget such as an iPhone may involve an enormous number of
users all vying for the same product. The system is built to handle such situations expertly.
5. Very high concurrency
An OLTP environment experiences very high concurrency due to the large user population, small
transactions, and very short response times. However, data integrity is maintained by a concurrency
algorithm, which prevents two or more users from altering the same data at the same time. It prevents
double bookings or allocations in online ticketing and sales, respectively.
A mobile money transfer application is a good example where concurrency is very high as thousands of
users can be making transfers simultaneously on the platform at every time of the day.
6. Round-the-clock availability
OLTP systems often need to be available round the clock, 24/7, without interruption. A small period of
unavailability or offline operations can significantly impact a large number of people and an equally huge
transaction quantity.Downtimes can also pose potential losses to organizations, e.g., an online banking
system downtime has adverse consequences to the bank’s bottom line. Therefore, an OLTP system
requires frequent, regular, and incremental backup.
7. Data usage patterns
OLTP systems experience periods of both high data usage and low data usage. Finance-related OLTP
systems typically see high data usage during month ends when financial obligations are settled.
8. Indexed data sets
Index data sets are used to facilitate rapid query, search, and retrieval.
9. Normalized schema
OLTP systems utilize a fully normalized schema for database consistency.
10. Storage
OLTP stores data records for the past few days or about a week. It supports sophisticated data models and
tables.
1. Business Strategy
The business strategy influences the OLTP systems design. The strategy is formulated at the senior
management and the level of the board of directors.
2. Business Process
They are processes by the OLTP system that will accomplish the goals set by the business strategy. The
processes comprise a set of activities, tasks, and actions.
3. Product, Customer/Supplier, Transactions, Employees
The OLTP database contains information on products, transactions, employees, and customers, and
suppliers.
4. Extract, Transform, Load (ETL) Process
The ETL process extracts data from the OLTP database and transforms it into the staging area, which
includes data cleansing and optimizing the data for analysis. The transformed data is then loaded into the
online analytical processing (OLAP) database, which is synonymous with the data warehouse
environment.
5. Data Warehouse and Data Mart
Data warehouses are central repositories of integrated data from one or more incongruent sources. A data
mart is an access layer of the data warehouse that is used to access specific/summarized information of a
unit or department.
6. Data Mining, Analytics, and Decision Making
The data stored in the data warehouse and data mart is used for analysis, data mining, and decision
making.
RESULT:
Thus case study using OTLP done successfully.
EX 9 - IMPLEMENTATION OF WAREHOUSE TESTING
AIM:
To implement warehouse testing.
ALGORITHM:
1. Install necessary libraries
pip install pytest pandas.
2. Create a Python script for data transformation and loading
3. Create test cases using pytest
4. Run the tests using pytest:
pytest test_data_integration.py
5. Analyze the test results to ensure that the data transformation and loading processes are functioning
correctly in the operational data layer.
By implementing automated tests for data integration processes in the data warehousing
Environment, you can ensure the accuracy and reliability of the data transformation and Loading
operations. This approach helps in identifying any issues or discrepancies early on in the development
cycle, leading to a more robust and efficient data warehousing system.
PROGRAM:
# data_transformation.py
import pandas as pd
def transform_data(input_data):
# Perform data transformation logic here
transformed_data = input_data.apply(lambda x: x * 2)
return transformed_data
def load_data(transformed_data):
# Load transformed data into the operational data layer
transformed_data.to_csv('transformed_data.csv', index=False)
# test_data_integration.py
import pandas as pd
import data_transformation
def test_transform_data():
input_data = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
expected_output = pd.DataFrame({'A': [2, 4, 6], 'B': [8, 10, 12]})
transformed_data = data_transformation.transform_data(input_data)
assert transformed_data.equals(expected_output)
def test_load_data():
input_data = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
data_transformation.load_data(input_data)
loaded_data = pd.read_csv('transformed_data.csv')
assert input_data.equals(loaded_data)
OUTPUT:
RESULT:
Thus implementation of warehouse testing done successfully.