Dm-Lab - Nov 1
Dm-Lab - Nov 1
TECH I-SEMESTER
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
1. To impart quality professional education that meets the needs of present and
emerging technological world.
2. To strive for student achievement and success, while preparing them for life,
career and leadership.
3. To produce graduates with professional ethics and responsibility towards the
development of industry and the society and for sustainable development.
4. To ensure abilities in the graduates to lead technical and management teams
for conception, development and management of projects for industrial and
national development.
5. To forge mutually beneficial relationships with government organizations,
industries, society and the alumni.
The mission and vision are published in the department, laboratories
& all the instructional rooms.
It is also provided in the college website and department notice
boards.
Explained to students and their parents as a part of the induction
programme.
This mission and vision are exhibited in the library and in the seminar
halls.
It is published in the lab manuals, newsletters and course files.
DM LAB
SYLLABUS
PO8 Ethics: Apply ethical principles and commit to professional ethics and
responsibilities and norms of the engineering practice.
PO9 Individual and team work: Function effectively as an individual, and as a
member or leader in diverse teams, and in multi disciplinary settings
Communication: Communicate effectively on complex engineering
activities with the engineering community and with society at large, such
PO10 as, being able to comprehend and write effective reports and design
documentation, make effective presentations, and give and receive clear
instructions.
Project management and finance: Demonstrate knowledge and
understanding of
PO11 theengineeringandmanagementprinciplesandapplythesetoone’sownwork,as
a member and leader in a team, to manage projects and in multi
disciplinary environments.
Life-long learning: Recognize the need for, and have the preparation and
ability to engage in independent and life-long learning in the broadest
PO12
context of technological change.
C418.2 3 - 2 2 3 - - - - - - - 2 -
C418.3 3 - 2 2 3 - - - - - - - 2 -
C418.4 3 - 2 2 3 - - - - - - - 2 -
Avg 3 - 2 2 3 - - - - - - - 2 -
( C418)
LIST OF EXPERIMENTS
Weka is a popular tool for data mining and machine learning tasks. It also provides
functionalities for basic data preprocessing like data cleaning.
Steps:
Steps:
Pentaho Data Integration (PDI), also known as Kettle, is a powerful tool for data
processing, including ETL (Extract, Transform, Load) processes.
Steps:
Weka is not typically designed for complex data integration tasks like merging
datasets from different sources. However, you can perform basic data integration
operations, such as merging datasets with similar structures.
Steps:
Pentaho Data Integration (PDI) is well-suited for more complex data integration
tasks, especially when dealing with data from different sources (e.g., databases,
CSV files, Excel, etc.).
Steps:
Conclusion:
Weka: Suitable for basic data integration when datasets have similar
structures and are small in size.
Pentaho Data Integration: Ideal for more complex data integration tasks
involving multiple data sources, different formats, and larger datasets.
Pentaho is generally preferred for comprehensive data integration tasks due to its
flexibility and wide range of functionalities.
EXPERIMENT – 2
2.
Weka is more commonly used for data mining and analysis than for data
partitioning, but you can perform basic horizontal and vertical partitioning through
its functionalities.
Horizontal Partitioning
Horizontal partitioning involves dividing the dataset into multiple subsets of rows
(instances).
Steps:
Steps:
Pentaho PDI offers more advanced and flexible partitioning methods, including
horizontal, vertical, round-robin, and hash-based partitioning.
Horizontal Partitioning
Steps:
Vertical Partitioning:
Steps:
Round-Robin Partitioning :
Steps:
Hash-Based Partitioning :
Conclusion:
Pentaho’s flexibility makes it ideal for complex data partitioning tasks, especially
when you need to automate and scale these processes across large datasets.
Would you like more detailed steps for any specific partitioning method or an
example using a particular type of data?
Dimension table
Nearly all of the information in a typical fact table is also present in one or more
dimension tables. The main purpose of maintaining Dimension Tables is to allow
browsing the categories quickly and easily.
The primary keys of each of the dimension tables are linked together to form the
composite primary key of the fact table. In a star schema design, there is only
one de-normalized table for a given dimension. Typical dimension tables in a
data warehouse are:
Example:
python
Copy code
import pandas as pd
# Sample DataFrame
data = {'ID': [1, 2, 3, 4, 5], 'Value': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)
print(partition1)
print(partition2)
Example:
python
Copy code
Department of Computer Science and Engineering DM Lab
# Sample DataFrame
data = {'ID': [1, 2, 3, 4, 5], 'Value': [10, 20, 30, 40, 50], 'Category': ['A', 'B', 'A', 'B',
'A']}
df = pd.DataFrame(data)
print(partition1)
print(partition2)
Example:
python
Copy code
# Sample DataFrame
data = {'ID': [1, 2, 3, 4, 5], 'Value': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)
Example:
Department of Computer Science and Engineering DM Lab
python
Copy code
# Sample DataFrame
data = {'ID': [1, 2, 3, 4, 5], 'Value': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)
Experiment No. 3
Objective: Create a simple data warehouse with three schemas: Star, Snowflake,
and Fact Constellation. Load sample data and perform queries to demonstrate their
structure and functionality.
Star Schema:
1. Create Tables:
-- Dimension Tables
CREATE TABLE ProductDimension (
ProductID INT PRIMARY KEY,
ProductName VARCHAR(100),
Category VARCHAR(100)
);
1. Create Tables:
-- Dimension Tables
CREATE TABLE ProductDimension (
ProductID INT PRIMARY KEY,
ProductName VARCHAR(100),
CategoryID INT
);
1. Create Tables:
-- Dimension Tables
CREATE TABLE ProductDimension (
ProductID INT PRIMARY KEY,
ProductName VARCHAR(100),
Department of Computer Science and Engineering DM Lab
Category VARCHAR(100)
);
Load some sample data into the tables. Here’s an example for the Star Schema:
Repeat similar INSERT statements for the Snowflake and Fact Constellation
schemas, adapting the values as necessary.
Objective: Create a data cube from a sample dataset and perform various OLAP
operations, such as slicing, dicing, drilling down, and rolling up.
Dimension Tables
Product Dimension:
Customer Dimension:
Date Dimension:
You can construct a data cube using SQL queries. In OLAP, a data cube is
typically created by aggregating data across multiple dimensions.
1. Slicing:
Experiment – 4
i) Develop an application to implement OLAP, roll up, drill down,
slice and dice operation
Year
Office Day
OLAP operations:
The analyst can understand the meaning contained in the databases using multi-
dimensional analysis. By aligning the data content with the analyst's mental
model, the chances of confusion and erroneous interpretations are reduced. The
analyst can navigate through the database and screen for a particular subset of the
data, changing the data's orientations and defining analytical calculations. The
user- initiated process of navigating by calling for page displays interactively,
through the specification of slices via rotations and drill down/up is sometimes
called "slice and dice". Common operations include slice and dice, drill down, roll
up, and pivot.
Dice: The dice operation is a slice on more than two dimensions of a data cube (or
more than two consecutive slices).
Drill Down/Up: Drilling down or up is a specific analytical technique whereby
the user navigates among levels of data ranging from the most summarized (up)
to the most detailed (down).
Roll-up: A roll-up involves computing all of the data relationships for one or more
dimensions. To do this, a computational relationship or formula might be defined.
Department of Computer Science and Engineering DM Lab
Pivot: To change the dimensional orientation of a report or page display.
Other operations
Drill through: through the bottom level of the cube to its back-end relational tables
(using SQL)
The slice operation performs a selection on one dimension of the given cube,
resulting in a sub_cube.
Drill Down:
Drill-Down
Slice:
Dice:
Multidimensional data model is to view it as a cube. The cable at the left contains
detailed sales data by product, market and time. The cube on the right associates
sales number (unit sold) with dimensions-product type, market and time with the
unit variables organized as cell in an array.
This cube can be expended to include another array-price-which can be
associates with all or only some dimensions.
As number of dimensions increases number of cubes cell increase exponentially.
Dimensions are hierarchical in nature i.e. time dimension may contain
hierarchies for years, quarters, months, weak and day. GEOGRAPHY may
contain country, state, city etc.
Fig.A.1
a)Logical Cubes
b) Logical Measures
Measures populate the cells of a logical cube with the facts collected about
business operations. Measures are organized by dimensions, which typically
include a Time dimension.
Measures are static and consistent while analysts are using them to inform their
decisions. They are updated in a batch window at regular intervals: weekly,
daily, or periodically throughout the day. Many applications refresh their data by
adding periods to the time dimension of a measure, and may also roll off an
equal number of the oldest time periods. Each update provides a fixed historical
record of a particular business activity for that interval. Other applications do a
full rebuild of their data rather than performing incremental updates.
The base level determines whether analysts can get an answer to this question.
For this particular question, Time could be rolled up into months, Customer
could be rolled up into regions, and Product could be rolled up into items (such
as dresses) with an attribute of color. However, this level of aggregate data could
not answer the question: At what time of day are women most likely to place an
order? An important decision is the extent to which the data has been pre-
aggregated before being loaded into a data warehouse.
c)Logical Dimensions
Dimensions contain a set of unique values that identify and categorize data. They
form the edges of a logical cube, and thus of the measures within the cube.
Because measures are typically multidimensional, a single value in a measure
must be qualified by a member of each dimension to be meaningful. For
example, the Sales measure has four dimensions: Time, Customer, Product, and
Channel. A particular Sales value (43,613.50) only has meaning when it is
qualified by a specific time period (Feb-01), a customer (Warren Systems), a
product (Portable PCs), and a channel (Catalog).
Each level represents a position in the hierarchy. Each level above the base (or
most detailed) level contains aggregate values for the levels below it. The
members at different levels have a one-to-many parent-child relation. For
example, Q1-02 and Q2-02 are the children of 2002, thus 2002 is the parent of
Q1-02 and Q2-02.
Suppose a data warehouse contains snapshots of data taken three times a day,
that is, every 8 hours. Analysts might normally prefer to view the data that has
been aggregated into days, weeks, quarters, or years. Thus, the Time dimension
needs a hierarchy with at least five levels. Similarly, a sales manager with a
particular target for the upcoming year might want to allocate that target amount
among the sales representatives in his territory; the allocation requires a
dimension hierarchy in which individual sales representatives are the child
values of a particular territory.
An attribute provides additional information about the data. Some attributes are
used for display. For example, you might have a product dimension that uses
Stock Keeping Units (SKUs) for dimension members. The SKUs are an excellent
way of uniquely identifying thousands of products, but are meaningless to most
people if they are used to label the data in a report or graph. You would define
attributes for the descriptive labels.
Time attributes can provide information about the Time dimension that may be
useful in some types of analysis, such as identifying the last day or the number of
days in each time period.
a)Dimension Tables
A star schema stores all of the information about a dimension in a single table.
Each level of a hierarchy is represented by a column or column set in the
dimension table. A dimension object can be used to define the hierarchical
relationship between two columns (or column sets) that represent two levels of a
hierarchy; without a dimension object, the hierarchical relationships are defined
only in metadata. Attributes are stored in columns of the dimension tables.
A snowflake schema normalizes the dimension members by storing each level in
a separate table.
Department of Computer Science and Engineering DM Lab
b) Fact Tables
Measures are stored in fact tables. Fact tables contain a composite primary key,
which is composed of several foreign keys (one for each dimension table) and a
column for each measure that uses these dimensions.
c) Materialized Views
Queries can be written either against a fact table or against a materialized view.
If a query is written against the fact table that requires aggregate data for its
result set, the query is either redirected by query rewrite to an existing
materialized view, or the data is aggregated on the fly.
K- means is the most popularly used algorithm for clustering. User need to specify
3. Compute new mean for each cluster Ci for the closest mean
4. Iterate until criteria function converges, that is, there are no more new
assignments
iv. If class attribute is known, then user can select that attribute for “classes to
cluster evaluation” to check for accuracy of results.
Aim: This experiment illustrates some of the basic data preprocessing operations
that can be performed using WEKA-Explorer. The sample dataset used for this
example is the student data available in arff format.
1. Step1: Loading the data. We can load the dataset into weka by clicking on
open button in preprocessing interface and selecting the appropriate file.
Discretization
1)Sometimes association rule mining can only be performed on categorical
data.This requires performing discretization on numeric or continuous attributes.In
the following example let us discretize age attribute.
To change the defaults for the filters,click on the box immediately to the right of
the choose button.
We enter the index for the attribute to be discretized.In this case the attribute is
age.So we must enter ‘1’ corresponding to the age attribute.
Enter ‘3’ as the number of bins.Leave the remaining field values as they are.
Click OK button.
Click apply in the filter panel.This will result in a new working relation with the
selected attribute partition into 3 bins.
@relation student
@data
Load each dataset into Weka and run Apriori algorithm with different
support and confidence values. Study the rules generated.
Step1: Open the data file in Weka Explorer. It is presumed that the required data
fields have been discretized. In this example it is age attribute.
Step2: Clicking on the associate tab will bring up the interface for association rule
algorithm.
Step4: Inorder to change the parameters for the run (example support, confidence
etc) we click on the text box immediately to the right of the choose button.
Aim: This experiment illustrates some of the basic elements of asscociation rule
mining using WEKA. The sample dataset used for this example is test.arff
Step2: Clicking on the associate tab will bring up the interface for association rule
algorithm.
Step4: Inorder to change the parameters for the run (example support, confidence
etc) we click on the text box immediately to the right of the choose button.
Dataset test.arff
@relation test
@data
2005, cse
2005, it
2005, cse
2006, mech
2006, it
2006, ece
2007, it
Department of Computer Science and Engineering DM Lab
2007, cse
2008, it
2008, cse
2009, it
2009, ece
The following screenshot shows the association rules that were generated when
apriori algorithm is applied on the given dataset.
Step3: Now we specify the various parameters. These can be specified by clicking
in the text box to the right of the chose button. In this example, we accept the
default values. The default version does perform some pruning but does not
perform error pruning.
Step4: Under the “text” options in the main panel. We select the 10-fold cross
validation as our evaluation approach. Since we don’t have separate evaluation
data set, this is necessary to get a reasonable idea of accuracy of generated model.
Step-5: We now click ”start” to generate the model .the Ascii version of the tree as
well as evaluation statistic will appear in the right panel when the model
construction is complete.
Step-6: Note that the classification accuracy of model is about 69%.this indicates
that we may find more work. (Either in preprocessing or in selecting current
parameters for the classification)
Step-7: Now weka also lets us a view a graphical version of the classification tree.
This can be done by right clicking the last result set and selecting “visualize tree”
from the pop-up menu.
Step-9: In the main panel under “text” options click the “supplied test set” radio
button and then click the “set” button. This wills pop-up a window which will
allow you to open the file containing test instances.
Dataset test.arff
@data
2005, cse
2005, it
2005, cse
2006, mech
2006, it
2006, ece
2007, it
2007, cse
2008, it
2008, cse
2009, it
2009, ece
Aim: This experiment illustrates the use of id3 classifier in weka. The sample data
set used in this experiment is “employee”data available at arff format. This
document assumes that appropriate data pre processing has been performed.
Step3: now we specify the various parameters. These can be specified by clicking
in the text box to the right of the chose button. In this example, we accept the
default values his default version does perform some pruning but does not perform
error pruning.
Step4: under the “text “options in the main panel. We select the 10-fold cross
Aim: This experiment illustrates the use of simple k-mean clustering with Weka
explorer. The sample data set used for this example is based on the iris data
available in ARFF format. This document assumes that appropriate preprocessing
has been performed. This iris dataset includes 150 instances.
Step 1: Run the Weka explorer and load the data file iris.arff in preprocessing
interface.
Step 2: Inorder to perform clustering select the ‘cluster’ tab in the explorer and
click on the choose button. This step results in a dropdown list of available
clustering algorithms.
Step 4: Next click in text button to the right of the choose button to get popup
window shown in the screenshots. In this window we enter six on the number of
clusters and we leave the value of the seed on as it is. The seed value is used in
generating a random number which is used for making the internal assignments of
instances of clusters.
Step 6 : The result window shows the centroid of each cluster as well as statistics
on the number and the percent of instances assigned to different clusters. Here
clusters centroid are means vectors for each clusters. This clusters can be used to
characterized the cluster.For eg, the centroid of cluster1 shows the class
iris.versicolor mean value of the sepal length is 5.4706, sepal width 2.4765, petal
width 1.1294, petal length 3.7941.
The following screenshot shows the clustering rules that were generated when
simple k means algorithm is applied on the given dataset.
From the above visualization, we can understand the distribution of sepal length
and petal length in each cluster. For instance, for each cluster is dominated by petal
length. In this case by changing the color dimension to other attributes we can see
their distribution with in each of the cluster.
Step 8: We can assure that resulting dataset which included each instance along
with its assign cluster. To do so we click the save button in the visualization
window and save the result iris k-mean .The top portion of this file is shown in the
following figure.
Department of Computer Science and Engineering DM Lab
Department of Computer Science and Engineering DM Lab
Department of Computer Science and Engineering DM Lab
PROCEDURE FOR ALL EXPERIMENTS WITH VIVA QUESTIONS
4. What is dimension?
o A dimension is something that qualifies a quantity (measure).
For an example, consider this: If I just say… “20kg”, it does not mean
anything. But if I say, "20kg of Rice (Product) is sold to Ramesh (customer)
on 5th April (date)", then that gives a meaningful sense. These product,
customer and dates are some dimension that qualified the measure - 20kg.
Dimensions are mutually independent. Technically speaking, a dimension is
a data element that categorizes each item in a data set into non-overlapping
regions.
5. What is Fact?
o A fact is something that is quantifiable (Or measurable). Facts are typically
(but not always) numerical values that can be aggregated.
6. Briefly state different between data ware house & data mart?
o Dataware house is made up of many datamarts. DWH contain many subject
areas. but data mart focuses on one subject area generally. e.g. If there will
be DHW of bank then there can be one data mart for accounts, one for Loans
etc. This is high level definitions. Metadata is data about data. e.g. if in data
mart we are receving any file. then metadata will contain information like
how many columns, file is fix width/elimted, ordering of fileds, dataypes of
field etc...
62. What can business analysts gain from having a data warehouse?
80. If there are 3 dimensions, how many cuboids are there in cube?
1. 2^3 = 8 cuboids
85. What are the criteria on the basic of which classification and prediction
can be compared?
o speed, accuracy, robustness, scalability, goodness of rules, interpret-ability