0% found this document useful (0 votes)

23 views25 pages

DMW LabFile 0901CS243D11 Swastik

The document is a lab report from Madhav Institute of Technology & Science detailing various experiments conducted in the Data Mining and Warehousing course. It includes instructions for installing the WEKA tool, creating ARFF files, and implementing algorithms such as Apriori, FP-Growth, and Decision Tree Induction. The report serves as a comprehensive guide for students to understand and apply data mining techniques effectively.

Uploaded by

aniket.tagor24

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views25 pages

DMW LabFile 0901CS243D11 Swastik

Uploaded by

aniket.tagor24

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Madhav Institute of Technology & Science, Gwalior

Deemed University
(Declared under Distinct Category by Ministry of Education, Government of India)
NAAC ACCREDITED WITH A++ GRADE

Lab Report
of
DATA MINING AND WAREHOUSING (3150403)

SUBMITTED BY :-
Swastik Jain
(0901CS243D11)

SUBMITTED TO :-
Prof. Amit Kumar Manjhvar
Assistant Professor

Department of Computer Science and Engineering

MADHAV INSTITUTE OF TECHNOLOGY AND SCIENCE , GWALIOR
DEEMED TO BE UNIVERSITY

Session:
Jan-June 2025
INDEX

S.No. Experiment Date Signature

1. Installation of WEKA Tool

2. Creating a new Ariff File

3. Data Processing Techniques on Data Set

4. Data cube construction – OLAP operations

5. Implementation of Apriori algorithm

6. Implementation of FP- Growth algorithm

7. Implementation of Decision Tree Induction

8. Calculating Information gains measures

Classification of data using Bayesian

9. approach

10. Implementation of K-means algorithm

11. Build Data Warehouse and Explore WEKA

Case study of open source data mining tools

12. (WEKA, ORANGE & TERADATA)
EXPERIMENT 1:

Installation of WEKA Tool

To install WEKA, follow these steps:

● Download WEKA: Go to the official WEKA website (https://www.cs.waikato.ac.nz/ml/

weka/) and navigate to the download section. Choose the appropriate version for your
operating system (Windows, macOS, or Linux).

● Install Java: WEKA requires Java to run. Make sure you have Java installed on your
system. You can download and install Java from the official Oracle website (https://
www.oracle.com/java/technologies/javase-jdk11-downloads.html).

● Extract the WEKA ZIP file: Once downloaded, extract the ZIP file to a location on your
computer. This will create a directory containing the WEKA files.

● Run WEKA: Navigate to the directory where you extracted the WEKA files and locate the
weka.jar file. You can run WEKA by double-clicking on this file or by running the
following command in the terminal or command prompt:

#code
java -jar weka.jar
● Set Up Environment Variables (Optional): If you plan to use WEKA from the command
line or scripts, you might want to set up environment variables to simplify running WEKA.
For example, you can set the WEKA_HOME variable to point to the directory where
WEKA is installed.

● Explore WEKA: Once WEKA is running, you can explore its features and start using it for
machine learning tasks.
EXPERIMENT 2:

Creating new Ariff File

Creating a new ARFF (Attribute-Relation File Format) file involves defining the structure of your
dataset including attributes and their types, as well as providing the data itself. Here's how you can
create a new ARFF file:

● Open a Text Editor: You can use any text editor like Notepad (on Windows), TextEdit (on
macOS), or any other code editor of your choice.

● Define Attributes: In the ARFF file, you need to define the attributes of your dataset. Each
attribute is defined on a separate line using the @attribute keyword followed by the
attribute name and its type. For example:

Replace attribute1, attribute2, etc., with your actual attribute names, and specify the appropriate
type (e.g., numeric, string, or a list of possible values enclosed in curly braces).

● Define Relation: After defining attributes, specify the relation name using the @relation
keyword. For example:

Replace MyDataset with your desired relation name.

● Add Data: Below the attribute and relation definitions, add the data instances one by one.
Each data instance should be on a separate line, with attribute values separated by commas.
For example:

Replace value1, value2, etc., with the actual values for each attribute.
● Save the File: Save the file with a .arff extension. For example, my_dataset.arff.
Here's a simple example of a complete ARFF file:

This example represents a weather dataset with attributes like outlook, temperature, humidity,
windy, and whether to play outside. Adjust the attribute names, types, and data according to your
specific dataset.
EXPERIMENT 3:

Data Processing Techniques on Data set

● Data Cleaning:
- Handling Missing Values: WEKA provides various methods for handling missing
values. One simple approach is to use the ReplaceMissingValues filter. You can find
it under the "Filters" tab in the Preprocess panel. Select the filter and apply it to your
dataset.
- Outlier Detection and Removal: WEKA doesn't have built-in support for outlier
detection, but you can use filters like RemoveWithValues to remove instances with
outlier values or use clustering techniques to identify outliers.

● Data Transformation:
- Normalization and Standardization: You can use the Normalize filter for
normalization and the Standardize filter for standardization. These filters are
available under the "Filters" tab in the Preprocess panel.
- Encoding Categorical Variables: WEKA automatically handles nominal attributes,
but if you need to explicitly encode them, you can use the Nominal To Binary or
Nominal To String filters.

● Feature Selection:
- Filter Methods: WEKA offers several filters for attribute selection under the "Select
Attributes" tab. For example, you can use the CorrelationAttributeEval filter to
select attributes based on their correlation with the target variable.
- Wrapper Methods: WEKA provides wrapper-based attribute selection methods like
AttributeSelection meta-classifier, where you can specify a search algorithm (e.g.,
genetic search, hill climbing) and an evaluator (e.g., cross-validation accuracy) to
select features.

● Data Integration:
- Merge or Join: If you need to merge datasets, you can use the MergeTwoFiles filter
under the "Filters" tab. This filter allows you to merge datasets by appending
instances or adding new attributes.
- Aggregation: WEKA doesn't have built-in support for aggregation, but you can
aggregate data using preprocessing steps outside of WEKA or by writing custom
Java code.

● Data Reduction:
- Dimensionality Reduction: WEKA provides various dimensionality reduction
techniques like PCA (PrincipalComponents filter) and attribute selection methods.
- Sampling: WEKA offers filters for sampling, such as Resample and
StratifiedRemoveFolds.

● Data Discretization:
- Binning: You can use the Discretize filter under the "Filters" tab to bin numerical
attributes into intervals.

● Data Augmentation:
- WEKA doesn't have built-in support for data augmentation, but you can generate
synthetic data outside of WEKA and then import it into your dataset.
After applying the desired preprocessing techniques using WEKA's graphical interface,
you can save the modified dataset by clicking on "Save" or "Save ARFF" under the "File"
menu. Make sure to save it with a new filename to avoid overwriting the original dataset.
EXPERIMENT 4:

Data cube construction – OLAP operations

OLAP (Online Analytical Processing) operations involve the construction and analysis of data
cubes, which are multidimensional structures used for interactive analysis of large datasets. Here
are the basic OLAP operations:

1. Roll-up (Aggregation): This operation involves summarizing data along one or more
dimensions by moving up in the hierarchy. For example, rolling up sales data from daily
to monthly or yearly.
2. Drill-down (De-aggregation): The opposite of roll-up, drill-down involves breaking
down aggregated data into more detailed levels. For instance, drilling down from yearly
sales to quarterly or monthly sales.
3. Slice: Slicing involves selecting a subset of data from a single dimension of the cube.
For example, analyzing sales data for a particular region or product category.
4. Dice (Subcube): Dicing involves selecting a subcube by specifying ranges or conditions
on multiple dimensions. For instance, analyzing sales data for a specific region and time
period.
5. Pivot (Rotate): Pivoting involves reorienting the cube to view it from different
perspectives. This operation can involve swapping dimensions or reordering them to
facilitate analysis.
6. Drill-across: This operation involves accessing data stored in separate cubes or datasets
that share one or more dimensions. It allows for analysis across multiple datasets.

● Implementing these operations typically involves using OLAP tools or databases that
support multidimensional data structures. While it's possible to perform these operations
manually using SQL or programming languages like Java, specialized OLAP tools provide
a more efficient and user-friendly interface.

● For example, you can perform OLAP operations using tools like Microsoft Excel with its
PivotTable functionality, dedicated OLAP software like IBM Cognos or Oracle Essbase,
or OLAP-enabled databases like Microsoft SQL Server Analysis Services (SSAS) or
PostgreSQL with OLAP extensions.

● Each tool or database may have its own syntax or interface for performing OLAP
operations, but the underlying principles remain the same. These operations enable analysts
to explore and analyze large datasets from different angles, providing valuable insights for
decision-making.
EXPERIMENT 5:

Implementation of Apriori algorithm.

The Apriori algorithm is a classic and widely used algorithm in data mining for discovering
frequent itemsets in transactional databases. It is primarily employed in market basket analysis to
uncover associations between items purchased together. Developed by Agrawal and Srikant in
1994, Apriori is based on the principle of association rule mining, aiming to find patterns or
relationships among items in a dataset.

The algorithm works by iteratively generating candidate itemsets and pruning those that do not
meet the minimum support threshold. Here's a step-by-step explanation of how the Apriori
algorithm operates:

1. Initialization:
- Begin with identifying all unique items in the dataset as individual itemsets.
- Calculate the support (frequency) of each itemset, representing the number of transactions
containing that itemset.

2. Generating Candidate Itemsets:

- Based on the frequent itemsets from the previous iteration, generate candidate itemsets by joining
them. The join operation involves combining itemsets to form larger itemsets.
- Prune the candidate itemsets that do not satisfy the minimum support threshold. This pruning
step reduces the search space and computational complexity.

3. Calculating Support:
- Count the occurrences of each candidate itemset in the dataset to calculate its support.
- Discard candidate itemsets with support below the minimum threshold.

4. Iterating:
- Repeat the process of generating candidate itemsets, calculating support, and pruning until no
new frequent itemsets can be generated.

5. Extracting Association Rules:

- Once all frequent itemsets are discovered, generate association rules from them.
- Association rules consist of an antecedent (the items on the left-hand side) and a consequent
(the item on the right-hand side), with associated metrics such as support, confidence, and lift.
Apriori is efficient for mining frequent itemsets in large transactional databases because of its
ability to prune the search space by exploiting the Apriori property. This property states that any
subset of a frequent itemset must also be frequent. However, Apriori's performance may degrade
with an increasing number of items or a low minimum support threshold due to the exponential
growth of candidate itemsets.

Overall, the Apriori algorithm remains a foundational technique in data mining, enabling the
discovery of valuable associations and patterns in transactional data, which can be leveraged for
various applications such as market basket analysis, recommendation systems, and more.
EXPERIMENT 6:

Implementation of FP- Growth algorithm

The FP-Growth (Frequent Pattern Growth) algorithm is a powerful method for frequent itemset
mining in transactional databases. It was proposed by Han et al. in 2000 as an improvement over
the Apriori algorithm, aiming to address its scalability issues. FP-Growth achieves efficiency by
representing the transactional database in a compact data structure called the FP-tree (Frequent
Pattern tree) and using a divide-and-conquer approach to mine frequent itemsets.

Here's how the FP-Growth algorithm works:

1. Building the FP-Tree:

- Scan the transactional database once to construct the FP-tree. The FP-tree is a tree-like
structure where each node represents an item and its frequency in the dataset.
- Items in each transaction are sorted based on their frequency in the database, with the most
frequent items placed at the top.
- The FP-tree is constructed recursively by adding each transaction to the tree, updating the
node frequencies accordingly.

2. Mining Frequent Itemsets:

- Once the FP-tree is built, frequent itemsets can be extracted efficiently using a recursive
process called the FP-Growth method.
- Starting from the least frequent item in the database, the algorithm recursively grows
conditional FP-trees by considering prefixes of the original FP-tree.
- Frequent itemsets are generated by combining itemsets found in the conditional FP-trees with
the item under consideration.
- The process continues until all frequent itemsets are discovered.

3. Performance Benefits:
- FP-Growth offers significant performance advantages over the Apriori algorithm, especially
for large datasets. By constructing the FP-tree in a single pass and avoiding the generation of
candidate itemsets, FP-Growth reduces the computational overhead associated with multiple
database scans and candidate generation.
- The FP-tree data structure compresses the transactional database, leading to reduced memory
requirements and faster processing.
4. Applications:
- FP-Growth is widely used in various data mining tasks, including market basket analysis,
association rule mining, and frequent itemset discovery.
- It has applications in retail for identifying patterns in customer purchasing behavior, in
healthcare for analyzing patient treatment histories, and in web mining for discovering navigation
patterns on websites, among others.

5. Scalability and Efficiency:

- Due to its efficient design, FP-Growth is well-suited for large-scale data mining tasks, making
it a preferred choice for analyzing massive datasets in real-world applications.
- The algorithm's ability to handle high-dimensional data and its scalability to large datasets
contribute to its popularity in both research and industry.

Overall, the FP-Growth algorithm offers a robust and scalable approach to frequent itemset mining,
enabling the discovery of valuable patterns and associations in transactional data with improved
efficiency compared to traditional methods like Apriori.
EXPERIMENT 7:

Implementation of Decision Tree Induction

Decision tree induction is a popular machine learning technique used for predictive modeling and
data analysis. It is a supervised learning algorithm that learns a decision tree from labeled training
data, where each internal node represents a "test" on an attribute, each branch represents the
outcome of the test, and each leaf node represents a class label or a decision.

Here's how decision tree induction works:

1. Tree Construction:
- The algorithm recursively partitions the training data based on the values of attributes to create
a tree structure.
- At each node of the tree, the algorithm selects the best attribute to split the data. This is done
by evaluating various splitting criteria, such as Gini impurity, entropy, or information gain, to find
the attribute that maximally separates the classes or reduces uncertainty.
- The data is split into subsets based on the selected attribute's values, creating child nodes
corresponding to each subset.
- This process continues recursively until the stopping criteria are met, such as reaching a
maximum tree depth, no further improvement in impurity, or reaching a minimum number of
instances in a node.

2. Tree Pruning (Optional):

- After the tree is constructed, pruning may be applied to improve its generalization
performance and prevent overfitting.
- Pruning involves removing parts of the tree that are not statistically significant or do not
contribute significantly to its predictive accuracy.
- Techniques like cost-complexity pruning (e.g., Reduced Error Pruning, Chi-squared Pruning)
or minimum description length pruning are commonly used for this purpose.

3. Classification and Prediction:

- Once the decision tree is built, it can be used for classification or prediction on unseen data.
- To classify a new instance, it traverses the tree from the root node down to a leaf node,
following the decision rules at each node based on the attribute values of the instance.
- The class label associated with the leaf node reached by the instance is assigned as the
predicted class.
4. Interpretability and Visualization:
- One of the key advantages of decision trees is their interpretability. The decision rules learned
by the tree are easy to understand and interpret, making them useful for explaining the model's
predictions to stakeholders.
- Decision trees can also be visualized graphically, allowing users to inspect the structure of the
tree and understand the decision-making process.

Decision tree induction is widely used in various domains, including healthcare, finance,
marketing, and natural language processing, due to its simplicity, interpretability, and effectiveness
in handling both categorical and numerical data. However, decision trees are prone to overfitting,
especially with noisy data, and may not always generalize well to unseen data without appropriate
techniques such as pruning or ensemble methods like Random Forests or Gradient Boosting Trees.
EXPERIMENT 8:

Calculating Information gains measures

Calculating information gain measures involves computing the entropy and information gain for
each attribute in a dataset. Here's how you can calculate them:
1. Entropy (H): Entropy is a measure of impurity in a dataset. For a binary classification
problem, it's defined as:
H(S)=−p1log2(p1)−p2log2(p2)
Where p1 and p2 are the probabilities of the two classes in the dataset.
2. Information Gain (IG): Information gain measures the effectiveness of an attribute in
classifying the data. It's calculated as the difference between the entropy before and after
splitting the data based on the attribute. For a dataset S and an attribute A, the information
gain is:
IG(S,A)=H(S)−∑vValues(A)∣S∣∣Sv∣H(Sv)
Where Values(A) are the possible values of attribute A, Sv is the subset of S where attribute A has value
v, and ∣S∣ denotes the size of set S.
3. Pseudocode for Calculating Information Gain:
This code calculates the Information Gain for a given attribute (specified by its index) in a
dataset. You need to provide the dataset in the form of a list of lists, where each inner list
represents a row of data, and the indices of the attribute and class columns.
EXPERIMENT 9:

Classification of data using Bayesian approach

The Bayesian approach to classification involves using Bayes' theorem to compute the probability
of a class given the features of an instance. Here's a step-by-step guide to implementing
classification using the Bayesian approach:
1. Bayes' Theorem:
Bayes' theorem states:
P(Ci∣X) = P(X)P(X ∣ Ci)P(Ci)
● P(Ci∣X) is the posterior probability of class Ci given the features X.
● P(X ∣ Ci) is the likelihood of observing the features X given class Ci.
● P(Ci) is the prior probability of class Ci.
● P(X) is the total probability of observing the features X, also known as the
evidence.
2. Classification Process:
● Training Phase:
● Calculate class priors P(Ci) based on the frequencies of each class in the
training data.
● Estimate the likelihood P(X ∣Ci) for each class based on the training data.
● Classification Phase:
● For a given instance X, compute the posterior probability P(Ci∣X) for each
class Ci.
● Assign the instance to the class with the highest posterior probability.
3. Implementation:
Here's a simple Python implementation of classification using the Bayesian approach:
from collections import Counter

class NaiveBayesClassifier:
def init (self):
self.class_probs = {}
self.feature_probs = {}

def train(self, X_train, y_train):

# Calculate class priors
class_counts = Counter(y_train)
total_samples = len(y_train)
for class_label, count in class_counts.items():
self.class_probs[class_label] = count / total_samples
# Calculate likelihoods
num_features = len(X_train[0])
for class_label in self.class_probs:
# Filter training instances for the current class
class_instances = [X_train[i] for i in range(total_samples) if y_train[i] == class_label]
# Calculate feature probabilities for each feature
self.feature_probs[class_label] = []
for feature_index in range(num_features):
feature_values = [instance[feature_index] for instance in class_instances]
feature_counts = Counter(feature_values)
feature_prob = {value: count / len(class_instances) for value, count in
feature_counts.items()}
self.feature_probs[class_label].append(feature_prob)

def predict(self, X_test):

predictions = []
for instance in X_test:
# Initialize posterior probabilities for each class
posterior_probs = {class_label: self.class_probs[class_label] for class_label in
self.class_probs}
# Calculate likelihoods for each feature
for class_label in self.class_probs:
for feature_index, feature_value in enumerate(instance):
if feature_value in self.feature_probs[class_label][feature_index]:
likelihood = self.feature_probs[class_label][feature_index][feature_value]
else:
# Handle unseen feature values by assuming a small probability
likelihood = 1e-6
posterior_probs[class_label] *= likelihood
# Assign the instance to the class with the highest posterior probability
predicted_class = max(posterior_probs, key=posterior_probs.get)
predictions.append(predicted_class)
return predictions

# Example usage:
X_train = [[1, 'Sunny'], [2, 'Overcast'], [3, 'Rainy'], [4, 'Sunny']]
y_train = ['No', 'Yes', 'Yes', 'No']
X_test = [[5, 'Rainy']]

classifier = NaiveBayesClassifier()
classifier.train(X_train, y_train)
predictions = classifier.predict(X_test)
print("Predicted class:", predictions[0])

In this implementation, X_train is the training dataset, y_train contains the corresponding class
labels, and X_test is the test dataset for which we want to make predictions. The
NaiveBayesClassifier class trains on the training data using the train method and predicts the
classes of instances in the test data using the predict method.
EXPERIMENT 10:

Implementation of K-means algorithm

The K-means algorithm is a popular unsupervised machine learning technique used for clustering
data into groups or clusters based on similarity. It is widely employed in data mining for various
applications such as customer segmentation, image segmentation, anomaly detection, and more.

Here's how the K-means algorithm works:

1. Initialization:
- The algorithm starts by randomly initializing K cluster centroids in the feature space, where K
is the number of clusters specified by the user.

2. Assignment Step:
- Each data point in the dataset is assigned to the nearest centroid based on a distance metric,
commonly the Euclidean distance.
- This step results in the formation of K clusters, where each cluster consists of data points
closest to its corresponding centroid.

3. Update Step:
- After the assignment step, the centroids of the clusters are updated to the mean of the data
points belonging to each cluster.
- The new centroids represent the center of gravity for each cluster and are recalculated
iteratively.

4. Convergence:
- Steps 2 and 3 are repeated iteratively until either the centroids do not change significantly
between iterations or a specified number of iterations is reached.
- The algorithm converges when the centroids stabilize, indicating that the clusters have
reached a stable configuration.

5. Final Result:
- Once the algorithm converges, the final clusters are formed, and each data point is associated
with a cluster.
- The cluster centroids represent the centers of the clusters, and data points within the same
cluster are more similar to each other than to those in other clusters.

6. Evaluation:
- Various metrics can be used to evaluate the quality of the clustering, such as the within-
cluster sum of squares (WCSS) or the silhouette score, which measure the compactness and
separation of clusters.

K-means is efficient and scalable, making it suitable for large datasets. However, it has several
limitations:
- The algorithm's performance depends on the initial placement of centroids, which can lead to
different results for each run.
- K-means assumes clusters to be spherical and of equal size, which may not hold true for all
datasets.
- It is sensitive to outliers and noise, as they can significantly affect the positions of centroids and
the resulting clusters.

Despite these limitations, K-means remains one of the most widely used clustering algorithms due
to its simplicity, efficiency, and effectiveness in many practical scenarios.
EXPERIMENT 11:

Build Data Warehouse and Explore WEKA

Building a data warehouse involves integrating data from multiple sources into a centralized
repository optimized for querying and analysis. However, WEKA is primarily a machine learning
toolkit rather than a data warehousing platform. Still, you can use WEKA for exploratory data
analysis and modeling after preparing your data.

Here's a general approach:

1. Build the Data Warehouse:

● Identify data sources: Determine which data sources you want to integrate into your data
warehouse. These could include databases, CSV files, APIs, etc.
● Extract, Transform, Load (ETL): Extract data from the sources, transform it into a common
format or schema, and load it into the warehouse. You might use tools like Apache Spark,
Talend, or custom scripts for this purpose.
● Schema design: Design the schema of your data warehouse based on the types of analysis
you plan to perform. Common schema designs include star schema and snowflake schema.
● Data cleaning and preprocessing: Cleanse and preprocess the data to handle missing values,
outliers, and inconsistencies.

2. Explore Data in WEKA:

● Once your data warehouse is set up, you can export data from it in a format compatible
with WEKA, such as ARFF (Attribute-Relation File Format).
● Start WEKA and load the exported data file.
● Perform exploratory data analysis (EDA) using WEKA's various tools and visualizations.
This includes summarizing statistics, histograms, scatter plots, etc.
● Preprocess the data as needed for machine learning tasks. WEKA provides various
preprocessing techniques such as normalization, discretization, attribute selection, etc.
● Train machine learning models on the preprocessed data using WEKA's algorithms. You
can experiment with different algorithms and tune their parameters using WEKA's
interface.

3. Model Evaluation and Deployment:

● Evaluate the performance of trained models using cross-validation or holdout validation.
● Fine-tune models by adjusting parameters or trying different algorithms.
● Once satisfied with a model's performance, deploy it for predictions on new data.
4. Continuous Improvement:
● Data warehousing and machine learning are iterative processes. Continuously refine your
data warehouse and models based on feedback and new data.
● Monitor model performance in production and retrain models periodically to keep them up
to date.
While WEKA itself does not handle data warehousing tasks, it's a powerful tool for exploratory
data analysis and modeling once your data is prepared and ready for analysis. Integrating
WEKA into your workflow can help you derive insights and build predictive models from your
data warehouse.
EXPERIMENT 12:

Case study of open source data mining tools (WEKA, ORANGE &
TERADATA)
A case study comparing three open-source data mining tools: WEKA, Orange, and Teradata. We'll
consider a hypothetical scenario where a retail company wants to analyze customer data to improve
marketing strategies and increase sales.

1. WEKA:

The retail company has collected a dataset containing information about customers,
including demographics, purchase history, and behavior. They decide to use
WEKA for exploratory data analysis and predictive modeling.

● Data Preprocessing: They use WEKA's preprocessing tools to handle missing

values, normalize features, and encode categorical variables.
● Exploratory Data Analysis (EDA): WEKA provides various visualization and
statistical tools to explore relationships between variables and identify patterns in
the data.
● Modeling: They use WEKA's machine learning algorithms to build predictive
models, such as decision trees, random forests, and neural networks. For example,
they train a decision tree to predict customer churn based on demographics and
purchase behavior.
● Model Evaluation: They evaluate the performance of the trained models using
cross-validation or holdout validation and select the best-performing model for
deployment.
2. Orange:

The retail company also considers using Orange, another open-source data mining
tool, for their analysis.

● Visual Programming: Orange offers a visual programming interface that allows

users to build data analysis workflows by connecting predefined components.
● Data Visualization: They use Orange's interactive visualization tools to explore the
dataset and gain insights into customer behavior.
● Feature Engineering: Orange provides feature engineering techniques to create new
features from existing ones, such as aggregating purchase history or creating
customer segments based on clustering.
● Predictive Modeling: They use Orange's machine learning algorithms, such as
logistic regression and k-nearest neighbors, to build predictive models for customer
segmentation and churn prediction.
3. Teradata:

The retail company also has access to Teradata, a powerful data warehousing and
analytics platform.

● Data Integration: Teradata allows them to integrate data from multiple sources,
including transactional databases, CRM systems, and external sources.
● Scalability: Teradata's parallel processing capabilities enable them to analyze large
volumes of data efficiently.
● Advanced Analytics: They leverage Teradata's advanced analytics capabilities,
such as in-database analytics and machine learning functions, to perform complex
analyses on their customer data.
● Real-Time Analytics: Teradata supports real-time analytics, allowing them to make
data-driven decisions in near real-time, such as personalized marketing campaigns
and dynamic pricing strategies.
In this case study, the retail company compares and evaluates the suitability of WEKA, Orange,
and Teradata based on factors such as ease of use, functionality, scalability, and performance. They
may choose to use a combination of these tools depending on their specific requirements and
constraints. For example, they might use WEKA for initial data exploration and modeling, Orange
for interactive data visualization and feature engineering, and Teradata for scalable analytics and
real-time insights.

Experiment No: 01 Data Exploration & Data Preprocessing
No ratings yet
Experiment No: 01 Data Exploration & Data Preprocessing
54 pages
Data Warehousing Lab Exp 1-3
No ratings yet
Data Warehousing Lab Exp 1-3
24 pages
Anne - CCS341 - DW - Students Record - 1a - 1b - 2 - Print
No ratings yet
Anne - CCS341 - DW - Students Record - 1a - 1b - 2 - Print
63 pages
WEKA Practical Protocol
No ratings yet
WEKA Practical Protocol
40 pages
Lecture 12 - Weka Tutorial
No ratings yet
Lecture 12 - Weka Tutorial
84 pages
Dinesh DM
No ratings yet
Dinesh DM
34 pages
DWDM Lab Manual
No ratings yet
DWDM Lab Manual
55 pages
Lab Manual Format
No ratings yet
Lab Manual Format
37 pages
AI-43 Data Mining
No ratings yet
AI-43 Data Mining
96 pages
CS-703 (B) Data Warehousing and Data Mining Lab
No ratings yet
CS-703 (B) Data Warehousing and Data Mining Lab
50 pages
Experiment 1: Installation of WEKA Tool Aim
No ratings yet
Experiment 1: Installation of WEKA Tool Aim
19 pages
Data Mining Term Project Machine Learning With WEKA: Weka Explorer Tutorial For Version 3.4.3
No ratings yet
Data Mining Term Project Machine Learning With WEKA: Weka Explorer Tutorial For Version 3.4.3
42 pages
32013105-BDA LabManual
No ratings yet
32013105-BDA LabManual
122 pages
DWH Manual Merged
No ratings yet
DWH Manual Merged
47 pages
Data Warehouse Final Record
No ratings yet
Data Warehouse Final Record
55 pages
Data Warehousing Lab Manual
No ratings yet
Data Warehousing Lab Manual
36 pages
DMDV 210
No ratings yet
DMDV 210
63 pages
DWM1 Riya
No ratings yet
DWM1 Riya
16 pages
Data Warehousing Lab Record Final
No ratings yet
Data Warehousing Lab Record Final
45 pages
DMDV 210
No ratings yet
DMDV 210
61 pages
DW Lab
No ratings yet
DW Lab
85 pages
Data Warehousing
No ratings yet
Data Warehousing
54 pages
Printing 1-3
No ratings yet
Printing 1-3
36 pages
Itdw
No ratings yet
Itdw
44 pages
DWBI Lab Manual 2023-24 Final
No ratings yet
DWBI Lab Manual 2023-24 Final
40 pages
DWDM Lab File
No ratings yet
DWDM Lab File
29 pages
Lab Updated - Merged
No ratings yet
Lab Updated - Merged
49 pages
Data Mining Lab File
No ratings yet
Data Mining Lab File
20 pages
DW 9 Exp 1
No ratings yet
DW 9 Exp 1
43 pages
DataMiningManual Sawan
No ratings yet
DataMiningManual Sawan
30 pages
OS Journal
No ratings yet
OS Journal
28 pages
DWDM File-Final Ver3.pdf 20241230 172003 0000
No ratings yet
DWDM File-Final Ver3.pdf 20241230 172003 0000
54 pages
DM Lab Material
No ratings yet
DM Lab Material
88 pages
Data Warehousing Laboratory
0% (1)
Data Warehousing Laboratory
28 pages
DMDV Main Manual
No ratings yet
DMDV Main Manual
35 pages
DMW Lab Print
No ratings yet
DMW Lab Print
21 pages
DWDM File
No ratings yet
DWDM File
26 pages
Priyadarshini J. L. College of Engineering, Nagpur: Session 2022-23 Semester-V
No ratings yet
Priyadarshini J. L. College of Engineering, Nagpur: Session 2022-23 Semester-V
31 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
12 pages
BI - Experiment - No - 1
No ratings yet
BI - Experiment - No - 1
7 pages
Rintro Wekacomplete
No ratings yet
Rintro Wekacomplete
135 pages
DWM1
No ratings yet
DWM1
19 pages
Data Mining Notes
100% (1)
Data Mining Notes
45 pages
Wekappt
No ratings yet
Wekappt
58 pages
Data Warehousing and Data Mining Lab
No ratings yet
Data Warehousing and Data Mining Lab
53 pages
Data Warehousing Full
No ratings yet
Data Warehousing Full
41 pages
Weka Data Miningvsem
No ratings yet
Weka Data Miningvsem
7 pages
Exp 6
No ratings yet
Exp 6
9 pages
What's Next?: Rule Models Learning Ordered Rule Lists Learning Unordered Rule Sets Descriptive Rule Learning
No ratings yet
What's Next?: Rule Models Learning Ordered Rule Lists Learning Unordered Rule Sets Descriptive Rule Learning
47 pages
Weka Tutorial
No ratings yet
Weka Tutorial
45 pages
DWDM Lab QP 3-1
No ratings yet
DWDM Lab QP 3-1
1 page
WEKA Explorer Tutorial
No ratings yet
WEKA Explorer Tutorial
45 pages
Weka: A Tool For Data Preprocessing, Classification, Ensemble, Clustering and Association Rule Mining
No ratings yet
Weka: A Tool For Data Preprocessing, Classification, Ensemble, Clustering and Association Rule Mining
4 pages
Data Base Management Key Points
No ratings yet
Data Base Management Key Points
8 pages
Appendix Weka
No ratings yet
Appendix Weka
17 pages
Data Mining in Bioinformatics
No ratings yet
Data Mining in Bioinformatics
21 pages
Data Mining
100% (1)
Data Mining
40 pages
Using Sentiment Analysis in Complaint Management System
100% (1)
Using Sentiment Analysis in Complaint Management System
6 pages
Bca603qb 221018 214429
No ratings yet
Bca603qb 221018 214429
46 pages
Dsbda Unit4
No ratings yet
Dsbda Unit4
110 pages
Data Analytics
No ratings yet
Data Analytics
146 pages
AI&DS Module 1 KTU
No ratings yet
AI&DS Module 1 KTU
29 pages
DS Honors Sem 5 Syllabus
No ratings yet
DS Honors Sem 5 Syllabus
4 pages
Data Analytics All Practical
No ratings yet
Data Analytics All Practical
31 pages
QM 20242 Cs5228 Lecture01 Introduction
No ratings yet
QM 20242 Cs5228 Lecture01 Introduction
80 pages
Apriori Algorithm
No ratings yet
Apriori Algorithm
30 pages
Weka Activity Report
No ratings yet
Weka Activity Report
30 pages
Unit 1 - Data Science BCA
No ratings yet
Unit 1 - Data Science BCA
16 pages
External PPT - Animesh Singh-200301120038
No ratings yet
External PPT - Animesh Singh-200301120038
47 pages
Agent Intelligence Through Data Mining Multiagent Systems Artificial Societies and Simulated Organizations 14 1st edition by Andreas Symeonidis, Pericles Mitkas ISBN 0387243526 Â 978-0387243528 instant download
100% (2)
Agent Intelligence Through Data Mining Multiagent Systems Artificial Societies and Simulated Organizations 14 1st edition by Andreas Symeonidis, Pericles Mitkas ISBN 0387243526 Â 978-0387243528 instant download
56 pages
The Full Version and Explore A Variety of Ebooks
No ratings yet
The Full Version and Explore A Variety of Ebooks
43 pages
Dec 2024 MCS-221
No ratings yet
Dec 2024 MCS-221
5 pages
JNTUH Used Papers
No ratings yet
JNTUH Used Papers
4 pages
Big IoT Data Analytics
100% (1)
Big IoT Data Analytics
15 pages
Final LP - 1 Problem Statement (ML, DAA or ADBMS) - 2023 - 24
No ratings yet
Final LP - 1 Problem Statement (ML, DAA or ADBMS) - 2023 - 24
21 pages
Mirpur University of Science & Technology, MUST Mirpur AJ&K
No ratings yet
Mirpur University of Science & Technology, MUST Mirpur AJ&K
3 pages
Unit 2
No ratings yet
Unit 2
60 pages
Data Mining 1
No ratings yet
Data Mining 1
36 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
4 pages
Unit 4 - Association Analysis
No ratings yet
Unit 4 - Association Analysis
12 pages
Assessment, Synthesis and Analysis of Data Mining Tools
No ratings yet
Assessment, Synthesis and Analysis of Data Mining Tools
13 pages
Mining Frequent Patterns Without Candidate Generation
No ratings yet
Mining Frequent Patterns Without Candidate Generation
44 pages
Data Mining: Homework 1 Solution
No ratings yet
Data Mining: Homework 1 Solution
5 pages
An Efficient Method of Data Inconsistency Check For Very Large Relations
No ratings yet
An Efficient Method of Data Inconsistency Check For Very Large Relations
3 pages
Python Beyond Limits: Python, #3
From Everand
Python Beyond Limits: Python, #3
AnwaarX
No ratings yet
Study Guide MO-500 Certification Exam Microsoft Access Expert ( Office 2019)
From Everand
Study Guide MO-500 Certification Exam Microsoft Access Expert ( Office 2019)
Anand Vemula
No ratings yet
Learning Informatica PowerCenter 9.x
From Everand
Learning Informatica PowerCenter 9.x
Rahul Malewar
3/5 (4)
Scala Data Analysis Cookbook (new): Navigate the world of data analysis, visualization, and machine learning with over 100 hands-on Scala recipes
From Everand
Scala Data Analysis Cookbook (new): Navigate the world of data analysis, visualization, and machine learning with over 100 hands-on Scala recipes
Arun Manivannan
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)

DMW LabFile 0901CS243D11 Swastik

Uploaded by

DMW LabFile 0901CS243D11 Swastik

Uploaded by

Madhav Institute of Technology & Science, Gwalior

Department of Computer Science and Engineering

S.No. Experiment Date Signature

1. Installation of WEKA Tool

2. Creating a new Ariff File

3. Data Processing Techniques on Data Set

4. Data cube construction – OLAP operations

5. Implementation of Apriori algorithm

6. Implementation of FP- Growth algorithm

7. Implementation of Decision Tree Induction

8. Calculating Information gains measures

Classification of data using Bayesian

10. Implementation of K-means algorithm

11. Build Data Warehouse and Explore WEKA

Case study of open source data mining tools

Installation of WEKA Tool

To install WEKA, follow these steps:

● Download WEKA: Go to the official WEKA website (https://www.cs.waikato.ac.nz/ml/

Creating new Ariff File

Replace MyDataset with your desired relation name.

Data Processing Techniques on Data set

Data cube construction – OLAP operations

Implementation of Apriori algorithm.

2. Generating Candidate Itemsets:

5. Extracting Association Rules:

Implementation of FP- Growth algorithm

Here's how the FP-Growth algorithm works:

1. Building the FP-Tree:

2. Mining Frequent Itemsets:

5. Scalability and Efficiency:

Implementation of Decision Tree Induction

Here's how decision tree induction works:

2. Tree Pruning (Optional):

3. Classification and Prediction:

Calculating Information gains measures

Classification of data using Bayesian approach

def train(self, X_train, y_train):

def predict(self, X_test):

Implementation of K-means algorithm

Here's how the K-means algorithm works:

Build Data Warehouse and Explore WEKA

Here's a general approach:

1. Build the Data Warehouse:

2. Explore Data in WEKA:

3. Model Evaluation and Deployment:

● Data Preprocessing: They use WEKA's preprocessing tools to handle missing

● Visual Programming: Orange offers a visual programming interface that allows

You might also like