0% found this document useful (0 votes)

6 views11 pages

Software Projekt

Uploaded by

f8mfk29tfr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views11 pages

Software Projekt

Uploaded by

f8mfk29tfr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

Slajd 2:

The goal of the first part of the project is to extract insights from an unsupervised UCI Wholesale
customers dataset. The data set refers to clients of a wholesale distributor. It includes the annual
spending in monetary units on diverse product categories This dataset contains yearly expenditures in
different product categories by wholesale customers in Portugal.

1Slajd 3:

#### Loading, Preprocessing, and Initial Analysis

Download and convert the dataset:

1. **Downloading the Dataset
2. **Converting the Dataset**: Convert the downloaded data into numerical tables, such
as numpy arrays. This step ensures that the raw data is structured appropriately for
machine learning algorithms.

Preprocess the Data:

1. **Filtering Features**: Select relevant numerical data for analysis, such as annual
spending per category, and exclude useless meta-data
2. **Transforming Data**: Apply necessary transformations to handle heterogeneity
among features.

Generate Basic Statistical Visualizations:

1. **Histograms**: Create histograms to show the distribution of spendings across
different product categories.
2. **Scatter Plots**: Generate scatter plots to visualize the correlation between various
product categories.
3. **Transformation Verification**: Recompute histograms and scatter plots after
transformations to ensure the desired distribution
#### Detecting Anomalies

Basic Anomaly Detection:

1. **Neighbor-Based Anomaly Detection**: Calculate the anomaly score for each data
point based on the distance to its nearest neighbor. This score ranks data points from most
to least anomalous.

Robust Anomaly Models:

1. **Soft Minimum Approach**: Implement a robust anomaly detection model using a soft
minimum approach. This approach considers multiple neighbors to define an outlier,
ensuring reproducibility and reducing sensitivity to slight variations in the dataset.
2. **Equation Implementation**: Use the soft minimum function:

Selecting a Suitable Parameter γ:

1. **Bootstrap Method**: Use bootstrapping to generate multiple variants of the dataset
by randomly sampling instances. Compute anomaly scores for each variant and
characterize the anomaly of each instance by its average (μ_j) and spread (σ_j).
2. **Parameter Optimization**: Experiment with different values of γ. Plot the anomaly
averages and spreads to visually identify the parameter that offers the best separability
between anomalous and non-anomalous instances. Formalize this visual intuition into an
evaluation metric to derive the optimal γ value.

#### Getting Insights into Anomalies

Relation Between Anomalies and Meta-Data:

1. **Subset Analysis**: Investigate whether anomalies are over- or under-represented
among specific types of retailers or geographical locations. Using meta-data to define
subsets and analyze the distribution of anomaly scores across these subsets.
**Identifying Input Features that Drive Anomaly:**
1. **Explainable AI Techniques**: Use methods like Layer-wise Relevance Propagation
(LRP) to attribute anomaly scores to specific input features.
2. **Relevance Scores**: Calculate the contribution of each data point and feature to the
anomaly score
3. **Visualization**: Implement the propagation process and visualize the contributions of
each input feature to the anomaly scores. This helps in interpreting the factors driving
anomalies and validating the model.

Slajd 4:
### Key Concepts and Their Use in the Project

Unsupervised Machine Learning:

Unsupervised machine learning involves training models on data without
labeled outcomes. The goal is to identify patterns and structures within
the data. In this project, unsupervised learning techniques are used to
detect anomalies and clusters within the UCI Wholesale customers
dataset, helping to identify unusual spending behaviors and group similar
customers.

**Data Transformation:**
Data transformation modifies data to improve its suitability for analysis.
Common transformations include scaling, normalization, and log
transformations. In this project, log transformations are applied to the
dataset to handle the heavy-tailed distribution of spending values, making
the data more suitable for statistical analysis and anomaly detection.

**Softmin Function:**
The softmin function provides a smooth approximation to the minimum
function. It is used in the project to enhance the robustness of anomaly
detection by considering multiple nearest neighbors rather than just the
closest one. This approach helps to mitigate the impact of slight variations
in the dataset, improving the reproducibility of anomaly detection results.

**Bootstrap Method:**
The bootstrap method involves repeatedly sampling from a dataset with
replacement to create multiple simulated samples. This technique is used
in the project to estimate the stability of anomaly scores across different
subsets of data. By analyzing the variation in anomaly scores, the project
selects an optimal parameter for the softmin function, ensuring reliable
anomaly detection.

**Explainable AI:**
Explainable AI (XAI) refers to methods that make the outputs of machine
learning models interpretable to humans. In this project, XAI techniques,
such as Layer-wise Relevance Propagation (LRP), are used to identify
which input features contribute most to an anomaly score. This helps in
understanding why certain instances are considered anomalous and
provides insights into the underlying causes of anomalies.

These concepts are integrated to achieve a comprehensive analysis of the

wholesale customers' dataset, focusing on detecting, explaining, and
understanding anomalies within the data.
Slajd 5:
numpy (alias np): for numerical calculations.
matplotlib.pyplot (aka plt): plotting.
ucimlrepo: to download data from the UCI repository.

Took it from the dataset website. It downloads the repository of id number 292, and keeps it
in a variable named wholesale_customers. We store features in x variable, and targets in Y
variable as data frames. We then created a new variable, X_copy, without Channel column.
We used itertools to generate indexes from the function ,,product” for diagrams. We defined
function plot histograms, which contained two arguments: dataframe with data for
visualization.
itertools.product(range(subplot_rows), range(subplot_cols)):

itertools.product creates the Cartesian product of two sequences.

range(subplot_rows) creates a list of numbers from 0 to subplot_rows-1.
range(subplot_cols) creates a list of numbers from 0 to subplot_cols-1.
As a result of running itertools.product(range(3), range(2)) we will get the sequence: (0, 0),
(0, 1), (1, 0), (1, 1), (2, 0), (2, 1).
enumerate(itertools.product(...)):

enumerate adds a counter (idx) to the items generated by itertools.product.

idx is the index number (starting from 0), (r,c) is the pair of coordinates for each subplot.
Slajd7:
Slajd8:
Slajd9:
 Logarithmic transformation:
• Especially values of theta = 1 and theta = 10 seem to produce the best results.
 Anomaly detection:
• scores obtained using hard and soft minimum indicate different rows as
anomalies
• top 10 anomalies in the X_copy data show extreme values in different
columns, which may suggest invalid data or interesting cases to investigate
further.
 Optimal Gamma:
• the optimal gamma value (0.01) is crucial for more accurate anomaly
detection using a soft minimum.
 Impact of the function on the anomaly:
• Analysis of the Rk and Ri results shows which data features have the greatest
impact on the anomaly results.
theta = 100:

It may shift the data too much, which may lead to loss of information about the relative
differences between values.
The histogram may be less useful if all values are shifted by a large fixed amount, causing a
loss of discrimination between small and large values.
Slajd 6:
A function that calculates an anomaly score for a single row (observation) in the data.
j = row.name: Gets the row index.
x_j = row.values: Gets the row values as an array.
data_values = data.values: Gets all the values in the DataFrame as a numpy array.

Calculations:
diffs = -data_values + x_j: Calculates the differences between x_j and each observation in
data.
squared_diffs = np.square(diffs): Squares the differences.
z_jk = np.sum(squared_diffs, axis=1): Sums the squared differences for each observation,
creating an array z_jk, which represents the sum of the squared differences for each
observation from x_j.

Excludes row index in calculations and caltulate minimum value for z_jk
.apply(lambda row: compute_outlier_score(row, X_transform), axis=1): Applies the
compute_outlier_score function to each row in X_transform.
axis=1: Specifies that the function is applied along the row axis.
.values: Gets the results as a numpy array.
X_outliers['Outlier_Score'] = outlier_scores:
• Adds a new Outlier_Score column to X_outliers containing the calculated anomaly scores.

Sorts outliers_df by the Outlier_Score column in descending order (ascending=False).

full_data.iloc[top_10_outliers.index]:
top_10_outliers.index: Returns the indexes of the 10 most abnormal observations from
sorted_df.
full_data.iloc[top_10_outliers.index]: Uses these indexes to select the appropriate rows from
full_data, the original dataset.
print(...): Prints these selected lines, i.e. the 10 most abnormal observations, in their original
form.

 Anomaly Models section

logsumexp: A function from the scipy.special library that calculates the logarithm of the sum
of exponents. It is used to calculate logarithmic sums stably,
j = row.name: Gets the row index.
x_j = row.values: Gets the row values as an array.
N = data.shape[0]: Number of rows in the data set.
data_values = data.values: Gets all the values in the DataFrame as a numpy array.

X_transform.apply(lambda row: compute_outlier_score_softmin(row, X_transform,

gamma), axis=1).values: Applies the compute_outlier_score_softmin function to each row
in X_transform using the apply method, calculating anomaly scores for each observation.
axis=1 means the function is applied to rows.
.values: Gets the results as a numpy array.
Slajd 7:
A function that calculates the gamma metric based on average values and standard
deviations.
Parameters:
mean_values: An array of average anomaly score values for individual observations.
std_values: An array of standard deviations of anomaly scores for individual observations.
Action:
e.g.diff(mean_values).max() calculates the largest difference between consecutive mean
values.
std_values.mean() calculates the average standard deviation.
The result is the quotient of the largest difference in mean values divided by the mean
standard deviation.

e.g.arange(0, 8, 1):

e.g.arange(start, stop, step) creates an array of values from start (inclusive) to stop
(exclusive) with step step.
In this case, e.g.arange(0, 8, 1) creates an array of values [0, 1, 2, 3, 4, 5, 6, 7].
for gamma in gamma_values: Loop after each gamma value from gamma_values.
scores = {}: A dictionary to store the anomaly scores for each observation.
for idx in range(X_transform.shape[0]): A loop that initializes the result lists for each
observation.
means = {} and stds = {}: Dictionaries for storing the means and standard deviations of
anomaly results.
bootstrap_sample.apply(..., axis=1):
Applies a function to each row in bootstrap_sample.
lambda row: compute_outlier_score_softmin(row, bootstrap_sample, gamma):
A lambda function that calculates the anomaly score for each row using
compute_outlier_score_softmin.
row: A single row of data.
bootstrap_sample: The entire bootstrap sample, used as a date in the
compute_outlier_score_softmin function.
gamma: Adjustment parameter for the softmin function.

for loop:
Iterates through each index in bootstrap_sample.
range(sample_size): Assumes that sample_size specifies the number of rows in
bootstrap_sample.
Index download:

python
Copy the code
idx = bootstrap_sample.index.tolist()[i]
bootstrap_sample.index.tolist(): Returns the list of row indexes in bootstrap_sample.

[i]: Gets the index of the ith row.

Adding an anomaly score to scores:

python
Copy the code
scores[idx].append(outlier_scores[i])
scores[idx]: List of anomaly scores for the row at index idx.
append(outlier_scores[i]): Adds the anomaly score for row i from bootstrap_sample to the
list in scores[idx].

scores.items():
scores is a dictionary in which the keys are indexes of observations and the values are lists
of anomaly scores.
scores.items() returns (key, value) pairs, where key is the index and value is a list of
anomaly scores for a given index.
k, v:
k is the observation index.
v is a list of anomaly scores for observations with index k.
means[k] = np.mean(v):
Calculates the average of the anomaly scores v for index k.
Stores this average in the means dictionary under key k.
stds[k] = np.std(v):
Calculates the standard deviation of anomaly scores v for index k.
Stores this standard deviation in the stds dictionary under the key k.

Slajd 8:
logsumexp: A function from the scipy.special library that reliably calculates the logarithm
of the sum of exponents. It is useful for numerical calculations to avoid precision problems.

j = row.name: Gets the row index of row.

x_j = row.values: Gets the values of the row row as an array.
N = data.shape[0]: Number of rows in the data dataset.
data_values = data.values: Gets all the values in the DataFrame as a numpy array.

Ri_scores:
This is an empty list that will be used to store the Ri anomaly metrics results for each
feature and each observation.
range(X_transform.shape[1]):

X_transform.shape[1] returns the number of columns (features) in X_transform.

range(X_transform.shape[1]) creates a sequence of numbers from 0 to
X_transform.shape[1] - 1 that is used to iterate over all features.
and:

i is an iterative variable that takes values from 0 to the number of features - 1.

Represents the index of the current feature in X_transform.
X_transform.apply(..., axis=1):

Applies a function to each line in X_transform.

axis=1 means that the function will be applied to each row (observation).
Lambda function:

lambda row: compute_Ri(row, X_transform, gamma, Rk_scores, i):

An anonymous function that takes one row argument (a single row of data).
Calls the compute_Ri function for each row of row.
compute_Ri(row, X_transform, gamma, Rk_scores, i):

A function that calculates the Ri metric for a given row, the entire X_transform dataset, the
gamma adjustment parameter, the Rk_scores value, and the feature index i.
.values:

Converts the resulting series to a numpy array.

Ri_scores.append(...):

Adds the resulting array of Ri metrics for the current feature i to the Ri_scores list.
e.g.array(Ri_scores):
Converts a list of Ri_scores to a numpy array.
e.g.swapaxes(Ri_scores, 0, 1):
Swaps axes 0 and 1 in the Ri_scores array.
After this operation, the Ri_scores array will have the shape (n_observations, n_features).

Service Manual 2702
67% (3)
Service Manual 2702
915 pages
Dawit House
No ratings yet
Dawit House
49 pages
Sibi 5
No ratings yet
Sibi 5
27 pages
Project Amazon Sales Data Analysis
No ratings yet
Project Amazon Sales Data Analysis
12 pages
Data Cleaning and Preprocessing Techniques
No ratings yet
Data Cleaning and Preprocessing Techniques
13 pages
Internship Report 1 Internship Report 1
No ratings yet
Internship Report 1 Internship Report 1
24 pages
Phase 2 New
No ratings yet
Phase 2 New
14 pages
Chapter 02 Overview (Python)
No ratings yet
Chapter 02 Overview (Python)
16 pages
ML Course Project
No ratings yet
ML Course Project
13 pages
Naan Mudhalvan Phase 2
No ratings yet
Naan Mudhalvan Phase 2
13 pages
Disaster
No ratings yet
Disaster
20 pages
Project Plan For Implementation of ISO 20000 20000academy EN
No ratings yet
Project Plan For Implementation of ISO 20000 20000academy EN
7 pages
Plate Notebook Guided Project 1 1
No ratings yet
Plate Notebook Guided Project 1 1
58 pages
Phase-2 For DS
No ratings yet
Phase-2 For DS
13 pages
Unit 7 ML
No ratings yet
Unit 7 ML
33 pages
Steps of Implementation of A GLM
No ratings yet
Steps of Implementation of A GLM
8 pages
Data Mining
No ratings yet
Data Mining
10 pages
Atulkumar 20215011
No ratings yet
Atulkumar 20215011
12 pages
Assignment No 8
No ratings yet
Assignment No 8
17 pages
Phase 3
No ratings yet
Phase 3
19 pages
Rithika Content
No ratings yet
Rithika Content
25 pages
EDA Report Week2
No ratings yet
EDA Report Week2
15 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
Machine Learning
No ratings yet
Machine Learning
7 pages
Lab4 Top Level SSN Insertion
No ratings yet
Lab4 Top Level SSN Insertion
19 pages
Unit 3 Data Science
No ratings yet
Unit 3 Data Science
7 pages
Ads Phase 5
No ratings yet
Ads Phase 5
23 pages
Documentation Part by Pranay Kashyap
No ratings yet
Documentation Part by Pranay Kashyap
7 pages
Soln Architecture11.
No ratings yet
Soln Architecture11.
5 pages
Another Project-Creating Customer Segments
No ratings yet
Another Project-Creating Customer Segments
31 pages
Rithika
No ratings yet
Rithika
16 pages
EE - 353 - 769 A4 Unsupervised Learning
No ratings yet
EE - 353 - 769 A4 Unsupervised Learning
1 page
Final 1
No ratings yet
Final 1
6 pages
DAV Practical 2
No ratings yet
DAV Practical 2
6 pages
Sprint Hack-O-Hire Team 1920587 1b2c50fteam Blitz
No ratings yet
Sprint Hack-O-Hire Team 1920587 1b2c50fteam Blitz
6 pages
Asdf
No ratings yet
Asdf
4 pages
Each Stage of A Data Mining Project
No ratings yet
Each Stage of A Data Mining Project
5 pages
Customer Segmentation 2
No ratings yet
Customer Segmentation 2
19 pages
1july Presentation
No ratings yet
1july Presentation
18 pages
Dsur Ea2352001010391 W7
No ratings yet
Dsur Ea2352001010391 W7
3 pages
2324 BigData Lab3
No ratings yet
2324 BigData Lab3
6 pages
Mining and Visualising Real-World Data: About This Module
100% (1)
Mining and Visualising Real-World Data: About This Module
16 pages
Case Study 219302405
No ratings yet
Case Study 219302405
14 pages
Big Mart Sales Analysis
No ratings yet
Big Mart Sales Analysis
3 pages
Data Science
No ratings yet
Data Science
8 pages
Cours 3 - TP
No ratings yet
Cours 3 - TP
3 pages
Sammary of Steps
No ratings yet
Sammary of Steps
2 pages
Project Report-Micro Credit Loan
No ratings yet
Project Report-Micro Credit Loan
8 pages
CSUDS Project
No ratings yet
CSUDS Project
13 pages
Project
No ratings yet
Project
2 pages
Data Preparation Basics#
No ratings yet
Data Preparation Basics#
2 pages
Capstones AIML and DS Capstone Projects
No ratings yet
Capstones AIML and DS Capstone Projects
6 pages
Building A CNC Router
100% (1)
Building A CNC Router
42 pages
Revit Structure Test
100% (3)
Revit Structure Test
6 pages
BigMart PDF
100% (1)
BigMart PDF
42 pages
B M Sale Analysis
No ratings yet
B M Sale Analysis
3 pages
AML Assignment 1 1
No ratings yet
AML Assignment 1 1
4 pages
Predictive Analysis For Big Mart Sales Using Machine Learning Algorithms
No ratings yet
Predictive Analysis For Big Mart Sales Using Machine Learning Algorithms
6 pages
Assignment 1 - CIS 508
No ratings yet
Assignment 1 - CIS 508
11 pages
WWE 2K17 PS3 Online Manual
No ratings yet
WWE 2K17 PS3 Online Manual
22 pages
Solution
No ratings yet
Solution
4 pages
Machine Learning - Customer Segment Project. Approved by UDACITY
100% (1)
Machine Learning - Customer Segment Project. Approved by UDACITY
19 pages
Revenue Predictor - Udit Ennam PDF
No ratings yet
Revenue Predictor - Udit Ennam PDF
30 pages
Android Car Multimedia System Instruction Manual: WWW - Tradetec.es Info@tradetec - Es
No ratings yet
Android Car Multimedia System Instruction Manual: WWW - Tradetec.es Info@tradetec - Es
18 pages
CAE Companion 2017 2018
No ratings yet
CAE Companion 2017 2018
116 pages
Big Mart Sales Analysis
No ratings yet
Big Mart Sales Analysis
3 pages
Create Basic Windows 10 Batch File
No ratings yet
Create Basic Windows 10 Batch File
14 pages
SDC15 Single Loop Controller User's Manual: For Basic Operation
No ratings yet
SDC15 Single Loop Controller User's Manual: For Basic Operation
144 pages
0937 Using Flutter Framework
No ratings yet
0937 Using Flutter Framework
50 pages
Gs30b05a10 01en PDF
No ratings yet
Gs30b05a10 01en PDF
23 pages
A357460420 - 22393 - 2 - 2018 - String Matching
No ratings yet
A357460420 - 22393 - 2 - 2018 - String Matching
27 pages
All News Paper
No ratings yet
All News Paper
30 pages
Sites Google Com Site Alfalaqorg Articles Ruqyah-Sja PDF
No ratings yet
Sites Google Com Site Alfalaqorg Articles Ruqyah-Sja PDF
6 pages
Top 10 Use Cases For Sysdig
No ratings yet
Top 10 Use Cases For Sysdig
14 pages
SQL QUERIES For Railway Reservation Program
No ratings yet
SQL QUERIES For Railway Reservation Program
10 pages
Syed Amir Ali - SAP - 22-SEP-2019 PDF
No ratings yet
Syed Amir Ali - SAP - 22-SEP-2019 PDF
8 pages
Vikram Resume-2
No ratings yet
Vikram Resume-2
3 pages
JAVA Lab Examples
No ratings yet
JAVA Lab Examples
3 pages
C Lab Manual 1 To 3
No ratings yet
C Lab Manual 1 To 3
5 pages
R Tutorial, #1: Data, Frequency Tables, and Histograms
No ratings yet
R Tutorial, #1: Data, Frequency Tables, and Histograms
4 pages
Rubik 22
No ratings yet
Rubik 22
8 pages
Afsdf 234 R 34 Tewfbdsfbsdfgsdfg
No ratings yet
Afsdf 234 R 34 Tewfbdsfbsdfgsdfg
2 pages
Automated Security For Virtual Data Centers & Clouds ALTOR v4.0
No ratings yet
Automated Security For Virtual Data Centers & Clouds ALTOR v4.0
2 pages
Mã Hóa Thông Điệp: Code
No ratings yet
Mã Hóa Thông Điệp: Code
12 pages
B Whats Ne
No ratings yet
B Whats Ne
2 pages
Features of Iphone That Android Doesn't - Google Search
No ratings yet
Features of Iphone That Android Doesn't - Google Search
1 page
Nusrat Jahan Brishty UPDATED RESUME
No ratings yet
Nusrat Jahan Brishty UPDATED RESUME
1 page
Guidelines On Cyber Resilience For Participants of Paynet's Services - 002
No ratings yet
Guidelines On Cyber Resilience For Participants of Paynet's Services - 002
1 page