0% found this document useful (0 votes)
6 views11 pages

Software Projekt

Uploaded by

f8mfk29tfr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views11 pages

Software Projekt

Uploaded by

f8mfk29tfr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Slajd 2:

The goal of the first part of the project is to extract insights from an unsupervised UCI Wholesale
customers dataset. The data set refers to clients of a wholesale distributor. It includes the annual
spending in monetary units on diverse product categories This dataset contains yearly expenditures in
different product categories by wholesale customers in Portugal.

1Slajd 3:

#### Loading, Preprocessing, and Initial Analysis

**Download and convert the dataset:**


1. **Downloading the Dataset
2. **Converting the Dataset**: Convert the downloaded data into numerical tables, such
as numpy arrays. This step ensures that the raw data is structured appropriately for
machine learning algorithms.

**Preprocess the Data:**


1. **Filtering Features**: Select relevant numerical data for analysis, such as annual
spending per category, and exclude useless meta-data
2. **Transforming Data**: Apply necessary transformations to handle heterogeneity
among features.

**Generate Basic Statistical Visualizations:**


1. **Histograms**: Create histograms to show the distribution of spendings across
different product categories.
2. **Scatter Plots**: Generate scatter plots to visualize the correlation between various
product categories.
3. **Transformation Verification**: Recompute histograms and scatter plots after
transformations to ensure the desired distribution
#### Detecting Anomalies

**Basic Anomaly Detection:**


1. **Neighbor-Based Anomaly Detection**: Calculate the anomaly score for each data
point based on the distance to its nearest neighbor. This score ranks data points from most
to least anomalous.

**Robust Anomaly Models:**


1. **Soft Minimum Approach**: Implement a robust anomaly detection model using a soft
minimum approach. This approach considers multiple neighbors to define an outlier,
ensuring reproducibility and reducing sensitivity to slight variations in the dataset.
2. **Equation Implementation**: Use the soft minimum function:

**Selecting a Suitable Parameter γ:**


1. **Bootstrap Method**: Use bootstrapping to generate multiple variants of the dataset
by randomly sampling instances. Compute anomaly scores for each variant and
characterize the anomaly of each instance by its average (μ_j) and spread (σ_j).
2. **Parameter Optimization**: Experiment with different values of γ. Plot the anomaly
averages and spreads to visually identify the parameter that offers the best separability
between anomalous and non-anomalous instances. Formalize this visual intuition into an
evaluation metric to derive the optimal γ value.

#### Getting Insights into Anomalies

**Relation Between Anomalies and Meta-Data:**


1. **Subset Analysis**: Investigate whether anomalies are over- or under-represented
among specific types of retailers or geographical locations. Using meta-data to define
subsets and analyze the distribution of anomaly scores across these subsets.
**Identifying Input Features that Drive Anomaly:**
1. **Explainable AI Techniques**: Use methods like Layer-wise Relevance Propagation
(LRP) to attribute anomaly scores to specific input features.
2. **Relevance Scores**: Calculate the contribution of each data point and feature to the
anomaly score
3. **Visualization**: Implement the propagation process and visualize the contributions of
each input feature to the anomaly scores. This helps in interpreting the factors driving
anomalies and validating the model.

Slajd 4:
### Key Concepts and Their Use in the Project

**Unsupervised Machine Learning:**


Unsupervised machine learning involves training models on data without
labeled outcomes. The goal is to identify patterns and structures within
the data. In this project, unsupervised learning techniques are used to
detect anomalies and clusters within the UCI Wholesale customers
dataset, helping to identify unusual spending behaviors and group similar
customers.

**Data Transformation:**
Data transformation modifies data to improve its suitability for analysis.
Common transformations include scaling, normalization, and log
transformations. In this project, log transformations are applied to the
dataset to handle the heavy-tailed distribution of spending values, making
the data more suitable for statistical analysis and anomaly detection.

**Softmin Function:**
The softmin function provides a smooth approximation to the minimum
function. It is used in the project to enhance the robustness of anomaly
detection by considering multiple nearest neighbors rather than just the
closest one. This approach helps to mitigate the impact of slight variations
in the dataset, improving the reproducibility of anomaly detection results.

**Bootstrap Method:**
The bootstrap method involves repeatedly sampling from a dataset with
replacement to create multiple simulated samples. This technique is used
in the project to estimate the stability of anomaly scores across different
subsets of data. By analyzing the variation in anomaly scores, the project
selects an optimal parameter for the softmin function, ensuring reliable
anomaly detection.

**Explainable AI:**
Explainable AI (XAI) refers to methods that make the outputs of machine
learning models interpretable to humans. In this project, XAI techniques,
such as Layer-wise Relevance Propagation (LRP), are used to identify
which input features contribute most to an anomaly score. This helps in
understanding why certain instances are considered anomalous and
provides insights into the underlying causes of anomalies.

These concepts are integrated to achieve a comprehensive analysis of the


wholesale customers' dataset, focusing on detecting, explaining, and
understanding anomalies within the data.
Slajd 5:
numpy (alias np): for numerical calculations.
matplotlib.pyplot (aka plt): plotting.
ucimlrepo: to download data from the UCI repository.

Took it from the dataset website. It downloads the repository of id number 292, and keeps it
in a variable named wholesale_customers. We store features in x variable, and targets in Y
variable as data frames. We then created a new variable, X_copy, without Channel column.
We used itertools to generate indexes from the function ,,product” for diagrams. We defined
function plot histograms, which contained two arguments: dataframe with data for
visualization.
itertools.product(range(subplot_rows), range(subplot_cols)):

itertools.product creates the Cartesian product of two sequences.


range(subplot_rows) creates a list of numbers from 0 to subplot_rows-1.
range(subplot_cols) creates a list of numbers from 0 to subplot_cols-1.
As a result of running itertools.product(range(3), range(2)) we will get the sequence: (0, 0),
(0, 1), (1, 0), (1, 1), (2, 0), (2, 1).
enumerate(itertools.product(...)):

enumerate adds a counter (idx) to the items generated by itertools.product.


idx is the index number (starting from 0), (r,c) is the pair of coordinates for each subplot.
Slajd7:
Slajd8:
Slajd9:
 Logarithmic transformation:
• Especially values of theta = 1 and theta = 10 seem to produce the best results.
 Anomaly detection:
• scores obtained using hard and soft minimum indicate different rows as
anomalies
• top 10 anomalies in the X_copy data show extreme values in different
columns, which may suggest invalid data or interesting cases to investigate
further.
 Optimal Gamma:
• the optimal gamma value (0.01) is crucial for more accurate anomaly
detection using a soft minimum.
 Impact of the function on the anomaly:
• Analysis of the Rk and Ri results shows which data features have the greatest
impact on the anomaly results.
theta = 100:

It may shift the data too much, which may lead to loss of information about the relative
differences between values.
The histogram may be less useful if all values are shifted by a large fixed amount, causing a
loss of discrimination between small and large values.
Slajd 6:
A function that calculates an anomaly score for a single row (observation) in the data.
j = row.name: Gets the row index.
x_j = row.values: Gets the row values as an array.
data_values = data.values: Gets all the values in the DataFrame as a numpy array.

Calculations:
diffs = -data_values + x_j: Calculates the differences between x_j and each observation in
data.
squared_diffs = np.square(diffs): Squares the differences.
z_jk = np.sum(squared_diffs, axis=1): Sums the squared differences for each observation,
creating an array z_jk, which represents the sum of the squared differences for each
observation from x_j.

Excludes row index in calculations and caltulate minimum value for z_jk
.apply(lambda row: compute_outlier_score(row, X_transform), axis=1): Applies the
compute_outlier_score function to each row in X_transform.
axis=1: Specifies that the function is applied along the row axis.
.values: Gets the results as a numpy array.
X_outliers['Outlier_Score'] = outlier_scores:
• Adds a new Outlier_Score column to X_outliers containing the calculated anomaly scores.

Sorts outliers_df by the Outlier_Score column in descending order (ascending=False).

full_data.iloc[top_10_outliers.index]:
top_10_outliers.index: Returns the indexes of the 10 most abnormal observations from
sorted_df.
full_data.iloc[top_10_outliers.index]: Uses these indexes to select the appropriate rows from
full_data, the original dataset.
print(...): Prints these selected lines, i.e. the 10 most abnormal observations, in their original
form.

 Anomaly Models section

logsumexp: A function from the scipy.special library that calculates the logarithm of the sum
of exponents. It is used to calculate logarithmic sums stably,
j = row.name: Gets the row index.
x_j = row.values: Gets the row values as an array.
N = data.shape[0]: Number of rows in the data set.
data_values = data.values: Gets all the values in the DataFrame as a numpy array.

X_transform.apply(lambda row: compute_outlier_score_softmin(row, X_transform,


gamma), axis=1).values: Applies the compute_outlier_score_softmin function to each row
in X_transform using the apply method, calculating anomaly scores for each observation.
axis=1 means the function is applied to rows.
.values: Gets the results as a numpy array.
Slajd 7:
A function that calculates the gamma metric based on average values and standard
deviations.
Parameters:
mean_values: An array of average anomaly score values for individual observations.
std_values: An array of standard deviations of anomaly scores for individual observations.
Action:
e.g.diff(mean_values).max() calculates the largest difference between consecutive mean
values.
std_values.mean() calculates the average standard deviation.
The result is the quotient of the largest difference in mean values divided by the mean
standard deviation.

e.g.arange(0, 8, 1):

e.g.arange(start, stop, step) creates an array of values from start (inclusive) to stop
(exclusive) with step step.
In this case, e.g.arange(0, 8, 1) creates an array of values [0, 1, 2, 3, 4, 5, 6, 7].
for gamma in gamma_values: Loop after each gamma value from gamma_values.
scores = {}: A dictionary to store the anomaly scores for each observation.
for idx in range(X_transform.shape[0]): A loop that initializes the result lists for each
observation.
means = {} and stds = {}: Dictionaries for storing the means and standard deviations of
anomaly results.
bootstrap_sample.apply(..., axis=1):
Applies a function to each row in bootstrap_sample.
lambda row: compute_outlier_score_softmin(row, bootstrap_sample, gamma):
A lambda function that calculates the anomaly score for each row using
compute_outlier_score_softmin.
row: A single row of data.
bootstrap_sample: The entire bootstrap sample, used as a date in the
compute_outlier_score_softmin function.
gamma: Adjustment parameter for the softmin function.

for loop:
Iterates through each index in bootstrap_sample.
range(sample_size): Assumes that sample_size specifies the number of rows in
bootstrap_sample.
Index download:

python
Copy the code
idx = bootstrap_sample.index.tolist()[i]
bootstrap_sample.index.tolist(): Returns the list of row indexes in bootstrap_sample.

[i]: Gets the index of the ith row.

Adding an anomaly score to scores:

python
Copy the code
scores[idx].append(outlier_scores[i])
scores[idx]: List of anomaly scores for the row at index idx.
append(outlier_scores[i]): Adds the anomaly score for row i from bootstrap_sample to the
list in scores[idx].

scores.items():
scores is a dictionary in which the keys are indexes of observations and the values are lists
of anomaly scores.
scores.items() returns (key, value) pairs, where key is the index and value is a list of
anomaly scores for a given index.
k, v:
k is the observation index.
v is a list of anomaly scores for observations with index k.
means[k] = np.mean(v):
Calculates the average of the anomaly scores v for index k.
Stores this average in the means dictionary under key k.
stds[k] = np.std(v):
Calculates the standard deviation of anomaly scores v for index k.
Stores this standard deviation in the stds dictionary under the key k.

Slajd 8:
logsumexp: A function from the scipy.special library that reliably calculates the logarithm
of the sum of exponents. It is useful for numerical calculations to avoid precision problems.

j = row.name: Gets the row index of row.


x_j = row.values: Gets the values of the row row as an array.
N = data.shape[0]: Number of rows in the data dataset.
data_values = data.values: Gets all the values in the DataFrame as a numpy array.

Ri_scores:
This is an empty list that will be used to store the Ri anomaly metrics results for each
feature and each observation.
range(X_transform.shape[1]):

X_transform.shape[1] returns the number of columns (features) in X_transform.


range(X_transform.shape[1]) creates a sequence of numbers from 0 to
X_transform.shape[1] - 1 that is used to iterate over all features.
and:

i is an iterative variable that takes values from 0 to the number of features - 1.


Represents the index of the current feature in X_transform.
X_transform.apply(..., axis=1):

Applies a function to each line in X_transform.


axis=1 means that the function will be applied to each row (observation).
Lambda function:

lambda row: compute_Ri(row, X_transform, gamma, Rk_scores, i):


An anonymous function that takes one row argument (a single row of data).
Calls the compute_Ri function for each row of row.
compute_Ri(row, X_transform, gamma, Rk_scores, i):

A function that calculates the Ri metric for a given row, the entire X_transform dataset, the
gamma adjustment parameter, the Rk_scores value, and the feature index i.
.values:

Converts the resulting series to a numpy array.


Ri_scores.append(...):

Adds the resulting array of Ri metrics for the current feature i to the Ri_scores list.
e.g.array(Ri_scores):
Converts a list of Ri_scores to a numpy array.
e.g.swapaxes(Ri_scores, 0, 1):
Swaps axes 0 and 1 in the Ri_scores array.
After this operation, the Ri_scores array will have the shape (n_observations, n_features).

You might also like