Software Projekt
Software Projekt
The goal of the first part of the project is to extract insights from an unsupervised UCI Wholesale
customers dataset. The data set refers to clients of a wholesale distributor. It includes the annual
spending in monetary units on diverse product categories This dataset contains yearly expenditures in
different product categories by wholesale customers in Portugal.
1Slajd 3:
Slajd 4:
### Key Concepts and Their Use in the Project
**Data Transformation:**
Data transformation modifies data to improve its suitability for analysis.
Common transformations include scaling, normalization, and log
transformations. In this project, log transformations are applied to the
dataset to handle the heavy-tailed distribution of spending values, making
the data more suitable for statistical analysis and anomaly detection.
**Softmin Function:**
The softmin function provides a smooth approximation to the minimum
function. It is used in the project to enhance the robustness of anomaly
detection by considering multiple nearest neighbors rather than just the
closest one. This approach helps to mitigate the impact of slight variations
in the dataset, improving the reproducibility of anomaly detection results.
**Bootstrap Method:**
The bootstrap method involves repeatedly sampling from a dataset with
replacement to create multiple simulated samples. This technique is used
in the project to estimate the stability of anomaly scores across different
subsets of data. By analyzing the variation in anomaly scores, the project
selects an optimal parameter for the softmin function, ensuring reliable
anomaly detection.
**Explainable AI:**
Explainable AI (XAI) refers to methods that make the outputs of machine
learning models interpretable to humans. In this project, XAI techniques,
such as Layer-wise Relevance Propagation (LRP), are used to identify
which input features contribute most to an anomaly score. This helps in
understanding why certain instances are considered anomalous and
provides insights into the underlying causes of anomalies.
Took it from the dataset website. It downloads the repository of id number 292, and keeps it
in a variable named wholesale_customers. We store features in x variable, and targets in Y
variable as data frames. We then created a new variable, X_copy, without Channel column.
We used itertools to generate indexes from the function ,,product” for diagrams. We defined
function plot histograms, which contained two arguments: dataframe with data for
visualization.
itertools.product(range(subplot_rows), range(subplot_cols)):
It may shift the data too much, which may lead to loss of information about the relative
differences between values.
The histogram may be less useful if all values are shifted by a large fixed amount, causing a
loss of discrimination between small and large values.
Slajd 6:
A function that calculates an anomaly score for a single row (observation) in the data.
j = row.name: Gets the row index.
x_j = row.values: Gets the row values as an array.
data_values = data.values: Gets all the values in the DataFrame as a numpy array.
Calculations:
diffs = -data_values + x_j: Calculates the differences between x_j and each observation in
data.
squared_diffs = np.square(diffs): Squares the differences.
z_jk = np.sum(squared_diffs, axis=1): Sums the squared differences for each observation,
creating an array z_jk, which represents the sum of the squared differences for each
observation from x_j.
Excludes row index in calculations and caltulate minimum value for z_jk
.apply(lambda row: compute_outlier_score(row, X_transform), axis=1): Applies the
compute_outlier_score function to each row in X_transform.
axis=1: Specifies that the function is applied along the row axis.
.values: Gets the results as a numpy array.
X_outliers['Outlier_Score'] = outlier_scores:
• Adds a new Outlier_Score column to X_outliers containing the calculated anomaly scores.
full_data.iloc[top_10_outliers.index]:
top_10_outliers.index: Returns the indexes of the 10 most abnormal observations from
sorted_df.
full_data.iloc[top_10_outliers.index]: Uses these indexes to select the appropriate rows from
full_data, the original dataset.
print(...): Prints these selected lines, i.e. the 10 most abnormal observations, in their original
form.
logsumexp: A function from the scipy.special library that calculates the logarithm of the sum
of exponents. It is used to calculate logarithmic sums stably,
j = row.name: Gets the row index.
x_j = row.values: Gets the row values as an array.
N = data.shape[0]: Number of rows in the data set.
data_values = data.values: Gets all the values in the DataFrame as a numpy array.
e.g.arange(0, 8, 1):
e.g.arange(start, stop, step) creates an array of values from start (inclusive) to stop
(exclusive) with step step.
In this case, e.g.arange(0, 8, 1) creates an array of values [0, 1, 2, 3, 4, 5, 6, 7].
for gamma in gamma_values: Loop after each gamma value from gamma_values.
scores = {}: A dictionary to store the anomaly scores for each observation.
for idx in range(X_transform.shape[0]): A loop that initializes the result lists for each
observation.
means = {} and stds = {}: Dictionaries for storing the means and standard deviations of
anomaly results.
bootstrap_sample.apply(..., axis=1):
Applies a function to each row in bootstrap_sample.
lambda row: compute_outlier_score_softmin(row, bootstrap_sample, gamma):
A lambda function that calculates the anomaly score for each row using
compute_outlier_score_softmin.
row: A single row of data.
bootstrap_sample: The entire bootstrap sample, used as a date in the
compute_outlier_score_softmin function.
gamma: Adjustment parameter for the softmin function.
for loop:
Iterates through each index in bootstrap_sample.
range(sample_size): Assumes that sample_size specifies the number of rows in
bootstrap_sample.
Index download:
python
Copy the code
idx = bootstrap_sample.index.tolist()[i]
bootstrap_sample.index.tolist(): Returns the list of row indexes in bootstrap_sample.
python
Copy the code
scores[idx].append(outlier_scores[i])
scores[idx]: List of anomaly scores for the row at index idx.
append(outlier_scores[i]): Adds the anomaly score for row i from bootstrap_sample to the
list in scores[idx].
scores.items():
scores is a dictionary in which the keys are indexes of observations and the values are lists
of anomaly scores.
scores.items() returns (key, value) pairs, where key is the index and value is a list of
anomaly scores for a given index.
k, v:
k is the observation index.
v is a list of anomaly scores for observations with index k.
means[k] = np.mean(v):
Calculates the average of the anomaly scores v for index k.
Stores this average in the means dictionary under key k.
stds[k] = np.std(v):
Calculates the standard deviation of anomaly scores v for index k.
Stores this standard deviation in the stds dictionary under the key k.
Slajd 8:
logsumexp: A function from the scipy.special library that reliably calculates the logarithm
of the sum of exponents. It is useful for numerical calculations to avoid precision problems.
Ri_scores:
This is an empty list that will be used to store the Ri anomaly metrics results for each
feature and each observation.
range(X_transform.shape[1]):
A function that calculates the Ri metric for a given row, the entire X_transform dataset, the
gamma adjustment parameter, the Rk_scores value, and the feature index i.
.values:
Adds the resulting array of Ri metrics for the current feature i to the Ri_scores list.
e.g.array(Ri_scores):
Converts a list of Ri_scores to a numpy array.
e.g.swapaxes(Ri_scores, 0, 1):
Swaps axes 0 and 1 in the Ri_scores array.
After this operation, the Ri_scores array will have the shape (n_observations, n_features).