-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Description
Describe the workflow you want to enable
I'm concerned with slow execution speed of the classification_report procedure which makes it barely suitable for production-grade workloads.
On a 8M sample it already takes 15 seconds whereas simple POC numba implementation takes 20 to 40 milliseconds. I understand that sklearn code is well covered with tests, has wide functionality, follows style guildlines and best practices, but there should be no excuse for the performace leap with simple POC of magnitude that big (x1000).
import numpy as np
from sklearn.metrics import classification_report
y_true = np.random.randint(0, 2, size=2 ** 23)
y_pred = y_true.copy()
np.random.shuffle(y_pred[2 ** 20 : 2 ** 21])
print(classification_report(y_true=y_true, y_pred=y_pred, digits=10, output_dict=False, zero_division=0,))
precision recall f1-score support 0 0.9373906697 0.9373906697 0.9373906697 4192570 1 0.9374424159 0.9374424159 0.9374424159 4196038 accuracy 0.9374165535 8388608 macro avg 0.9374165428 0.9374165428 0.9374165428 8388608 weighted avg 0.9374165535 0.9374165535 0.9374165535 8388608
time: 18.6 s (started: 2023-07-08 16:10:08 +03:00)
print(own_classification_report(y_true=y_true, y_pred=y_pred, zero_division=0))
(3930076, 3933544, 262494, 262494, 0.9374165534973145, array([4192570, 4196038], dtype=int64), array([0.93739067, 0.93744242]), array([0.93739067, 0.93744242]), array([0.93739067, 0.93744242]))
time: 16 ms (started: 2023-07-08 16:11:18 +03:00)
Describe your proposed solution
import numpy as np
from numba import njit
@njit()
def own_classification_report(y_true: np.ndarray, y_pred: np.ndarray, nclasses: int = 2, zero_division: int = 0):
correct = np.zeros(nclasses, dtype=np.int64)
wrong = np.zeros(nclasses, dtype=np.int64)
for truth, pred in zip(y_true, y_pred):
if pred == truth:
correct[pred] += 1
else:
wrong[pred] += 1
tn, tp = correct
fn, fp = wrong
accuracy = (tn + tp) / len(y_true)
groups = np.array([(tn + fn), (tp + fp)])
supports = np.array([(tn + fp), (tp + fn)])
precisions = np.array([tn, tp]) / groups
recalls = np.array([tn, tp]) / supports
f1s = 2 * (precisions * recalls) / (precisions + recalls)
for arr in (precisions, recalls, f1s):
np.nan_to_num(arr, copy=False, nan=zero_division)
return tn, tp, fn, fp, accuracy, supports, precisions, recalls, f1s
Describe alternatives you've considered, if relevant
Can someone from original classification_report authors take a look please why it's taking so long?
Additional context
I have the latest versions of scikit-learn and numba to this date, an Intel CPU with AVX.