|
4 | 4 | "cell_type": "markdown",
|
5 | 5 | "metadata": {},
|
6 | 6 | "source": [
|
7 |
| - "\n# Post-tuning the decision threshold for cost-sensitive learning\n\nOnce a classifier is trained, the output of the :term:`predict` method outputs class\nlabel predictions corresponding to a thresholding of either the :term:`decision\nfunction` or the :term:`predict_proba` output. For a binary classifier, the default\nthreshold is defined as a posterior probability estimate of 0.5 or a decision score of\n0.0.\n\nHowever, this default strategy is most likely not optimal for the task at hand.\nHere, we use the \"Statlog\" German credit dataset [1]_ to illustrate a use case.\nIn this dataset, the task is to predict whether a person has a \"good\" or \"bad\" credit.\nIn addition, a cost-matrix is provided that specifies the cost of\nmisclassification. Specifically, misclassifying a \"bad\" credit as \"good\" is five\ntimes more costly on average than misclassifying a \"good\" credit as \"bad\".\n\nWe use the :class:`~sklearn.model_selection.TunedThresholdClassifierCV` to select the\ncut-off point of the decision function that minimizes the provided business\ncost.\n\nIn the second part of the example, we further extend this approach by\nconsidering the problem of fraud detection in credit card transactions: in this\ncase, the business metric depends on the amount of each individual transaction.\n.. topic:: References\n\n .. [1] \"Statlog (German Credit Data) Data Set\", UCI Machine Learning Repository,\n [Link](https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29).\n\n .. [2] [Charles Elkan, \"The Foundations of Cost-Sensitive Learning\",\n International joint conference on artificial intelligence.\n Vol. 17. No. 1. Lawrence Erlbaum Associates Ltd, 2001.](https://cseweb.ucsd.edu/~elkan/rescale.pdf)\n" |
| 7 | + "\n# Post-tuning the decision threshold for cost-sensitive learning\n\nOnce a classifier is trained, the output of the :term:`predict` method outputs class\nlabel predictions corresponding to a thresholding of either the\n:term:`decision_function` or the :term:`predict_proba` output. For a binary classifier,\nthe default threshold is defined as a posterior probability estimate of 0.5 or a\ndecision score of 0.0.\n\nHowever, this default strategy is most likely not optimal for the task at hand.\nHere, we use the \"Statlog\" German credit dataset [1]_ to illustrate a use case.\nIn this dataset, the task is to predict whether a person has a \"good\" or \"bad\" credit.\nIn addition, a cost-matrix is provided that specifies the cost of\nmisclassification. Specifically, misclassifying a \"bad\" credit as \"good\" is five\ntimes more costly on average than misclassifying a \"good\" credit as \"bad\".\n\nWe use the :class:`~sklearn.model_selection.TunedThresholdClassifierCV` to select the\ncut-off point of the decision function that minimizes the provided business\ncost.\n\nIn the second part of the example, we further extend this approach by\nconsidering the problem of fraud detection in credit card transactions: in this\ncase, the business metric depends on the amount of each individual transaction.\n\n.. rubric :: References\n\n.. [1] \"Statlog (German Credit Data) Data Set\", UCI Machine Learning Repository,\n [Link](https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29).\n\n.. [2] [Charles Elkan, \"The Foundations of Cost-Sensitive Learning\",\n International joint conference on artificial intelligence.\n Vol. 17. No. 1. Lawrence Erlbaum Associates Ltd, 2001.](https://cseweb.ucsd.edu/~elkan/rescale.pdf)\n" |
8 | 8 | ]
|
9 | 9 | },
|
10 | 10 | {
|
|
422 | 422 | },
|
423 | 423 | "outputs": [],
|
424 | 424 | "source": [
|
425 |
| - "fraud = target == 1\namount_fraud = data[\"Amount\"][fraud]\n_, ax = plt.subplots()\nax.hist(amount_fraud, bins=100)\nax.set_title(\"Amount of fraud transaction\")\n_ = ax.set_xlabel(\"Amount ($)\")" |
| 425 | + "fraud = target == 1\namount_fraud = data[\"Amount\"][fraud]\n_, ax = plt.subplots()\nax.hist(amount_fraud, bins=100)\nax.set_title(\"Amount of fraud transaction\")\n_ = ax.set_xlabel(\"Amount (\u20ac)\")" |
426 | 426 | ]
|
427 | 427 | },
|
428 | 428 | {
|
429 | 429 | "cell_type": "markdown",
|
430 | 430 | "metadata": {},
|
431 | 431 | "source": [
|
432 |
| - "### Addressing the problem with a business metric\n\nNow, we create the business metric that depends on the amount of each transaction. We\ndefine the cost matrix similarly to [2]_. Accepting a legitimate transaction provides\na gain of 2% of the amount of the transaction. However, accepting a fraudulent\ntransaction result in a loss of the amount of the transaction. As stated in [2]_, the\ngain and loss related to refusals (of fraudulent and legitimate transactions) are not\ntrivial to define. Here, we define that a refusal of a legitimate transaction is\nestimated to a loss of $5 while the refusal of a fraudulent transaction is estimated\nto a gain of $50 dollars and the amount of the transaction. Therefore, we define the\nfollowing function to compute the total benefit of a given decision:\n\n" |
| 432 | + "### Addressing the problem with a business metric\n\nNow, we create the business metric that depends on the amount of each transaction. We\ndefine the cost matrix similarly to [2]_. Accepting a legitimate transaction provides\na gain of 2% of the amount of the transaction. However, accepting a fraudulent\ntransaction result in a loss of the amount of the transaction. As stated in [2]_, the\ngain and loss related to refusals (of fraudulent and legitimate transactions) are not\ntrivial to define. Here, we define that a refusal of a legitimate transaction is\nestimated to a loss of 5\u20ac while the refusal of a fraudulent transaction is estimated\nto a gain of 50\u20ac and the amount of the transaction. Therefore, we define the\nfollowing function to compute the total benefit of a given decision:\n\n" |
433 | 433 | ]
|
434 | 434 | },
|
435 | 435 | {
|
|
505 | 505 | },
|
506 | 506 | "outputs": [],
|
507 | 507 | "source": [
|
508 |
| - "from sklearn.dummy import DummyClassifier\n\neasy_going_classifier = DummyClassifier(strategy=\"constant\", constant=0)\neasy_going_classifier.fit(data_train, target_train)\nbenefit_cost = business_scorer(\n easy_going_classifier, data_test, target_test, amount=amount_test\n)\nprint(f\"Benefit/cost of our easy-going classifier: ${benefit_cost:,.2f}\")" |
| 508 | + "from sklearn.dummy import DummyClassifier\n\neasy_going_classifier = DummyClassifier(strategy=\"constant\", constant=0)\neasy_going_classifier.fit(data_train, target_train)\nbenefit_cost = business_scorer(\n easy_going_classifier, data_test, target_test, amount=amount_test\n)\nprint(f\"Benefit/cost of our easy-going classifier: {benefit_cost:,.2f}\u20ac\")" |
509 | 509 | ]
|
510 | 510 | },
|
511 | 511 | {
|
512 | 512 | "cell_type": "markdown",
|
513 | 513 | "metadata": {},
|
514 | 514 | "source": [
|
515 |
| - "A classifier that predict all transactions as legitimate would create a profit of\naround $220,000. We make the same evaluation for a classifier that predicts all\ntransactions as fraudulent.\n\n" |
| 515 | + "A classifier that predict all transactions as legitimate would create a profit of\naround 220,000.\u20ac We make the same evaluation for a classifier that predicts all\ntransactions as fraudulent.\n\n" |
516 | 516 | ]
|
517 | 517 | },
|
518 | 518 | {
|
|
523 | 523 | },
|
524 | 524 | "outputs": [],
|
525 | 525 | "source": [
|
526 |
| - "intolerant_classifier = DummyClassifier(strategy=\"constant\", constant=1)\nintolerant_classifier.fit(data_train, target_train)\nbenefit_cost = business_scorer(\n intolerant_classifier, data_test, target_test, amount=amount_test\n)\nprint(f\"Benefit/cost of our intolerant classifier: ${benefit_cost:,.2f}\")" |
| 526 | + "intolerant_classifier = DummyClassifier(strategy=\"constant\", constant=1)\nintolerant_classifier.fit(data_train, target_train)\nbenefit_cost = business_scorer(\n intolerant_classifier, data_test, target_test, amount=amount_test\n)\nprint(f\"Benefit/cost of our intolerant classifier: {benefit_cost:,.2f}\u20ac\")" |
527 | 527 | ]
|
528 | 528 | },
|
529 | 529 | {
|
530 | 530 | "cell_type": "markdown",
|
531 | 531 | "metadata": {},
|
532 | 532 | "source": [
|
533 |
| - "Such a classifier create a loss of around $670,000. A predictive model should allow\nus to make a profit larger than $220,000. It is interesting to compare this business\nmetric with another \"standard\" statistical metric such as the balanced accuracy.\n\n" |
| 533 | + "Such a classifier create a loss of around 670,000.\u20ac A predictive model should allow\nus to make a profit larger than 220,000.\u20ac It is interesting to compare this business\nmetric with another \"standard\" statistical metric such as the balanced accuracy.\n\n" |
534 | 534 | ]
|
535 | 535 | },
|
536 | 536 | {
|
|
559 | 559 | },
|
560 | 560 | "outputs": [],
|
561 | 561 | "source": [
|
562 |
| - "from sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import GridSearchCV\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import StandardScaler\n\nlogistic_regression = make_pipeline(StandardScaler(), LogisticRegression())\nparam_grid = {\"logisticregression__C\": np.logspace(-6, 6, 13)}\nmodel = GridSearchCV(logistic_regression, param_grid, scoring=\"neg_log_loss\").fit(\n data_train, target_train\n)\n\nprint(\n \"Benefit/cost of our logistic regression: \"\n f\"${business_scorer(model, data_test, target_test, amount=amount_test):,.2f}\"\n)\nprint(\n \"Balanced accuracy of our logistic regression: \"\n f\"{balanced_accuracy_scorer(model, data_test, target_test):.3f}\"\n)" |
| 562 | + "from sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import GridSearchCV\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import StandardScaler\n\nlogistic_regression = make_pipeline(StandardScaler(), LogisticRegression())\nparam_grid = {\"logisticregression__C\": np.logspace(-6, 6, 13)}\nmodel = GridSearchCV(logistic_regression, param_grid, scoring=\"neg_log_loss\").fit(\n data_train, target_train\n)\n\nprint(\n \"Benefit/cost of our logistic regression: \"\n f\"{business_scorer(model, data_test, target_test, amount=amount_test):,.2f}\u20ac\"\n)\nprint(\n \"Balanced accuracy of our logistic regression: \"\n f\"{balanced_accuracy_scorer(model, data_test, target_test):.3f}\"\n)" |
563 | 563 | ]
|
564 | 564 | },
|
565 | 565 | {
|
|
606 | 606 | },
|
607 | 607 | "outputs": [],
|
608 | 608 | "source": [
|
609 |
| - "print(\n \"Benefit/cost of our logistic regression: \"\n f\"${business_scorer(tuned_model, data_test, target_test, amount=amount_test):,.2f}\"\n)\nprint(\n \"Balanced accuracy of our logistic regression: \"\n f\"{balanced_accuracy_scorer(tuned_model, data_test, target_test):.3f}\"\n)" |
| 609 | + "print(\n \"Benefit/cost of our logistic regression: \"\n f\"{business_scorer(tuned_model, data_test, target_test, amount=amount_test):,.2f}\u20ac\"\n)\nprint(\n \"Balanced accuracy of our logistic regression: \"\n f\"{balanced_accuracy_scorer(tuned_model, data_test, target_test):.3f}\"\n)" |
610 | 610 | ]
|
611 | 611 | },
|
612 | 612 | {
|
|
635 | 635 | },
|
636 | 636 | "outputs": [],
|
637 | 637 | "source": [
|
638 |
| - "business_score = business_scorer(\n model_fixed_threshold, data_test, target_test, amount=amount_test\n)\nprint(f\"Benefit/cost of our logistic regression: ${business_score:,.2f}\")\nprint(\n \"Balanced accuracy of our logistic regression: \"\n f\"{balanced_accuracy_scorer(model_fixed_threshold, data_test, target_test):.3f}\"\n)" |
| 638 | + "business_score = business_scorer(\n model_fixed_threshold, data_test, target_test, amount=amount_test\n)\nprint(f\"Benefit/cost of our logistic regression: {business_score:,.2f}\u20ac\")\nprint(\n \"Balanced accuracy of our logistic regression: \"\n f\"{balanced_accuracy_scorer(model_fixed_threshold, data_test, target_test):.3f}\"\n)" |
639 | 639 | ]
|
640 | 640 | },
|
641 | 641 | {
|
|
0 commit comments