How To Argue With An Algorithm: Lessons From The Compas-Propublica Debate
How To Argue With An Algorithm: Lessons From The Compas-Propublica Debate
How To Argue With An Algorithm: Lessons From The Compas-Propublica Debate
131
132 COLO. TECH. L.J. [Vol. 17.1
INTRODUCTION........................................... 133
I. THE CASE: WIscoNSIN v. LooMIS ................. ...... 138
A. Pre-Sentence Information ....................... 138
B. Sentencing ............................ ....... 139
C. Due Process Claims. ..................... ....... 140
II. THE ALGORITHM: RISKASSESSMENT.......... ........... 142
. . . . . . . . . 142
A. Why Assess Risk with an Algorithm?. . . . . . ......
. . . . . . . . . 144
B. Why Do Information Systems Matter?. . . . .
. . . . . . . 146
C. Why Is Data Quality Alone Insufficient? ......
III. THE DEBATE: PROPUBLICA AND COMPAS... ............... 148
A. ProPublicaClaims Bias ...................... 148
B. Fairnessin PredictiveAlgorithms ................. 150
C. ExplainableData Science ................. ...... 151
D. ComparingPopulations....................152
IV. A PROPOSAL: ALTERNATIVE CLAIMS FOR Loomis .... ...... 154
A. Provenance ...................................... 154
B. Practice ......................................... 155
C. TrainingData ......................................... 157
D. Data Science Reasoning ...................... ..... 158
CONCLUSION ....................................... ..... 159
20181 How To ARGUE WITH AN ALGORITHM 133
INTRODUCTION
opportunity to challenge his risk scores by arguing that other factors or information
demonstrate their inaccuracy.").
8. Id. at 761 ("Although Loomis cannot review and challenge how the COMPAS
algorithm calculates risk, he can at least review and challenge the resulting risk scores
set forth in the report attached to the PSI.").
9. Id. at 772, cert. denied, 137 S. Ct. 2290 (2017).
10. Id. at 761-62.
11. Id. ("Loomis had the opportunity to verify that the questions and answers listed
on the COMPAS report were accurate.").
12. Data quality has multiple dimensions beyond accuracy, including completeness,
consistency, timeliness, representativeness, unambiguousness, meaning, precision, and
reliability. See Yair Wand & Richard Y. Wang, Anchoring Data Quality Dimensions in
OntologicalFoundations, COMMS. ACM, Nov. 1996, at 86, 93-94.
13. Loomis, 881 N.W.2d at 763.
14. For a review of the initial academic impact of a May 2016 ProPublica
investigative journalism article, see Julia Angwin & Jeff Larson, Bias in CriminalRisk
Scores is Mathematically Inevitable, Researchers Say, PROPUBLICA, (Dec. 30, 2016),
https://www.propublica.org/article/bias-in-criminal-risk-scores-is-mathematically-
inevitable-researchers-say [http://perma.cclY7VC-66TG].
15. In a concurring opinion to Loomis, Justice Shirley S. Abrahamson noted, "this
court's lack of understanding of COMPAS was a significant problem in the instant case.
2018] How To ARGUE WITH AN ALGORITHM 135
At oral argument, the court repeatedly questioned both the State's and the defendant's
counsel about how COMPAS works. Few answers were available." Loomis, 881 N.W.2d
at 774 (Abrahamson, J., concurring).
16. Julia Angwin et al., Machine Bias: There's Software Used Across the Country td
Predict Future Criminals. And It's BiasedAgainst Blacks, PROPUBLICA (May 23, 2016),
https://www.propubhlica.org/article/machine-bias-risk-assessments-in-criminal-
sentencing [http://perma.cc/3M9F-LFDM].
17. Id.
18. WILLIAM DIETERICH ET AL., COMPAS RISK SCALES: DEMONSTRATING ACCURACY
EQUITY AND PREDICTIVE PARITY (2016), http://go.volarisgroup.com/rs/430-MBX-989/
images/ProPublicaCommentaryFinal 070616.pdf [http://perma.cclL7VU-T4BT].
19. Jeff Larson & Julia Angwin, ProPublicaResponds to Company's Critique of
Machine Bias Story, PROPUBLICA (July 29, 2016), www.propublica.org/article/propublica
-responds-to-companys-critique-of-machine-bias-story [http://perma.cclFM9V-W5EU].
20. On January 8, 2019 we ran a Google Scholar search for articles that cited the
LRL or the title of the ProPublica article in 2016-17 and found 248 English-language
results. There were another 330 results in 2018. Citations to 'Machine Bias" by
ProPublica, GOOGLE SCHOLAR, https://scholar.google.com/ (type 'machine bias" OR
"www.propublica.org/article/machine-bias"' into Google Scholar).
21. Data and Analysis for 'Machine Bias', GITHUB, https://github.com/propublical
compas-analysis/ [http://perma.cc/6UEP-24YS] [hereinafter Data and Analysis]; see Jeff
Larson et al., How We Analyzed the COMPAS Recidivism Algorithm, PROPUBLICA (May
23, 2016), https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-
algorithm [http://perma.cc/QD4F-3VBR].
22. For an in-depth discussion of the social construction of actuarial risk in risk
assessment, see Jessica M. Eaglin, Constructing Recidivism Risk, 67 EMORY L.J. 59
(2017).
23. State v. Loomis, 881 N.W.2d 749, 753 (Wis. 2016).
24. Id.
136 COLO. TECH. L.J. [Vol. 17.1
25. Thanks to Julia Powles and Andrew Selbst for this point. See Frederick Schauer,
Giving Reasons, 47 STAN. L. REV. 633, 633 (1995); Martin Shapiro, The Giving Reasons
Requirement, 1992 U. CHI. LEGAL F. 179, 180 (1992).
26. Legal and public administration scholars have written extensively on the
importance of transparent procedures. See, e.g., Loomis, 881 N.W.2d at 774
(Abrahamson, J., concurring) ("First, I conclude that in considering COMPAS (or other
risk assessment tools) in sentencing, a circuit court must set forth on the record a
meaningful process of reasoning addressing the relevance, strengths, and weaknesses of
the risk assessment tool."); CHRISTOPHER HOOD & DAVID HEALD, TRANSPARENCY: THE
KEY TO BETTER GOVERNANCE? (2006); R6nin Kennedy, Algorithms and the Rule of Law,
17 LEGAL INFO. MGMT. 170, 170-72 (2017); Shapiro, supra note 25.
27. The 2017 GDPR regulates the online exchange of data in the European Union
and calls for more accountability than contemporary laws in other jurisdictions. See
Andrew D. Selbst & Solon Barocas, The Intuitive Appeal of Explainable Machines, 87
FORDHAM L. REV. 1085 (2018); Andrew D. Selbst & Julia Powles, Meaningful
Information and the Right to Explanation, 7 INT'L DATA PRIVACY L. 233, 234 (2017).
28. Danielle Keats Citron, Technological Due Process, 85 WASH. U. L. REV. 1249,
1254 (2008) ("The opacity of automated systems shields them from scrutiny. Citizens
cannot see or debate these new rules. In turn, the transparency, accuracy, and political
accountability of administrative rulemaking are lost.") Legal scholars warned about the
possibility that technology could hide procedures. For a discussion on procedures hidden
in predictive scoring algorithms, see FRANK PASQUALE, THE BLACK Box SOCIETY: THE
SECRET ALGORITHMS THAT CONTROL MONEY AND INFORMATION (2015).
2018] How TO ARGUE WITH AN ALGORITHM 137
29. Criminal justice has a history of risk assessments even before the use of
computational algorithms. See, e.g., Charles W. Dean & Thomas J. Duggan, Problems in
Parole Prediction: A Historical Analysis, 15 Soc. PROBS. 450, 457 (1968); Michael
Hakeem, The Validity of the Burgess Method of Parole Prediction, 53 AM. J. Soc. 376,
379 (1948).
30. "Data Science Reasoning" was the title of my 2016-17 Fellowship at the Data
&
Society Research Institute where I considered how to improve data science education
and data literacy in the public sector. See Data Science Reasoning, DATA & SOC'Y,
https://datasociety.net/initiatives/additional-projects/datareasoning/
[http://perma.ccIL85T-URUG].
31. Public sector informatics considers the institutional and social contexts of the
texts created by government. See generally Kevin P. Jones, Informatics, 261 NATURE 370
(1976) (defining informatics as the study of structure within large collections of text);
Rob Kling, What Is Social Informatics and Why Does It Matter?, 5 D-LIB MAG., Jan. 1999,
http://www.dlib.org/dlib/january99/kling/01kling.html [http://perma.cc/M897-5JKA].
138 COLO. TECH. L.J. [Vol. 17.1
A. Pre-Sentence Information
The concerns in Loomis revolved around the contents of a Pre-
Sentence Investigation (PSI) report. The Wisconsin circuit court
ordered a PSI report on the defendant in Loomis, which included a
35
risk assessment generated by the COMPAS algorithm. PSI
36
reports support the internal operational efficiency of the court.
The Wisconsin Supreme Court cited the State v. Skaff decision,
which determined that a defendant was in the best position to
refute, explain, or supplement incorrect or incomplete information
in the PSI.37
The PSI in Loomis included a COMPAS risk assessment score,
a graph chart showing the placement of the score, and twenty-one
related questions and answers. 38 COMPAS scores range from 1 to
39
10, with 10 representing the strongest prediction of risk. The
predictive risk scores are then grouped into classifications: 1-4 Low
Risk, 5-7 Medium Risk, and 8-10 High Risk. 40
The COMPAS risk assessment is derived, in part, from
responses to a series of questions. 4 1 The sources that go into the
COMPAS algorithms differ by jurisdiction and predictive
B. Sentencing
In Loomis, the defendant denied involvement in the crime but
waived his right to trial by agreeing to a plea deal. 4 4 The plea deal
left the actual sentence to the discretion of the Wisconsin circuit
court judge.4 5 The judge accepted the guilty plea from the defendant
and ordered a risk assessment as part of the PSI.46 The COMPAS
risk assessment predicted that the defendant had high pre-trial
risk, high risk of recidivism, and high risk of violent recidivism. 4 7
Instead of one year in county jail with probation, which the
prosecution and defense had agreed upon, the circuit court
sentenced the defendant to "seven years with four years initial
confinement" for operating a motor vehicle without the owner's
consent. 48 For attempting to flee an officer, the circuit court
sentenced him to four years with two years of initial confinement to
be served consecutively in state prison. 49 Both charges were repeat
offenses. 5 0
The defendant filed a motion requesting a new sentencing
hearing arguing that "the circuit court erroneously exercised its
discretion" by referring to a high-risk assessment score when
imposing the maximum sentence.5 1 At the sentencing hearing, the
circuit court referenced the high COMIPAS risk classification given
to the defendant, specifically stating that his PSI shows "a high risk
63. Id.
64. 152 Wis. 2d 48, 53 (Ct. App. 1989).
65. Id. at 57. The Loomis opinion stated multiple times that the defendant had the
ability "to refute, explain, or supplement the [pre-sentencing report]." Loomis, 881
N.W.2d at 760.
66. Loomis, 881 N.W.2d at 761.
67. Id.
68. Computer scientists debate over whether data structures or operations are more
influential in determining the outcome of algorithms. Moshe Vardi compares the
problem to physicists arguing about whether light is a particle or a wave. Moshe Y. Vardi,
What Is an Algorithm?, COMMS. ACM, Mar. 2012, at 5, 5.
69. See generally "RAw DATA" IS AN OXYMORON (Lisa Gitelman ed., 2013) (arguing
that data are anything but "raw" and that data should be viewed as a cultural resource
that needs to be generated, protected, and interpreted).
70. Loomis, 881 N.W.2d at 764.
71. Id. at 757, 763-64 ("Although we ultimately conclude that a COMPAS risk
assessment can be used at sentencing, we do so by circumscribing its use. Importantly,
we address how it can be used and what limitations and cautions a circuit court must
observe in order to avoid potential due process violations .... Specifically, any PSI
containing a COMPAS risk assessment must inform the sentencing court about the
following cautions regarding a COMPAS risk assessment's accuracy: (1) the proprietary
nature of COMPAS has been invoked to prevent disclosure of information relating to how
factors are weighed or how risk scores are to be determined; (2) risk assessment
compares defendants to a national sample, but no cross-validation study for a Wisconsin
population has yet been completed; (3) some studies of COMPAS risk assessment scores
have raised questions about whether they disproportionately classify minority offenders
as having a higher risk of recidivism; and (4) risk assessment tools must be constantly
monitored and re-normed for accuracy due to changing populations and
subpopulations.").
142 COLO. TECH. L.J. [Vol. 17.1
The question presented for review by the Court was whether the
proprietary nature of the COMPAS violated a defendant's
constitutional right to due process because a defendant cannot
72
challenge the algorithm's accuracy or scientific validity. The
United States Supreme Court declined to hear the case, allowing
73
the Wisconsin Supreme Court's ruling to stand.
In a concurring opinion in the Wisconsin Loomis case, Justice
Abrahamson called for ways that courts could keep up to date with
developments in evidence-based decision-making, noting that "[t]he
court needed all the help it could get."74 This paper attempts to
provide some of that help with arguments about algorithms from
data science scholarship.
72. Loomis v. Wisconsin, 137 S. Ct. 2290 (2017) ("Petition for writ of certiorari to
the Supreme Court of Wisconsin denied.").
73. Id.
74. Loomis, 881 N.W.2d at 774 (Abrahamson, J., concurring).
75. Eaglin, supra note 22.
76. ROBERT SEDGEWICK & KEVIN DANIEL WAYNE, ALGORITHMS 4 (4th ed. 2011).
77. Id.
78. See generally Amanda Clarke & Helen Margetts, Governments and Citizens
Getting to Know Each Other? Open, Closed, and Big Datain Public ManagementReform,
6 POL'Y & INTERNET 393 (2014) (discussing the use of big data analysis by governments).
79. Risk assessments are commonly offered along with needs assessments. The
needs of defendants entering the system are assessed to identify low-risk offenders who,
if certain criteria are met, can be supervised in outside rehabilitation. COMPAS provides
2018) How To ARGUE WITH AN ALGORITHM 1143
87
The tension between identifying risk and minimizing harm
was the focus of the academic debate over COMPAS risk
assessment. Is it possible to minimize harm to those who might be
misidentified as a high risk? Is it possible to minimize harm to the
public if someone is misidentified as a low risk? These are age-old
questions of public policy. Discussing similar predictions in 1936,
Burgess writes: "Parole, and in fact our whole system of criminal
justice, must constantly be prepared to face trial in the court of
public opinion."8 8
87. See generally Rachel Courtland, Bias Detectives: The Researchers Striving to
Make Algorithms Fair, 558 NATURE 357, 358-59 (2018); Geoff Pleiss et al., On Fairness
and Calibration, 31 CONF. ON NEURAL INFO. PROCESSING SYs. 2904 (2017),
https://papers.nips.cc/paper/715 1-on-fairness-and-calibration.pdf
[http://perma.cclK7EM-5BSB] (summarizing machine learning research on fairness and
bias).
88. Burgess, supranote 80, at 491.
89. The stops were not equally distributed throughout the population. 83% of the
stops involved a person who was identified as black or Hispanic. Only 6% of these stops
resulted in an arrest. See N.Y. Times Editorial Bd., Racial Discriminationin Stop-and-
Frisk, N.Y. TIMES (Aug. 12, 2013), https://nyti.ms/15x3ngU [https://perma.cc/K5AX-
MXK9].
90. Table 18: Estimated Number of Arrests United States, United States 2016, FBI
UNIFORM CRIME REPORTING, https://ucr.fbi.gov/crime-in-the-u.s/2016/crime-in-the-u.s.-
2016/tables/table-18 [http://perma.cclK5RZ-N2FS].
91. E. ANN CARSON, BUREAU OF JUSTICE STATISTICS, U.S. DEP'T OF JUSTICE,
PRISONERS IN 2014, at 1 (2015), https://www.bjs.gov/content/pub/pdf/pl4.pdf
[http://perma.cc/M4Y5-UGNK]; BUREAU OF JUSTICE STATISTICS, U.S. DEP'T OF JUSTICE,
PRISONERS IN 1983, at 1 (1984), https://www.bjs.gov/content/pub/pdf/p83.pdf
[http://perma.cc/827X-8L8D].
92. Teppo Felin et al., The Law and Big Data, CORNELL J.L. & PUB. POL'Y 357, 359
(2017). Courts began to modernize using technology along with other parts of
government. An important point in the United States was the Federal E-Government
Act of 2002. See E-Government Act of 2002, Pub. L. No. 107-347, 116 Stat. 2899 (2002)
(codified as amended across sections of the U.S. Code).
2018] How TO ARGUE WITH AN ALGORITHM 145
93. BRENNAN ET AL., ENHANCING PRISON CLASSIFICATION SYSTEMS, supra note 43,
at 21; COMPAS Classification, EQUIVANT, http://www.equivant.com/solutions/inmate-
classification [http://perma.cc/AK27-PJ7L1.
94. Ed Yong, A PopularAlgorithm Is No Better at Predicting Crimes than Random
People, ATLANTIC (Jan. 17, 2018) https://www.theatlantic.com/technology/archive/2018/
01/equivant-compas-algorithm/550646 [http://perma.cc/RPF9-8R9S]; Practitioner's
Guide to COMPAS Core, supra note 40, at 2. Dressel and Farid conducted a study that
compared COMPAS assessments with non-expert human assessments. See Julia Dressel
& Hany Farid, The Accuracy, Fairness, and Limits of Predicting Recidivism, SCl.
ADVANCES, Jan. 17, 2018, at 1, 4
95. E-justice systems in the judicial branch of government developed alongside e-
government systems in executive branch. In both cases, the systems were designed to
meet public sector statutory goals more efficiently. See generally BUREAU OF JUSTICE
STATISTICS, U.S. DEP'T OF JUSTICE, REPORT OF THE NATIONAL TASK FORCE ON COURT
AUTOMATION AND INTEGRATION (1999), http://www.ncjrs.gov/pdffides1/177601.pdf
[http://perma.ccl9DB4-JLVRI; Joio Rosa et al., Risk Factors in E-Justice Information
Systems, 30 GOv'T INFO. Q. 241 (2013).
96. State v. Loomis, 881 N.W.2d 749, 754 (Wis. 2016).
97. Rebecca Wexler, Life, Liberty, and Trade Secrets: Intellectual Property in the
Criminal Justice System, 70 STAN. L. REV. 1343, 1421-22 (2018).
98. See generally Janssen & Kuk, supra note 3, at 373; Kennedy, supra note 26, at
170; Frank Pasquale, A Rule of Persons, Not Machines: The Limits of Legal Automation,
87 GEO. WASH. L. REV 1, 6 (2019).
99. Abhishek Borah & Gerard J. Tellis, Make, Buy, or Ally? Choice of and Payoff
from Announcements of Alternate Strategies for Innovations, 33 MARKETING SCI. 114,
114 (2014).
100. MIKE EISENBERG ET AL., THE COUNCIL OF STATE GOv'TS JUSTICE CTR.,
VALIDATION OF THE WISCONSIN DEPARTMENT OF CORRECTIONS RISK ASSESSMENT
INSTRUMENT 1 (2009), https://csgjusticecenter.org/wp-content/uploads/2012/12/
WIRiskValidationFinalJuly2009.pdf [http://perma.cc/TD7X-LFYV].
146 COLO. TECH. L.J. [Vol. 17.1
A. ProPublicaClaims Bias
A few weeks before the Loomis decision, investigative
journalists at ProPublica published a controversial article claiming
that COMPAS risk assessment was biased. 118 Although COMPAS
and other risk assessments scores had been accused of gender
113. ProPublica published their article in May 2016 and Northpointe replied with a
report disputing their claims in July 2016. See DIETERICH ET AL., supra note 18; Angwin
et al., supra note 16: Larson & Angwin, supra note 19.
114. CORRECTIONS Act, S. 467, 114th Cong. (2015); Sentencing Reform and
Corrections Act, S. 2123, 114th Cong. (2015); Recidivism Risk Reduction Act, H.R. 759,
114th Cong. (2015); Sensenbrenner-Scott SAFE (Safe, Accountable, Fair, Effective)
Justice Reinvestment Act, H.R. 2944, 114th Cong. (2015).
115. Loomis, 881 N.W.2d at 749.
116. DIETERICH ET AL., supranote 18; Angwin et al., supra note 16; Larson & Angwin,
supra note 19;
117. Dataand Analysis, supranote 21.
118. ProPublica published the "Machine Bias" Article on May 23, 2016. Angwin et
al., supra note 16. The Loomis decision was released on July 13, 2016. Loomis, 881
N.W.2d 749.
20181 How TO ARGUE WITH AN ALGORITHM 149
119. Shaina Massie, Orange is the New Equal Protection Violation: How Evidence-
Based Sentencing Harms Male Offenders, 24 WM. & MARY BILL RTS. J. 521 (2015)
(illustrating how some states give different threshold cutoffs or tailor actuarial
instruments to reflect differences in people labeled as male or female); see John
Lightbourne, Damned Lies & Criminal Sentencing: Using Evidence-Based Tools, 15
DUKE L. & TECH. REV. 327 (2017).
120. ProPublica profiled two shoplifting arrests as an illustration. A teenage African-
American girl who had never been arrested before was rated as a medium risk by
COMPAS after being charged with burglary for attempting to steal a bike. A 54-year-old
man of European heritage had been arrested twice, had a criminal record, and had drugs
in his car, but he was rated as low risk by COMPAS after being arrested for shoplifting.
These individual examples represented the statistical problem that inaccurate
predictions, or false positives, were not uniformly applied. Angwin et al., supra note 16.
121. Id.
122. For an excellent visual depiction of the differences in false positives between the
groups, see Sam Corbett-Davies et al., A Computer Program Used for Bail and
Sentencing Decisions Was Labeled Biased Against Blacks. It's Actually Not that Clear.,
WASH. POST (Oct. 17, 2016), https://www.washingtonpost.com/news/monkey-
cage/wp/2016/10/17/can-an-algorithm-be-racist-our-analysis-is-more-cautious-than-
propublicas/ [https://perma.cc/9DC3-3CFH] [hereinafter Corbett-Davies et al., A
Computer Program Used for Bail] ("ProPublica points out that among defendants who
ultimately did not reoffend, blacks were more than twice as likely as whites to be
classified as medium or high risk (42 percent vs. 22 percent). Even though these
defendants did not go on to commit a crime, they are nonetheless subjected to harsher
treatment by the courts. ProPublica argues that a fair algorithm cannot make these
serious errors more frequently for one race group than for another .... Black defendants
who don't reoffend are predicted to be riskier than white defendants who don't reoffend;
this is ProPublica's criticism of the algorithm.").
123. Angwin et al., supra note 16; see Jon Kleinberg et al., Inherent Trade-Offs in the
FairDeterminationofRisk Scores, 2017 PROC. OF INNOVATIONS THEORETICAL COMPUTER
SCI. (2017), https://arxiv.org/pdf/1609.05807.pdf [http://perma.cc/E3NX-QJWX] (noting
the ProPublica point as "[o]ne of their main contentions was that the tool's errors were
asymmetric: African-American defendants were more likely to be incorrectly labeled as
higher-risk than they actually were, while white defendants were more likely to be
incorrectly labeled as lower-risk than they actually were").
124. Responding to the ProPublica article, Northpointe stated: "Based on our
examination of the work of Angwin et al. and on results of our analysis of their data, we
strongly reject the conclusion that the COMPAS risk scales are racially biased against
blacks. ProPublica focused on classification statistics that did not take into account the
different base rates of recidivism for blacks and whites." DIETERICH ET AL., supra note
18, at 1.
150 COLO. TECH. L.J. [Vol. 17.1
C. ExplainableData Science
Data-driven organizations, including governments, thrive on
finding unusual data sources and complex algorithms to create
predictions.1 38 Algorithms in public service, however, have a special
need to be understood by the general public through models with
130. Logan & Ferguson, supranote 85; Pasquale, supra note 98.
131. See generally Pleiss et al., supra note 87 (defining calibration).
132. Kleinberg et al., supra note 123, at 2-4.
133. Ileinberg does introduce one hypothetical condition where it is possible to meet
all three conditions. The trade-offs disappear if all populations have equal base rates.
This means the groups are essentially identical in distribution and behavior. Only the
label changes. In national risk assessment data, the individuals in the black and white
sets have different base rates of recidivism. Id. at 5-6, 17.
134. Corbett-Davies et al., A Computer Program Used for Bail, supra note 122.
135. Chouldechova, supra note 125.
136. See generally Sarah Tan et al., Detecting Bias in Black-Box Models Using
Transparent Model Distillation, 2018 AAI/ACM CONF. ON ARTIFICIAL INTELLIGENCE,
ETHICS, & Soc'Y 96 (2018).
137. Since 2015, the annual Fairness Accountability and Transparency conferences
investigate new concerns machine learning and algorithms. See ACM Conference on
Fairness, Accountability, and Transparency (ACM FAT), ACM FAT CONF.,
http://fatconference.org [https:/perma.cc/89GL-RG7N].
138. See generally Judie Attard et al., Value Creationon Open Government Data, 49
HAW. INT'L CONF. ON SYS. SCI. 2605 (2016).
152 COLO. TECH. L.J. [Vol. 17.1
D. ComparingPopulations
The ProPublica-COMPAS debate emphasized the best
statistical practices for comparing populations in models. Models
guide how data that is input into an algorithm is processed into an
output.1 4 4 Models are often expressed as equations showing
relationships between concepts. A risk assessment score is built on
45
a model that abstracts behavioral data about past populations.1
The base rate is a vital indicator because it reflects actual
population trends within the data set. Baird expressed concern
about algorithms being used across jurisdictions because of changes
in population base rates.1 4 6 Predicting the likelihood of arrest is
139. See Jim Dwyer, Showing the Algorithms Behind New York City Services, N.Y
TIMES, Aug. 24, 2017, at Al8.
140. Elaine Angelino et al., Learning Certifiably Optimal Rule Lists for Categorical
Data, 18 J. MACHINE LEARNING RES. 234 (2018).
141. James E. Johndrow & Kristian Lum, An Algorithm for Removing Sensitive
Information: Application to Race-Independent Recidivism Prediction, ANNALS APPLIED
STAT. (forthcoming 2019) (manuscript at 5), https://arxiv.org/pdf/1703.04957.pdf
[http://perma.cc/KE2D-YGPP].
142. See BAIRD ET AL., supra note 79, at 134.
143. Dressel & Farid, supra note 94, at 1-2.
144. See Lehr & Ohm, supra note 2, at 671.
145. Lightbourne, supranote 119, at 329; Logan & Ferguson, supra note 85, at 554-
56; Provost & Fawcett, supra note 102, at 52.
146. BAIRD ET AL., supra note 79, at 11. For a further discussion of base rates as a
test of validity, see Dr. David Thompson's expert testimony on base rates as a test of
2018) How To ARGUE WITH AN ALGORITHM 153
validity: "The Court does not know how the COMPAS compares that individual's history
with the population that it's comparing them with. The Court doesn't even know whether
that population is a Wisconsin population, a New York population, a California
population .... There's all kinds of information that the court doesn't have, and what
we're doing is we're mis-informing the court when we put these graphs in front of them
and let them use it for sentence." State v. Loomis, 881 N.W.2d 749, 756-57 (Wis. 2016).
147. For a discussion on validation in data science, see Galen Panger, Reassessing
the Facebook Experiment: CriticalThinking About the Validity of Big Data Research, 19
INFO., COMM. & Soc'Y 1108 (2016).
148. Loomis, 881 N.W.2d at 762-63.
149. Richard Berk et al., Fairnessin CriminalJustice Risk Assessments: The State of
the Art --8, 18-19 (Univ. of Pa. Dep't of Criminology, Working Paper No. 2017-1.0, 2017),
https://crim.sas.upenn.edulsites/default/files/2017-1.0-BerkFairnessCrimJustRisk.pdf
[http://perma.cc/U6B9-4JLL].
150. Chouldechova, supra note 125, at 135-37.
151. DIETERICH ET AL., supra note 18, at 9 ("A risk scale exhibits accuracy equity if it
can discriminate recidivists and non-recidivists equally well for two different groups
such as blacks and whites. The risk scale exhibits predictive parity if the classifier
obtains similar predictive values for two different groups such as blacks and whites, for
example, the probability of recidivating, given a high risk score, is similar for blacks and
whites.").
152. Flores et al., supra note 127, at 40.
154 COLO. TECH. L.J. [Vol. 17.1
A. Provenance
Provenance establishes the value of an item by documenting
its history and ownership. Buyers of fine art use provenance to
trace paintings and sculptures across centuries of owners. Although
usually considered for data that moves through multiple systems,
provenance can also apply to the origins, ownership, and history of
any digital asset.1 53 Provenance could provide a linkage between
who created the algorithm and who currently owns it. How were
the COMPAS products introduced to the organization? How long
has the Wisconsin circuit court used COMPAS predictive risk
assessments? When was the product last updated? Has the
algorithm been specially calibrated for Wisconsin populations or is
it the standard version of the product? The provenance of the
153. For more on digital provenance, see Lucian Carata et al., A Primer on
Provenance, 12 ACMQUEUE, Apr. 10, 2014, at 1, and Luc Moreau et al., The Provenance
of Electronic Data, COMMS. ACM, Apr. 2008, at 52, 54-58.
2018] How To ARGUE WITH AN ALGORITHM 155
B. Practice
Norms of practice specify how technology is used and
implemented. While the Loomis decision did recognize existing
practice, it merely pointed to language in documents. 159 Stronger
support for the practice would have been some indication of training
154. Erna H.J.M. Ruijer & Richard F. Huff, Breaking through Barriers:The Impact
of Organizational Culture on Open Government Reform, 10 TRANSFORMING GOV'T 335
(2016).
155. Volaris Group Acquires Northpointe Institute for Public Management, VOLARIs
GROUP (May 03, 2011), https://www.volarisgroup.com/news/article/volaris-group-
acquires-northpointe-institute-for-public-management [http://perma.cc/QE4J-B8J5].
156. CourtView, Constellation, & NorthpointeRe-Brand to equivant, supra note 5.
157. For a discussion concerns about using businesses that have different goals and
time-scale to long-standing public institutions, see Klievink et al., supra note 102, at 72-
73.
158. See generally Cecelia Klingele, The Promises and Perils of Evidence-Based
Corrections, 91 NOTRE DAME L. REV. 537, 538-41 (2016) (explaining that the criminal
justice system undergoes periodic reforms as values and science change).
159. State v. Loomis, 881 N.W.2d 749, 764 (Wis. 2016) (making references to the
State of Wisconsin Department of Corrections Electronic Case Reference Manual,
COMPAS Assessment Frequently Asked Questions, and a Practitioner's Guide to
COMPAS Core published by Northpointe).
156 COLO. TECH. L.J. [Vol. 17.1
and familiarity with the system. Where was COMPAS first used in
Wisconsin? Why did the COMPAS product meet their requirement
needs? What other systems did the state consider and why did they
not meet their needs? How are employees trained to use COMPAS?
How long had the State of Wisconsin been using predictive risk
assessments? How long had the Wisconsin circuit court been using
predictions in sentencing? When was the decision made to use it
throughout the court systems? These are not unusual questions
when considering how digital government products go through
160
procurement in the federal executive branch. Although the
circuit court in Loomis discussed its commitment to evidence-based
sentencing, more specific details about implementation and
evaluation could have shown that the court had integrated
COMPAS risk assessment into normative organizational
practices. 161
The Loomis decision was factually correct that both sides had
access to the same information because they both had documents
that contained the score and questionnaire. Yet, there was a subtle
difference in access to information for each party in Loomis. The
circuit court and the defense had access to the same documents, but
not the same context for the information contained in those
documents. Anyone with the opportunity to see how COMPAS
scores were applied to hundreds of people over time could develop
an inductive understanding of what is important on the
questionnaire and how it is applied through the Wisconsin
population. People external to an organization are unlikely to
understand what intuitive internal needs are.
The court probably takes for granted how things work and
what aspects of their work they prioritize. Employees of the
Wisconsin circuit court, including the judge, would be familiar with
the administrative goals of court operations. The COMPAS
algorithm is a product marketed to target exactly the efficiency
concerns of courts. As a part of an organizational information
system, predictive assessments are designed to support employees
of criminal justice organizations. The courts have a better grasp on
risk thresholds because they see these scores applied over time.
Because of this unique retrospective view, the state might have
been capable of singularly corroborating its own connection
between a COMPAS risk assessment and specific questions while
the defendant did not have that inductive experience.
160. Peter Johnson & Pamela Robinson, Civic Hackathons: Innovation, Procurement,
or Civic Engagement?, 31 REV. POL'Y RES. 349, 352 (2014).
161. A field of public policy, the science of science policy, considers how to evaluate
digital investments like this. See Sandra Braman, Technology and Epistemology:
Information Policy and Desire, in CULTURAL TECHNOLOGIES: THE SHAPING OF CULTURE
IN MEDIA AND SOCIETY 133 (Goran Bolin ed., 2012); Maryann Feldman et al., The New
Data Frontier, 44 RES. POL'Y 1629 (2015).
2018] How To ARGUE WITH AN ALGORITHM 157
C. TrainingData
Many public organizations are making digital information
available to the public as open data. 162 Training data is one type of
open data that could be useful to establish the validity of risk
assessment scores. Generated sample data is another type of open
data that could be used to establish claims. With shared data, either
side in Loomis could have engaged statisticians to prove their
claims using the same data set.
Training data calibrate and optimize algorithms. 163 In a 2009
article, Northpointe employees pointed out that they trained the
COMPAS algorithm on a population sample of 2,328 that was 76%
white and only 19% female. 164 Anyone who wanted to argue with
COMPAS could point to the training populations in this older
article to question the validity of current predictions given their
own jurisdiction demographics. Furthermore, knowing the training
set could support a claim that the defendant was an outlier and
therefore may be easily misclassified. 165 Algorithms use training
data as a benchmark for speed, accuracy, or other optimization
goals. A training set could also confirm how the algorithm
considered characteristics represented by the defendant. A
characteristic less probable in the training data could be an
argument for a less accurate prediction.
Generated data reflect representative practices and are used
in quality assurance to test the breadth of design requirements.16 6
Given the sensitivity of criminal justice data, it might be reasonable
to generate a data set that would allow for hypothetical testing.
Unlike training data, generated data are designed to test the
robustness of the system to handle a range of cases. Generated data
could confirm how the algorithm handles unusual or under-
represented factors in the data. In Loomis, Wisconsin could have
provided generated data about their populations, or COMPAS could
162. Open data are internal organizational files released to the general public usually
through the Internet. See Anne L. Washington & David Morar, Open Government Data
and File Formats: Constraintson Collaboration,18 PROC. INT'L CONF. ON DIGITAL GOV'T
RES. 155 (2017).
163. Tom Dietterich, Overfittingand Undercomputingin Machine Learning, 27 ACM
COMPUTING SuRvs. 326, 326-27 (1995); Leslie Scism & Mark Maremont, Insurers Test
Data Profiles to Identify Risky Clients, WALL ST. J. (Nov. 19, 2010, 12:01 AM),
http://www.wsj.com/articles/SB10001424052748704648604575620750998072986
{http://perma.cc/9B6Y-C34SI.
164. Brennan et al., Evaluating the Predictive Validity, supra note 43.
165. An outlier is an observation beyond the general data trend. Algorithms can
exclude large segments of society if variations in human populations are not considered.
In a series of experiments, computer scientists showed that facial recognition software
could have high overall average success but failed at the intersection of gender and skin
color. See Joy Buolamwini & Timnit Gebru, Gender Shades: Intersectional Accuracy
Disparitiesin Commercial Gender Classification, 81 PRoc. MACHINE LEARNING RES. 77
(2018).
166. To understand the role of generated data in software engineering, see D. C. Ince,
The Automatic Generationof Test Data, 30 COMPUTER J. 63 (1987).
158 COLO. TECH. L.J. [Vol. 17.1
CONCLUSION
The purpose of this article was to examine the ProPublica-
COMPAS debate over risk assessment algorithms. My analysis of
the arguments contributes the following three points.
First, the standard set in Loomis for challenging a predictive
assessment, by reviewing data accuracy, was not supported in the
data science literature. Accuracy is just one of many data qualities
and does not address how the algorithm produces results or
manages the input data. 174 The data quality standard set in Loomis
is a very low bar for understanding predictive risk assessment. The
ProPublica-COMPAS debate revealed claims of fairness, simplicity,
and population comparison that can the basis for arguments about
predictive algorithms. Accurate data is a necessary but not
sufficient standard for assessing the integrity of a risk assessment
prediction.
Second, a healthy debate on predictive analytics required
shared information such as open data, standard evaluation
practices, shared designs, and hypotheses. Sharing information