Early evidence suggests some risk assessment tools offer promise in rationalizing decisions on granting bail without racial bias. But we still need to monitor how judges actually use the algorithms, says a Boston attorney.
Next Monday morning, visit an urban criminal courthouse. Find a seat on a bench, and then watch the call of the arraignment list.
Files will be shuffled. Cases will be called. Knots of lawyers will enter the well of the court and mutter recriminations and excuses. When a case consumes more than two minutes you will see unmistakable signals of impatience from the bench.
Pleas will be entered. Dazed, manacled prisoners—almost all of them young men of color—will have their bails set and their next dates scheduled.
Some of the accused will be released; some will be detained, and stepped back into the cells.
You won’t leave the courthouse thinking that this is a process that needs more dehumanization.
But a substantial number of criminal justice reformers have argued that if the situation of young men facing charges is to be improved, it will be through reducing each accused person who comes before the court to a predictive score that employs mathematically derived algorithms which weigh only risk.
This system of portraiture, known as risk assessment tools, is claimed to simultaneously reduce pretrial detentions, pretrial crime, and failures to appear in court—or at least that was the claim during a euphoric period when the data revolution first poked its head up in the criminal justice system.
We can have fewer prisoners and less crime. It would be, the argument went, a win/win: a silver bullet that offers liberals reduced incarceration rates and conservatives a whopping cost cut.
These confident predictions came under assault pretty quickly. Prosecutors—represented, for example, by Eric Sidall here in The Crime Report—marshaled tales of judges (“The algorithm made me do it!”) who released detainees who then committed blood-curdling crimes.
Other voices raised fears about the danger that risk assessment tools derived from criminal data trails that are saturated with racial bias will themselves aggravate already racially disparate impacts.
A ProPublica series analyzed the startling racial biases the authors claim were built into one widely used proprietary instrument. Bernard Harcourt of Columbia University argued that “risk” has become a proxy for race.
A 2016 study by Jennifer Skeem and Christopher Lowenkamp dismissed Harcourt’s warnings as “rhetoric,” but found that on the level of particular factors (such as the criminal history factors) the racial disparities are substantial.
Meanwhile, a variety of risk assessment tools have proliferated: Some are simple checklists; some are elaborate “machine learning” algorithms; some offer transparent calculations; others are proprietary “black boxes.”
Whether or not the challenge of developing a race-neutral risk assessment tool from the race-saturated raw materials we have available can ever be met is an argument I am not statistician enough to join.
But early practical experience seems to show that some efforts, such as the Public Safety Assessment instrument, developed by the Laura and John Arnold Foundation and widely adopted, do offer a measure of promise in rationalizing bail decision-making at arraignments without aggravating bias (anyway, on particular measurements of impact).
The Public Safety Assessment (PSA), developed relatively transparently, aims to be an objective procedure that could encourage timid judges to separate the less dangerous from the more dangerous, and to send the less dangerous home under community-based supervision.
At least, this practical experience seems to show that in certain Kentucky jurisdictions where (with a substantial push from the Kentucky legislature) PSA has been operationalized, the hoped-for safety results have been produced—and with no discernible increase in racial disparity in outcomes.
Unfortunately, the same practical experience also shows that those jurisdictions are predominately white and rural, and that there are other Kentucky jurisdictions, predominately minority and urban, where judges have been—despite the legislature’s efforts—gradually moving away from using PSA.
These latter jurisdictions are not producing the same pattern of results.
The judges are usually described as substituting “instinct” or “intuition” for the algorithm. The implication is that they are either simply mobilizing their personal racial stereotypes and biases, or reverting to a primitive traditional system of prophesying risk by opening beasts and fowl and reading their entrails, or crooning to wax idols over fires.
As Malcolm M. Feeley and Jonathan Simon predicted in a 2012 article for Berkeley Law, past decades have seen a paradigm shift in academic and policy circles, and “the language of probability and risk increasingly replaces earlier discourse of diagnosis and retributive punishment.”
A fashion for risk assessment tools was to be expected, they wrote, as everyone tried to “target offenders as an aggregate in place of traditional techniques for individualizing or creating equities.”
But the judges at the sharp end of the system whom you will observe on your courthouse expedition don’t operate in a scholarly laboratory.
They have other goals to pursue besides optimizing their risk-prediction compliance rate, and those goals exert constant, steady pressure on release decision-making.
Some of these “goals” are distasteful. A judge who worships the great God, Docket, and believes the folk maxim that “Nobody pleads from the street” will set high bails to extort quick guilty pleas and pare down his or her room list.
Another judge, otherwise unemployable, who needs re-election or re-nomination, will think that the bare possibility that some guy with a low predictive risk score whom he has just released could show up on the front page tomorrow, arrested for a grisly murder, inexorably points to detention as the safe road to continued life on the public payroll.
They are just trying to get through their days.
But the judges are subject to other pressures that most of us hope they will respect.
For example, judges are expected to promote legitimacy and trust in the law.
It isn’t so easy to resist the pull of “individualizing “and “diagnostic” imperatives when you confront people one at a time.
Somehow, “My husband was detained, so he lost his job, and our family was destroyed, but after all, a metronome did it, it was nothing personal” doesn’t seem to be a narrative that will strengthen community respect for the courts.
Rigorously applying the algorithm may cut the error rate in half, from two in six to one in six, but one in six are still Russian roulette odds, and the community knows that if you play Russian roulette all morning (and every morning) and with the whole arraignment list, lots of people get shot.
No judge can forget this community audience, even if the “community” is limited to the judge’s courtroom work group. It is fine for a judge to know whether the re-offense rate for pretrial releases in a particular risk category is eight in ten, but to the judges, their retail decisions seem to be less about finding the real aggregated rate than about whether this guy is one of the eight or one of the two.
Embedded in this challenge is the fact that you can make two distinct errors in dealing with difference.
First, you can take situations that are alike, and treat them as if they are different: detain an African-American defendant and let an identical white defendant go.
Second, you can take things that are very different and treat them as if they are the same: Detain two men with identical scores, and ignore the fact that one of the two has a new job, a young family, a serious illness, and an aggressive treatment program.
A risk assessment instrument at least seems to promise a solution to the first problem: Everyone with the same score can get the same bail.
But it could be that this apparent objectivity simply finesses the question. An arrest record, after all, is an index of the detainee’s activities, but it also a measure of police behavior. If you live in an aggressively policed neighborhood your history may be the same as your white counterpart’s, but your scores can be very different.
And risk assessment approaches are extremely unwieldy when it comes to confronting the second problem. A disciplined sticking-to-the-score requires blinding yourself to a wide range of unconsidered factors that might not be influential in many cases, but could very well be terrifically salient in this one.
This tension between the frontline judge and the backroom programmer is a permanent feature of criminal justice life. The suggested solutions to the dissonance range from effectively eliminating the judges by stripping them of discretion in applying the Risk Assessment scores to eliminating the algorithms themselves.
But the judges aren’t going away, and the algorithms aren’t going away either.
As more cautious commentators seem to recognize, the problem of the judges and the algorithms is simply one more example of the familiar problem of workers and their tools.
If the workers don’t pick up the tools it might be the fault of the workers, but it might also be the fault of the design of the tools.
And it’s more likely that the fault does not lie in either the workers or the tools exclusively but in the relationship between the workers, the tools, and the work. A hammer isn’t very good at driving screws; a screw-driver is very bad at driving nails; some work will require screws, other work, nails.
If you are going to discuss these elements, it usually makes most sense to discuss them together, and from the perspectives of everyone involved.
The work that the workers and their tools are trying to accomplish here is providing safety—safety for everyone: for communities, accused citizens, cops on the streets. A look at the work of safety experts in other fields such as industry, aviation, and medicine provides us with some new directions.
To begin with, those safety experts would argue that this problem can never be permanently “fixed” by weighing aggregate outputs and then tinkering with the assessment tool and extorting perfect compliance from workers. Any “fix” we install will be under immediate attack from its environment.
Among the things that the Kentucky experience indicates is that in courts, as elsewhere, “covert work rules”, workarounds, and “informal drift” will always develop, no matter what the formal requirements imposed from above try to require.
The workers at the sharp end will put aside the tool when it interferes with their perception of what the work requires. Deviations won’t be huge at first; they will be small modifications. But they will quickly become normal.
And today’s small deviation will provide the starting point for tomorrow’s.
What the criminal justice system currently lacks—but can build—is the capacity for discussing why these departures seemed like good ideas. Why did the judge zig, when the risk assessment tool said he or she should have zagged? Was the judge right this time?
Developing an understanding of the roots of these choices can be (as safety and quality experts going back to W. Edwards Deming would argue) a key weapon in avoiding future mistakes.
We can never know whether a “false positive” detention decision was an error, because we can never prove that the detainee if released would not have offended. But we can know that the decision was a “variation” and track its sources. Was this a “special cause variation” traceable to the aberrant personality of a particular judge? (God knows, they’re out there.)
Or was it a “common cause variation” a natural result of the system (and the tools) that we have been employing?
This is the kind of analysis that programs like the Sentinel Events Initiative demonstration projects about to be launched by the National Institute of Justice and the Bureau of Justice Assistance can begin to offer. The SEI program, due to begin January 1, with technical assistance from the Quattrone Center for the Fair Administration of Justice at the University of Pennsylvania Law School, will explore the local development of non-blaming, all-stakeholders, reviews of events (not of individual performances) with the goal of enhancing “forward-looking accountability” in 20-25 volunteer jurisdictions.
The “thick data” that illuminates the tension between the algorithm and the judge can be generated. The judges who have to make the decisions, the programmers who have to refine the tools, the sheriff who holds the detained, the probation officer who supervises the released, and the community that has to trust both the process and the results can all be included.
We can mobilize a feedback loop that delivers more than algorithms simply “leaning in” to listen to themselves.
What we need here is not a search for a “silver bullet,” but a commitment to an ongoing practice of critically addressing the hard work of living in the world and making it safe.
James Doyle is a Boston defense lawyer and author, and a frequent contributor to The Crime Report. He has advised in the development of the Sentinel Events Initiative of the National Institute of Justice. The opinions expressed here are his own. He welcomes readers’ comments.