When the Research Gets It Wrong: What a New Study on AI Police Reports Actually Tells Us

When the Research Gets It Wrong: What a New Study on AI Police Reports Actually Tells Us

When the Research Gets It Wrong: What a New Study on AI Police Reports Actually Tells Us

A recently circulated paper on AI-assisted police report writing has been making rounds in law enforcement circles and academic discussions, and I want to address it directly. The study is a working paper written by Adams et al. (2026), titled A Good College Essay but a Bad Police Report: A Triple-Blind Expert Evaluation of AI-Assisted Police Reporting. The study, which evaluated Axon Draft One reports through blinded supervisor review, concludes that AI-assisted report writing produces lower quality reports, that supervisory oversight is fundamentally broken, and that governance frameworks need to rethink their reliance on human review entirely. Those are serious claims, and they deserve serious scrutiny, because when you dig into the methodology, what the study actually demonstrates is something quite different from what its authors claim. What it really provides is a roadmap for how AI-assisted report writing should be done.

Let me break down the critical problems with this research and explain what it actually means for agencies considering AI report writing tools done right.

The Study Tested the Wrong Thing

The most fundamental flaw in this research is one the authors actually acknowledge but then largely ignore in their conclusions. The study evaluated one specific product, Axon’s Draft One, using one specific architecture: a passive-input system that takes body-worn camera audio, transcribes it, and then generates a report narrative from that raw transcription. The officer's role in this process is purely editorial. The AI decides what is relevant from hours of ambient audio, background noise, and scene chatter, and builds the report from that (Adams et al., 2026).

I have written before about the inherent problems with this approach. As any officer with a body-worn camera knows, a massive portion of that audio has nothing to do with the incident being investigated. Generating reports from all of that captured audio is inefficient at best and legally problematic at worst, and that doesn’t even include the memory and cognition piece on the officers. The study's own authors note that Draft One's transcript-only architecture "ensures incompleteness" on a structural level, because the system has no access to officer notes, dispatch records, CAD data, or anything the officer observed but did not vocalize on scene (Adams et al., 2026).

The completeness gaps and accuracy concerns the study identifies are not evidence that AI-assisted report writing is flawed. They are evidence that this particular passive-input architecture is flawed. Those are very different conclusions, and conflating them does a disservice to the field.

Additionally, both the AI-reports and traditional officer generated reports came from the No Man’s Hand study by Adams et al. (2024).  Again, not disregarding my above points about the flaws with Draft One’s method of report generation, significant improvements could possibly have been made to the Draft One system in the two years between the studies; a more up-to-date sample could have been obtained, in addition to having a cross spectrum of AI report tools.

The Numbers Tell a More Complicated Story

Beyond the conceptual problem, the study has significant issues with the size and balance of what it actually tested that undermine its headline findings.

Start with a basic structural problem: the study compared 60 non-AI reports against only 20 AI-assisted reports. That 3-to-1 imbalance matters because the study's most important claim, that supervisors rated AI reports lower on accuracy, is built almost entirely on those 20 reports (Adams et al., 2026). With a sample that small in the AI group, a handful of genuinely poor reports, or even a couple of outlier rater judgments, can swing the entire finding. That is not a solid foundation for sweeping conclusions about AI report writing.

The study also tested six different quality dimensions simultaneously, including clarity, completeness, grammar, accuracy, utility, and an overall global rating, and found a significant difference on only one of them: accuracy (Adams et al., 2026). When researchers test that many outcomes at once, the odds of finding at least one that looks significant purely by chance increase substantially. The authors did not account for this, which means their one significant finding deserves considerably more skepticism than the paper gives it.

Perhaps most telling is a number buried in the study's own analysis: 73% of the variation in how supervisors rated reports came down to essentially random noise, and only about 9% reflected actual differences between reports (Adams et al., 2026). In practical terms, that means the ratings were driven more by which supervisor happened to review a report than by anything actually in that report.

This could be explained by the fact that the law enforcement reviewers were not from the authoring agency themselves; they were drawn from law enforcement supervisors and command staff personnel attending regional or national training programs (FBI National Academy, etc.) (Adams et al. 2026). Review processes differ largely between agencies; couple that with the reviewing supervisors not having context of the incident, authoring officer’s writing style, etc., it leaves questions on how insightful they would be in approving or rejecting a report for accuracy and other factors. Drawing firm conclusions about AI quality from a measurement approach that inconsistently requires a level of caution the authors simply do not apply.

The Readability Critique Cuts the Wrong Way

The study makes a significant argument that AI-assisted reports score lower on readability indices, including Flesch Reading Ease, SMOG, and Flesch-Kincaid grade level, and that supervisors fail to penalize this in their ratings, representing an oversight failure (Adams et al., 2026).

There are two problems here. First, these are general-purpose readability measures developed for expository prose. They penalize technical and legal language that is often not just appropriate but required in police reports. Phrases like "in plain view," "exigent circumstances," or "did knowingly and intentionally" are legally precise terminology that these readability scores treat as complexity problems. Higher grade-level scores in police reports may reflect appropriate legal register, not problematic inaccessibility. Also, if these score penalizations came from more “academic” language, which are typically not utilized in police reporting, this can be easily controlled. Prompting of the AI model matters; with constrained prompting, academic jargon can be controlled.

Second, and more practically, the authors use a Scrabble tile value score, specifically the mean English Scrabble tile value per letter, as an independent complexity measure. Whatever its theoretical justification, the operational validity of Scrabble scores as a police report quality metric is, to put it charitably, unestablished.

The Detection Finding Is Already Irrelevant

The study devotes considerable attention to demonstrating that supervisors cannot detect AI authorship at better than chance levels. The authors frame this as a governance crisis.

But here is the thing, under California's SB 524 (later codified as Penal Code § 13663 following its passing), Utah's 2025 AI disclosure statute, and the emerging framework from the Council on Criminal Justice (2026), supervisors reviewing AI-assisted reports will know those reports used AI, because disclosure is mandatory (or one would think a supervisor would have knowledge of a direct agency implementation of an AI system if there isn’t a disclosure requirement). Detection ability is already becoming irrelevant as a governance mechanism. The study's most dramatic finding addresses a problem that disclosure requirements directly solve, without requiring supervisors to develop any detection ability whatsoever.

The authors actually acknowledge this themselves when they recommend audit infrastructure, draft retention, version tracking, and mandatory disclosure as governance tools (Adams et al, 2026). They just bury those recommendations after extensive discussion of the detection failure, creating an impression of crisis that their own proposed solutions dissolve.

What the Study Actually Shows Us

Here is what I think is most important for agencies and administrators reading this research. Strip away the unsupported conclusions, and the study's findings actually point directly toward what effective, safely implemented AI report writing should look like:

·       Grammar showed the smallest negative effect, meaning AI demonstrably supports consistent, professional language

·       Approval rates were identical at roughly 22% in both conditions, meaning AI reports pass supervisory review at exactly the same rate as traditionally written reports

·       The completeness gap was tied specifically to the transcript-only architecture, which is a design choice and not a fundamental AI limitation

·       Word count showed no significant difference, meaning AI maintains appropriate report length without padding or truncation

Every failure mode the study identifies is addressable. Completeness gaps are solved by active-input architecture. Accuracy concerns are addressed through constrained prompting trained on legally sufficient police reports, as well as the active-input method controlled by the officer using the AI system. Supervisor rubric failures are addressable through disclosure, training, and rubric updates. These are implementation problems, not fundamental AI problems; I made this same claim in the evaluations of the No Man’s Hand study.

The Policereports.ai Difference

This is exactly why our approach at Policereports.ai is built around officer-controlled, active-input dictation rather than passive audio transcription. The officer dictates the incident as he or she remembers it, controlling the narrative from the beginning, including what is relevant, what the legal elements are, and what the reporting structure of their specific agency requires. The AI's job is to take the officer's words and organize, format, and complete the required documentation efficiently, not to decide what happened based on hours of ambient body camera audio.

Our system is customized for each agency's specific report structure and forms, which means the constrained prompting is built around what that agency and exterior stakeholders actually need, what that jurisdiction's statutory elements require, and what that department's supervisors expect to see. That is the difference between AI that produces academically sophisticated prose disconnected from police report genre requirements and AI that produces operationally functional documentation in the format and register the criminal justice system needs. Additionally, our quality assurance checkers analyze the AI generated draft document, comparing it back to the original input.  Through this, deviations are flagged and addressed with the officer to ensure any part of the AI generated report is faithful and accurate to the original input.  Additionally, all iterations of the process are maintained, auditable, and subject to scrutiny.

The study's authors write that the useful governance question is "not whether a human is in the loop, but which specific failure modes the human in the loop is positioned to catch" (Adams et al., 2026). That is exactly right. With active-input dictation, constrained prompting, mandatory disclosure, and audit infrastructure, supervisors are positioned to catch what they have always been positioned to catch: whether the report is accurate, complete, and legally sufficient for the incident as the officer documented it. The AI's job is to make that documentation faster and more consistent, not to replace the officer's judgment about what matters.

The research, read carefully, is not an argument against AI-assisted report writing. It is an argument for doing it right.

References

Adams, I.T., Barter, M., McLean, K. et al. No man’s hand: artificial intelligence does not improve police report writing speed. J Exp Criminol (2024). https://doi.org/10.1007/s11292-024-09644-7

Adams, I. T., Barter, M., McLean, K., Jr., I. A. G., Fabila, A., & McCrain, J. (2026). A Good College Essay but a Bad Police Report: A Triple-Blind Expert Evaluation of AI-Assisted Police Reporting. CrimRxiv. https://doi.org/10.21428/cb6ab371.ee18b482