Duncan Dobbins, PharmD, MHI Geisinger College of Health Sciences Scranton, Pennsylvania

When the Math Doesn’t Add Up, Can AI Do the First Pass to Improve Biomedical Research?

Duncan Dobbins, PharmD, MHIGeisinger College of Health Sciences Scranton, Pennsylvania

Dr. Dobbins, PharmD

MedicalResearch.com Interview with:
Duncan Dobbins, PharmD, MHI
Geisinger College of Health Sciences
Scranton, Pennsylvania

MedicalResearch.com: What prompted this commentary, and what did you find?

Response: In theory, there could be a drug interaction between immunotherapy and medical cannabis. A small (N=102) observational report from Israel appeared to find that immunotherapies worked much less well in cancer patients who also used medical cannabis.1 However, a follow up report2 took about two weeks and involved manually rechecking the math and data-analysis. Several discrepancies emerged between the methods and results. Two-tailed tests were listed in the methods yet one-tailed p values appeared in the results. Arithmetic errors, some traceable to unconventional “floor” rounding, affected key percentages. Multiple p values in Table 1 (21 out of 22) could not be reproduced with the stated tests. Finally, smoking status, a key confound, was not reported. Taken together, these issues complicate interpretation and highlight how small computational slips can cascade into larger inferential uncertainty. For this follow-up report, I was asked, “Do you think AI could have double checked this math?”

MedicalResearch.com: Why might AI tools be helpful here?

Response: Today, most statistical spot checks are manual, time-consuming, and vulnerable to copy-paste or rounding mistakes. These are the very problems that slip through busy peer review.  Large Language Model (LLM) assisted agents can parse tables, recompute percentages, and replicate 2×2 tests (chi-square or Fisher’s). They can reconcile Ns across text, figures, and supplements; identify the types of discrepancies (one vs two-tailed tests) identified here; and flag p values that do not reproduce. This is like a spellcheck for statistics.

Psychology’s statcheck software3 shows what is possible and oncology can borrow the playbook. The ideal workflow is lightweight and auditable. On upload, the system recalculates core numbers, applies transparent rounding rules, and produces a red-yellow-green report with line referenced flags. Editors and authors see exactly what tripped the alarm, resolve it quickly, and move on–no gatekeeping, just guardrails. Embedding such agents at submission and again prepublication would create a lightweight, auditable safety net that reduces preventable errors before they trigger PubPeer debates.

MedicalResearch.com: How did AI/LLMs perform in your work?

Response: We used ChatGPT in a browser and compared GPT4o to a reasoning model (o1). With minimal prompting (literally copying the relevant rows) the o1 model recalculated key 2×2 tests correctly and flagged problems 4o missed (e.g., second-line therapy: our recompute χ²≈4.67, p≈0.03 vs. the paper’s 0.05178). It also caught a simple but meaningful percent error (brain metastases 8/34 = 23.5%, not 13.2%). Interestingly these results are consistent with those published: OpenAI’s reasoning model, o1, solved 83 percent of problems on the American Invitational Mathematics Examination (AIME)4. This is a level of performance comparable to top PhD-entry exam success. These results far surpass GPT-4o’s 12 percent on the same evaluation4. These models were released only four months apart, so you can see how quickly these models are improving. GPT 5 just came out this month (eight months after o1) and scored a 95% on that this year’s version of same exam5. The takeaway is not that one model is perfect; it’s that modern reasoning models are already good enough to be embedded as “math agents” for authors pre-submission and for journals in triage. Models evolve fast; the half-life of capability is getting shorter. They should be embedded in this process.

MedicalResearch.com: Where are the limits and does this replace human reviewers?

Response: Scope the bots to low variance tasks like arithmetic, rounding, internal consistency, and simple test replication. Keep humans on the high variance calls like design, identifying unreported confounds (e.g., tobacco), multiplicity, modeling choices, and clinical plausibility. Validate any checker on gold-standard datasets before rollout; freeze versions; record prompts; and attach the audit log when corrections change interpretation. Build a three-touch workflow that will not bottleneck timelines: (1) presubmission check run by authors; (2) editorial check at acceptance to verify revisions did not introduce new errors; (3) production check before proofs. The outcome we want is modest but meaningful: fewer preventable math mistakes, clearer results sections, and less post publication confusion. AI provides the first pass but statisticians and clinicians provide the last word.

 Researcher and robot reading a paper together. Created by the author using ChatGPT (OpenAI), 2025.


Researcher and robot reading a paper together. Created by the author using ChatGPT (OpenAI), 2025.

Publication

Piper BJ, Dobbins DX, et al. Commentary on “Bar-Sela et al. Cannabis consumption used by cancer patients during immunotherapy correlates with poor clinical outcome. Cancers 2020, 12, 2447”. Cancers 2025; 17(17), 2754 (24 pages); https://doi.org/10.3390/cancers1717275.

Citations

  1. Bar-Sela G, et al. Cannabis consumption used by cancer patients during immunotherapy correlates with poor clinical outcome. Cancers. 2020;12(9):2447.
  2. Piper BJ, et al. Immunotherapy and cannabis: a harmful drug interaction or Reefer Madness?. Cancers. 2024; 16(7):1245.
  3. Nuijten, M.; Polanin, J.R. “statcheck”: Automatically detect statistical reporting inconsistencies to increase reproducibility of meta-analyses. Research Synthesis Methods 2020; 11:574-579.
  4. Learning to reason with LLMs. OpenAI Research Blog. Published September 12, 2024. Available from: https://openai.com/index/learning-to-reason-with-llms/

OpenAI. Introducing GPT‑5. OpenAI Blog. Published August 7, 2025. Available from: https://openai.com/index/introducing-gpt-5/

———

The information on MedicalResearch.com is provided for educational purposes only and may not be up to date. It is in no way intended to diagnose, cure, or treat any medical or other condition. Some links are sponsored. Products are not warranted or endorsed.

Always seek the advice of your physician or other qualified health and ask your doctor any questions you may have regarding a medical condition. In addition to all other limitations and disclaimers in this agreement, service provider and its third party providers disclaim any liability or loss in connection with the content provided on this website.

Last Updated on August 28, 2025 by Marie Benz MD FAAD