Local-First Clinical Text Structuring with Fine-Tuned MedGemma for Readmission Risk Assessment

dc.contributor.authorЗаболотній, Сергій Васильович
dc.contributor.authorZabolotnii, Serhii
dc.contributor.authorHolinko, Viktoriia
dc.date.accessioned2026-06-05T12:56:02Z
dc.date.available2026-06-05T12:56:02Z
dc.date.issued2026
dc.description.abstractBackground. Unstructured clinical notes remain a bottleneck for deployable healthcare AI; cloud-dependent pipelines raise privacy and infrastructure barriers. Methods. We present MedGemma StructCore, a local-first two-stage extraction pipeline using compact MedGemma 4B models. Stage 1 applies Schema-Guided Reasoning to summarize notes into structured JSON across nine clinical clusters. Stage 2 projects summaries into canonical KVT4 (Cluster|Keyword|Value|Timestamp) facts via a LoRA-adapted model. Deterministic normalization, a signal-integrity gate, and offline hybrid regeneration audit and reduce silent objective signal-loss between stages. Prompt KV-cache reuse yields +10.6% speedup with bit-exact output [Verified]. Results. On MIMIC-IV (N=50,000; patient-level split; Ntest=9,857), the tabular baseline (A4) achieves AUROC 0.685 (95% CI 0.670–0.699) [Verified]. On the full canonical test split (Ntest=9,857), under a constrained training regime (Ntrain=1,500, Nval=400), A3factlevel achieves AUROC 0.659, AUPRC 0.321, and Brier 0.145. Against a fair tabular refit baseline (LogReg and XGBoost) with the same training split and demographic covariates, A3factlevel improves AUPRC and Brier [Verified], while AUROC uplift is small and not statistically verified [Preliminary]. Notably, XGBoost does not outperform logistic regression on the same feature set, confirming that downstream gains are attributable to KVT4 features rather than estimator choice. As a post-closure continuation branch, direct typed downstream fusion of four high-signal semantic labels improves the current Stage 2 baseline on the same canonical split and yields a verified AUPRC gain over the canonical A4 tabular arm [Verified], while remaining near-parity rather than clearly superior to A3factlevel. KVT4 format validity is 99.74%; a signal-integrity audit (N=4,000) finds 15.55% doc-level objective loss (among admissions with Stage 1 numeric vitals/labs), reduced to 8.48% by offline hybrid regeneration without additional LLM calls. Structured-reference validation now includes a large LABS benchmark on the full canonical test split and a preliminary VITALS benchmark path with chartevents-backed BP/Weight evaluation. A model scaling pilot replacing Stage 1 with GPT4.1-mini confirms that moderate LABS micro-F1 (≈0.52 ceiling) reflects reference-alignment mismatch rather than model capacity [Preliminary, N=200]. Conclusion. The primary contribution is reliable, auditable local-first clinical text structuring infrastructure running on consumer hardware. On the canonical test split, factlevel KVT tokenization improves precision–recall and probabilistic accuracy metrics (AUPRC, Brier) over a tabular refit baseline (Verified); AUROC uplift is small (Preliminary). Direct typed downstream fusion now provides the strongest verified continuation path over the current Stage 2 baseline, suggesting that typed semantic signals are a more promising next optimization target than further free-form Stage 2 generator variants. The current revision package therefore supports a conservative conclusion: notes-derived KVT4 facts add useful predictive signal, but stronger extraction-quality and fairness claims still require further validation.
dc.identifier.citationЗаболотній С.В., Голінько В. Local-First Clinical Text Structuring with Fine-Tuned MedGemma for Readmission Risk Assessment. Zenodo (Preprint). 2026-02-19. https://doi.org/10.5281/zenodo.18701786
dc.identifier.urihttps://zenodo.org/records/19465471
dc.identifier.urihttps://dr.csbc.edu.ua/handle/123456789/2246
dc.language.isoen
dc.publisherhttps://zenodo.org
dc.subjectTECHNOLOGY
dc.subjectSOCIAL SCIENCES::Statistics, computer and systems science::Informatics, computer and systems science::Information technology
dc.subjectMEDICINE
dc.subjectTECHNOLOGY::Other technology::Medical engineering
dc.subjectMEDICINE::Physiology and pharmacology::Physiology::Medical informatics
dc.subjectMEDICINE::Physiology and pharmacology::Physiology::Medical technology
dc.titleLocal-First Clinical Text Structuring with Fine-Tuned MedGemma for Readmission Risk Assessment
dc.typeArticle
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
draft_paper_08_04_26.pdf
Size:
690.57 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed to upon submission
Description: