Dear Editor,
Acute abdominal pain syndrome represents one of the most common and diagnostically challenging presentations in emergency medicine, accounting for approximately 5-10% of all emergency department visits (1). The differential diagnosis spans diverse surgical emergencies including acute appendicitis, acute cholecystitis, intestinal obstruction, and volvulus, each requiring distinct management strategies ranging from conservative therapy to emergent surgical intervention (2). Diagnostic accuracy is paramount, as delayed recognition of surgical emergencies significantly increases morbidity and mortality. Recent advances in multimodal artificial intelligence (AI) systems have demonstrated promise in clinical decision support; however, their application across the heterogeneous spectrum of acute abdominal emergencies remains limited (3). We present a pilot evaluation comparing a multidisciplinary AI system with a single-model AI for diagnosing and managing acute abdominal pain syndrome.
We retrospectively analyzed 36 consecutive patients who presented to Etimesgut Şehit Sait Ertürk State Hospital (Ankara, Türkiye) with acute abdominal pain between September 2024 and January 2025. A multidisciplinary AI system integrating GPT-4V for radiological interpretation (ultrasonography, computed tomography findings), Med-PaLM 2 for emergency medicine assessment, BioGPT for intensive care unit (ICU) triage, and LLaMA 4 for surgical decision-making was compared against ChatGPT-5.0 and decisions made by the clinical team. GPT-4V was prompted using standardized radiological descriptor templates to identify pathognomonic imaging features: appendiceal diameter ≥6 mm with periappendiceal fat stranding (appendicitis); gallbladder wall thickening ≥3 mm, pericholecystic fluid, and sonographic Murphy sign (cholecystitis); and whirl sign, coffee-bean sign, and transition point (volvulus). Med-PaLM 2 processed structured clinical parameters including vital signs, symptom onset, physical examination findings (rebound tenderness, guarding, McBurney’s tenderness), and laboratory values (white blood cell, C-reactive protein, bilirubin, lipase, lactate) using PICO-formatted prompts. BioGPT assessed the necessity for ICU admission based on Sequential Organ Failure Assessment scores, the systemic inflammatory response syndrome criteria, hemodynamic parameters, and lactate levels according to the Surviving Sepsis Campaign 2023 guidelines. LLaMA 4 integrated American Society of Anesthesiologists physical status classification, comorbidity profile, and current guideline recommendations [World Society of Emergency Surgery (WSES), Tokyo Guidelines 2018, American Society for Gastrointestinal Endoscopy (ASGE)] to generate urgency-stratified surgical management decisions. All AI models, which did not have access to clinical team decisions, received standardized, pre-specified prompts that were developed a priori and applied uniformly across all 36 cases. Final diagnoses, established by consensus of the attending clinical team (emergency physician, surgeon, and radiologist), were confirmed by operative findings, pathological reports, or clinical course as appropriate, and served as the reference standard for all AI performance evaluations. AI model assessments were performed independently and were blinded to the clinical team’s decisions, using standardized, pre-specified prompts applied uniformly across all cases. The cohort (mean age 52.4±17.6 years; 55.6% male) comprised acute appendicitis (n=12, 33.3%), acute cholecystitis (n=10, 27.8%), paralytic ileus (n=8, 22.2%), sigmoid volvulus (n=4, 11.1%), and cecal volvulus (n=2, 5.6%).
The multidisciplinary AI achieved 94.4% overall diagnostic accuracy (34/36), significantly outperforming ChatGPT-5.0 (80.6%, 29/36; p=0.012). For acute appendicitis, the multidisciplinary AI system correctly identified 11/12 cases (91.7%), including 3 cases of complicated appendicitis with perforation, compared with 9/12 (75.0%) identified by ChatGPT-5.0; integration of the Alvarado score further enhanced diagnostic precision. Detection of acute cholecystitis achieved 90.0% sensitivity (9/10) by accurately identifying gallbladder wall thickening and pericholecystic fluid on ultrasonography. The critical differentiation between surgical emergencies (appendicitis, cholecystitis, volvulus) and candidates for conservative management (ileus) showed 96.4% accuracy (27/28) for the multidisciplinary system, versus 82.1% (23/28) for ChatGPT-5.0 (p=0.024). Volvulus cases demonstrated a detection sensitivity of 100% with recognition of the coffee-bean and whirl signs (Table 1).
Analysis of guideline adherence revealed concordance rates of 94.4% (34/36) for multidisciplinary AI compared with 77.8% (28/36) for ChatGPT-5.0 (p=0.018). For appendicitis, the AI correctly recommended early laparoscopic appendectomy per WSES Jerusalem guidelines in all uncomplicated cases (4). Acute cholecystitis management was aligned with Tokyo Guidelines 2018 criteria, with appropriate early cholecystectomy recommended in 9/10 cases (90.0%) versus 7/10 (70.0%) for ChatGPT-5.0 (5). Sigmoid volvulus cases received guideline-concordant endoscopic detorsion recommendations (100% vs 50.0%) per ASGE criteria (6). ICU triage accuracy was 91.7% versus 77.8% (p=0.032), with BioGPT demonstrating superior sepsis recognition per Surviving Sepsis Campaign criteria (7). Clinical outcomes were: surgical intervention in 22 patients (61.1%), endoscopic intervention in 3 patients (8.3%), and conservative management in 11 patients (30.6%); there was one in-hospital death (perforated appendicitis, 2.8%) (Table 2).
Our findings support the emerging paradigm of task-distributed AI in acute care. Schouten et al. (8) reported in a comprehensive scoping review analyzing 432 studies that multimodal AI models provide an average 6.2-point area under the curve improvement over unimodal systems, consistent with our 13.8 percentage point accuracy difference. Notably, the 94.4% diagnostic accuracy achieved by the multidisciplinary AI system is contextually comparable to published benchmarks: Kaczmarczyk et al. (9) reported 95.8% accuracy for collective human decision-making in complex diagnostic tasks; however, given the retrospective single-center design of the current pilot study, direct equivalence with human specialist performance cannot be inferred and prospective head-to-head comparison with clinical teams is required. Krones et al. (10) demonstrated in Information Fusion that specialized model integration consistently outperforms single comprehensive systems in complex clinical scenarios requiring cross-domain expertise. Recent work by Wiest et al. (11) in Nature Reviews Gastroenterology & Hepatology highlighted that domain-specific large language models achieve optimal performance through integration rather than through general-purpose approaches. The 2025 multimodal AI review in World Journal of Gastroenterology emphasized the transformative potential of AI in precision diagnosis of gastrointestinal emergencies, with particular relevance to acute abdominal presentations. Our superior performance in differentiating appendicitis, cholecystitis, and ileus addresses a critical diagnostic challenge in which clinical and radiological overlap frequently delays appropriate management. This study has several limitations, including a small sample size, a single-center design, and retrospective data collection, all of which preclude definitive conclusions regarding generalizability; however, these preliminary results demonstrating high diagnostic accuracy within this limited pilot context support larger multicenter prospective validation trials to establish the clinical utility of multidisciplinary AI integration across the acute abdominal pain spectrum.


