Page 719 - ISC PROCEEDINGS 21.4

P. 719

baseline correctness rates of 8-27% approximating chance performance on multiple-
choice items meant that any genuine improvement would necessarily produce large
absolute gains. Several internal patterns lend credibility to these results. Improvement
was not uniform: it was steepest for the most cognitively demanding indicators I² and
NNT, consistent with the hypothesis that the structured protocol specifically supported
higher-order reasoning. Concurrently, task completion time fell sharply, a pattern difficult
to reconcile with a ceiling or test-familiarity artefact. Nevertheless, effect sizes in a
controlled design with a comparison group would likely be more modest.
The 83.7% reduction in task completion time (from 135 to 22 minutes) is consistent
with a transition from effortful, deliberate processing to more fluent performance, as
described by Anderson’s (1987) skill acquisition model [13] students progressed from
looking up definitions to directly applying indicators in clinical contexts. A residual
familiarity effect with the item format cannot be fully excluded.
The observed improvements may reflect a combination of language support,
stepwise reasoning, and structured practice. The AI helped students interpret technical
statistical terms in more accessible language, while also breaking complex indicators such
as NNT and I² into manageable steps. At the same time, the session design encouraged
active engagement through independent reading, AI-assisted analysis, and source
verification. The lack of between-group differences suggests that the structured protocol,
rather than the platform itself, was the main contributor to the observed gains. The
authentic meta-analysis dataset employed (Wu et al., 2025: pooled RR = 1.10, p = 0.07,
low I²) provided a pedagogically valuable real-world case illustrating the independence of
statistical significance from sample size, a conceptual lesson that a synthetic or
hypothetical dataset could not replicate with equivalent credibility.Comparison with
International Literature and Interpretation of the AI Comparison Sallam (2023), in a
systematic review of 60 records on ChatGPT in health sciences education and practice,
identified notable benefits reported in 85% of records (51/60), particularly in improving
academic writing, data analysis, and health education [6]. A distinguishing feature of the
present study is that improvements were concentrated on complex multi-step integrated
indicators (NNT, I²) not merely knowledge recall differentiating structured-prompt
intervention from free AI use.
Regarding the ChatGPT versus Gemini comparison: no statistically significant
difference was detected on any indicator (Fisher’s exact p = 0.71-1.00; Mann-Whitney U p
= 0.54; φ ≈ 0.02-0.10). With subgroups of only 19 and 18 students, the comparison had
limited statistical power to detect modest between-platform differences; the null result
therefore reflects insufficient evidence rather than demonstrated equivalence. Under the
conditions of this study, neither platform showed a clear advantage. This finding supports
the hypothesis that prompt methodology rather than AI platform is the primary
determinant under the study conditions.
4.2. Limitations and future directions
First, the one-group pretest-posttest design does not fully exclude temporal
confounding (maturation effect) or repeated-testing effects. Control through parallel
forms and the two-week interval minimizes but does not eliminate these threats. An
optimal design would require a randomized control group. Second, the small subgroup
sizes (n₁ = 19; n₂ = 18) limit the sensitivity of the ChatGPT- Gemini comparison for
detecting modest between-group differences. The accurate conclusion is that no
statistically significant difference was detected under these study conditions, not that the

718

714 715 716 717 718 719 720 721 722 723 724