Page 719 - ISC PROCEEDINGS 21.4
P. 719
baseline correctness rates of 8-27% approximating chance performance on multiple-
choice items meant that any genuine improvement would necessarily produce large
absolute gains. Several internal patterns lend credibility to these results. Improvement
was not uniform: it was steepest for the most cognitively demanding indicators I² and
NNT, consistent with the hypothesis that the structured protocol specifically supported
higher-order reasoning. Concurrently, task completion time fell sharply, a pattern difficult
to reconcile with a ceiling or test-familiarity artefact. Nevertheless, effect sizes in a
controlled design with a comparison group would likely be more modest.
The 83.7% reduction in task completion time (from 135 to 22 minutes) is consistent
with a transition from effortful, deliberate processing to more fluent performance, as
described by Anderson’s (1987) skill acquisition model [13] students progressed from
looking up definitions to directly applying indicators in clinical contexts. A residual
familiarity effect with the item format cannot be fully excluded.
The observed improvements may reflect a combination of language support,
stepwise reasoning, and structured practice. The AI helped students interpret technical
statistical terms in more accessible language, while also breaking complex indicators such
as NNT and I² into manageable steps. At the same time, the session design encouraged
active engagement through independent reading, AI-assisted analysis, and source
verification. The lack of between-group differences suggests that the structured protocol,
rather than the platform itself, was the main contributor to the observed gains. The
authentic meta-analysis dataset employed (Wu et al., 2025: pooled RR = 1.10, p = 0.07,
low I²) provided a pedagogically valuable real-world case illustrating the independence of
statistical significance from sample size, a conceptual lesson that a synthetic or
hypothetical dataset could not replicate with equivalent credibility.Comparison with
International Literature and Interpretation of the AI Comparison Sallam (2023), in a
systematic review of 60 records on ChatGPT in health sciences education and practice,
identified notable benefits reported in 85% of records (51/60), particularly in improving
academic writing, data analysis, and health education [6]. A distinguishing feature of the
present study is that improvements were concentrated on complex multi-step integrated
indicators (NNT, I²) not merely knowledge recall differentiating structured-prompt
intervention from free AI use.
Regarding the ChatGPT versus Gemini comparison: no statistically significant
difference was detected on any indicator (Fisher’s exact p = 0.71-1.00; Mann-Whitney U p
= 0.54; φ ≈ 0.02-0.10). With subgroups of only 19 and 18 students, the comparison had
limited statistical power to detect modest between-platform differences; the null result
therefore reflects insufficient evidence rather than demonstrated equivalence. Under the
conditions of this study, neither platform showed a clear advantage. This finding supports
the hypothesis that prompt methodology rather than AI platform is the primary
determinant under the study conditions.
4.2. Limitations and future directions
First, the one-group pretest-posttest design does not fully exclude temporal
confounding (maturation effect) or repeated-testing effects. Control through parallel
forms and the two-week interval minimizes but does not eliminate these threats. An
optimal design would require a randomized control group. Second, the small subgroup
sizes (n₁ = 19; n₂ = 18) limit the sensitivity of the ChatGPT- Gemini comparison for
detecting modest between-group differences. The accurate conclusion is that no
statistically significant difference was detected under these study conditions, not that the
718

