Page 719 - ISC PROCEEDINGS 21.4
P. 719

baseline correctness rates of 8-27% approximating chance performance on multiple-
                  choice items meant that any genuine improvement would necessarily produce large
                  absolute gains. Several internal patterns lend credibility to these results. Improvement
                  was not uniform: it was steepest for the most cognitively demanding indicators I² and
                  NNT, consistent with the hypothesis that the structured protocol specifically supported
                  higher-order reasoning. Concurrently, task completion time fell sharply, a pattern difficult
                  to reconcile with a ceiling or test-familiarity artefact. Nevertheless, effect sizes in a
                  controlled design with a comparison group would likely be more modest.
                        The 83.7% reduction in task completion time (from 135 to 22 minutes) is consistent
                  with a transition from effortful, deliberate processing to more fluent performance, as
                  described by Anderson’s (1987) skill acquisition model [13] students progressed from
                  looking up definitions to directly applying indicators in clinical contexts. A residual
                  familiarity effect with the item format cannot be fully excluded.
                        The observed improvements may reflect a combination of language support,
                  stepwise reasoning, and structured practice. The AI helped students interpret technical
                  statistical terms in more accessible language, while also breaking complex indicators such
                  as NNT and I² into manageable steps. At the same time, the session design encouraged
                  active engagement through independent reading, AI-assisted analysis, and source
                  verification. The lack of between-group differences suggests that the structured protocol,
                  rather than the platform itself, was the main contributor to the observed gains. The
                  authentic meta-analysis dataset employed (Wu et al., 2025: pooled RR = 1.10, p = 0.07,
                  low I²) provided a pedagogically valuable real-world case illustrating the independence of
                  statistical significance from sample size, a conceptual lesson that a synthetic or
                  hypothetical dataset could not replicate with equivalent credibility.Comparison with
                  International Literature and Interpretation of the AI Comparison Sallam (2023), in a
                  systematic review of 60 records on ChatGPT in health sciences education and practice,
                  identified notable benefits reported in 85% of records (51/60), particularly in improving
                  academic writing, data analysis, and health education [6]. A distinguishing feature of the
                  present study is that improvements were concentrated on complex multi-step integrated
                  indicators (NNT, I²) not merely knowledge recall differentiating structured-prompt
                  intervention from free AI use.
                        Regarding the ChatGPT versus Gemini comparison: no statistically significant
                  difference was detected on any indicator (Fisher’s exact p = 0.71-1.00; Mann-Whitney U p
                  = 0.54; φ ≈ 0.02-0.10). With subgroups of only 19 and 18 students, the comparison had
                  limited statistical power to detect modest between-platform differences; the null result
                  therefore reflects insufficient evidence rather than demonstrated equivalence. Under the
                  conditions of this study, neither platform showed a clear advantage. This finding supports
                  the hypothesis that prompt methodology rather than AI platform is the primary
                  determinant under the study conditions.
                        4.2. Limitations and future directions
                        First, the one-group pretest-posttest design does not fully exclude temporal
                  confounding (maturation effect) or repeated-testing effects. Control through parallel
                  forms and the two-week interval minimizes but does not eliminate these threats. An
                  optimal design would require a randomized control group. Second, the small subgroup
                  sizes (n₁ = 19; n₂ = 18) limit the sensitivity of the ChatGPT- Gemini comparison for
                  detecting modest between-group differences. The accurate conclusion is that no
                  statistically significant difference was detected under these study conditions, not that the


                                                                                                      718
   714   715   716   717   718   719   720   721   722   723   724