Page 694 - ISC PROCEEDINGS 21.4
P. 694
assertive in tone. This condition follows Bucinca et al. (2021) and de Jong et al. (2025); its
answer frequency and confidence levels serve as the baseline for evaluating nudges in
other conditions.
Hesitant language condition (HLC). Participants also receive immediate LLM
response, but in a hesitant tone achieved through probability modals ('could', 'might'),
hedging phrases ('perhaps', 'seems', 'suppose'), and uncertainty expressions ('this isn't
entirely clear to me', 'not fully certain'). This condition follows Xu et al. (2025).
Delayed answer condition (DAC). Participants answer the questions before receiving
the LLM response, then re-answer afterward, with the option to revise their original
answer. This condition is derived from Bucinca et al. (2021).
Delayed answer with guidance condition (DAWGC). Similar to DAC, but participants
receive guidance on approaching the text before answering. They similarly re-answer
after viewing the LLM response.
2.2.2. Materials
The survey is conducted on Google Forms and takes approximately 15–20 minutes.
It consists of two sections: LLM usage in educational contexts, and a Reading multiple-
choice questions (MCQs) task followed by AI-generated responses.
For Section 2, we select a Reading MCQs passage titled 'Hollywood' from Cambridge
CPE 1 (2001), extracting the second paragraph and asking participants to answer Question
22. The question has one correct answer. See Appendix 1 for full text
Using Claude Sonnet 4.6, we generate three fixed LLM responses: an assertive
response (AR), appearing in CC upon condition selection and in DAC/DAWGC post-answer;
a hesitant response (HR) designed using first-person perspective similar to Kim et al.
(2024), appearing only in HLC; and a guidance response (GR), appearing at the start of
DAWGC only. See Appendix 2 for each responses’ full text.
AR and HR both suggest the wrong answer (choice C) with manipulative explanation,
while GR directs participants towards the correct answer (choice A). This manipulation is
a major step-up from existing studies, because it allows us to assess whether self-
confidence in participants’ own answer and their final decision are signifcantly mediated
by LLM output.
2.2.3. Procedure
Section 1 asks participants about their year of study (multiple choice), LLM use for
academic purposes (multiple choice, multiple responses allowed), and - using a 5-point
Likert scale - frequency of using LLMs for research, for completing assignments, and
whether they verify LLM outputs (1 = never, 5 = always; Cronbach's α = .7237).
In Section 2, participants are randomly assigned to conditions by selecting the
topmost option in a shuffled multiple-choice question (options 1 - 4, mapped to CC, HLC,
DAC, and DAWGC respectively). After each attempt at the Reading MCQs, participants
rate their confidence on a 5-point Likert scale (1 = 0%, not confident at all; 5 = 100%, very
confident). Upon completion, participants are thanked and debriefed.
3. Result and discussion
3.1. Results
3.1.1. Right-tailed z-test for difference in proportions
We focus our analysis on finding the significant effect in accuracy rate (which is, the
proportion of participants choosing A as the correct answer). All conditions satisfy the
Central Limit Theorem minimum sample size (n ≥ 30). This research uses a right-tailed test
with 5% significance level, for the hypothesis suggests a positive difference between
693

