Page 694 - ISC PROCEEDINGS 21.4
P. 694

assertive in tone. This condition follows Bucinca et al. (2021) and de Jong et al. (2025); its
                  answer frequency and confidence levels serve as the baseline for evaluating nudges in
                  other conditions.
                        Hesitant language condition (HLC). Participants also receive immediate LLM
                  response, but in a hesitant tone achieved through probability modals ('could', 'might'),
                  hedging phrases ('perhaps', 'seems', 'suppose'), and uncertainty expressions ('this isn't
                  entirely clear to me', 'not fully certain'). This condition follows Xu et al. (2025).
                        Delayed answer condition (DAC). Participants answer the questions before receiving
                  the LLM response, then re-answer afterward, with the option to revise their original
                  answer. This condition is derived from Bucinca et al. (2021).
                        Delayed answer with guidance condition (DAWGC). Similar to DAC, but participants
                  receive guidance on approaching the text before answering. They similarly re-answer
                  after viewing the LLM response.
                        2.2.2. Materials
                        The survey is conducted on Google Forms and takes approximately 15–20 minutes.
                  It consists of two sections: LLM usage in educational contexts, and a Reading multiple-
                  choice questions (MCQs) task followed by AI-generated responses.
                        For Section 2, we select a Reading MCQs passage titled 'Hollywood' from Cambridge
                  CPE 1 (2001), extracting the second paragraph and asking participants to answer Question
                  22. The question has one correct answer. See Appendix 1 for full text
                        Using Claude Sonnet 4.6, we generate three fixed LLM responses: an assertive
                  response (AR), appearing in CC upon condition selection and in DAC/DAWGC post-answer;
                  a hesitant response (HR) designed using first-person perspective similar to Kim et al.
                  (2024), appearing only in HLC; and a guidance response (GR), appearing at the start of
                  DAWGC only. See Appendix 2 for each responses’ full text.
                        AR and HR both suggest the wrong answer (choice C) with manipulative explanation,
                  while GR directs participants towards the correct answer (choice A). This manipulation is
                  a major step-up from existing studies, because it allows us to assess whether self-
                  confidence in participants’ own answer and their final decision are signifcantly mediated
                  by LLM output.
                        2.2.3. Procedure
                        Section 1 asks participants about their year of study (multiple choice), LLM use for
                  academic purposes (multiple choice, multiple responses allowed), and - using a 5-point
                  Likert scale - frequency of using LLMs for research, for completing assignments, and
                  whether they verify LLM outputs (1 = never, 5 = always; Cronbach's α = .7237).
                        In Section 2, participants are randomly assigned to conditions by selecting the
                  topmost option in a shuffled multiple-choice question (options 1 - 4, mapped to CC, HLC,
                  DAC, and DAWGC respectively). After each attempt at the Reading MCQs, participants
                  rate their confidence on a 5-point Likert scale (1 = 0%, not confident at all; 5 = 100%, very
                  confident). Upon completion, participants are thanked and debriefed.
                        3. Result and discussion
                        3.1. Results
                        3.1.1. Right-tailed z-test for difference in proportions
                        We focus our analysis on finding the significant effect in accuracy rate (which is, the
                  proportion of participants choosing A as the correct answer). All conditions satisfy the
                  Central Limit Theorem minimum sample size (n ≥ 30). This research uses a right-tailed test
                  with 5% significance level, for the hypothesis suggests a positive difference between


                  693
   689   690   691   692   693   694   695   696   697   698   699