Page 692 - ISC PROCEEDINGS 21.4

P. 692

response. The current research seeks to address this research gap by introducing both
verbalized uncertainty and cognitive forcing methods (making decisions before AI and
taking suggestions prior to answer) into the experiment. The hypotheses are as below.
H1.1 The proportion of accuracy within participants receiving a hesitant response
from LLM is significantly higher than that within participants in the control condition.
H1.2 The proportion of accuracy within participants experiencing a delay before
receiving LLM response is significantly higher than that within participants in the control
condition.
H1.3 The proportion of accuracy at final decision within participants experiencing a
delay with LLM suggestions before receiving LLM response is significantly higher than that
within participants in the control condition.
H1.4 The proportion of accuracy at final decision within participants experiencing a
delay with LLM suggestions before receiving LLM response is significantly higher than that
within participants experiencing a delay before receiving LLM response.
H1.5. The proportion of choosing the deliberate incorrect answer within
participants receiving hesitant responses from LLM is significantly lower than that within
participants in the control condition.
2.1.2. Self-confidence and reliance before and after LLM response
Internal measure of self-confidence is fundamentally a metacognitive process - a
person's subjective assessment of their own competence that may or may not align with
their performance. In psychological research, this self-evaluation is operationalized
through confidence calibration - the degree to which an individual's expressed certainty
corresponds to their actual accuracy (Fleming & Dolan, 2012). Researchers have long
distinguished between two core components of this internal measure: calibration, which
is the goodness of fit between probability assessments and the corresponding proportion
of correct responses, and resolution, which captures an individual's ability to discriminate
between their own correct and incorrect judgments (Praveen et al., 2025). Importantly,
personality traits and cognitive ability appear to play only a small role in determining
accuracy of self-assessment; rather, there are multiple causes of miscalibration that
current imperfect models fail to capture. This means self- confidence ratings are not a
stable trait but a dynamic judgment susceptible to disruption by the surrounding
environment, including AI-assisted environments.
Recent research has revealed that frequent use of AI may disrupt the accuracy of
users' self-reported confidence. A landmark study by Fernandes and Welsch (2025) found
that while participants using ChatGPT to complete logical reasoning tasks outperformed a
control group, their self-reported estimates of success were dramatically inflated. On
average, participants estimated they had answered approximately 17 out of 20 questions
correctly, despite their real performance being considerably lower - a pattern the
researchers describe as an "illusion of competence" in which confidence is overrated
relative to accuracy. A parallel study at Carnegie Mellon University reinforced these
findings across multiple LLMs and task types, with researchers observing that when an AI
asserts an answer with confidence, users may not be as skeptical as they should be, partly
because humans lack the non-verbal cues they would normally use to evaluate another
person's certainty. Together, these findings demonstrate that AI interactions do not
merely supplement human judgment - they actively distort the user's internal confidence
signal.
The question of how users perceive their performance relative to an AI response

691

687 688 689 690 691 692 693 694 695 696 697