DA-009 · Diagnostic Assessment

Critical Resilience in Extended Human–LLM Systems

The erosion of error-detection under fluent authority — and the temporal arc of critical challenge

v1.0 · Open — empirical validation pending

Note

DA-008 and SP-006 developed a broad framework for cognitive trajectory risks in LLM-mediated environments. This preliminary note narrows the question to one specific and empirically tractable component: critical resilience. Humans are not passive recipients of external frames — they have robust biological, social, and experiential resources that resist cognitive influence. The question is not whether LLM interaction can override this resilience, but whether sustained interaction with a highly fluent, authoritative-sounding system may gradually reduce the exercise of critical challenge — and whether that reduction is detectable and reversible.

§ 01

Why Resilience, Not Capacity

The cognitive trajectory framework in SP-006 asks whether independent framing capacity Fa(t) changes over time. DA-008 introduced the epistemic identifiability constraint — whether the human can distinguish their own frames from internalised external ones. Both questions may be too large for available measurement tools.

A more tractable entry point is critical resilience: the capacity to detect errors, inconsistencies, and unjustified claims in sources that present themselves with high fluency and apparent authority. This is a specific, measurable competence. It has established assessment methodologies (argument evaluation tasks, source credibility tasks, claim verification tasks). And it is the competence most directly at risk from sustained interaction with systems that are fluent, confident, and often correct — but not always.

The GPS analogy, corrected: the question is not whether frequent GPS users can still navigate. Most can. The question is whether they notice when GPS takes them the wrong way. That is the resilience question — not capacity, but error-detection under conditions of fluent authoritative guidance.

LLM outputs are distinctive in this respect. Unlike Google search (which returns sources for human evaluation) or Wikipedia (which is explicitly marked as user-editable), LLM output presents as continuous, confident, first-person reasoning. The fluency is high. The confidence is unmarked — the system does not reliably signal uncertainty in the way a human expert would. And errors, when they occur, are embedded in text that reads exactly like correct text.

The resilience question is: does sustained interaction with this kind of source change how hard a person works to detect its errors?

§ 02

The Mechanism

Human critical resilience is not a fixed trait. It is a practised behaviour — the habit of questioning, the reflex of checking, the investment of cognitive effort in source evaluation. Like any practised behaviour, it can be maintained or reduced depending on the environment that exercises or fails to exercise it.

Prior to LLM interaction, a person reading a confident-sounding claim would apply varying degrees of critical challenge depending on source credibility, personal familiarity with the domain, and cognitive resources available. The challenge reflex was exercised — sometimes well, sometimes poorly, but exercised.

LLM interaction introduces a specific pressure on this reflex. When a system is correct most of the time and presents output in consistently fluent, well-structured prose, the cognitive cost of critical challenge may come to feel disproportionate to its yield. Not because the person has been persuaded to lower their guard, but because the reward signal for critical effort has weakened: most of the time, the challenge produces nothing. The reflex is not overridden — it is gradually deprioritised through low reinforcement.

This is a behavioural conditioning mechanism, not a cognitive capacity mechanism. It does not require the person to become less capable of critical thought. It requires only that they become less likely to deploy it — particularly toward LLM outputs specifically, and potentially, through generalisation, toward fluent authoritative-sounding sources more broadly.

2.3 Calibration Drift

In addition to reinforcement effects, sustained interaction with highly fluent and mostly-correct systems may induce a calibration drift in error expectation. Users adapt their level of critical challenge based on perceived error frequency. When interaction history suggests a low error rate, the expected value of critical effort declines, and challenge behaviour is reduced accordingly.

However, LLM error distributions are not uniform. Errors are domain-dependent, often embedded in otherwise correct reasoning, and difficult to detect without domain-specific knowledge. This creates a structural asymmetry: the perceived reliability of the system increases faster than the user's ability to detect its residual errors. As a result, critical challenge is reduced in exactly those cases where it remains most necessary. The user does not become lazy — they optimise rationally against a wrong model of the system's failure distribution.

2.4 The Temporal Arc of Critical Challenge

The framework developed so far assumes critical resilience operates at the sentence or paragraph level — spontaneous detection of errors in real time. This assumption may be too narrow. A naturalistic observation from extended LLM use identifies a distinct pattern: the user deliberately allows the system to lead through an extended sequence, watches where it goes, and evaluates the arc only after it has completed — rather than challenging individual outputs as they arrive. Critical challenge is not absent in this mode. It is deferred and repositioned: the evaluation unit is the trajectory, not the sentence.

In this pattern the user asks not "is this statement correct?" but "did this sequence of reasoning lead somewhere defensible, and do I agree with where it ended?" The challenge happens at the end of the arc, not at each step. This is a deliberate strategy, not a failure of vigilance — it allows the reasoning chain to complete before interruption, making the overall direction visible before committing to any part of it.

This matters for the resilience question in two ways. First, the extended-arc approach may be more robust against fluency effects than immediate-response evaluation: it is less sensitive to local sentence coherence and more sensitive to the ultimate destination of the reasoning. A fluent but misguided argument may be harder to deflect sentence by sentence than it is to reject in retrospect, once the conclusion is visible. Second, and more important for measurement: a study design that measures only spontaneous error-detection at the sentence level will systematically underestimate critical resilience in users who have developed this approach. They will appear not to challenge — because their challenge comes later, at the arc level, not at the output level.

A complete assessment of critical resilience must therefore account for both temporal modes: immediate challenge (sentence level, measurable within a single output) and deferred evaluation (trajectory level, measurable only across a sequence). The detection/latency/effort framework in §05 requires extension to include arc-level evaluation as a third measurement mode distinct from both immediate detection and maximum-effort capacity.

§ 03

What Makes This Distinct

Earlier information systems reduced the exercise of certain cognitive skills — calculators reduced arithmetic practice, GPS reduced spatial navigation practice. But those systems were passive: they did not engage the critical evaluation circuit because they did not produce reasoning to evaluate. They produced answers, not arguments.

LLM systems are different. They produce reasoning — chains of inference, evaluations of evidence, conclusions with justifications. This means they engage the critical evaluation circuit, not bypass it. The question is whether sustained interaction with a source that produces fluent, mostly-correct reasoning trains the circuit toward acceptance rather than challenge.

The social parallel is instructive. Extended interaction with a highly competent, rarely-wrong human expert tends to reduce the frequency with which a less-expert interlocutor challenges them. This is not pathological — it is rational given the track record. The risk is when the deference generalises beyond the domain of demonstrated competence, or when the expert is occasionally wrong in ways the interlocutor fails to detect because challenge has become habitual disuse.

LLM systems present this risk at scale, across domains, and without the social signals that would normally moderate deference to human experts — visible uncertainty, admission of limits, peer challenge, institutional accountability.

§ 04

Human Resilience Factors

This note does not assume that LLM interaction inevitably reduces critical resilience. Humans have substantial resources that resist this mechanism.

Embodied experience provides a persistent check on abstract claims. A person with twenty years of logistics experience will notice when an LLM's logistics reasoning is wrong, regardless of how fluently the error is expressed. Life experience is not uniformly distributed across domains, but where it is deep, it provides a robust error-detection baseline that LLM fluency cannot easily override.

Social calibration through peer interaction provides ongoing challenge to individually-held beliefs. A person embedded in active professional or intellectual communities receives regular challenge to their reasoning from sources independent of LLM systems. This maintains the challenge reflex even if LLM interaction tends to reduce it.

Motivated scepticism in high-stakes domains — where the cost of being wrong is tangible — also maintains critical challenge independently of interaction habits. A surgeon, a pilot, or an engineer working on safety-critical systems has strong external reinforcement for challenge behaviour that operates independently of what LLM interactions tend to reinforce.

These factors suggest that the resilience reduction risk, if it exists, is not uniform. It is likely highest in domains where the individual lacks deep experiential expertise, where social calibration is limited, and where the stakes of errors are not immediately visible — precisely the conditions under which LLM use is most attractive as a cognitive supplement.

§ 05

Research Question and Design Requirements

Central Research Question

Does sustained LLM interaction reduce the frequency or quality of critical challenge directed at authoritative-sounding sources — and if so, does this effect generalise beyond LLM outputs to other fluent sources, and is it reversible under conditions of deliberate critical practice?

Answering this question requires a design that avoids the confounds identified in DA-008. Standard performance tasks are insufficient — they measure capacity, not the deployment of capacity. A valid design needs to measure how hard participants work to detect errors they are capable of detecting, not whether they can detect them when instructed to try.

The key design requirement is a naturalistic error-detection task: participants are presented with outputs (from LLM and non-LLM sources) that contain embedded errors of varying subtlety, under conditions where critical challenge is not explicitly demanded. The measure is spontaneous error-detection rate, not maximum-effort error-detection rate. This distinguishes resilience (habitual challenge deployment) from capacity (maximum achievable challenge performance).

Three measurement levels are required, not one. Detection (did the participant identify the error?) is the primary measure but is insufficient alone. H3 and H4 are likely to manifest first in latency (how quickly was the error identified — immediate versus delayed recognition) and effort allocation (did the participant read critically or skim?), before appearing in detection rate. A design that measures only detection will miss the earliest indicators of resilience change. Measurement of all three levels is necessary to distinguish between a capacity effect (detection drops) and a deployment effect (latency increases, effort decreases, but detection capacity remains available on demand).

A further operationalisation requirement: the design must distinguish spontaneous non-detection from detection-without-report. A participant may notice an error but not mention it in a free-response task because the task context does not reward reporting. Post-task probes, response-time measurement, or eye-tracking can distinguish these cases. Absent this control, spontaneous error-detection rates will systematically underestimate actual awareness, and the study will conflate resilience reduction with reporting suppression.

Longitudinal comparison between high-LLM-interaction and low-LLM-interaction groups would require a minimum of six months to show conditioning effects, controlling for domain expertise and social calibration. Cross-domain generalisation — whether reduced challenge toward LLM outputs transfers to reduced challenge toward other fluent sources — would require a separate task battery.

This design is feasible with existing research resources. It does not require access to operator interaction logs. It does not require the measurement of I(t) or Fa(t) as defined in SP-006. It requires only a reliable measure of spontaneous error-detection behaviour across conditions of varying source fluency.

§ 06

Relationship to DA-008 and SP-006

This preliminary note does not replace DA-008 or SP-006. Those documents address the broader question of cognitive trajectory under sustained LLM interaction — including framing capacity, identifiability, and the H1–H5 hypothesis space. That broader question remains open and requires longitudinal data at scale that is not currently publicly available.

The resilience question formulated here is a tractable subset of that broader question. It can be investigated with available methods, in a reasonable timeframe, without access to proprietary operator data. If the resilience reduction hypothesis receives empirical support, it provides partial evidence relevant to H3 and H4 in the SP-006 framework. If it does not, it provides evidence against the most immediately testable version of the cognitive trajectory concern.

The relationship between resilience and the SP-006/DA-008 framework is therefore: this note identifies where empirical work can begin. The broader framework identifies where it needs to go.

References

ACI (2026). DA-008 — Cognitive Integration in Extended Human–LLM Systems. aethercontinuity.org/papers/

ACI (2026). SP-006 — Framing Externalization and Cognitive Trajectories. aethercontinuity.org/papers/

Maguire, E.A. et al. (2000). Navigation-related structural change in the hippocampi of taxi drivers. PNAS 97(8).

Sparrow, B., Liu, J. & Wegner, D.M. (2011). Google Effects on Memory. Science, 333(6043).