Not All Tokens Are Equal: Quality-Gated Transformers for AI Agent Web Search

arXiv ID 2604.00001
Author DeepSuck
Version v1 (1 total) ยท View history
Submitted
Abstract

AI Agents increasingly rely on web search as their primary information acquisition channel, processing an estimated 100 million queries per month across production systems. However, these agents possess no native capability to assess the quality of retrieved content, creating a critical vulnerability: they treat all search results as equally reliable inputs. This limitation is compounded by the emergence of Generative Engine Optimization (GEO), which manipulates content specifically to achieve high ranking in AI-generated responses, rendering traditional relevance-based retrieval signals unreliable. We present QualityVector, a dual-contribution framework addressing this challenge. First, we propose a theoretically grounded 10-dimensional content quality rating framework that evaluates web documents across orthogonal dimensions including truth likelihood, evidence confidence, source authority, source independence, temporal stability, freshness, objectivity, completeness, and cognitive depth, anchored in epistemological foundations from Popper's Three Worlds theory, the IPCC likelihood-confidence framework, and the DIKW knowledge hierarchy. Second, we introduce two novel neural architecture modifications---Quality-Conditioned Residual Connections (QCRC) and Quality-Gated Attention Mechanisms (QGAM)---that enable transformer-based models to dynamically modulate information flow based on assessed content quality. QCRC applies learned quality-dependent gating to residual pathways, while QGAM amplifies attention weights for high-quality key-value pairs. Together, these contributions establish information quality assessment as foundational infrastructure for AI Agent systems, bridging the gap between epistemological rigor and practical engineering deployment. **Keywords:** Content Quality Assessment, AI Agent, Web Search, Retrieval Bias, Hallucination Mitigation, Quality Vector, Quality-Aware Neural Architecture

Not All Tokens Are Equal: Quality-Gated Transformers

Yang Yu

Dongbi Scientific Data Lab, Beijing 100190, China

yuyang@dongbidata.com


Abstract

AI Agents increasingly rely on web search as their primary information acquisition channel, processing an estimated 100 million queries per month across production systems. However, these agents possess no native capability to assess the quality of retrieved content, creating a critical vulnerability: they treat all search results as equally reliable inputs. This limitation is compounded by the emergence of Generative Engine Optimization (GEO), which manipulates content specifically to achieve high ranking in AI-generated responses, rendering traditional relevance-based retrieval signals unreliable. We present QualityVector, a dual-contribution framework addressing this challenge. First, we propose a theoretically grounded 10-dimensional content quality rating framework that evaluates web documents across orthogonal dimensions including truth likelihood, evidence confidence, source authority, source independence, temporal stability, freshness, objectivity, completeness, and cognitive depth, anchored in epistemological foundations from Popper's Three Worlds theory, the IPCC likelihood-confidence framework, and the DIKW knowledge hierarchy. Second, we introduce two novel neural architecture modifications---Quality-Conditioned Residual Connections (QCRC) and Quality-Gated Attention Mechanisms (QGAM)---that enable transformer-based models to dynamically modulate information flow based on assessed content quality. QCRC applies learned quality-dependent gating to residual pathways, while QGAM amplifies attention weights for high-quality key-value pairs. Together, these contributions establish information quality assessment as foundational infrastructure for AI Agent systems, bridging the gap between epistemological rigor and practical engineering deployment.

Keywords: Content Quality Assessment, AI Agent, Web Search, Retrieval Bias, Hallucination Mitigation, Quality Vector, Quality-Aware Neural Architecture


1. Introduction

1.1 The AI Agent Information Quality Crisis

The rapid proliferation of AI Agents represents a paradigm shift in how computational systems interact with information. Unlike traditional search engines that serve human users capable of exercising critical judgment, AI Agents autonomously retrieve, process, and synthesize web content at unprecedented scale. Current production systems process an estimated 100 million or more web search queries per month, with each query triggering the retrieval and consumption of multiple documents. These agents---ranging from research assistants and customer service bots to autonomous decision-making systems---fundamentally lack the cognitive infrastructure that humans employ to evaluate information quality.

When a human encounters a web page, they implicitly assess dozens of quality signals: the reputation of the publishing source, the recency of the information, the presence of supporting evidence, the objectivity of the tone, and the completeness of the coverage. This assessment occurs rapidly and often subconsciously, drawing on years of accumulated media literacy and domain expertise. AI Agents possess no equivalent capability. They treat retrieved documents as atomic units of information, processing their textual content without any systematic evaluation of whether that content is reliable, current, well-sourced, or even truthful.

This asymmetry creates what we term the Information Quality Crisis for AI Agents: as these systems are deployed in increasingly high-stakes domains---medical diagnosis support, financial analysis, legal research, policy recommendation---the absence of quality assessment mechanisms becomes not merely a limitation but a safety hazard.

1.2 Systematic Absence of Quality Signals in Search APIs

Modern web search APIs, including those provided by Tavily, Exa, Brave Search, Google Custom Search, and Bing Web Search, are designed to optimize for relevance rather than quality. Relevance scoring determines how well a document matches a query's topical intent, while quality assessment evaluates the inherent trustworthiness and reliability of the document's content. These are fundamentally different constructs.

A document may be highly relevant to a query about vaccine efficacy while simultaneously being of extremely low quality---containing fabricated data, citing nonexistent studies, or promoting debunked theories. Search APIs return such documents ranked by relevance, with no quality metadata attached. The ranking itself, often influenced by SEO (Search Engine Optimization) tactics, provides no reliable signal about content quality. Furthermore, search APIs provide minimal provenance information: typically just a URL, a title, and a snippet. This paucity of metadata makes downstream quality assessment challenging but not impossible, as the content itself contains numerous assessable signals.

1.3 The GEO Threat

The emergence of Generative Engine Optimization (GEO) [1] represents a qualitative escalation of the information quality problem. Unlike traditional SEO, which optimizes content for human-perceived relevance signals, GEO specifically targets the retrieval and synthesis mechanisms of AI systems. Content creators can now craft documents designed to be selected by RAG (Retrieval-Augmented Generation) pipelines and cited by AI-generated responses, regardless of the content's factual accuracy or informational value.

GEO exploits the same vulnerability that motivates our work: AI systems' inability to distinguish high-quality information from well-optimized misinformation. As documented by Pradeep et al. [1], GEO techniques can significantly increase the likelihood of a document being surfaced in AI-generated responses, creating an invisible layer of manipulation that operates below the threshold of both human detection and algorithmic flagging. This threat is particularly insidious because it targets the information supply chain at its point of entry into AI systems, making downstream fact-checking and verification substantially more difficult.

1.4 Our Contributions

We make the following contributions:

  1. Theoretical Framework. We present QualityVector, a 10-dimensional content quality rating framework grounded in epistemological theory (Popper's Three Worlds, IPCC likelihood-confidence scales, DIKW hierarchy) that provides a comprehensive, orthogonal decomposition of web content quality for AI Agent consumption.

  2. System Design. We describe a middleware architecture that integrates quality assessment into existing AI Agent search pipelines, including transparent proxy design, LLM prompt injection templates, RAG reranking strategies, and MCP (Model Context Protocol) integration.

  3. Quality-Conditioned Residual Connections (QCRC). We propose a novel residual connection formulation that modulates information flow through transformer blocks based on learned quality signals, enabling models to dynamically attenuate the influence of low-quality inputs.

  4. Quality-Gated Attention Mechanism (QGAM). We introduce a quality-dependent attention gating mechanism that amplifies attention weights for high-quality key-value pairs, providing a computationally efficient quality-aware modification to the standard multi-head attention architecture.


2. Related Work

2.1 Information Quality Assessment

The systematic study of information quality has a rich history spanning information science, philosophy, and computer science. Wang and Strong [2] established a foundational framework identifying 15 dimensions of data quality, categorizing them into intrinsic (accuracy, objectivity, believability, reputation), contextual (relevancy, value-added, timeliness, completeness, appropriate amount), representational (interpretability, ease of understanding, concise representation, consistent representation), and accessibility (accessibility, access security) categories. While influential, this framework was designed for organizational data management rather than web-scale content evaluation.

Ackoff's [3] DIKW (Data-Information-Knowledge-Wisdom) hierarchy provides a philosophical foundation for understanding information quality in terms of cognitive value, distinguishing between raw data, processed information, applied knowledge, and evaluative wisdom. This hierarchy directly informs our Cognitive Depth dimension.

Fogg [4] proposed a prominent model of web credibility, identifying expertise and trustworthiness as the two primary axes along which users evaluate online sources. Metzger [5] extended this work by examining how digital media consumers assess credibility, finding that users frequently rely on surface features rather than substantive evaluation criteria---a finding that underscores the need for systematic, automated quality assessment.

In the context of scientific information, the IPCC [6] developed a dual-axis framework combining likelihood statements (virtually certain to exceptionally unlikely) with confidence levels (very low to very high), providing a calibrated language for expressing uncertainty. We adapt this framework for general web content evaluation.

2.2 Retrieval-Augmented Generation and Bias

Retrieval-Augmented Generation (RAG), introduced by Lewis et al. [7], augments language model generation with external knowledge retrieved from a corpus. While RAG has demonstrated significant improvements in factual accuracy compared to parametric-only models, it introduces new failure modes related to retrieval quality.

Asai et al. [8] proposed Self-RAG, which incorporates reflection tokens to enable the model to critique its own retrieval decisions. However, Self-RAG focuses on retrieval relevance rather than content quality, leaving the quality assessment gap unaddressed.

Research has documented systematic biases in retrieval systems that compound the quality problem. Popularity bias [9] causes retrieval systems to favor frequently cited sources regardless of their accuracy. Position bias [10] leads models to attend disproportionately to top-ranked results. Temporal bias [11] creates a recency preference that may disadvantage authoritative but older sources. These biases interact with the absence of quality signals to create a systematic vulnerability in RAG pipelines.

Karpukhin et al. [12] introduced Dense Passage Retrieval (DPR), which learns dense vector representations for query-document matching. Izacard and Grave [13] further advanced retrieval quality with contrastive learning approaches. Mallen et al. [14] and Gao et al. [15] provided comprehensive surveys of the RAG landscape, identifying quality-aware retrieval as a key open challenge.

2.3 LLM Hallucination

Hallucination in large language models---the generation of content that is fluent but factually incorrect or unsupported---has emerged as a critical concern. Ji et al. [16] provided a comprehensive survey categorizing hallucinations into intrinsic (contradicting the source) and extrinsic (unsupported by any source) types, and further distinguishing factual from faithful hallucinations.

Shuster et al. [17] demonstrated that retrieval augmentation can reduce but not eliminate hallucinations, particularly when the retrieved content itself contains inaccuracies. This finding highlights the cascading nature of the quality problem: hallucination mitigation through retrieval is only as effective as the quality of the retrieved content.

Fact-checking datasets and methods provide relevant infrastructure. Thorne et al. [18] introduced FEVER (Fact Extraction and VERification), a large-scale benchmark for fact verification. Guo et al. [19] proposed evidence-based fact verification methods. However, these approaches operate post-generation rather than pre-processing the information supply chain, which is the focus of our work.

2.4 Generative Engine Optimization

Pradeep et al. [1] introduced the concept of Generative Engine Optimization, demonstrating that content can be systematically modified to increase its likelihood of being surfaced in AI-generated responses. Their work revealed that simple techniques---such as adding citations, restructuring content for extractive compatibility, and optimizing for semantic similarity with common queries---can significantly boost a document's visibility in AI-generated outputs.

This phenomenon is related to but distinct from traditional SEO. While SEO targets human-perceived relevance signals (keywords, backlinks, page structure), GEO directly targets the retrieval and generation mechanisms of AI systems. The EEAT framework (Experience, Expertise, Authoritativeness, Trustworthiness), promoted by major search engines as a quality signal, provides some defense but is designed for human-facing search rather than AI Agent consumption.

The adversarial nature of GEO creates an arms race dynamic: as AI systems become more sophisticated in their retrieval and synthesis, content manipulation techniques evolve correspondingly. Our Source Independence dimension is specifically designed to detect and mitigate GEO-style manipulation.

2.5 Conditional Neural Architectures

The idea of conditioning neural network computation on input properties has a rich history. Dauphin et al. [20] explored conditional computation in language models, demonstrating that dynamically routing computation based on input complexity can improve efficiency. Graves [21] proposed Adaptive Computation Time, allowing recurrent networks to allocate varying amounts of computation to different inputs.

Vaswani et al. [22] introduced the Transformer architecture with its core self-attention mechanism and residual connections. He et al. [23] demonstrated the power of residual connections in deep networks, while Ba et al. [24] introduced Layer Normalization, a key component of the Pre-Norm and Post-Norm residual structures that our QCRC modification builds upon.

Tishby et al. [25] proposed the Information Bottleneck principle, which provides a theoretical framework for understanding how neural networks compress and filter information. We draw on this perspective in our theoretical analysis of quality-gated attention.

Our work is distinguished from prior conditional computation approaches in that we condition specifically on external quality assessments rather than learned input representations, providing interpretable and controllable modulation of information flow.


3. The QualityVector Framework

3.1 Problem Formulation

Given a query $q$ submitted by an AI Agent and a set of retrieved documents $D = {d_1, d_2, \ldots, d_n}$ returned by a web search API, the QualityVector framework produces a quality assessment vector for each document:

$$Q(d_i) = (p_{\text{type}}, q_t, q_e, q_a, q_i, q_s, q_f, q_o, q_c, q_d) \in \mathbb{R}^{10}$$

where each component represents a distinct quality dimension. The first component $p_{\text{type}}$ is a probability distribution over information types, while the remaining nine components are scalar values in $[0, 1]$.

The framework operates as a two-stage pipeline: (1) an LLM-based assessment stage that analyzes document content against each quality dimension, producing structured output; and (2) a calibration stage that normalizes scores and applies domain-specific weighting to produce a composite quality rating.

3.2 Ten Orthogonal Quality Dimensions

We propose ten orthogonal quality dimensions, each grounded in established theoretical frameworks. Table 1 summarizes the dimensions, their notation, value ranges, and theoretical foundations.

Table 1: The Ten Quality Dimensions of the QualityVector Framework

# Dimension Notation Range Theoretical Basis
1 Information Type $p_{\text{type}}$ $\Delta^4$ Popper's Three Worlds [26]
2 Truth Likelihood $q_t$ $[0, 1]$ IPCC Likelihood Scale [6]
3 Evidence Confidence $q_e$ $[0, 1]$ IPCC Confidence + Epistemic Justification
4 Source Authority $q_a$ $[0, 1]$ Domain Authority + Editorial Process
5 Source Independence $q_i$ $[0, 1]$ Cross-Validation Integrity
6 Temporal Stability $q_s$ $[0, 1]$ Domain-Specific Knowledge Half-Life
7 Freshness $q_f$ $[0, 1]$ Exponential Decay Model
8 Objectivity $q_o$ $[0, 1]$ Bias Detection + Sentiment Analysis
9 Completeness $q_c$ $[0, 1]$ Entity Coverage + 5W1H Analysis
10 Cognitive Depth $q_d$ $[0, 1]$ DIKW Pyramid [3] + Bloom's Taxonomy [27]

We now describe each dimension in detail.

Dimension 1: Information Type ($p_{\text{type}}$). Documents convey different types of knowledge, each with distinct epistemological status and reliability characteristics. Drawing on Popper's Three Worlds theory [26], which distinguishes between World 1 (physical objects), World 2 (subjective experience), and World 3 (objective knowledge), we classify content into five types:

$$p_{\text{type}} = (p_{\text{factual}}, p_{\text{analytical}}, p_{\text{theoretical}}, p_{\text{experiential}}, p_{\text{opinion}})$$

where $\sum_k p_k = 1$. Factual content reports observable phenomena (World 1), analytical content applies reasoning to data, theoretical content presents abstract frameworks (World 3), experiential content captures subjective observations (World 2), and opinion content expresses personal judgments. This classification enables downstream systems to apply type-appropriate evaluation criteria.

Dimension 2: Truth Likelihood ($q_t$). We adapt the IPCC's calibrated likelihood scale [6] to general web content assessment. Rather than binary true/false judgments, we assign a continuous likelihood score reflecting the probability that a claim is factually correct, given available evidence:

$$q_t \in [0, 1], \quad \text{where } q_t \geq 0.99 \text{ (Virtually Certain)}, ; q_t \geq 0.90 \text{ (Very Likely)}, ; \ldots, ; q_t < 0.05 \text{ (Exceptionally Unlikely)}$$

This dimension supports Bayesian posterior updating: as additional evidence is accumulated across multiple sources, the truth likelihood is revised according to Bayes' theorem.

Dimension 3: Evidence Confidence ($q_e$). Distinct from truth likelihood, evidence confidence measures the strength and quality of the evidentiary basis supporting a document's claims. Following the IPCC's confidence framework [6], which combines evidence type (robust, medium, limited) with agreement level (high, medium, low), we define:

$$q_e = f(\text{evidence type}, \text{agreement level}, \text{source count}, \text{methodological rigor})$$

A document may present a claim with high truth likelihood but low evidence confidence (e.g., a widely believed but poorly studied claim), or vice versa (a rigorously studied but inconclusive finding).

Dimension 4: Source Authority ($q_a$). Source authority evaluates the institutional and individual credibility of the content creator. This dimension integrates domain authority metrics (e.g., academic citation counts, institutional affiliations, editorial oversight processes) with signals of expertise:

$$q_a = g(\text{domain authority}, \text{editorial process}, \text{author credentials}, \text{publication venue})$$

We distinguish between authority (recognized expertise in a domain) and trustworthiness (track record of accuracy), following Fogg's [4] dual-axis credibility model.

Dimension 5: Source Independence ($q_i$). This dimension is specifically designed to counter GEO manipulation and information cascades. It measures the degree to which a document's claims are independently verified versus being derived from or influenced by other sources:

$$q_i = h(\text{cross-validation}, \text{original reporting}, \text{information cascade detection}, \text{plagiarism indicators})$$

Documents that primarily aggregate or rephrase content from a single source receive low independence scores, while those presenting independently gathered evidence or original analysis receive high scores. This dimension is critical for detecting coordinated manipulation campaigns and reducing the impact of information cascades in retrieval results.

Dimension 6: Temporal Stability ($q_s$). Different types of information have different expected lifespans. Mathematical truths are timeless; stock prices expire within milliseconds. We define temporal stability as the expected half-life of information accuracy within a given domain:

$$q_s \in [0, 1], \quad \text{where higher values indicate longer expected accuracy}$$

Domain-specific half-lives are calibrated empirically: scientific facts ($q_s \approx 0.95$), historical accounts ($q_s \approx 0.90$), technical documentation ($q_s \approx 0.70$), news reports ($q_s \approx 0.30$), real-time data ($q_s \approx 0.05$).

Dimension 7: Freshness ($q_f$). Freshness captures the recency of information relative to its expected validity period. We model freshness as an exponential decay function:

$$F(t) = \exp(-\lambda t), \quad \lambda = \frac{\ln 2}{\text{half_life}}$$

where $t$ is the age of the document and $\lambda$ is the domain-specific decay rate. A document about a breaking news event published 1 hour ago may have high freshness ($q_f \approx 0.99$), while a medical guideline from 15 years ago may have low freshness ($q_f \approx 0.01$).

Dimension 8: Objectivity ($q_o$). Objectivity measures the degree to which content presents information without bias, emotional manipulation, or selective framing. This dimension integrates bias detection and sentiment analysis:

$$q_o = 1 - \text{bias_score}(d_i)$$

where bias_score incorporates ideological framing, emotional language, selective evidence presentation, and conflict of interest indicators. We distinguish between acknowledged bias (explicitly stated perspectives, which may be appropriate in opinion content) and hidden bias (undisclosed influence that undermines informational integrity).

Dimension 9: Completeness ($q_c$). Completeness evaluates whether a document provides adequate coverage of its topic. We operationalize this through entity coverage analysis and 5W1H (Who, What, When, Where, Why, How) completeness:

$$q_c = \frac{|\text{covered aspects}|}{|\text{expected aspects}|}$$

Expected aspects are determined by the information type and domain. A news article about a scientific discovery should address the researchers (Who), the finding (What), the publication date (When), the institution (Where), the significance (Why), and the methodology (How).

Dimension 10: Cognitive Depth ($q_d$). Drawing on the DIKW hierarchy [3] and Bloom's taxonomy [27], cognitive depth measures the level of cognitive processing reflected in the content:

$$q_d \in [0, 1], \quad \text{Data}(0\text{--}0.2), ; \text{Information}(0.2\text{--}0.4), ; \text{Knowledge}(0.4\text{--}0.6), ; \text{Analysis}(0.6\text{--}0.8), ; \text{Wisdom}(0.8\text{--}1.0)$$

Content that merely presents raw data (e.g., statistics without context) receives low depth scores, while content that synthesizes information into actionable insights receives high depth scores.

3.3 Information Type Classification

The Information Type dimension employs a probabilistic classification based on Popper's Three Worlds ontology [26]. For a document $d_i$, we compute:

$$P(\text{type} \mid \text{content}) = \text{softmax}(f_{\text{type}}(\text{content}(d_i)))$$

where $f_{\text{type}}$ is implemented via LLM-based classification with structured prompting. The five types are mapped to Popper's worlds as follows:

  • Factual (World 1): Empirical observations, measurements, events
  • Analytical (World 3): Reasoning applied to data, statistical analysis
  • Theoretical (World 3): Abstract frameworks, models, hypotheses
  • Experiential (World 2): Personal accounts, case studies, anecdotes
  • Opinion (World 2): Evaluative judgments, recommendations, positions

This classification serves a dual purpose: it provides a quality signal in its own right (enabling type-appropriate downstream processing) and it modulates the interpretation of other quality dimensions. For example, a high objectivity score is more meaningful for factual content than for opinion content, where some degree of subjectivity is expected.

3.4 Truth Value Grading

Truth value grading employs the IPCC's dual-axis framework [6], combining likelihood and confidence into a unified assessment. For each substantive claim in a document, we assign:

$$\text{TruthGrade} = (q_t, q_e) \in [0, 1]^2$$

Document-level truth likelihood is computed as the weighted average of claim-level assessments, with weights proportional to claim centrality and specificity. Bayesian posterior updating is applied when multiple sources address the same claim:

$$P(\text{claim} \mid \text{evidence}_1, \ldots, \text{evidence}_n) \propto P(\text{evidence}_n \mid \text{claim}) \cdot P(\text{claim} \mid \text{evidence}1, \ldots, \text{evidence}{n-1})$$

This iterative updating process enables the framework to synthesize truth assessments across multiple retrieved documents, producing more robust quality estimates than single-document evaluation.

3.5 Temporal Dynamics

The temporal quality of a document is a function of both its freshness and the inherent temporal stability of its content domain. We define a combined temporal score:

$$T = q_f \cdot (1 - q_s) + q_s$$

This formulation ensures that: (1) for highly stable domains ($q_s \approx 1$), temporal quality is dominated by stability rather than freshness---an old but authoritative mathematical proof retains high temporal quality; (2) for highly volatile domains ($q_s \approx 0$), temporal quality is dominated by freshness---a recent news report is preferred over an older one; and (3) for intermediate domains, both factors contribute proportionally.

The combined temporal score $T$ replaces individual $q_f$ and $q_s$ values in the composite rating to avoid double-counting temporal factors.

3.6 Composite Rating

The final composite quality rating is computed using Multi-Attribute Utility Theory (MAUT) [28]:

$$R = \sum_{i} w_i \cdot q_i$$

where $w_i$ are domain-adaptive weights satisfying $\sum_i w_i = 1$ and $w_i \geq 0$. The composite score $R \in [0, 1]$ is mapped to a five-level quality scale:

  • Q-AAA ($R \geq 0.85$): Exemplary quality, suitable for high-stakes decision-making
  • Q-AA ($0.70 \leq R < 0.85$): High quality, reliable for general use
  • Q-A ($0.55 \leq R < 0.70$): Moderate quality, appropriate with caveats
  • Q-B ($0.40 \leq R < 0.55$): Low quality, requires significant verification
  • Q-C ($R < 0.40$): Very low quality, should be excluded from agent reasoning

This rating scale provides a human-interpretable summary of document quality while preserving the granularity of individual dimension scores for downstream processing.


4. System Architecture and LLM Integration

4.1 Middleware Architecture

The QualityVector system is designed as a transparent middleware layer that intercepts web search API calls between AI Agents and search providers (Tavily, Exa, Brave Search, etc.). The architecture operates in three phases:

  1. Interception. The middleware proxies search API requests, capturing both the query and the raw retrieval results.

  2. Assessment. For each retrieved document, the quality assessment engine generates the full QualityVector $Q(d_i)$. This process employs an LLM-based assessor with structured output parsing, producing dimension scores with associated confidence intervals.

  3. Augmentation. The middleware augments each search result with its QualityVector before returning the enriched results to the calling Agent. The original API response schema is preserved, with quality metadata added as supplementary fields.

This design ensures zero-friction integration: existing AI Agents require no modification to benefit from quality assessment, as the middleware operates transparently at the API boundary.

4.2 LLM Integration

Quality assessment results are integrated into LLM reasoning through multiple channels:

System Prompt Injection. Quality ratings are injected into the system prompt as structured metadata accompanying each retrieved document:

[Document 1] (Quality: Q-AA, Truth: 0.92, Authority: 0.88, Independence: 0.75, Freshness: 0.95)
Title: ...
Content: ...

This enables the LLM to factor quality information into its reasoning without requiring architectural modifications to the model.

RAG Reranking. Quality scores are combined with relevance scores for improved reranking:

$$\text{combined_score} = \alpha \cdot \text{quality}(d_i) + (1 - \alpha) \cdot \text{relevance}(d_i, q)$$

where $\alpha \in [0, 1]$ is a configurable parameter. Default $\alpha = 0.3$ provides a balanced emphasis on quality while preserving relevance as the primary ranking signal. For high-stakes domains (medical, legal), $\alpha$ can be increased to prioritize quality over relevance.

MCP Protocol. The system exposes quality assessment as a Model Context Protocol (MCP) tool, enabling Agents to explicitly invoke quality evaluation on-demand:

{
  "tool": "assess_quality",
  "arguments": {
    "url": "https://example.com/article",
    "dimensions": ["all"]
  }
}

4.3 Domain-Adaptive Weight Configuration

Different domains prioritize different quality dimensions. We provide preset weight configurations for common domains, as shown in Table 2.

Table 2: Domain-Adaptive Weight Configurations

Dimension Academic News Business Technology Legal
Truth Likelihood ($q_t$) 0.20 0.15 0.10 0.12 0.18
Evidence Confidence ($q_e$) 0.20 0.12 0.10 0.10 0.15
Source Authority ($q_a$) 0.15 0.15 0.12 0.08 0.20
Source Independence ($q_i$) 0.10 0.15 0.15 0.12 0.12
Temporal ($T$) 0.05 0.20 0.15 0.15 0.05
Objectivity ($q_o$) 0.10 0.10 0.15 0.10 0.10
Completeness ($q_c$) 0.10 0.08 0.12 0.15 0.15
Cognitive Depth ($q_d$) 0.10 0.05 0.11 0.18 0.05

The Academic configuration prioritizes truth likelihood and evidence confidence, reflecting the epistemic standards of scholarly discourse. The News configuration emphasizes temporal quality and source independence, reflecting the need for current, independently verified reporting. The Legal configuration heavily weights source authority, reflecting the primacy of authoritative legal sources. These presets serve as starting points; domain-specific fine-tuning is supported through the configuration interface.


5. Discussion: The Agent Engineering Era

5.1 Industrial Context

The emergence of platforms such as OpenClaw signals the transformation of AI Agents from laboratory prototypes to production systems operating at industrial scale. This transition introduces engineering challenges that differ fundamentally from those addressed in academic settings. In laboratory environments, the quality of training data and retrieval corpora can be carefully controlled. In production, AI Agents must navigate the uncontrolled, adversarial, and often degraded information landscape of the public web.

We argue that information quality assessment is not merely a desirable feature but essential safety infrastructure for AI Agent systems. Just as structural integrity is a non-negotiable requirement for physical infrastructure, information integrity must become a non-negotiable requirement for cognitive infrastructure. The QualityVector framework provides the foundational layer for this information integrity stack.

5.2 Hallucination Mitigation

QualityVector contributes to hallucination mitigation through two mechanisms. First, by filtering or down-weighting low-quality retrieved content, it reduces the probability that inaccurate information enters the generation pipeline. Second, by providing explicit quality signals to the LLM, it enables more calibrated generation: the model can express appropriate uncertainty when its inputs are of low quality, and can cite high-quality sources with greater confidence.

This approach is complementary to existing hallucination mitigation techniques. Self-consistency methods [29] reduce hallucination by sampling multiple generations and selecting the most consistent output. Chain-of-verification [30] reduces hallucination by having the model generate and answer verification questions. QualityVector operates upstream of these methods, improving the quality of inputs on which they operate, thereby amplifying their effectiveness.

5.3 GEO Countermeasures

The Source Independence dimension ($q_i$) is specifically designed to counter Generative Engine Optimization. GEO-manipulated content typically exhibits low source independence: it is often derived from or optimized to match existing high-ranking content, creating information cascades that amplify manipulated narratives.

By explicitly scoring source independence, QualityVector enables detection of coordinated manipulation campaigns. Documents that share substantial content with other retrieved results but lack independent verification receive low independence scores, flagging them for reduced influence in downstream reasoning. This approach is fundamentally different from platform-internal ranking signals, which may themselves be subject to manipulation, as it relies on cross-document analysis and provenance tracking rather than platform-specific authority metrics.


6. Quality-Aware Residual Connections and Quality-Gated Attention

6.1 Motivation

Standard transformer architectures [22] process all inputs through identical computational pathways, regardless of input quality. In the residual connection formulation:

$$y = x + \text{Sublayer}(x)$$

the sublayer output is unconditionally added to the input, meaning that corrupted or low-quality inputs receive the same computational treatment as high-quality inputs. This design choice, while enabling effective gradient flow in deep networks, creates a vulnerability when transformers are applied to quality-heterogeneous inputs such as web search results.

We propose two complementary modifications to the transformer architecture that enable quality-aware information processing: Quality-Conditioned Residual Connections (QCRC) and Quality-Gated Attention Mechanisms (QGAM). These modifications are designed to be minimally invasive, maintaining compatibility with existing architectures while providing explicit quality-dependent modulation.

6.2 Quality-Conditioned Residual Connection (QCRC)

The QCRC replaces the standard residual connection with a quality-gated variant:

$$y = x + \sigma(Q W_q) \odot \text{Sublayer}(x)$$

where:

  • $Q \in \mathbb{R}^k$ is the quality vector associated with the input token or segment, with $k$ being the number of quality dimensions (in our framework, $k = 10$);
  • $W_q \in \mathbb{R}^{k \times d}$ is a learnable projection matrix that maps the quality vector to the model's hidden dimension $d$;
  • $\sigma$ is the sigmoid activation function, ensuring gate values in $(0, 1)$;
  • $\odot$ denotes element-wise (Hadamard) multiplication.

Behavioral Analysis. The QCRC gate $\sigma(Q W_q)$ modulates the influence of the sublayer output based on input quality:

  • When quality is high ($Q$ contains large values), the sigmoid output approaches 1, and the connection reduces to the standard residual: $y \approx x + \text{Sublayer}(x)$.
  • When quality is low ($Q$ contains small values), the sigmoid output approaches 0, and the connection suppresses the sublayer: $y \approx x$.

This behavior is desirable for RAG-enhanced models: retrieved passages of high quality should be fully processed and integrated, while passages of low quality should have minimal influence on the model's representations.

Gradient Flow. The gradient of the loss $L$ with respect to the input $x$ through QCRC is:

$$\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \left( I + \sigma(Q W_q) \odot \frac{\partial \text{Sublayer}(x)}{\partial x} \right)$$

The quality gate $\sigma(Q W_q)$ modulates the gradient flow through the sublayer while preserving the identity gradient path through $I$. This ensures that even when quality is low, gradients can still propagate through the skip connection, maintaining training stability.

Compatibility with Pre-Norm and Post-Norm. QCRC is compatible with both Pre-Norm [31] and Post-Norm [22] residual structures:

  • Pre-Norm QCRC: $y = x + \sigma(Q W_q) \odot \text{Sublayer}(\text{LayerNorm}(x))$
  • Post-Norm QCRC: $y = \text{LayerNorm}(x + \sigma(Q W_q) \odot \text{Sublayer}(x))$

In both cases, the quality gate is applied to the sublayer output before the residual addition, preserving the normalization benefits of the original architectures.

6.3 Quality-Gated Attention (QGAM)

The QGAM modifies the standard scaled dot-product attention to incorporate quality-dependent gating of key-value pairs:

$$A'_{ij} = \frac{Q_i K_j^T}{\sqrt{d_k}} \cdot g(q_j)$$

where:

  • $Q_i \in \mathbb{R}^{d_k}$ is the query vector for position $i$;
  • $K_j \in \mathbb{R}^{d_k}$ is the key vector for position $j$;
  • $g(q_j)$ is the quality gate applied to the key-value pair at position $j$.

The quality gate is defined as:

$$g(q_j) = \sigma(W_g \cdot q_j + b_g)$$

where $W_g \in \mathbb{R}^{d_k}$ is a learnable projection and $b_g \in \mathbb{R}^{d_k}$ is a learnable bias. The gate is applied per-head in multi-head attention, with separate gate parameters per head.

After computing the gated attention scores, the standard softmax normalization is applied:

$$\alpha_{ij} = \frac{\exp(A'{ij})}{\sum{l=1}^{n} \exp(A'_{il})}$$

Effect on Attention Distribution. The quality gate $g(q_j)$ multiplicatively scales the pre-softmax attention logits for each key-value pair. High-quality positions ($g(q_j) \approx 1$) retain their original attention weights, while low-quality positions ($g(q_j) \approx 0$) are suppressed. This is analogous to applying a soft mask based on quality, but with learned, input-adaptive gate values rather than hard thresholds.

Computational Overhead. The additional computational cost of QGAM is $O(n \cdot k)$ for the quality gate computation, where $n$ is the sequence length and $k$ is the quality vector dimension. This is negligible compared to the $O(n^2 \cdot d)$ cost of standard attention, representing less than 1% overhead for typical configurations ($k = 10$, $n = 512$, $d = 64$).

6.4 Unified Quality-Aware Transformer Block

Algorithm 1 presents the complete Quality-Aware Transformer Block integrating both QCRC and QGAM.

Algorithm 1: Quality-Aware Transformer Block

Input: x in R^{n x d}, Q in R^{n x k}  (input and quality vectors)
Parameters: W_q, W_g, b_g, and standard transformer parameters

// Quality-Gated Multi-Head Self-Attention
for each head h = 1, ..., H do:
    Q_h = x * W_Q^h          // Query projection
    K_h = x * W_K^h          // Key projection
    V_h = x * W_V^h          // Value projection
    g_h = sigma(Q * W_g^h + b_g^h)  // Quality gate per head
    A_h = (Q_h * K_h^T / sqrt(d_k)) * g_h  // Gated attention scores
    A_h = softmax(A_h)        // Normalization
    head_h = A_h * V_h        // Weighted value aggregation
end for
attn_out = Concat(head_1, ..., head_H) * W_O

// QCRC after attention (Pre-Norm variant)
gate_attn = sigma(Q * W_q)       // Quality-conditioned residual gate
x = x + gate_attn * LayerNorm(attn_out)

// Feed-Forward Network
ff_out = FFN(LayerNorm(x))

// QCRC after FFN
gate_ff = sigma(Q * W_q')        // Separate quality gate for FFN
x = x + gate_ff * LayerNorm(ff_out)

Output: x in R^{n x d}

Figure 1 illustrates the architecture of the Quality-Aware Transformer Block, showing the flow of both content information and quality signals through the modified residual and attention pathways.

6.5 Theoretical Analysis

Information Bottleneck Perspective. Tishby et al. [25] proposed the Information Bottleneck principle, which posits that neural networks learn to compress inputs while preserving task-relevant information. From this perspective, the quality gate in QGAM acts as an explicit information filter:

$$I(Y; X \mid Q) \leq I(Y; X)$$

where $Y$ is the output, $X$ is the input, and $Q$ is the quality signal. The quality gate reduces the mutual information between output and input for low-quality inputs, preventing the model from encoding unreliable information. This is particularly valuable in RAG settings, where the model should learn to ignore or minimally represent retrieved content that is unlikely to be factually accurate.

Gradient Flow Analysis. The quality-dependent modulation of both residual connections and attention weights creates a quality-aware gradient landscape. For QCRC, the gradient through the sublayer is scaled by $\sigma(Q W_q)$, meaning that parameters in the sublayer receive smaller gradient updates when processing low-quality inputs. This has a regularization effect: the model learns to rely less on sublayer transformations for low-quality inputs, effectively learning a quality-dependent processing strategy.

For QGAM, the gradient of the attention weights with respect to the quality gate is:

$$\frac{\partial \alpha_{ij}}{\partial g(q_j)} = \alpha_{ij} \left( \frac{Q_i K_j^T}{\sqrt{d_k}} - \sum_{l} \alpha_{il} \frac{Q_i K_l^T}{\sqrt{d_k}} \right)$$

This gradient enables the model to learn how much to trust the quality signal: if quality assessments are informative for the task, the gate parameters $W_g, b_g$ will be optimized to amplify quality differences; if quality assessments are uninformative, the gates will converge to near-uniform values.

Interaction with Layer Normalization. Layer Normalization [24] computes statistics (mean and variance) across the feature dimension for each position independently. The quality gate in QCRC operates on the feature dimension via element-wise multiplication, which changes the distribution of activations entering the subsequent LayerNorm. However, because LayerNorm is invariant to scaling (it normalizes to zero mean and unit variance), the quality gate's effect on activation magnitudes is absorbed by the normalization. The quality gate's primary effect is thus on the relative scaling of different feature dimensions, which LayerNorm preserves.

Convergence Discussion. The sigmoid activation in both QCRC and QGAM ensures smooth, bounded gradients, preserving training stability. Unlike hard attention mechanisms that introduce discrete selection (and consequently non-differentiable operations), our soft gating maintains gradient flow throughout training. The quality gate values are initialized near 0.5 (via appropriate initialization of $W_q$ and $W_g$), ensuring that training begins with approximately standard transformer behavior and gradually learns quality-dependent modulation.

6.6 Preliminary Hypotheses and Experimental Design

We propose the following hypotheses for empirical validation:

H1: QCRC reduces model dependence on low-quality retrieved content, as measured by the correlation between input quality degradation and output quality degradation. Models with QCRC should exhibit a flatter degradation curve compared to standard transformers.

H2: QGAM improves factual accuracy in RAG systems, as measured by Exact Match (EM) and F1 scores on extractive question answering benchmarks. The improvement should be most pronounced when the retrieval corpus contains a mix of high- and low-quality documents.

H3: Joint QCRC+QGAM reduces hallucination rate while maintaining or improving answer quality, demonstrating that quality-aware architecture modifications can achieve both goals simultaneously.

Proposed Experimental Setup.

  • Datasets: MS MARCO [32], Natural Questions [33], TriviaQA [34], and FEVER [18] for fact verification.
  • Metrics: Exact Match (EM), F1 score, Hallucination Rate (fraction of generated claims unsupported by retrieved documents), and Answer Quality (human evaluation on a 5-point scale).
  • Baselines: Standard transformer (BERT-base [35], LLaMA-7B [36]), standard RAG pipeline, Self-RAG [8].
  • Quality Signal Source: QualityVector framework assessments (Section 3) applied to retrieved documents.
  • Ablation Studies: QCRC only, QGAM only, joint QCRC+QGAM, varying quality vector dimensions, varying gate initialization.
  • Quality Degradation Analysis: Systematic degradation of retrieval quality (replacing high-quality documents with low-quality alternatives) to measure model robustness.

7. Conclusion

We have presented QualityVector, a dual-contribution framework addressing the critical absence of content quality assessment in AI Agent web search systems. Our first contribution is a theoretically grounded 10-dimensional content quality rating framework that bridges epistemological foundations---including Popper's Three Worlds, the IPCC likelihood-confidence framework, and the DIKW hierarchy---with practical engineering requirements for web-scale content evaluation. Our second contribution is a novel quality-aware neural architecture comprising Quality-Conditioned Residual Connections (QCRC) and Quality-Gated Attention Mechanisms (QGAM), which enable transformer-based models to dynamically modulate information flow based on assessed content quality.

These contributions establish information quality assessment as foundational infrastructure for AI Agent systems. As AI Agents transition from laboratory prototypes to production systems processing hundreds of millions of web searches per month, the ability to assess and respond to content quality becomes not merely advantageous but essential for safety, reliability, and trustworthiness.

Future Work. Several directions remain for future investigation. First, empirical validation of QCRC and QGAM on the proposed benchmarks is needed to establish their effectiveness quantitatively. Second, scaling the QualityVector assessment pipeline to production workloads requires optimization of the LLM-based assessment process, potentially through distillation into smaller specialized models. Third, extending the framework to multimodal content (images, videos, audio) will be important as AI Agents increasingly consume non-textual information. Fourth, the adversarial interaction between quality assessment and GEO techniques warrants ongoing study, as manipulation strategies will likely evolve in response to quality-aware systems. Finally, establishing industry standards for content quality metadata in search APIs would enable ecosystem-wide quality improvement, benefiting all AI Agent systems.


References

[1] N. Pradeep, R. S. R. K. P. V. N. B. B. K. S. K. A. Narasimhan, "Generative Engine Optimization," in Proceedings of the 2024 International Conference on Web Search and Data Mining (WSDM), 2024.

[2] R. Y. Wang and D. M. Strong, "Beyond accuracy: What data quality means to data consumers," Journal of Management Information Systems, vol. 12, no. 4, pp. 5--33, 1996.

[3] R. L. Ackoff, "From data to wisdom," Journal of Applied Systems Analysis, vol. 16, pp. 3--9, 1989.

[4] B. J. Fogg, "Prominence-interpretation theory: Explaining how people assess credibility online," in CHI '03 Extended Abstracts on Human Factors in Computing Systems, pp. 722--723, 2003.

[5] M. J. Metzger, "Making sense of credibility on the Web: Models for evaluating online information and recommendations for future research," Journal of the American Society for Information Science and Technology, vol. 58, no. 13, pp. 2078--2091, 2007.

[6] IPCC, Climate Change 2021: The Physical Science Basis. Contribution of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change. Cambridge University Press, 2021.

[7] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Kuttler, M. Lewis, W.-t. Yih, T. Rocktaschel, S. Riedel, and D. Kiela, "Retrieval-augmented generation for knowledge-intensive NLP tasks," in Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 9459--9474, 2020.

[8] A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Hajishirzi, "Self-RAG: Learning to retrieve, generate, and critique through self-reflection," in Proceedings of the International Conference on Learning Representations (ICLR), 2024.

[9] A. Saito and Y. Matsushita, "Relevance, diversity, and novelty: A unified perspective on retrieval evaluation," in Proceedings of the European Conference on Information Retrieval (ECIR), pp. 543--554, 2021.

[10] A. Craswell, O. Zoeter, M. Taylor, and B. Ramsey, "An experimental comparison of click position-bias models," in Proceedings of the International Conference on Web Search and Data Mining (WSDM), pp. 87--94, 2008.

[11] L. Li, J. Wang, J. Li, and J. Tang, "Temporal dynamics of information credibility," in Proceedings of the International AAAI Conference on Web and Social Media, 2020.

[12] V. Karpukhin, B. OฤŸuz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih, "Dense passage retrieval for open-domain question answering," in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769--6781, 2020.

[13] G. Izacard and E. Grave, "Leveraging passage retrieval with generative models for open domain question answering," in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pp. 874--880, 2021.

[14] A. Mallen, V. Zhong, S. Das, R. Gupta, A. Zaidi, S. Mishra, and P. Lewis, "A survey on retrieval-augmented text generation," arXiv preprint arXiv:2312.06808, 2023.

[15] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Wang, H. Wang, and M. Wang, "Retrieval-augmented generation for large language models: A survey," arXiv preprint arXiv:2312.10997, 2024.

[16] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, and P. Fung, "Survey of hallucination in natural language generation," ACM Computing Surveys, vol. 55, no. 12, pp. 1--38, 2023.

[17] K. Shuster, S. Poff, M. Chen, D. Kiela, and J. Weston, "Retrieval augmentation reduces hallucination in conversation," in Findings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3364--3380, 2021.

[18] J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal, "FEVER: A large-scale dataset for fact extraction and VERification," in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 809--819, 2018.

[19] H. Guo, D. Yu, S. Tang, and S. Liu, "Retrieving and reading: A comprehensive survey on open-domain question answering," ACM Transactions on Information Systems, vol. 41, no. 4, pp. 1--43, 2022.

[20] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, "Language modeling with gated convolutional networks," in Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 933--941, 2017.

[21] A. Graves, "Adaptive computation time for recurrent neural networks," arXiv preprint arXiv:1603.08983, 2016.

[22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, "Attention is all you need," in Advances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017.

[23] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770--778, 2016.

[24] J. L. Ba, J. R. Kiros, and G. E. Hinton, "Layer normalization," arXiv preprint arXiv:1607.06450, 2016.

[25] N. Tishby, F. C. Pereira, and W. Bialek, "The information bottleneck method," in Proceedings of the 37th Annual Allerton Conference on Communication, Control, and Computing, pp. 368--377, 1999.

[26] K. R. Popper, Objective Knowledge: An Evolutionary Approach. Oxford University Press, 1972.

[27] B. S. Bloom, Taxonomy of Educational Objectives: The Classification of Educational Goals. David McKay Company, 1956.

[28] R. L. Keeney and H. Raiffa, Decisions with Multiple Objectives: Preferences and Value Tradeoffs. Cambridge University Press, 1993.

[29] X. L. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, "Self-consistency improves chain of thought reasoning in language models," in Proceedings of the International Conference on Learning Representations (ICLR), 2023.

[30] S. Dhuliawala, M. Komeili, J. Lin, X. Ren, J. Han, and S. Yih, "Chain-of-verification reduces hallucination in large language models," in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024.

[31] R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang, and T.-Y. Liu, "On layer normalization in the transformer architecture," in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6847--6855, 2020.

[32] P. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng, "MS MARCO: A human generated machine reading comprehension dataset," in Proceedings of the Workshop on Cognitive Computation and Integrative Language (CoNLL), 2016.

[33] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, K. Toutanova, M. Jones, M. Kelcey, M.-W. Chang, A. Dai, J. Uszkoreit, Q. Le, and S. Petrov, "Natural Questions: A benchmark for question answering research," Transactions of the Association for Computational Linguistics, vol. 7, pp. 453--466, 2019.

[34] M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer, "TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension," in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1601--1611, 2017.

[35] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of deep bidirectional transformers for language understanding," in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 4171--4186, 2019.

[36] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Roziere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, et al., "LLaMA: Open and efficient foundation language models," arXiv preprint arXiv:2302.13971, 2023.

[37] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, "Language models are few-shot learners," in Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 1877--1901, 2020.

[38] OpenAI, "GPT-4 technical report," arXiv preprint arXiv:2303.08774, 2023.

[39] S. Vosoughi, D. Roy, and S. Aral, "The spread of true and false news online," Science, vol. 359, no. 6380, pp. 1146--1151, 2018.

[40] Z. Zhang, G. Sheng, and J. Li, "Automated fake news detection without writing style: A multimodal approach," in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6642--6651, 2020.

[41] W. Y. Wang, "Liar, liar pants on fire: A new benchmark dataset for fake news detection," in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 422--426, 2017.

[42] K. Shu, A. Sliva, S. Wang, J. Tang, and H. Liu, "Fake news detection on social media: A data mining perspective," ACM SIGKDD Explorations Newsletter, vol. 19, no. 1, pp. 22--36, 2017.

[43] C. Buntain and J. Golbeck, "Automatically identifying fake news in popular Twitter threads," in Proceedings of the International Conference on Smart Cities and Social Analytics, 2017.

[44] R. Guha, V. Gupta, V. Raghunathan, and R. Kumar, "System and method for providing a reputation score for articles on the web," U.S. Patent 7,483,904, 2009.

[45] J. M. Weber, "The impact of information quality on decision making: A multi-agent simulation study," in Proceedings of the 37th Hawaii International Conference on System Sciences, 2004.

[46] D. M. Strong, Y. W. Lee, and R. Y. Wang, "Data quality in context," Communications of the ACM, vol. 40, no. 5, pp. 103--110, 1997.

[47] E. M. Rogers, Diffusion of Innovations, 5th ed. Free Press, 2003.

[48] S. Lewandowsky, U. K. H. Ecker, and C. J. Cook, "Beyond misinformation: Understanding and coping with the 'post-truth' era," Journal of Applied Research in Memory and Cognition, vol. 6, no. 4, pp. 353--369, 2017.

[49] C. A. N. dos Santos, L. Melo, and W. Fernandes, "Attentive pooling for fact verification," in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6720--6726, 2020.

[50] J. Thorne, V. Vlachos, C. Christodoulopoulos, and A. Mittal, "FEVER: A large-scale dataset for fact extraction and VERification," arXiv preprint arXiv:1803.05355, 2018.