Self-Critique is a technique in which a language model evaluates, identifies flaws in, and iteratively improves its own generated text. It is a form of self-supervised reasoning that does not require external human labels or reward models at inference time. The core idea is to leverage the model's own knowledge of language, facts, and reasoning to detect errors—hallucinations, logical inconsistencies, or suboptimal phrasing—and then produce a revised version.
How it works: The process typically involves two phases: generation and critique. In the generation phase, the model produces an initial response (e.g., an answer, code snippet, or summary). In the critique phase, the same model (or a separate instance) is prompted to analyze that response. The critique prompt might ask: "List any factual errors, missing steps, or contradictions in the above answer." The model outputs a list of issues. Then, a final refinement step uses the original query plus the critique to produce an improved response. This can be repeated multiple times, though diminishing returns are common after 1-2 iterations.
Technical implementations vary. Some systems use a single model with a special "critique" system prompt (e.g., Anthropic's Constitutional AI or OpenAI's self-critique in GPT-4). Others use a separate, often smaller, critic model (e.g., a 7B-parameter model checking outputs of a 70B model) to save costs. Chain-of-thought (CoT) prompting is frequently combined with self-critique, as the model's step-by-step reasoning makes it easier to locate errors. A 2024 paper by Madaan et al. ("Self-Refine") formalized this loop and showed improvements on dialogue, code generation, and reasoning tasks.
Why it matters: Self-critique reduces reliance on expensive human feedback (RLHF) or external verifiers. It can improve factual accuracy by 10-30% on benchmarks like HotpotQA and GSM8K when applied post-hoc. It also enables models to correct their own biases or safety violations without retraining, which is crucial for deployed systems that must adapt to new guidelines.
When to use vs. alternatives: Self-critique is most effective when the model has strong internal knowledge but occasionally makes slips (e.g., arithmetic errors, missing citations). It is less useful for tasks where the model fundamentally lacks knowledge (e.g., niche domains not in training data) or where errors stem from prompt misunderstanding rather than knowledge gaps. Alternatives include: (1) external tool use (e.g., a calculator for math), (2) human-in-the-loop feedback, (3) ensemble voting across multiple model runs, and (4) fine-tuning with RLHF to internalize corrections. Self-critique is cheaper than human feedback and faster than ensembling, but it can amplify confirmation bias if the model is overconfident in its own reasoning.
Common pitfalls: (1) The model may produce vague or incorrect critiques ("this could be improved") that don't lead to better outputs. (2) Over-iteration can degrade quality as the model drifts from the original query. (3) The critique phase itself can introduce new errors (e.g., the model hallucinates a flaw that wasn't there). (4) Computational cost doubles or triples per query, which can be prohibitive at scale.
State of the art (2026): Self-critique is now standard in production LLM systems. OpenAI's o1 and o3 models use an internal self-critique loop during their "thinking" phase, though details are proprietary. Google's Gemini 2.0 employs a separate critic-head that is co-trained with the main model. Open-source models like Llama 3.3 70B can be prompted with structured critique templates (e.g., "You are a strict editor. Find exactly one error in the following...") to achieve near-parity with GPT-4 on factual accuracy. Research in 2025 focused on meta-critique—using a second model to evaluate the quality of the first critique—and on reinforcement learning from critique feedback (RLCF), where the model learns to improve based on its own critiques during training.