LLMs Show Weak Agreement with Human Essay Graders, Overvalue Short Essays and Penalize Minor Errors

A new arXiv study finds LLMs like GPT and Llama have weak agreement with human essay scores. They systematically over-score short, underdeveloped essays and under-score longer essays with minor grammatical errors.

AAAla SMITH & AI Research Desk·Mar 27, 2026·7 min read··269 views·AI-Generated·Report error

Source: arxiv.orgvia arxiv_aiCorroborated

A new study posted to arXiv, "LLMs Do Not Grade Essays Like Humans," provides a systematic, zero-shot evaluation of how large language models perform as automated essay graders. The research, submitted on March 24, 2026, directly tests several models from the GPT and Llama families against human grading practices, finding that while LLM-generated scores and feedback are internally consistent, their alignment with human judgment remains limited and follows distinct, sometimes problematic, patterns.

This work arrives amid a surge of research on arXiv exploring the practical limits and applications of LLMs. Just this week, the repository has seen studies on RAG chunking strategies, the fairness of model representations, and RL-guided robot planning, reflecting the platform's central role in disseminating rapid AI research. The trend of using LLMs as evaluative tools—for code, reasoning, or, as in this case, writing—is accelerating, making this granular analysis of their grading behavior particularly timely.

What the Researchers Tested

The researchers conducted a controlled evaluation using several prominent LLMs in an "out-of-the-box" setting. This means the models were not fine-tuned or specifically trained for the essay grading task; they were prompted to generate a score and feedback based solely on their pre-existing knowledge and capabilities. The study compared the scores generated by these models against grades assigned by human raters for the same set of essays.

The key variable was essay characteristics. The researchers analyzed how model performance varied based on factors like essay length, developmental depth, and the presence of surface-level errors like grammar and spelling mistakes.

Key Results: A Systematic Mismatch

The core finding is that agreement between LLM-generated scores and human grades is "relatively weak." More importantly, the disagreement is not random but follows predictable, systematic biases:

((a)) ASAP Task 1

Over-scoring of Short/Underdeveloped Essays: Compared to human raters, LLMs consistently assigned higher scores to essays that were brief or lacked substantive development. This suggests LLMs may be overly forgiving of a lack of content or depth, potentially rewarding superficial responses.
Under-scoring of Longer Essays with Minor Errors: Conversely, LLMs tended to assign lower scores to longer, more developed essays that contained minor grammatical or spelling errors. Human graders, while noting such errors, typically weighed the overall content and argument more heavily. LLMs appeared to disproportionately penalize these surface-level mistakes.
Internal Consistency Between Score and Feedback: The study found a strong correlation within the LLMs' own outputs: essays that received more praise in the model's textual feedback also received higher scores, and those receiving more criticism received lower scores. This indicates the models are not generating arbitrary numbers; their scoring logic is coherent and aligned with their own evaluative language.

How the Grading Disconnect Manifests

The results point to a fundamental difference in the "signals" prioritized by LLMs versus human graders. Human raters typically employ a holistic rubric that balances ideas, organization, development, and conventions (grammar, mechanics). They can discern between a critical error that obscures meaning and a minor typo in an otherwise sophisticated argument.

((a)) ASAP Task 1

LLMs, operating as next-token predictors trained on vast corpora of text, likely develop a strong statistical sense of "well-formedness." An essay with perfect grammar and sentence structure may statistically correlate with high-quality writing in their training data, leading to a positive bias. Conversely, they may lack the nuanced, context-aware judgment to see past minor errors to evaluate the strength of an idea or argument. Their tendency to over-score short essays could stem from a lack of training data featuring low-quality but lengthy text, or an inability to properly assess argumentative depth and structural coherence.

Why It Matters for Deploying LLM Graders

This research serves as a crucial reality check for the burgeoning use of LLMs in educational technology and assessment. Proponents often highlight the potential for scalable, instant feedback. However, this study demonstrates that deploying these models as primary or unsupervised graders introduces specific, non-human biases.

Figure 1: Overview of the proposed analysis framework. Essays are first evaluated by LLMs to produce both predicted scor

For formative assessment (practice): An LLM that over-scores short answers could give students a false sense of competency, failing to push them toward deeper analysis.
For summative assessment (grading): An LLM that unduly penalizes a strong essay for a few tykes could unfairly impact a student's outcome.

The authors conclude that while LLM-human alignment is limited, the models "can be reliably used in supporting essay scoring." The key term is "supporting." This suggests a role as a first-pass analyzer, a feedback generator for students to review, or a tool to flag essays for human review—not as a replacement for human judgment. The internal consistency between scores and feedback also means the models could be useful for generating explanations for a given score, even if that score itself requires human calibration.

gentic.news Analysis

This paper fits into a clear and critical trend in AI research: moving from demonstrating capability to rigorously auditing real-world performance. As covered in our recent article on the EnterpriseArena benchmark, which found LLM agents fail at complex resource allocation, there is a growing body of work identifying the specific, practical gaps between LLM potential and reliable deployment. This essay grading study performs a similar function for the education domain, swapping out business logic for pedagogical judgment.

The findings also resonate with earlier research on LLM biases. The models' hypersensitivity to grammatical errors mirrors observations in other contexts where LLMs over-index on surface-level patterns. Their struggle with holistic, weighted evaluation is a known challenge in complex reasoning tasks. This study usefully quantifies these tendencies in a high-stakes, familiar application.

Looking at the broader arXiv activity this week—with papers on RAG, fairness, and planning—a pattern emerges. The field is in a deep calibration phase. The initial wave of "LLMs can do X" is giving way to a more nuanced question: "How well do LLMs do X, under what conditions, and with what biases?" This essay grading paper is a direct contributor to that essential line of inquiry. For practitioners building educational tools, the takeaway is not to abandon LLMs but to design systems that leverage their consistency and text-generation power while instituting human oversight to correct for their systematic misalignments with human value judgments.

Frequently Asked Questions

Can I use ChatGPT to grade my students' essays?

Based on this research, using an off-the-shelf LLM like ChatGPT as the sole grader for high-stakes assessments is not recommended. The study found weak agreement with human scores and systematic biases, such as overscoring short essays and underscoring longer ones with minor errors. It could be used as a supportive tool to generate initial feedback or highlight potential areas for review, but final scoring should involve human judgment calibrated against a clear rubric.

Which LLM is the best for essay grading?

The arXiv study evaluated several models from the GPT and Llama families in a zero-shot setting and found that agreement with human graders remained "relatively weak" across the board. The paper does not declare a clear winner, as the core issue appears to be a fundamental mismatch in grading signals between LLMs and humans, not a simple performance deficit of one model over another. The choice of model may be less important than designing a system that mitigates the identified biases.

Will fine-tuning an LLM on graded essays fix this problem?

The study specifically examined an "out-of-the-box setting, without task-specific training." Fine-tuning on a high-quality dataset of human-graded essays would likely improve alignment by explicitly teaching the model the human rubric. However, the success would depend entirely on the quality, size, and representativeness of the training data. It could mitigate the bias but may not eliminate it, as the model's underlying architecture still processes text differently than a human mind.

What is the main takeaway for teachers and educators?

The main takeaway is caution and context. LLMs can be powerful assistants for generating practice prompts, providing illustrative feedback, or helping students brainstorm. However, they should not be trusted as autonomous graders. Educators should be aware of the specific biases identified—leniency on depth and harshness on minor errors in long essays—and use LLM output as one data point among many, not a definitive assessment.

Source: gentic.news · Mar 27, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

This study is a textbook example of the necessary 'phase two' in applied AI research: rigorous benchmarking against human standards. It moves beyond the simplistic question of 'can it grade?' to the more critical 'how does it grade, and where does it diverge from experts?' The identified biases are not surprising from a technical perspective—LLMs are statistical models of text, not cognitive models of argument evaluation. Their over-rewarding of grammatical fluency and under-rewarding of substantive depth is a direct reflection of their training objective (predicting plausible text) rather than an educational objective (evaluating rhetorical merit). This work connects directly to a major trend we've been tracking: the systematic uncovering of LLM failure modes in practical scenarios. Just yesterday, we covered research showing LLMs can [de-anonymize users from public data](https://gentic.news), highlighting a different kind of misalignment with human expectations of privacy. Earlier this week, another arXiv paper challenged the assumption that [fair model representations guarantee fair recommendations](https://gentic.news/research-challenges-assumption). Together, these papers form a compelling narrative: as LLMs are pushed into real-world decision loops—grading, recommending, planning—their internal optima often diverge from human social, ethical, and pedagogical optima. This grading paper adds a concrete, measurable instance to that pattern. For practitioners, the implication is clear: LLMs are components, not complete solutions. The value is in designing systems that use the LLM's strengths (text generation, pattern matching at scale) while constraining its weaknesses with human oversight, explicit rubrics, or hybrid scoring models. The finding of strong internal consistency between scores and feedback is perhaps the most actionable insight: it means an LLM's explanation for its score is reliable, even if the score itself is not. This could enable novel applications where the LLM's role is to articulate a rationale for a human-assigned grade, not to assign the grade itself.

#research #benchmarks #education #large language models

This story is part of

The MCP Protocol Is Fragmenting the AI Coding Assistant Market

How a simple connectivity standard is forcing every major player to choose sides between open ecosystems and walled gardens

Compare side-by-side

large language models vs Retrieval-Augmented Generation

→

Mentioned in this article

large language models arXiv Llama GPT Image 1.5 Retrieval-Augmented Generation RL-guided robot planning

Enjoyed this article?