A new study posted to arXiv, "LLMs Do Not Grade Essays Like Humans," provides a systematic, zero-shot evaluation of how large language models perform as automated essay graders. The research, submitted on March 24, 2026, directly tests several models from the GPT and Llama families against human grading practices, finding that while LLM-generated scores and feedback are internally consistent, their alignment with human judgment remains limited and follows distinct, sometimes problematic, patterns.
This work arrives amid a surge of research on arXiv exploring the practical limits and applications of LLMs. Just this week, the repository has seen studies on RAG chunking strategies, the fairness of model representations, and RL-guided robot planning, reflecting the platform's central role in disseminating rapid AI research. The trend of using LLMs as evaluative tools—for code, reasoning, or, as in this case, writing—is accelerating, making this granular analysis of their grading behavior particularly timely.
What the Researchers Tested
The researchers conducted a controlled evaluation using several prominent LLMs in an "out-of-the-box" setting. This means the models were not fine-tuned or specifically trained for the essay grading task; they were prompted to generate a score and feedback based solely on their pre-existing knowledge and capabilities. The study compared the scores generated by these models against grades assigned by human raters for the same set of essays.
The key variable was essay characteristics. The researchers analyzed how model performance varied based on factors like essay length, developmental depth, and the presence of surface-level errors like grammar and spelling mistakes.
Key Results: A Systematic Mismatch
The core finding is that agreement between LLM-generated scores and human grades is "relatively weak." More importantly, the disagreement is not random but follows predictable, systematic biases:

- Over-scoring of Short/Underdeveloped Essays: Compared to human raters, LLMs consistently assigned higher scores to essays that were brief or lacked substantive development. This suggests LLMs may be overly forgiving of a lack of content or depth, potentially rewarding superficial responses.
- Under-scoring of Longer Essays with Minor Errors: Conversely, LLMs tended to assign lower scores to longer, more developed essays that contained minor grammatical or spelling errors. Human graders, while noting such errors, typically weighed the overall content and argument more heavily. LLMs appeared to disproportionately penalize these surface-level mistakes.
- Internal Consistency Between Score and Feedback: The study found a strong correlation within the LLMs' own outputs: essays that received more praise in the model's textual feedback also received higher scores, and those receiving more criticism received lower scores. This indicates the models are not generating arbitrary numbers; their scoring logic is coherent and aligned with their own evaluative language.
How the Grading Disconnect Manifests
The results point to a fundamental difference in the "signals" prioritized by LLMs versus human graders. Human raters typically employ a holistic rubric that balances ideas, organization, development, and conventions (grammar, mechanics). They can discern between a critical error that obscures meaning and a minor typo in an otherwise sophisticated argument.

LLMs, operating as next-token predictors trained on vast corpora of text, likely develop a strong statistical sense of "well-formedness." An essay with perfect grammar and sentence structure may statistically correlate with high-quality writing in their training data, leading to a positive bias. Conversely, they may lack the nuanced, context-aware judgment to see past minor errors to evaluate the strength of an idea or argument. Their tendency to over-score short essays could stem from a lack of training data featuring low-quality but lengthy text, or an inability to properly assess argumentative depth and structural coherence.
Why It Matters for Deploying LLM Graders
This research serves as a crucial reality check for the burgeoning use of LLMs in educational technology and assessment. Proponents often highlight the potential for scalable, instant feedback. However, this study demonstrates that deploying these models as primary or unsupervised graders introduces specific, non-human biases.

- For formative assessment (practice): An LLM that over-scores short answers could give students a false sense of competency, failing to push them toward deeper analysis.
- For summative assessment (grading): An LLM that unduly penalizes a strong essay for a few tykes could unfairly impact a student's outcome.
The authors conclude that while LLM-human alignment is limited, the models "can be reliably used in supporting essay scoring." The key term is "supporting." This suggests a role as a first-pass analyzer, a feedback generator for students to review, or a tool to flag essays for human review—not as a replacement for human judgment. The internal consistency between scores and feedback also means the models could be useful for generating explanations for a given score, even if that score itself requires human calibration.
gentic.news Analysis
This paper fits into a clear and critical trend in AI research: moving from demonstrating capability to rigorously auditing real-world performance. As covered in our recent article on the EnterpriseArena benchmark, which found LLM agents fail at complex resource allocation, there is a growing body of work identifying the specific, practical gaps between LLM potential and reliable deployment. This essay grading study performs a similar function for the education domain, swapping out business logic for pedagogical judgment.
The findings also resonate with earlier research on LLM biases. The models' hypersensitivity to grammatical errors mirrors observations in other contexts where LLMs over-index on surface-level patterns. Their struggle with holistic, weighted evaluation is a known challenge in complex reasoning tasks. This study usefully quantifies these tendencies in a high-stakes, familiar application.
Looking at the broader arXiv activity this week—with papers on RAG, fairness, and planning—a pattern emerges. The field is in a deep calibration phase. The initial wave of "LLMs can do X" is giving way to a more nuanced question: "How well do LLMs do X, under what conditions, and with what biases?" This essay grading paper is a direct contributor to that essential line of inquiry. For practitioners building educational tools, the takeaway is not to abandon LLMs but to design systems that leverage their consistency and text-generation power while instituting human oversight to correct for their systematic misalignments with human value judgments.
Frequently Asked Questions
Can I use ChatGPT to grade my students' essays?
Based on this research, using an off-the-shelf LLM like ChatGPT as the sole grader for high-stakes assessments is not recommended. The study found weak agreement with human scores and systematic biases, such as overscoring short essays and underscoring longer ones with minor errors. It could be used as a supportive tool to generate initial feedback or highlight potential areas for review, but final scoring should involve human judgment calibrated against a clear rubric.
Which LLM is the best for essay grading?
The arXiv study evaluated several models from the GPT and Llama families in a zero-shot setting and found that agreement with human graders remained "relatively weak" across the board. The paper does not declare a clear winner, as the core issue appears to be a fundamental mismatch in grading signals between LLMs and humans, not a simple performance deficit of one model over another. The choice of model may be less important than designing a system that mitigates the identified biases.
Will fine-tuning an LLM on graded essays fix this problem?
The study specifically examined an "out-of-the-box setting, without task-specific training." Fine-tuning on a high-quality dataset of human-graded essays would likely improve alignment by explicitly teaching the model the human rubric. However, the success would depend entirely on the quality, size, and representativeness of the training data. It could mitigate the bias but may not eliminate it, as the model's underlying architecture still processes text differently than a human mind.
What is the main takeaway for teachers and educators?
The main takeaway is caution and context. LLMs can be powerful assistants for generating practice prompts, providing illustrative feedback, or helping students brainstorm. However, they should not be trusted as autonomous graders. Educators should be aware of the specific biases identified—leniency on depth and harshness on minor errors in long essays—and use LLM output as one data point among many, not a definitive assessment.







