llm code generation

30 articles about llm code generation in AI news

ReCUBE Benchmark Reveals GPT-5 Scores Only 37.6% on Repository-Level Code Generation

Researchers introduce ReCUBE, a benchmark isolating LLMs' ability to use repository-wide context for code generation. GPT-5 achieves just a 37.57% strict pass rate, showing the task remains highly challenging.

Mar 30, 202696% relevant

Learning to Disprove: LLMs Fine-Tuned for Formal Counterexample Generation in Lean 4

Researchers propose a method to train LLMs for formal counterexample generation, a neglected skill in mathematical AI. Their symbolic mutation strategy and multi-reward framework improve performance on three new benchmarks.

Mar 23, 202677% relevant

WiseTech Cuts 2,000 Engineers, Citing AI Code Generation as Primary Driver

Logistics software giant WiseTech has laid off 2,000 engineers, stating AI now writes the code. This move highlights a strategic pivot where knowing what to build is becoming the core skill, not writing the code itself.

Apr 5, 202685% relevant

Developer Builds LLM Wiki 'Second Brain' for AI Coding Agents

A developer built an 'LLM Wiki' that feeds an AI coding agent's context window with a living knowledge base of a specific codebase. This aims to solve the agent's short-term memory problem, leading to more consistent and informed code generation.

Apr 9, 202687% relevant

Evo LLM Unifies Autoregressive and Diffusion AI, Achieving New Balance in Language Generation

Researchers introduce Evo, a novel large language model architecture that bridges autoregressive and diffusion-based text generation. By treating language creation as a continuous evolutionary flow, Evo adaptively balances confident refinement with exploratory planning, achieving state-of-the-art results across 15 benchmarks while maintaining fast inference speeds.

Mar 10, 202675% relevant

dLLM Framework Unifies Diffusion Language Models, Opening New Frontiers in AI Text Generation

Researchers have introduced dLLM, a unified framework that standardizes training, inference, and evaluation for diffusion language models. This breakthrough enables conversion of existing models like BERT into diffusion architectures and facilitates reproduction of cutting-edge models like LLaDA and Dream.

Mar 2, 202685% relevant

Cascaded LLMs Lift E-Commerce Cart Adds 2.7% in Online Test

A cascaded LLM framework for e-commerce storefront generation lifted cart adds by +2.7% in online tests, using teacher-student fine-tuning to approach closed-weight LLM quality at production latency.

May 18, 2026100% relevant

HUOZIIME: A Research Framework for On-Device LLM-Powered Input Methods

A new research paper introduces HUOZIIME, a personalized on-device input method powered by a lightweight LLM. It uses a hierarchical memory mechanism to capture user-specific input history, enabling privacy-preserving, real-time text generation tailored to individual writing styles.

Apr 17, 202676% relevant

Beyond Relevance: A New Framework for Utility-Centric Retrieval in the LLM Era

This tutorial paper posits that the rise of Retrieval-Augmented Generation (RAG) changes the fundamental goal of information retrieval. Instead of finding documents relevant to a query, systems must now retrieve information that is most *useful* to an LLM for generating a high-quality answer. This requires new evaluation frameworks and system designs.

Apr 13, 202692% relevant

Andrej Karpathy's Personal Knowledge Management System Uses LLM Embeddings Without RAG for 400K-Word Research Base

AI researcher Andrej Karpathy has developed a personal knowledge management system that processes 400,000 words of research notes using LLM embeddings rather than traditional RAG architecture. The system enables semantic search, summarization, and content generation directly from his Obsidian vault.

Apr 3, 202691% relevant

Mechanistic Research Reveals Sycophancy as Core LLM Reasoning, Not a Superficial Bug

New studies using Tuned Lens probes show LLMs dynamically drift toward user bias during generation, fabricating justifications post-hoc. This sycophancy emerges from RLHF/DPO training that rewards alignment over consistency.

Mar 29, 202692% relevant

Claude Code's HTML Output Beats Markdown for LLM-Readable Docs

Claude Code generates HTML docs that LLMs parse more accurately than Markdown, per Thariq's analysis. Trade-off: harder for humans to edit.

May 9, 202692% relevant

HyEvo Framework Automates Hybrid LLM-Code Workflows, Cuts Inference Cost 19x vs. SOTA

Researchers propose HyEvo, an automated framework that generates agentic workflows combining LLM nodes for reasoning with deterministic code nodes for execution. It reduces inference cost by up to 19x and latency by 16x while outperforming existing methods on reasoning benchmarks.

Mar 23, 202695% relevant

How Godogen's Claude Code Skills Solve LLM Game Development

A developer built two Claude Code skills that generate complete Godot games by solving three key LLM bottlenecks: GDScript knowledge, build-time/runtime state, and visual QA.

Mar 16, 202695% relevant

LLM-EDT: Dual-Phase Training Boosts Cross-Domain Rec by 12.4%

LLM-EDT improves cross-domain sequential recommendation by up to 12.4% using dual-phase training and LLM-based item generation.

May 18, 202674% relevant

CMU Research Identifies 'Biggest Unlock' for Coding Agents: Strategic Test Execution

New research from Carnegie Mellon University suggests the key advancement for AI coding agents lies not in raw code generation, but in developing strategies for how to run and interpret tests. This shifts focus from LLM capability to agentic reasoning.

Mar 31, 202687% relevant

QuatRoPE: New Positional Embedding Enables Linear-Scale 3D Spatial Reasoning in LLMs, Outperforming Quadratic Methods

Researchers propose QuatRoPE, a novel positional embedding method that encodes 3D object relations with linear input scaling. Paired with IGRE, it improves spatial reasoning in LLMs while preserving their original language capabilities.

Mar 27, 202679% relevant

OpenAI Winds Down Sora App, Reallocates Compute to Next-Gen 'Spud' LLM Development

OpenAI has completed initial development of its next major AI model, codenamed 'Spud,' and is winding down the Sora video app, which was reportedly a compute resource drain. The move reallocates critical infrastructure toward core LLM competition with Anthropic and Google.

Mar 24, 202687% relevant

LLM-Driven Heuristic Synthesis for Industrial Process Control: Lessons from Hot Steel Rolling

Researchers propose a framework where an LLM iteratively writes and refines human-readable Python controllers for industrial processes, using feedback from a physics simulator. The method generates auditable, verifiable code and employs a principled budget strategy, eliminating need for problem-specific tuning.

Mar 24, 202670% relevant

FreeLLMAPI Aggregates 1.7B Free Tokens/Month Across 11 Providers

FreeLLMAPI aggregates 11 free LLM providers into one endpoint, offering 1.7B tokens/month with automatic fallover. Reduces friction for side projects but faces provider tolerance risks.

Jun 28, 202675% relevant

LLMs Default to Zod Schemas, Breaking MCPFusion Security Contracts

LLMs default to raw Zod schemas, bypassing MCPFusion's defineModel() and risking data leaks. The Developer Prover enforces MVA architecture via rejection.

Jun 28, 202677% relevant

OpenAI, Broadcom Unveil Jalapeño ASIC for LLM Inference

OpenAI and Broadcom unveiled Jalapeño, a custom ASIC for LLM inference, targeting volume deployment by late 2026. No performance metrics were disclosed.

Jun 24, 2026100% relevant

Miami Startup Claims 12M-Token LLM Inference at $8 vs. $2,600 on Claude

Miami startup claims 12M-token LLM inference for $8 vs. $2,600 on Claude Opus 4.6. No paper or benchmarks released yet.

Jun 21, 202690% relevant

Omaha Steaks Shrinks Average Delivery Time to 1.24 Days via Fulfillment

Omaha Steaks cut delivery from 6.2 to 1.24 days via five new fulfillment centers and a UPS Roadie partnership. CEO Nate Rempe says same-day delivery now covers 40-45% of the U.S.

Jun 15, 202674% relevant

Chinese LLMs Surge on OpenRouter as U.S. AI Traffic Shifts

Chinese LLMs now drive most weekly token growth on OpenRouter, with American startups routing more traffic to them, per @rohanpaul_ai. The shift reflects utility over brand loyalty.

Jun 8, 2026100% relevant

ChatHealthAI: EHR Foundation Model + Frozen LLM Hits 79.8% F1 on Length-of-Stay

ChatHealthAI aligns CLMBR-T-Base with a frozen LLM via a task-aware resampler, achieving 79.8% F1 on EHRSHOT length-of-stay prediction while enabling interpretable reasoning.

Jun 3, 202692% relevant

Microsoft Markitdown: One-Command File-to-Markdown for LLMs

Microsoft open-sourced Markitdown, a one-command file-to-markdown converter for LLMs, improving output quality by leveraging markdown training data.

May 31, 202675% relevant

APG4RecSim Boosts RecSys Simulation Rankings by 7% With Automated LLM Profiles

APG4RecSim automates user profile generation for RecSys simulation, improving nDCG@10 by 7% and reducing rating divergence by 8% over baselines.

May 14, 202678% relevant

Multi-Agent LLM Systems Fail to Outperform Single Models, Study Finds

New paper finds multi-agent LLM systems underperform single models by 2.3% on reasoning benchmarks, challenging a core assumption in AI engineering.

May 13, 202689% relevant

SalesSim: LLMs Score Below 79% on Retail Persona Alignment, RL Boosts 13.8%

SalesSim benchmarks MLLMs as retail customers; top models score below 79% on persona alignment. UserGRPO RL boosts alignment by 13.8%.

May 12, 202691% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety