llm architecture
30 articles about llm architecture in AI news
LLM Architecture Gallery Compiles 38 Model Designs from 2024-2026 with Diagrams and Code
A new open-source repository provides annotated architecture diagrams, key design choices, and code implementations for 38 major LLMs released between 2024 and 2026, including DeepSeek V3, Qwen3 variants, and GLM-5 744B.
Moonshot AI CEO Yang Zhilin Advocates for Attention Residuals in LLM Architecture
Yang Zhilin, founder of Moonshot AI, argues for the architectural value of attention residuals in large language models. This technical perspective comes from the creator of the popular Kimi Chat model.
DriveXQA: New AI Framework Helps Autonomous Vehicles See Through Fog and Sensor Failures
Researchers introduce DriveXQA, a multimodal dataset and MVX-LLM architecture that enables autonomous vehicles to answer complex questions about adverse driving conditions by fusing data from multiple visual sensors, significantly improving performance in challenging scenarios like fog.
Akshay Pachaar Inverts LLM Agent Architecture with 'Harness' Design
AI engineer Akshay Pachaar outlined a novel 'harness' architecture for LLM agents that externalizes intelligence into memory, skills, and protocols. He is building a minimal, didactic open-source implementation of this design.
Solving LLM Debate Problems with a Multi-Agent Architecture
A developer details moving from generic prompts to a multi-agent system where two LLMs are forced to refute each other, improving reasoning and output quality. This is a technical exploration of a novel prompting architecture.
DualPath Architecture Shatters KV-Cache Bottleneck, Doubling LLM Throughput for AI Agents
Researchers have developed DualPath, a novel architecture that eliminates the KV-cache storage bottleneck in agentic LLM inference. By implementing dual-path loading with RDMA transfers, the system achieves nearly 2× throughput improvements for both offline and online scenarios.
MARS Method Boosts LLM Throughput 1.7x With No Architecture Changes
Researchers introduced MARS, a training-free method that allows autoregressive LLMs to generate multiple tokens per forward pass, boosting throughput by 1.5-1.7x without architectural modifications or accuracy loss.
New Pipeline Enables Lossless Distillation of Transformer LLMs into Hybrid xLSTM Architectures
Researchers developed a distillation pipeline that transfers transformer LLM knowledge into hybrid xLSTM models. The distilled students match or exceed teacher models like Llama, Qwen, and Olmo on downstream tasks.
Expert Pyramid Tuning: A New Parameter-Efficient Fine-Tuning Architecture for Multi-Task LLMs
Researchers propose Expert Pyramid Tuning (EPT), a novel PEFT method that uses multi-scale feature pyramids to better handle tasks of varying complexity. It outperforms existing MoE-LoRA variants while using fewer parameters, offering more efficient multi-task LLM deployment.
Two-Tower vs Vector DB + LLM: Which Wins for RecSys at Scale?
Two-tower models offer sub-10ms latency for cold-start; vector DB + LLM provides richer semantics. Hybrid architectures reduce churn by 15-20%.
Columbia Prof: LLMs Can't Generate New Science, Only Map Known Data
Columbia CS Professor Vishal Misra argues LLMs cannot generate new scientific ideas because they learn structured maps of known data and fail outside those boundaries. True discovery requires creating new conceptual maps, a capability current architectures lack.
A Reference Architecture for Agentic Hybrid Retrieval in Dataset Search
A new research paper presents a reference architecture for 'agentic hybrid retrieval' that orchestrates BM25, dense embeddings, and LLM agents to handle underspecified queries against sparse metadata. It introduces offline metadata augmentation and analyzes two architectural styles for quality attributes like governance and performance.
Prefill-as-a-Service Paper Claims to Decouple LLM Inference Bottleneck
A research paper proposes a 'Prefill-as-a-Service' architecture to separate the heavy prefill computation from the lighter decoding phase in LLM inference. This could enable new deployment models where resource-constrained devices handle only the decoding step.
Cognitive Companion Monitors LLM Agent Reasoning with Zero Overhead
A 'Cognitive Companion' architecture uses a logistic regression probe on LLM hidden states to detect when agents loop or drift, reducing failures by over 50% with zero inference overhead.
GeoAgentBench: New Dynamic Benchmark Tests LLM Agents on 117 GIS Tools
A new benchmark, GeoAgentBench, evaluates LLM-based GIS agents in a dynamic sandbox with 117 tools. It introduces a novel Plan-and-React agent architecture that outperforms existing frameworks in multi-step spatial tasks.
Bi-Predictability: A New Real-Time Metric for Monitoring LLM
A new arXiv paper introduces 'bi-predictability' (P), an information-theoretic measure, and a lightweight Information Digital Twin (IDT) architecture to monitor the structural integrity of multi-turn LLM conversations in real-time. It detects a 'silent uncoupling' regime where outputs remain semantically sound but the conversational thread degrades, offering a scalable tool for AI assurance.
MiniMax M2.7 Tops Open LLM Leaderboard with 230B Parameter Sparse Model
MiniMax announced its M2.7 model has taken the top spot on the Hugging Face Open LLM Leaderboard. The model uses a sparse mixture-of-experts architecture with 230B total parameters but only activates 10B per token.
SauerkrautLM-Doom-MultiVec: 1.3M-Param Model Outperforms LLMs 92,000x Its Size
Researchers built a 1.3M-parameter model that plays DOOM in real-time, scoring 178 frags in 10 episodes. It outperforms LLMs like Nemotron-120B and GPT-4o-mini, which scored only 13 combined, demonstrating the power of small, task-specific architectures.
Memory Systems for AI Agents: Architectures, Frameworks, and Challenges
A technical analysis details the multi-layered memory architectures—short-term, episodic, semantic, procedural—required to transform stateless LLMs into persistent, reliable AI agents. It compares frameworks like MemGPT and LangMem that manage context limits and prevent memory drift.
Andrej Karpathy's Personal Knowledge Management System Uses LLM Embeddings Without RAG for 400K-Word Research Base
AI researcher Andrej Karpathy has developed a personal knowledge management system that processes 400,000 words of research notes using LLM embeddings rather than traditional RAG architecture. The system enables semantic search, summarization, and content generation directly from his Obsidian vault.
FAOS Neurosymbolic Architecture Boosts Enterprise Agent Accuracy by 46% via Ontology-Constrained Reasoning
Researchers introduced a neurosymbolic architecture that constrains LLM-based agents with formal ontologies, improving metric accuracy by 46% and regulatory compliance by 31.8% in controlled experiments. The system, deployed in production, serves 21 industries with over 650 agents.
Meta's Adaptive Ranking Model: A Technical Breakthrough for Efficient LLM-Scale Inference
Meta has developed a novel Adaptive Ranking Model (ARM) architecture designed to drastically reduce the computational cost of serving large-scale ranking models for ads. This represents a core infrastructure breakthrough for deploying LLM-scale models in production at massive scale.
Alibaba's XuanTie C950 CPU Hits 70+ SPECint2006, Claims RISC-V Record with Native LLM Support
Alibaba's DAMO Academy launched the XuanTie C950, a RISC-V CPU scoring over 70 on SPECint2006—the highest single-core performance for the architecture—with native support for billion-parameter LLMs like Qwen3 and DeepSeek V3.
8 AI Model Architectures Visually Explained: From Transformers to CNNs and VAEs
A visual guide maps eight foundational AI model architectures, including Transformers, CNNs, and VAEs, providing a clear reference for understanding specialized models beyond LLMs.
A Deep Dive into LoRA: The Mathematics, Architecture, and Deployment of Low-Rank Adaptation
A technical guide explores the mathematical foundations, memory architecture, and structural consequences of Low-Rank Adaptation (LoRA) for fine-tuning LLMs. It provides critical insights for practitioners implementing efficient model customization.
The Digital Twin Revolution: How LLMs Are Creating Virtual Testbeds for Social Media Policy
Researchers have developed an LLM-augmented digital twin system that simulates short-video platforms like TikTok to test policy changes before implementation. This four-twin architecture allows platforms to study long-term effects of AI tools and content policies in realistic closed-loop simulations.
dLLM Framework Unifies Diffusion Language Models, Opening New Frontiers in AI Text Generation
Researchers have introduced dLLM, a unified framework that standardizes training, inference, and evaluation for diffusion language models. This breakthrough enables conversion of existing models like BERT into diffusion architectures and facilitates reproduction of cutting-edge models like LLaDA and Dream.
Beyond the Transformer: Liquid AI's Hybrid Architecture Challenges the 'Bigger is Better' Paradigm
Liquid AI's LFM2-24B-A2B model introduces a novel hybrid architecture blending convolutions with attention, addressing critical scaling bottlenecks in modern LLMs. This 24-billion parameter model could redefine efficiency standards in AI development.
Anthropic Paper: 'Emotion Concepts and their Function in LLMs' Published
Anthropic has released a new research paper titled 'Emotion Concepts and their Function in LLMs.' The work investigates the role and representation of emotional concepts within large language model architectures.
HAVEN Benchmark Exposes MLLM Gap Between Fluency and Video Understanding
HAVEN benchmark tests MLLMs on hierarchical video understanding across frame, shot, and video levels. Results show top models lack grounded multimodal reasoning despite fluent text generation.