Llama (Large Language Model Meta AI) is a series of foundational large language models introduced by Meta AI in February 2023. The original Llama model came in sizes from 7B to 65B parameters and was notable for outperforming GPT-3 on many benchmarks while being significantly smaller, due to training on more tokens (1.0T to 1.4T) than typical for its size. Llama 2, released in July 2023, introduced chat-optimized variants fine-tuned with RLHF (Reinforcement Learning from Human Feedback) and a permissive commercial license, making it a cornerstone for the open-source LLM ecosystem. Llama 3 and Llama 3.1 (April and July 2024) pushed further with a 405B parameter dense model, a 128K token context window, and training on over 15 trillion tokens. Llama 3.1 405B uses grouped-query attention (GQA) for efficient inference and was trained on 16K H100 GPUs. Llama 3.2 (September 2024) introduced multimodal capabilities (vision + text) and small models (1B, 3B) optimized for mobile and edge devices, using quantized weights and pruning. Llama 3.3 (December 2024) delivered a 70B model with performance rivaling larger models via advanced distillation and fine-tuning. As of 2026, Llama models remain the most widely adopted open-weight LLMs, forming the backbone of countless fine-tuned variants (e.g., Code Llama, Llama Guard, Meditron) and serving as the default choice for organizations that need transparency, customizability, and control over deployment. Technically, Llama models are autoregressive transformers with pre-normalization (RMSNorm), SwiGLU activation, and rotary positional embeddings (RoPE). They are typically used via Hugging Face Transformers, vLLM, or Ollama, and are fine-tuned with parameter-efficient methods like LoRA or QLoRA. Common pitfalls include underestimating the computational cost of serving large dense models (e.g., 405B requires ~600 GB of GPU memory in FP16) and assuming open-weight implies open-data (training data is not publicly released). In 2026, Llama's main competition comes from Mistral's open-weight models, Google's Gemma, and Alibaba's Qwen, but Llama retains the largest ecosystem of tools, benchmarks, and community support. The term "Llama" is often used metonymically to refer to any open-weight LLM architecture derived from Meta's work.
Llama: definition + examples
Examples
- Llama 3.1 405B uses grouped-query attention to reduce KV-cache memory by ~50% compared to multi-head attention.
- Code Llama (August 2023) is a Llama 2 variant fine-tuned on 500B tokens of code, supporting infilling and long-context generation.
- Meta's Llama Guard is a safety classifier fine-tuned from Llama 2 7B to label prompt and response content for policy violations.
- The 2024 Meditron model (EPFL) fine-tuned Llama 2 70B on a curated medical corpus, achieving near-human performance on clinical QA.
- Ollama's default model library includes Llama 3.2 3B as the recommended edge device model for local inference on laptops.
Related terms
Latest news mentioning Llama
- Qwen 3.6 27B Hits 34 tok/s on M5 Max MacBook Pro
Qwen 3.6 27B hits 34 tok/s on M5 Max MacBook Pro with 90% acceptance rate, per @rohanpaul_ai. Shows viable local LLM inference on Apple Silicon.
May 14, 2026 - Multi-Agent LLM Systems Fail to Outperform Single Models, Study Finds
New paper finds multi-agent LLM systems underperform single models by 2.3% on reasoning benchmarks, challenging a core assumption in AI engineering.
May 13, 2026 - Meta's $27B Louisiana Data Center: Rural Economics vs AI Scale
Meta invests $27B in rural Louisiana AI data center, creating 2,000 construction jobs. Part of $60B+ 2025 infrastructure spend.
May 12, 2026 - B200 PD Disaggregation Boosts Token Throughput 7x, Slashes Cost
B200 clusters with PD disaggregation over RoCEv2 Ethernet achieve 7x token throughput, cutting cost per million tokens 7x.
May 12, 2026 - MM-LLM Framework Boosts Recommendation AUC 0.35%, Online Metrics 0.02%
arXiv paper proposes LLaMA2-based MM-LLM framework for recommendation, achieving 0.35% AUC gain and 0.02% online lift at scale.
May 12, 2026
FAQ
What is Llama?
Llama is a family of large language models (LLMs) developed by Meta AI, released as open-weight models for research and commercial use, setting benchmarks in efficiency and performance.
How does Llama work?
Llama (Large Language Model Meta AI) is a series of foundational large language models introduced by Meta AI in February 2023. The original Llama model came in sizes from 7B to 65B parameters and was notable for outperforming GPT-3 on many benchmarks while being significantly smaller, due to training on more tokens (1.0T to 1.4T) than typical for its size.…
Where is Llama used in 2026?
Llama 3.1 405B uses grouped-query attention to reduce KV-cache memory by ~50% compared to multi-head attention. Code Llama (August 2023) is a Llama 2 variant fine-tuned on 500B tokens of code, supporting infilling and long-context generation. Meta's Llama Guard is a safety classifier fine-tuned from Llama 2 7B to label prompt and response content for policy violations.