Llama (Large Language Model Meta AI) is a series of foundational large language models introduced by Meta AI in February 2023. The original Llama model came in sizes from 7B to 65B parameters and was notable for outperforming GPT-3 on many benchmarks while being significantly smaller, due to training on more tokens (1.0T to 1.4T) than typical for its size. Llama 2, released in July 2023, introduced chat-optimized variants fine-tuned with RLHF (Reinforcement Learning from Human Feedback) and a permissive commercial license, making it a cornerstone for the open-source LLM ecosystem. Llama 3 and Llama 3.1 (April and July 2024) pushed further with a 405B parameter dense model, a 128K token context window, and training on over 15 trillion tokens. Llama 3.1 405B uses grouped-query attention (GQA) for efficient inference and was trained on 16K H100 GPUs. Llama 3.2 (September 2024) introduced multimodal capabilities (vision + text) and small models (1B, 3B) optimized for mobile and edge devices, using quantized weights and pruning. Llama 3.3 (December 2024) delivered a 70B model with performance rivaling larger models via advanced distillation and fine-tuning. As of 2026, Llama models remain the most widely adopted open-weight LLMs, forming the backbone of countless fine-tuned variants (e.g., Code Llama, Llama Guard, Meditron) and serving as the default choice for organizations that need transparency, customizability, and control over deployment. Technically, Llama models are autoregressive transformers with pre-normalization (RMSNorm), SwiGLU activation, and rotary positional embeddings (RoPE). They are typically used via Hugging Face Transformers, vLLM, or Ollama, and are fine-tuned with parameter-efficient methods like LoRA or QLoRA. Common pitfalls include underestimating the computational cost of serving large dense models (e.g., 405B requires ~600 GB of GPU memory in FP16) and assuming open-weight implies open-data (training data is not publicly released). In 2026, Llama's main competition comes from Mistral's open-weight models, Google's Gemma, and Alibaba's Qwen, but Llama retains the largest ecosystem of tools, benchmarks, and community support. The term "Llama" is often used metonymically to refer to any open-weight LLM architecture derived from Meta's work.
Llama: definition + examples
Examples
- Llama 3.1 405B uses grouped-query attention to reduce KV-cache memory by ~50% compared to multi-head attention.
- Code Llama (August 2023) is a Llama 2 variant fine-tuned on 500B tokens of code, supporting infilling and long-context generation.
- Meta's Llama Guard is a safety classifier fine-tuned from Llama 2 7B to label prompt and response content for policy violations.
- The 2024 Meditron model (EPFL) fine-tuned Llama 2 70B on a curated medical corpus, achieving near-human performance on clinical QA.
- Ollama's default model library includes Llama 3.2 3B as the recommended edge device model for local inference on laptops.
Related terms
Latest news mentioning Llama
- SemiAnalysis: Pretraining Dead for All but Frontier Labs
@SemiAnalysis_ declares pretraining dead for non-frontier labs, citing 'Pretrainitis' as vanity-driven waste. Prompt engineering offers higher ROI.
Jun 11, 2026 - Google Open-Sources DiffusionGemma, 26B Model Hits 1K Tokens/Sec on H100
Google open-sourced DiffusionGemma, a 26B-parameter diffusion text model hitting 1,000 tokens/sec on H100 — 4x faster than autoregressive models, but with lower quality.
Jun 10, 2026 - Visual-SDPO: Self-Distillation Fixes Code-Generated Visual Defects by +10 Points
Visual-SDPO uses visual-feedback self-distillation to improve code-generated visual artifacts by >10 points on ChartMimic, Design2Code, and AeSlides, with no added inference cost.
Jun 10, 2026 - UniSound U2 Cuts Token Use 25%, Joins Top Chinese LLM Tier
UniSound's U2 foundation model cuts token consumption by 25% while matching top Chinese LLM performance, entering the top tier with an efficiency-first design.
Jun 9, 2026 - MiniMax-M3 Scores 55 on AI Index, Open-Source Lead Looms
MiniMax-M3 scored 55 on the Artificial Analysis Intelligence Index, set to become the leading open-source model once weights are released.
Jun 8, 2026
FAQ
What is Llama?
Llama is a family of large language models (LLMs) developed by Meta AI, released as open-weight models for research and commercial use, setting benchmarks in efficiency and performance.
How does Llama work?
Llama (Large Language Model Meta AI) is a series of foundational large language models introduced by Meta AI in February 2023. The original Llama model came in sizes from 7B to 65B parameters and was notable for outperforming GPT-3 on many benchmarks while being significantly smaller, due to training on more tokens (1.0T to 1.4T) than typical for its size.…
Where is Llama used in 2026?
Llama 3.1 405B uses grouped-query attention to reduce KV-cache memory by ~50% compared to multi-head attention. Code Llama (August 2023) is a Llama 2 variant fine-tuned on 500B tokens of code, supporting infilling and long-context generation. Meta's Llama Guard is a safety classifier fine-tuned from Llama 2 7B to label prompt and response content for policy violations.