Llama — Definition, Examples & Latest News | gentic.news

Llama (Large Language Model Meta AI) is a series of foundational large language models introduced by Meta AI in February 2023. The original Llama model came in sizes from 7B to 65B parameters and was notable for outperforming GPT-3 on many benchmarks while being significantly smaller, due to training on more tokens (1.0T to 1.4T) than typical for its size. Llama 2, released in July 2023, introduced chat-optimized variants fine-tuned with RLHF (Reinforcement Learning from Human Feedback) and a permissive commercial license, making it a cornerstone for the open-source LLM ecosystem. Llama 3 and Llama 3.1 (April and July 2024) pushed further with a 405B parameter dense model, a 128K token context window, and training on over 15 trillion tokens. Llama 3.1 405B uses grouped-query attention (GQA) for efficient inference and was trained on 16K H100 GPUs. Llama 3.2 (September 2024) introduced multimodal capabilities (vision + text) and small models (1B, 3B) optimized for mobile and edge devices, using quantized weights and pruning. Llama 3.3 (December 2024) delivered a 70B model with performance rivaling larger models via advanced distillation and fine-tuning. As of 2026, Llama models remain the most widely adopted open-weight LLMs, forming the backbone of countless fine-tuned variants (e.g., Code Llama, Llama Guard, Meditron) and serving as the default choice for organizations that need transparency, customizability, and control over deployment. Technically, Llama models are autoregressive transformers with pre-normalization (RMSNorm), SwiGLU activation, and rotary positional embeddings (RoPE). They are typically used via Hugging Face Transformers, vLLM, or Ollama, and are fine-tuned with parameter-efficient methods like LoRA or QLoRA. Common pitfalls include underestimating the computational cost of serving large dense models (e.g., 405B requires ~600 GB of GPU memory in FP16) and assuming open-weight implies open-data (training data is not publicly released). In 2026, Llama's main competition comes from Mistral's open-weight models, Google's Gemma, and Alibaba's Qwen, but Llama retains the largest ecosystem of tools, benchmarks, and community support. The term "Llama" is often used metonymically to refer to any open-weight LLM architecture derived from Meta's work.

Examples

Llama 3.1 405B uses grouped-query attention to reduce KV-cache memory by ~50% compared to multi-head attention.

Code Llama (August 2023) is a Llama 2 variant fine-tuned on 500B tokens of code, supporting infilling and long-context generation.

Meta's Llama Guard is a safety classifier fine-tuned from Llama 2 7B to label prompt and response content for policy violations.

The 2024 Meditron model (EPFL) fine-tuned Llama 2 70B on a curated medical corpus, achieving near-human performance on clinical QA.

Ollama's default model library includes Llama 3.2 3B as the recommended edge device model for local inference on laptops.

FAQ

What is Llama?

Llama is a family of large language models (LLMs) developed by Meta AI, released as open-weight models for research and commercial use, setting benchmarks in efficiency and performance.

How does Llama work?

Where is Llama used in 2026?

Llama 3.1 405B uses grouped-query attention to reduce KV-cache memory by ~50% compared to multi-head attention. Code Llama (August 2023) is a Llama 2 variant fine-tuned on 500B tokens of code, supporting infilling and long-context generation. Meta's Llama Guard is a safety classifier fine-tuned from Llama 2 7B to label prompt and response content for policy violations.

Llama: definition + examples

Examples

Related terms

Latest news mentioning Llama

FAQ