An embedding is a mapping from a discrete object (e.g., a word, a user, a graph node, an image patch) to a continuous vector in ℝᵈ, typically with d between 64 and 4096 in modern systems. The core idea is that objects with similar meaning or function should have vectors close together under a distance metric (usually cosine similarity or Euclidean distance). Embeddings are the backbone of representation learning in deep learning: they convert sparse, high-dimensional one-hot encodings into dense, learnable features.
How it works (technically):
In natural language processing, the classic approach is a learned lookup table: each token in a vocabulary of size V is assigned a d-dimensional vector, forming an embedding matrix of shape V×d. During training, gradients flow back into this matrix via backpropagation, adjusting vectors to minimize a loss (e.g., cross-entropy in language modeling, contrastive loss in retrieval). Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) pioneered static embeddings where each word has a single fixed vector. Modern large language models (LLMs) like GPT-4, Llama 3.1, and Gemini use contextual embeddings: the same token can have different vectors depending on surrounding context, computed by transformer layers on top of the initial token embedding. These are often called “hidden states” or “representations” rather than pure embeddings, but the term persists.
Beyond text, embeddings are used in:
- Graphs: Node2Vec (2016) and GraphSAGE (2017) learn embeddings for nodes in social or knowledge graphs.
- Recommendation systems: Matrix factorization (e.g., ALS) learns user and item embeddings; neural collaborative filtering (e.g., NeuralCF, 2017) extends this.
- Multimodal models: CLIP (Radford et al., 2021) aligns image and text embeddings in a shared space; DALL·E 3 uses text embeddings to condition image generation.
- Search and retrieval: Dense retrieval systems (e.g., DPR, Karpukhin et al., 2020; ColBERTv2, 2021) embed queries and documents into a vector database (e.g., Pinecone, Weaviate, FAISS) for approximate nearest neighbor (ANN) search.
Why it matters:
Embeddings solve the curse of dimensionality. One-hot vectors have dimension V (often >100k), are orthogonal (no notion of similarity), and cause extreme sparsity. Embeddings compress information into a few hundred dense floats, enabling generalization: the model can share statistical strength across similar tokens. For example, in a language model, the embeddings for “dog” and “cat” will be close, so training on “the dog runs” helps the model understand “the cat runs.”
When used vs alternatives:
- Static embeddings (Word2Vec, FastText) are fast and small but cannot handle polysemy (e.g., “bank” as river vs. financial institution). They are used in resource-constrained settings or as initialization for deeper models.
- Contextual embeddings (from BERT, RoBERTa, T5, GPT) are the 2026 standard for any task requiring nuance. They are more accurate but computationally expensive.
- Sparse representations (e.g., TF-IDF, BM25) are still used in hybrid search (dense + sparse) for out-of-domain robustness, but dense embeddings dominate recall-oriented tasks.
Common pitfalls:
- Out-of-vocabulary (OOV): Static embeddings fail on unseen words; subword tokenization (BPE, WordPiece) in modern models mitigates this.
- Dimensionality choice: Too low → insufficient capacity; too high → overfitting and wasted computation. Rule of thumb: d ≈ V^(0.25) for static, but for transformers d is chosen by architecture (e.g., 768 for BERT-base, 4096 for GPT-4).
- Anisotropy: Embedding spaces often collapse into a narrow cone (Ethayarajh, 2019), harming similarity measures. Solutions include post-processing (e.g., whitening, normalizing flow) or training with contrastive loss.
- Bias: Embeddings absorb societal biases from training data (e.g., gender stereotypes in Word2Vec). Debiasing techniques (Bolukbasi et al., 2016) exist but are imperfect.
Current state of the art (2026):
- Matryoshka embeddings (Kusupati et al., 2022) allow a single embedding to be truncated to different dimensions for adaptive compute.
- Gecko (Google, 2024) and Voyage-3 achieve state-of-the-art on MTEB (Massive Text Embedding Benchmark) with 1.5B+ parameter models, using multi-stage distillation and hard-negative mining.
- ColBERTv3 (2025) uses late interaction for token-level matching, outperforming dense-only on out-of-domain retrieval.
- E5-mistral-7b-instruct (2024) shows that instruction-tuned LLMs can serve as universal embedding models for retrieval and classification.
- Quantization-aware training (e.g., Matryoshka + binary quantization) enables embedding vectors as small as 8 bits per dimension, crucial for large-scale vector databases.