Fine-tuning is a transfer learning technique where a model that has already been trained on a large, general dataset (pre-training) is further trained on a smaller, task-specific dataset. This process adjusts the model's parameters to specialize its knowledge for a particular downstream task, such as sentiment analysis, question answering, or domain-specific text generation.
Technically, fine-tuning begins with a checkpoint from a pre-trained model—often a large language model (LLM) like GPT-4, Llama 3, or BERT—and continues the training loop using a new dataset. The loss function, optimizer, and learning rate schedule are typically reused, but the learning rate is usually reduced (e.g., 1e-5 to 5e-5 for AdamW) to avoid catastrophic forgetting of the pre-trained knowledge. The dataset is usually much smaller than the pre-training corpus (e.g., thousands to hundreds of thousands of examples). Two common strategies exist: full fine-tuning updates all parameters, which is computationally expensive (e.g., 1000+ GPU-hours for a 70B-parameter model), and parameter-efficient fine-tuning (PEFT) methods like LoRA (Low-Rank Adaptation) or adapters that update only a small fraction of parameters (often <1%). LoRA, introduced by Hu et al. in 2021, injects trainable low-rank matrices into attention layers, reducing memory and storage requirements dramatically. For instance, fine-tuning Llama 3 70B with LoRA can be done on a single A100 GPU with 80GB memory, whereas full fine-tuning would require multiple GPUs.
Fine-tuning matters because it enables general-purpose models to achieve state-of-the-art performance on specialized tasks without training from scratch. For example, BERT-base fine-tuned on SQuAD 2.0 achieves an F1 score of 83.0, rivaling models trained solely on that dataset. In the LLM era, fine-tuning is critical for instruction following (e.g., GPT-3.5 was fine-tuned on human demonstrations) and for domain adaptation (e.g., BloombergGPT fine-tuned on financial texts). As of 2026, the state of the art includes techniques like QLoRA (quantized LoRA) for fine-tuning 4-bit quantized models, and DoRA (Weight-Decomposed Low-Rank Adaptation) which outperforms LoRA by decoupling magnitude and direction updates. Multi-task fine-tuning and continual fine-tuning with rehearsal buffers are used to mitigate forgetting. However, fine-tuning is not always the best choice; for very small datasets (<100 examples), prompt engineering or in-context learning often works better, and for extremely large datasets (>1M examples), full pre-training may be warranted.
Common pitfalls include overfitting (especially with small datasets), catastrophic forgetting of general knowledge, and distribution shift between fine-tuning data and deployment data. For instance, fine-tuning on too few examples of a new language can degrade performance on the original language. As of 2026, best practices include using validation splits to tune hyperparameters, applying weight decay, and using early stopping. Fine-tuning remains a cornerstone of applied machine learning, enabling rapid customization of large models for enterprise and research use cases.