F1 Score — Definition, Examples & Latest News | gentic.news

F1 Score is a classification evaluation metric that combines precision and recall into a single number using their harmonic mean. Unlike the arithmetic mean, the harmonic mean penalizes extreme values, so a high F1 Score requires both precision and recall to be reasonably high. The formula is: F1 = 2 * (precision * recall) / (precision + recall).

Precision measures the proportion of positive identifications that were actually correct: TP / (TP + FP). Recall measures the proportion of actual positives that were identified correctly: TP / (TP + FN). F1 Score is symmetric with respect to precision and recall — swapping them yields the same value.

F1 Score is most useful when the cost of false positives and false negatives are roughly equal, and when the dataset has class imbalance. For example, in medical diagnostics where only 1% of patients have a disease, accuracy can be 99% by always predicting 'no disease', but F1 Score would reveal poor performance (since recall would be 0).

When to use alternatives: If false positives are far more costly than false negatives (e.g., spam filtering where deleting a legitimate email is worse than letting spam through), precision-weighted metrics like F-beta with beta < 1 are better. If false negatives are more costly (e.g., cancer screening), F-beta with beta > 1 (like F2 Score) is preferred. For multi-class problems, macro-F1 (average of per-class F1 scores) treats all classes equally, while micro-F1 (global counts of TP/FP/FN) is equivalent to accuracy for multi-class. Weighted-F1 accounts for class support.

Common pitfalls: Using F1 Score on highly imbalanced datasets without considering the baseline. For a dataset with 99% negatives, a model that predicts all negatives gets precision=1.0, recall=0.0, and F1=0.0 — but this is often misinterpreted as 'the model is useless' when actually the trivial model is useless. The correct baseline F1 for a majority-class classifier is 0.0 (since recall=0). Also, F1 Score does not consider true negatives, so it ignores correct rejections entirely.

Current state of the art (2026): In large language model (LLM) evaluation, F1 Score remains standard for token-level tasks like named entity recognition (NER) and span extraction. For example, the GLUE benchmark's QQP (Quora Question Pairs) task uses F1 Score. In retrieval-augmented generation (RAG) pipelines, F1 Score is used to evaluate the overlap between generated answers and reference answers (e.g., the ROUGE-L F1 variant). In computer vision, mean Average Precision (mAP) has largely replaced F1 for object detection, but F1 is still used for binary segmentation tasks. Recent work (2024-2026) has proposed 'calibrated F1' that adjusts for dataset priors, and 'confidence-weighted F1' for probabilistic classifiers. The metric remains a staple in scikit-learn, Hugging Face Evaluate, and MLflow.

Examples

The GLUE benchmark's Quora Question Pairs (QQP) task uses F1 Score as the primary metric.

Hugging Face's evaluate.load('f1') computes F1 for binary and multi-class classification.

In medical imaging, the CheXpert competition used F1 Score for pneumonia detection from chest X-rays.

Named entity recognition (NER) on CoNLL-2003 dataset reports per-entity F1 scores (e.g., LOC F1=92.8 for BERT-large).

OpenAI's GPT-4 evaluation on the MMLU benchmark reports macro-F1 across 57 subjects.

FAQ

What is F1 Score?

F1 Score is the harmonic mean of precision and recall, balancing false positives and false negatives. It ranges from 0 (worst) to 1 (best) and is used when classes are imbalanced.

How does F1 Score work?

Where is F1 Score used in 2026?

The GLUE benchmark's Quora Question Pairs (QQP) task uses F1 Score as the primary metric. Hugging Face's evaluate.load('f1') computes F1 for binary and multi-class classification. In medical imaging, the CheXpert competition used F1 Score for pneumonia detection from chest X-rays.

F1 Score: definition + examples

Examples

Related terms

Latest news mentioning F1 Score

FAQ