What does negative R² mean for crop yield prediction?

Negative R² means the model is worse than predicting the mean yield for every field — a complete failure of generalization.

Why did Prithvi-EO fail despite being a foundation model?

The paper shows that yield distributions differ significantly between countries, and frozen embeddings cannot compensate for this distribution shift.

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Listen

Satellite image of patchwork agricultural fields in various shades of green and brown, with geometric boundaries…

AI ResearchScore: 70

Prithvi-EO Fails Cross-Country Crop Yield Generalization, Paper Shows

Prithvi-EO and ViT-Base embeddings yield universally negative R² under cross-country maize yield prediction, failing to beat traditional spectral features due to yield distribution shift.

AAAla SMITH & AI Research Desk·7h ago·3 min read··5 views·AI-Generated·Report error

Source: arxiv.orgvia arxiv_mlSingle Source

Do foundation model embeddings improve cross-country crop yield generalization?

Prithvi-EO-1.0-100M and ViT-Base embeddings yield universally negative R² values under leave-one-country-out evaluation on 6,404 maize field observations from five African countries, failing to outperform traditional Sentinel-2 spectral features.

TL;DR

Prithvi-EO and ViT-Base yield negative R² cross-country. · Leave-one-country-out reveals generalization gap in yield prediction. · Yield distribution shift, not representation, is key limitation.

A new arXiv paper from April 2026 finds Prithvi-EO and ViT-Base embeddings yield universally negative R² under cross-country maize yield prediction. The study evaluates 6,404 field observations across five African nations using a leave-one-country-out scheme.

Key facts

6,404 maize field observations from five African countries.
Prithvi-EO / Ridge achieves least-negative LOCO R² of −0.027.
All nine feature–regressor combinations yield negative cross-country R².
Within-country random CV yields moderate R²; cross-country collapses.
Paper argues yield distribution shift, not representation, is the limit.

Geospatial foundation models are marketed as universal feature extractors for Earth observation tasks, but a rigorous generalization test out of sub-Saharan Africa shows they fail to transfer across national boundaries. The paper, Do Foundation Model Embeddings Improve Cross-Country Crop Yield Generalisation? A Leave-One-Country-Out Evaluation in Sub-Saharan Africa [arXiv], tests Prithvi-EO-1.0-100M (a NASA-developed Vision Transformer pretrained on satellite imagery) and ViT-Base against traditional Sentinel-2 spectral indices.

The core finding: every feature-regressor combination achieves negative R² under leave-one-country-out (LOCO) cross-validation. Within-country random splits yield moderate R², but the moment the model must predict on an unseen country, performance collapses. The best result comes from Prithvi-EO with Ridge regression, scoring −0.027 R². That means the models are worse than simply predicting the mean yield of the target country.

Why Foundation Models Don't Help

The paper's unique take: the bottleneck is not representation quality but a shift in yield distribution between countries. Even frozen Prithvi-EO embeddings, which encode rich spatial-spectral features, cannot compensate for the fact that maize yields in Kenya follow a different distribution than those in Tanzania. The authors argue that most published benchmarks overstate generalization by reporting only within-country performance.

This echoes a broader pattern in applied ML: foundation models excel when the test distribution closely matches the training distribution, but their value diminishes under severe covariate shift. The paper releases a reproducible negative benchmark — a rare and valuable contribution for a field that tends to publish only positive results.

Implications for Food Security AI

Accurate cross-country yield forecasting is critical for food security planning in sub-Saharan Africa, where smallholder maize farming dominates. The negative result suggests that purely satellite-based models, even with foundation model embeddings, cannot replace ground-truth yield surveys or country-specific calibration. Future work must either collect more representative training data or develop methods to handle distribution shift explicitly.

The study joins a growing body of work showing that foundation models for Earth observation are not silver bullets. A prior paper from April 2026 [arXiv] evaluating nine pretrained audio models for music recommendation similarly found that pretraining does not guarantee cross-domain transfer.

What to Watch

Watch for follow-up work that attempts to close the generalization gap — either through domain adaptation techniques, multi-task learning across countries, or integration of non-satellite data sources like soil surveys and market prices. The authors' released benchmark provides a standardized evaluation protocol for future methods to beat.

What to watch

ibm-nasa-geospatial/Prithvi-EO-1.0-100M-multi-temporal-crop ...

Watch for follow-up papers using domain adaptation or multi-task learning to close the LOCO generalization gap on the released benchmark. Also monitor whether NASA or IBM adjust Prithvi-EO training to include more diverse geographic yield data.

Figure 7: Yield distributions (kg/ha) per country. Nigeria and Rwanda exhibitmarkedly different central tendency and sp

Source: gentic.news · 7h ago · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

This paper delivers a clean negative result that cuts against the hype around geospatial foundation models. The leave-one-country-out design is the right evaluation — most benchmarks use random splits that inflate generalization claims. The key insight is that distribution shift in the target variable (yield) dominates any gains from better representations. This mirrors findings in other domains: foundation models transfer well when the task is consistent but struggle when the output distribution changes. The authors' decision to release a reproducible benchmark is commendable and should become standard practice. The field needs more such papers to ground expectations.

#earth-observation #foundation-models #arxiv #agriculture-ai #generalization

Compare side-by-side

Prithvi-EO-1.0-100M vs ViT-Base

→

Mentioned in this article

Prithvi-EO-1.0-100M ViT-Base Ridge regression

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

AI Research

Anthropic Teaches Claude Why: New Interpretability Method Deployed

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

More in AI Research

View all

A college student wearing a 64-channel EEG cap with multiple electrodes on their head, seated in front of a computer…

AI Research

TikTok Brain Has an EEG Signature: Frontal Theta Drops 0.395

Zhejiang University EEG study finds 0.395 correlation between short-video addiction and suppressed frontal-lobe theta waves during attention tasks, indicating algorithmic engagement optimization dampens executive control.

x.com/19h ago/3 min read

social-media-effectsrecommendation-systemsattention

A diagram illustrates SAE probes predicting agent tool failures, with GPT-OSS 20B and Gemma 3 27B models and a graph…

AI Research

SAEs Predict Agent Tool Failures Before Execution, Paper Shows

SAE-based probes predict agent tool failures before execution, tested on GPT-OSS and Gemma 3. Adds internal observability missing from current external methods.

arxiv.org/1d ago/3 min read/Widely Reported

agentic aiinterpretabilityai research

A bar chart comparing RL, LLM, VLM, hybrid, and human agent scores on the Agentick benchmark, with GPT-5 mini…

AI ResearchBreakthrough

Agentick Benchmark: GPT-5 Mini Tops at 0.309, No Agent Paradigm Dominates

Agentick benchmark evaluates RL, LLM, VLM, and hybrid agents on 37 tasks. GPT-5 mini leads at 0.309 ONS, but no paradigm dominates. ASCII beats natural language.

arxiv.org/1d ago/3 min read/Widely Reported

agentsreinforcement learningbenchmarks

Why Foundation Models Don't Help

Implications for Food Security AI

What to Watch

What to watch

AI Analysis

✨AI Toolslive

Related Articles

RRCM Uses GRPO to Decide When to Retrieve for LLM Recommendation

Simple Graph Heuristic Beats Generative Recommenders on 10 of 14 Benchmarks

Claude Code's Six-Layer Architecture: Harness, Not Magic

MCP vs CLI Debate Resolved by Anthropic's Code Mode: 98.7% Token Drop

Two-Tower vs Vector DB + LLM: Which Wins for RecSys at Scale?

Anthropic Teaches Claude Why: New Interpretability Method Deployed

The framework underneath this story

More in AI Research

TikTok Brain Has an EEG Signature: Frontal Theta Drops 0.395

SAEs Predict Agent Tool Failures Before Execution, Paper Shows

Agentick Benchmark: GPT-5 Mini Tops at 0.309, No Agent Paradigm Dominates