![CANDOR: Counterfactual ANnotated DOubly Robust off-policy evaluation ...](https://shengpu-tang.me/publication/candor/featured.png)

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Data scientist comparing IPS, SNIPS, and Doubly Robust evaluation methods on a laptop with ranking model charts visible

AI ResearchScore: 98

Counterfactual Evaluation in Ads: IPS, SNIPS, and Doubly Robust Explained

Towards AI article explains counterfactual evaluation methods (IPS, SNIPS, doubly robust) for ad ranking models. These techniques estimate model performance from logged data without A/B tests, critical for recommendation systems in retail.

AAAla SMITH & AI Research Desk·Jun 3, 2026·4 min read··102 views·AI-Generated·Report error

Source: pub.towardsai.netvia medium_recsysWidely Reported

How do IPS, SNIPS, and doubly robust estimation work for counterfactual evaluation in ad ranking models?

The article covers three counterfactual evaluation methods—IPS, SNIPS, and doubly robust—for estimating ad ranking model performance from logged data, avoiding costly A/B tests. It explains the bias-variance tradeoff and practical implementation considerations.

TL;DR

New article explains how to evaluate ad ranking models without A/B tests using IPS, SNIPS, and doubly robust estimation.

Key Takeaways

Towards AI article explains counterfactual evaluation methods (IPS, SNIPS, doubly robust) for ad ranking models.
These techniques estimate model performance from logged data without A/B tests, critical for recommendation systems in retail.

What Happened

CANDOR: Counterfactual ANnotated DOubly Robust off-policy evaluation ...

A new article on Towards AI breaks down the challenge every recommendation system team faces: how to evaluate a new ranking model without running a full A/B test. The article focuses on three counterfactual evaluation methods—Inverse Propensity Scoring (IPS), Self-Normalized Inverse Propensity Scoring (SNIPS), and Doubly Robust estimation—that allow teams to estimate model performance from historical logged data.

The core problem is selection bias: the data you have was generated by the old model, not the new one. Comparing outcomes directly would be misleading because the items shown (and not shown) were chosen by a different policy. Counterfactual evaluation corrects for this by reweighting or modeling the missing counterfactuals.

Technical Details

IPS (Inverse Propensity Scoring) reweights each logged observation by the inverse of the probability that the old model would have taken that action. If a recommendation was rare under the old policy, its outcome gets higher weight when evaluating the new policy. The article explains the bias-variance tradeoff: IPS is unbiased but can have high variance when propensity scores are small.

SNIPS (Self-Normalized IPS) addresses this by normalizing the weights so they sum to 1, reducing variance at the cost of a small bias. This makes it more stable in practice, especially when propensity scores vary widely.

Doubly Robust combines IPS with a direct model of the outcome (e.g., a regression model predicting click-through rate). It is unbiased if either the propensity model or the outcome model is correct—hence "doubly robust." The article walks through the math and practical considerations, including how to estimate propensity scores from logged data.

Retail & Luxury Implications

For retail and luxury recommendation systems—whether for e-commerce product ranking, personalized email campaigns, or ad targeting—this is directly applicable. A/B tests are expensive, time-consuming, and risk exposing customers to suboptimal recommendations. Counterfactual evaluation lets teams iterate faster.

Product recommendation ranking: A luxury fashion retailer can evaluate a new ranking model using historical click and purchase data, without running a live test that might show less relevant products to high-value customers.
Ad creative optimization: IPS can estimate which ad creative would have performed better under a different targeting policy, using past campaign data.
Personalization for VIP segments: For small but high-value segments (e.g., VIC customers), A/B tests may not have enough statistical power. Counterfactual methods can provide faster, more reliable estimates.

However, the article acknowledges a key limitation: these methods rely on accurate propensity estimation. If the old model's logging policy is not well understood, estimates can be biased. For luxury brands with complex, multi-stage personalization, this requires careful implementation.

Business Impact

Counterfactual Evaluation for Recommendation Systems

Faster iteration: Teams can evaluate dozens of model variants per week instead of running sequential A/B tests.
Lower risk: No live exposure to potentially worse recommendations.
Cost savings: Reduces infrastructure costs for running and monitoring A/B experiments.
Better for long-tail items: Counterfactual methods can surface insights for rare products (e.g., limited-edition luxury goods) that never get enough traffic in A/B tests.

Implementation Approach

Logging infrastructure: Ensure you log the action taken, the probability of that action under the old policy (propensity score), and the outcome (click, purchase, etc.).
Propensity estimation: If the old policy is deterministic or unknown, you may need to train a model to estimate propensities from logged data.
Metric selection: Choose the right estimator (IPS, SNIPS, or doubly robust) based on your data characteristics and tolerance for bias vs. variance.
Validation: Use synthetic data or holdout experiments to validate that your counterfactual estimates align with actual A/B test results.

Governance & Risk Assessment

Bias risk: If propensity models are inaccurate, estimates can be misleading. Regular validation against live A/B tests is recommended.
Data privacy: Logged data must be anonymized and stored securely, especially for luxury customers.
Maturity: These methods are well-established in academic literature but require skilled ML engineers to implement correctly in production.

Source: pub.towardsai.net

Source: gentic.news · Jun 3, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

This article is a solid primer for any team working on recommendation systems or ad ranking, including those in retail and luxury. The methods described (IPS, SNIPS, doubly robust) are mature in the causal inference community but underutilized in production e-commerce systems. Most teams still rely on A/B tests or simple offline metrics like recall@k, which don't correct for selection bias. The article's main contribution is making these methods accessible with clear math and practical examples. For luxury retail specifically, the value proposition is strong. High-value customer segments are small, making A/B tests underpowered. Counterfactual evaluation can provide actionable estimates with less data. However, the implementation complexity is non-trivial: teams need robust logging infrastructure, accurate propensity estimation, and a culture that trusts offline estimates over live experiments. The article rightly emphasizes the bias-variance tradeoff, which is often the deciding factor in practice. The gap between research and production remains: most e-commerce teams lack the data infrastructure to log propensities reliably. But for teams already using contextual bandits or reinforcement learning for personalization, these methods are a natural fit. The article serves as a useful reference for ML engineers and data scientists looking to move beyond A/B testing.

#advertising #causal inference #e-commerce #machine learning #recommendation systems

Compare side-by-side

Inverse Propensity Scoring vs Self-Normalized Inverse Propensity Scoring

→

Mentioned in this article

Inverse Propensity Scoring Self-Normalized Inverse Propensity Scoring Doubly Robust Towards AI

Enjoyed this article?