DriveXQA: New AI Framework Helps Autonomous Vehicles See Through Fog and Sensor Failures
AI ResearchScore: 75

DriveXQA: New AI Framework Helps Autonomous Vehicles See Through Fog and Sensor Failures

Researchers introduce DriveXQA, a multimodal dataset and MVX-LLM architecture that enables autonomous vehicles to answer complex questions about adverse driving conditions by fusing data from multiple visual sensors, significantly improving performance in challenging scenarios like fog.

3d ago·5 min read·12 views·via arxiv_cv
Share:

DriveXQA: Teaching AI to Understand Adverse Driving Scenes Through Multi-Sensor Fusion

In the quest for truly reliable autonomous vehicles, one of the most persistent challenges has been maintaining situational awareness when sensors fail or weather conditions deteriorate. While today's self-driving systems rely on multiple sensors—cameras, LiDAR, radar—they often struggle to integrate this information intelligently when faced with fog, rain, or partial sensor failures. A new research breakthrough from arXiv, detailed in the paper "DriveXQA: Cross-modal Visual Question Answering for Adverse Driving Scene Understanding," addresses this critical gap by creating a comprehensive framework that enables AI to answer complex questions about challenging driving scenarios using multiple visual modalities.

The DriveXQA Dataset: A Benchmark for Adverse Conditions

The researchers' first contribution is the DriveXQA dataset, which represents a significant advancement in autonomous driving benchmarks. Unlike previous datasets that focus primarily on normal driving conditions, DriveXQA specifically targets adverse scenarios with:

  • Four visual modalities: Including different sensor types that provide complementary information about the driving environment
  • Five sensor failure cases: Simulating realistic scenarios where one or more sensors malfunction
  • Five weather conditions: Including fog, rain, and other challenging atmospheric conditions
  • 102,505 question-answer pairs: Categorized into three distinct levels of understanding:
    • Global scene level (overall environment assessment)
    • Allocentric level (relationships between objects in the scene)
    • Ego-vehicle centric level (implications for the vehicle itself)

This structured approach allows researchers to test not just whether an AI system can perceive elements of a scene, but whether it can understand their relationships and implications for safe navigation.

MVX-LLM: A Novel Architecture for Multi-Modal Understanding

Perhaps the most innovative aspect of this research is the MVX-LLM (Multi-View Cross-modal Large Language Model) architecture. The researchers identified that existing Multimodal Large Language Models (MLLMs) weren't designed to handle multiple complementary visual modalities as input simultaneously. When simply feeding multiple sensor streams into conventional architectures, significant information redundancy occurs, wasting computational resources and potentially confusing the model.

Figure 4: Overview of MVX-LLM.The framework processes multi-modal sensor inputs (RGB, Depth, Event cameras from four vi

To solve this problem, the team developed a Dual Cross-Attention (DCA) projector that intelligently fuses information from different modalities. This token-efficient architecture identifies complementary information across sensors while filtering redundant data, allowing the model to maintain a comprehensive understanding even when some sensors provide partial or conflicting information.

Performance Breakthroughs in Challenging Conditions

The experimental results demonstrate significant improvements over baseline approaches, particularly in adverse conditions. In foggy scenarios, the DCA-enhanced MVX-LLM achieved a GPTScore of 53.5 compared to just 25.1 for the baseline—more than doubling performance. This improvement suggests the architecture is particularly effective at piecing together partial information from different sensors to form a coherent understanding when any single sensor would be insufficient.

The researchers note that their approach shows similar advantages across various sensor failure scenarios, indicating robustness to the types of hardware malfunctions that inevitably occur in real-world deployment.

Implications for Autonomous Vehicle Development

This research arrives at a critical juncture in autonomous vehicle development. As companies transition from controlled testing environments to broader deployment, handling edge cases and adverse conditions becomes increasingly important for both safety and public acceptance.

Figure 3: Hierarchical XQA examples on DriveXQA dataset. The framework demonstrates three semantic levels: Global Scene

The DriveXQA framework addresses several key challenges:

  1. Sensor redundancy management: Rather than simply adding more sensors, the system learns which sensors provide unique versus redundant information in specific conditions
  2. Graceful degradation: When sensors fail, the architecture can compensate by relying more heavily on remaining functional sensors
  3. Interpretable reasoning: By framing the problem as question-answering, the system's understanding becomes more transparent and testable

The Broader AI Research Context

This work fits into several important trends in AI research. First, it represents the continued expansion of large language models into multimodal domains, extending their reasoning capabilities beyond text to complex visual scenes. Second, it addresses the growing need for AI systems that can handle uncertainty and partial information—a requirement for real-world deployment where perfect data is never guaranteed.

The timing is notable within arXiv's recent publication patterns. In the same week this paper was submitted (March 11, 2026), arXiv also published significant research on AI agents' capabilities in executing cyber attacks, LLM calibration degeneration, and safety alignment issues. This concentration of papers suggests accelerating progress and concern around AI robustness and safety across multiple domains.

Open Source Contribution and Future Directions

Consistent with arXiv's culture of open research, the authors have committed to making both the DriveXQA dataset and source code publicly available. This transparency should accelerate further research in several directions:

Figure 1: Left: Two corner cases of adverse driving scenes (foggy condition causing poor visibility and camera over-expo

  • Integration with existing autonomous systems: How can MVX-LLM be incorporated into complete self-driving stacks?
  • Extension to additional modalities: Could radar, ultrasonic, or thermal imaging data be integrated using similar architectures?
  • Real-time optimization: The current research focuses on accuracy; future work will need to address latency and computational efficiency for real-time driving applications

Conclusion

The DriveXQA framework represents a significant step toward autonomous vehicles that can truly understand their environment rather than simply perceive it. By combining a comprehensive adverse-conditions dataset with an innovative multi-modal fusion architecture, the research addresses one of the most persistent challenges in self-driving technology: maintaining situational awareness when conditions are less than ideal.

As autonomous vehicles move closer to widespread adoption, breakthroughs like this—which improve performance specifically in the most challenging scenarios—may prove crucial for both safety and public trust. The open availability of both dataset and code ensures that this advancement will benefit the entire research community, potentially accelerating progress toward more robust and reliable autonomous systems.

Source: arXiv:2603.11380v1, submitted March 11, 2026

AI Analysis

The DriveXQA research represents a sophisticated evolution in autonomous vehicle AI that addresses fundamental limitations in current systems. Most existing approaches treat different sensors as independent information sources that get fused at a relatively simple level, but this research recognizes that true understanding requires reasoning across modalities—not just combining them. The Dual Cross-Attention mechanism is particularly significant because it moves beyond simple concatenation or averaging of sensor data to actively identify complementary relationships between modalities. From a practical standpoint, this work directly tackles the 'edge case' problem that has slowed autonomous vehicle deployment. By specifically testing against sensor failures and adverse weather—and showing dramatic improvements in these scenarios—the researchers are addressing the exact conditions where current systems often fail. The question-answering framework is also strategically important because it creates a testable benchmark for understanding rather than just perception. An autonomous system might correctly identify objects in fog but still make poor decisions if it doesn't understand their relationships or implications. Looking forward, this architecture could influence AI design beyond autonomous vehicles. Any application requiring robust multi-sensor understanding in dynamic environments—from industrial robotics to surveillance systems to augmented reality—could benefit from similar approaches to cross-modal reasoning. The researchers' decision to release both dataset and code ensures this will become a foundational benchmark for evaluating how well AI systems truly understand complex multi-modal scenes.
Original sourcearxiv.org

Trending Now