streaming
30 articles about streaming in AI news
DIET: A New Framework for Continually Distilling Streaming Datasets in Recommender Systems
Researchers propose DIET, a framework for streaming dataset distillation in recommender systems. It maintains a compact, evolving dataset (1-2% of original size) that preserves training-critical signals, reducing model iteration costs by up to 60x while maintaining performance trends.
Kimi 2.5's 1T Parameter MoE Model Runs on 96GB Mac Hardware via SSD Streaming
Developers have demonstrated that Kimi 2.5's 1 trillion parameter Mixture-of-Experts model can run on Mac hardware with just 96GB RAM by streaming expert weights from SSD, with only 32B parameters active per token.
Qwen 3.5 397B-A17B MoE Model Runs on M3 Mac at 5.7 TPS with 5.5GB Active Memory via SSD Streaming
Developer Dan reportedly runs the 209GB Qwen 3.5 397B-A17B MoE model on an M3 Mac at ~5.7 tokens per second using only 5.5GB of active memory by quantizing and streaming weights from SSD.
Global TV Liberation: How Open Source Collaboration Is Disrupting Streaming
An open-source project called Free-TV/IPTV has compiled free live TV channels from over 60 countries into a single M3U playlist. With 88 contributors maintaining the repository, this GitHub project offers HD streams from major platforms without subscriptions.
Run Claude Code in Any Sandbox with One API: AgentBox SDK
Swap coding agents and sandbox providers without changing code. Preserves full interactive capabilities (approval flows, streaming).
IAT: Instance-As-Token Compression for Historical User Sequence Modeling
Researchers propose Instance-As-Token (IAT), which compresses all features of each historical interaction into a unified embedding token, then applies standard sequence modeling. This approach outperforms state-of-the-art methods and has been deployed in e-commerce advertising, shopping mall marketing, and live-streaming e-commerce with substantial business metric improvements.
OpenClaw Voice Interface Demo Shows Real-Time AI Assistant Hardware
A developer showcased a custom hardware rig that integrates a push-button voice interface with the OpenClaw AI model, streaming responses in real-time. This demonstrates a tangible, open-source alternative to proprietary voice assistants like Amazon Alexa.
scan-for-secrets 0.2: Streamline Your Security Workflow with New CLI Options
Simon Willison's scan-for-secrets 0.2 adds streaming output, multi-directory scanning, and file-specific options that developers can use immediately in Claude Code workflows.
Building a Memory Layer for a Voice AI Agent: A Developer's Blueprint
A developer shares a technical case study on building a voice-first journal app, focusing on the critical memory layer. The article details using Redis Agent Memory Server for working/long-term memory and key latency optimizations like streaming APIs and parallel fetches to meet voice's strict responsiveness demands.
Storing Less, Finding More: Novelty Filtering Architecture for Cross-Modal Retrieval on Edge Cameras
A new streaming retrieval architecture uses an on-device 'epsilon-net' filter to retain only semantically novel video frames, dramatically improving cross-modal search accuracy while reducing power consumption to 2.7 mW. This addresses the fundamental problem of redundant frames crowding out correct results in continuous video streams.
Extended Thinking's Two-Block Response: What Claude Code Users Need to Know
Extended Thinking returns separate thinking and text blocks - handle them correctly in streaming or your UI will show raw reasoning.
FastAPI-FullStack: Production-Ready Template for AI Agent Apps with FastAPI, Next.js, and Framework Choice
A new open-source template, fastapi-fullstack, provides a pre-built foundation for deploying AI agent applications. It integrates FastAPI, Next.js, and multiple agent frameworks with WebSocket streaming, authentication, and database support out of the box.
Claude-to-IM Skill: Get Claude Code in Your Team Chat (Without OpenClaw's Security Risks)
Open-source bridge brings Claude Code to Telegram/Discord with permission prompts, streaming, and persistent sessions—safer alternative to OpenClaw.
OmniForcing Enables Real-Time Joint Audio-Visual Generation at 25 FPS with 0.7s Latency
Researchers introduced OmniForcing, a method that distills a bidirectional LTX-2 model into a causal streaming generator for joint audio-visual synthesis. It achieves ~25 FPS with 0.7s latency, a 35× speedup over offline diffusion models while maintaining multi-modal fidelity.
The Billion-Dollar Bet on AI World Models: How AMI's Funding Signals a New Era of Machine Understanding
AMI's $1 billion funding round for world model development highlights a strategic shift toward AI systems that understand physical reality. Meanwhile, robotics and creative AI tools see massive investments, with YouTube maintaining streaming dominance.
Gemini 3.5 Live Translate Debuts as Real-Time Audio Model
Google DeepMind released Gemini 3.5 Live Translate, an audio model for real-time translation, but disclosed no pricing, latency, or language pair details.
DeepSeek-V4 Hits 500K Context with 90% Less KV Cache via FlashMemory
DeepSeek-V4 achieves 500K context with 90% less KV cache via FlashMemory's lookahead sparse attention, keeping only 13.5% of cache in GPU memory without retraining.
Kling AI Video Enters Hollywood Production with 'House of David'
Kling AI video used in 'House of David', first Hollywood production at industrial scale. Show reached 44M+ viewers, #1 on Prime Video U.S.
train-llm-from-scratch: 1B-Parameter LLM on a Single GPU
train-llm-from-scratch trains billion-parameter LLMs on a single GPU, cutting costs from $10M+ to consumer hardware.
Claude Code's Six-Layer Architecture: Harness, Not Magic
Claude Code's six-layer architecture uses a 3-layer context compressor at 92% threshold and Redis-based multi-agent FSM protocol. The model is just one node in a harness.
Two-Tower vs Vector DB + LLM: Which Wins for RecSys at Scale?
Two-tower models offer sub-10ms latency for cold-start; vector DB + LLM provides richer semantics. Hybrid architectures reduce churn by 15-20%.
Microsoft’s VibeVoice: Open-Source Speech-to-Text with Diarization
Microsoft released VibeVoice, an MIT-licensed speech-to-text model with built-in speaker diarization. Simon Willison tested a 4-bit MLX conversion on an M5 MacBook, transcribing 1 hour of audio in ~9 minutes using ~60GB RAM.
Free-Claude-Code Proxy Routes Anthropic API to Free NVIDIA NIM Models
A developer released free-claude-code, a proxy that intercepts Claude Code's API calls and routes them to free NVIDIA NIM endpoints, unlocking free access to models like Kimi K2 and GLM 4.7. This bypasses Anthropic's subscription fees and adds remote execution via a Telegram bot.
Catching Drift Before It Catches You
The author details implementing the open-source Evidently AI library to monitor a Kafka-powered movie recommender for data drift. This is a hands-on guide to a fundamental MLOps task for maintaining live AI systems.
AI-Powered PS4 Emulator 'Spine' Runs Bloodborne Locally on PC
A developer has released Spine, a PS4 emulator that uses AI techniques to run Bloodborne fully on PC. This represents a major step forward in console emulation, previously considered years away.
Prefill-as-a-Service Paper Claims to Decouple LLM Inference Bottleneck
A research paper proposes a 'Prefill-as-a-Service' architecture to separate the heavy prefill computation from the lighter decoding phase in LLM inference. This could enable new deployment models where resource-constrained devices handle only the decoding step.
GPT-5.5 Limited Rollout Begins, Frontend Improvements Noted
OpenAI has started a limited rollout of GPT-5.5 to select users, with early reports highlighting significant frontend quality improvements. This suggests an incremental update focused on user experience rather than core model capabilities.
Vibe's $227M ARR Shows AI-Powered CTV Ads Are Eating Linear TV Budgets
Ad platform Vibe.co reports $227M in annual recurring revenue, growing 264% year-over-year. The surge is driven by AI that optimizes Connected TV ads by combining identity graphs with transactional data, convincing brands to shift major budgets.
A Practical Guide to Building Real-Time Recommendation Systems
This article provides a practical overview of building real-time recommendation systems, covering core components like data ingestion, feature stores, and model serving. It matters because real-time personalization is becoming a baseline expectation in digital commerce.
Onlook: Open-Source AI Tool Edits React Code Visually, Hits 23.9K GitHub Stars
Onlook, an open-source desktop app, enables visual editing of live React and Next.js applications, with AI generating and writing code changes directly to the codebase. It has gained 23.9K GitHub stars, positioning itself as a free alternative to paid design tools like Figma.