A new, fully open-access graduate course titled "Training and Deploying Large-Scale Models" has been published for the 2025-2026 academic year by Edouard Oyallon of the MVA program at ENS Paris-Saclay. The course provides a complete, production-oriented curriculum covering the entire lifecycle of large language models (LLMs), from distributed training fundamentals to agentic AI deployment. All lecture slides and hands-on lab notebooks are freely available on GitHub Pages.
What the Course Covers: The Full LLM Stack
The course is structured into seven core sessions, designed to bridge the gap between theoretical machine learning and the systems engineering required to build frontier models.
- Distributed Training Fundamentals & Systems for ML: Introduces the core concepts and system architectures that enable training across thousands of GPUs.
- Multi-GPU Parallelization: A deep dive into data, tensor, and pipeline parallelism—the three primary strategies for splitting model workloads across hardware.
- Communication-Efficient Distributed Optimization: Covers advanced techniques like gradient compression to reduce the communication bottleneck in large-scale training.
- Post-Training: Explores supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and model evaluation.
- Serving LLMs at Scale: Focuses on high-throughput, low-latency inference using vLLM, a leading open-source serving engine.
- Agentic AI: Examines the architecture and implementation of AI agents that can autonomously perform multi-step tasks.
The Production Toolchain: Meta's Stack in the Classroom
A defining feature of the course is its commitment to teaching with the same tools used in industry. The hands-on labs are built on a production-grade PyTorch stack, heavily featuring tools developed and open-sourced by Meta.
- Core Framework: Labs use PyTorch nightly builds.
- Training Systems:
torchtitan(Meta's framework for large-scale model training) andtorchft(Meta's fault-tolerant training library) are central to the curriculum. - Fine-Tuning & Evaluation:
torchtune(Meta's library for LLM fine-tuning) is used for post-training workflows. - Serving:
vLLMis taught for high-performance model deployment.
This choice directly connects academic learning to the engineering practices at companies building the largest models, such as Meta, which recently published research on its LeWorldModel and V-JEPA 2.1 models.
Key Hands-On Labs
The course emphasizes practical implementation. Key lab assignments include:
- Tiny Scaling Laws with nanoGPT: Students empirically explore how model performance scales with compute and data size.
- Porting nanoGPT to torchtitan: A practical exercise in adapting a known codebase to a production-scale training framework.
- Pipeline Parallelism Simulator: Builds intuition for the complexities and scheduling challenges of pipeline-parallel training.
- Collaborative Training with TorchFT: Implements fault-tolerant training, a critical requirement for long-running, multi-node jobs.
- Evaluation and SFT with TorchTune: Guides students through the post-training pipeline.
- Serving with vLLM: Deploys a model for inference with optimizations like PagedAttention.
- LLM Agents: Constructs a basic agentic system capable of planning and executing tasks.
All materials are hosted on GitHub, a platform central to the AI engineering ecosystem and a frequent subject in our coverage, having appeared in 58 prior articles.
gentic.news Analysis
This course release is a significant contribution to open AI education, arriving at a time when practical engineering knowledge for large-scale systems is as valuable as algorithmic innovation. By standardizing instruction around Meta's open-source stack (torchtitan, torchft, torchtune), it creates a direct pedagogical pipeline to the tools used for training models like LLaMA 3. This is particularly noteworthy given Meta's intense recent research activity, including the LeWorldModel paper from Yann LeCun's team and the V-JEPA 2.1 release, which we covered last week.
The curriculum's heavy focus on agentic AI aligns perfectly with the current industry trend. It provides the foundational systems knowledge required to build the autonomous agents that are becoming a primary interface for LLMs, a topic frequently in our headlines, such as the recent story on Anthropic's Claude Code acting as an autonomous PR agent. The course effectively demystifies the infrastructure—distributed training, efficient serving—that makes such agentic capabilities possible at scale.
Furthermore, the publication of a comprehensive, free course from a prestigious institution like ENS Paris-Saclay acts as a force multiplier for the open-source AI ecosystem. It lowers the barrier to entry for aspiring ML engineers and researchers, enabling them to contribute more effectively to projects on GitHub. This educational initiative complements the wave of open-source technical releases, such as the "Maths, CS & AI Compendium" textbook and various "skill packs" for AI agents, that are collectively expanding the global talent pool capable of working on frontier AI systems.
Frequently Asked Questions
Where can I access the "Training and Deploying Large-Scale Models" course?
All course materials, including lecture slides (PDFs) and Jupyter notebook labs, are freely available on the course's GitHub Pages site: https://training-large-models-course.github.io/. No registration or payment is required.
What are the prerequisites for taking this course?
The course is designed for graduate students (MVA program). A strong foundation in machine learning, deep learning (with PyTorch), and software engineering is assumed. Familiarity with basic parallel computing concepts is beneficial but not strictly required, as the fundamentals are covered in the first sessions.
Why does the course focus on Meta's tools like torchtitan and torchft?
The instructor, Edouard Oyallon, states the goal is to use "the same production toolchain used to train frontier models." Meta has been a major contributor to the open-source ecosystem for large-scale AI training. Tools like torchtitan and torchft (fault-tolerant training) represent state-of-the-art, production-tested frameworks. Learning them provides direct, applicable skills for industry and research roles focused on scaling LLMs.
How does this course relate to learning about AI agents?
The course dedicates an entire session to Agentic AI, covering the architecture of multi-agent systems and including a hands-on lab to build an LLM agent. It positions agentic AI not as a standalone topic but as the culmination of the stack: you need a reliably trained model, fine-tuned for instruction-following, served efficiently with vLLM, and then orchestrated into an agentic loop. The course provides the full-stack engineering context necessary to deploy agents beyond simple API calls.



