NanoEuler is a 116M-parameter GPT-2-scale language model built entirely from scratch in C/CUDA — no PyTorch, no autograd, no ML libraries. The project, posted on Hacker News on June 28, 2026, provides a complete, from-scratch training pipeline for educational purposes.
Key facts
- 116M parameters: GPT-2-small scale model.
- Trained on single RTX 4070 consumer GPU.
- No PyTorch, autograd, or any ML libraries used.
- Hand-written BPE tokenizer, FlashAttention, and training pipeline.
- Gradient check validates backward pass in double precision.
The project, by developer JustVugg, includes a hand-written byte-level BPE tokenizer, pretraining on a books + web corpus, and supervised fine-tuning into a chat model (RLHF/DPO planned). It runs on CPU for a small showcase model, and a full from-scratch CUDA engine — cuBLAS matmuls, a hand-written FlashAttention, validated against a CPU reference by a full-model gradient check — trains the ~116M-parameter model on a single RTX 4070.
A sample from the model after partial pretraining shows fluent but shallow output: "Alessandro eat a icing textile: the satisfied by the servants in order to keep your weight" — demonstrating learned grammar and encyclopedic register without real-world knowledge. The project's name draws an analogy between residual connections and the forward-Euler method for solving ordinary differential equations, as detailed in the README.
Community reception on Hacker News has been mixed, with one top comment questioning the neural ODE analogy and suggesting the README may be AI-generated. Another comment pointed the developer to Y Combinator's second-chance pool for HN posts.
What the project is — and isn't
NanoEuler is explicitly a research and educational artifact. As the developer states, "At ~116M parameters trained on a single consumer GPU, it is a text generator in the spirit of GPT-2-small: fluent-ish English, no real world knowledge. It is not a capable assistant." The point is the from-scratch engineering and the complete, understandable training pipeline.
The developer cites two motivations: (1) interfacing with LLMs does not mean understanding how they are composed, and (2) working with a very low-level layer to understand the correlation between parameters, data, and model growth, including how the GPU works and how layers can be optimized.
Technical architecture
The model is a decoder-only transformer with RMSNorm (pre-norm, no bias), rotary position embeddings (RoPE) applied to queries and keys, and SwiGLU feed-forward: down(silu(gate(x)) * up(x)). The backward pass is verified with a gradient check in double precision via make check.
Training is initiated with commands like ./nanoeuler train for the small ~0.76M-parameter model, ./nanoeuler train big for the larger ~10M-parameter model, and ./nanoeuler chat for a REPL interface.
Key Takeaways
- NanoEuler is a 116M-parameter GPT-2-scale model built in pure C/CUDA from scratch.
- It provides a complete educational training pipeline for understanding LLMs at the lowest level.
What to watch
Watch for whether the developer adds RLHF/DPO training (as planned) and whether the project gains traction as a teaching tool for low-level LLM implementation. Also track community response to the AI-generated README concern.
Source: github.com








