What follows is just my interpretation after reading the paper you linked. The paper says that the infinite-order norm makes the algorithm surprisingly stable. The update rule is u_t = max(b_2 u_{t-1}, |g_t|) so gt is completely ignored when it's close to zero. This means that u1, u2, ..., un are influenced by fewer gradients and this makes the algorithm more robust to noise in the gradients.
Source: Discussion on r/MachineLearning




