What is the benefit of adamax over adam? : r/MachineLearning

What follows is just my interpretation after reading the paper you linked. The paper says that the infinite-order norm makes the algorithm surprisingly stable. The update rule is u_t = max(b_2 u_{t-1}, |g_t|) so gt is completely ignored when it's close to zero. This means that u1, u2, ..., un are in

GAla Smith & AI Research Desk·2h ago·1 min read·16 views·AI-Generated
Share:
Source: reddit.comSingle Source

What follows is just my interpretation after reading the paper you linked. The paper says that the infinite-order norm makes the algorithm surprisingly stable. The update rule is u_t = max(b_2 u_{t-1}, |g_t|) so gt is completely ignored when it's close to zero. This means that u1, u2, ..., un are influenced by fewer gradients and this makes the algorithm more robust to noise in the gradients.

Source: Discussion on r/MachineLearning

Enjoyed this article?
Share:

Related Articles

More in Products & Launches

View all