Why are residual blocks called “residual” blocks?
The reason why I was confused was that the equation in the diagram explaining the residual blocks on the research paper was f(x) + x. So I thought, “Where is the residual..?”
When you rephrase the equation, by skip-connecting an identity function, the machine will learn the function R(x) = f(x) – x. This means the model will learn the difference between the original input x and the transformed output f(x), instead of f(x) itself.