Category Deep Learning

163. Why Normalize Inputs?

Why Do we Normalize Inputs? When the input is not normalized, the shape of the cost function can become distorted like the diagram on the left. This leads to instability when optimizing the model. The training speed decreases depending on…

162. Residual Blocks

Why are residual blocks called “residual” blocks? The reason why I was confused was that the equation in the diagram explaining the residual blocks on the research paper was f(x) + x. So I thought, “Where is the residual..?” When…

156. Highway Networks

Highway Networks Training models with DEEP networks becomes difficult, even when using variance-preserving initialization. By adding an information highway (Learning how to route information through the network), it makes it easier to train models even when it is really DEEP.…

155. S3D (Separable 3D CNN)

I’ve learned about S3D(Separable 3D CNN) today so I like to share it here. S3D helps solve three challenges for video analysis. How to understand spatial information. (Recognizing the appearance of an object) How to understand temporal information. (Such as…

153. Non-Local Neural Networks

“Local” means only understanding the CURRENT “time” and “space”. To understand “non-local” nuances (What will the person in the image do next? Where will the soccer ball being kicked head towards?), if we were to use traditional methods such as…

146. BERT

What is BERT? BERT is a deep learning architecture for natural language processing. If you stack the Transformer’s encoder, you get BERT. What can BERT Solve? Neural Machine Translation Question Answering Sentiment Analysis Text Summarization How to solve the problems…

140. Spatial Pyramid Pooling

Spatial Pyramid Pooling helps the network output the same shape regardless of any aspect ratio and input size. Instead of Pooling with a fixed filter size, it divides the input with different levels of ratio, so the output would not…

138. Variational Autoencoders

Autoencoders encode an input to a smaller representation vector (also called a latent vector) and decode that to restore the original input. For example, you can encode an image, send the encoded image to someone, and have them decode it…

137. Vision Transformers

Vision Transformers is inspired by Transformers for natural language processing. Unlike traditional convolution networks with pyramid architectures, ViT has an isotropic architecture, where the input does not downsize. The steps are the following. Split images into “patches” Flatten each path…