transformer’s architecture has been the cornerstone for the development of many of the latest SOTA NLP models. It mainly relies on a mechanism called attention. Unlike other successful models that came before, it has no involvement with convolutional or recurrent layers what so ever.
If you’re new to this model, chances are you won’t find this architecture to be easiest to understand. If that’s the case, I hope this article can help.
We’ll start the explanation with how a regular encoder-decoder network works and what difficulties it may encounter, what is an attention mechanism used for in a regular encoder-decoder architecture, and finally, how it’s used in Transformers.