Index Self-Attention Variants Long-Short Range Attention Tree-Structured Attention with Subtree Masking Hashed Attention eXtra Hop Attention Training Objectives Discriminative Replacement Task Word and Sentence Structural Tasks Type-Constrained Entity Replacement Embeddings Position-Aware Complex Word Embeddings Hierarchical Embeddings Factorized Embedding Parametrization Model Architecture Compressive Memory Reversible Layers Cross-Layer Parameter Sharing Adaptive Depth Estimation Conclusion 链接地址: http://gsarti.com/post/iclr2020-transformers/