Softmax is tailor made for multi-class categorization problems like the MNIST or CIFAR datasets. It's ideal for converting the result of a linear layer into vote for a category. It works best across a wide range of input values, so it takes the place of other activation functions, like sigmoid (logistic) or rectified linear units (ReLU). The softmax emphasizes the strongest vote and so focuses the learning on the parameters that will strengthen that vote. It's also relatively cheap to compute.