The Lottery Ticket Hypothesis
Using an analogy from the gambling world, the training of machine learning models is often compared to winning the lottery by buying every possible ticket. But if we know how winning the lottery looks like, couldn’t we be smarter about selecting the tickets?
In machine learning models, training processes produced large neural network structures that are the equivalent to a big bag of lottery tickets. After the initial training, models need to undergo optimization techniques such as pruning that remove unnecessary weights within the network in order reduce the size of the model without sacrificing performance. This is the equivalent of searching for the winning tickets in the bag and getting rid of the rest. Very often, pruning techniques end up producing neural network structures that are 90% smaller than the original. The obvious question the is: if a network can be reduced in size, why do we not train this smaller architecture instead in the interest of making training more efficient as well? Paradoxically, practical experiences in machine learning solutions show that the architectures uncovered by pruning are harder to train from the start, reaching lower accuracy than the original networks. So you can buy a big bag of tickets and work your way to the winning numbers but the opposite process is too hard. Or so we thought 😉
The main idea behind MIT’s Lottery Ticket Hypothesis is that, consistently, a large neural network will contain a smaller subnetwork that, if trained from the start, will achieve a similar accuracy than the larger structure. Specifically, the research paper outlines the hypothesis as following:
· The Lottery Ticket Hypothesis: A randomly-initialized, dense neural network contains a subnetwork that is initialized such that — when trained in isolation — it can match the test accuracy of the original network after training for at most the same number of iterations.
In the context of the paper, the small subnetwork is often referred to as the winning ticket.