A few months ago I posted a story showing how to use an autoencoder (a certain type or architecture of a NN that is mostly used for dimensionality reduction and image compression) to efficiently detect an anomalous or faulty string sequence in a large set of sequences. Given a list of sequences following a certain pattern or format (e.g. ‘AB121E’, ‘AB323’, ‘DN176’), I showed how we can design a network that will learn these patterns or formats and will be able to detect cases in which a sequence was not formatted properly. In this story, I want to show a more “advanced” use case, or how we can use the same methodology to address a somewhat more complicated problem. Sometimes, the question we want to answer is not whether or not a certain sequence is anomalous or an outlier, but how anomalous or abnormal a given sequence is comparing to other sequences. Suppose, for example, that it is pretty normal to have some anomalous sequences per hour in our data stream, but we need to find a way to evaluate how normal or abnormal a certain hour is comparing to the last one. We can calculate the average level of abnormality that we saw every hour but how do we go about something like this?
I won’t delve into the theory of autoencoders and how they work (there is quite a bit of good and accessible reading material out there — I have mentioned a few below), but the basic idea, in short, is that an autoencoder is a form of network that learns compressed representations of its input data and tries to reconstruct it. By learning how to reconstruct a sequence from its compressed representation, a well-trained autoencoder can be said to learn the “dominant rules” that govern its formatting (I should admit that although this analogy is helpful for some folks, not everyone finds appealing this idea of learning compressed representation as learning the formatting rules). Therefore, we can use such autoencoder to estimate how the formatting or pattern of a certain sequence differs or diverges from the others.