Let’s have a walk through the actual mechanism of LSTMs. We will first briefly discuss the overall view of an LSTM cell and then start discussing each of the operations taking place within an LSTM cell along with an example of text generation.
This article is an excerpt from the book Natural Language Processing with TensorFlow written by Thushan Ganegedara. The book provides an emphasis on both the theory and practice of natural language processing. It introduces the reader to existing TensorFlow functions and explains how to apply them while writing NLP algorithms. Specific examples are used to make the concepts and techniques concrete.
LSTMs are mainly composed of the following three gates:
- Input gate: A gate which outputs values between 0 (the current input is not written to the cell state), and 1 (the current input is fully written to the cell state). Sigmoid activation is used to squash the output to between 0 and 1.
- Forget gate: A sigmoidal gate which outputs values between 0 (the previous cell state is fully forgotten for calculating the current cell state) and 1 (the previous cell state is fully read in when calculating the current cell state).
- Output gate: A sigmoidal gate which outputs values between 0 (the current cell state is fully discarded for calculating the final state) and 1 (the current cell state is fully used when calculating the final hidden state).
This is a very high-level diagram, and some details have been hidden in order to avoid clutter. We present LSTMs, both with loops and without loops to improve the understanding. The figure on the right-hand side depicts an LSTM with loops and that on the left-hand side shows the same LSTM with the loops expanded so that no loops are present in the model:
Now, to get a better understanding of LSTMs, let's consider an example. We will discuss the actual update rules and equations along with an example to understand LSTMs better.
Now let's consider an example of generating text starting from the following sentence: John gave Mary a puppy.
The story that we output should be about John, Mary, and puppy. Let's assume our LSTM to output two sentences following the given sentence:
John gave Mary a puppy. ____________________. _____________________.
The following is the output given by our LSTM:
John gave Mary a puppy. It barks very loudly. They named it Luna.
We are still far from outputting realistic phrases such as these. However, LSTMs can learn relationships such as between nouns and pronouns. For example, it is related to the puppy, and they to John and Mary. Then, it should learn the relationship between the noun/pronoun and the verb. For example, for it, the verb should have an s at the end. We illustrate these relationships/dependencies in the following figure.
As we can see both, long-term (for example, Luna -> puppy) and short-term (for example, It -> barks) dependencies are present in this phrase. The solid arrows depict links between nouns and pronouns and dashed arrows show links between nouns/pronouns and verbs:
Now let's consider how LSTMs, using its various operations, can model such relationships and dependencies to output sensible text, given a starting sentence.
The input gate (it) takes the current input (xt) and the previous final hidden state (ht – 1) as the input and calculates it, as follows:
The input gate, it can be understood as the calculation performed at the hidden layer of a single-hidden-layer standard RNN with the sigmoidal activation. Remember that we calculated the hidden state of a standard RNN as follows:
Therefore, the calculation of it of the LSTM looks quite analogous to the calculation of ht of a standard RNN, except for the change in the activation function and the addition of bias.
After the calculation, a value of 0 for it will mean that no information from the current input will flow to the cell state, where a value of 1 means that all the information from the current input will flow to the cell state.
Next, another value (which is called candidate value) is calculated as follows, which is added to calculate the current cell state later:
We can visualize these calculations in the following figure:
In our example, at the very beginning of the learning, the input gate needs to be highly activated. The first word that the LSTM outputs is it. Also, in order to do so, the LSTM must learn that puppy is also referred to as it. Let's assume our LSTM has five neurons to store the state. We would like the LSTM to store the information that it refers to puppy. Another piece of information we would like the LSTM to learn (in a different neuron) is that the present tense verb should have an s at the end of the verb, when the pronoun it is used. One more thing the LSTM needs to know is that the puppy barks loud. Figure 7.5 illustrates how this knowledge might be encoded in the cell state of the LSTM. Each circle represents a single neuron (that is, hidden unit) of the cell state:
With this information, we can output the first new sentence:
John gave Mary a puppy. It barks very loudly.
Next, the forget gate is calculated as follows:
The forget gate does the following. A value of 0 for the forget gate means that no information from ct-1 will be passed to calculate ct, and a value of 1 means that all the information of ct – 1 will propagate into the calculation of ct.
Now we will see how the forget gate helps in predicting the next sentence: They named it Luna.
Now as you can see, the new relationship we are looking at is between John and Mary and them. Therefore, we no longer need information about it and how the verb bark behaves, as the subjects are John and Mary. We can use the forget gate in combination with the current subject they and the corresponding verb named to replace the information stored in the Current subject and Verb for current subject neurons (see the following figure):
In terms of the values of weights, we illustrate this transformation in Figure 6. We do not change the state of the neuron maintaining the it -> puppy relationship, because puppy appears as an object in the last sentence. This is done by setting weights connecting it -> puppy from ct – 1 to ct to 0. Then we will replace the neurons maintaining current subject and current verb information with new subject and verb. This is achieved by setting the forget weights of ft, for that neuron, to 1. Then we will set the weights of it connecting the current subject and verb to the corresponding state neurons to 1. We can think of ~Ct as the entity that contains what new information (such as new information from the current input xt) should be brought to the cell state:
The current cell state will be updated as follows:
In other words, the current state is the combination of the following:
- What information to forget/remember from the previous cell state
- What information to add/discard to the current input
Next in the following figure, we highlight what we have calculated so far with respect to all the calculations that are taking place inside an LSTM:
After learning the full state, it would look like the following figure:
Next, we will look at how the final state of the LSTM cell (ht) is computed:
In our example, we want to output the following sentence: They named it Luna.
For this we do not need the second to last neuron to compute this sentence, as it contains information about how the puppy barks, where this sentence is about the name of the puppy. Therefore, we can ignore the last neuron (containing bark -> loud relationship) during the predictions of the last sentence. This is exactly what ot does; it will ignore the unnecessary memory and only retrieve the related memory from the cell state when calculating the final output of the LSTM cell. Also, in the following figure, we illustrate how an LSTM cell would look like at a full glance:
Here, we summarize all the equation relating to the operations taking place within an LSTM cell.
Now in the bigger picture, for a sequential learning problem, we can unroll the LSTM cells over time to show how they would link together so they receive the previous state of the cell to compute the next state, as shown in the following figure:
However, this is not adequate to do something useful. As you can see, even though we can create a nice chain of LSTMs that can model a sequence, we still don't have an output or a prediction. But if we want to use what the LSTM learned, we need a way to extract the final output from the LSTM. Therefore, we will fix a softmax layer (with weights Ws and bias bs) on top of the LSTM. The final output is obtained using the following equation:
Now the final picture of the LSTM with the softmax layer looks like the following figure:
TensorFlow is the leading framework for deep learning algorithms critical to artificial intelligence, and natural language processing (NLP) makes much of the data used by deep learning applications accessible to them. Natural Language Processing with TensorFlow brings the two together and teaches deep learning developers how to work with today’s vast amount of unstructured data.
Thushan Ganegedara is currently a third year Ph.D. student at the University of Sydney, Australia. He is specializing in machine learning and has a liking for deep learning. He lives dangerously and runs algorithms on untested data. He also works as the chief data scientist for AssessThreat, an Australian start-up. He got his BSc. (Hons) from the University of Moratuwa, Sri Lanka. He frequently writes technical articles and tutorials about machine learning. Additionally, he also strives for a healthy lifestyle by including swimming in his daily schedule.