From NLP to Transformers and Vice Versa

Outline

Terminology I've Come Across
My Big Picture Understanding of the Field
Things I do not Understand
Resources

Terminology

Natural Language Processing(NLP): Having machines take written/spoken language as input, understand it, process it, and then transforming it to an output space. Examples include translation from written English to verbal French, determining the best response, autocorrect, smart assistants, etc.)
Transduction:Transformation of input sequences to output sequences ¹
Self-attention: defined in the "Attention is All You Need" publication as "attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence"
Attention function: f(query, Set of key:value pairs) = Output
Sequence-Aligned Recurrence:
Recurrent Attention Mechanism
Learned vector representation
Residual Connection: Connection that factors in Prior Input into Existing Calculations. Due to the inherent nature of gradient descent, having a residual connection allows the gradients to course through a particular input more directly without a bunch of coefficients in the way.
Normalization: Standardizes
Autoregressive: Predicts future values by relying on past values
Query, Key, values
There are some excellent explanations for these terms including this and this. Essentially, consider the example of information retrieval systems where you have a query which tries to map to the best results. In order to do so, there are certain descriptors or keys which you test against the query, the best of which will have its value selected and shown at the top of the search results. In the context of transformers,

What I Understand Regarding Transformers: The Big Picture View

The Architecture Details of the Original Paper "Attention is All You Need"

Transformer Model Architecture(taken from 'Attention is All You Need' — **Figure 1:** Transformer Model Architecture(taken from "Attention is All You Need")

Like RNNs, the transformer model is divided into an encoder and decoder section. The encoder has one set of inputs and the decoder has two sets of inputs(these will be described soon). The set of inputs that the encoder receives is simply the set of inputs that goes into the model(i.e. if you're trying to translate English to French, the inputs into the encoder will be whatever English sentence is fed into the model). Prior to being fed into the encoder, each word in the sentence will be mapped to a vector which represents their relative locations in a high-dimensional space(i.e. the vectors for dog and horse might be relatively similar to one another b/c they are animals). This embedding space/mapping can either be trained or taken from elsewhere. After each word in the sentence is mapped to a vector, they are then added to a positional vector which encapsulates information about the relative location of each word using sine and cosine functions. Then, the encoder takes the resultant vector which represents each word and attempts to spit out new vectors which represent each word's semantic relation(within the sentence as opposed to general meaning like in the embedding stage) to one another. It does this using a combination of multi-headed attention, residual connections, and a feedforward network.

The decoder model is a bit more complex. Let's asume that we are doing English to French translation and we have predicted the first French word. We now want to predict the next word. We will again apply and add embedding(except for French) and positional encoding vectors with regards to the French word. We will then put this vector through 2 attention blocks. The first attention block will use a technique called masking(which I do not completely understand yet). The second attention block will take the results of the first attention block along with the results from the encoder and form assosciation vectors between the French and English words. These resultant association vectors will then go through the rest of the model and output a set of probabilities for the possible selections for the next potential word

Why the feedforward network?
Where do the QKVs come from?
Why is it called multi-headed attention?
How do we reference words that are in the French dictionary that are possible options for the next word? How do the output probabilities look like and how do they refer to a word? -> Can only be answered in practice.
I understand that the FFNs are used to make the attention vectors into a more digestible form. Is there any other way of interpreting this function that is less vague?
How does parallelization manifest in the FFNs?
Do the FFNs backprop?
How do the weight matrices for QKV and from attention network to FFNs get generated/trained?

Resources

Rise of Transformers & Details of Transformers Architecture

Source Material/Useful Links

https://aicorespot.io/an-intro-to-transduction-within-machine-learning/