One-hot input
We start with a vocabulary index, not meaning. The center word Vienna becomes a one-hot vector that only says “pick the first row”.
We use one tiny sentence, Vienna is super, and follow a single Skip-gram example all the way from one-hot input to a softmax distribution and one training update.
We start with a vocabulary index, not meaning. The center word Vienna becomes a one-hot vector that only says “pick the first row”.
Multiplying the one-hot vector by W simply selects Vienna’s row. That row is the learned embedding.
The hidden embedding multiplies with W' and produces one raw score per vocabulary word. These are logits, not probabilities.
This is the key transformation. Softmax exponentiates the logits, divides by their shared total, and produces a probability distribution that sums to 1.
The target context word is is, so the correct target vector is one-hot. We compare that with the predicted distribution.
Now we use the error signal to update the matrices. Only Vienna’s row in W changes, because Vienna was the center word.
This is the whole point of training. After the update, the probability of the true context word is should increase.