multiclaw / dnlp

Lecture 01: A Gentle Introduction to Word2Vec

We use one tiny sentence, Vienna is super, and follow a single Skip-gram example all the way from one-hot input to a softmax distribution and one training update.

Lecture 03 Lab

shared example

One training example, kept fixed throughout the whole page

Vocabulary Vienna, is, super Center word Vienna Target context is Embedding size 2

Step 1

One-hot input

We start with a vocabulary index, not meaning. The center word Vienna becomes a one-hot vector that only says “pick the first row”.

vocabulary view

takeaway

One-hot encoding does not contain semantics. It only selects which embedding row will be used next.

Step 2

Embedding lookup from W

Multiplying the one-hot vector by W simply selects Vienna’s row. That row is the learned embedding.

input matrix W

hidden embedding h

Vienna selects the first row of W, so the dense representation becomes [0.1, 0.2].

Step 3

Logits: raw scores for every context word

The hidden embedding multiplies with W' and produces one raw score per vocabulary word. These are logits, not probabilities.

output matrix W'

logits u

Logits are just unnormalized scores. The largest score right now belongs to super, which is not what we want.

Step 4

Softmax: how raw scores become probabilities

This is the key transformation. Softmax exponentiates the logits, divides by their shared total, and produces a probability distribution that sums to 1.

A. The softmax rule

B. Substitute the current logits

C. Exponentiate and normalize

Exponentiated logits

Shared denominator

D. Final probabilities

What goes in: raw scores. What happens: exponentiate, then normalize. What comes out: probabilities over Vienna, is, super.

Step 5

Compare prediction with the target

The target context word is is, so the correct target vector is one-hot. We compare that with the predicted distribution.

target one-hot

predicted probabilities

The model should push p(is | Vienna) up and push the others down.

Step 6

One update of gradient descent

Now we use the error signal to update the matrices. Only Vienna’s row in W changes, because Vienna was the center word.

update for W'

update for W

The model changes exactly the parameters that participated in this training example.

Step 7

Before and after: did the right probability go up?

This is the whole point of training. After the update, the probability of the true context word is should increase.

old distribution

new distribution

After one step, p(is | Vienna) rises from 0.3332 to 0.3443. That is Word2Vec learning.