Noise-Contrastive Estimation

I encountered the noise-contrastive estimation (NCE) technique in several papers recently. So it is a good time to visit this topic. NCE is a commonly used technique under the word2vec umbrella [2]. Previously I talked about several other ways to generate word embeddings in [1] but skipped introducing NCE. In this post, I will introduce it, with notations borrowed from the very paper applying it to word2vec [2].

First, let’s look at an example to get familiar with the NLP terms, target word and context [3], which are denoted as $w$ and $h$ in [2]:

Using the idea of word2vec, words appear together frequently should have similar embeddings. One step further, a word’s embedding should match with the embeddings of its context. Often the context is another embedding derived from the embeddings of the words in the context. For example, the context embedding can be as simple as mean pooling of the embeddings of all the words in the context.

Translating this idea into a loss function, we want to increase the following probability for every observed pair of $(w, h)$ :

(1) $\begin{equation*} P^h_\theta(w) = \frac{exp(s_\theta(w, h))}{\sum_{w' \in V} exp(s_\theta(w', h))}\end{equation*}$

where $s_\theta(\cdot)$ is a scoring function to score the similarity of any pair $(w, h)$ , and $w'$ comes from all other possible words in the entire vocabulary. Obviously, this can be an expensive loss function if the vocabulary size is large, which is usually the case.

To avoid the inefficiency of computing pairwise scores for all the words, NCE proposes to transform the problem of Eqn.1 into a binary classification problem. For an observed pair $(w, h)$ , we randomly pick $k$ additional words from the vocabulary to form a negative sample set. Among the k+1 words ( $w$ and $k$ sampled words), we look at each word $w$ and compute the probability that it is from the original observed pair $(w, h)$ . The fact that $w$ appears as one word in the k+1 words is either due to being sampled from the distribution $P^h_\theta(w)$ provided $w$ is from the observed pair $(w, h)$ , or due to being sampled from a uniform distribution $P_n(w)=1/|V|$ at least one out of $k$ times. Using Bayesian Theorem (more details in [3]), we can actually derive the conditional probability of $w$ being from the observed pair given $w$ is already in the $k+1$ words:

(2) $\begin{equation*} P^h(D=1|w, \theta) = \frac{P^h_\theta(w)}{P^h_\theta(w) + kP_n(w)}\end{equation*}$

The classification objective then becomes that for the word $w$ from $(w, h)$ , $P^h(D=1|w, \theta)$ should be as high as possible; for the word $w$ from the negative samples, $P^h(D=0|w, \theta)$ should be as high as possible. That is what Eqn. 8 from [2] is about:

(3) $\begin{equation*} J^h(\theta) = \mathbb{E}_{w \in (w,h)} \left[ log P^h(D=1|w,\theta)\right] + k \mathbb{E}_{w \not\in (w,h)} \left[ log P^h(D=0|w,\theta) \right]\end{equation*}$

There exists some theoretical reasons that I am not too sure about, but which can be taken advantage of to relax the definition of $P^h(D=1|w, \theta)$ . That is, we replace $P^h_\theta(w)$ in Eqn. 2 with an unnormalized score $s_\theta(w,h)$ . With the replacement, we won’t need to worry about how to compute the partition function $\sum_{w' \in V} exp(s_\theta(w', h))$ as in Eqn. 1.

There are also variants of NCE loss. One is called InfoNCE (see Eqn. 4 in [4]) and has been applied to model-based reinforcement learning [5]. [4] claims that its objective function is actually optimizing the lower bound on mutual information between $w$ and $h$ in the word2vec context (or state and next state in the reinforcement learning context). Mutual information can be understood as the uncertainty reduction of $w$ after observing $h$ [6].

References

[1] https://czxttkl.com/2017/01/02/embedding-and-heterogeneous-network-papers/

[2] Learning word embeddings efficiently with noise-contrastive estimation

[3] https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html#context-based-skip-gram-model

[4] Representation Learning with Contrastive Predictive Coding

[5] Unsupervised State Representation Learning in Atari

[6] http://www.scholarpedia.org/article/Mutual_information