Language Modelling with n-grams
In this article, we will begin with the fundamentals of probability and aim to develop a model capable of predicting future words. Additionally, we will evaluate the model’s effectiveness by defining an appropriate metric for testing its quality.
Language Model -
In simple terms, a language model assigns a probability to each possible next word in a sentence.
Example: I love __ , coffee-50%,tea-40%, milk-10%
The next question is how to compute these probabilities. Let’s say we have some history h and we want to compute the probability of a word w given h — $P(w \mid h)$.
| For example, if w = ‘coffee’ and h = ‘I love’, we want to compute P(coffee | I love). Simple rules of probability tell us that if we have a large corpus, and these phrases are taken from it, then: |
To approach this problem in more formal and general way, let us introduce some notations.
- Let us say we have a corpus and $w_1,w_2,w_3,…w_n$ is a sequence of n words. the same thing can also be written as $w_{1:n}$
Joint probability of each word in sequence having a particular value can be written as - \(\begin{equation} \tag{1} P(X_1=w_1,X_2=w_2,...X_N=w_N)=P(w_1....w_N) \end{equation}\)
- We can compute the joint probability using chain rule- \(\begin{equation} P(w_1....w_n)=P(w_1)P(w_2|w_1)P(w_3|w_{1:2})......P(w_n|w_{1:n-1}) \end{equation}\) \(\begin{equation} \tag{2} P(w_1....w_n)=\prod_{k=1}^{n} P(w_k|w_{1:k-1}) \end{equation}\) And we’re done. But here’s a catch: we can’t compute the right-hand side for long sequences. So, how do we proceed? We use an approximation.
Markov assumption:
We assume that the probability of the next word can be predicted without looking too far into the past. That is:
\(\begin{equation} \tag{3} P(w_n|w_{1:n-1}) \approx P(w_n|w_{n-N+1}) \end{equation}\) If N = 2, we consider only the previous word as history (a bigram). If N = 3, we use the previous two words (a trigram), and so on.
This simplifies our calculations. Now we can compute counts from the corpus and build a model.
Bigram Model
Let’s take the simple case of N = 2. Therefore- $P(w_n|w_{1:n-1}) \approx P(w_n|w_{n-1})$ If we put it in equation (2) we get : \(\begin{equation} \tag{4} P(w_1....w_N)=\prod_{k=1}^{n} P(w_k|w_{k-1}) \end{equation}\)
We can calculate the counts from the corpus and then we normalize it so that the value lie between 0 to 1.
\(P(w_n|w_{n-1}) = \frac{C(w_{n-1}w_{n})}{\sum_w C(w_{n-1}w)}\) Since all combination of $w_{n-1}w$ will start with $w_{n-1}$ the denominator is simply equal to the unigram count of$w_{n-1}$.
\(\begin{equation} \tag{5} P(w_n|w_{n-1}) = \frac{C(w_{n-1}w_{n})}{C(w_{n-1})} \end{equation}\) Using equations (4) and (5), we can build a bigram model.
Evaluation
We can evaluate our language model in two ways-
Maximum Likelihood
One way to evaluate a language model is to examine the probabilities it assigns to a test set. Higher probabilities generally indicate a better model. Because individual word probabilities are often less than 1, multiplying them can lead to numerical underflow. Therefore, we typically work in log space. For a good model:
- The product of probabilities, $\prod_{i=1}^{n}P(w_i\mid w_{i-1})$, should be high. This is known as maximum likelihood estimation (MLE).
Equivalently, in log space, the sum of log probabilities, $\sum_{i=1}^{n} \log(P(w_i \mid w_{i-1}))$, should be high. This is called maximum log-likelihood. We often divide this sum by n to normalize for sequence length, which doesn’t change the relative ranking of models.
- This normalized log-likelihood is equivalent to maximizing the expected log probability: \({E}[\log(P(w{_i}\mid w_{i-1}))]\)
Perplexity
The probability of a test set also depends on its length. We prefer a metric normalized by length—one that is independent of the number of words. Perplexity provides this normalization. It’s defined as the inverse probability of the test set, normalized by the number of words. If there are N words:\(\text{Perplexity}(W) = P(w_1,w_2,\dots,w_N)^{-1/N} = \sqrt[N]{\frac{1}{P(w_1,w_2,\dots,w_N)}}\)
For a bigram model:
\[\text{Perplexity} = \sqrt[N]{\frac{1}{\prod_{i=1}^{N}P(w_i|w_{i-1})}}\]Lower perplexity indicates a better model.
Regularization:
A problem arises when the test set contains words or n-grams not present in the training set. In such cases, the probability assigned by the model will be zero, leading to an undefined perplexity (division by zero). To address this, we use smoothing techniques.
One common method is add-one smoothing (also known as Laplace smoothing). This involves adding 1 to all n-gram counts before normalization.
Example:
For unigrams, the standard probability is $P(w_i) = \frac{c_i}{N}$, where ci is the count of word wi and N is the total number of words.
With add-one smoothing:
\[P_{\text{Laplace}}(w_i) = \frac{c_i + 1}{N + V}\]Here, V is the vocabulary size. Adding V to the denominator ensures that the probabilities still sum to 1. This technique helps to assign non-zero probabilities to unseen n-grams, preventing zero probabilities and undefined perplexities.
Notebook Implementation:
For better understanding I have implemented these in python, the python notebooks can be found here
References:
- Karpathy-Zero to Hero
- Daniel Jurafsky and James H. Martin. 2024. Speech and Language Processing: Chapter-3