November 29, 2014

NGram Language Model.

An N-Gram Language model is a model which predicts the last word in the sequence given the sequence of (N-1) words.

These models are pretty handy when we are trying to analyze text from noisy or ambiguous input sources eg speech recognition or even in spelling correction when the error word itself is an English word eg. 30 minuets vs 30 minutes.

So mathematically what we want to compute is a probability of a word w that will follow a sequence of words s = { w1, w2, w3, w4, …, wn }.
P(w | s ) = (# (s + w) ) / (# s) in the corpus. ie. Total number of time the word “w” occurs after sentence “s” divided by total number of time the sentence “s” occurs in the training corpus.

Now one can use the web as a corpus to train such a model but there are two constraints

Language is creative in nature and new sentences get created quiet frequently.
To compute such statistics where in we compute the joint probability of every sentence like “s” in above case, we will, out of all the sequences of n words in our corpus find the frequency of sequence of sentence “s”. And that’s a lot of computation.

Now this demands for a better means for computing the probability of the sentence or approximate the probability. This where Markov Property comes into picture.
Markov Property : The future state of the system depends only on the present state and not on the states preceding the current state.
In case of language the probability of a word depends on previous word. For example probability of word N depends on (N-1)th word, this is Bigram Model where at a moment there are two words involved and hence “Bi”
P(Wn | Wn-1) = (#WnWn-1) / (#Wn-1)

Similarly generalizing for N-Gram, we would use N-1 terms to predict the probability of a term being the Nth term. Where (N-1) terms would be the current state and Nth terms will be the future we would like to predict.

So far so good. But the above approach as such has some tuning that can be done. There will be cases where a particular N-Gram wont exist in the corpus may be because our corpus will be of finite length and since they didnt exist in the training corpus the probability value will tend to zero which would lead to a very sparse matrix. To avoid such a setup we can opt to use some smoothing technique an example could be before we computer the probabilities we add one to all counts till will eliminate all zero probability outcomes and give them some small chance.

Another approach to that one can use is Backoff and Interpolation methods. Suppose we have a N-Gram sequence that doesnt exist then we check for (N-1)Gram and process until reach a Unigram or find some non zero value. The difference between Backoff and Interpolation is, Backoff stops when it encounters some value for the NGram where as interpolation uses a combination of all N sequence probability (N word seq, N-1 word seq, …., Unigram)

Backoff :
value = null;
for sequence_length in [N, N-1, N-2, …., 1]:
if(exist(seq(N)):
value = score(seq(n))
break

Interpolation:
value = 0;
for sequence_length in [N, N-1, N-2, …., 1]:
if(exist(seq(N)):
value = value + (importance(N))*score(seq(n))

Kudos

NGram Language Model.

Now read this

Types of Function