Text Classification with Naive Baye’s

Introduction

In the context of NLP (natural language processing), Baye’s rule is given by: \(P(category | word) = P(word | category) * P(category) / P(word)\)

where:

\(P(category | word)\): probability of a category given a word
\(P(word | category)\): probability of word given category
\(P(category)\): probability of a category to occur
\(P(word)\): probability of a word to occur

For a sequence of words, the formula becomes:

\(P(category | word_0, word_1, ..., word_n) = P(category) * \frac{P(word_0 | category) * P(word_1 | category) * ... * P(word_n | category)}{P(word_0) * P(word_1) * ... * P(word_n)}\)

Basic Idea

Naive Baye’s programming is composed by the following steps: 1. count word frequency by category

For example, when classifying statements between categories: Positive and Negative:

Given the statements:

positive: The happy fox
negative: The dead fox
positive: Fox is happy
negative: Fox is dead

count word frequencies per category

word table:

word	positive	negative
the	1	1
fox	2	2
happy	2	0
dead	0	2
is	1	1
TOTAL	6	6

compute table of probabilities

word	pos_probability	neg_probability
the	1/6 = 0.1666	1/6 = 0.1666
fox	2/6 = 0.3333	2/6 = 0.3333
happy	2/6 = 0.3333	0/6 = 0
dead	0/6 = 0	2/6 = 0.3333
is	1/6 = 0.1666	1/6 = 0.1666

compute likelihood of a statement

query statement example: Is fox happy ?

\[ \prod_{i=1}^{m} \frac{P(w_i|positive)}{P(w_i|negative)}= \frac{P(is|positive)*P(fox|positive)*P(happy|positive)}{P(is|negative)*P(fox|negative)*P(happy|negative)}= \frac{0.1666*0.3333*0.3333}{0.1666*0.3333*0}= \]

replacing 0 in the formula above for a very small number as a temporary solution before we can later learn about Laplacian Smoothing:

\[ \frac{0.1666*0.3333*0.3333}{0.1666*0.3333*0.0001}=10000 \]

10000 > 1 therefore the statement is classified as positive

Laplacian Smoothing

Avoids the problem of handling words with 0 frequency in one of the classes in the step 2 above when calculating word class probability. Note the words happy and dead in the example above.

\[ P(w_i|class)=\frac{freq(w_i, class)+1}{N_{class}+V} \]

where:

\(w_i\): \(i^{th}\) word from the training data
\(class\): the category class
\(N_{class}\): number of categories
\(V\): number of unique words in vocabulary

Log Likelihood

Avoids the problem of underflow (too small number) as a result of long multiplication of probabilities:

\[ log( a * b * c ) = log(a) + log(b) + log(c) \]

Original Solution

Log Solution with Non-Zero Correction

\(log( \prod_{i=1}^{m} \frac{P(w_i|positive) + 1}{P(w_i|negative) + 1} )=\sum_{i=1}^{m} log(\frac{P(w_i|positive) + 1}{P(w_i|negative) + 1})=\)

\(log(\frac{P(is|positive) + 1}{P(is|negative) + 1}) + log(\frac{P(fox|positive) + 1}{P(fox|negative) + 1}) + log(\frac{P(happy|positive) + 1}{P(happy|negative) + 1})=\)

\(log(\frac{0.1666 + 1}{0.1666 + 1}) + log(\frac{0.3333 + 1}{0.3333 + 1}) + log(\frac{0.3333 + 1}{0 + 1})=log(1) + log(1) + log(1.3333)=0 + 0 + 0.124 = 0.124 > 0\)

(can be classified as positive)

Improved Naive Baye’s Laplacian Smoothing + Log Likelihood

As a conclusion, a better naive bayes classifier can be built by using:

laplacian smoothing to avoid 0 probabilities
log likelihood to handle longer word sequences.

Therefore the final classifier formula becomes:

\[ \sum_{i=1}^{m} log(\frac{ \frac{P(w_i|positive) + 1}{N_{class} + V} }{ \frac{P(w_i|negative) + 1}{N_{class} + V} })= \]

where m is each word in a sentence