I type a lot and over the years I’ve been tempted to learn Dvorjak. When that urge strikes I remember this old article from 1996 exposing the myth.

By reposting it here I hope to save someone else hours of futile frustration.

I type a lot and over the years I’ve been tempted to learn Dvorjak. When that urge strikes I remember this old article from 1996 exposing the myth.

By reposting it here I hope to save someone else hours of futile frustration.

I’ve been optimizing a Naive Bayes classifier and one part of it was tuning the penalty ratio. First, some background on Naive Bayes.

When training Naive Bayes, you count how many times each feature occurs in each class. At the end of training you turn each count into a ratio. I call the table containing the ratio for each feature for each class the **FDM** for Feature Density Matrix. For example, suppose you are classifying email as **Spam** or **Ham** and the features are words in email body. Each word has a ratio representing how often it appeared in that class. For example the word **viagra** might have a high ratio in Spam and low (or never appear) in Ham.

To classify with Naive Bayes, you take each feature from the item (in this case words from an email), get the ratio for each, multiply them all together to get a single number. Do this for each class (in this case twice, once each for Spam and Ham), then assume the item belongs to whichever class produced the highest number.

NOTE: this is where the name Naive Bayes comes from. **Bayes**, because you’re multiplying frequencies (probabilities) together, as if the probability of them all happening together equaled the product of their individual probabilities. **Naive**, because this would be true only if each was independent or uncorrelated to the others. They’re not, so this (naive assumption) is wrong, but it still works surprisingly well. An incorrect model is often a useful tool.

So what happens if the item you’re classifying has a feature that is not in the FDM? It could be a new word that never appeared during training, or you saw it in training but not for this particular class? You can’t use 0 because that would nullify the entire number for this class (remember, you’re multiplying them all together). You can’t use 1 because that won’t penalize the ratio for this feature that was never seen for this class.

Here’s one solution: use a small constant ratio **K** for words that never appeared in the class. During training, while building the FDM, keep a running track of the smallest ratio, then at the end use it to set **K**. The question is: **how much smaller than the minimum should K be?** There is no easy answer. It depends on how much you want to penalize a class that doesn’t have a feature from the item you’re classifying. Whether you over-penalize it (K too small) or under-penalize it (K too big), depends on the actual results. Optimal classification accuracy may come from some in between value of K.

NOTE: why am I assuming K should be smaller than the smallest ratio in the FDM? Because K is used for features that never appeared for a certain class. If that feature appeared only once for this class, we would have some other ratio R. Appearing never is worse than appearing once, so K should be smaller than R. We don’t know exactly what R would be, but we do know that the smallest ratio in the table is for some feature that occurred at least once, so K should be less than that.

But then I thought, **why does K matter at all?** There is a kind of symmetry to this. For each class, you’re multiplying a bunch of ratios together. If a certain feature is not in the FDM, then you’ll have the same K in each of the overall class ratios. Thus K will cancel out – that is, if the same K is in all of them, it can’t change which is biggest.

As it commonly happens, that random thought turned out to be half-baked. Yes, it’s true – but it’s true only if the feature is missing from the FDM. That is, entirely missing – no class had this feature. That is the symmetric case. But what often happens is a particular feature exists for some classes yet not others. Like the viagra example above – it might appear in the FDM, but only for Spam, not for Ham. This is asymmetric. So yes, the value of K really does indeed matter, because other classes might have contained the feature that this class did not.

Let’s take 2 cases to see how K can be both “too big” and “too small”.

First, what if K is too big? This is the obvious case. Consider an email having the word viagra. For Spam, the ratio for viagra is pretty high. For Ham, we need that ratio to be low. But if viagra never appeared in Ham, we rely on K. If K is too big, it won’t penalize the Ham class enough and the overall ratio from Ham might be big, causing us to classify the email as Ham.

Second, what if K is too small? Suppose there is a relatively “neutral” feature, say a relatively rare word like **zugzwang**, which appears in some classes but not others, and its ratios are about the same in the classes where it does appear. This means the word is not useful for classifying items (because the ratios across classes are similar). Yet the classes where it didn’t occur get unfairly penalized because K is so small.

In summary, K should always be less than the smallest ratio in the FDM. But you need to test different values of K to see how much smaller to set it. I find values in the range 10^-3 to 10^-10 work well. That is, K is 1/1000 or less of the the smallest ratio in the FDM. Your mileage may vary!