Home     Archive     Tags     About        RSS
Bayesian Binary Classifiers

We've recently been looking at using Bayesian classifiers for trading in the financial markets. So I thought I'd write a general description of how they work.

A classifier works by looking at the attributes of an object to decide which of a set of classes it belongs to. Take human beings for example. The goal is to classify a person as male or female. You can do this if you have data on the person's height, weight, arm length and other physiological attributes. All you need to know is the distribution of these measurements for both sexes and you can determine the probability of a person being male or female. The distributions must come from a set of already classified attribute measurements. This is an example of binary classification since there are only two classes.

An attribute is not necessarily a direct property of an object but may just influence which class the object ends up in. An example of this is using bond yields and currency exchange rates to determine if the stock market will have an up or down day. The yields and exchange rates are not direct attributes of the stock market but can have a strong effect on stock indices. Another example is classifying tomorrow's weather as cloudy or sunny based on the weather conditions that exist today.

It's not always clear which attributes are most relevant for classifying an object or if they are sufficient for a definitive classification. So a perfect classifier is rare and most will make mistakes regardless of the method they use to draw conclusions from the attributes. A good classifier will be correct with a high probability but it will not be perfect. The only way to know if a given set of attributes is relevant is to test how well they classify. But this can sometimes be misleading since the same set of attributes can give different results depending on the method used.

A good general purpose classifier that can exploit almost any kind of relationship that may exist between an object and its attributes is called a Bayesian classifier. The name comes from the fact that it is based on Bayes' theorem in probability. In what follows I will describe how a Bayesian binary classifier works and give a couple of examples.

First let's look at the case of just a single attribute. Let the two classes be labeled \(A\) and \(B\) and let the attribute value be \(X\). What we want is the probability that the object is either in class \(A\) or \(B\) for a given value of \(X\). These are conditional probabilities which are written as \(P(A|X)\) and \(P(B|X)\). Since there are only two classes and the object must be in one of them, the two probabilities should sum to 1, \(P(A|X) + P(B|X) = 1\). Bayes' theorem gives the following equations for these probabilities.

\[P(A|X) = \frac{P(X|A)P(A)}{P(X|A)P(A)+P(X|B)P(B)}\]

\[P(B|X) = \frac{P(X|B)P(B)}{P(X|A)P(A)+P(X|B)P(B)}\]

The denominator in both equations is a normalizing factor that ensures that \(P(A|X) + P(B|X) = 1\). The first equation says that the probability of being in class \(A\) given the attribute value \(X\) is proportional to the probability of having attribute value \(X\) given that the class is \(A\), \(P(X|A)\), times the probability of an object being in class \(A\), \(P(A)\). In the language of Bayesian inference \(P(A)\) is called the prior probability of \(A\), \(P(X|A)\) is called the likelihood function and \(P(A|X)\) is called the posterior probability of \(A\). The prior probability and the likelihood function are usually estimated from a training set of already classified objects. Exactly how this is done will be explained below.

I will look at two examples. The first involves medical diagnosis and is probably the simplest example of a binary classifier. In this example \(X\) is a discrete variable representing the result of a test for disease. There are only two possible values for \(X\): positive (1) or negative (0). Positive indicates the person has the disease and negative indicates that they don't. To be specific let's say we are trying to determine if a person has cancer. Class \(A\) means the person has cancer and class \(B\) means they do not. The value of \(X\) is the result of a test for the presence of cancer.

No test is perfect. There are times when a test will come back positive but the person does not have cancer and there are times when it will come back negative and the person does have cancer. If the test is any good then there will be few of these false positives and negatives. If the test has been used for a long time then there should be sufficient data for a good estimate of the probability of a true positive \(P(X=1|A)\) and a false positive \(P(X=1|B)\). Simply count how many times in the past each of these scenarios has happened and divide by the total number of tests to get an estimate for these probabilities. Likewise you can get an estimate for the probability that a randomly chosen person has cancer, \(P(A)\), or not, \(P(B)\). (Note that you can take a more Bayesian approach to these probabilities and use distributions to represent them instead of point estimates but I'm not going to go into that here.)

Let's assume that our best estimate for cancer in the general population is \(P(A)=0.001\) i.e. one tenth of one percent of the population has this cancer. For the accuracy of the test assume that the probability of a true positive is \(P(X=1|A)=0.97\) and the probability of a true negative is \(P(X=0|B)=0.95\). This means that \(97\%\) of the time if a person has the cancer they will test positive and \(95\%\) of the time if they don't have cancer they will test negative. The probability of a false negative and a false positive is then \(P(X=0|A)=0.03\) and \(P(X=1|B)=0.05\) respectively. Now we can find the probability that a person has cancer given that they have a positive test result. From the equation above the probability is:

\[P(A|X=1) = \frac{P(X=1|A)P(A)}{P(X=1|A)P(A)+P(X=1|B)P(B)}\]

which, if you plug in the numbers, gives \(P(A|X=1)=0.01905\). A positive test result therefore means that the person only has about a \(2\%\) chance of having cancer. The chance that the person doesn't have cancer given a positive test result is \(P(B|X=1)=0.98095\). This counterintuitive result is due to the cancer rate being so low \(P(A)=0.001\). The low rate means that the number of true positives is much smaller than the number of false positives. The number of true positives will be a small portion of all the positive test results, therefore the probability is low.

For the second example I'm going to look at classifying a person as male or female based solely on height. In this case class \(A\) is male and \(B\) is female. \(X\) is now a continuous random variable representing a person's height and \(P(X|A)\) and \(P(X|B)\) are continuous probability distributions. There are two ways to get these distributions. In the nonparametric approach you make no assumptions about their form and evaluate them using something called a kernel density estimate. I'm not going to describe how that works here but see our data mining and machine learning page or you can read the wikipedia article on it. The other way is to assume a particular distribution and then estimate the parameters using a set of classified training data. In the case of heights it is safe to assume that a Normal distribution is a good fit for both males and females. The parameters for these distributions can be found in the CDC's Anthropometric Reference Data for Children and Adults. Mean heights for males and females twenty years or older are \(\mu_m=69.4\) and \(\mu_f=63.8\). The standard deviations are \(\sigma_m=4.686\) and \(\sigma_f=4.181\). With these parameters the two Normal distributions can be evaluated.

\[P(X|A) = \frac{1}{\sqrt{2\pi\sigma_m^2}}e^{-\frac{(x-\mu_m)^2}{2\sigma_m^2}}\]

\[P(X|B) = \frac{1}{\sqrt{2\pi\sigma_f^2}}e^{-\frac{(x-\mu_f)^2}{2\sigma_f^2}}\]

The only missing information is the prior probabilities, \(P(A)\) and \(P(B)\). These can be estimated from the sex ratio which is simply the ratio of the number of males to females in the population. Using a ratio of 0.97 for the United States gives \(P(A)=0.49\) and \(P(B)=0.51\). Given these probabilities you can now calculate the probability of a person being male or female given just their height. There is of course no way the classification can be perfect, based just on height, but on simulated data it turns out to work surprisingly well with an accuracy of around \(70\%\). The reason it works is that the means and standard deviations are different and the sex ratio is not equal to 1. The bigger the difference in means and standard deviations, the more spread apart the distributions will be and so for a given height there will be a larger difference in probabilities, making it easier to separate male and female.

These are just two simple examples of binary classifiers. There are countless other applications where they can be used. With more than one attribute things get more interesting. You then have to consider the dependence or independence of the attributes. If they are independent then the conditional distributions will factor so for two attributes \(X\) and \(Y\) you get \(P(X,Y|A)=P(X|A)P(Y|A)\). If not then you have to deal with joint distributions. It turns out that even if the attributes are not independent you can often get away with treating them as if they are. This is then called a Naive Bayes classifier and it often works quite well.

For software to implement these classifiers see the Exstrom Labs web page on data mining and machine learning. For information on how to use classifiers to trade in the stock market, stay tuned for a report we will be releasing soon. If you want to know when it is released, sign up for our newsletter.

© 2010-2012 Stefan Hollos and Richard Hollos

blog comments powered by Disqus