Sunday, May 29, 2016

AdaBoost 3 and Bayesian logic

I think I've figured out a way to express the workings of AdaBoost in terms of the Bayesian processing, and I think it becomes simpler this way. Now, I'm not the first one to do that, the book "Boosting" describes a way to do this with what it calls the logistical regression. It also derives an approach to probabilities in AdaBoost from this way. But the logistical regression uses a sigmoid curve (I think like the one generally used for the normal distribution) and differs from the plain AdaBoost. I think my explanation works for the plain AdaBoost.

Before I start on that explanation, I want to talk about why the boosting works. Yeah, I've followed the proofs (more or less) but it took me a while to get a bit of an intuitive feeling in my head, the "physical meaning" of the formulas. I want to share this understanding which is much simpler than these proofs (and I hope that it's correct):

The premise of boosting is that we're able to find a number of methods (what they call "hypotheses" in AdaBoost) to predict the correct outcomes of the training cases, each method correctly predicting more than 50% of the training cases. Then if we collect a large-ish number of these methods, we can predict the correct outcomes of all the training cases simply by averaging the predictions of these methods. And the other cases will follow the training cases (unless an overfitting happens). Since more than 50% of the cases are correctly predicted by each method, after the averaging more than 50% of the votes for each training case will be correct, and thus the result will be correct too. Of course this depends on the correct predictions being distributed pretty evenly among the cases. If we have a thousand methods that predict correctly the same cases and incorrectly the other cases, obviously after averaging these other cases will still be predicted incorrectly. So the selection of methods must somehow shuffle the preference for the cases, so that the next picked method will predict well the cases that have been predicted poorly by the previous picked methods. That's it, that's the whole basic idea. It's a bit of an oversimplification but it's easy to understand.

Now on to the explanation of the connection between AdaBoost and the Bayesian logic.

To start with, I want to show one more rewrite of the AdaBoost algorithm. It builds further on the version I've shown in the last installment. Now I'm returning back to the notation of et and (1-et). These values are proportional to Wbadt and Wgoodt but are confined to the range [0, 1] which fits well into the Bayesian computations. More importantly, this new version gets rid of the square root in the computation of Dt+1(i). It wasn't the first thing I realized for the connection between the AdaBoost and Bayesian logic, it was actually the last thing I realized, but after that all the parts of the puzzle fell into place. So I want to show this key piece of the puzzle first.

The point of this computation is to readjust the weights of the training cases for the next round, so that the total weight of all the cases successfully predicted by the current round equals to the total weight of the unsuccessfully predicted rounds. There is more than one way to make the weights satisfy this condition, they're all proportional to each other and all just as good. They way the weight are modified in the classic for of AdaBoost algorithm has been selected to make the modification symmetric, and allow it to write the adjustment for both correct and incorrect cases as a single formula:

Dt+1(i) = Dt(i)*exp(-Alphat*yi*ht(xi)) / Zt

But if we're writing an honest if/else statement, it doesn't have to be symmetric. We can as well write either of:

     if ht(xi) != yi; {
          Dt+1(i) = Dt(i) / et
     } else {
          Dt+1(i) = Dt(i) / (1-et)
     }

or with a degenerate "else" part:

     if ht(xi) != yi; {
          Dt+1(i) = Dt(i) * (1-et) / et
     } else {
          Dt+1(i) = Dt(i)
     }

Either way the result is the same. And there is no need to involve the square roots, the formula becomes simpler. After making this simplification, here is my latest and simplest version of the algorithm:



Given: (x1, y1), ..., (xm, ym) where xi belongs to X, yi belongs to {-1, +1}.
Initialize: D1(i) = 1 for i = 1, ..., m.
For t = 1, ..., T:
  • Train the basic algorithm using the weights Dt.
  • Get weak hypothesis ht: X -> {-1, +1}.
  • Aim: select ht to minimalize the weighted error et:
    Wgoodt = 0; Wbadt = 0;
    for i = 1, ..., m {
         if ht(xi) = yi {
              Wgoodt += Dt(i)
         } else {
              Wbadt += Dt(i)
         }
    }
    et = Wbadt / (Wgoodt + Wbadt)
  • Update,
    for i = 1, ..., m {
         if ht(xi) != yi; {
              Dt+1(i) = Dt(i) * (1-et) / et
         } else {
              Dt+1(i) = Dt(i)
         }
    }
Produce the function for computing the value of the final hypothesis:
H(x) {
     chance = 1;
     for t=1,...,T {
          if (ht(x) > 0) {
               chance *= (1-et)/et;
          } else
               chance *= et/(1-et);
          }
     }
     return sign(chance - 1)
}



Now this form can be translated to the Bayesian approach. I'll be using the form of Bayesian computations that works with weights rather than probabilities. The translation to his form is fairly straightforward:

There are two Bayesian mutually-exclusive hypotheses (but to avoid confusion with the AdaBoost term "hypothesis", let's call them "outcomes" here): one that the result is true (1), another one that the result false (0).

Each "hypothesis" ht(x) in AdaBoost terms becomes an "event" Evt in the Bayesian form (to avoid the terminological confusion I'll call it an event henceforth). Each event is positive: it being true predicts that the true outcome is correct, and vice versa. The training table looks like this:

Weight Outcome  Ev1 ... EvT
1 *    true     1   ... 1
1 *    false    0   ... 0

Aside from this, we remember et from all the boosting rounds, and of course the computations for all the "events" chosen by the boosting.

Then when we run the model, we get the set of arguments x as the input, and start computing the values of the events. When we compute the event Evt, we apply it with the confidence C(Evt) value of (1-et), as described in the fuzzy training logic. In result if the Evt is true, the weight of the true outcome gets multiplied by (1-et) and the weight of the false outcome gets multiplied by et. If the event is false, the multiplication goes the opposite way.

In the end we compute the probability of the true outcome from the final weights: W(true)/(W(true)+ W(false)). If it's over 0.5, the true outcome wins. This logic is exactly the same as in the function H(x) of the AdaBoost algorithm.

Why is the (1-et) used as the confidence value? Since it's the fraction of the training cases that got predicted correctly by Evt, it makes sense that we're only so much confident in this event. Well, for the first event it is, but for the following events (1-et) is computed from a modified distribution of the events, with the adjusted weights. Does it still make sense?

As it turns out, it does. To find out why, we need to look at the weights of the training cases after they pass the filter of the first event. Instead of looking at the training table with two composite rows we need to step back and look at the table with the M original uncombined training cases:

CaseId Weight Outcome  Ev1 ... EvT
1       1 *    true     1   ... 1
...
M       1 *    false    0   ... 0

The cases that have the outcome of true still have 1 for all the events, and the one with the false outcome have 0 for all the events. In case if we run the algorithm for an input that matches a particular training case, Ev1 will predict the outcome correctly for some of the training cases and incorrectly for the others. When it predicts correctly, the weight of this training case will be multiplied by the confidence C(Ev1) and will become (1-e1). But when it predicts incorrectly, the weight will be multiplied by e1. The distribution will become skewed. We want to unskew it for computing the confidence of Ev2, so we compensate by multiplying the weight of the cases that got predicted incorrectly by (1-e1)/e1. Lo and behold, it's the same thing that is done in AdaBoost:

     if ht(xi) != yi; {
          Dt+1(i) = Dt(i) * (1-et) / et
     } else {
          Dt+1(i) = Dt(i)
     }

So it turns out that on each step of AdaBoost the distribution represents the compensation of the weight adjustment for application of the previous events, and the value (1-et) is the proper adjusted fraction of the training cases that got predicted correctly. Amazing, huh? I couldn't have come up with this idea from scratch, but given an existing formula, it all fits together.

What is the benefit of looking at AdaBoost from the Bayesian standpoint? For one, we get the probability value for the result. For another one, it looks like AdaBoost can be slightly improved by changing the initial weights of true and false outcomes from 1:1 to their actual weights in the M training cases. That's an interesting prediction to test. And for the third one, maybe it can be used to improve the use of AdaBoost for the multi-class classifications. I haven't read the book that far yet, I want to understand the current state of affairs before trying to improve on it.

Saturday, May 28, 2016

Summary of Bayes by weight

This is a copy of the post from my MSDN blog. If you've been reading this blog, you've already seen all the ideas described here. But I've realized that the repost can be useful on this blog for the references, as a short summary of the most interesting parts. So, here it goes.

(P.S. See also a simpler later explanation.)

I recently wrote a series of posts in my other blog on the Bayes expert system. Some time ago I wrote an expert system, and I've been surprised by how little of the information is available on the Internet and how much of it is misleading, so I wanted to make a write-up of my experience, and recently I've finally found time to do it. It turned out larger than I expected, and as I was writing it I myself have gained a deeper understanding of my previous experience and came up with further better ideas. The text starts from the very basics of the Bayesian formula and goes through its use in the expert systems, the typical pitfalls when building the expert systems and the solutions to these pitfalls, the ways to handle the uncertainties, a deeper dive into how and why these formulas actually work, more of the practical solutions based on that knowledge, and the aspects of the testing.

The whole series is here (in the reverse order): http://babkin-cep.blogspot.com/search/label/bayes
The first post is here: http://babkin-cep.blogspot.com/2015/10/bayes-1-basic-terms.html
They are not interspersed with any other posts, so you can read sequentially.

Right in the first post I refer to the Yudkowsky's explanation, so why would you want to read my explanation rather than his? By all means, read his explanation too, it's a good one. But he explains it in a different context. He takes one application of the Bayesian formula and works through it. So do many other texts on the artificial intelligence. The expert systems don't apply the formula once. They apply the formula thousands of times to make a single diagnosis. Which brings its own problems of scale that I describe. And there are many other practical issues. Your input data might be "impossible" from the standpoint of probabilities in your knowledge base. The evidence of absence is not the same as the absence of evidence. The events are rarely completely independent. Multiple hypotheses might be valid at the same time. And so on.

The particularly surprising realization for me was that the Bayesian systems and the decision trees are fundamentally the same thing. You can represent either one through the other one. They are traditionally used in somewhat different ways but that's just the question of tuning the parameters. They can be easily tuned either way or anywhere in between. This stuff is in the part 11 of my notes. There it requires the context from the previous parts, but here is the short version as a spoiler. The short version might be not very comprehensible due to its shortness but at least it points to what to look for in the long version.

So, it starts with the basic Bayes formula where the probability of some hypothesis H after taking into account the information about the event E is:

P(H|E) = P(H) * P(E|H) / P(E)

The hypotheses can also be thought of as diagnoses, and the events as symptoms.

In a typical expert system there can be hundreds of hypotheses and hundreds of events to consider. The values of P(E|H) are taken from the table that is computed from the training data. The values of P(H) and P(E) change as the events are applied one after another (P(H|E) for one event becomes P(H) for the next event, and P(E) gets computed from the complete set of values for P(H) and P(E|H)) but their initial values are also sourced from the same table. At the end one or multiple most probable hypotheses are chosen as the diagnosis.

What is the training data? It's the list of the previously diagnosed cases, where both the correct winning hypotheses and the values of the events are known. In the simplest way it can be thought of as a bitmap where each row has a hypothesis name and the bits for every event, showing if it was true or false:

    E1  E2  E3  E4
H1   1   0   0   0
H1   1   0   0   0
H1   0   0   0   1
H2   0   1   1   0
H2   0   1   1   0

In reality one case may have a diagnosis of multiple hypotheses, and the values of the events might be not just 0 and 1 but a number in the range between 0 and 1: we might not be completely confident in some symptom but have say a 0.7 confidence that it's true.

How is the table of probabilities built from the training data? For P(H) we take the proportion of the number of cases with this hypothesis to the total number of cases. For P(E|H) we take all the cases for the hypothesis H and average all the values for the event E in them. For P(E) we average all the values for E in all the cases in the whole training data.

As it turns out, this approach doesn't always work so well. Consider the following training data:

    E1 E2
H1   0  0
H1   1  1
H2   1  0
H2   0  1

The meaning of the data is intuitively clear, it's H1 if E1 and E2 are the same and H2 if E1 and E2 are different. But when we start computing P(E|H), all of them end up at 0.5 and the resulting expert system can't tell anything apart.

There is a solution: for the duration of the computation, split each hypothesis into two, say H1A and H1B:

    E1 E2
H1A  0  0
H1B  1  1
H2A  1  0
H2B  0  1

Before making the final decision, add up the probabilities P(H1)=P(H1A)+P(H1B) and use them for the decision. Now the logic works. The hypotheses can be split down to the point where each case in the training data becomes its own sub-hypothesis. Indeed the current example had done exactly this. If there are multiple cases that are exactly the same, we could keep them together by assigning a weight instead of splitting them into the separate sub-hypotheses. For example, if there are 5 cases that are exactly the same, we can make them one sub-hypothesis with the weight of 5.

And with such a fine splitting the computation of probabilities can be thought of as striking out the sub-hypotheses that don't match the incoming events. If we're computing the diagnosis and receive E1=1, we can throw away H1A and H2B, leaving only H1B and H2A. If then we receive E2=0, we can throw away H2A, and the only hypothesis left, H1B, becomes the final diagnosis that we'll round up to H1.

We're essentially trying to find a match between the current case and one of the training cases. But that's exactly what the decision trees do! We can represent the same table as a decision tree:

             |
             V
        0   E1    1
        +----+----+
        |         |
        V         V
     0 E2  1   0 E2  1
     +--+--+   +--+--+
     |     |   |     |
     V     V   V     V
    H1A   H2B H2A   H1B

As you can see, it's equivalent, produces the exact same result on the same input.

But what if we're not entirely confident in the input data? What if we get E1 with the confidence C(E1)=0.7? The way the Bayesian formula

P(H|C(E)) = P(H) * ( C(E)*(0+P(E|H)) + (1-C(E))*(1-P(E|H)) )
  / ( C(E)*(0+P(E)) + (1-C(E))*(1-P(E)) )

treats it amounts to "multiply the weight of the cases where E1=1 by 0.7 and multiply the weight of the cases where E1=0 by 0.3". So in this example we won't throw away the cases H1A and H2B but will multiply their weights by 0.3 (getting 0.3 in the result). H1B and H2A don't escape unscathed either, their weights get multiplied by 0.7. Suppose we then get E2=0.8. Now the weights of H1A and H2A get multiplied by 0.2, and of H1B and H2B get multiplied by 0.8. We end up with the weights:

W(H1A) = 1*0.3*0.2 = 0.06
W(H1B) = 1*0.7*0.8 = 0.56
W(H2A) = 1*0.7*0.2 = 0.14
W(H2B) = 1*0.3*0.8 = 0.24

When we add up the weights to full hypotheses, W(H1)=0.62 and W(H2)=0.38, so H1 wins (although whether it actually wins or we consider the data inconclusive depends on the boundaries we set for the final decision).

Can this be represented as a decision tree? Sure! It just means that when we make a decision at each event node we don't choose one way out. We choose BOTH ways out but with different weights. If the weight of some branch comes down to 0, we can stop following it, but we faithfully follow all the other branches until the come down to the leaves, and keep track of the weights along the way.

Now, what if the table contains not just 0s and 1s but the arbitrary values between 0 and 1? This might be because some training cases had only a partial confidence in their events. Or it might be because we averaged the events of multiple training cases together, building the classic Bayesian table with one line per hypothesis.

We can still compute it with weights. We can just logically split this case into two cases with the partial weights. If we have a value of 0.7 for the hypothesis H and event E, we can split this line into two, one with 1 in this spot and the weight of 0.7, another one with 0 in this spot and the weight of 0.3. And we can keep splitting this case on every event. For example, if we start with

    E1  E2
H1  0.7 0.1

we can first split it into 2 cases on E1:

     E1  E2   W
H1A  1   0.1  0.7
H1B  0   0.1  0.3

And then further split on E2:

     E1  E2   W
H1AA 1   1    0.07
H1AB 1   0    0.63
H1BA 0   1    0.03
H1BB 0   0    0.27

Or we can modify the table as we apply the events: right before applying the event, split each row in the table in twain based on the values in it for this event, apply the event by multiplying the weights appropriately, and then collapse the split rows back into one by adding up their weights. This makes the intermediate values more short-lived and saves on their upkeep. And that's exactly the logic used in the classic Bayesian formula.

Since once the table rows get split, the operations on the table stay exactly the same, they still can be represented with the decision tree. Just the splitting of the rows in the table would be transformed to the splitting of nodes in the decision tree.

In the end the Bayesian logic and the decision trees are equivalent. They do the same things. Traditionally the Bayesian logic uses the training cases that have been averaged out while the decision trees try to match to the individual training cases. But both can be used in either way. And it's possible to get a mix of both by partitioning the weights between the original training cases and their combined averaged versions. It's a smooth scale of 0 to 1: if you transfer all the weight of the training cases into averaging, you get the traditional Bayesian logic, if you transfer none of it, you get the traditional decision trees, and if you transfer say 0.3 of the weight, you get the logic that is a combination of 0.3 of the traditional Bayesian logic and of 0.7 of the traditional decision trees. Note I've been saying "traditional" because they are really equivalent and can be transformed into each other.

Sunday, May 22, 2016

AdaBoost in simpler formulas 2

I've been reading along the book on boosting, and I'm up to about 1/3 of it :-) I've finally realized an important thing about how the H(x) is built.

For easy reference, here is another copy of the AdaBoost algorithm from the previous installment, simplified slightly further by replacing (1-et)/et with Wgoodt/Wbadt, and et/(1-et) with Wbadt/Wgoodt as mentioned at the end of it and getting rid of et altogether:



Given: (x1, y1), ..., (xm, ym) where xi belongs to X, yi belongs to {-1, +1}.
Initialize: D1(i) = 1 for i = 1, ..., m.
For t = 1, ..., T:
  • Train the basic algorithm using the weights Dt.
  • Get weak hypothesis ht: X -> {-1, +1}.
  • Aim: select ht to minimalize the weighted error Wbadt/Wgoodt:
    Wgoodt = 0; Wbadt = 0;
    for i = 1, ..., m {
         if ht(xi) = yi {
              Wgoodt += Dt(i)
         } else {
              Wbadt += Dt(i)
         }
    }
  • Update,
    for i = 1, ..., m {
         if ht() != yi; {
              Dt+1(i) = Dt(i) * sqrt(Wgoodt/Wbadt)
         } else {
              Dt+1(i) = Dt(i) * sqrt(Wbadt/Wgoodt)
         }
    }
Output the final hypothesis:
H(x) = sign(sum for t=1,...,T (ln(sqrt(Wgoodt/Wbadt))*ht(x)) ).



I've been wondering, what's the meaning of ln() in the formula for H(x). Here is what it is:

First of all, let's squeeze everything into under the logarithm. The first step would be to put ht(x) there.

ln(sqrt(Wgoodt/Wbadt))*ht(x) = ln(sqrt(Wgoodt/Wbadt)ht(x))

This happens by the rule of ln(a)*b = ln(ab).

Since ht(x) can be only +1 or -1, taking the value to the power of it basically means that depending on the result of the ht(x) the value be either taken as-is or 1 divided by it. Which is the exact same thing that is happening in the computation of Dt+1(i). The two formulas are getting closer.

The next step, let's stick the whole sum under the logarithm using the rule ln(a)+ln(b) = ln(a*b):

H(x) = sign(ln( product for t=1,...,T ( sqrt(Wgoodt/Wbadt)ht(x) ) ))

The expression under the logarithm becomes very similar to the formula for Dt+1(i) as traced through all the steps of the algorithm:

Dt+1(i) = product for t=1,...,T ( sqrt(Wgoodt/Wbadt)-yt*ht(x) )

So yeah, the cuteness of expressing the condition as a power comes handy. And now the final formula for H(x) makes sense, the terms in it are connected with the terms in the computation of D.

The next question, what is the meaning of the logarithm? Note that its result is fed into the sign function. So the exact value of the logarithm doesn't matter in the end result, what matters is only if it's positive or negative. The value of logarithm is positive if its agrument is > 1, and negative if it's < 1. So we can get rid of the logarithm and write the computation of H(x) as:

if ( product for t=1,...,T ( sqrt(Wgoodt/Wbadt)ht(x) ) > 1 ) then H(x) = +1 else H(x) = -1

Okay, if it's exactly = 1 then H(x) = 0 but we can as well push it to +1 or -1 in this case. Or we can write that

H(x) = sign( (product for t=1,...,T ( sqrt(Wgoodt/Wbadt)ht(x) ) - 1 )

The next thing, we can pull the square root out of the product:

if (sqrt( product for t=1,...,T (Wgoodt/Wbadtht(x)) ) > 1 ) then H(x) = +1 else H(x) = -1

But since the only operation on its result is the comparison with 1, taking the square root doesn't change the result of this comparison. If the argument of square root was > 1, the result will still be >1, and the same for < 1. So we can get rid of the square root altogether:

if ( product for t=1,...,T (Wgoodt/Wbadtht(x) ) > 1 ) then H(x) = +1 else H(x) = -1

The downside of course is that the computation becomes unlike the one for Dt+1(i). Not sure yet if this is important or not.

Either way, we can do one more thing to make the algorithm more readable, we can write the product as a normal loop:



chance = 1;
for t=1,...,T {
     if (ht(x) > 0) {
          chance *= Wgoodt/Wbadt;
     } else
          chance *= Wbadt/Wgoodt;
     }
}
H(x) = sign(chance - 1)



Note that this code runs not at the training time but later, at the run time, with the actual input data set x. When the model runs, it computes the actual values ht(x) for the actual x and computes the result H(x).

I've named the variable "chance" for a good reason: it represents the chance that H(x) is positive. The chance can be expressed as a relation of two numbers A/B. The number A represents the positive "bid", and the number B the negative "bid". The chance and probability are connected and can be expressed though each other:

chance = p / (1-p)
p = chance / (1+chance)

The chance of 1 matches the probability of 0.5. Initially we have no knowledge about the result, so we start with the chance of 1, and with each t the chance gets updated according to the hypothesis picked on that round of boosting.

The final thing to notice is that in the Bayesian approach we do a very similar thing: we start with the prior probabilities (here there are two possible outcomes, with the probability 0.5 each), and then look at the events and multiply the probability accordingly. At the end we see which hypothesis wins. Thus I get the feeling that there should be a way to express the boosting in the Bayesian terms, for a certain somewhat strange definition of events. Freund and Shapire describe a lot of various ways to express the boosting, so why not one more. I can't quite formulate it yet, it needs more thinking. But the concept of "margins" maps very nicely to the Bayesian approach.

In case if you wonder what the margins are: as the rounds of boosting go, after each round of boosting H(x) can be computed for each set of the training data xi. At some point they all start matching the training results yi, however the boosting can be run further, and more rounds can still improve the results on the test data set. This happens because the sign() function in H(x) collapses the details, and the further improvements are not visible on the training data. But if we look at the argument of sign(), such as the result of the logarithm in the original formula, we'll notice that they keep moving away from the 0 boundary, representing more confidence. This extra confidence then helps make better decisions on the test data. This distance between the result of the logarithm and 0 is called the margin. Well, in the Bayesian systems we also have the boundary of the probability (in the simplest case for two outcomes, 0.5), and when a Bayesian system has more confidence, it drives the resulting probabilities farther from 0.5 and closer to 1. Very similar.