Sunday, December 13, 2015

Bayes 20: Testing

I've already mentioned how to test a model but I want to go into it in more detail and collect all the items on this subject in one place.

The first thing you need to define is some cost metric for comparison. The result of the testing will not be binary "pass or fail", it will be the value of that cost metric, and you'll be trying to minimize that cost.

In the perfect world, you would have the table of cost for replacing every spare part (including both the part itself and the labor). You'd also know the labor cost of performing the diagnosis by a human diagnostician. This cost might vary by the type of problem, and in the even more perfect world you'd have this information too. Moreover, the human diagnostician can make mistakes, so you'd have the information about the frequency of mistakes and the cost of labor and parts spend on them factored into the information about the cost of the human diagnosis. Having these tables, you'd be ready to do the comparisons.

If you don't have the exact tables, you can do some kind of estimations. For a toy model, you can do some very rough estimations: for example, say that the cost of the human diagnosis is 1 unit, and the cost of any particular repair is 3 units.

Then you take the body of the training data. The meaning of the words "training data" is kind of loose here, it's not necessarily the data that you will be using to train the model, you might put at least some of it aside to use in the testing. It all can also be called the "ground truth": the information about the previous cases, diagnosed by humans, and confirmed that the diagnosis is correct. For each test, you'd want to measure how good each version of your model does against the other versions of the model, and also against the humans. After all, if your diagnostic automation results in higher costs than the human diagnosis, there is no point in using it. And if it does better, you'd have some exciting numbers to show to your bosses and/or customers.

There is an absolute minimum cost needed to do the repairs on a set of training data: you take the cost of fixing every confirmed problem in it and add them up.

Cperfect = sum( cost(every_problem) )

To estimate, what the human diagnosis would cost, you'd take the cost for diagnosing one case, multiply it my the number of the cases in the training data and add to Cperfect to get the total cost.

Chuman = Cperfect + cost(diagnosis)*n_of_cases

Of course, if you have the more detailed information about the cost of diagnosis by each type of problem, you can use this detailed information to add them up. Generally there still will be the fixed cost of diagnosing one case, plus the sum of diagnosing each problem in this case:

Chuman = Cperfect 
    + cost(diagnose_case)*n_of_cases 
    + sum( cost(diagnose_every_problem) )

Note that even if you build a perfect diagnoser, the best savings you can hope for are (Chuman - Cperfect). Or if you prefer to see it as a percentage, 100%*(Chuman - Cperfect)/Chuman.

In reality when you run your automatic diagnoser, you'll have some problems misdiagnosed. There will be some false negatives (when your model failed to notice some problem) and some false positives (when your model diagnosed a problem that is not present). If your model has produced a wrong diagnosis, that's obviously a combination of a false negative (for the real problem that got missed) and a false positive (for the wrong problem that got diagnosed).

The cost of the false negatives is the cost of a human diagnosis, because the human diagnostician would have to go and look at this case. The cost of post-repair testing might need to be added as well, because that's what would be detecting that the problem is not fixed before sending it to the human. In many cases the cost of this testing might be negligible compared to the cost of human diagnosis.

The cost of the false positives is the cost of the parts and labor spent on replacing the parts that aren't broken.

With all this, the cost of the repairs per the expert system will be:

C = Cperfect 
    + cost(diagnose_case)*n_of_false_negative_cases
    + sum( cost(diagnose_every_false_negative_problem) )
    + sum( cost(repair_every_false_positive_problem) )

You'd compare the output of your model with the perfect diagnosis, notice the false positives and false negatives, and add their costs.

Now you're able to compare two models: run them on the same data, find the resulting cost, and see which cost is lower and by how much. You can try different algorithms and different constants for these algorithms and see the changes of the cost. And sometimes the results would surprise you, you'll discover that you went for that fancier algorithm only to make things worse.

If you're wondering, what kind of boundary constant value should be used for accepting the hypotheses, the answer is to try multiple values and see, which one works better. If all goes well, you should be able to build a graph of the total cost by the boundary value and see a nice smooth-ish curve with a distinct minimum on it, something like this:

|
|                   *
|                   
|   *           *
|      *    *
|         
|
|
+-----------------------

If you have two interdependent constants (such as in the algorithm that computes probabilities for both all hypotheses and independent hypotheses, and has different acceptance boundaries for these sub-algorithms), you may want to try taking a couple values of one constant, and for each one of them go through the graphing of the cost by the changing of the other constant. That might give you the hint of how they are interdependent. And then with this knowledge you'd be able to choose a smaller interesting area and go through every combination of both constants in it, compute the cost, and find the minimum on the 3-dimensional graph.

You might be even able to analyze the dependencies, build a formula, and find a good approximation of the minimum analytically.

These boundary constants generally adjust the balance between the false positives and false negatives. If you set the boundary too low, a lot of false positives will be accepted. If you set the boundary too high, the model will reject a lot of diagnoses and thus generate many false negatives. And around the optimum, you'll be trading some false positives for false negatives. In the perfect world there would be some area with no false positives nor false negatives but in reality you'll still have both, and it will come to giving up some in one area to win in the other.

The exact win or loss around the optimum area will depend on the mix of cases in the training data. If you move the boundary up, and lose in lots of cases, and win in a few, your cost will go up. If you move the boundary up, lose in a few cases but win in many, your cost will go down. The proportions of different cases mixed in your training data will determine the balance for the minimal total loss.

This has two consequences: First, there is no point in the very precise picking of the boundary constants. The small changes of these constants one way or the other won't matter, they will depend on which mix of cases did we get in the last time period. And the mix will fluctuate a little over time, there is no way to predict it precisely. Second, it's important to preserve the mix of the cases in the training. If you select a subset of data for the training, the mix of cases in it should match the full real set. Don't use the approaches like "we'll only include into the training data the cases where we've been able to diagnose the problem on the first go". If you do that, you'll be changing the mix, throwing away the more complex data, and your cost estimations won't match the
reality.

The common advice is "split your case data in two, use one half for training, another half for testing". As you can see, how exactly your split the data, will matter. You don't want to change the mix too much. If you take the first half of the data for training and the second half for testing, and the data happens to be sorted on some parameter, you'll get the very different mixes in two halves. It can get as bad as the training in the first half being totally wrong for diagnosing the second half. You need to do the splitting in some more random way. Either by picking the cases by some random number generator, or another good approach is to split them by the timestamp: order the cases by the timestamp and then you can divide them into the first and second half.

But this whole advice of splitting the data in two is not very good. It's good for the follow-up testing but not good for the initial testing. For the initial testing, you want to use the SAME data both for training and for testing. If you trained your model on some data and still can't diagnose everything right when testing with the exact same data, that will give you a good opportunity to analyze, what went wrong. Some people might say "oh, but you'll be over-fitting your model to the particular data". Over-fitting is rarely a problem for the Bayesian models, the typical problem is the opposite. And if the mix of cases in your training data matches the typical real mix, you'll be fitting the expectations close enough to the reality.

Thus start with testing on the same data that you used for training. Identify all the cases of false positives and false negatives. See if there is something common with them. Look in depth at some of the cases, how did the model come up with this result? What caused it to miss the correct result? In my small examples, I've been showing the step-by-step intermediate results of applying every event. This is the kind of data you want to look at. Did the steps match what you expected? If they didn't then why, what values in the tables cause the computation to veer this way?

This detailed data becomes large fairly quickly, so some automatic pre-processing can help with finding the source of trouble. In particular, I've been using the automated code that would pick, which events caused the biggest changes in the probabilities of the hypotheses, both the computed ones and the expected ones. Then it can be printed out in the more compact form that is easier to analyze at a glance. To get this kind of printout automatically, it helps to build the diagnostic support into the program itself. The good plan is to run the test data through the model once, pick the strange cases, and then run them through the model the second time, along with the information about the correct diagnosis and the diagnosis produced by the model. These known diagnoses can then be used to drive the debugging printouts during the second computation.

The different algorithms will give different results. Running multiple algorithms on the same set of training data (with the same data used for training and testing) will let you see the upsides and downsides of each algorithm. That's how I've been coming up with my refinements. Look at what can be adjusted to make the algorithm work better. If one algorithm handles some cases well, and another one handles the other cases well, could we perhaps differentiate these cases somehow and then run them through the different algorithms? Some refinements will work well, some will crash and burn, try the different ones and pick the ones that work.

And only after you've got a system that can do reasonably well on processing the data that were used for training, it's time to test it on the other data. Again, identify the cases that go wrong. In some cases the results will be better or worse than before simply because of a different mix on cases. If you see the same kinds of mistakes as when you tested with the training data but in different proportions, it's probably the mix issue. If you see the different mistakes, well, it means that there is something in this set of data that you haven't seen before and that throws your model off.

A useful experiment is to split your data (nicely and randomly) in two, then use each half to train and test in turn. If we call these parts A and B, then you can do:

  • train with part A, test with part A
  • train with part B, test with part B
  • train with part A, test with part B
  • train with part B, test with part A

If you test each algorithm change on both halves of the data, you'll be able to see if it affects them in the same way or differently. Then test the training table you produced from one half of data on the other half of data. Did it do substantially differently than the same algorithm did when trained by the same data as used in the test? If yes then perhaps the mix in two halves of data is different.

After your system is in production, you should still keep collecting the training data, and keep re-training the model. As the devices age, some of their parts will be coming to the end of life (by design or by mis-design), and this will be showing as the increasing frequency of their failures. This is something that you'd want to capture for the future diagnosis.

And the diagnostic data that was useful for testing is useful in production too. It's useful in two ways: First, when the diagnosis turns out to be wrong, and the case goes to the human diagnostician, the human may benefit from knowing, why did the machine make this diagnosis. He might be able to pinpoint the error quickly and cheaply, without repeating the difficult steps. Second, you should always look back at what is going wrong in production? It would never be perfect but can we do it better? For that, it would help to take the misdiagnosed cases and analyze them further. Try to figure out, what went wrong, and the diagnostic information is what lets you find out what went wrong. You might be also able to use the feedback from the human diagnosticians.

Oh, and a little tangent on the subject of the "ground truth". Recently I went to an internal conference on the subject of machine learning, and there they've been saying that there are the systems with the "ground truth" and systems without the "ground truth" (i.e. no known perfect solution, such as finding the topic of a text message). I disagree with that. I think that there are no systems without the "ground truth". There is always at least the "ground truth" benchmark of how would a human do at this task? Can the automated system match the human? Can the automated system beat the human? In that example of finding the topic of a test message, there obviously must be a body of training messages that the humans would read and formulate the topics. Then these messages can be used for testing of the automated models. And the good automated models must come reasonably close to what the humans did. Only after that can we use these automated models to process the real messages in the wild. Otherwise there is no way to tell if the results of these models are any good or if they are some complete garbage.

There is always, ALWAYS a way to test your system and estimate, how good is it doing. If you didn't test it, you're not following the best practices, and you probably have a whole lot of bugs in your program. Testing the large volumes of data manually is impossible but the solution is to pick some samples and check carefully that these samples are producing the correct results. And of course there are other indications: for example, if you ever see a probability value of 15, this means that there is something very wrong with your computation.

***

That concludes what I had to say on the subject of the Bayesian expert systems. Unless I recall something that I forgot :-) I wrote up all the ideas I had, and that's actually more ideas than I've started with, I've had some interesting realizations as I went along. At some point I'll probably provide the code examples for the techniques discussed in the last few installments. We'll see how it will go with my time.

Saturday, December 12, 2015

Bayes 19: model information, and enumerated events

Sometimes the diagnosis depends on the features of the device being diagnosed. For example, you wouldn't want to diagnose the power steering fluid leak in a car with the electric power steering (or no power steering at all).

The typical thing would be to at least ask the make and model of the car up front and enter it as the events into the diagnosis. How would it be entered as events? The straightforward way would be to define one event per model. Then when diagnosing a car, the event for its model will be true, and for the rest of the models false.

This would also mean that every single training case will be model-specific. And even if we fold multiple training cases into the per-hypothesis entry, it still will include the information about what models experienced this failure, and at what prior probability.

In one way, this is good. It means that we will diagnose this hypothesis only for the models where this failure can actually occur. In another way, it's bad. For some rarer models we may not have any training cases showing that yes, this problem can happen for this model. There might be plenty of Volvos in the training data to show all their possible problems but not enough Aston Martins to show any but the most frequent problems. Yet the Aston Martins are mechanically similar to the other cars, and much of knowledge about the other cars can be applied to the Aston Martins too. So should we take the model into account or not?

Its hard to tell for sure for each particular case. But the approach of specifying the model information as events actually works fairly well together with the confidence capping. The confidence capping means that even if we can't find an exact match in training data, we'll still keep trying to find the best match, first with the mismatch of one event, then with the mismatch of two events, and so on. Finding a similar training case on another model means the mismatch of two events: the event for the model of that training case will be false instead of true in the input, and the event for the model of the actual car being examined will be true instead of false. So in essence it will mean that the fully matching training case for the same model will be preferred, but if there isn't one, it will become a choice between the otherwise fully matching case from another model and a partially matching case for this model.

So far so good but there are more caveats to keep in mind. This would not work well if we try aggressively to find the relevant symptoms/events of the training cases. We might end up with a training case for this particular model that has only the model information and one other event marked as relevant. Well, this event will be a mismatch but it's a mismatch on only one event (since the others are marked as irrelevant), so this case will take a priority over the cases for the other model but a matching symptom (since these case will have two mismatches in the model information events). A possible workaround for this is to not fold the model information into the events but just partition the training data into the separate per-model tables and a common table without the model information. Then look in the per-model table first, and if nothing useful is found there, look in the common table. This workaround has its own problems of course, such as what if there are multiple failures, one of which showing in the per-model table but the second one only in the common table? The second table computation might never happen because the first failure would be found in the per-model table and the processing will stop.

Another complication is that with the large number of events the uncertainty introduced by the certainty capping will accumulate fast. With a hundred events, and using 0.01 for the cap value, by the end of the processing the result will be completely uncertain. That's why the smaller values of the cap, such as 1e-8 are typically more useful.

You might think that if we do the computation directly by the training cases, we preserve the correlation between the events, so instead of having N events for N models, we can encode them as a binary number that can be represented with only log2(N) events. But that's a bad idea. It means that the code distance between the representations of the models will become varying. Instead of differing always by 2 events, they could differ by as few as 1 and as many as log2(N) events. That will seriously mess up the logic.

A possible workaround for that is to allow some events to have not 2 possible values (true or false) but a number of possible values, as many as hundreds or even thousands. The fuzzy logic can still be used with such events: they would be represented by two values, the confidence value in the range [0..1] and the enumeration value. We'd check if the enumeration value of the event in the input matches the enumeration value of the event in the training case, and then apply the confidence value appropriately, for or against this training case. It also means that the mismatching value of this event will be a mismatch of only one event, not two, and thus improve the standing of the training cases from other models with the capping logic.

A completely different take on this problem is to preprocess the information about the model and split it into the separate events that describe the mechanical features of the car. There might be a separate event for the type of power steering (none, or hydraulic, or electric), the type of the front brakes (disc or drum), the type of the rear brakes, and so on. The approach with the enumerated multiple-valued events can be used here as well, with for example the power steering type becoming one event with 3 possible values. However to use this approach effectively, we really need to handle the event relevance properly: the training cases for the problems in each subsystem must mark only the identifying information about this subsystem as relevant. You wouldn't want to fail recognizing a brake problem just because the car in the training case had a different type of power steering than the one being diagnosed. But this kind of relevance might be difficult to discover automatically, it would require the information about the subsystems and their relations to be added by the humans.

The good part about this splitting is that it would allow to diagnose the cars that have been modified by the owners. If someone installed the Lincoln disc brakes on the rear axle of a Fox-body Ford Mustang that had the drum brakes from the factory, this information can be entered as an override after entering the car model, changing the event for the type of the rear brakes. And the system will still be capable of diagnosing the problems. On the other hand, it might not help much: the aftermarket parts often have their own problems stemming from the mismatch of their characteristics with the original parts. For example, the said Lincoln brake conversion fits mechanically but causes the very wrong proportioning of the brake forces between the front and the rear. On the third hand, a sufficiently smart diagnostics system might be still useful. Even though the misproportioned brake forces do not happen in the cars from the factory, a smart enough system might still be able to point to a faulty master brake cylinder. It wouldn't be faulty as such but it would be faulty in its function for the modified brakes, distributing the brake forces in the wrong proportion.

Tuesday, December 8, 2015

Bayes 18: finding the relevance

I've previously promised to describe the ideas I have for computing the relevance values, and here we go. I haven't tested this ideas with a set of production data. They're just ideas. But they look like a reasonable start.

The general idea should be to tell apart if this value of this event differentiates this hypothesis/case from the other hypotheses/cases or not. If it does differentiate, it's relevant, otherwise it's irrelevant.

Taking two cases, we should be able to differentiate them based only on the relevant events in them both (and obviously a relevant event in one case can only be used to make a decision for this case, not for the other one).

For example, suppose that we have two cases with outcome of two different hypotheses where the event values are represented as a bit string:

hyp1 0 0 0 1 0 0
hyp2 0 1 0 0 0 0

We can mark some events irrelevant by "-" (how we do this is a separate question, for now let's suppose that we've done it somehow) and still be able to differentiate these cases:

hyp1 - - - 1 - -
hyp2 - 1 - - - -

The bit string "0 0 0 1 0 0" will still match only the case for hyp1, and "0 1 0 0 0 0" will still match only the case for hyp2. Obviously, when working with the full training set, we should be able to differentiate between any two cases in the training set.

As it turns out, this problem of finding the smallest set of differentiating bits is already known. It's the same problem as the optimization of a set of boolean functions in the hardware engineering. The very classic way to approach it is with the Karnaugh maps, though they don't scale so easily to the large number of bits. But there's got to be some newer and better ways for this optimization.

This approach has its pitfalls too. For another example, suppose that we have the cases:

hyp3 0 0 0 1 0 0
hyp4 0 1 0 1 0 0

Then we can differentiate them based purely on one event:

hyp3 - 0 - - - -
hyp4 - 1 - - - -

Of course, this works only as long as we have only two events. And it would likely result in the spurious false positives: in case if all the events in the input are at 0, they would still be taken to mean hyp3.

This can be resolved or at least improved by having the training set include the set of the cases where there is nothing wrong. What is known as the "null hypothesis", and what I've been marking as the "ok" hypothesis in the previous examples. The events that differ in value from the null hypothesis should always be relevant. If we arrange the event polarity so that the null hypothesis has all the events at 0, this rule translates into "if an event is at 1, it must be relevant", and only the events at 0 need to be considered for the possible irrelevance.

Things get more complicated when we have some cases resulting in multiple hypotheses. Splitting their symptoms into the single-hypotheses cases is not obvious. It can be attempted by subtracting the other single-hypothesis cases and looking at what is left. If some hypothesis have no single-hypothesis cases, that becomes even more strange. Then perhaps these cases should be left as-is because they express the special interactions between multiple hypotheses. Or we can try to find the commonality between all the cases that include this particular hypothesis and discard the differences stemming from the other hypotheses included in these cases.

Let's look at the highly pathological example I've shown before and try to do these manipulations with it. The example was:

# tab09_01.txt
!             evA evB evC
1 * hyp1,hyp2 1   1   0
1 * hyp2,hyp3 0   1   1
1 * hyp1,hyp3 1   0   1

Previously I've calculated the probabilities by hypothesis:

# tab09_01.txt
!,,evA,evB,evC
hyp1,0.66667,1,0.5,0.5
hyp2,0.66667,0.5,1,0.5
hyp3,0.66667,0.5,0.5,1

Those probabilities P(E|H) can be used as a guidance for relevance. If they're close to 0 or 1 (say, below 0.1 or above 0.9) , we'll say that this event is relevant for this hypothesis, and if it's somewhere in the middle, we'll say that it's irrelevant. Then we can use this knowledge to split the cases:

!             evA evB evC
1 * hyp1      1   -   -
1 *      hyp2 -   1   -
1 * hyp2      -   1   -
1 *      hyp3 -   -   1
1 * hyp1      1   -   -
1 *      hyp3 -   -   1

Then we can merge the cases that are the same:

!        evA evB evC
2 * hyp1 1   -   -
2 * hyp2 -   1   -
2 * hyp3 -   -   1

It worked well for this special example but might be not so great in the bigger world. It needs more thinking and experimentation.

Tuesday, November 17, 2015

Bayes 17: confidence subtraction

One of the ways I've come up with for differentiating the valid hypotheses from noise is the idea of subtracting the competing hypotheses.

To remind, what the problem is, consider a simple training table:

# tab17_01.txt
!,,evA,evB,evC
hyp1,1,1,0,0
hyp2,1,0,1,0
hyp3,1,0,0,1

One event indicates one hypothesis. For now let's ignore the idea of the relevance because for now the relevance is not obvious to compute. I actually plan to work up to the computation of the relevance from the subtraction logic.

If we feed the input with all events present (and with capping), that results with all hypotheses getting the equal probability:

# in17_01_01.txt 
evA,1
evB,1
evC,1

$ perl ex16_01run.pl -c 0.01 tab17_01.txt in17_01_01.txt 
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.00000,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp3   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333
--- Applying event evA, c=0.990000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.99000,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,0.01000,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp3   ,0.01000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evB, c=0.990000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.00990,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,0.00990,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp3   ,0.00010,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evC, c=0.990000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.00010,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,0.00010,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp3   ,0.00010,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333

If we feed the input with all the events absent (and with capping), that also results with all hypotheses getting the equal probability:

# in17_01_02.txt 
evA,0
evB,0
evC,0

$ perl ex16_01run.pl -c 0.01 tab17_01.txt in17_01_02.txt 
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.00000,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp3   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333
--- Applying event evA, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.01000,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,0.99000,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp3   ,0.99000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evB, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.00990,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,0.00990,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp3   ,0.98010,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evC, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.00980,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,0.00980,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp3   ,0.00980,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333

How do we know that for the first input the right answer is "all three hypotheses are true" and in the second input the right answer is "all three hypotheses are false"? Note that if we look at the weights, the weights are much higher for the second input.

The idea I've come up with is that we can take the set of the highly probable hypotheses (all three hypotheses in the examples above) and try to subtract the effects of all but one hypothesis in the set from the input. Then run the modified input through the table again and see if that one remaining hypothesis will pop up above all the others. If it will, it should be accepted. If it won't, it should be refused. Repeat the computation for every hypothesis in the set.

To do that, we need to decide, what does it mean, "subtract"?

It seems reasonable to make the decision based on what probability this event has for this one hypothesis and what probability it has for all the other top hypotheses.

This can be interpreted in two ways depending on what case weights we're using for this computation: these from the original table or these from the result of the first computation. Using the weights from the result of the first computation seems to make more sense, since it favors the cases that have actually matched the input.

OK, suppose, we get these two probability values, how do we subtract the effects? Let's look at some examples of what results would make sense.

Let's name the probability of the event Pone (it can also be called P(E|H) fairly consistently with what we designated by it before, or TC(E|H)) in the situation where the one chosen hypothesis is true, and Prest in the situation where all the other top hypotheses are true. Let's call the actual event confidence C, and the confidence after subtraction Csub.

Some obvious cases would be if Pone and Prest are opposite:

Pone=0, Prest=1, C=1 => Csub=0
Pone=1, Prest=0, C=1 => Csub=1
Pone=0, Prest=1, C=0 => Csub=0
Pone=1, Prest=0, C=0 => Csub=1

Basically, if Prest and C are opposite, C stays as it was, if Prest and C are the same, C flips. The other way to say it is that Csub ends up matching Pone.

The less obvious cases are where both Pone and Prest point the same way. Should C stay? Should it move towards 0.5? One thing that can be said for sure is that C shouldn't flip in this situation. There are arguments for both staying and for moving towards 0.5. This situation means that all the remaining cases match the state of this event, so staying means that there is no reason to penalize one case just because the other cases match it. Moving towards 0.5 means that we say that the rest of the hypotheses can account well for this event by themselves, so let's try to eliminate the "also ran" hypothesis. Staying seems to make more sense to me.

The goal of the subtraction is that with the subtracted confidences applied, a valid hypothesis should be strongly boosted above all others. If it doesn't get boosted (i.e. if it still gets tangled with other hypotheses), it's probably not a valid hypothesis but just some random noise.

The only time I've used the subtraction approach with the real data, I did it in a simple way, and it still worked quite well. That implementation can be expressed as:

Csub = C * TCone/(TCone + TCrest)

Here TCone and TCrest are similar to Pone and Prest but represent the sums of weighted training confidences instead of probabilities:

TCone(E) = sum(TC(E|Hone)*W(Hone))
TCrest(E) = sum(TC(E|Hrest)*W(Hrest))

That implementation was asymmetric: if C is 1, Csub may become less than C, but if C is 0, Csub will stay at 0. It handles reasonably well the situations where the event is mostly positive for the top hypotheses but not the situations where the event is mostly negative for the top hypotheses.

If we compute the values of Csub in this way for the first example above (C(evA)=1, C(evB)=1, C(evC)=1), we will get:

  • For hyp1: Csub(evA)=1, Csub(evB)=0, Csub(evC)=0
  • For hyp2: Csub(evA)=0, Csub(evB)=1, Csub(evC)=0
  • For hyp3: Csub(evA)=0, Csub(evB)=0, Csub(evC)=1

These exactly match the training cases, so all three hypotheses will be accepted, with each hypothesis going to the probability 1 on its run.

If we compute the values of Csub in this way for the second example above (C(evA)=0, C(evB)=0, C(evC)=0), we will get:

  • For hyp1: Csub(evA)=0, Csub(evB)=0, Csub(evC)=0
  • For hyp2: Csub(evA)=0, Csub(evB)=0, Csub(evC)=0
  • For hyp3: Csub(evA)=0, Csub(evB)=0, Csub(evC)=0

They stay the same as the original input, thus the results won't change, the probabilities of all hypothesis will stay at 0.33 for each run, and all the hypotheses will be rejected.

The defect of this formula shows itself when the events are negative, their absence pointing towards the hypotheses, as in the following table:

# tab17_02.txt
!,,evA,evB,evC
hyp1,1,0,1,1
hyp2,1,1,0,1
hyp3,1,1,1,0

In this case the all-0 input should produce the result saying that all the hypotheses are true, and all-1 input should have all the hypotheses as false.

For all-one C(evA)=1, C(evB)=1, C(evC)=1, we will get:

  • For hyp1: Csub(evA)=0, Csub(evB)=0.5, Csub(evC)=0.5
  • For hyp2: Csub(evA)=0.5, Csub(evB)=0, Csub(evC)=0.5
  • For hyp3: Csub(evA)=0.5, Csub(evB)=0.5, Csub(evC)=0

Let's try applying the computed values for hyp1:

# in17_02_03.txt
evA,0
evB,0.5
evC,0.5

$ perl ex16_01run.pl -c 0.01 tab17_02.txt in17_02_03.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp3   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333
--- Applying event evA, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.99000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,0.01000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp3   ,0.01000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
--- Applying event evB, c=0.500000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.49500,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,0.00500,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp3   ,0.00500,1.00000/1.00,1.00000/1.00,0.00000/1.00,
--- Applying event evC, c=0.500000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.24750,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp2   ,0.00250,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp3   ,0.00250,1.00000/1.00,1.00000/1.00,0.00000/1.00,
--- Probabilities
hyp1    0.98020
hyp2    0.00990
hyp3    0.00990

The hyp1 still comes out as true! This is the problem caused by the asymmetry of the formula.

My next idea of a better formula was this: instead of subtractions, just either leave Csub=C or "flip" it: Csub=(1-C). The idea of the flipping is that the direction of C, whether it's less or greater than 0.5, shows the "value" of the event while the distance between C and 0.5 shows the confidence as such. The operation of flipping keeps the confidence (i.e. the distance between C and 0.5) the same while changes the direction. And if C was 0.5, the flipping will have no effect.

C would be flipped in the situation where it points against this hypothesis but for another top hypothesis. This situation likely means that this event is a symptom of another hypothesis but not really relevant for this one.

The logic will be like this:

If TCone and C point in the same direction (i.e. both >0.5 or both <0.5)
then Csub = C;
else if there exists another top hypothesis with TCother pointing in the
 same direction as C
then Csub = (1-C);
else Csub = C;

And instead of hypotheses, we can work with the individual training cases. Instead of picking the top hypotheses, pick the top cases. And then do the subtraction/flipping logic case-by-case. Except perhaps exclude the other cases of the same hypothesis from consideration of "exists another" for flipping.

Let's work this logic through our examples.

The first example is the table

# tab17_01.txt
!,,evA,evB,evC
hyp1,1,1,0,0
hyp2,1,0,1,0
hyp3,1,0,0,1

and the all-one input: C(evA)=1, C(evB)=1, C(evC)=1. From the previous computation we know that all three hypotheses are the top probable hypotheses.

Let's check hyp1:

  • For evA, TC=1 and C=1, point the same way, so Csub=1.

  • For evB, TC=0 and C=1, point opposite, and there is TC(evB|hyp2)=1, so flip to Csub=0.

  • For evC, TC=0 and C=1, point opposite, and there is TC(evC|hyp3)=1, so flip to Csub=0.

We've got the subtracted values, let's run them through the processing:

# in17_01_03.txt
evA,1
evB,0
evC,0

$ perl ex16_01run.pl -c 0.01 tab17_01.txt in17_01_03.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.00000,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp3   ,1.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333
--- Applying event evA, c=0.990000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.99000,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,0.01000,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp3   ,0.01000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evB, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.98010,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,0.00010,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp3   ,0.00990,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evC, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.97030,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,0.00010,0.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp3   ,0.00010,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.99980
hyp2    0.00010
hyp3    0.00010

It says that hyp1 is quite true. Similarly, hyp2 and hyp3 would show themselves as true.

For the second example, let's look at the same table and the inputs of all-0: C(evA)=0, C(evB)=0, C(evC)=0. From the previous computation we know that all three hypotheses are again the top probable hypotheses.

Let's check hyp1:

  • For evA, TC=1 and C=0, point opposite, and there are two other hypotheses with TC=0, so flip to Csub=1.
  • For evB, TC=0 and C=0, point the same, so Csub=0.
  • For evC, TC=0 and C=0, point the same, so Csub=0.

It again produced (1, 0, 0), and hyp1 would also show as true! But that's not a good result, it's a false positive. This idea didn't work out well.

The problem is that we need to differentiate between the states of the event that say "there is nothing wrong" and "there is something wrong", and flip the event only if it was pointing in the direction of "there is something wrong". That's what my first asymmetric logic did, it always assumed that C=0 meant
"there is nothing wrong".

If we have the additional information about which state of the event is "normal" and which is "wrong", that would solve this problem. If we don't have this information, we can try to deduce it. A simple assumption could be that if a symptom is specific to some cases, then in most of the training cases it will be in the "normal" state, and only in a few of these specific cases it will be in the "wrong" state.

Of course, there will be exceptions, for example if a medical diagnostic system has an event with the question "is the patient feeling unwell?" then the answer for this question in most cases will be true, even though this is not the "normal" state. But it doesn't seem to cause problems: for most patients and most hypotheses TC and C on this event will be pointing the same way, and there will be no need for flipping anyway.

So, let's update the logic rules:

If TCone and C point in the same direction (i.e. both >0.5 or both <0.5)
then Csub = C;
else if there exists another top hypothesis with TCother pointing in the
 same direction as C and most cases in the original training data were
 pointing opposite C
then Csub = (1-C);
else Csub = C;

With this revised logic the revisited case of the all-0 inputs (C(evA)=0, C(evB)=0, C(evC)=0) for hyp1 will be:

  • For evA, TC=1 and C=0, point opposite, but most training cases (2 of 3) also point to C=0, so leave Csub=0.
  • For evB, TC=0 and C=0, point the same, so Csub=0.
  • For evC, TC=0 and C=0, point the same, so Csub=0.

With this unchanged input, hyp1 will still finish with the probability of 0.33, and it won't make the cut. Neither will make hyp2 nor hyp3 when processed in the same way.

Let's look at the example with the opposite table

# tab17_02.txt
!,,evA,evB,evC
hyp1,1,0,1,1
hyp2,1,1,0,1
hyp3,1,1,1,0

and again the all-0 input. In it the handling of hyp1 will be:

  • For evA, TC=0 and C=0, point the same, so Csub=0.
  • For evB, TC=1 and C=0, point opposite, most training cases (2 of 3) point to C=1, and there is hyp2 with TC(evB|hyp2) = 0, so flip to Csub=1.
  • For evC, TC=1 and C=0, point opposite, most training cases (2 of 3) point to C=1, and there is hyp3 with TC(evB|hyp3) = 0, so flip to Csub=1.

This result of (C(evA)=0, C(evB)=1, C(evC)=1) will match the training case for hyp1 exactly, and drive its probability all the way up, just as we wanted to.

This last logic has managed to handle all the examples fairly decently.

Monday, November 9, 2015

Bayes 16: code for hypothesis composition

This time I want to show the code that composes the separate cases into the merged hypotheses. I've shown before how to do it manually, now the code to do it.

The short summary of changes is:

  • The %hyphash went away since it turned out to be unnecessary.
  • Option "-compose N" added. N is the fraction in range [0..1] of what part of the weights needs to be composed into the per-hypothesis merged cases. 0 is the default and means that nothing will be composed. 1 means that all the data will be composed per hypothesis.
  • Option "-msplit" added. It enables the splitting of the multi-hypothesis cases into multiple single-hypothesis cases (just as shown before, the same data gets copied to every hypothesis separately).
  • The table loading code has been extended to allow the reading of the multi-hypothesis cases that are formatted on multiple lines. This allows to save the result of the composition and then load it for the diagnosis runs.
  • The new logic is in the function compose_hypotheses().

Initially I wanted to show only the changed code but then decided that since I'm not uploading the files anywhere yet, it would be easier for the people who want to play with the code to copy-and-paste it than to assembly it from bits and pieces. So here we go, the whole new code:

# ex16_01run.pl
#!/usr/bin/perl
#
# Running of a Bayes expert system on a table of training cases.
# Includes the code to compose the training cases by hypothesis.
# And can parse back the multi-hypothesis cases printed in multi-line.

use strict;
use Carp;

our @evname; # the event names, in the table order
our %evhash; # hash of event names to indexes
our @case; # the table of training cases
  # each case is represented as a hash with elements:
  # "hyp" - array of hypotheis names that were diagnosed in this case
  # "wt" - weight of the case
  # "origwt" - original weight of the case as loaded form the table
  # "tc" - array of training confidence of events TC(E|I)
  # "r" - array of relevance of events R(E|I)
our %phyp; # will be used to store the computed probabilities

# options
our $verbose = 0; # verbose printing during the computation
our $cap = 0; # cap on the confidence, factor adding fuzziness to compensate
  # for overfitting in the training data;
  # limits the confidence to the range [$cap..1-$cap]
our $boundary = 0.9; # boundary for accepting a hypothesis as a probable outcome
our $composite = 0.; # fraction [0..1] of the cases to compose by hypothesis
our $mhyp_split = 0; # split the multi-hypothesis cases when computing the composites

# print formatting values
our $pf_w = 7; # width of every field
our $pf_tfmt = "%-${pf_w}.${pf_w}s"; # format of one text field
our $pf_nw = $pf_w-2; # width after the dot in the numeric fields field
our $pf_nfmt = "%-${pf_w}.${pf_nw}f"; # format of one numeric field (does the better rounding)
our $pf_rw = 4; # width of the field R(E|I)
our $pf_rtfmt = "%-${pf_rw}.${pf_rw}s"; # format of text field of the same width as R(E|I)
our $pf_rnw = $pf_rw-2; # width after the dot for R(E|I)
our $pf_rnfmt = "%-${pf_rw}.${pf_rnw}f"; # format of the field for R(E|I)

sub load_table($) # (filename)
{
  my $filename = shift;

  @evname = ();
  %evhash = ();
  @case = ();

  my $nev = undef; # number of events minus 1

  confess "Failed to open '$filename': $!\n"
    unless open(INPUT, "<", $filename);

  my $prepend;
  while(<INPUT>) {
    chomp;
    s/,\s*$//; # remove the trailing comma if any
    if (/^\#/ || /^\s*$/) {
      # a comment line
    } elsif (/^\!/) {
      # row with event names
      @evname = split(/,/); # CSV format, the first 2 elements gets skipped
      shift @evname;
      shift @evname;
    } elsif (/\+\s*$/) {
      # a split-line of hypothesis names
      $prepend .= $_;
    } else {
      $_ = $prepend . $_;
      $prepend = undef;
      my @s = split(/,/); # CSV format for a training case
      # Each line contains:
      # - list of hypotheses, separated by "+"
      # - weight (in this position it's compatible with the format of probability tables)
      # - list of event data that might be either of:
      #   - one number - the event's training confidence TC(E|I), implying R(E|I)=1
      #   - a dash "-" - the event is irrelevant, meaning R(E|I)=0
      #   - two numbers separated by a "/": TC(E|I)/R(E|I)

      my $c = { };
      
      my @hyps = split(/\+/, shift @s);
      for (my $i = 0; $i <= $#hyps; $i++) {
        $hyps[$i] =~ s/^\s+//;
        $hyps[$i] =~ s/\s+$//;
      }

      $c->{hyp} = \@hyps;
      $c->{origwt} = $c->{wt} = shift(@s) + 0.;

      if (defined $nev) {
        if ($nev != $#s) {
          close(INPUT);
          my $msg = sprintf("Wrong number of events, expected %d, got %d in line: %s\n",
            $nev+1, $#s+1, $_);
          confess $msg;
        }
      } else {
        $nev = $#s;
      }

      # the rest of fields are the events in this case
      foreach my $e (@s) {
        if ($e =~ /^\s*-\s*$/) {
          push @{$c->{r}}, 0.;
          push @{$c->{tc}}, 0.;
        } else {
          my @edata = split(/\//, $e);
          push @{$c->{tc}}, ($edata[0] + 0.);
          if ($#edata <= 0) {
            push @{$c->{r}}, 1.;
          } else {
            push @{$c->{r}}, ($edata[1] + 0.);
          }
        }
      }

      push @case, $c;
    }
  }
  close(INPUT);
  if ($prepend) {
    confess "The input contained the hanging hypothese names: $prepend\n";
  }

  if ($#evname >= 0) {
    if ($#evname != $nev) {
      my $msg = sprintf("Wrong number of event names, %d events in the table, %d names\n",
        $nev+1, $#evname+1);
      confess $msg;
    }
  } else {
    for (my $i = 0; $i <= $nev; $i++) {
      push @evname, ($i+1)."";
    }
  }

  for (my $i = 0; $i <= $#evname; $i++) {
    $evname[$i] =~ s/^\s+//;
    $evname[$i] =~ s/\s+$//;
    $evhash{$evname[$i]} = $i;
  }
}

sub print_table()
{
  # the title row
  printf($pf_tfmt . ",", "!");
  printf($pf_tfmt . ",", "");
  foreach my $e (@evname) {
    printf($pf_tfmt . " " . $pf_rtfmt . ",", $e, "");
  }
  print("\n");
  # the cases
  for (my $i = 0; $i <= $#case; $i++) {
    my $c = $case[$i];
    # if more than one hypothesis, print each of them on a separate line
    for (my $j = 0; $j < $#{$c->{hyp}}; $j++) {
      printf($pf_tfmt . "+\n", $c->{hyp}[$j]);
    }

    printf($pf_tfmt . ",", $c->{hyp}[ $#{$c->{hyp}} ]);
    printf($pf_nfmt . ",", $c->{wt});
    for (my $j = 0; $j <= $#evname; $j++) {
      printf($pf_nfmt . "/" . $pf_rnfmt . ",", $c->{tc}[$j], $c->{r}[$j]);
    }
    print("\n");
  }
}

# Compute the hypothesis probabilities from weights
sub compute_phyp()
{
  %phyp = ();
  my $h;

  # start by getting the weights
  my $sum = 0.;
  for (my $i = 0; $i <= $#case; $i++) {
    my $w = $case[$i]->{wt};
    $sum += $w;

    foreach $h (@{$case[$i]->{hyp}}) {
      $phyp{$h} += $w;
    }
  }

  if ($sum != 0.) { # if 0 then all the weights are 0, leave them alone
    for $h (keys %phyp) {
      $phyp{$h} /= $sum;
    }
  }
}


# Print the probabilities of the kypotheses
sub print_phyp()
{
  printf("--- Probabilities\n");
  for my $h (sort keys %phyp) {
    printf($pf_tfmt . " " . $pf_nfmt . "\n", $h, $phyp{$h});
  }
}

# Apply one event
# evi - event index in the array
# conf - event confidence [0..1]
sub apply_event($$) # (evi, conf)
{
  my ($evi, $conf) = @_;

  # update the weights
  for (my $i = 0; $i <= $#case; $i++) {
    my $w = $case[$i]->{wt};
    my $r = $case[$i]->{r}[$evi];
    my $tc = $case[$i]->{tc}[$evi];

    $case[$i]->{wt} = $w * (1. - $r)
      + $w*$r*( $tc*$conf + (1. - $tc)*(1. - $conf) );
  }
}


# Apply an input file
sub apply_input($) # (filename)
{
  my $filename = shift;

  confess "Failed to open the input '$filename': $!\n"
    unless open(INPUT, "<", $filename);
  while(<INPUT>) {
    chomp;
    next if (/^\#/ || /^\s*$/); # a comment

    my @s = split(/,/);
    $s[0] =~ s/^\s+//;
    $s[0] =~ s/\s+$//;

    confess ("Unknown event name '" . $s[0] . "' in the input\n")
      unless exists $evhash{$s[0]};
    my $evi = $evhash{$s[0]};

    my $conf = $s[1] + 0.;
    if ($conf < $cap) {
      $conf = $cap;
    } elsif ($conf > 1.-$cap) {
      $conf = 1. - $cap;
    }
    printf("--- Applying event %s, c=%f\n", $s[0], $conf);
    &apply_event($evi, $conf);
    &print_table;
  }
  close(INPUT);
}

# compose the training cases by hypothesis
sub compose_hypotheses()
{
  if ($mhyp_split) {
    # the number of cases will grow, remember the index of last original case
    my $lastcase = $#case + 1;
    for (my $i = 0; $i <= $lastcase; $i++) {
      my $c = $case[$i];
      while ($#{$c->{hyp}} > 0) {
        # split a copy of the case for each resulting hypothesis in it
        my $hname = pop @{$c->{hyp}};
        push @case, {
          hyp => [$hname],
          wt => $c->{wt},
          origwt => $c->{origwt},
          tc => $c->{tc},
          r => $c->{r},
        };
      }
    }
  }

  if ($composite <= 0.) {
    return; # nothing to do
  }

  # newly-generated composite hypotheses
  my %hyps; # keyed by the hypothesis names
  for (my $i = 0; $i <= $#case; $i++) {
    my $c = $case[$i];
    my $key = join("+", sort @{$c->{hyp}});
    if (!exists $hyps{$key}) {
      $hyps{$key} = {
        hyp => $c->{hyp},
        wt => 0.,
        wtrel => [], # weight of relevant part, by event
        tc => [], # initially will contain the sum
        r => [], # will be filled later
      };
    }
    my $hyp = $hyps{$key};
    my $wt = $c->{wt};

    $hyp->{wt} += $wt;

    for (my $e = 0; $e <= $#evname; $e++) {
      my $r = $c->{r}[$e];
      $hyp->{wtrel}[$e] += $wt * $r;
      $hyp->{tc}[$e] += $wt * $r * $c->{tc}[$e];
    }
  }
  if ($composite >= 1.) {
    # throw away the raw cases, since the hypotheses will replace them
    @case = ();
  } else {
    # 2nd pass: adjust the weight of the raw cases,
    # unless it's the only case of the hypothesis (then throw
    # away that hypothesis)
    for (my $i = 0; $i <= $#case; $i++) {
      my $c = $case[$i];
      my $key = join("+", sort @{$c->{hyp}});
      if ($hyps{$key}->{wt} == $c->{wt}) {
        delete $hyps{$key}; # hypothesis is a copy of the case
      } else {
        $c->{wt} *= (1. - $composite);
      }
    }
  }

  foreach my $h (sort keys %hyps) {
    my $hyp = $hyps{$h};
    my $wt = $hyp->{wt};

    for (my $e = 0; $e <= $#evname; $e++) {
      $hyp->{r}[$e] = $hyp->{wtrel}[$e] / $wt;
      if ($hyp->{wtrel}[$e] == 0.) {
        $hyp->{tc}[$e] = 0.;
      } else {
        $hyp->{tc}[$e] /= $hyp->{wtrel}[$e];
      }
    }

    # scale down the weight
    $hyp->{wt} *= $composite;
    $hyp->{origwt} = $hyp->{wt};
    delete $hyp->{wtrel};
    push @case, $hyp;
  }
}

# main()
while ($ARGV[0] =~ /^-(.*)/) {
  if ($1 eq "v") {
    $verbose = 1;
  } elsif ($1 eq "c") {
    shift @ARGV;
    $cap = $ARGV[0]+0.;
  } elsif ($1 eq "b") {
    shift @ARGV;
    $boundary = $ARGV[0]+0.;
  } elsif ($1 eq "compose") {
    shift @ARGV;
    $composite = $ARGV[0]+0.;
  } elsif ($1 eq "msplit") {
    $mhyp_split = 1;
  } else {
    confess "Unknown switch -$1";
  }
  shift @ARGV;
}
&load_table($ARGV[0]);
&print_table;
if ($composite > 0. || $mhyp_split) {
  &compose_hypotheses;
  printf "--- Composite ${pf_nfmt}:\n", $composite;
  &print_table;
}
&compute_phyp;
&print_phyp;
if ($#ARGV >= 1) {
  &apply_input($ARGV[1]);
  &compute_phyp;
  &print_phyp;
}

The basic example, loading of a multi-line case and splitting of the multi-hypothesis cases:

#tab16_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   +
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   +
hyp3   ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   +
hyp3   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,

$ perl ex16_01run.pl -msplit tab16_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   +
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   +
hyp3   ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   +
hyp3   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Composite 0.00000:
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp3   ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp3   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333

The same but also fully composed by hypotheses:

$ perl ex16_01run.pl -msplit -compose 1 tab16_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   +
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   +
hyp3   ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   +
hyp3   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Composite 1.00000:
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,2.00000,1.00000/1.00,0.50000/1.00,0.50000/1.00,
hyp2   ,2.00000,0.50000/1.00,1.00000/1.00,0.50000/1.00,
hyp3   ,2.00000,0.50000/1.00,0.50000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333

The same with only 0.5 of the weights composed:

$ perl ex16_01run.pl -msplit -compose 0.5 tab16_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   +
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   +
hyp3   ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   +
hyp3   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Composite 0.50000:
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.50000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   ,0.50000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   ,0.50000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp2   ,0.50000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp3   ,0.50000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp3   ,0.50000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
hyp1   ,1.00000,1.00000/1.00,0.50000/1.00,0.50000/1.00,
hyp2   ,1.00000,0.50000/1.00,1.00000/1.00,0.50000/1.00,
hyp3   ,1.00000,0.50000/1.00,0.50000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333

It's a combination of the two examples above. The original split events are left half their weight, and the other half of the weight has been composed.

What if we try the composition without splitting?

$ perl ex16_01run.pl -compose 0.5 tab16_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   +
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   +
hyp3   ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   +
hyp3   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Composite 0.50000:
!      ,       ,evA         ,evB         ,evC         ,
hyp1   +
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   +
hyp3   ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   +
hyp3   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.66667
hyp2    0.66667
hyp3    0.66667

The result is exactly the same as the original. This is because the original had only one case for each hypothesis combination, so there is nothing to compose and there is no point in splitting them into two.

Now a demo of the relevance computation:

# tab16_02.txt
!,,evA,evB,evC
hyp1,1,1,0,-
hyp1,1,0,1,-
hyp1,6,-,-,1

$ perl ex16_01run.pl -compose 1 tab16_02.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.00000,1.00000/1.00,0.00000/1.00,0.00000/0.00,
hyp1   ,1.00000,0.00000/1.00,1.00000/1.00,0.00000/0.00,
hyp1   ,6.00000,0.00000/0.00,0.00000/0.00,1.00000/1.00,
--- Composite 1.00000:
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,8.00000,0.50000/0.25,0.50000/0.25,1.00000/0.75,
--- Probabilities
hyp1    1.00000

According to the logic from the previous installment, the relevance of evA and evB gets set to 0.25 because 2/8=1/4 of the cases for them have them relevant, and for evC the relevance is set to 0.75 because 6/8=3/4 cases for it are relevant.

Wednesday, November 4, 2015

Bayes 15: relevance combination

I've set out to write the code that would do the combination of the cases into the likeness of the probability tables and I've realized that since I've introduced the concept of relevance, now I'd need to deal with it too. What would it mean to combine two cases that have the relevance values?

First, what does it mean to combine two cases without the relevance values? We've been though this computation many times when building the probability tables but I don't think I've given the formula yet.

The weights obviously add up (it were that many cases uncombined, and now they become the same number of cases in a single combination):

W(I) = W(I1) + W(I2)

The training confidences get added up according to their weights:

TC(E|I) = ( W(I1)*TC(E|I1) + W(I2)*TC(E|I2) ) / ( W(I1) + W(I2) )

To check that it's right let's compute the application of one event and see that the combined weights work out the same either way.

Separately they get computed as follows (without including the relevance yet):

W(I1|E) = W(I1) * ( TC(E|I1)*C(E) + (1 - TC(E|I1))*(1 - C(E)) )
    = W(I1) * ( TC(E|I1)*C(E) + 1 - TC(E|I1) - C(E) + TC(E|I1)*C(E) )
    = W(I1) * ( 1 - TC(E|I1) + (2*TC(E|I1) - 1)*C(E) )
    = W(I1) - W(I1)*TC(E|I1) + (W(I1)*2*TC(E|I1) - W(I1))*C(E) )

W(I2|E) follows the same kind of formula. After combination they compute as:

W(I|E) = W(I) * ( TC(E|I)*C(E) + (1 - TC(E|I))*(1 - C(E)) )
    = ( W(I1) + W(I2) )*( TC(E|I)*C(E) + (1 - TC(E|I))*(1 - C(E)) )

Let's express TC(E|I) as a ratio:

TC(E|I) = TCup(E|I) / TClow(E|I)
TCup(E|I) = W(I1)*TC(E|I1) + W(I2)*TC(E|I2)
TClow(E|I) = W(I1) + W(I2)
    = W(I)

Then substituting it into the formula for weights we get

W(I|E) = W(I) * ( TC(E|I)*C(E) + (1 - TC(E|I))*(1 - C(E)) )
    = W(I) * ( TCup(E|I)*C(E)/TClow(E|I) + (1 - TCup(E|I)/TClow(E|I))*(1 - C(E)) )
    = W(I) * ( TCup(E|I)*C(E)/TClow(E|I) + (TClow(E|I) - TCup(E|I))/TClow(E|I) *(1 - C(E)) )
    = W(I) * ( TClow(E|I) - TCup(E|I) + (TCup(E|I) - TClow(E|I) + TCup(E|I))*C(E) ) / TClow(E|I)
    = W(I) * ( W(I) - TCup(E|I) + (2*TCup(E|I) - W(I))*C(E) ) / W(I)
    = W(I) - TCup(E|I) + (2*TCup(E|I) - W(I))*C(E)
    = W(I1) + W(I2) - W(I1)*TC(E|I1) + W(I2)*TC(E|I2) + 
        (2*W(I1)*TC(E|I1) + 2*W(I2)*TC(E|I2) - W(I1) - W(I2))*C(E)
    = ( W(I1) - W(I1)*TC(E|I1) + (2*W(I1)*TC(E|I1) - W(I1))*C(E) ) +
        ( W(I2) - W(I2)*TC(E|I2) + (2*W(I2)*TC(E|I2) - W(I2))*C(E) )
    = W(I1|E) + W(I2|E)

It all computes. The result is the same either way. The difference is of course that as we apply multiple events, the effect of the previous events on the following ones will manifest differently. But that is to be expected.

That taken care of, let's look at the relevance. If we have a case with a fractional R(E|I), we can simulate it by splitting the case into two cases, the fully relevant part and the fully irrelevant part. The case

W(I) * ...,TC/R,...

gets split into two cases, relevant one with the weight W(Ir) and irrelevant one with the weight W(Ii):

W(Ii)=W(I)*(1-R) * ...,0/0,...
W(Ir)=W(I)*R * ...,TC,...

To show that it's equivalent, the original formula for the posterior weight is:

W(I|E) = W(I)*(1 - R(E|I)) + W(I)*R(E|I)*( TC(E|I)*C(E) + (1 - TC(E|I))*(1 - C(E)) )

If we add up the weights of the two split cases, we get:

W(Ii|E) = W(I)*(1-R(E|I))
W(Ir|E) = W(I)*R(E|I)*( TC(E|I)*C(E) + (1 - TC(E|I))*(1 - C(E)) )
W(Ii|E) + W(Ir|E) = W(I)*(1 - R(E|I)) + W(I)*R(E|I)*( TC(E|I)*C(E) + (1 - TC(E|I))*(1 - C(E)) )

Both give the exact same result, the split is correct.

This split can also be undone. The formulas for the weights can be expanded as:

W(Ii) = W(I)*(1 - R)  = W(I) - W(I)*R
W(Ir) = W(I)*R

And from there it follows that:

W(Ii) + W(Ir) = W(I) - W(I)*R + W(I)*R = W(I)

Not surprisingly, the original weight can be found as the sum of two split weights. And from there R can be found as:

R = W(Ir) / W(I)

If we want to combine multiple cases with fractional relevances, we can split each of them into the relevant and irrelevant parts, combine the relevant parts together, combine the irrelevant parts together (this is easy: just add up their weights), and then undo the split.

Incidentally, the whole relevance thing can also be expressed purely by the manipulation with the case weights and training confidences.

The point of the relevance concept is to leave the weight of a case unchanged if the event is irrelevant to it. The closest effect that can be achieved with the training confidences is the TC=0.5, in it the weight gets halved, no matter what is C(E):

W(I|E) = W(I)*( TC(E)*C(E) + (1 - TC(E))*(1 - C(E)) )
    = W(I)*( 0.5*C(E) + (1 - 0.5)*(1 - C(E)) )
    = W(I)*( 0.5*(C(E) + 1 - C(E)) )
    = W(I)*0.5

Which means that we can simulate an irrelevant event purely through the training confidence, by setting TC(E|I)=0.5 and doubling the weight of the event. For example, if we had a case with weight 15 and an irrelevant event

15 * ...,0/R=0,...

We can replace it with the case of weight 30 and TC=0.5.

30 * ...,0.5,...

The same can be also seen in another way: create the second case, exactly the same as the first one and with the same weight, except that put TC(E|I)=0 in one case and TC(E|I)=1 in another case. No matter what C(E), one case will get through and another one will be thrown away (or the matching fractions of both cases will be thrown away, still leaving one original weight). For example, taking the same case of weight 15 and an irrelevant event, we can convert it to two cases:

15 * ...,0,...
15 * ...,1,...

Either way, doubling the weight of one case, or splitting into two cases, the total weight gets doubled, and the average TC(E|I) gets set equal to 0.5.

Monday, November 2, 2015

Bayes 14: code for weight-based computation

As promised, here is the code that performs the computation directly from the table of the training cases, using the weights:

# ex14_01run.pl
#!/usr/bin/perl
#
# Running of a Bayes expert system on a table of training cases.

use strict;
use Carp;

our @evname; # the event names, in the table order
our %evhash; # hash of event names to indexes
our %hyphash; # the hypothesis names translation to the arrays of
  # refences to all cases involving this hypothesis
our @case; # the table of training cases
  # each case is represented as a hash with elements:
  # "hyp" - array of hypotheis names that were diagnosed in this case
  # "wt" - weight of the case
  # "origwt" - original weight of the case as loaded form the table
  # "tc" - array of training confidence of events TC(E|I)
  # "r" - array of relevance of events R(E|I)
our %phyp; # will be used to store the computed probabilities

# options
our $verbose = 0; # verbose printing during the computation
our $cap = 0; # cap on the confidence, factor adding fuzziness to compensate
  # for overfitting in the training data;
  # limits the confidence to the range [$cap..1-$cap]
our $boundary = 0.9; # boundary for accepting a hypothesis as a probable outcome

# print formatting values
our $pf_w = 7; # width of every field
our $pf_tfmt = "%-${pf_w}.${pf_w}s"; # format of one text field
our $pf_nw = $pf_w-2; # width after the dot in the numeric fields field
our $pf_nfmt = "%-${pf_w}.${pf_nw}f"; # format of one numeric field (does the better rounding)
our $pf_rw = 4; # width of the field R(E|I)
our $pf_rtfmt = "%-${pf_rw}.${pf_rw}s"; # format of text field of the same width as R(E|I)
our $pf_rnw = $pf_rw-2; # width after the dot for R(E|I)
our $pf_rnfmt = "%-${pf_rw}.${pf_rnw}f"; # format of the field for R(E|I)

sub load_table($) # (filename)
{
  my $filename = shift;

  @evname = ();
  %evhash = ();
  %hyphash = ();
  @case = ();

  my $nev = undef; # number of events minus 1

  confess "Failed to open '$filename': $!\n"
    unless open(INPUT, "<", $filename);
  while(<INPUT>) {
    chomp;
    s/,\s*$//; # remove the trailing comma if any
    if (/^\#/ || /^\s*$/) {
      # a comment line
    } elsif (/^\!/) {
      # row with event names
      @evname = split(/,/); # CSV format, the first 2 elements gets skipped
      shift @evname;
      shift @evname;
    } else {
      my @s = split(/,/); # CSV format for a training case
      # Each line contains:
      # - list of hypotheses, separated by "+"
      # - weight (in this position it's compatible with the format of probability tables)
      # - list of event data that might be either of:
      #   - one number - the event's training confidence TC(E|I), implying R(E|I)=1
      #   - a dash "-" - the event is irrelevant, meaning R(E|I)=0
      #   - two numbers separated by a "/": TC(E|I)/R(E|I)

      my $c = { };
      
      my @hyps = split(/\+/, shift @s);
      my %hypuniq;
      for (my $i = 0; $i <= $#hyps; $i++) {
        $hyps[$i] =~ s/^\s+//;
        $hyps[$i] =~ s/\s+$//;
        $hypuniq{$hyps[$i]} = 1;
      }
      foreach my $h (keys %hypuniq) {
        push @{$hyphash{$h}}, $c;
      }

      $c->{hyp} = \@hyps;
      $c->{origwt} = $c->{wt} = shift(@s) + 0.;

      if (defined $nev) {
        if ($nev != $#s) {
          close(INPUT);
          my $msg = sprintf("Wrong number of events, expected %d, got %d in line: %s\n",
            $nev+1, $#s+1, $_);
          confess $msg;
        }
      } else {
        $nev = $#s;
      }

      # the rest of fields are the events in this case
      foreach my $e (@s) {
        if ($e =~ /^\s*-\s*$/) {
          push @{$c->{r}}, 0.;
          push @{$c->{tc}}, 0.;
        } else {
          my @edata = split(/\//, $e);
          push @{$c->{tc}}, ($edata[0] + 0.);
          if ($#edata <= 0) {
            push @{$c->{r}}, 1.;
          } else {
            push @{$c->{r}}, ($edata[1] + 0.);
          }
        }
      }

      push @case, $c;
    }
  }
  close(INPUT);

  if ($#evname >= 0) {
    if ($#evname != $nev) {
      my $msg = sprintf("Wrong number of event names, %d events in the table, %d names\n",
        $nev+1, $#evname+1);
      confess $msg;
    }
  } else {
    for (my $i = 0; $i <= $nev; $i++) {
      push @evname, ($i+1)."";
    }
  }

  for (my $i = 0; $i <= $#evname; $i++) {
    $evname[$i] =~ s/^\s+//;
    $evname[$i] =~ s/\s+$//;
    $evhash{$evname[$i]} = $i;
  }
}

sub print_table()
{
  # the title row
  printf($pf_tfmt . ",", "!");
  printf($pf_tfmt . ",", "");
  foreach my $e (@evname) {
    printf($pf_tfmt . " " . $pf_rtfmt . ",", $e, "");
  }
  print("\n");
  # the cases
  for (my $i = 0; $i <= $#case; $i++) {
    my $c = $case[$i];
    # if more than one hypothesis, print each of them on a separate line
    for (my $j = 0; $j < $#{$c->{hyp}}; $j++) {
      printf($pf_tfmt . "+\n", $c->{hyp}[$j]);
    }

    printf($pf_tfmt . ",", $c->{hyp}[ $#{$c->{hyp}} ]);
    printf($pf_nfmt . ",", $c->{wt});
    for (my $j = 0; $j <= $#evname; $j++) {
      printf($pf_nfmt . "/" . $pf_rnfmt . ",", $c->{tc}[$j], $c->{r}[$j]);
    }
    print("\n");
  }
}

# Compute the hypothesis probabilities from weights
sub compute_phyp()
{
  %phyp = ();
  my $h;

  # start by getting the weights
  my $sum = 0.;
  for (my $i = 0; $i <= $#case; $i++) {
    my $w = $case[$i]->{wt};
    $sum += $w;

    foreach $h (@{$case[$i]->{hyp}}) {
      $phyp{$h} += $w;
    }
  }

  if ($sum != 0.) { # if 0 then all the weights are 0, leave them alone
    for $h (keys %phyp) {
      $phyp{$h} /= $sum;
    }
  }
}


# Print the probabilities of the kypotheses
sub print_phyp()
{
  printf("--- Probabilities\n");
  for my $h (sort keys %phyp) {
    printf($pf_tfmt . " " . $pf_nfmt . "\n", $h, $phyp{$h});
  }
}

# Apply one event
# evi - event index in the array
# conf - event confidence [0..1]
sub apply_event($$) # (evi, conf)
{
  my ($evi, $conf) = @_;

  # update the weights
  for (my $i = 0; $i <= $#case; $i++) {
    my $w = $case[$i]->{wt};
    my $r = $case[$i]->{r}[$evi];
    my $tc = $case[$i]->{tc}[$evi];

    $case[$i]->{wt} = $w * (1. - $r)
      + $w*$r*( $tc*$conf + (1. - $tc)*(1. - $conf) );
  }
}


# Apply an input file
sub apply_input($) # (filename)
{
  my $filename = shift;

  confess "Failed to open the input '$filename': $!\n"
    unless open(INPUT, "<", $filename);
  while(<INPUT>) {
    chomp;
    next if (/^\#/ || /^\s*$/); # a comment

    my @s = split(/,/);
    $s[0] =~ s/^\s+//;
    $s[0] =~ s/\s+$//;

    confess ("Unknown event name '" . $s[0] . "' in the input\n")
      unless exists $evhash{$s[0]};
    my $evi = $evhash{$s[0]};

    my $conf = $s[1] + 0.;
    if ($conf < $cap) {
      $conf = $cap;
    } elsif ($conf > 1.-$cap) {
      $conf = 1. - $cap;
    }
    printf("--- Applying event %s, c=%f\n", $s[0], $conf);
    &apply_event($evi, $conf);
    &print_table;
  }
  close(INPUT);
}

# main()
while ($ARGV[0] =~ /^-(.*)/) {
  if ($1 eq "v") {
    $verbose = 1;
  } elsif ($1 eq "c") {
    shift @ARGV;
    $cap = $ARGV[0]+0.;
  } elsif ($1 eq "b") {
    shift @ARGV;
    $boundary = $ARGV[0]+0.;
  } else {
    confess "Unknown switch -$1";
  }
  shift @ARGV;
}
&load_table($ARGV[0]);
&print_table;
&compute_phyp;
&print_phyp;
if ($#ARGV >= 1) {
  &apply_input($ARGV[1]);
  &compute_phyp;
  &print_phyp;
}

As you can see, the function apply_event() became much simpler, and there is no chance of division by 0 any more. The function apply_input() stayed exactly the same, just the functions called by it have changed.

The inputs are backwards-compatible with the previously shown examples, although what was formerly the field for hypothesis probabilities now becomes the field for case weights. The new code just works with weights and converts them to hypothesis probabilities at the end for display.

There are a couple of new features supported in the inputs. First, the field for hypothesis names is now allowed to contain multiple hypotheses, separated by the plus sign. That allows to enter the cases with multi-hypothesis results.

Second, what used to be the conditional probabilities of the events now can contain both the training confidence values and the relevance values of the events. These fields may have one of three formats:

  • A single number: the confidence value TC(E|I), similar to the previous P(E|H). The relevance R(E|I) is assumed to be 1.
  • The character "-": means that this event is not relevant for this case. Which means that the relevance value R(E|I) is 0, and TC(E|I) doesn't matter.
  • Two numbers separated by a "/": the first one is TC(E|I) and the second one is R(E|I).

Let's look at some examples.

The very first example I shown was this:

# tab06_01_01.txt
!,,evA,evB,evC
hyp1,0.66667,1,0.66667,0.66667
hyp2,0.33333,0,0,1

The very first input I've shown was this:

# in06_01_01_01.txt
evA,1
evB,0
evC,0

Let's compare the old and the new results. Old:

$ perl ex06_01run.pl tab06_01_01.txt in06_01_01_01.txt
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,0.66667,1.00000,0.66667,0.66667,
hyp2   ,0.33333,0.00000,0.00000,1.00000,
--- Applying event evA, c=1.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,Since this case has no match 
--- Applying event evB, c=0.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,
--- Applying event evC, c=0.000000
!      ,       ,evA    ,evB    ,evC    ,
hyp1   ,1.00000,1.00000,0.66667,0.66667,
hyp2   ,0.00000,0.00000,0.00000,1.00000,

New:

$ perl ex14_01run.pl tab06_01_01.txt in06_01_01_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.66667,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,0.33333,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.66667
hyp2    0.33333
--- Applying event evA, c=1.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.66667,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,0.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evB, c=0.000000Since this case has no match 
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.22222,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,0.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evC, c=0.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.07407,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,0.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    1.00000
hyp2    0.00000

The result is the same, although the intermediate data is printed as weights, not probabilities, and the printed table contains the relevance information (all the relevance values are at 1 here).

This table of probabilities was produced from 6 cases for hyp1 and 3 cases for hyp2:

         evA evB evC
4 * hyp1 1   1   1
2 * hyp1 1   0   0
3 * hyp2 0   0   1

Before entering the raw cases, let's look at the same combined probability table with weights entered directly instead of probabilities:

# tab14_01a.txt
!,,evA,evB,evC
hyp1,6,1,0.66667,0.66667
hyp2,3,0,0,1

$ perl ex14_01run.pl tab14_01a.txt in06_01_01_01.txt 
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,6.00000,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,3.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.66667
hyp2    0.33333
--- Applying event evA, c=1.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,6.00000,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,0.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evB, c=0.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.99998,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,0.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evC, c=0.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.66665,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,0.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    1.00000
hyp2    0.00000

The end result is the same, and the intermediate weights are simply scaled up proportionally. Note that even though the case matches one of the training cases one-to-one, the weight of hyp1 ends up at only 0.66665. That's because the table entry is produced by folding two kinds of cases, and this case matched only one-third of the cases. Since originally this particular case had the weight of 2, the result is 2*1/3 = 0.67 (approximately).

If we enter the raw training cases into the table, the result changes:

# tab14_01b.txt
!,,evA,evB,evC
hyp1,4,1,1,1
hyp1,2,1,0,0
hyp2,3,0,0,1

$ perl ex14_01run.pl tab14_01b.txt in06_01_01_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,4.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   ,2.00000,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,3.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.66667
hyp2    0.33333
--- Applying event evA, c=1.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,4.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   ,2.00000,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,0.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evB, c=0.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   ,2.00000,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,0.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evC, c=0.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   ,2.00000,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,0.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    1.00000
hyp2    0.00000

Now the resulting probability is the same but the weight 2 matches what was in the training table.

Let's see how it handles an impossible input:

# in08_01_01.txt
evC,0
evA,0
evB,0

$ perl ex14_01run.pl tab14_01a.txt in08_01_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,6.00000,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,3.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.66667
hyp2    0.33333
--- Applying event evC, c=0.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.99998,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,0.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evA, c=0.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.00000,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,0.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evB, c=0.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.00000,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,0.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.00000
hyp2    0.00000

Since this case has no match in the training table, the weights come out as 0. The computation of the probabilities would have required to divide the weights by the sum of weights, but since the sum is 0, the code just leaves the probabilities at 0 to avoid the division by 0.

It's interesting to compare the results with capping for two different kinds of the tables:

$ perl ex14_01run.pl -c 0.01 tab14_01a.txt in08_01_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,6.00000,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,3.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.66667
hyp2    0.33333
--- Applying event evC, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,2.01998,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,0.03000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evA, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.02020,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,0.02970,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evB, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.00680,1.00000/1.00,0.66667/1.00,0.66667/1.00,
hyp2   ,0.02940,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.18784
hyp2    0.81216

$ perl ex14_01run.pl -c 0.01 tab14_01b.txt in08_01_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,4.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   ,2.00000,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,3.00000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.66667
hyp2    0.33333
--- Applying event evC, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.04000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   ,1.98000,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,0.03000,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evA, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.00040,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   ,0.01980,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,0.02970,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evB, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.00000,1.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   ,0.01960,1.00000/1.00,0.00000/1.00,0.00000/1.00,
hyp2   ,0.02940,0.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.40005
hyp2    0.59995

They came out different. Why? In the second run, the first training case for hyp1 mismatches the values of all 3 input events. Its weight gets multiplied by 0.01 thrice and becomes very close to 0. The second training case for hyp1 and the training case for hyp2 mismatch only one input event, so their weights get multiplied by 0.01 only once, and their relative weights decide the resulting probabilities. They were 2:3 to start with and that stayed as 2:3 (the little deviation is contributed by the first training case for hyp1).

On the other hand, in the first run all the cases for hyp1 were lumped into one line, and the average content of that line was seriously tilted towards the case that mismatched all 3 events. Thus hyp1 ended up with a much lower weight but hyp2 ended up with exactly the same weight as in the second run, so it outweighed the hyp1 much more seriously.

To look at the effects of the relevance values and of the training cases with multi-hypothesis results let's revisit the example from the 9th and 10th parts. First, a training table with multiple hypotheses, as in the part 9:

# tab14_02a.txt
!,,evA,evB,evC
hyp1+hyp2,1,1,1,0
hyp2+hyp3,1,0,1,1
hyp1+hyp3,1,1,0,1

With the input data:

# in09_01_01.txt
evA,0
evB,0
evC,1

$ perl ex14_01run.pl -c 0.01 tab14_02a.txt in09_01_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   +
hyp2   ,1.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   +
hyp3   ,1.00000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   +
hyp3   ,1.00000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.66667
hyp2    0.66667
hyp3    0.66667
--- Applying event evA, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   +
hyp2   ,0.01000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   +
hyp3   ,0.99000,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   +
hyp3   ,0.01000,1.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evB, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   +
hyp2   ,0.00010,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   +
hyp3   ,0.00990,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   +
hyp3   ,0.00990,1.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Applying event evC, c=0.990000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   +
hyp2   ,0.00000,1.00000/1.00,1.00000/1.00,0.00000/1.00,
hyp2   +
hyp3   ,0.00980,0.00000/1.00,1.00000/1.00,1.00000/1.00,
hyp1   +
hyp3   ,0.00980,1.00000/1.00,0.00000/1.00,1.00000/1.00,
--- Probabilities
hyp1    0.50003
hyp2    0.50003
hyp3    0.99995

The capping was needed since the data doesn't match any of the training cases, but in the end it points pretty conclusively to the hypothesis hyp3. The multi-hypothesis cases are printed out in the intermediate results with one hypothesis per line (this format cannot be read back as input).

Now a variation of the example from the part 10, where I've manually decided, which events should be relevant to which hypotheses, using the same in input as above. Only this table is for all 3 hypotheses at once, not one hypothesis at a time:

# tab14_02b.txt
!,,evA,evB,evC
hyp1,1,1,-,-
hyp2,1,-,1,-
hyp1,1,-,-,1

$ perl ex14_01run.pl tab14_02b.txt in09_01_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.00000,1.00000/1.00,0.00000/0.00,0.00000/0.00,
hyp2   ,1.00000,0.00000/0.00,1.00000/1.00,0.00000/0.00,
hyp3   ,1.00000,0.00000/0.00,0.00000/0.00,1.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333
--- Applying event evA, c=0.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.00000,1.00000/1.00,0.00000/0.00,0.00000/0.00,
hyp2   ,1.00000,0.00000/0.00,1.00000/1.00,0.00000/0.00,
hyp3   ,1.00000,0.00000/0.00,0.00000/0.00,1.00000/1.00,
--- Applying event evB, c=0.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.00000,1.00000/1.00,0.00000/0.00,0.00000/0.00,
hyp2   ,0.00000,0.00000/0.00,1.00000/1.00,0.00000/0.00,
hyp3   ,1.00000,0.00000/0.00,0.00000/0.00,1.00000/1.00,
--- Applying event evC, c=1.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.00000,1.00000/1.00,0.00000/0.00,0.00000/0.00,
hyp2   ,0.00000,0.00000/0.00,1.00000/1.00,0.00000/0.00,
hyp3   ,1.00000,0.00000/0.00,0.00000/0.00,1.00000/1.00,
--- Probabilities
hyp1    0.00000
hyp2    0.00000
hyp3    1.00000

You can see in the printout that some of the relevances have been set to 1 and some to 0. This time it picked hyp3 with full certainty, without even any need for capping.

Now the same table but the input with all events true:

# in10_01_02.txt
evA,1
evB,1
evC,1

$ perl ex14_01run.pl tab14_02b.txt in10_01_02.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.00000,1.00000/1.00,0.00000/0.00,0.00000/0.00,
hyp2   ,1.00000,0.00000/0.00,1.00000/1.00,0.00000/0.00,
hyp3   ,1.00000,0.00000/0.00,0.00000/0.00,1.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333
--- Applying event evA, c=1.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.00000,1.00000/1.00,0.00000/0.00,0.00000/0.00,
hyp2   ,1.00000,0.00000/0.00,1.00000/1.00,0.00000/0.00,
hyp3   ,1.00000,0.00000/0.00,0.00000/0.00,1.00000/1.00,
--- Applying event evB, c=1.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.00000,1.00000/1.00,0.00000/0.00,0.00000/0.00,
hyp2   ,1.00000,0.00000/0.00,1.00000/1.00,0.00000/0.00,
hyp3   ,1.00000,0.00000/0.00,0.00000/0.00,1.00000/1.00,
--- Applying event evC, c=1.000000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.00000,1.00000/1.00,0.00000/0.00,0.00000/0.00,
hyp2   ,1.00000,0.00000/0.00,1.00000/1.00,0.00000/0.00,
hyp3   ,1.00000,0.00000/0.00,0.00000/0.00,1.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333

As far as the weights are concerned, all the hypotheses got the full match. But the probabilities have naturally gotten split three-way. Let's contrast it with the result of another input:

# in08_01_01.txt
evC,0
evA,0
evB,0

$ perl ex14_01run.pl -c 0.01 tab14_02b.txt in08_01_01.txt
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.00000,1.00000/1.00,0.00000/0.00,0.00000/0.00,
hyp2   ,1.00000,0.00000/0.00,1.00000/1.00,0.00000/0.00,
hyp3   ,1.00000,0.00000/0.00,0.00000/0.00,1.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333
--- Applying event evC, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,1.00000,1.00000/1.00,0.00000/0.00,0.00000/0.00,
hyp2   ,1.00000,0.00000/0.00,1.00000/1.00,0.00000/0.00,
hyp3   ,0.01000,0.00000/0.00,0.00000/0.00,1.00000/1.00,
--- Applying event evA, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.01000,1.00000/1.00,0.00000/0.00,0.00000/0.00,
hyp2   ,1.00000,0.00000/0.00,1.00000/1.00,0.00000/0.00,
hyp3   ,0.01000,0.00000/0.00,0.00000/0.00,1.00000/1.00,
--- Applying event evB, c=0.010000
!      ,       ,evA         ,evB         ,evC         ,
hyp1   ,0.01000,1.00000/1.00,0.00000/0.00,0.00000/0.00,
hyp2   ,0.01000,0.00000/0.00,1.00000/1.00,0.00000/0.00,
hyp3   ,0.01000,0.00000/0.00,0.00000/0.00,1.00000/1.00,
--- Probabilities
hyp1    0.33333
hyp2    0.33333
hyp3    0.33333

In this run the probabilities also split three-way. But how do we know that for the first input (1,1,1) we should pick all 3 hypotheses and for the second input (0,0,0) we should pick none? Both are split evenly, so we can't pick based on the rule of even splitting discussed in the part 9, both would fit it. One indication is that the second input required capping, or it would have produced the probabilities of 0. For another indication we can compare the final weights of the cases with their initial weights. In the first run they stayed the same. In the second run they got multiplied by 0.01 by the capping. Thus we can say that neither really matched the input. Since only one multiplication by 0.01 was done, each case had only one mismatch. Is one mismatch that bad? Since each hypothesis has only one relevant event, we can say that yes, the mismatch in this one event is real bad, and all these hypotheses should be considered false.

With the larger sets of training data, it would be possible to make decisions based on how many relevant events are available for each case, and what fraction of them is allowed to be mismatched before we consider the case inapplicable. And if a hypothesis has no applicable cases, it would be considered false.

Saturday, October 31, 2015

Bayes 13: the relevance

Before showing the code that computes by the table of training cases, I want to talk through one more aspect.

What causes the "impossible" cases that haven't been seen in the training data? Some of them are just rare occurrences that weren't observed during training. Some of them are the results of errors made when evaluating and entering the event data. But in my experience most of them are caused by the situations when multiple hypotheses are true and their evidences interact with each other.

It's not such a big deal for the hypotheses that occur often (i.e. they have the high prior P(H)) and also occur together relatively often. Then there will be the training cases reflecting these interactions, and they can be recognized fairly well. However if you have 200 hypotheses, that means the possibility of 40000 combinations of two of them occurring together. If you have less than 40000 training cases, you're guaranteed not to see them all. And if you remember that the trueness of two hypotheses is usually rather rare, and three or more hypotheses might happen to be true, you won't have a good chance to see all the reasonable combinations until you have millions of training cases.

One way to work around this problem is to try isolating the hypotheses from each other. I've already mentioned it in the part 10: if we can figure out, what events are relevant for a hypothesis, we can ignore the other events that are relevant for the other hypotheses, thus reducing the interaction between these hypotheses.

How can we figure out, which events are relevant and which aren't? One way is to use the human expertise and the knowledge about the structure of the object. Such as, if a brake light on a car is out, it has no connection with the blown engine head gasket.

Some amount of this kind of knowledge engineering is inevitable because the Bayesian logic is notoriously bad at structuring the data. Someone's got to prepare the meaningful events and combine the knowledge into this form before feeding it to this model. For example, the question "is the compression in the rear-most cylinders low?" can be used as an indication that the engine has been mildly overheated. But it combines multiple pieces of knowledge:

  • How many cylinders does this car have, and which are the rear-most? Depending on an inline or V (or flat) engine configuration, there might be one or two rear-most cylinders.
  • What is the measured compression in these cylinders?
  • What is the normal range of compression for this car model?
  • Is the coolant flow going from the front of the engine towards the rear? That's the normal situation but in some cars the direction might be opposite, and then we'd have to look at the front-most cylinders instead.

All this would be pretty hard to enter as events into a Bayesian model. But if the data gets combined and pre-processed, it becomes one convenient event. This pre-processing would use the knowledge about the car model to find the cylinders to look at, and then would combine the compression measurement with the lower end of the factory-specified range to give a confidence value that the compression is low.

The problem is obviously that someone has to program this pre-processing. And the same goes for the expert estimation of relevance of symptoms for a hypothesis: someone has to do this work.

The second problem with the expert estimations is that the experts' knowledge might be limited. I believe I've read about the history of the medicine where some easy-to-test symptoms of some illnesses weren't discovered until a much later time, and in the meantime everyone went by the roundabout and unreliable way. But I don't remember the details of this story: what illnesses, what symptoms? Well, we'd like such symptoms to be discovered automatically from the data. But if we force the expert knowledge onto the model, it would never discover such dependencies because it would be forced to follow the human opinions.

So how do we discover this relevance automatically? I don't have a good answer. I have some ideas but they've occurred to me fairly late, and I've had only a very limited time to experiment with some of them, and I haven't done any experimentation with the ones that occurred to me later yet. I'll describe them anyway but not now yet.

Right now let's discuss a different, simpler aspect: suppose we somehow know which events are relevant to which hypotheses, how do we use this knowledge in the processing?

In the part 10 I've shown a way to do it with the independent tables for each hypothesis: to treat an event as irrelevant to a hypothesis, either drop the event from the hypothesis table altogether or set the probabilities of both P(E|H) and P(E|~H) to 0.5.

The computation on training table allows to do the same for the combined table. Just add an additional relevance value to every cell of the table, to every event of every training case. If this value is 1, process the event on this case as before. If this value is 0, leave the weight of this case unchanged when applying this event.

There is also an obvious extension: rather than have the relevance as discrete 0 or 1, make it a number in the range [0..1]. The processing would follow the same principle as for the event confidence: logically split the case into two cases with their relative weights determined by the relevance value R(E|I), one case with the event being fully relevant, another with the event being fully irrelevant, process them separately, and then put back together:

W(I|E) = W(I)*(1 - R(E|I)) + W(I)*R(E|I)*( TC(E|I)*C(E) + (1 - TC(E|I))*(1 - C(E)) )

This concept can also be applied to the classic Bayesian probability tables. It's straightforward: treat the prior probabilities as weights, do the same computation, then divide each resulting weight by the sum of all these weights to get the posterior probability.