This assignment deals with learning in Bayesian networks. You will develop learning algorithms for both the complete-data case and the incomplete-data case. You will apply these to the restaurant and oil-drilling domains. The complete-data algorithm is essentially counting in the training set. The incomplete-data algorithm is a modification of MCMC to count in the sampled complete states generated by the MCMC process. You should have an MCMC algorithm from A5; if not, we will be publishing the solution on Monday May 3rd (after the expiry of the 5 late days for A5); in the mean time, if you want to start earlier, you can use complete states generated by the rejection sampling algorithm instead.
Be sure to use the latest version from ~cs188.
Question 1 (5 pts). To familiarize yourself with the learning code, generate an incremental learning curve for decision tree learning applied to the 100 restaurant examples. Your curve should reflect 100 trials with data pooints every 5 examples. Use plot-alist to write the output to a file called restaurant-dtl-curve.data. There is a gnuplot command file called restaurant-dtl-curve.plot. Copy this to your directory, and run it using the Unix command
/usr/sww/bin/gnuplot restaurant-dtl-curve.plotThe results should appear on the screen and then will be written as restaurant-dtl-curve.eps, which should be included in your submission after you have checked to make sure it looks right.
Question 2 (10 pts). Complete-data learning is described on pages 716-718 of AIMA2e. The key is counting: the maximum-likelihood parameter estimates are exactly the relative frequences of the appropriate events in the data. For example, the conditional probability P(Y=true | X=true) is estimated by the fraction of cases with X=true that also have Y=true. We will need to keep counts corresponding to every CPT entry in the network. Recall that a CPT for a tabulated node is an array, indexed by parent values, each element of which is a discrete distribution, i.e., a vector of probabilities. Hence, corresponding to each CPT, we will need an array, each element of which is a vector of counts. Write a function
(make-bn-counts bn)that creates and returns a set of such arrays, one for each node in the BN. Write also a function
(increment-bn-counts event bn counts)that updates and returns the set of count arrays for the new event.
Question 3 (5 pts). Probably your first answer to Question 2 initialized the counts to 0. Explain, by means of a simple example, why this can cause problems when an ML-trained BN is used to answer queries concerning a new example. [Hint: what happens to an inference algorithm when given evidence that has probability zero according to the network?]
Question 4 (10 pts). Now write a function
(complete-data-bn-learning examples bn)which is a given a set of examples and a BN, and returns the BN with CPT entries set to the ML estimates given the examples. [Hint: the make-event function in uncertainty/domains/bayes-nets.lisp will come in handy. Also, to debug your code, use a very simple network such as uncertainty/domains/ab.bn and a small set of examples.]
Question 5 (5 pts). Write a function
(bn->hypothesis bn goal)that is analogous to dt->hypothesis for decision trees. That is, your function should return a function that, given an unlabelled example, returns a value for the goal attribute using the BN. The value returned should be the most probable given the evidence in the example. (Use the-biggest-random-tie from utilities.lisp.) Also write the function
(complete-data-bayes-net-hypothesis examples goal bn)that is analogous to decision-tree-hypothesis.
Question 6 (10 pts). In the file learning/domains/restaurant-naive-bayes.bn is a naive Bayes model for the restaurant data, i.e., the root is WillWait and the leaves are the 10 other attributes. (The model has uniform CPTs; since your learning algorithm will modify them, you can always use set-cpts-to-uniform from bayes-nets.lisp to reset them.) Use an appropriate call to incremental-learning-curve and call plot-alist to write the output to a file called restaurant-naive-bayes-curve.data. Call gnuplot on restaurant-naive-bayes-curve.plot to show the results. [Hint: this is a little bit tricky, because the learning curve function expects an induction algorithm that takes examples, attributes, and a goal as input. Lambda to the rescue!]
With MCMC, the process is very simple indeed: each state visited by the MCMC algorithm, given an incomplete example as evidence, can be viewed as a possible completion of the example. Thus the E-step calls a suitably modified MCMC algorithm once for each example, and accumulates counts (just as in complete-data learning) for every complete state visited by MCMC. From the accumulated counts, the M-step recalculates all the CPTs.
Question 7 (10 pts). You should have a function (mcmc X e bn) from A5. Modify this to define a function
(counting-mcmc e bn counts)which calls increment-bn-counts for every state it visits and returns the final counts at the end.
Question 8 (20 pts). Write a function
(incomplete-data-bn-learning examples bn)that is analogous to complete-data-bn-learning but uses the MCMC-based EM scheme described above. That is, it should have an outer loop specifying some number of EM iterations. Each iteration reinitializes the counts, calls counting-mcmc on every example, accumulating the counts, and recalculates the CPTs given the counts. Also write incomplete-data-bayes-net-hypothesis, analogous to complete-data-bayes-net-hypothesis,
Question 9 (10 pts). We can generate incomplete data from a Bayes net by generating complete samples (using prior-sample and deleting some of the attribute values. For example, *oil-100-incomplete-examples* in learning/domains/oil.lisp were generated from uncertainty/domains/oil.bn with some of the variables hidden. Generate a learning curve for your algorithm when applied to this data with the goal of predicting Bankrupt (see *oil-goal*). You may need to play with the number of EM iterations and the number of MCMC steps per example; you should be able to do 10 trials with data points every 10 examples.
Plot the results using oil-incomplete-curve.plot How do the results compare to the prediction performance of the true network on the same 100 examples? Explain any apparently surprising aspects of your learning curve.