
Occam's razor is generally interpreted as counselling the use
of "simpler" models rather than complex ones, fewer
parameters rather than more, and "smoother" generalizers
rather than those that are less smooth. The mathematical descendents
of this philosophical principle of parsimony appear in minimum-description-length,
Akaike, Kolmogorov complexity and related principles, having numerous
manifestations in learning, for instance regularization, pruning,
and overfitting avoidance.
For a given quality of fit to the training data, in the absence of other information should we favor "simpler" models, and if so, why? How do we measure simplicity, and which representation should we use when doing so? What assumptions are made -- explicitly or implicitly -- by these methods and when are such assumptions valid? What are the minimum assumptions or conditions -- for instance that by increasing the amount of training data we will improve a classifier's performance -- that yield Occam's razor? Support Vector Machines and some neural networks contain a very large number of free parameters, more than might be permitted by the size of the training data and in seeming contradiction to Occam's razor; nevertheless, such classifiers can work exceedingly well. Why? Bayesian techniques such as ML-II reduce a classifier's complexity in a data-dependent way. Does this comport with Occam's razor? Can we characterize problems for which Occam's razor should or should not apply? Even if we abandon the search for the "true" model that generated the training data, can Occam's razor improve our chances of finding a "useful" model?
It has been said that Occam's razor is either profound and true, or vacuous and false -- it just isn't clear which. Rather than address specific implementation techniques or applications, the goal of this workshop is to shed light on, and if possible resolve, the theoretical questions associated with Occam's razor, some of the deepest in the intellectual foundations of machine learning and pattern recognition.
Duration: one day
Date: Friday, December 7, 2001
Location: Delta Whistler Resort, 4050 Whistler Way, Whistler, British Columbia, Canada VON 184; phone: 604-932-1982
Confirmed speakers:
7:30am
7:35am
This presentation will begin with a short history of William of Ockham (1285?-1349?) (Latinized to Occam) and his philosophy of parsimony. We then give an overview of the central issues addressed by Occam's razor, most of which are to be addressed in the workshop. Does the razor state something "objective" about the external world, or instead about our theories of it? Are there methodological biases that "justify" the razor? For instance Karl Popper argued that science progresses by falsifying bad hypotheses (rather than confirming true ones). "Simpler" models are more easily falsified by experiment or consistency considerations, and this, perhaps, is a bias towards simpler models. Moreover, in pattern classification, a version of Occam's razor can be justified by the hypothesis that more training data does not decrease the performance of the resulting classifier. How do we quantify "simplicity"? What representations are appropriate? What are the theoretically justified methods for balancing simplicity and quality of fit to training data? How can models with many parameters, such as neural networks with large numbers of free parameters or Support Vector Machines, work well despite large numbers of free parameters? Can we prove Occam's razor to be "correct"?
7:45am
Amongst many other things, the No-Free-Lunch theorems mean that without making assumptions about the real world, it is impossible to prove that any particular learning algorithm --- Occam's razor included --- works well. Typically, when such assumptions are actually made explicit, they directly concern the possible states of the real world. Such assumptions tend to be of very restricted applicability. Fortunately, it turns out that we can instead use indirect, very broadly applicable assumptions concerning the relation between the real world and various algorithms. In particular, if we assume that the generalization of an algorithm improves with more data, on average, then its generalization can be improved further by transforming the algorithm via Occam's razor (i.e., by simplifying it according to a particular model-free definition of "simplicity").
References:
Wolpert, D. H., and Macready, W. G., "No Free Lunch Theorems for Search," IEEE Transactions on Evolutionary Computation, 1, 1997.
Wolpert, D. H., "The Lack of A Priori Distinctions between Learning Algorithms," and "The Existence of A Priori Distinctions between Learning Algorithms," Neural Computation, 7, 1996.
Wolpert, D. H., "On the Bayesian 'Occam Factors' Argument for Occam's Razor," in Computational Learning Theory and Natural Learning Systems III, T. Petsche, S. Hanson and J. Shavlik (eds.), MIT Press, 1995.
Wolpert, D. H., "The Relationship Between Occam's Razor and Convergent Guessing," Complex Systems, 4, 319-368, 1990.
8:15am
Under what circumstances should our learning algorithms choose hypotheses simpler than reality?
I will defend three arguments for explaining (this formulation of) Occam's razor in terms of standard learning theory. The first argument (the Variance Argument) assumes that the real world is very complex and that therefore our learning algorithms are always operating with insufficient training data. As a result, we must employ hypotheses simpler than reality, because we lack enough training data to statistically justify hypotheses of the same complexity as reality.
The second argument (the Bias Argument) assumes that we have enough data but that our space of hypotheses in inefficiently parameterized. That is, some fraction of the hypotheses that we can represent could never correspond to reality. Consequently, if we consider only hypotheses with a statistically-justified level of complexity, our hypothesis space will still not include the true hypothesis, and we will select a simpler hypothesis instead.
The third argument (the Non-identifiability Argument) assumes that the available data are drawn from a skewed distribution that does not reveal the full complexity of the real world. According to this argument, the data are only able to identify an equivalence class of possible models but not the true model. This equivalence class typically contains some models simpler than reality, and our algorithms can choose one such model with no loss of accuracy.
There are some important consequences of these arguments (and of Occam's razor in general). First, methods for incorporating prior knowledge into learning algorithms (e.g., Bayesian networks) must still be combined with some form of Occam's razor to obtain optimal performance. Hence, maximum likelihood, MAP, and even full Bayesian approaches to fitting belief networks will not give optimal performance. Second, the methods advocated by Pearl, Spirtes, et al. for inferring causal models will not succeed in general.
References:
Pearl, J., Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000.
Spirtes, P. and Glymour, C. and Scheines, R., Causation, Prediction, and Search. Springer-Verlag, 1993.
8:45am break
9:00am
Given two models with the same accuracy on training data, should we prefer the simpler one? The answer depends entirely on what we mean by "simpler." If we mean "having fewer parameters," and we have no knowledge whatsoever beyond the data, there is no reason whatsoever to expect the simpler model to generalize better. This is implied by the "no free lunch" theorems, and confirmed empirically by the success of methods that produce models with very many parameters (e.g., boosting, Support Vector Machines). If we believe in the Bayesian approach, and by "simpler" we mean "encoded in fewer bits," with shorter encodings corresponding to more probable models, then a preference for simpler models is correct but vacuous, because it is simply a restatement of the preference for more probable ones. Under this interpretation, any proof that some set of assumptions implies Occam's razor is futile, because we know a priori that any set of assumptions does so. If by "simpler" we mean "generatable by a shorter program," we have no way of actually determining which model is simpler. The only useful interpretation of Occam's razor as a universal principle (as opposed to a domain-specific heuristic) is that simpler models should be preferred because they are more comprehensible to people. However, simplicity is only an imperfect measure of comprehensibility, and precisely quantifying the latter does not seem possible.
References:
Domingos, P., "The role of Occam's razor in knowledge discovery," Data Mining and Knowledge Discovery, 3(4): 409-425, 1999.
See: http://www.cs.washington.edu/homes/pedrod
9:30am
Recently a considerable body of work in model selection has focused on estimation by maximizing a penalized likelihood, where the penalty is an intuitive measure of complexity, such as the number of terms in the model. This criterion, it is argued, favors the "simplest'' model which explain the data adequately. Suppose instead we measure complexity of the bitstring y by the length of shortest computer program that computes it. This notion is often called Kolmogorov complexity, and has captivated many as the most satisfying notion of complexity.
So, given noisy data x, suppose we look for the lowest complexity y that matches the observed x to within the noise level according to the natural goodness of fit measure associated with the noise distribution. This can be viewed as the ultimate form of complexity-penalized likelihood estimation, in which the complexity penalty is measured in its ultimately most satisfying way. We might call this the Kolmogorov estimator. We remark that it is a deterministic function of the input data.
We consider performance of this estimator in a few interesting continuous and discrete settings where it is applied to noisy data generated by a Bayesian prior for the many unknown estimands. We show that, in these settings anyway, the estimator quite generally and quite precisely acts as a sampler from the underlying posterior distribution from whatever prior happens to be the case -- this, despite the fact that no prior knowledge about the object to be recovered has been input to the procedure, but only some information about the noise distribution. In other words, the minimum complexity criterion automatically "figures" out what prior distribution is holding, and generates a sample from the corresponding posterior. So we had better call this the "Kolmogorov Sampler," even though it is a deterministic function of the data.
It follows that in estimating a high-dimensional mean vector in Gaussian noise, for squared-error loss and under a variety of assumptions about the means for example (a) iid means from an unknown prior or (b) means following an unknown stationary process, the Kolmogorov Sampler achieves asymptotically twice the Bayes risk. So it is quite impressively adaptive, although only 50% efficient. Why is this true? The argument is based on a chain of relationships between Kolmogorov Complexity, Shannon Rate Distortion Theory, and Bayes Decision Theory, which I will explain. Some of these do not seem to be very well known, even to experts, and perhaps they should be better known.
10:00am
10:30am adjourn
4:00pm
Solomonoff's formal
theory of universal induction is based on Ideas of Occam's razor,
Epicurus' principle of multiple explanations, and Bayes' rule,
and is closely related to Kolmogorov complexity. Solomonoff formally
solves the problem of sequence prediction in case of an unknown
environmental probability distribution. We show that the number
of prediction errors of this scheme is only slightly larger than
the number of errors made by the best possible scheme which knows
the true distribution. The error excess depends on the Kolmogorov
complexity of the environment. The results can be generalized
to arbitrary loss or utility functions. We outline how Solomonoff's
scheme can be used to construct an active agent behaving and optimally
in any computable, but otherwise unknown, environment.
4:30pm
The theory of algorithmic probability and algorithmic information has traditionally focused on "monotone'' universal Turing machines with one-way write-only output tape. The universal, enumerable Solomonoff-Levin measure of bitstring x is the probability that such a machine's output starts with x, given random inputs. This measure plays a central role in the theory of universal inductive inference, because it dominates all other enumerable measures in the sense that it assigns higher probability to any finite x, save for a constant factor. Here we discuss measures that are even more dominant but still computable in the limit: nonenumerable but cumulatively enumerable measures (CEMs) derived from Enumerable Output Machines with random input and lexicographically nondecreasing output, and even more dominant approximable measures. We obtain a natural hierarchy of generalizations of Kolmogorov complexity and algorithmic probability, suggesting that the "true'' but nonapproximable information content of some possibly infinite bitstring x is the size of the shortest nonhalting program that converges to x and nothing but x on a general Turing machine that can edit its previous outputs.
We show that certain x are computable in the limit yet more random than Chaitin's "number of wisdom'' Omega, that any approximable measure of x is small for any x lacking a short description (a fundamental justification of Occam's razor), that there is no universal approximable distribution dominating all the others, that there is a universal nonenumerable CEM, and that any CEM of x is small for any x lacking a short enumerating program (Occam's razor version for enumerable data). Finally we conjecture a very simple Occam-type relationship between generalized algorithmic probability P(x) and generalized algorithmic information I(x): P(x)=O(2^{-I(x)}) (no corrective factors -- x may be infinite!).
See: http://www.idsia.ch/~juergen/toesv2/
5:00pm break
5:15pm
Here are three mysteries:
1. Bayesian arguments have been used both to vindicate Occam's razor (`Occam factor'-type arguments) and to criticize Occam's razor (`No Free Lunch'-type arguments).
2. The MDL Principle seems to be, at least to some extent, arbitrary: one can always change an encoding scheme such that some of the hypotheses that were originally very `complex' (long description length) become very `simple' (short description length).
3. Since the phenomena we are trying to model are typically very complex, how can a general preference for `simple' models be useful at all?
We analyze and `solve' all three mysteries based on a single argument. We show that, modern, non-asymptotic versions of MDL, make use of non-arbitrary notions of `simplicity.' We argue that in many realistic situations, even if the `truth' is complex, there are strong reasons to prefer models that, according to MDL, are `simple.'
References:
P. Grunwald, Minimum Description Length and Maximum Probability. Kluwer Academic Publishers, to appear early 2002.
5:45pm
The relationship between the Bayesian approach and the minimum description length approach is established. We sharpen and clarify the general modeling principles MDL and MML, abstracted as the ideal MDL principle and defined from Bayes' rule by means of Kolmogorov complexity. The basic condition under which the ideal principle should be applied is encapsulated as the Fundamental Inequality, which in broad terms states that the principle is valid when the data are random, relative to every contemplated hypothesis and also these hypotheses are random relative to the (universal) prior. Basically, the ideal principle states that the prior probability associated with the hypothesis should be given by the algorithmic universal probability, and the sum of the log universal probability of the model plus the log of the probability of the data given the model should be minimized. If we restrict the model class to the finite sets then application of the ideal principle turns into Kolmogorov's minimal sufficient statistic. In general we show that data compression is almost always the best strategy, both in hypothesis identification and prediction.
See: http://www.cwi.nl/~paulv
References:
P.M.B. Vitanyi and M. Li, "Minimum Description Length Induction, Bayesianism, and Kolmogorov Complexity," IEEE Transactions on Information Theory, IT-46:2, 446--464, 2000. http://arXiv.org/abs/cs.LG/9901014
Q. Gao, M. Li and P.M.B. Vitanyi, "Applying MDL to learning best model granularity," Artificial Intelligence, 121:1--29, 2000. http://arXiv.org/abs/physics/0005062
M. Li and P.M.B. Vitanyi, An Introduction to Kolmogorov Complexity and its Applications (2nd. ed.), Springer-Verlag, New York, 1997.
6:15pm
6:55pm
7:00pm adjourn
http://www.rii.ricoh.com/~stork/OccamWorkshop.html