All branches of science and engineering proceed by the formulation of hypotheses and the provision of supporting (or refuting) evidence for them. As both science and engineering, Informatics should be no exception. Unfortunately, the provision of explicit hypotheses in Informatics publications is rare (see the review papers, for instance). This causes lots of problems: both for authors and reviewers.
In these notes, I hope to persuade you of the importance of the explicit and prominent statement of defendable hypotheses or claims in papers about your research. This will not only help give direction and progress to your research, but also head off some avoidable misunderstandings with those who assess it. This will not just be good for you personally, but good for the field of Informatics.
This all sounds fine in theory, but Informatics is not a natural science, in which hypotheses are claims about the nature of the world, supported by experimental evidence. So, do "hypotheses", as normally understood in natural sciences, have any role to play in Informatics? I strongly believe that hypotheses can and should be formulated about nearly all Informatics research, but you may be more comfortable calling them "claims" rather than "hypotheses". To understand what shape such claims might take, we need to remind ourselves of the nature of Informatics as the exploration of a space of techniques.
A research advance in Informatics might be realised as any of the following:
A hypothesis will then take the form of a claim that such an advance has taken place. The space of such possible claims is multi-dimensional. Below we discuss some of these dimensions. Firstly, a claim can be formulated at different levels, i.e. be about a task, a system, a technique or a particular parameter. They might, for instance, take one of the following forms:
Properties of and relationships between techniques and systems can be characterised by placing or relating them along particular dimensions. Below I describe some of these dimensions, grouping them as scientific, engineering or cognitive science.
The following three scientific dimensions are fundamental in the sense that they are relevant not just to the theoretical study of Informatics techniques, but also when they are being used as engineering tools or for modelling natural systems.
Claiming that something has a property is usually to place it on one of these dimensions; claiming that it bears some relationship to something else is to compare it along a dimension. Often, these dimensions are combined in a claim. For instance, the behaviour or efficiency of a technique might vary within the range of application, being worse, say, in some more complex situations.
When building systems either as platforms for further research or as ICT products, additional dimensions come into play. These are usually applied at the system level, although the other levels are also possible. There is also some overlap between these engineering dimensions and the scientific ones.
Additional dimensions also come into play when developing computer systems as models of natural systems, e.g. in cognitive science, where we are trying to model cognition and the mind. In fact, Cognitive Science is a conventional natural science in which the typical hypothesis is that some technique or system is a faithful model of some mental faculty. The dimensions listed below reflect different interpretations of 'faithful'. Note that all claims of faithfulness will be made at some level of abstraction, i.e. we always abstract away some details and focus on others. For instance, since brains and computers are made from different kinds of hardware, we would not normally make a claim at this level. Most claims are to do with similarities at the level of process.
Below we will illustrate the four dimensions with reference to a hypothetical model of the processes that children use when carrying out elementary addition. For instance, Seeley-Brown and Burton built a similar model in their Buggy system, as part of a tool to assist primary school teachers to diagnose student arithmetic errors. Theirs was a modular subtraction program in which different kinds of errors could be obtained by making minor modifications to the program, e.g. removing or adding a module, and then running it.
We can investigate properties at several levels. We can look at inherent properties of a particular task, e.g. prove the undecidability of deduction. We can look at properties of a complete system, e.g. consider the naturalness of the dialogue conducted by a natural language understanding system. We can look at the properties of a particular technique, e.g. explore the ability of a machine learning algorithm to cope with noisy training data. We can look at a particular parameter within a technique, e.g. find the optimal mutation rate in a genetic algorithm.
Our investigations can be theoretical, e.g. use mathematical logic to prove the completeness of a deductive procedure. They can also be experimental, e.g. testing a robot on an assembly task. Experimental investigations often start in an exploratory mode, in which we try out a program in various situations and try to spot patterns. When we think we have seen a pattern we may switch to a hypothesis testing mode. In both modes statistical methods are useful. Visualisation techniques are good for exploration, e.g. histograms, scatter plots, summary statistics. Statistical measures of significance are good for hypothesis testing.
Sometimes we are interested in the properties of a particular task, system or technique. Sometimes we are interested in the relationships between two or more tasks, systems, techniques or parameters. Comparisons are usually made along one of the scientific, engineering or cognitive science dimensions. For instance, we may investigate which of two teaching strategies is most effective in an intelligent teaching system. This is a behavioural comparison. We may ask which parser can cope with a wider range of sentence types. This is a coverage comparison. We may ask which initial weight setting in a neural net leads to the quickest convergence. This is an efficiency comparison. We may also investigate trade-offs between the dimensions, e.g. how the quality of behaviour or efficiency degrades in certain parts of the range. We might try to detect and explain phase boundaries, where the value along some dimension changes suddenly rather than gradually.
In cognitive science, experimentation takes a different form. We are usually comparing a computational model with the natural system that it is modelling. Much of our experimental work is aimed at checking the range of behaviour, both external and internal, of the natural systems, across subjects and time, rather than those of the model. The experimental methodology is closer to psychology than it is to engineering.
The internal structure of techniques can also be compared. The outcome of such a comparison might be, for instance, to show that two apparently different techniques are essentially the same or that one is a special case of the other. For instance, semantic nets were initially introduced as a alternative to logical representations, but Schubert, Hayes and others have shown that they are just a graphical presentation of a subset of first order logic. Contrariwise, one might show that two or more different techniques might confusingly share the same name. For instance, the name 'back propagation' is overused and refers to at least two distinct techniques.
Theoretical research can be conducted either into the nature of a task, the techniques for doing that task, or the parameters of some technique. Complete systems are usually too complicated to yield to theoretical analysis, so their analysis requires experimentation. Theory usually uses mathematical tools. Typical properties checked are correctness, completeness, termination and complexity. For instance, theoretical research into the task of a robot grasping objects might start by using a combination of geometry and mechanics to investigate: optimal places and forces to hold an object; optimal trajectories for a robot hand/arm to traverse; required accuracies in motors and joints for acceptable error rates; etc. This theoretical analysis might then be used to design and evaluate an actual robot arm and its software. We might predict that the robot's behaviour would be of sufficiently high quality in terms of its success rate and accuracy. We might also predict its coverage by showing which type of object grasping problems it could and could not be expected to deal with.
Of course, we will eventually have to show that these theoretical predictions are borne out in practice by building and testing the robot. Theoretical analysis nearly always requires some simplifying assumptions or abstractions. If these assumptions turn out to be false then the theoretical results may be invalid. Experiment may reveal this and suggest a more refined theoretical analysis.
However, theoretical analysis has various advantages that are not available from experiment alone.
In theoretical research we can regard any theorems as being hypotheses and their proofs as their supporting evidence. Theoretical Informatics is thus unusual in making its hypotheses explicit.
Initial experimental research is usually exploratory, but switches to formulating and testing hypotheses once a pattern has been observed in the data. Polished experimental research should state a hypothesis and then present evidence to support or reject it. Experiments might investigate the quality of system behaviour, its success rate, the scalability of the system, the dependability and usability of the system, etc.
To demonstrate a hypothesis experimentally we need to compare a new system, technique or parameter setting with rival ones. Usually we need to compare complete programs. Ideally, the compared programs will differ in only one respect from each other. If there are several differences it is hard to apportion credit for the improvement. Unfortunately, there are often technical or pragmatic reasons for programs to differ in more than one respect. This is especially the case when two complete systems are being compared, since they will each usually consist of a combination of several different techniques. So hypotheses about complete systems are usually a bit vague. Hypotheses about isolated techniques or the adjustment of single parameters within a technique are to be preferred.
The onus is on the author of a hypothesis to identify the kind of evidence which would serve to support or refute it. Usually the compared programs will be tested on some example tasks and then some analysis will be applied to the results. If the claim is about speed, coverage or efficiency then the analysis might consist of the calculation and presentation of some statistics. If the claim is about psychological validity or user-friendly interaction then the analysis might be a discussion of some case studies. In all these cases the analysis has to argue for two difficult to establish conclusions:
These two issues are discussed in the following two sub-sections.
Obtaining representative sets of examples for testing a technique is a standard problem in science - and there are some standard solutions, i.e.:
The purpose of controlled experimentation is to simplify the task of explaining the experimental results. For instance, if we compare two systems which differ in only one respect then any difference in performance between them must be due to this sole difference. On the other hand, if the two systems differ in two respects then a difference in performance might be due to either difference in isolation or a combination of the differences. Unfortunately, most Informatics experimentation is not well controlled. Compared systems usually differ in many respects. This makes allocation of credit or blame very difficult.
This lack of experimental control is sometimes unavoidable. A difference in one respect might entail necessary differences in other respects. For instance, a learning technique that adjusts the weights of features in an evaluation function presupposes a search process controlled by such an evaluation function. Sometimes the lack of experimental control is avoidable in principle, but understandable in practice. For instance, a researcher might have invested heavily in time and money in software and hardware that is different from that used by a rival system. Unfortunately, the lack of experimental control is also sometimes due to sloppy experimental design and is wholly avoidable.
Fortunately, we can often circumvent lack of experimental control by more careful analysis of the experimental results. For instance, suppose two systems are being compared on speed. We can investigate the slower system and identify the procedures which contribute most to the slowness. If these slow procedures implement the technique being compared then it can legitimately be blamed for the slow performance. Similarly, a difference in the traces of the two systems might be credited to the two techniques being compared, rather than some more accidental difference between systems. Such micro-analysis is often possible and must often be used to supplement summary statistics in order to accurately apportion credit or blame.
Often the behaviour of the systems is too complex for analysis by hand. There is increasing use of automatic tools to support system analysis. Such tools can locate correlations between events in a system, providing hypotheses for more detailed investigation.
When formulating hypotheses to be established by a piece of research, it is vital to take care that it is one that can be evaluated. A hypothesis that cannot be evaluated, and hence cannot be refuted, fails Popper's test for what constitutes science. Our scientific investigations should be inspired by big and important problems. We all hope that our advances will contribute to the solutions to these problems, but we must take care not to claim more for these advances than we can defend.
For instance, the most obvious hypothesis may be too expensive to evaluate. Suppose I have invented a new programming language, which we will call MyLang. My aim in inventing it may have been to improve programmer productivity by dramatically reducing the number of bugs that a typical programmer might produce. Unfortunately, a realistic evaluation of this claim might require building several industrial-scale systems in both MyLang and its immediate rival languages and comparing the times taken to arrive at a stable and dependable product. Each system might require a team of a dozen or so programmers, that would need to be carefully matched in terms of their abilities in the various languages. The person years in both the development and debugging of each these programs might run to several hundreds. Where would a PhD student realistically get such resources?
What we need to do is to replace this over-ambitious hypothesis with one that can be more readily evaluated. For instance, we might note that certain kinds of error are detected at compile time as syntax errors when a programming language is strongly typed. MyLang might have such strong typing at its heart. Our modified hypothesis is that MyLang improves programmer productivity by its early detection of a common form of bug. To show that such errors are common and affect productivity we might conduct some studies of buggy programs. We want to analyse the frequency of 'type violation' errors in untyped programs and to how long they take to detect and fix when they are not detected at compile time. Of course, MyLang might decrease programmer productivity in other directions, e.g. by forcing programmers to identify the types of its procedures, variables, etc. We might also want to cover these issues in our analysis. But even taking such issues into account, we now have an evaluable, but much more modest hypothesis.
Just like other science and engineering disciplines, Informatics advances via the formulation of hypotheses (or claims) and the provision of supporting (or refuting) evidence for them. Hypothesis in Informatics typically establish or compare properties along some dimension. These dimensions include the following:
The evidence to support (or refute) these hypotheses can be either theoretical, experimental or a combination of both. When devising experiments, care needs to be taken to ensure that the experimental data are representative and to identify the correct cause of any observed effects.
I always welcome feedback on these notes, especially if you have detected missing material, e.g. new kinds of dimension, new forms of hypothesis.Alan Bundy
Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: firstname.lastname@example.org
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh