The Need for Hypotheses in Informatics

Introduction

All branches of science and engineering proceed by the formulation of hypotheses and the provision of supporting (or refuting) evidence for them. As both science and engineering, Informatics should be no exception. Unfortunately, the provision of explicit hypotheses in Informatics publications is rare (see the review papers, for instance). This causes lots of problems: both for authors and reviewers.

As we will see, there are often many possible hypotheses. If the author does not stake explicit claim(s), then it is quite likely that the reviewers will identify the wrong one(s), causing them to misinterpret and then misassess the paper. Many "rogue reviews" of both research papers and grant proposals can be traced back to such misinterpretations.
When you analyse a paper (see feedback on previous review papers, for example analyses) you can typically identify several possible hypotheses. Usually, only some of these are intended seriously and supported with evidence; the others are merely suggestive and left for future work. Confusion about which is which will lead to criticism that insufficient evidence has been presented./li>
Sometimes researchers are unclear in their own minds about what claim(s) they are making. It follows that they are also unclear about what evidence they should be presenting. The result can be that none of the possible claims is properly tested, leading to unfocused research.

In these notes, I hope to persuade you of the importance of the explicit and prominent statement of defendable hypotheses or claims in papers about your research. This will not only help give direction and progress to your research, but also head off some avoidable misunderstandings with those who assess it. This will not just be good for you personally, but good for the field of Informatics.

What is the Form of Hypotheses in Informatics?

This all sounds fine in theory, but Informatics is not a natural science, in which hypotheses are claims about the nature of the world, supported by experimental evidence. So, do "hypotheses", as normally understood in natural sciences, have any role to play in Informatics? I strongly believe that hypotheses can and should be formulated about nearly all Informatics research, but you may be more comfortable calling them "claims" rather than "hypotheses". To understand what shape such claims might take, we need to remind ourselves of the nature of Informatics as the exploration of a space of techniques.

A research advance in Informatics might be realised as any of the following:

The invention of a new technique, e.g. a new algorithm, information storage method, computer architecture, software engineering practice, etc.
The better understanding of an existing technique, e.g. its properties or relationships to other techniques.
The extension or improvement of an existing technique, e.g. to extend its range of application, or improve its efficiency or output quality.
The discover of new applications of a technique either within an artificial system or the modelling of some natural system.
The implementation of a combination of techniques to build a computer system.
The better understanding of the properties of a task that these techniques might undertake, e.g. that it is NP Complete, so that it is unlikely that a polynomial solution exists.

A hypothesis will then take the form of a claim that such an advance has taken place. The space of such possible claims is multi-dimensional. Below we discuss some of these dimensions. Firstly, a claim can be formulated at different levels, i.e. be about a task, a system, a technique or a particular parameter. They might, for instance, take one of the following forms:

All techniques to solve task X will have property Y.
System X is superior to system Y on dimension Z.
Technique X has property Z.
Technique X has relationship Z to technique Y.
X is the optimal setting of parameter Y.

Dimensions of Comparison

Properties of and relationships between techniques and systems can be characterised by placing or relating them along particular dimensions. Below I describe some of these dimensions, grouping them as scientific, engineering or cognitive science.

Scientific Dimensions

The following three scientific dimensions are fundamental in the sense that they are relevant not just to the theoretical study of Informatics techniques, but also when they are being used as engineering tools or for modelling natural systems.

Behaviour: The effect or result of the technique. To assess behaviour an external "gold standard" is needed to which the behaviour can be compared. The assessment can be absolute or comparative, i.e. we can say merely that the behaviour is either correct or incorrect, or we can say that is of higher or lower quality than some rival. For instance, suppose we were assessing a mobile robot navigation system. If the robot successfully reached its target then its behaviour could be regarded as correct; if it does so by a shorter route than a rival robot then we could regard its behaviour as of better quality.
Coverage: The range of application of the technique. To assess coverage we need to identify a set of situations to which its application is relevant. The assessment can be absolute or comparative, i.e. we can claim that it applies to the whole set, sometimes called completeness or that it applies to a bigger subset than some rival. In terms of the robot, we might claim that it can navigate all routes of interest or merely that it can navigate more routes than some comparator robot.
Efficiency: The resources consumed by the technique. The resources measured are usually time or space. In complexity theory, a space of abstract efficiency measures are used, e.g. constant, linear, quadratic, polynomial, exponential, etc. We might, for instance, claim that the run time of our robot route-finding algorithm is polynomial in the size of the space to be navigated, e.g. as measured by the number of arcs in some graph representation of the possible paths.

Claiming that something has a property is usually to place it on one of these dimensions; claiming that it bears some relationship to something else is to compare it along a dimension. Often, these dimensions are combined in a claim. For instance, the behaviour or efficiency of a technique might vary within the range of application, being worse, say, in some more complex situations.

Engineering Dimensions

When building systems either as platforms for further research or as ICT products, additional dimensions come into play. These are usually applied at the system level, although the other levels are also possible. There is also some overlap between these engineering dimensions and the scientific ones.

Dependability: Whether the system is reliable, secure and safe. This is especially important when the system is to be used in a safety or security critical application, such as flying an aeroplane or running a bank's ATM machines. Reliability could be interpreted as correct behaviour, even in the presence of faults from systems it was interfaced to: human or machine. Safety and security can also be expressed as behavioural properties.
Usability: How easy the system is to use. This includes ease of use by the intended human users and ease of interfacing to other systems.
Maintainability: How easily the system can evolve to meet changes in the user's requirements. This is especially important where the system is to be used over a long period in a changing situation, e.g. a company's payroll system that must take into account regular changes to the tax laws.
Scalability: Whether the system continues to work within realistic resource limits on the most complex examples. This is really a form of efficiency, but this term is usually used for large systems when complexity analysis would be infeasible and the evidence is largely empirical.
Cost: How much the system costs to build and/or maintain. This might be measured as money, but might also be measured in development time.
Fitness: How well a system meets the user's requirements. This dimension will often be an amalgam of various other dimensions, such as dependably exhibiting correct behaviour on all the user's problems within certain resource limits, while being easy to use, within budget and on time.

Cognitive Science Dimensions

Additional dimensions also come into play when developing computer systems as models of natural systems, e.g. in cognitive science, where we are trying to model cognition and the mind. In fact, Cognitive Science is a conventional natural science in which the typical hypothesis is that some technique or system is a faithful model of some mental faculty. The dimensions listed below reflect different interpretations of 'faithful'. Note that all claims of faithfulness will be made at some level of abstraction, i.e. we always abstract away some details and focus on others. For instance, since brains and computers are made from different kinds of hardware, we would not normally make a claim at this level. Most claims are to do with similarities at the level of process.

Below we will illustrate the four dimensions with reference to a hypothetical model of the processes that children use when carrying out elementary addition. For instance, Seeley-Brown and Burton built a similar model in their Buggy system, as part of a tool to assist primary school teachers to diagnose student arithmetic errors. Theirs was a modular subtraction program in which different kinds of errors could be obtained by making minor modifications to the program, e.g. removing or adding a module, and then running it.

External: The model displays the appropriate external behaviours. For instance, when modelling the arithmetic behaviour of a particular child, a faithful model should give the same answers as the child when applied to the same arithmetic problems. This is similar to the scientific behaviour dimension, except that the 'gold standard' is subtly changed; if the child gives a wrong answer then the program should give the same wrong answer and vice versa.
Internal: The model works in the same way as the thing modelled. We may want to claim that not only does the model produce the appropriate external behaviours, but it does so as a result of similar internal processing as the thing being modelled. There is a major impediment to demonstrating internal faithfulness: whereas we can readily trace the internal processing of a computer model, the internal processing of the mind is difficult for us to discover. We can get clues to the mind's processing from techniques such as protocol analysis, where audio, written, eye-tracking, video, brain scan, etc recordings are taken during the act of problem solving. Analysis of these recordings can yield clues about internal processing, e.g. via the reflections of the subject, tracking of the subject's focus of attention, 'chunking' of the process into episodes, involvement of similar brain areas, etc. In the case of our arithmetic task we might note, for instance, that both schoolchild and program start by adding the rightmost column, add from top to bottom, pause after each column and mark any carries at the bottom of the next column. A common erroneous behaviour might be modelled when both program and schoolchild fail to carry digits into the next column, resulting in a wrong answer.
Adaptability: The model accounts for a wide range of occurring behaviours. A model will gain credibility if it can be readily adapted to account for the range of observed behaviours - but can't be readily adapted to generate behaviours that are not observed. For instance, we might define a space of minor modifications of our arithmetic program and then try to match the resulting range of behaviours to those observed among real schoolchildren. If the range of modelled behaviours (nearly) coincides with the range of observed naturally occurring behaviours then our program would score highly on the adaptability dimension.
Evolvability: The model accounts for the evolution or learning of the ability it models. We need to show that not only can we model each stage of development, but that there is a progression in these models, i.e. that simple modifications take us from each stage to the next. In the case of addition we might show that developing arithmetic ability can be modelled by the gradual accumulation of program modules. We might also show how each of these modules might be automatically learnt.

Kinds of Research

We can investigate properties at several levels. We can look at inherent properties of a particular task, e.g. prove the undecidability of deduction. We can look at properties of a complete system, e.g. consider the naturalness of the dialogue conducted by a natural language understanding system. We can look at the properties of a particular technique, e.g. explore the ability of a machine learning algorithm to cope with noisy training data. We can look at a particular parameter within a technique, e.g. find the optimal mutation rate in a genetic algorithm.

Our investigations can be theoretical, e.g. use mathematical logic to prove the completeness of a deductive procedure. They can also be experimental, e.g. testing a robot on an assembly task. Experimental investigations often start in an exploratory mode, in which we try out a program in various situations and try to spot patterns. When we think we have seen a pattern we may switch to a hypothesis testing mode. In both modes statistical methods are useful. Visualisation techniques are good for exploration, e.g. histograms, scatter plots, summary statistics. Statistical measures of significance are good for hypothesis testing.

Sometimes we are interested in the properties of a particular task, system or technique. Sometimes we are interested in the relationships between two or more tasks, systems, techniques or parameters. Comparisons are usually made along one of the scientific, engineering or cognitive science dimensions. For instance, we may investigate which of two teaching strategies is most effective in an intelligent teaching system. This is a behavioural comparison. We may ask which parser can cope with a wider range of sentence types. This is a coverage comparison. We may ask which initial weight setting in a neural net leads to the quickest convergence. This is an efficiency comparison. We may also investigate trade-offs between the dimensions, e.g. how the quality of behaviour or efficiency degrades in certain parts of the range. We might try to detect and explain phase boundaries, where the value along some dimension changes suddenly rather than gradually.

In cognitive science, experimentation takes a different form. We are usually comparing a computational model with the natural system that it is modelling. Much of our experimental work is aimed at checking the range of behaviour, both external and internal, of the natural systems, across subjects and time, rather than those of the model. The experimental methodology is closer to psychology than it is to engineering.

The internal structure of techniques can also be compared. The outcome of such a comparison might be, for instance, to show that two apparently different techniques are essentially the same or that one is a special case of the other. For instance, semantic nets were initially introduced as a alternative to logical representations, but Schubert, Hayes and others have shown that they are just a graphical presentation of a subset of first order logic. Contrariwise, one might show that two or more different techniques might confusingly share the same name. For instance, the name 'back propagation' is overused and refers to at least two distinct techniques.

Theoretical Research

Theoretical research can be conducted either into the nature of a task, the techniques for doing that task, or the parameters of some technique. Complete systems are usually too complicated to yield to theoretical analysis, so their analysis requires experimentation. Theory usually uses mathematical tools. Typical properties checked are correctness, completeness, termination and complexity. For instance, theoretical research into the task of a robot grasping objects might start by using a combination of geometry and mechanics to investigate: optimal places and forces to hold an object; optimal trajectories for a robot hand/arm to traverse; required accuracies in motors and joints for acceptable error rates; etc. This theoretical analysis might then be used to design and evaluate an actual robot arm and its software. We might predict that the robot's behaviour would be of sufficiently high quality in terms of its success rate and accuracy. We might also predict its coverage by showing which type of object grasping problems it could and could not be expected to deal with.

Of course, we will eventually have to show that these theoretical predictions are borne out in practice by building and testing the robot. Theoretical analysis nearly always requires some simplifying assumptions or abstractions. If these assumptions turn out to be false then the theoretical results may be invalid. Experiment may reveal this and suggest a more refined theoretical analysis.

However, theoretical analysis has various advantages that are not available from experiment alone.

We can analyse the task in the abstract. Sometimes it is possible to show inherent limitations in any solution to the task, e.g. it may be undecidable or exponential, or some forms of it may be impossible to solve.
Theoretical analysis of a task can sometimes suggest a new technique for solving it. For instance, we might derive constraints on solutions to a task that can be used to define an inefficient 'generate and test' procedure. By progressively integrating more of the constraints into the generation process we might evolve more efficient procedures. Techniques whose design is informed by theory are often simpler, easier to understand and more robust than those developed by more ad hoc methods like exploratory programming.
We can prove properties of and relationships between techniques that hold for a potentially infinite set of examples. For instance, we might show that a technique always produces optimal solutions.
Theoretical analysis can often suggest ways of generalising or extending a technique. It might show that some aspects of the knowledge representation or the operations on it are unnecessarily restrictive. Removing these restrictions might enable new problems to be tackled or new solutions to be generated. Analysis might reveal some non-optimality whose removal will result in greater efficiency or in higher quality solutions.

In theoretical research we can regard any theorems as being hypotheses and their proofs as their supporting evidence. Theoretical Informatics is thus unusual in making its hypotheses explicit.

Experimental Research

Initial experimental research is usually exploratory, but switches to formulating and testing hypotheses once a pattern has been observed in the data. Polished experimental research should state a hypothesis and then present evidence to support or reject it. Experiments might investigate the quality of system behaviour, its success rate, the scalability of the system, the dependability and usability of the system, etc.

To demonstrate a hypothesis experimentally we need to compare a new system, technique or parameter setting with rival ones. Usually we need to compare complete programs. Ideally, the compared programs will differ in only one respect from each other. If there are several differences it is hard to apportion credit for the improvement. Unfortunately, there are often technical or pragmatic reasons for programs to differ in more than one respect. This is especially the case when two complete systems are being compared, since they will each usually consist of a combination of several different techniques. So hypotheses about complete systems are usually a bit vague. Hypotheses about isolated techniques or the adjustment of single parameters within a technique are to be preferred.

The onus is on the author of a hypothesis to identify the kind of evidence which would serve to support or refute it. Usually the compared programs will be tested on some example tasks and then some analysis will be applied to the results. If the claim is about speed, coverage or efficiency then the analysis might consist of the calculation and presentation of some statistics. If the claim is about psychological validity or user-friendly interaction then the analysis might be a discussion of some case studies. In all these cases the analysis has to argue for two difficult to establish conclusions:

That the results extend beyond the current case studies; and
that the explanation of the results is the stated hypothesis and is not due to some other cause.

These two issues are discussed in the following two sub-sections.

How to Show Examples Representative

Obtaining representative sets of examples for testing a technique is a standard problem in science - and there are some standard solutions, i.e.:

Distinguish training from test examples. The generality of a technique can be demonstrated by selecting a set of test examples and setting them aside until the technique has been fully debugged on a separate training set. This guards against the danger of over-tuning the technique to the training set. Of course, to be fair to the technique the training set should itself be representative of the range and complexity of problems that will be encountered in the test set.
Use a lot of dissimilar examples. Showing that a technique is successful and/or efficient on a lot of dissimilar examples in itself provides evidence that the technique is widely applicable. In addition, over-tuning the technique is less likely if the training and test sets are large and heterogeneous.
Collect examples from an independent source. If some third party has provided the examples and they have not been subsequently filtered then we have evidence that they are not biased in favour of any of the technique(s) being compared. However, care must be taken to show that the source is truly independent, i.e. is not, due to its origin, inherently biased. For instance, if the technique is aimed at some final application then it is good if the examples have arisen from that application. This might even be taken as a definition of representative.
Use the shared examples of the field. If the field has developed a standard test set then it aids comparison to rival techniques if both techniques are evaluated on the same test set. This does not make the test set representative. Indeed, there is a real danger that the test set is provided by the founders of the field and is biased in favour of the techniques that they developed. One of the aims of the field should be to evolve the standard test set to make it more representative.
Use challenging examples. It is easy to get a high success rate on easy problems, but this does not aid behavioural or coverage comparison with other techniques. The technique needs to be evaluated on examples that have caused problems for other techniques.
Use acute examples, i.e. examples that are small but embody features that defeat naive approaches. Their small size simplifies the task of debugging techniques that are defeated by them. Even when they do not have a shared set of test examples, most fields pass around a selection of acute examples. These are of most use during the development phase for exposing over-simplifications, missing cases, false assumptions, etc. in the technique. Success on them also demonstrates the generality of the technique, although acute examples are in some sense the opposite of representative.

How to Show that the Results Support the Hypothesis

The purpose of controlled experimentation is to simplify the task of explaining the experimental results. For instance, if we compare two systems which differ in only one respect then any difference in performance between them must be due to this sole difference. On the other hand, if the two systems differ in two respects then a difference in performance might be due to either difference in isolation or a combination of the differences. Unfortunately, most Informatics experimentation is not well controlled. Compared systems usually differ in many respects. This makes allocation of credit or blame very difficult.

This lack of experimental control is sometimes unavoidable. A difference in one respect might entail necessary differences in other respects. For instance, a learning technique that adjusts the weights of features in an evaluation function presupposes a search process controlled by such an evaluation function. Sometimes the lack of experimental control is avoidable in principle, but understandable in practice. For instance, a researcher might have invested heavily in time and money in software and hardware that is different from that used by a rival system. Unfortunately, the lack of experimental control is also sometimes due to sloppy experimental design and is wholly avoidable.

Fortunately, we can often circumvent lack of experimental control by more careful analysis of the experimental results. For instance, suppose two systems are being compared on speed. We can investigate the slower system and identify the procedures which contribute most to the slowness. If these slow procedures implement the technique being compared then it can legitimately be blamed for the slow performance. Similarly, a difference in the traces of the two systems might be credited to the two techniques being compared, rather than some more accidental difference between systems. Such micro-analysis is often possible and must often be used to supplement summary statistics in order to accurately apportion credit or blame.

Often the behaviour of the systems is too complex for analysis by hand. There is increasing use of automatic tools to support system analysis. Such tools can locate correlations between events in a system, providing hypotheses for more detailed investigation.

Hypotheses must be Evaluable

When formulating hypotheses to be established by a piece of research, it is vital to take care that it is one that can be evaluated. A hypothesis that cannot be evaluated, and hence cannot be refuted, fails Popper's test for what constitutes science. Our scientific investigations should be inspired by big and important problems. We all hope that our advances will contribute to the solutions to these problems, but we must take care not to claim more for these advances than we can defend.

For instance, the most obvious hypothesis may be too expensive to evaluate. Suppose I have invented a new programming language, which we will call MyLang. My aim in inventing it may have been to improve programmer productivity by dramatically reducing the number of bugs that a typical programmer might produce. Unfortunately, a realistic evaluation of this claim might require building several industrial-scale systems in both MyLang and its immediate rival languages and comparing the times taken to arrive at a stable and dependable product. Each system might require a team of a dozen or so programmers, that would need to be carefully matched in terms of their abilities in the various languages. The person years in both the development and debugging of each these programs might run to several hundreds. Where would a PhD student realistically get such resources?

What we need to do is to replace this over-ambitious hypothesis with one that can be more readily evaluated. For instance, we might note that certain kinds of error are detected at compile time as syntax errors when a programming language is strongly typed. MyLang might have such strong typing at its heart. Our modified hypothesis is that MyLang improves programmer productivity by its early detection of a common form of bug. To show that such errors are common and affect productivity we might conduct some studies of buggy programs. We want to analyse the frequency of 'type violation' errors in untyped programs and to how long they take to detect and fix when they are not detected at compile time. Of course, MyLang might decrease programmer productivity in other directions, e.g. by forcing programmers to identify the types of its procedures, variables, etc. We might also want to cover these issues in our analysis. But even taking such issues into account, we now have an evaluable, but much more modest hypothesis.

Summary

Just like other science and engineering disciplines, Informatics advances via the formulation of hypotheses (or claims) and the provision of supporting (or refuting) evidence for them. Hypothesis in Informatics typically establish or compare properties along some dimension. These dimensions include the following:

Scientific: behaviour, coverage, efficiency.
Engineering: dependability, usability, maintainability, scalability, cost, fitness.
Cognitive Science: external, internal, adaptability, evolvability.

The evidence to support (or refute) these hypotheses can be either theoretical, experimental or a combination of both. When devising experiments, care needs to be taken to ensure that the experimental data are representative and to identify the correct cause of any observed effects.

I always welcome feedback on these notes, especially if you have detected missing material, e.g. new kinds of dimension, new forms of hypothesis.

Alan Bundy

Home : Teaching : Courses : Irm