Feedback on paper of Shepherd et al


This is an interesting paper for you to review. On the face of it, it appears to
be full of technical terms from bioinformatics and would seem to require someone
expert in that area to do it justice. I maintain, however, that some basic
understanding of informatics methodology is sufficient to do quite a good review
job and, to a first approximation, the detailed bioinformatics material can be
skimmed. Informaticians have to be 'jacks of all trades', so you need practice
in operating outside your comfort zone. You do, of course, need to know
something about databases, but I assume this is shared knowledge among all
informaticians. 

HYPOTHESES

For once, I think there is one main hypothesis and it is clearly (if not
entirely explicitly) stated. The last sentence of the penultimate paragraph of
the Results section on the first page says:

    "The resultant database is both fast and flexible."

I think that's it. Earlier parts of this same section even give the 'because'
clauses. 

Fast: "A potential drawback with this approach — poor performance caused by the
       number of joins across meta-level tables — is avoided by implementing the
       PFDB with materialized views using the mature relational database
       technology of Oracle 8i."

Flexible: "The explicit representation of relationships at the meta-level has a
       number of advantages, including flexibility — both in terms of the range
       of queries that can be formulated and the ability to integrate new
       biological entities within the existing design."


EVALUATION

When it comes to evaluation, however, things are not so clear. 

Fast: Let's take the 'fast' claim first, as this is the clearer case. What you
might expect here would be some empirical data, e.g., some timing information on
a range of queries, perhaps with some favourable comparisons with similar data
from rival systems. But, apart from a range of 2-10 seconds being given for the
preformulated questions, there is nothing of this sort. Rather, the main
evidence seems to be in the beginning of the "Implementing the PFDB" section
p1688, when a design choice is presented between relational or object databases,
and a pilot study with a particular object database shows poor performance. No
comparative data is provided and no evidence that the results using that
particular object database were representative of all of them.

Furthermore, later in the discussion (bottom p1670, 2nd col) we read:

    "Preformulated queries are easy to use, they can be highly optimized to
     guarantee fast response times, and they prevent users from running queries
     that are inefficient and/or require excessive amount of CPU time. However,
     preformulated queries do not offer the kind of flexibility that many users
     desire."

This suggests that either you can have 'fast', provided you stick to
preformulated queries, or can you have the flexibility of using user-designed
queries, but at the risk of slow performance. This observation rather undermines
the conjunction in the main claim above, i.e., that you can have both.

Similar remarks apply to the penultimate paragraph of the Discussion, which says:

    "The absence of atomic-level data from the PFDB points to another of its key
     characteristics. Rather than attempt to be comprehensive, the PFDB is by
     design selective in the data it allows users to search on, preferring
     high-level information to vast quantities of low-level information (such as
     atomic-level data). This selectivity has clear performance benefits."

Again, it seems flexibility has been traded off in favour of efficiency. 


Flexible: The main body of the paper - from the 2nd column of p1666 to the 1st
column of p1668 -- describes the many uses to which PFDB has been put and the
many other DBs it interacts with. Apart from giving background information about
the importance of the system, does this material provide any evidence for the
main claim. Sort of.  We could be generous and argue that it demonstrates
flexibility by:

     "the ability to integrate new biological entities within the existing
      design"

as taken from the second 'because' clause above. No evidence is provided,
however, that this flexibility arises from:

      "The explicit representation of relationships at the meta-level"

which is the reason claimed in that 'because' clause.

Where the advantages of the meta-level representation /are/ discussed is in the 4
bullets at the bottom of p1668 and the top of p1669. Here we see examples of an
unusually (presumably) wide range of query types and, in the last bullet, a
claim about the easy introduction of new entities. This is backed up by the
claim in the Discussion that:

      "The changes that need to be made to the PFDB schema in order to establish
       explicit, well-defined relationships to entities in MSD are negligible,
       being confined to a small number of base tables."

The usefulness of the wide range of queries is, however, rather undermined by
the provision of only preformulated queries in the web interface, which is
(presumably) intended for use by the majority of users. A forthcoming /flexible/
interface is promised at the top of p1671, 1st column, which makes one question
whether the claim of flexibility might have been a bit premature. There's no
discussion about how this flexible interface will protect the user from asking
questions requiring "excessive use of CPU time", which was part of the
justification for restricting users to pre-formatted queries.


RELATED WORK

There is, essentially, no discussion of related work. This is needed both to
establish the originality of the work and to provide a baseline against which to
assess the speed and flexibility of PFDB. Without this discussion it is not
possible to assess how significant the results are.


YOUR REVIEWS

On the whole, the standard of the reviews was very high. It's good to see
several people improving significantly on their first review performance. 

* Some people missed the main claim about speed and flexibility, but sometimes
  identified subsiduary claims. Never-the-less, they were often able to spot the
  main criticisms about poor evaluation. 

* When identifying claims, try to spot and avoid: claims about the context
  within which the work is framed; claims that are well known parts of the
  'folklore' of the field, for which there is usually nothing that can be cited
  because it's considered too obvious to merit publication; descriptions of what
  a system does or how it works.

* Many people omitted 'experimental evidence to support a hypothesis' as one
  kind of contribution. Perhaps this was a reflection of the fact that they
  didn't do a very good job at evaluation, but I think they did try, if
  ineffectually.

* Not everyone drew attention to the effective absence of a related work
  discussion. Several people /did/ spot examples of systems PFDB /should/ have
  been compared with, such as SCOP.

* Some of you went to a lot of trouble to find out about the current state or
  PFDB and its rivals. Well done to you, although I wasn't expecting this.