Querying and Storing XML - Spring 2013 - Projects
Projects
The project is the main assessed component of the course. Projects
can take one of two forms:
- Design and development projects involving
two or three students who implement or improve upon an algorithm
or system. Project reports should be around 15-20 pages (for a 2 person
project) or 20-30 pages (for a 3 person project).
- Survey projects involving one student,
involving surveying a number of research papers or using several
systems for the same task and comparing them. These should be
around 15 pages.
It is up to you (and any partners) to set a realistic schedule
for completion of the project, report and presentation. Group
projects must be a joint effort, and each member of
the group should contribute to the final report and participate in
the presentation.
Here is a suggested schedule of milestones for a project:
Week 3 (~ Feb 1) | Identify groups and pick project topics |
Week 5 (~ Feb 15) | Literature review |
Week 7 (March 1) | Draft report/preliminary results |
Week 10-11 (March 26-April 5) | Final project reports and presentations |
Project Selection
Please read over the following project ideas and then send your project preferences to the instructor by
Wednesday, January 30, with the following information:
- List up to three projects you're interested in
- Indicate whether you prefer a group (design/development) or individual (survey/benchmarking) project, or if you don't care.
- List any classmates who you would (or would not) prefer to work with. (This information will be kept confidential.)
- If you are interested in a topic not listed as a possible project area, let me know as soon as possible and I will see whether it makes sense to add it.
I will then make a preliminary allocation of students to
projects. I will prioritize
forming groups based on stated mutual preferences for working
together, then form additional groups from students interested in similar
projects, and finally assign projects to groups. There will be a
chance to adjust the groups/project assignments during week 3 before
they are finalized.
Please feel free to use the mailing list or discuss with others in
class (or the instructor) before sending your preferences. It is
also a good idea to have a look at any suggested readings, which are
linked from the course web page. If you do not send
preferences, you may be assigned to any project.
Design and Development
Team project: a group of 2 or 3 people.
- Project 1: DTD Subsumption
Responsibility: develop an algorithm to decide whether a DTD is subsumed
by another, as defined below.
Background.
A DTD D1 = (E1, r1, P1) subsumes another
DTD D2 = (E2, r2, P2) if there exists a tag renaming
function f: E2 -> E1 such that for any XML tree T
conforming to D2, f(T) conforms to D1.
Objective.
This project is to develop an algorithm that,
given any two DTDs D1 and D2, determines
whether or not D1 subsumes D2.
The final report should consist of (a) a problem statement,
(b) the algorithm, (c) a proof for the correctness of the algorithm,
ie, you need to prove that your algorithm returns true
if and only if D1 subsumes D2,
(d) a complexity analysis of your algorithms, and
(e) an argument showing that your algorithm is efficient.
You may focus on DTDs with productions of the form A -> W, where
W ::=
A1, ..., An | A1 + ... + An
| A* | PCDATA
and Ai is an element tag for 0 <= n.
Investigators (a team of at most two people):
- Project 2:
Validating XML keys.
A relative XML key is defined as (Q, (P, {@A1, ..., @An})),
where the context and target paths Q, P
are XPath expressions defined by:
p::= A | * | p/p | p//p
If the context path Q is empty, the key is called an absolute key.
Find out more about XML keys from "Keys for XML".
- Write a query in XQuery such
that given an XML absolute key k
defined as above and an XML document T, it returns
true if and only if T satisfies k.
- Write a query in XQuery such
that given an XML relative key k
defined as above and an XML document T, it returns
true if and only if T satisfies k.
- Develop an algorithm that, given a set K of
XML relative keys, generates a query Q in XQuery
such that given an XML document T,
Q(T) is true if and only if
T satisfies K.
- Provide a complexity analysis of your algorithm and the query
generated.
- Experimentally evaluate your validating queries with respect to
various XML documents and XML keys.
- What is the complexity of the following problems?
- Given an XML document T, a set K of
XML relative keys, and a DTD D, it is to determine
whether T satisfies K and conforms to D.
- Given an XML document T, a set K of
XML absolute keys, and a DTD D, it is to determine
whether T satisfies K and conforms to D.
If these problems are intractable, develop heuristic algorithms for
them.
Investigators (a team of at most three people):
- Project 3: Tree pattern queries
Responsibility: develop effective algorithms for evaluating
tree pattern queries on directed acyclic graphs (DAGs).
Background.
A number of algorithms have been developed for tree-pattern (twig) queries
defined in terms of:
p::= A | * | p/p | p//p | p[q]
q ::= p = `c'
where 'c' is a constant. A tree-pattern query can be depicted as a tree,
and is evaluated on XML data modeled as a tree.
In practice XML data is often modeled as a graph rather than a tree,
when ID and IDREF attributes are taken into account.
Objective.
This project aims to develop
efficient algorithms for evaluating tree pattern queries on DAGs with
performance guarantee, i.e., traversing each node in the XML "DAG"
no more than twice.
The final report should consist of (a) a problem statement,
(b) your algorithms, (c) a proof for the correctness of the algorithms,
(d) a complexity analysis of your algorithms, and
(e) an experimental study verifying the efficiency of your
algorithms.
Investigators (a team of at most two people):
- Project 4: Querying XML streams
The objective is to extend an XML SAX parser to evaluate
a small fragment of XPath.
Read about evaluating XML path queries over streams, in "Querying XML
Streams" or "An Efficient XPath Query Processor for XML Streams"
The XPath fragment supports downward traversal only and does not allow
predicates. More specifically, it is defined as follows:
p::= A | * | p/p | p//p
Objective.
- Develop an algorithm that, given an XPath query q
defined as above, generates SAX events to evaluate the
query.
- Implement the algorithm on top of any SAX parser of your choice,
to evaluate XPath queries in a single traversal of any XML
document.
- Provide a complexity analysis of your algorithm.
- Experimentally evaluate the algorithm and implementation.
Investigators (a team of at most two people):
- Project 5:
Invertible Shredding of XML data
The objective is store data of an XML
document in relations, without loss of information. Read "Selectively
Storing XML in Relations" and "DTD-Directed Publishing with Attribute Translation Grammars".
- Find a real-life recursive DTD D from the Web.
- Design a relational schema R, in 3NF,
for storing XML documents of the DTD.
- Write an XML2DB mapping f
from documents of D to
database instances of R.
- Write an ATG g from databases of R
to XML instances of DTD D.
- Prove that for any XML instances T of D,
T = g(f(T)), ie, g is the inverse of f;
in other words, the document can be restored from its relational
representation.
- Discuss the complexity of your XML2DB mapping f and
ATG g in terms of the size of XML instances of D.
- Implement your XML2DB mapping and ATG.
- Generate XML documents of various sizes based on
the DTD by using, eg, ToXgene
(http://www.cs.toronto.edu/tox/toxgene/)
- Select data from these documents
via your XML2DB mappings
and store it in a relational database
(using whatever relational DBMS you like).
- Experimentally verify that your ATG can restore the
XML data stored in relations via your XML2DB mapping.
Investigators (a team of at most three people):
- Project 6:
Lossless Shredding of XML data
The objective is store data of an XML
document in relations, without loss of information when XPath
queries are concerned. Read "Selectively
Storing XML in Relations".
- Find a real-life recursive DTD D from the Web.
- Design a relational schema R, in 3NF,
for storing XML documents of the DTD.
- Write an XML2DB mapping f
from documents of D to
database instances of R.
- Develop an algorithm that, given any XPath query posed
on documents of DTD D, translates the query into an
equivalent query in SQL'99 on the corresponding relations
of R. In other words, XPath queries on
the documents can be answered in SQL.
- Implement your XML2DB mapping and your query translation algorithm.
- Generate XML documents of various sizes based on
the DTD by using, eg, ToXgene
(http://www.cs.toronto.edu/tox/toxgene/)
- Select data from these documents
via your XML2DB mappings
and store it in a relational database
(using whatever relational DBMS you like).
- Experimentally verify that your XML2DB mapping and query
translation algorithm are correct, by comparing the results of
XPath queries posed on XML documents with the results of
the SQL queries that are generated by your translation algorithms
and posed on the relations produced by your XML2DB mapping.
You may focus on the following XPath fragment:
p::= A | * | p/p | p//p | p[q]
q ::= @a = `c'
where @a denotes an attribute.
Investigators (a team of at most three people):
- Project 7:
Selective Shredding of XML Data
The objective is to select data from an XML
document and store it in an existing relational database,
by using XML2DB proposed by
"Selectively Storing XML Data in Relations".
- Find a real-life recursive DTD from the Web.
- Design a relational schema, in 3NF,
for storing XML documents of the DTD.
- Write 3 sufficiently diverse XML2DB mappings
from documents of the DTD to its
relational storage.
- Incorporate the mappings into an SAX parser for XML DTD.
- Generate XML documents of various sizes based on
the DTD by using, eg, ToXgene
(http://www.cs.toronto.edu/tox/toxgene/)
- Select data from these documents
via your XML2DB mappings and SAX-implementation,
and store it in a relational database
(using whatever relational DBMS you like).
- Experimentally verify the scalability of your
shredding mappings with respect to
the size of the XML documents generated.
Investigators (a team of at most two people):
- Project 8: Storing DBLP XML records
in relations.
The
DBLP server provides bibliographic information on major
computer science journals and proceedings. It now collects more than
830000 article entries, in
XML records, and is being widely used by computer scientists and
students worldwide.
The objective of this project is to store DBLP XML data in relations
and query DBLP data using relational DBMS.
- Design a relational schema in 3NF (with appropriate functional
dependencies). Your schema should be able to store
the title, authors, conference/journal, year,
page numbers of each entry.
- Write an XML2DB mapping for each set of the following data.
An XML2DB mapping traverses DBLP XML data,
extracts the relevant entries and extends the relations
of your schema using the data:
- Select entries of the top 5 database conferences: SIGMOD,
PODS, VLDB, ICDT, ICDE (ignoring the
pdf files).
- Select entries of major computer science journals:
JACM, SICOMP, JCSS, TCS, TODS.
- Design a DTD to specify an XML document that
consists of all publications in the conferences or journals above.
Based upon the DTD,
write an ATG to generate an XML view T from your
relational database.
- Write an XSLT program to generate a HTML page from
T, such that the page is of the form
http://www.lfcs.inf.ed.ac.uk/research/database/Publications.html.
Investigators (a team of at most two people):
- Project 9: Schema-Directed XML Transformations
Read "DTD-Directed Publishing with Attribute Translation Grammars".
Objective.
The objective of this project is to a systematic method to transform XML
data of a DTD D1 to instances of another DTD
D2. More specifically, it is to design the following:
- A specification language that allows users to
specify what data they want from an XML document of DTD
D1, extract the data, and generate an XML document that
is guaranteed to conform to DTD D2 based on the
extracted information;
- An algorithm for supporting the language such that given
a specification with respect to a source DTD D1 and
a target DTD D2, and any instance T of D1,
it generates an XML document that conforms to D2; and
- Optimization techniques.
The final report should consist of (a) a problem statement,
(b) the specification language, (c) the algorithm, (d)
a proof for the correctness of the algorithm,
(e) optimization techniques, and (f) an experimental study verifying the
effectiveness of the optimization techniques.
Investigators (a team of at most three people):
- Project 10:
Answering XPath queries in SQL.
The objective of this project is to develop a program to translate simple
XPath queries posed on DBLP data
into equivalent SQL queries.
Read "Query Translation from XPath to SQL in the Presence of Recursive
DTDs" and "Accelerating XPath Evaluation in any DBMS".
- Extract the DBLP XML records described in Project 8 above,
and store the records in relations of the following schema:
conference(id, name),
journal(id, name),
publish(author, title, id, year)
where id is the acronym of a conference or journal, e.g., SIGMOD, JACM;
name indicates the full name of a conference or journal, e.g.,
Proceedings of the ACM SIGMOD International Conference on Management of Data,
Journal of the ACM; a tuple (a, t, id, year) in the publish
table means that author a published
a paper with title t, at conference (or in journal) id,
in the particular year.
Use XML2DB mappings to specify XML shredding.
- Write a program that, given an XPath query posed on
the DBLP XML records,
translates it to an equivalent SQL query on the relational database storing
DBLP data. You may only consider XPath queries of the form:
p::= A | * | p/p | p//p | p[q]
q ::= p = `c'
where 'c' is a constant.
- Populate the relations using your XML2DB mapping and DBLP data.
- Experimentally evaluate your translation program
by answering at least 12 various XPath queries
on DBLP XML records. Compare the performance with
its counterpart by evaluating your XPath queries
on the DBLP XML file directly.
Investigators (a team of at most two people):
- Project 11:
Updating XML data using a relational DBMS.
This project is to develop a program to translate simple
XML updates posed on DBLP XML records into relational updates.
Read "Query Translation from XPath to SQL in the Presence of Recursive
DTDs".
- Extract the DBLP XML records described in Project 8 above,
and store the records in relations of the following schema:
conference(id, name)
journal(id, name),
publish(author, title, id, year)
see described in project 10 above.
You may use XML2DB mappings to specify XML shredding.
- Write a program that, given an XML update posed on
the DBLP XML records,
translates the update into an equivalent relational update on the
tables you created.
An XML update is either insert e into p, or delete p,
where e is a constant XML element, and p is an XPath
query in the fragment below:
p::= A | * | p/p | p//p | p[q]
q ::= p = `c'
where 'c' is a constant.
If an input update violates the DBLP DTD or
an equivalent SQL translation does not exist, raise exception.
- Populate the relations as described in Project 10.
- Generate at least 12 XML updates.
- Experimentally evaluating
your view-update program by issuing the XML updates generated
on DBLP XML records. Report the running time.
- Compare the performance with
its counterpart by updating the DBLP XML file directly.
Investigators (a team of at most two people):
- Project 12:
Incremental maintenance of XML views of relational data.
The objective of this project is to develop a program to propagate
changes from relational course data to its XML views.
Read "Incremental evaluation of schema-directed XML
publishing".
- Collect data from the
teaching page of the School of Informatics, and store the data in
a relational database of the following schema
- course(title, cno, type), where type indicates informatics 1,
informatics 2, 3rd year, 4th year, MSc, and MEng.
- prereq(cno1, cno2), which indicates that course cno2 is a prerequisite
of cno1.
- Write an ATG that builds
an XML view of the relational database, such that
the view conforms to the DTD given below (only productions are listed):
db -> course*,
course -> cno, title, type, prerequisite*,
prerequisite -> course,
where prerequisite under each course C should list only direct prerequisites
of C. Note that this is a recursive DTD.
- Write a program that incrementally modify your published
XML view in response to one of the following relational update:
- Insertion of a set of tuples into the course table.
- Insertion of a set of tuples into the prereq table.
- Deletion of a set of tuples from the course table.
- Deletion of a set of tuples from the prereq table.
- Experimentally evaluate the response time of your incremental maintenance
program by testing various insertions and deletions posed
on the relational database, vs. re-computing the views starting from
scratch.
Investigators (a team of at most two people):
- Project 13:
Translating XML View Updates to Relations.
This project is to develop a program to process simple
XML updates posed on XML views published from relational data.
Read "Updating XML Views of Relations".
-
Collect data from the
teaching page of the School of Informatics, and store the data in
a relational database of the following schema
- course(title, cno, type), where type indicates informatics 1,
informatics 2, 3rd year, 4th year, MSc, and MEng.
- prereq(cno1, cno2), which indicates that course cno2 is a prerequisite
of cno1.
- Write an ATG that builds
an XML view of the relational database, such that
the view conforms to the DTD given below (only productions are listed):
db -> course*,
course -> cno, title, type, prerequisite*,
prerequisite -> course,
where prerequisite under each course C should list only direct prerequisites
of C. Note that this is a recursive DTD.
- Write a program that, given an XML update posed on
the ATG view,
translates the update into an equivalent relational update on the
tables you created. That is, when the relational updates
are executed, the new view
published by the ATG from the updated relations
is precisely the one obtained from the old view modified by the
input XML updates. If a translation does not exist, raise exception.
An XML update is either insert e into p, or delete p,
where e is a constant XML element, and p is an XPath
query in the fragment below:
p::= A | * | p/p | p//p | p[q]
q ::= p = `c'
where 'c' is a constant.
- Populate your ATG view based on your relational data.
- Experimentally evaluate your view-update program by issuing various
XML updates on the ATG view.
Investigators (a team of at most two people):
- Project 14: Providing XQuery Engines with
Update Functionality
The objective is to support XML updates on top of XQuery engines.
Approach: Translate XML updates
to XML queries. Read "Querying XML using Update Syntax".
An XML update is specified as either insert e into q, or delete q,
where e is a constant XML element, and q is a query in
XQuery. You may focus on a class of queries q for which you can
find an available parser (eg, XPath or even a fragment of XPath) from the Web.
- Develop an algorithm that, given an XML update defined as above,
rewrites the update into an equivalent
query in XQuery.
- Implement the algorithm on top of any XQuery engine of your choice,
to provide the update functionality. You may use
the XQuery engine
to evaluate the rewritten XML query.
- Generate XML documents of various sizes by using
eg, ToXgene (http://www.cs.toronto.edu/tox/toxgene/), and
design a generator to produce a set of XML updates.
- Experimentally evaluate your algorithm using the XML data and
XML updates generated.
Investigators (a team of at most two people):
- Project 15: Constraint Propagation from Relations to XML
Responsibility: Investigate constraint propagation from relations to its XML
views. Read "Propagating XML Constraints to Relations".
Background. Consider XML publishing via an ATG g
defined via SPJ
queries from relations of schema R to XML trees of
a DTD D. Suppose that
there are keys and foreign keys defined on R. The propagation
problem
is to determine whether an XML key defined on D is propagated
from the keys and foreign keys on R via g.
That is, for any relational instance I of R,
if I satisfies the relational keys and foreign keys,
then g(I) must satisfy the XML key.
Objective. (a) Establish the complexity of the propagation problem.
(b) Develop a heuristic algorithm to derive XML keys from relational
keys and foreign keys via ATG mappings. (c) Verify the correctness
of your algorithm. (d) Experimentally
verify the effectiveness and efficiency of your algorithm.
Investigators (a team of at most two people):
-
Project 16: XML node identification/indexing schemes.
Background. Read "Accelerating XPath Evaluation
in Any DBMS".
The goal of the project is to implement XPath queries over
XML stored as relational data, handling all XPath axes.
Suggested approach: (a) Obtain some XML documents such as
DBLP or IMDB. (b) Design a relational schema that represents XML elements as
rows with appropriate indexes. (c) Implement a SAX parser that traverses the
document and loads it into the database, with appropriate indexes
for each node. (d) Consider downward XPath queries:
p::= A | * | p/p | p//p
Translate such queries to SQL queries. Generate some test queries and
evaluate their performance.
(e) Extend the translation to arbitrary XPath steps; generate
additional test queries and evaluate performance.
(f) Experiment with different choices of relational indexes
(e.g. hash, btree) on the index columns of the relational representation.
Investigators (a team of at most two people):
-
Project 17: Dynamic XML node identification/indexing schemes.
Background. Read "DDE: From Dewey to a Fully Dynamic XML
Labeling Scheme". For this project you should be (or become) familiar with how to
extend a relational database such as PostgreSQL or MySQL with user-defined functions.
The goal of the project is to implement and evaluate a dynamic XML node
identification/indexing schemes, taking both query and update
performance into account. An indexing scheme is dynamic if the
node ids can be efficiently maintained when the XML they represent
is updated.
Suggested approach: (a) Obtain some XML documents such as
DBLP or IMDB. (b) Design a relational schema that represents XML elements as
rows with appropriate indexes. (c) Implement a SAX parser that traverses the
document and loads it into the database, with appropriate indexes
for each node. (d) Consider downward XPath queries:
p::= A | * | p/p | p//p
Translate such queries to SQL queries. Generate some test queries and
evaluate their performance.
(e) Consider updates based on XPath of the form:
u::= insert xml as (first|last) into p | delete p
that either insert an XML tree as a child of the node(s) selected by a
path expression p (in
first or last position) or delete all nodes matching p.
(f) Extend the update translation to handle other updates, such as:
u ::= ... insert xml (before|after) p | replace p with xml |
rename p to A
Generate additional sample updates using these operations and
evaluate their performance.
Investigators (a team of at most three people):
-
Project 18: Information extraction from (X)HTML pages.
Background. Read WebDAM chapter 10 "Wrappers and Data
Extraction with XSLT" and "OXPath: A
language for scalable, memory-efficient data extraction from Web applications"
The goal of the project is to extract structured information from HTML
pages, as obtained from some Web resource, such as the Informatics
course pages, books on Amazon, Edinburgh Festival events, job
postings, etc.
Suggested approach: (a) Build a corpus of (static) HTML pages
concerning structured data in some domain. (b) Find an appropriate
tool for converting HTML (which may not be well-formed XML) to
XHTML, such as HTML Tidy, and use it to generate clean XHTML
versions of the corpus. (c) Define a schema for the data structures
represented by the web pages.
(d)
Use XSLT, XQuery, or any other
appropriate tools to convert each page into an XML tree containing
the data.
(e) Using the extracted data, write XSLT or XQuery code that
re-generates HTML similar to the original pages.
(f) Following a similar evaluation strategy to the "OXPath" paper,
experimentally evaluate your extraction tool.
(g) If time permits, experiment with the open-source OXPath system. Can
your extraction technique be implemented more efficiently using
this system?
Investigators (a team of at most two people):
-
Project 19: Typed XML projection.
Background. Read "Type-based XML projection"
The goal of the project is to implement and evaluate type-based
techniques for discarding parts of the XML document that are not
needed for evaluating a query.
Suggested approach: (a) Obtain some XML and accompanying
DTD, ideally with nontrivial (i.e. non-flat) structure. (b)
Implement a SAX parser that takes a set of element names S
to
discard, and produces a new XML document such that each
subtree whose root node is in S is removed. This is
called the S-projection of the document.
(c)
Consider XPath queries of the form:
p ::= A | * | p/p | p//p | p[q]
q ::= p = 'a'
Following the idea in the above paper, define a function that
analyzes a path / filter and calculates a set S of element
names that can be safely discarded: that is, evaluating q
on T produces the same result as evaluating q on
the S-projection of T.
Investigators (a team of at most three people):
Survey
These are individual projects rather than team efforts.
The investigator of each project
needs to select five recent
research papers (
not on
the reading list of this course) or
commercial systems (or fully implemented research prototype systems)
on the topic chosen, and carefully evaluate the papers or systems.
The final
report should consist of (a) a clear problem statement;
(b) a set of criteria for the evaluation,
(c) justification for the
criteria; and
(d) evaluation and assessment of the papers or systems selected,
based on the criteria you determined.
Topics include, but are not limited to the following:
- Project 20: XML applications in linguistics or humanities
Investigators:
- Project 21: XML data generators and benchmarks
Investigators:
- Project 22: Native XML stores
Investigators:
- Project 23: XQuery engines
Investigators:
- Project 24: Integrity constraints for XML
Investigators:
- Project 25: Specifications of XML data: Schema
Investigators:
- Project 26: XML data compression
Investigators:
- Project 27: XML shredding in commercial systems
Investigators:
- Project 28: XML publishing in commercial systems
Investigators:
- Project 29: Querying XML data stored in relations
Investigators:
- Project 30: XML updates
Investigators:
- Project 31: Incremental maintenance of XML views
Investigators:
- Project 32: XML view updates/bidirectional transformations
Investigators:
- Project 33: XML publish-subscribe systems
Investigators:
- Project 34: Access control for XML data
Investigators: