Querying and Storing XML - Spring 2013 - Projects

Projects

The project is the main assessed component of the course. Projects can take one of two forms:

Design and development projects involving two or three students who implement or improve upon an algorithm or system. Project reports should be around 15-20 pages (for a 2 person project) or 20-30 pages (for a 3 person project).
Survey projects involving one student, involving surveying a number of research papers or using several systems for the same task and comparing them. These should be around 15 pages.

It is up to you (and any partners) to set a realistic schedule for completion of the project, report and presentation. Group projects must be a joint effort, and each member of the group should contribute to the final report and participate in the presentation. Here is a suggested schedule of milestones for a project:

Week 3 (~ Feb 1)	Identify groups and pick project topics
Week 5 (~ Feb 15)	Literature review
Week 7 (March 1)	Draft report/preliminary results
Week 10-11 (March 26-April 5)	Final project reports and presentations

Project Selection

Please read over the following project ideas and then send your project preferences to the instructor by Wednesday, January 30, with the following information:

List up to three projects you're interested in
Indicate whether you prefer a group (design/development) or individual (survey/benchmarking) project, or if you don't care.
List any classmates who you would (or would not) prefer to work with. (This information will be kept confidential.)
If you are interested in a topic not listed as a possible project area, let me know as soon as possible and I will see whether it makes sense to add it.

I will then make a preliminary allocation of students to projects. I will prioritize forming groups based on stated mutual preferences for working together, then form additional groups from students interested in similar projects, and finally assign projects to groups. There will be a chance to adjust the groups/project assignments during week 3 before they are finalized.

Please feel free to use the mailing list or discuss with others in class (or the instructor) before sending your preferences. It is also a good idea to have a look at any suggested readings, which are linked from the course web page. If you do not send preferences, you may be assigned to any project.

Design and Development

Team project: a group of 2 or 3 people.

Project 1: DTD Subsumption
Responsibility: develop an algorithm to decide whether a DTD is subsumed by another, as defined below.
Background. A DTD D1 = (E1, r1, P1) subsumes another DTD D2 = (E2, r2, P2) if there exists a tag renaming function f: E2 -> E1 such that for any XML tree T conforming to D2, f(T) conforms to D1.
Objective. This project is to develop an algorithm that, given any two DTDs D1 and D2, determines whether or not D1 subsumes D2.
The final report should consist of (a) a problem statement, (b) the algorithm, (c) a proof for the correctness of the algorithm, ie, you need to prove that your algorithm returns true if and only if D1 subsumes D2, (d) a complexity analysis of your algorithms, and (e) an argument showing that your algorithm is efficient.
You may focus on DTDs with productions of the form A -> W, where
W ::= A1, ..., An | A1 + ... + An | A* | PCDATA
and Ai is an element tag for 0 <= n.
Investigators (a team of at most two people):
Project 2: Validating XML keys.
A relative XML key is defined as (Q, (P, {@A1, ..., @An})), where the context and target paths Q, P are XPath expressions defined by:
p::= A | * | p/p | p//p
If the context path Q is empty, the key is called an absolute key. Find out more about XML keys from "Keys for XML".
- Write a query in XQuery such that given an XML absolute key k defined as above and an XML document T, it returns true if and only if T satisfies k.
- Write a query in XQuery such that given an XML relative key k defined as above and an XML document T, it returns true if and only if T satisfies k.
- Develop an algorithm that, given a set K of XML relative keys, generates a query Q in XQuery such that given an XML document T, Q(T) is true if and only if T satisfies K.
- Provide a complexity analysis of your algorithm and the query generated.
- Experimentally evaluate your validating queries with respect to various XML documents and XML keys.
- What is the complexity of the following problems?
  - Given an XML document T, a set K of XML relative keys, and a DTD D, it is to determine whether T satisfies K and conforms to D.
  - Given an XML document T, a set K of XML absolute keys, and a DTD D, it is to determine whether T satisfies K and conforms to D.
  If these problems are intractable, develop heuristic algorithms for them.
Investigators (a team of at most three people):
Project 3: Tree pattern queries
Responsibility: develop effective algorithms for evaluating tree pattern queries on directed acyclic graphs (DAGs).
Background. A number of algorithms have been developed for tree-pattern (twig) queries defined in terms of:
p::= A | * | p/p | p//p | p[q]
q ::= p = `c'
where 'c' is a constant. A tree-pattern query can be depicted as a tree, and is evaluated on XML data modeled as a tree.
In practice XML data is often modeled as a graph rather than a tree, when ID and IDREF attributes are taken into account.
Objective. This project aims to develop efficient algorithms for evaluating tree pattern queries on DAGs with performance guarantee, i.e., traversing each node in the XML "DAG" no more than twice.
The final report should consist of (a) a problem statement, (b) your algorithms, (c) a proof for the correctness of the algorithms, (d) a complexity analysis of your algorithms, and (e) an experimental study verifying the efficiency of your algorithms.
Investigators (a team of at most two people):
Project 4: Querying XML streams
The objective is to extend an XML SAX parser to evaluate a small fragment of XPath.
Read about evaluating XML path queries over streams, in "Querying XML Streams" or "An Efficient XPath Query Processor for XML Streams"
The XPath fragment supports downward traversal only and does not allow predicates. More specifically, it is defined as follows:
p::= A | * | p/p | p//p
Objective.
- Develop an algorithm that, given an XPath query q defined as above, generates SAX events to evaluate the query.
- Implement the algorithm on top of any SAX parser of your choice, to evaluate XPath queries in a single traversal of any XML document.
- Provide a complexity analysis of your algorithm.
- Experimentally evaluate the algorithm and implementation.
Investigators (a team of at most two people):
Project 5: Invertible Shredding of XML data
The objective is store data of an XML document in relations, without loss of information. Read "Selectively Storing XML in Relations" and "DTD-Directed Publishing with Attribute Translation Grammars".
- Find a real-life recursive DTD D from the Web.
- Design a relational schema R, in 3NF, for storing XML documents of the DTD.
- Write an XML2DB mapping f from documents of D to database instances of R.
- Write an ATG g from databases of R to XML instances of DTD D.
- Prove that for any XML instances T of D, T = g(f(T)), ie, g is the inverse of f; in other words, the document can be restored from its relational representation.
- Discuss the complexity of your XML2DB mapping f and ATG g in terms of the size of XML instances of D.
- Implement your XML2DB mapping and ATG.
- Generate XML documents of various sizes based on the DTD by using, eg, ToXgene (http://www.cs.toronto.edu/tox/toxgene/)
- Select data from these documents via your XML2DB mappings and store it in a relational database (using whatever relational DBMS you like).
- Experimentally verify that your ATG can restore the XML data stored in relations via your XML2DB mapping.
Investigators (a team of at most three people):
Project 6: Lossless Shredding of XML data
The objective is store data of an XML document in relations, without loss of information when XPath queries are concerned. Read "Selectively Storing XML in Relations".
- Find a real-life recursive DTD D from the Web.
- Design a relational schema R, in 3NF, for storing XML documents of the DTD.
- Write an XML2DB mapping f from documents of D to database instances of R.
- Develop an algorithm that, given any XPath query posed on documents of DTD D, translates the query into an equivalent query in SQL'99 on the corresponding relations of R. In other words, XPath queries on the documents can be answered in SQL.
- Implement your XML2DB mapping and your query translation algorithm.
- Generate XML documents of various sizes based on the DTD by using, eg, ToXgene (http://www.cs.toronto.edu/tox/toxgene/)
- Select data from these documents via your XML2DB mappings and store it in a relational database (using whatever relational DBMS you like).
- Experimentally verify that your XML2DB mapping and query translation algorithm are correct, by comparing the results of XPath queries posed on XML documents with the results of the SQL queries that are generated by your translation algorithms and posed on the relations produced by your XML2DB mapping.
  You may focus on the following XPath fragment:
  p::= A | * | p/p | p//p | p[q]
  q ::= @a = `c'
  where @a denotes an attribute.
Investigators (a team of at most three people):
Project 7: Selective Shredding of XML Data
The objective is to select data from an XML document and store it in an existing relational database, by using XML2DB proposed by "Selectively Storing XML Data in Relations".
- Find a real-life recursive DTD from the Web.
- Design a relational schema, in 3NF, for storing XML documents of the DTD.
- Write 3 sufficiently diverse XML2DB mappings from documents of the DTD to its relational storage.
- Incorporate the mappings into an SAX parser for XML DTD.
- Generate XML documents of various sizes based on the DTD by using, eg, ToXgene (http://www.cs.toronto.edu/tox/toxgene/)
- Select data from these documents via your XML2DB mappings and SAX-implementation, and store it in a relational database (using whatever relational DBMS you like).
- Experimentally verify the scalability of your shredding mappings with respect to the size of the XML documents generated.
Investigators (a team of at most two people):
Project 8: Storing DBLP XML records in relations.
The DBLP server provides bibliographic information on major computer science journals and proceedings. It now collects more than 830000 article entries, in XML records, and is being widely used by computer scientists and students worldwide.
The objective of this project is to store DBLP XML data in relations and query DBLP data using relational DBMS.
- Design a relational schema in 3NF (with appropriate functional dependencies). Your schema should be able to store the title, authors, conference/journal, year, page numbers of each entry.
- Write an XML2DB mapping for each set of the following data. An XML2DB mapping traverses DBLP XML data, extracts the relevant entries and extends the relations of your schema using the data:
  - Select entries of the top 5 database conferences: SIGMOD, PODS, VLDB, ICDT, ICDE (ignoring the pdf files).
  - Select entries of major computer science journals: JACM, SICOMP, JCSS, TCS, TODS.
- Design a DTD to specify an XML document that consists of all publications in the conferences or journals above. Based upon the DTD, write an ATG to generate an XML view T from your relational database.
- Write an XSLT program to generate a HTML page from T, such that the page is of the form http://www.lfcs.inf.ed.ac.uk/research/database/Publications.html.
Investigators (a team of at most two people):
Project 9: Schema-Directed XML Transformations
Read "DTD-Directed Publishing with Attribute Translation Grammars". Objective. The objective of this project is to a systematic method to transform XML data of a DTD D1 to instances of another DTD D2. More specifically, it is to design the following:
- A specification language that allows users to specify what data they want from an XML document of DTD D1, extract the data, and generate an XML document that is guaranteed to conform to DTD D2 based on the extracted information;
- An algorithm for supporting the language such that given a specification with respect to a source DTD D1 and a target DTD D2, and any instance T of D1, it generates an XML document that conforms to D2; and
- Optimization techniques.
The final report should consist of (a) a problem statement, (b) the specification language, (c) the algorithm, (d) a proof for the correctness of the algorithm, (e) optimization techniques, and (f) an experimental study verifying the effectiveness of the optimization techniques.
Investigators (a team of at most three people):
Project 10: Answering XPath queries in SQL.
The objective of this project is to develop a program to translate simple XPath queries posed on DBLP data into equivalent SQL queries.
Read "Query Translation from XPath to SQL in the Presence of Recursive DTDs" and "Accelerating XPath Evaluation in any DBMS".
- Extract the DBLP XML records described in Project 8 above, and store the records in relations of the following schema:
  conference(id, name),
  journal(id, name),
  publish(author, title, id, year)
  where id is the acronym of a conference or journal, e.g., SIGMOD, JACM; name indicates the full name of a conference or journal, e.g., Proceedings of the ACM SIGMOD International Conference on Management of Data, Journal of the ACM; a tuple (a, t, id, year) in the publish table means that author a published a paper with title t, at conference (or in journal) id, in the particular year.
  Use XML2DB mappings to specify XML shredding.
- Write a program that, given an XPath query posed on the DBLP XML records, translates it to an equivalent SQL query on the relational database storing DBLP data. You may only consider XPath queries of the form:
  p::= A | * | p/p | p//p | p[q]
  q ::= p = `c'
  where 'c' is a constant.
- Populate the relations using your XML2DB mapping and DBLP data.
- Experimentally evaluate your translation program by answering at least 12 various XPath queries on DBLP XML records. Compare the performance with its counterpart by evaluating your XPath queries on the DBLP XML file directly.
Investigators (a team of at most two people):
Project 11: Updating XML data using a relational DBMS.
This project is to develop a program to translate simple XML updates posed on DBLP XML records into relational updates.
Read "Query Translation from XPath to SQL in the Presence of Recursive DTDs".
- Extract the DBLP XML records described in Project 8 above, and store the records in relations of the following schema:
  conference(id, name)
  journal(id, name),
  publish(author, title, id, year)
  see described in project 10 above. You may use XML2DB mappings to specify XML shredding.
- Write a program that, given an XML update posed on the DBLP XML records, translates the update into an equivalent relational update on the tables you created.
  An XML update is either insert e into p, or delete p, where e is a constant XML element, and p is an XPath query in the fragment below:
  p::= A | * | p/p | p//p | p[q]
  q ::= p = `c'
  where 'c' is a constant. If an input update violates the DBLP DTD or an equivalent SQL translation does not exist, raise exception.
- Populate the relations as described in Project 10.
- Generate at least 12 XML updates.
- Experimentally evaluating your view-update program by issuing the XML updates generated on DBLP XML records. Report the running time.
- Compare the performance with its counterpart by updating the DBLP XML file directly.
Investigators (a team of at most two people):
Project 12: Incremental maintenance of XML views of relational data.
The objective of this project is to develop a program to propagate changes from relational course data to its XML views.
Read "Incremental evaluation of schema-directed XML publishing".
- Collect data from the teaching page of the School of Informatics, and store the data in a relational database of the following schema
  - course(title, cno, type), where type indicates informatics 1, informatics 2, 3rd year, 4th year, MSc, and MEng.
  - prereq(cno1, cno2), which indicates that course cno2 is a prerequisite of cno1.
- Write an ATG that builds an XML view of the relational database, such that the view conforms to the DTD given below (only productions are listed):
  db -> course*,
  course -> cno, title, type, prerequisite*,
  prerequisite -> course,
  where prerequisite under each course C should list only direct prerequisites of C. Note that this is a recursive DTD.
- Write a program that incrementally modify your published XML view in response to one of the following relational update:
  - Insertion of a set of tuples into the course table.
  - Insertion of a set of tuples into the prereq table.
  - Deletion of a set of tuples from the course table.
  - Deletion of a set of tuples from the prereq table.
- Experimentally evaluate the response time of your incremental maintenance program by testing various insertions and deletions posed on the relational database, vs. re-computing the views starting from scratch.
Investigators (a team of at most two people):
Project 13: Translating XML View Updates to Relations.
This project is to develop a program to process simple XML updates posed on XML views published from relational data.
Read "Updating XML Views of Relations".
- Collect data from the teaching page of the School of Informatics, and store the data in a relational database of the following schema
  - course(title, cno, type), where type indicates informatics 1, informatics 2, 3rd year, 4th year, MSc, and MEng.
  - prereq(cno1, cno2), which indicates that course cno2 is a prerequisite of cno1.
- Write an ATG that builds an XML view of the relational database, such that the view conforms to the DTD given below (only productions are listed):
  db -> course*,
  course -> cno, title, type, prerequisite*,
  prerequisite -> course,
  where prerequisite under each course C should list only direct prerequisites of C. Note that this is a recursive DTD.
- Write a program that, given an XML update posed on the ATG view, translates the update into an equivalent relational update on the tables you created. That is, when the relational updates are executed, the new view published by the ATG from the updated relations is precisely the one obtained from the old view modified by the input XML updates. If a translation does not exist, raise exception.
  An XML update is either insert e into p, or delete p, where e is a constant XML element, and p is an XPath query in the fragment below:
  p::= A | * | p/p | p//p | p[q]
  q ::= p = `c'
  where 'c' is a constant.
- Populate your ATG view based on your relational data.
- Experimentally evaluate your view-update program by issuing various XML updates on the ATG view.
Investigators (a team of at most two people):
Project 14: Providing XQuery Engines with Update Functionality
The objective is to support XML updates on top of XQuery engines.
Approach: Translate XML updates to XML queries. Read "Querying XML using Update Syntax".
An XML update is specified as either insert e into q, or delete q, where e is a constant XML element, and q is a query in XQuery. You may focus on a class of queries q for which you can find an available parser (eg, XPath or even a fragment of XPath) from the Web.
- Develop an algorithm that, given an XML update defined as above, rewrites the update into an equivalent query in XQuery.
- Implement the algorithm on top of any XQuery engine of your choice, to provide the update functionality. You may use the XQuery engine to evaluate the rewritten XML query.
- Generate XML documents of various sizes by using eg, ToXgene (http://www.cs.toronto.edu/tox/toxgene/), and design a generator to produce a set of XML updates.
- Experimentally evaluate your algorithm using the XML data and XML updates generated.
Investigators (a team of at most two people):
Project 15: Constraint Propagation from Relations to XML
Responsibility: Investigate constraint propagation from relations to its XML views. Read "Propagating XML Constraints to Relations".
Background. Consider XML publishing via an ATG g defined via SPJ queries from relations of schema R to XML trees of a DTD D. Suppose that there are keys and foreign keys defined on R. The propagation problem is to determine whether an XML key defined on D is propagated from the keys and foreign keys on R via g. That is, for any relational instance I of R, if I satisfies the relational keys and foreign keys, then g(I) must satisfy the XML key.
Objective. (a) Establish the complexity of the propagation problem. (b) Develop a heuristic algorithm to derive XML keys from relational keys and foreign keys via ATG mappings. (c) Verify the correctness of your algorithm. (d) Experimentally verify the effectiveness and efficiency of your algorithm.
Investigators (a team of at most two people):
Project 16: XML node identification/indexing schemes.
Background. Read "Accelerating XPath Evaluation in Any DBMS".
The goal of the project is to implement XPath queries over XML stored as relational data, handling all XPath axes.
Suggested approach: (a) Obtain some XML documents such as DBLP or IMDB. (b) Design a relational schema that represents XML elements as rows with appropriate indexes. (c) Implement a SAX parser that traverses the document and loads it into the database, with appropriate indexes for each node. (d) Consider downward XPath queries:
p::= A | * | p/p | p//p
Translate such queries to SQL queries. Generate some test queries and evaluate their performance. (e) Extend the translation to arbitrary XPath steps; generate additional test queries and evaluate performance. (f) Experiment with different choices of relational indexes (e.g. hash, btree) on the index columns of the relational representation.
Investigators (a team of at most two people):
Project 17: Dynamic XML node identification/indexing schemes.
Background. Read "DDE: From Dewey to a Fully Dynamic XML Labeling Scheme". For this project you should be (or become) familiar with how to extend a relational database such as PostgreSQL or MySQL with user-defined functions.
The goal of the project is to implement and evaluate a dynamic XML node identification/indexing schemes, taking both query and update performance into account. An indexing scheme is dynamic if the node ids can be efficiently maintained when the XML they represent is updated.
Suggested approach: (a) Obtain some XML documents such as DBLP or IMDB. (b) Design a relational schema that represents XML elements as rows with appropriate indexes. (c) Implement a SAX parser that traverses the document and loads it into the database, with appropriate indexes for each node. (d) Consider downward XPath queries:
p::= A | * | p/p | p//p
Translate such queries to SQL queries. Generate some test queries and evaluate their performance. (e) Consider updates based on XPath of the form:
u::= insert xml as (first|last) into p | delete p
that either insert an XML tree as a child of the node(s) selected by a path expression p (in first or last position) or delete all nodes matching p. (f) Extend the update translation to handle other updates, such as:
u ::= ... insert xml (before|after) p | replace p with xml | rename p to A
Generate additional sample updates using these operations and evaluate their performance.
Investigators (a team of at most three people):
Project 18: Information extraction from (X)HTML pages.
Background. Read WebDAM chapter 10 "Wrappers and Data Extraction with XSLT" and "OXPath: A language for scalable, memory-efficient data extraction from Web applications"
The goal of the project is to extract structured information from HTML pages, as obtained from some Web resource, such as the Informatics course pages, books on Amazon, Edinburgh Festival events, job postings, etc.
Suggested approach: (a) Build a corpus of (static) HTML pages concerning structured data in some domain. (b) Find an appropriate tool for converting HTML (which may not be well-formed XML) to XHTML, such as HTML Tidy, and use it to generate clean XHTML versions of the corpus. (c) Define a schema for the data structures represented by the web pages. (d) Use XSLT, XQuery, or any other appropriate tools to convert each page into an XML tree containing the data. (e) Using the extracted data, write XSLT or XQuery code that re-generates HTML similar to the original pages. (f) Following a similar evaluation strategy to the "OXPath" paper, experimentally evaluate your extraction tool. (g) If time permits, experiment with the open-source OXPath system. Can your extraction technique be implemented more efficiently using this system?
Investigators (a team of at most two people):
Project 19: Typed XML projection.
Background. Read "Type-based XML projection"
The goal of the project is to implement and evaluate type-based techniques for discarding parts of the XML document that are not needed for evaluating a query.
Suggested approach: (a) Obtain some XML and accompanying DTD, ideally with nontrivial (i.e. non-flat) structure. (b) Implement a SAX parser that takes a set of element names S to discard, and produces a new XML document such that each subtree whose root node is in S is removed. This is called the S-projection of the document. (c) Consider XPath queries of the form:
p ::= A | * | p/p | p//p | p[q]
q ::= p = 'a'
Following the idea in the above paper, define a function that analyzes a path / filter and calculates a set S of element names that can be safely discarded: that is, evaluating q on T produces the same result as evaluating q on the S-projection of T.
Investigators (a team of at most three people):

Survey

These are individual projects rather than team efforts. The investigator of each project needs to select five recent research papers (not on the reading list of this course) or commercial systems (or fully implemented research prototype systems) on the topic chosen, and carefully evaluate the papers or systems. The final report should consist of (a) a clear problem statement; (b) a set of criteria for the evaluation, (c) justification for the criteria; and (d) evaluation and assessment of the papers or systems selected, based on the criteria you determined. Topics include, but are not limited to the following:

Project 20: XML applications in linguistics or humanities
Investigators:
Project 21: XML data generators and benchmarks
Investigators:
Project 22: Native XML stores
Investigators:
Project 23: XQuery engines
Investigators:
Project 24: Integrity constraints for XML
Investigators:
Project 25: Specifications of XML data: Schema
Investigators:
Project 26: XML data compression
Investigators:
Project 27: XML shredding in commercial systems
Investigators:
Project 28: XML publishing in commercial systems
Investigators:
Project 29: Querying XML data stored in relations
Investigators:
Project 30: XML updates
Investigators:
Project 31: Incremental maintenance of XML views
Investigators:
Project 32: XML view updates/bidirectional transformations
Investigators:
Project 33: XML publish-subscribe systems
Investigators:
Project 34: Access control for XML data
Investigators:

Home : Teaching : Courses