MASWS Coursework 2012/13

Author:	Fiona McNeill
Date:	2012-03-02

Table of Contents

Main Goals
Summary of Deadlines
Overview
Procedure for Moderating Group Marks
Assignment 1
- Part 1: Identifying a data source
- Part 2: Converting the data into RDF
Assignment 2
Assignment 3 (MSc only)

Main Goals

By successfully completing these coursework assignments, you will gain the ability to carry out the following tasks:

design simple data query and analysis techniques
choose appropriate methods / tools for converting legacy data into RDF
design and implement appropriate data representations within an RDF data model
explore available RDF schemas for representing a given dataset
gain practical experience of using Turtle as an RDF serialization
design and evaluate SPARQL queries
evaluate pros and cons of native and RDF-based query techniques

Summary of Deadlines

What	How	When
Assignment 1: source data check	email to `masws00@gmail.com`	4.00 pm, Monday 28th January [Week 3]
Assignment 1 part 1	`submit masws 1 masws-source-data.pdf`	4.00 pm, Friday 8th February [Week 4]
Assignment 1 part 2	`submit masws 1 masws-conversion.pdf`	4.00 pm, Monday 25th February [Week 7]
Assignment 2	`submit masws 2 masws-query.pdf`	4.00 pm, Friday 8th March [Week 8]
Assignment 3 (MSc only)	`submit masws 3 masws-tools.pdf`	4.00 pm, Friday 15th March [Week 9]

Overview

This coursework involves (i) converting some pre-existing dataset into RDF, and (ii) querying the RDF triples via the SPARQL language.

Before starting the assignments, you should decide on whether you wish to carry out the work as part of a team. A team can be from 2 to 4 members maximum. The advantages of working in a team are that you get to share ideas and divide up the work; the potential drawbacks are the you need to synchronize your efforts with each other and there is the overhead of communication. Every team should choose a coordinator. The coordinator will be responsible for submitting the results of the team's work and for ensuring that any other required communication takes place in a timely fashion.

You are not obliged to work as part of a team. To simplify the discussion below, an individual will be considered to be a team of one member who is also the 'team coordinator'.

Procedure for Moderating Group Marks

Every submission by a team should contain a statement indicating the relative contributions of each team member. Every member of the team will receive the full marks assigned for the submission as a whole, scaled by their agreed relative contribution.

More specifically, each submission should be accompanied by a list of the team members, together with an agreed ranking of the contribution that each member made to the work and the report. This estimate should be a number in the range 1..5, where 1 represents minimal contribution and 5 represents a full contribution. If N is the total mark awarded for the submission, and IR is the individual ranking of a given student s, then the mark awarded to s will be:

N * IR/5

So, for example, if the total awarded for the assignment is 50, and Ann gets a ranking of 5, while Bill gets a ranking of 3, then the marks awarded to Ann and Bill will be 50 and 30 respectively.

Groups should carry out a discussion about the relative contributions of individual team members and try to reach a consensus about the individual rankings to be assigned. If there is a failure to reach consensus, then you should note this on the submission and also report the problem to me.

Assignment 1

Part 1: Identifying a data source

To start off, each team will be required to identify some data that is freely available and is suitable for processing as described in Part 2 below. Your source data can be in any format, but it is probably best if you stick to the following formats: HTML, CSV, XML, or RDB. The source dataset should not be data that you have created yourself and it should not be data which is already available as RDF data. (It is your responsibility to verify that the second condition is true.) Here are some suggestions for sourcing open data.

There are many, many tools available for converting from diverse data structures into RDF. Here are some links which aggregate lists of such tools:

You could also write your own converter. For a starting point of examples scripts (in PHP, Ruby and Python), see:

https://scraperwiki.com/

For cleaning up data prior to conversion, consider Google Refine or Stanford DataWrangler.

Having selected a candidate dataset (and thought about a proposed conversion method), the team coordinator should send send an email to masws00@gmail.com which describes the data, and either provides a link to it, or sends the dataset as an attachement. In addition, please include a list of all the team members (plus matriculation numbers) and state clearly who the coordinator is. This should be by 4.00 pm on Monday 28th January; your choice of data will have to be approved before you progress further. I will respond to your email no later than Wednesday 1st February.

Once and only once you have received approval, you should write a report of around 1,000 words which addresses the following issues:

A brief description of the proposed dataset, and reasons why it might be useful to make it available in RDF format.
Proposed methods and tools for converting the source dataset into RDF; be as specific as you can.
Example of the kinds of queries that you would expect to be able to make.
A brief description of any difficulties that you might have to deal with.

Submission should be an electronic document in PDF format (not a Microsoft Word document!!!), submitted via the on-line submit system as follows:

submit masws 1 masws-source-data.pdf

Deadline	4.00 pm, Friday 8th February
Marks	20

Part 2: Converting the data into RDF

In this task, you will convert the data into RDF triples in Turtle format, using your chosen methods and tools. You should as far as possible use existing vocabularies for creating the URIs in your RDF. Google will often find something relevant; for example, if you want to find a vocabulary about TV programs, then a query such as "tv program rdf vocabulary" or "tv program ontology" is likely to get you some useful results.

If after diligent investigation you come to the conclusion that there is no suitable vocabulary already available, you can coin your own URIs. You can choose your own namespace, but http://vocab.inf.ed.ac.uk is a reasonable default. Verify that your RDF is valid using the Jena eyeball tool (available on DICE).

It is often easier to 'see' what is going on in a RDF graph by visualizing it. Here are some pointers to available tools, in addition to resources mentioned in the lectures:

Some of these tools are available as plugins to Protege, which is a widely used tool for developing ontologies in RDF and OWL. Protege is available on DICE, but also simple to install on a your own computer.

When you have completed this task, write a report of about 2,500 words which addresses the following issues:

Explain what steps you took to transform your source data into RDF, and describe any conceptual or engineering issues that arose.
Describe and justify any 3rd-party schemas that you used in your RDF data.
If you need to add some vocabulary of your own, explain why it is necessary, and explain the intended meaning of each vocabulary term. If this vocabulary is more than a few classes or properties, then make it available as an independent file.
Provide an visualization (as a graph) of the RDF schema (i.e., your own plus 3rd-party) that you have used.
Either attach your transformed RDF output, or (better) supply a URL for where it can be retrieved. In the latter case, please verify the the data is indeed accessible over HTTP.

Submission should be an electronic document in PDF format, submitted via the on-line submit system as follows:

submit masws 1 masws-conversion.pdf

Deadline	4.00 pm, Monday 25th February
Marks	30

Assignment 2

Design at least four SPARQL queries that provide a representative sample of information that can be extracted from your dataset. Evaluate the queries against your dataset and store the results.

When you have completed this task, write a report of about 2,500 words which addresses the following issues:

List the SPARQL queries that you wrote, and justify why they are representative.
Execute your queries against your RDF dataset, using a standard SPARQL query engine. Include in your report the result set for each query, limiting the set to no more than 10 results for each query.
Consider whether (and if applicable, how) the results you have obtained could have been extracted from the source dataset using some other programmatic query technique. What benefit, if any, results from the RDF + SPARQL combination?

Marks for (1) and (2) will take into account both the way in which you communicate your understanding of your data and the way in which you demonstrate your undertanding and knowledge of SPARQL.

In addressing issue (3), you may want to consider the benefits of federated query, where you in effect query the result of merging your RDF graph with the graph provided by a 3rd party data set. In practice, it may be infeasible to do this in a way that uses your whole data set, or where the 3rd party data is exposed via a SPARQL endpoint. In this case, you could illustrate your general idea by creating a small 'temporary' data set which combines a sample of your data with a sample of the 3rd party data into a single file. You should then be able to query this small combined set of data relatively easily.

Submission should be an electronic document in PDF format, submitted via the on-line submit system as follows:

submit masws 2 masws-query.pdf

Deadline	4.00 pm, Friday 8th March
Marks	50

Assignment 3 (MSc only)

This assignment must be carried out on an individual basis; you should not work in a team.

Mike Bergman regularly reviews the tools that support semantic technologies, and has compiled a list of more than 1,000 SW tools that have been developed over the last 5 years.

Choose one of the tools in this list, and critically review it. This should not be a tool that you used in other parts of these assignments. For further context, you might also want to have a look at the following report:

Top 10 Semantic Web Products of 2010

The kinds of questions you should address include:

what problem is the tool intended to address?
how well does it fulfil its intended function?
are there alternative approaches which might do a better job?
how does it fit into the wider picture of semantic technologies?
in what respects is the tool innovative or technically interesting?
what kind of impact does it seem to have had?

Do not just rely on the claims of the tool's designer(s) or on other peoples' views; you should ensure that you try out the tool yourself and use test cases which can probe the strengths and weaknesses of the tool.

You should submit a report of approximately 1,000 words describing the results of your work. Your report should contain enough technical detail that I can understand your topic without going to the source material myself.

Please put your name and matriculation number on the first page.

Submission should be an electronic document in PDF format (not a Microsoft Word document!!!), submitted via the on-line submit system as follows:

submit masws 3 masws-tools.pdf

Warning

It is essential that the report is written in your own words. Any direct copying or paraphrasing of text written by someone else (unless put inside quotation marks with the source cited, so that it is clear that you are quoting) will be treated as plagiarism, and could result in failure or academic discipline proceedings. If you are at all unclear about this, please consult me before submitting the assignment.

Deadline	4.00 pm, Friday 15th March
Marks	10

The MSc marks will be scaled from 110 to 100.