MASWS Coursework 2013/14

Author:	Fiona McNeill
Edited by:	Paolo Pareti
Date:	March 2014

Table of Contents

Main Goals
Summary of Deadlines
Overview
Assignment 1
- Part 1: Identifying a data source
- Part 2: Converting the data into RDF
Assignment 2

Main Goals

By successfully completing these coursework assignments, you will gain the ability to carry out the following tasks:

design simple data query and analysis techniques
choose appropriate methods / tools for converting legacy data into RDF
design and implement appropriate data representations within an RDF data model
explore available RDF schemas for representing a given dataset
gain practical experience of using Turtle as an RDF serialization
design and evaluate SPARQL queries
evaluate pros and cons of native and RDF-based query techniques

Summary of Deadlines

What	How	When
Assignment 1: source data check	email to `amy.guy@ed.ac.uk`	4.00 pm, Monday 27th January [Week 3]
Assignment 1 part 1	`submit masws 1 masws-source-data.pdf`	4.00 pm, Friday 7th February [Week 4]
Assignment 1 part 2	`submit masws 1 masws-conversion.pdf`	4.00 pm, Wednesday 5th March [Week 7]
Assignment 2	`submit masws 2 masws-query.pdf`	4.00 pm, Friday 21st March [Week 8]

Overview

This coursework involves (i) converting some pre-existing dataset into RDF, and (ii) querying the RDF triples via the SPARQL language.

These assignmnents are designed to be carried out individually. However, if you wish, you can carry out the work in a team of two. The advantages of working in pairs are that you get to share ideas and divide up the work; the potential drawbacks are the you need to synchronize your efforts with each other and there is the overhead of communication. If you choose to work in a team of two you should choose a coordinator. The coordinator will be responsible for submitting the results of the team's work and for ensuring that any other required communication takes place in a timely fashion.

Assignment 1

Part 1: Identifying a data source

To start off, you are required to identify some data that is freely available and is suitable for processing as described in Part 2 below. Your source data can be in any format, but it is probably best if you stick to the following formats: HTML, CSV, XML, or RDB. The source dataset should not be data that you have created yourself and it should not be data which is already available as RDF data. (It is your responsibility to verify that the second condition is true.) Here are some suggestions for sourcing open data.

There are many, many tools available for converting from diverse data structures into RDF. Here are some links which aggregate lists of such tools:

You could also write your own converter. For a starting point of examples scripts (in PHP, Ruby and Python), see:

https://scraperwiki.com/

If you want to create your own converter in Java, the Apache Jena library allows you to create RDF data (and in general to work with RDF and SPARQL, which can be useful in the coming pieces of coursework)

http://jena.apache.org/download/index.html

For cleaning up data prior to conversion, consider Open Refine.

Having selected a candidate dataset (and thought about a proposed conversion method), you should send send an email to amy.guy@ed.ac.uk which describes the data, and either provides a link to it, or sends the dataset as an attachment. In addition, if you are working in a team of two, please include the names of the team members (plus matriculation numbers) and state clearly who the coordinator is. This should be by sent by 4.00 pm on Monday the 27th of January; your choice of data will have to be approved before you progress further. Amy will respond to your email no later than Thursday the 30th of January.

Once and only once you have received approval, you should write a report of around 1,000 words which addresses the following issues:

A brief description of the proposed dataset, and reasons why it might be useful to make it available in RDF format.
Proposed methods and tools for converting the source dataset into RDF; be as specific as you can (you have to propose at least one method, but you can propose more if you want).
Example of the kinds of queries that you would expect to be able to make.
A brief description of any difficulties that you might have to deal with.

Submission should be an electronic document in PDF format (not a Microsoft Word document!!!), submitted via the on-line submit system as follows:

submit masws 1 masws-source-data.pdf

Deadline	4.00 pm, Friday 7th February
Marks	20

Part 2: Converting the data into RDF

In this task, you will convert the data into RDF triples in Turtle format, using your chosen methods and tools. You should as far as possible use existing vocabularies for creating the URIs in your RDF. Google will often find something relevant; for example, if you want to find a vocabulary about TV programs, then a query such as "tv program rdf vocabulary" or "tv program ontology" is likely to get you some useful results. Other useful websites to search for existing vocabularies are the following:

If after diligent investigation you come to the conclusion that there is no suitable vocabulary already available, you can coin your own URIs. Be careful: you will lose marks if you coin new URIs while a suitable vocabulary already exists and can be easily found. You can choose your own namespace, but http://vocab.inf.ed.ac.uk/masws is a reasonable default. Verify that your RDF is valid using the Jena eyeball tool. To use the Jena eyeball tool, download it from here and unzip it in a folder of your choice. You can now run the eyeball tool to check the validity of your RDF files using a command like the following:

java -cp "<path to where your unzipped eyeball>eyeball-2.3/lib/*" jena.eyeball -check <path to your turtle file>/your_turtle_file.ttl

You may want to read this page about some common eyeball warnings which you can ignore, such as the "predicate not declared in any schema" unknown predicate report.

It is often easier to 'see' what is going on in a RDF graph by visualizing it. Here are some pointers to available tools, in addition to resources mentioned in the lectures:

Some of these tools are available as plugins to Protege, which is a widely used tool for developing ontologies in RDF and OWL. Protege is available on DICE, but also simple to install on a your own computer.

When you have completed this task, write a report of about 2,500 words which addresses the following issues:

Explain what steps you took to transform your source data into RDF, and describe any conceptual or engineering issues that arose. If you are not using one of the transformation methods you proposed in the previous assignment, explain why.
Describe and justify any 3rd-party schemas that you used in your RDF data.
If you need to add some vocabulary of your own, explain why it is necessary, and explain the intended meaning of each vocabulary term. If this vocabulary is more than a few classes or properties, then make it available as an independent file.
Provide a visualization (as a graph) of the RDF schema (i.e., your own plus 3rd-party) that you have used.
To submit your converted RDF, supply a URL for where it can be retrieved in the report (do not attach all RDF triples in the report). Please verify the data is indeed accessible over HTTP. If your RDF data occupies less than 5 MB, you can make it available on your webpage: http://www.inf.ed.ac.uk/systems/web/homepages.html. Otherwise, you can put it in Google Drive, Dropbox or any free storage websites and then supply the link in your report. Your converted RDF data needs to be in turtle or N3 format (file extensions .ttl or .n3)

Submission should be an electronic document in PDF format, submitted via the on-line submit system as follows:

submit masws 1 masws-conversion.pdf

Deadline	4.00 pm, Wednesday 5th March
Marks	30

Assignment 2

Design at least four SPARQL queries that provide a representative sample of information that can be extracted from your dataset. Evaluate the queries against your dataset and store the results. In order to do so, you might want to use one of the many existing tools and libraries to evaluate SPARQL queries against local RDF files (e.g. Apache Jena). You might also want to look at this tutorial on how to query RDF with SPARQL: http://www.inf.ed.ac.uk/teaching/courses/masws/Coding/build/html/index.html.

When you have completed this task, write a report of about 2,500 words which addresses the following issues:

List the SPARQL queries that you wrote, and justify why they are representative.
Execute your queries against your RDF dataset, using a standard SPARQL query engine. Include in your report the result set for each query, limiting the set to no more than 10 results for each query.
Consider whether (and if applicable, how) the results you have obtained could have been extracted from the source dataset using some other programmatic query technique. What benefit, if any, results from the RDF + SPARQL combination?

Marks for (1) and (2) will take into account both the way in which you communicate your understanding of your data and the way in which you demonstrate your undertanding and knowledge of SPARQL.

In addressing issue (3), you may want to consider the benefits of federated query, where you in effect query the result of merging your RDF graph with the graph provided by a 3rd party data set. In practice, it may be infeasible to do this in a way that uses your whole data set, or where the 3rd party data is exposed via a SPARQL endpoint. In this case, you could illustrate your general idea by creating a small 'temporary' data set which combines a sample of your data with a sample of the 3rd party data into a single file. You should then be able to query this small combined set of data relatively easily.

Submission should be an electronic document in PDF format, submitted via the on-line submit system as follows:

submit masws 2 masws-query.pdf

Deadline	4.00 pm, Friday 21st March
Marks	50