SWS Coursework 2015/16

Assignment 1

Submission should be a zipped file, cointaining the files described below, submitted via the on-line submit system as follows:

submit sws 1 student_matriculation_id.zip

Deadline	Monday 29th of February 2016, 4pm
Marks	50

In this task, you will convert a specific dataset into RDF triples in Turtle format. Which dataset should you convert? Information on the dataset that you'll convert, and how to access it, will be emailed to you by the end of week 3 (by Friday the 29th of January). If this date has passed and you have not yet received this information contact the course TA (p.pareti@sms.ed.ac.uk) immediately.

Once you have received information regarding the dataset that has been assigned to you, you have to think on how you will convert it into RDF. There are many tools available for converting from diverse data structures into RDF. Here are some links that aggregate lists of such tools:

If you want to create your own converter in Java, the Apache Jena library allows you to create RDF data (and in general to work with RDF and SPARQL, which can be useful in the coming pieces of coursework)

http://jena.apache.org/download/index.html

For cleaning up data prior to conversion, consider Open Refine.

You should as far as possible reuse existing vocabularies for creating the URIs of concepts and relations in your RDF. When reusing existing vocabularies, you have to make sure that they are suitable for representing your data:

You must reuse URIs according to their intended meaning. Making assumptions on the basis of their label is not enough. Instead, you need to look at their definition and their usage. Two URIs with label Country, for example, might have different meanings. One might denote a set of geographical places and the other the set of country codes used in postal addresses.
You must consider the reliability of the vocabulary you are reusing. Is it published by a trustworthy source (e.g. W3C)? Is the vocabulary widely adopted and stable (e.g. DBpedia)? Or is it a work-in-progress and there is a risk that the meaning of the URIs will change?

Google will often find something relevant; for example, if you want to find a vocabulary about TV programs, then a query such as "tv program rdf vocabulary" or "tv program ontology" is likely to get you some useful results. Other useful websites to search for existing vocabularies include:

If after diligent investigation you come to the conclusion that there is no suitable vocabulary already available, you can coin your own URIs. Be careful: you will be penalised if you coin new URIs while a suitable vocabulary already exists and can be easily found. You will not lose marks if you plausibly justify your choice of not reusing a seemingly relevant vocabulary. If you have to coin new URIs you can choose your own namespace, but a reasonable default is:
http://vocab.inf.ed.ac.uk/sws/student_matriculation_id/
Adding your student matriculation id to the namespace ensures that you don't accidentally coin the same URI as another student.

It is also important to reuse entity-URIs for each class of resources for which you can identify existing URIs. For example, if your dataset contains a list of countries, it is better to reuse the existing GeoNames URI of the countries, instead of creating your own. However, because of the large amount of entities, it is often impossible to match all of them manually, and an automatic entity-matching systems needs to be created. This task of matching entities between different representations is called data-integration. Data-integration is often based on heuristics, such as assuming that two URIs that refer to the same entity will have a similar label, or similar properties. Since these heuristics can sometimes be wrong, a small margin of error is tolerated. You are asked to develop a data-integration system only for one class of resources for which you can identify existing URIs. For example, if your dataset contains instances of countries that you can match with GeoNames URIs, and a list of actors that you can match with DBpedia URIs, you can choose one of these classes (countries or actors) to match automatically, and then generate your own URIs for the instances of the other class. For the class of resources that you intend to match with existing URIs automatically you need to:

Manually identify the existing URI of 5 entities.
Create an entity-matching system to find the existing URIs for all the entities automatically. More than 50% of the entities should be matched correctly. To get you started, you might want to try one of the following approaches:
- If your data contains a descriptive textual label of each entity, then you might be able to discover URIs by related keywords. For example, the DBpedia Lookup Service allows you to discover the DBpedia URIs most related with certain keywords.
- Another option is to first create your own URI for each entity, and then match them with the existing ones using a link discovery system such as Silk.

After you have performed the conversion, you should verify that your RDF file is correct according to the Turtle syntax. You can find a list of possible validators here. You can also use Semantic Web libraries such as Jena to verify that your file can be correctly parsed as a Turtle file.

It is often easier to 'see' what is going on in a RDF graph by visualising it. Here are some pointers to available tools, in addition to resources mentioned in the lectures:

Some of these tools are available as plugins to Protege, which is a widely used tool for developing ontologies in RDF and OWL. Protege is available on DICE, but is also simple to install on your own computer.

Submission

Submit a zipped file containing your report along with all the other required files (as detailed below). Your report should be a PDF file containing your answers to the following questions (the number in brackets indicates the percentage of marks for each question):

[20/100] Include your converted RDF dataset in the zipped file. Your converted RDF data needs to be in Turtle (.ttl) format.
[24/100] Explain how you performed the transformation as step-by-step instructions, in such way that another person could replicate your transformation process. Make sure to include in your instructions valid links to (1) the third party tools you have used (if any) and (2) the program code you have developed (if any). If you have developed your own parser, do not include your code in the zipped submission file, but instead upload it on a storage website, such as GitHub or Bitbucket (in this case please verify that the files are openly available to users with the correct link).
[24/100] Describe and justify any 3rd-party vocabulary that you used in your RDF data. If you need to add some vocabulary of your own, explain why it is necessary, and explain the intended meaning of each vocabulary term. If your own vocabulary contains more than 5 classes or properties, then create a Turtle file that describes it using the RDFS model and include it in the zipped submission file.
[24/100] Choose one class of resources for which you have identified existing entity-URIs, and explain which existing dataset contains these URIs. For this class:
- List 5 existing URIs that you have reused in your dataset. These 5 URIs can be URIs that you have manually matched.
- Describe the integration system that you have used to match all the entities of the class. Describe it as step-by-step instructions, in such way that another person could replicate it. Make sure to include in your instructions valid links to (1) the third party tools you have used (if any) and (2) the program code you have developed (if any). Similarly to question 2, do not include your code in the zipped submission file, but upload it on a storage website instead.
[8/100] Provide a visualisation, as a graph, of all the RDF vocabularies that you have used (you shouldn't visualise the whole dataset, but only the vocabularies). This visualisation should include 3rd-party vocabularies and any new vocabulary that you might have created. Briefly explain how this visualisation was produced. Include this visualisation, as a JPEG file, in the zipped submission file.