To start off, you are required to identify some data that is
freely available and is suitable for processing as described in Part 2
below. Your source data can be in any format, but it is probably
best if you stick to the following formats: HTML, CSV, XML, or
RDB. The source dataset should not be data that you have created
yourself and it should not be data which is already available as RDF
data. (It is your responsibility to verify that the second condition
is true.) Here are some suggestions for sourcing open data.
There are many, many tools available for converting from diverse data
structures into RDF. Here are some links which aggregate lists of such
tools:
You could also write your own converter. For a starting point of
examples scripts (in PHP, Ruby and Python), see:
If you want to create your own converter in Java, the Apache Jena library allows you to create RDF data (and in general to work with RDF and SPARQL, which can be useful in the coming pieces of coursework)
For cleaning up data prior to conversion, consider Open Refine.
Having selected a candidate dataset (and thought about a proposed conversion method), you should send
send an email to amy.guy@ed.ac.uk which describes the data, and
either provides a link to it, or sends the dataset as an
attachment. In addition, if you are working in a team of two, please include the names of the team
members (plus matriculation numbers) and state clearly who the coordinator is.
This should be by sent by 4.00 pm on Monday the 27th of January; your choice of data will have to be
approved before you progress further. Amy will respond to your email no
later than Thursday the 30th of January.
Once and only once you have received approval, you should write a report of around
1,000 words which addresses the following issues:
- A brief description of the proposed dataset, and reasons
why it might be useful to make it available in RDF format.
- Proposed methods and tools for converting the source dataset into
RDF; be as specific as you can (you have to propose at least one method, but you can propose more if you want).
- Example of the kinds of queries that you would expect to be able to
make.
- A brief description of any difficulties that you might have to deal
with.
Submission should be an electronic document in PDF format (not a
Microsoft Word document!!!), submitted via the on-line submit system as follows:
submit masws 1 masws-source-data.pdf
Deadline |
4.00 pm, Friday 7th February |
Marks |
20 |
In this task, you will convert the data into RDF triples in Turtle format, using your chosen
methods and tools. You should as far as possible use existing
vocabularies for creating the URIs in your RDF.
Google will often find something relevant; for example, if you want to
find a vocabulary about TV programs, then a query such as "tv program
rdf vocabulary" or "tv program ontology" is likely to get you some
useful results. Other useful websites to search for existing vocabularies are the following:
If after diligent investigation you come to the conclusion that there is no suitable vocabulary already
available, you can coin your own URIs. Be careful: you will lose marks if you coin new URIs while a suitable vocabulary already exists and can be easily found.
You can choose your own namespace, but http://vocab.inf.ed.ac.uk/masws is a reasonable default.
Verify that your RDF is
valid using the Jena eyeball tool. To use the Jena eyeball tool, download it from
here and unzip it in a folder of your choice.
You can now run the eyeball tool to check the validity of your RDF files using a command like the following:
java -cp "<path to where your unzipped eyeball>eyeball-2.3/lib/*" jena.eyeball -check <path to your turtle file>/your_turtle_file.ttl
You may want to read this page about some common eyeball warnings which you can ignore, such as the "predicate not declared in any schema" unknown predicate report.
It is often easier to 'see' what is going on in a RDF graph by
visualizing it. Here are some pointers to available tools, in addition
to resources mentioned in the lectures:
Some of these tools are available as plugins to Protege, which is a
widely used tool for developing ontologies in RDF and OWL. Protege is
available on DICE, but also simple to install on a your own computer.
When you have completed this task, write a report of about 2,500 words
which addresses the following issues:
- Explain what steps you took to transform your source data into RDF,
and describe any conceptual or engineering issues that arose.
If you are not using one of the transformation methods you proposed in the previous assignment,
explain why.
- Describe and justify any 3rd-party schemas that you used in your
RDF data.
- If you need to add some vocabulary of your own, explain why it is
necessary, and explain the intended meaning of each vocabulary
term. If this vocabulary is more than a few classes or properties,
then make it available as an independent file.
- Provide a visualization (as a graph) of the RDF schema (i.e., your
own plus 3rd-party) that you have used.
- To submit your converted RDF, supply a URL for where it can be retrieved in the report (do not attach all RDF triples in the report).
Please verify the data is indeed accessible over HTTP.
If your RDF data occupies less than 5 MB, you can make it available on your webpage: http://www.inf.ed.ac.uk/systems/web/homepages.html.
Otherwise, you can put it in Google Drive, Dropbox or any free storage websites and then supply the link in your report.
Your converted RDF data needs to be in turtle or N3 format (file extensions .ttl or .n3)
Submission should be an electronic document in PDF format, submitted via the on-line submit system as follows:
submit masws 1 masws-conversion.pdf
Deadline |
4.00 pm, Wednesday 5th March |
Marks |
30 |