MASWS Coursework 2012/13
Author: |
Fiona McNeill |
Date: |
2012-03-02 |
By successfully completing these coursework assignments, you will gain
the ability to carry out the following tasks:
- design simple data query and analysis techniques
- choose appropriate methods / tools for converting legacy data into RDF
- design and implement appropriate data representations within an RDF data model
- explore available RDF schemas for representing a given dataset
- gain practical experience of using Turtle as an RDF serialization
- design and evaluate SPARQL queries
- evaluate pros and cons of native and RDF-based query techniques
What |
How |
When |
Assignment 1: source data check |
email to masws00@gmail.com |
4.00 pm, Monday 28th January [Week 3] |
Assignment 1 part 1 |
submit masws 1 masws-source-data.pdf |
4.00 pm, Friday 8th February [Week 4] |
Assignment 1 part 2 |
submit masws 1 masws-conversion.pdf |
4.00 pm, Monday 25th February [Week 7] |
Assignment 2 |
submit masws 2 masws-query.pdf |
4.00 pm, Friday 8th March [Week 8] |
Assignment 3 (MSc only) |
submit masws 3 masws-tools.pdf |
4.00 pm, Friday 15th March [Week 9] |
This coursework involves (i) converting some pre-existing dataset into
RDF, and (ii) querying the RDF triples via the SPARQL language.
Before starting the assignments, you should decide on whether you wish
to carry out the work as part of a team. A team can be from 2 to 4
members maximum. The advantages of working in a team are that you get to share
ideas and divide up the work; the potential drawbacks are the you need
to synchronize your efforts with each other and there is the overhead
of communication. Every team should choose a coordinator. The
coordinator will be responsible for submitting the results of the
team's work and for ensuring that any other required communication
takes place in a timely fashion.
You are not obliged to work as part of a team. To simplify the
discussion below, an individual will be considered to be a team of
one member who is also the 'team coordinator'.
Every submission by a team should
contain a statement indicating the relative contributions of each team
member. Every member of the team will receive the full marks assigned for
the submission as a whole, scaled by their agreed relative contribution.
More specifically, each submission should be accompanied by a list of
the team members, together with an agreed ranking of the contribution
that each member made to the work and the report. This estimate should be a
number in the range 1..5, where 1 represents minimal contribution and
5 represents a full contribution. If N is the total mark
awarded for the submission, and IR is the individual ranking
of a given student s, then the mark awarded to s
will be:
N * IR/5
So, for example, if the total awarded for the assignment is 50, and Ann gets a
ranking of 5, while Bill gets a ranking of 3, then the marks awarded
to Ann and Bill will be 50 and 30 respectively.
Groups should carry out a discussion about the relative contributions
of individual team members and try to reach a consensus about the
individual rankings to be assigned. If there is a failure to reach
consensus, then you should note this on the submission and also report
the problem to me.
To start off, each team will be required to identify some data that is
freely available and is suitable for processing as described in Part 2
below. Your source data can be in any format, but it is probably
best if you stick to the following formats: HTML, CSV, XML, or
RDB. The source dataset should not be data that you have created
yourself and it should not be data which is already available as RDF
data. (It is your responsibility to verify that the second condition
is true.) Here are some suggestions for sourcing open data.
There are many, many tools available for converting from diverse data
structures into RDF. Here are some links which aggregate lists of such
tools:
You could also write your own converter. For a starting point of
examples scripts (in PHP, Ruby and Python), see:
For cleaning up data prior to conversion, consider Google Refine or
Stanford DataWrangler.
Having selected a candidate dataset (and thought about a proposed conversion method), the team coordinator should send
send an email to masws00@gmail.com which describes the data, and
either provides a link to it, or sends the dataset as an
attachement. In addition, please include a list of all the team
members (plus matriculation numbers) and state clearly who the coordinator is.
This should be
by 4.00 pm on Monday 28th January; your choice of data will have to be
approved before you progress further. I will respond to your email no
later than Wednesday 1st February.
Once and only once you have received approval, you should write a report of around
1,000 words which addresses the following issues:
- A brief description of the proposed dataset, and reasons
why it might be useful to make it available in RDF format.
- Proposed methods and tools for converting the source dataset into
RDF; be as specific as you can.
- Example of the kinds of queries that you would expect to be able to
make.
- A brief description of any difficulties that you might have to deal
with.
Submission should be an electronic document in PDF format (not a
Microsoft Word document!!!), submitted via the on-line submit system as follows:
submit masws 1 masws-source-data.pdf
Deadline |
4.00 pm, Friday 8th February |
Marks |
20 |
In this task, you will convert the data into RDF triples in Turtle format, using your chosen
methods and tools. You should as far as possible use existing
vocabularies for creating the URIs in your RDF.
Google will often find something relevant; for example, if you want to
find a vocabulary about TV programs, then a query such as "tv program
rdf vocabulary" or "tv program ontology" is likely to get you some
useful results.
If after diligent investigation you come to the conclusion that there
is no suitable vocabulary already
available, you can coin your own URIs. You can choose your own
namespace, but http://vocab.inf.ed.ac.uk is a reasonable default. Verify that your RDF is
valid using the Jena eyeball tool (available on DICE).
It is often easier to 'see' what is going on in a RDF graph by
visualizing it. Here are some pointers to available tools, in addition
to resources mentioned in the lectures:
Some of these tools are available as plugins to Protege, which is a
widely used tool for developing ontologies in RDF and OWL. Protege is
available on DICE, but also simple to install on a your own computer.
When you have completed this task, write a report of about 2,500 words
which addresses the following issues:
- Explain what steps you took to transform your source data into RDF,
and describe any conceptual or engineering issues that arose.
- Describe and justify any 3rd-party schemas that you used in your
RDF data.
- If you need to add some vocabulary of your own, explain why it is
necessary, and explain the intended meaning of each vocabulary
term. If this vocabulary is more than a few classes or properties,
then make it available as an independent file.
- Provide an visualization (as a graph) of the RDF schema (i.e., your
own plus 3rd-party) that you have used.
- Either attach your transformed RDF output, or (better) supply a URL for
where it can be retrieved. In the latter case, please verify the
the data is indeed accessible over HTTP.
Submission should be an electronic document in PDF format, submitted via the on-line submit system as follows:
submit masws 1 masws-conversion.pdf
Deadline |
4.00 pm, Monday 25th February |
Marks |
30 |
Design at least four SPARQL queries that provide a representative sample
of information that can be extracted from your dataset. Evaluate the
queries against your dataset and store the results.
When you have completed this task, write a report of about 2,500 words
which addresses the following issues:
- List the SPARQL queries that you wrote, and justify why they are representative.
- Execute your queries against your RDF dataset, using a standard
SPARQL query engine. Include in your report the result
set for each query, limiting the set to no more than 10 results for each query.
- Consider whether (and if applicable, how) the results you have
obtained could have been extracted from the source dataset using
some other programmatic query technique. What benefit, if any,
results from the RDF + SPARQL combination?
Marks for (1) and (2) will take into account both the way in which
you communicate your understanding of your data and the way in which
you demonstrate your undertanding and knowledge of SPARQL.
In addressing issue (3), you may want to consider the benefits of
federated query, where you in effect query the result of merging your
RDF graph with the graph provided by a 3rd party data set. In
practice, it may be infeasible to do this in a way that uses your
whole data set, or where the 3rd party data is exposed via a SPARQL
endpoint. In this case, you could illustrate your general idea by
creating a small 'temporary' data set which combines a sample of your
data with a sample of the 3rd party data into a single file. You
should then be able to query this small combined set of data relatively easily.
Submission should be an electronic document in PDF format, submitted via the on-line submit system as follows:
submit masws 2 masws-query.pdf
Deadline |
4.00 pm, Friday 8th March |
Marks |
50 |
This assignment must be carried out on an individual basis; you
should not work in a team.
Mike Bergman regularly reviews the tools that support semantic
technologies,
and has compiled a list of more than 1,000 SW tools that have been
developed over the last 5 years.
Choose one of the tools in this list, and critically review it. This
should not be a tool that you used in other parts of these
assignments. For further context, you might also want to have a look
at the following report:
The
kinds of questions you should address include:
- what problem is the tool intended to address?
- how well does it fulfil its intended function?
- are there alternative approaches which might do a better job?
- how does it fit into the wider picture of semantic technologies?
- in what respects is the tool innovative or technically interesting?
- what kind of impact does it seem to have had?
Do not just rely on the claims of the tool's designer(s) or on other
peoples' views; you should ensure that you try out the tool yourself
and use test cases which can probe the strengths and weaknesses of the
tool.
You should submit a report of approximately 1,000 words describing the results of
your work. Your report should contain
enough technical detail that I can understand your topic without going to the source
material myself.
Please put your name and matriculation number on the first page.
Submission should be an electronic document in PDF format (not a
Microsoft Word document!!!), submitted via the on-line submit system as follows:
submit masws 3 masws-tools.pdf
Warning
It is essential that the report is written in your own words. Any direct
copying or paraphrasing of text written by someone else (unless put inside quotation
marks with the source cited, so that it is clear that you are quoting) will be treated as
plagiarism, and could result in failure or academic discipline proceedings. If you
are at all unclear about this, please consult me before submitting the assignment.
Deadline |
4.00 pm, Friday 15th March |
Marks |
10 |
The MSc marks will be scaled from 110 to 100.