Important Notes
- Late Policy: please click here to read about the school's late policy.
-
Partner Programming: you can choose to submit individually, or with a partner.
If you submit with a partner, then
- submit only one submission (by one of the students in the team) and
- inlude a file partner.txt which holds name and UUN of your partner.
- Both students in the team receive the same mark.
-
Virtual Machine (VM): we provide you with a VirtualBox disc image that runs Ubuntu Linux and has
everything installed that is needed for the assignment. Your code must compile and run within
this VM without any modifications to the VM (i.e., without installing any packages or libraries, etc.).
You may use the AD_assigment_1 directory provided. The username is ad, there is no password.
Click here to download image.
Warning: file size is 1.7GB
Warning: VM changed on March 2nd. Download again, if your have an older version.
To use the image, run VirtualBox. Under Machine select New, provide a name, choose type Linux and Ubuntu (32 bit).
Choose a memory size (512MB should be OK), then select "Use an existing virtual hard drive file", and
click on the folder-icon to select the disc image that you downloaded.
Assignment 2: Search and Retrieval
In Assignment 1 you desigend a relational schema for EBAY data (given in XML).
You converted the XML files to cvs-files and loaded them into your tables under
a MySQL database named "ad".
The purpose of this assignment is to
- use JDBC to fetch data from your "ad" database,
- insert this data into a Lucene index for full text search, and
- provide a search function that combines Lucene's full text search
with MySQL's spatial queries.
To efficiently carry out spatial search in MySQL, you are asked to
convert the longitude and latitude information of itmes into
points (i.e., into MySQL's
POINT data type),
and to build a spacial index over those points.
Note:
You can either use your own "ad" database generated with
your assignment-1 solution, or, use the "ad" database that
we provide for you on the VM.
On the VM you can run
mysql ad and then type
show tables to see a list of table names
of that database. Use
describe tablename to see
the schema of a table. For instance,
describe item shows
the column names and types of table
item.
Part A: Create a Spatial Index in MySQL
Here you are asked to write two sql-scripts named
createSpatialIndex.sql and
dropSpatialIndex.sql.
The
createSpatialIndex.sql-script carries out these tasks
-
create a new table that associates geo-coordinates (as POINTs) to item-IDs, i.e.,
a table with two columns:
one column for item-IDs, and another column for geo-coordinates.
Geo-coordinates are points, represented using the POINT data type of MySQL.
-
fill this table with all items that have latitude and longitude information, i.e.,
write a SQL insert into statement with a SQL query that
selects each item-ID together with its latitude and longitued information
(converted into a POINT).
-
create a spatial index for the point column of your table from the previous step.
Recall that the point-column must be declared as NOT NULL and that
the table of the previous step must be created using
ENGINE=MyISAM, i.e., the statement of the previous step has
this structure
CREATE TABLE IF NOT EXISTS xxx (...) ENGINE = MyISAM;
The
dropSpatialIndex.sql script drops the index and the table
generated by the
createSpatialIndex.sql script.
Note that we will never call the
dropSpatialIndex.sql script
in the further steps described below.
Part B: Create a Lucene Index
Here you are asked to write a program
Indexer.java that
creates a lucene index.
The index should be stored in a directory named
indexes
(under the current working directory from which
runLoad.sh is called).
We will want to use this index to carry out keyword searches over the
union of the name, categories, and description of an item.
For example, for the query "Disney", your basic search function should
print (in HTML)
the item-ID and name of items that have the keyword "Disney" in the union of the name,
category or description attributes.
Also, for multiple word queries, such as "star trek", you should consider that
as "star" OR "trek". That is, you should return any item if it has either
"star" or "trek" in its name, its categories, or its description.
Use the
SimpleAnalyzer of Lucene which carries out casefolding.
Part C: Implement the Search Function
You will write the Java program
Searcher.java which carries out
keyword and spatial search over the ebay data.
The
Searcher program will take as first argument a list of
space-separated keywords, given within quotes.
For instance,
java Searcher "star trek"
returns a list item-IDs, item-names, and Lucene scores of all items that contain the word
"star" or the word "trek" (or both) in the name of the item, or in one
of the categories of the item, or in the description of the item.
This list is returned in HTML format. In fact, just simply plain text without
any HTML-tags will be fine.
In the first line, print the number of hits.
Your program should print to stdout, not a file.
When you initilize the query parser, use again the
SimpleAnalyzer of Lucene.
An output (possibly with other numbers)
of
java Searcher "superman" may look as follows:
totalHits 72
1049430907, SUPERMAN WITH GEN 13 AND OTHER PRESTIGE BOOKS, score: 1.6115568
1045823269, Superman Doomsday Hunter Prey tpb ,score: 1.4560543
1047062670, Superman's Pal Jimmy Olsen # 81, score: 1.4560543
1048743351, Superman Lunchbox Hallmark Ornament, score: 1.3813344
1048647703, SUPERMAN COMIC N0.199 - AUSTRALIAN ISSUE, score: 1.2355031
1047692530, BATMAN OR SUPERMAN CHRISTMAS ORNAMENTS HOT!!, score: 1.1648434
1047761329, Superman Domed Lunchbox/Carrying Case NEW!!, score: 1.1511121
1048263344, SUPERMAN DAILY PLANET Magnet PICTURE FRAME, score: 1.0896113
1046936194, SUPERMAN METAL LUNCH BOX, score: 1.069977
1047388061, SUPERMAN #405 NM "BATMAN" (1985), score: 1.069977
.
.
Your
Searcher program should be implemented in such a way
that it can also take three arguments,
in the following way:
java Searcher "star trek" -x longitude -y latitude -w width
where
longitude,
latitude, and
width are numbers.
The longitude and latitude numbers descibe a geo-location, and the
width is a number that describes the width of a square, in
kilometers.
If such three parameters are given, then
your program should further restrict the results of the keyword search by
only returning items that
have longitude and latitude information,
and for which these numbers fall into the "square" that is centered
at the given longitude and latitude numbers, and that has width
given by the width number. Recall that this is not exactly a square, because
of the curvature of the earth.
It is fine to use an actual square, as long as it is big enough so that any point
with actual distance at most "width" falls inside the square.
In the first line, again print the number of total hits.
Then, return the items in this way:
first the item-ID, then the name, then the Lucene score, and
then the distance from the given geo-location (
in kilometers).
Ranking
If your
Searcher program is called only with the keywords and without
the other parameters, then the ranked list of item-IDs, names, and scores should be
-
ordered by decreasing Lucene-score
-
for items with equal Lucene score, order them
by increasing price (i.e., lowest price first). Here, price is defined as the
current price of the item.
Note that the Lucene score of a
ScoreDoc s can be retrieved via
s.score.
If your
Searcher program is called with the
latitude, longitude, and width parameters,
then the ranked list of item-IDs and names of items should be
-
ordered by decreasing Lucene-score
-
for items with equal Lucene score, order them
by increasing distance from the geo-location given by the latitude and longitude numbers
(i.e., smallest distance first = closest items first)
- for items with equal Lucene score and equal distance, order them
by increasing price (i.e., lowest price first), where price is defined as before.
It is not required but perfectly acceptable to also print
the prices of items.
Web Page and other Rankings
This part is not taken into account for marking and is optional.
You may experiment with the rudimentary web page we have provided.
For this, type in the following command in a terminal window of the VM:
php -S 127.0.0.1:8000 &
Now run firefox (in the VM) and go to
http://127.0.0.1:8000.
You will see a simple web page where you can type in a keyword, press
the button next to it, and get displayed to result of running
your java Searcher program with the appropriate parameters.
You may try to add a way to see the description of items (e.g., by clicking on them),
or to highlight (e.g., in bold font) each occurrence of the keywords in the
display.
You may want to experiment by adding two more buttons:
one to rank by
lowest price and one for ranking by
smallest distance.
In each case, for items with equal price or equal distance, respectively,
you should then order by Lucene score, and finally order by
smallest distance and lowest price, respectively.
NOTE
You do
not need to implement the three functions
basicSearch(String query, int NumResultsToSkip, int numResultsToReturn),
spatialSearch(String query, SearchRegion region, int NumResultsToSkip, int numResultsToReturn)
and
getHTMLforItemId(String itemId) that were mentioned on
the lecture slides!
The final version of the second assignment is what you see written on this web page.
Part D: Automize
Write a small
runLoad.sh script.
First, if they do not exists yet, this script
creates the geo-coordinates table and the spatial index.
Next, the script compiles your
Indexer.java and runs it
in order to build the Lucene index.
Finally, the script compiles your
Searcher.java program.
Thus, after completion of this script, one may call the Searcher program
via
java Searcher "list of keywords" (or, with the
additional parameters) to obtain the correct output.
Sample Files
In the zip-file
sampleFiles.zip
we have provided four three sample files the help you get going:
Indexer.java which creates an index in a directory "indexes"
and adds a few documents (the ones shown in the lecture slides)
into the index.
You can compile and run this class via the two-line
run-Indexer.sh
shell-script provided.
Then the file
Searcher.java which runs a keyword search with
the keyword provided as argument.
You can compile and run this class (searching for the keyword "the")
via the shell-script
run-Searcher.sh.
Finally, there is a file
jdbc.java.
It runs a simple query against the MySQL "ad" database and displays the result.
You can compile and run this program via the
run-jdbc.sh script provided.
Note that on the VM, under the directory
AD_assignment2 we have already
unpacked this zip file for you.
What to submit:
You should submit a file
assignment2.zip containing these files (and no directories whatsoever):
- (Optionally) A plain text file README.txt, with any comments you would like to make.
- Your MySQL scripts createSpatialIndex.sql, dropSpatialIndex.sql.
- The shell script runLoad.sh as described above.
- The Java file Indexer.java containing the source code of your Lucene Indexer.
- The Java file Searcher.java containing the source code of your Lucene Indexer.
- The Java file DbManager.java as provided by us.
- (optionally) partner.txt containing your partner's details.
- (optionally) index.php for running the Searcher via a web-page.
Submission Instruction
To submit, run the following command on DICE:
$ submit ad 2 assignment2.zip