Important Notes
- Late Policy: please click here to read about the school's late policy.
-
Partner Programming: you can choose to submit individually, or with a partner.
If you submit with a partner, then
- submit only one submission (by one of the students in the team) and
- inlude a file partner.txt which holds name and UUN of your partner.
- Both students in the team receive the same mark.
-
Virtual Machine (VM): we provide you with a VirtualBox disc image that runs Ubuntu Linux and has
everything installed that is needed for the assignment. Your code must compile and run within
this VM without any modifications to the VM (i.e., without installing any packages or libraries, etc.).
You may use the AD_assigment_1 directory provided. The username is ad, there is no password.
Click here to download image.
Warning: file size is 1.7GB
To use the image, run VirtualBox. Under Machine select New, provide a name,
choose type Linux and Ubuntu (32 bit).
Choose a memory size (512MB should be OK), then select "Use an existing virtual hard drive file", and
click on the folder-icon to select the disc image that you downloaded.
Simply "Cancel" when it asks about Ubuntu updates.
Assignment 2: Update (20-March-2017)
There had been a mistake in the order of the result list given
for java Searcher "start trek".
This pages has been updated to show the correct list.
Also note that an example of spatial search, namely
java Searcher "star trek" -x -73.997255 -y 40.732371 -w 10
has been added at the bottom of this web page.
Assignment 2: Search and Retrieval
In Assignment 1 you desigend a relational schema for EBAY data (given in XML).
You converted the XML files to cvs-files and loaded them into your tables under
a MySQL database named "ad".
The purpose of this assignment is to
- use JDBC to fetch data from the "ad" database,
- insert this data into a Lucene index for full text search, and
- provide a search function that combines Lucene's full text search
with MySQL's spatial queries.
To efficiently carry out spatial search in MySQL, you are asked to
convert the longitude and latitude information of items into
points (i.e., into MySQL's
POINT data type),
and to build a spacial index over those points.
Note:
In this assignment you cannot use your database from Assignment 1,
but you must use the "ad" database that is provided on the VM.
If you run
mysql ad on the VM, then
show tables;
will list all tables in the "ad" database. The attributes of a table and their
types can be seen via
describe tablename, e.g.,
describe bid
will show the five attribute names of the bid-table:
bid_id, item_id, bidder_name, bid_time, and bid_price.
If you want to use the database on your local MySQL installation, start the VM and in a terminal type
mysqldump ad > ad.sql
This should create a file named "ad.sql".
Copy it out of the VM onto your local machine and load it into your local database via
mysql ad < /path/to/ad.sql.
Part A: Create a Spatial Index in MySQL
Here you are asked to write two sql-scripts named
createSpatialIndex.sql and
dropSpatialIndex.sql.
The
createSpatialIndex.sql-script carries out these tasks
-
create a new table that associates geo-coordinates (as POINTs) to item-IDs, i.e.,
a table with two columns:
one column for item-IDs, and another column for geo-coordinates.
Geo-coordinates are points, represented using the POINT data type of MySQL.
-
fill this table with all items that have latitude and longitude information, i.e.,
write a SQL insert into statement with a SQL query that
selects each item-ID together with its latitude and longitued information
(converted into a POINT).
-
create a spatial index for the point column of your table from the previous step.
Recall that the point-column must be declared as NOT NULL and that
the table of the previous step must be created using
ENGINE=MyISAM, i.e., the statement of the previous step has
this structure
CREATE TABLE IF NOT EXISTS xxx (...) ENGINE = MyISAM;
The
dropSpatialIndex.sql script drops the index and the table
generated by the
createSpatialIndex.sql script.
Note that we will never call the
dropSpatialIndex.sql script
in the further steps described below.
Part B: Create a Lucene Index
Here you are asked to write a program
Indexer.java that
creates a lucene index.
The index should be stored in a directory named
indexes
(under the current working directory from which
runLoad.sh is called).
We will want to use this index to carry out keyword searches over the
union of the name, categories, and description of an item.
For example, for the query "Disney", your basic search function should
print (in HTML)
the item-ID and name of items that have the keyword "Disney" in the union of the name,
category and description entires.
Also, for multiple word queries, such as "star trek", you should consider that
as "star" OR "trek". That is, you should return any item if it has either
"star" or "trek" in its name, its categories, or its description.
Use the
SimpleAnalyzer of Lucene which carries out casefolding.
Part C: Implement the Search Function
You will write the Java program
Searcher.java which carries out
keyword and spatial search over the ebay data.
The
Searcher program will take as first argument a list of
space-separated keywords, given within quotes.
For instance,
java Searcher "star trek"
returns a list of item-IDs, item-names, and Lucene scores of all items that contain the word
"star" or the word "trek" (or both) in the name of the item, or in one
of the categories of the item, or in the description of the item.
This list is returned in HTML format. In fact, just plain text without
HTML-tags is fine.
In the first line, print the number of hits.
Your program should print to stdout, not a file.
When you initilize the query parser, use again the
SimpleAnalyzer of Lucene.
An output (possibly with other numbers)
of
java Searcher "superman" may look as follows:
totalHits 72
1049430907, SUPERMAN WITH GEN 13 AND OTHER PRESTIGE BOOKS, score: 1.6115568, price: 6.00
1047062670, Superman's Pal Jimmy Olsen # 81, score: 1.4560543, price: 1.20
1045823269, Superman Doomsday Hunter Prey tpb ,score: 1.4560543, price: 7.97
1048743351, Superman Lunchbox Hallmark Ornament, score: 1.3813344, price: 9.99
1048647703, SUPERMAN COMIC N0.199 - AUSTRALIAN ISSUE, score: 1.2355031, price: 1.99
1047692530, BATMAN OR SUPERMAN CHRISTMAS ORNAMENTS HOT!!, score: 1.1648434, price: 19.99
1047761329, Superman Domed Lunchbox/Carrying Case NEW!!, score: 1.1511121, price: 11.99
1048263344, SUPERMAN DAILY PLANET Magnet PICTURE FRAME, score: 1.0896113, price: 4.95
1047388061, SUPERMAN #405 NM "BATMAN" (1985), score: 1.069977, price: 6.99
1046936194, SUPERMAN METAL LUNCH BOX, score: 1.069977, price: 12.95
.
.
Your
Searcher program should be implemented in such a way
that it can also take three arguments,
in the following way:
java Searcher "star trek" -x longitude -y latitude -w width
where
longitude,
latitude, and
width are numbers.
The longitude and latitude numbers descibe a geo-location, and the
width is a number that describes the
radius of a circle, in
kilometers.
If such three parameters are given, then
your program should further restrict the results of the keyword search by
only returning items that
have longitude and latitude numbers,
and for which these numbers denote a location that has distance to the given
longitude and latitude numbers of at most the given width number.
You should carry out the spatial search in two stages: first find items in a bounding
box that is guaranteed to be large enough to contain all items of distance
at most
width. Then from the items in this box, filter out the ones with
the correct distances using the precise distance function for longitude/latitude pairs.
Return the items in this way:
first the item-ID, then the name, then the Lucene score, and
then the distance from the given geo-location (
in kilometers).
Ranking
If your
Searcher program is called only with the keywords and without
the other parameters, then the ranked list of item-IDs, names, and scores should be
-
ordered by decreasing Lucene-score
-
for items with equal Lucene score, order them
by increasing price (i.e., lowest price first). Here, price is defined as the
current price of the item.
Note that the Lucene score of a
ScoreDoc s can be retrieved via
s.score.
If your
Searcher program is called with the
latitude, longitude, and width parameters,
then the ranked list of item-IDs and names of items should be
-
ordered by decreasing Lucene-score
-
for items with equal Lucene score, order them
by increasing distance from the geo-location given by the latitude and longitude numbers
(i.e., smallest distance first = closest items first)
- for items with equal Lucene score and equal distance, order them
by increasing price (i.e., lowest price first), where price is defined as before.
It is not required but perfectly acceptable (and useful) to also print
the prices of items.
Here is an example of a spatial search.
$ java Searcher "star trek" -x -73.997255 -y 40.732371 -w 10
Running search(star trek)
totalHits 13
1049497688, Star Trek Impel 1991 series 1 & 2 complete, score: 0.9100517, dist: 5.91, price: 5.0
1498090827, NEW Star Trek Show TV Guide 3 (MINT) +!!, score: 0.7392576, dist: 5.91, price: 4.0
1496025492, **STAR TREK II 2 THE WRATH OF KHAN** on DVD, score: 0.5630657, dist: 5.91, price: 12.0
1046248019, DOMINICK SWAIN 4 Candid Photos #d2-12 4x6, score: 0.13453901, dist: 5.91, price: 9.99
1678501544, JEFF HAMILTON ORLANDO ALL STAR WEEKEND JACKET, score: 0.13453901, dist: 5.91, price: 35.0
1496595283, ZORRO SET & THE SIGN OF ZORRO-BRAND NEW-SEAL, score: 0.13315909, dist: 5.91, price: 49.99
1047947208, HALLMARK DARTH VADER LUKE SKYWALKER ORNAMENTS, score: 0.10872394, dist: 5.91, price: 15.0
1046171386, Star Spangled Soldier - Unkown Soldier # 182, score: 0.10872394, dist: 6.582, price: 2.5
1496408475, Vin 10x13 Joan Crawford Willinger '38Portrait, score: 0.096099295, dist: 5.778, price: 24.95
1495290715, KEVIN SORBO ANDROMEDA XENA SIGNED AUTOGRAPH, score: 0.096099295, dist: 5.91, price: 19.99
1496307141, (ROCK STAR) Zakk Wylde/ BLACK LABEL SOCIETY, score: 0.08322443, dist: 4.361, price: 1.99
1494690860, Autographed Photo of Soap Star Matt Cedeno, score: 0.081542954, dist: 5.91, price: 15.5
1679297745, NOT IN STORES ADIDAS T-MAC basketball sneaker, score: 0.076879434, dist: 5.91, price: 70.0
Web Page and other Rankings
This part is not taken into account for marking and is optional.
You may experiment with the rudimentary web page we have provided.
For this, type in the following command in a terminal window of the VM:
php -S 127.0.0.1:8000 &
Now run firefox (in the VM) and go to
http://127.0.0.1:8000.
You will see a simple web page where you can type in a keyword, press
the button next to it, and get displayed to result of running
your java Searcher program with the appropriate parameters.
You may try to add a way to see the description of items (e.g., by clicking on them),
or to highlight (e.g., in bold font) each occurrence of the keywords in the
display.
You may want to experiment by adding two more buttons:
one to rank by
lowest price and one for ranking by
smallest distance.
In each case, for items with equal price or equal distance, respectively,
you should then order by Lucene score, and finally order by
smallest distance and lowest price, respectively.
NOTE
You do
not need to implement the three functions
basicSearch(String query, int NumResultsToSkip, int numResultsToReturn),
spatialSearch(String query, SearchRegion region, int NumResultsToSkip, int numResultsToReturn)
and
getHTMLforItemId(String itemId) that were mentioned on
the lecture slides!
The final version of the second assignment is what you see written on this web page.
Part D: Automize
Write a small
runLoad.sh script.
First, if they do not exists yet, this script
creates the geo-coordinates table and the spatial index.
Next, the script compiles your
Indexer.java and runs it
in order to build the Lucene index.
Finally, the script compiles your
Searcher.java program.
Thus, after completion of this script, one may call the Searcher program
via
java Searcher "list of keywords" (or, with the
additional parameters) to obtain the correct output.
Sample Files
In the zip-file
sampleFiles.zip
we have provided four three sample files the help you get going:
Indexer.java which creates an index in a directory "indexes"
and adds a few documents (the ones shown in the lecture slides)
into the index.
You can compile and run this class via the two-line
run-Indexer.sh
shell-script provided.
Then the file
Searcher.java which runs a keyword search with
the keyword provided as argument.
You can compile and run this class (searching for the keyword "the")
via the shell-script
run-Searcher.sh.
Finally, there is a file
jdbc.java.
It runs a simple query against the MySQL "ad" database and displays the result.
You can compile and run this program via the
run-jdbc.sh script provided.
Note that on the VM, under the directory
AD_assignment2 we have already
unpacked this zip file for you.
What to submit:
You should submit a file
assignment2.zip containing these files (and no directories whatsoever):
- (Optionally) A plain text file README.txt, with any comments you would like to make.
- Your MySQL scripts createSpatialIndex.sql, dropSpatialIndex.sql.
- The shell script runLoad.sh as described above.
- The Java file Indexer.java containing the source code of your Lucene Indexer.
- The Java file Searcher.java containing the source code of your Lucene Indexer.
- The Java file DbManager.java as provided by us.
- (optionally) partner.txt containing your partner's details.
- (optionally) index.php for running the Searcher via a web-page.
Submission Instruction
To submit, run the following command on DICE:
$ submit ad 2 assignment2.zip