Important Notes


Assignment 2: Update (20-March-2017)

There had been a mistake in the order of the result list given for java Searcher "start trek". This pages has been updated to show the correct list. Also note that an example of spatial search, namely java Searcher "star trek" -x -73.997255 -y 40.732371 -w 10 has been added at the bottom of this web page.

Assignment 2: Search and Retrieval

In Assignment 1 you desigend a relational schema for EBAY data (given in XML). You converted the XML files to cvs-files and loaded them into your tables under a MySQL database named "ad". The purpose of this assignment is to
  1. use JDBC to fetch data from the "ad" database,
  2. insert this data into a Lucene index for full text search, and
  3. provide a search function that combines Lucene's full text search with MySQL's spatial queries.
To efficiently carry out spatial search in MySQL, you are asked to convert the longitude and latitude information of items into points (i.e., into MySQL's POINT data type), and to build a spacial index over those points.

Note: In this assignment you cannot use your database from Assignment 1, but you must use the "ad" database that is provided on the VM. If you run mysql ad on the VM, then show tables; will list all tables in the "ad" database. The attributes of a table and their types can be seen via describe tablename, e.g., describe bid will show the five attribute names of the bid-table: bid_id, item_id, bidder_name, bid_time, and bid_price. If you want to use the database on your local MySQL installation, start the VM and in a terminal type mysqldump ad > ad.sql This should create a file named "ad.sql". Copy it out of the VM onto your local machine and load it into your local database via mysql ad < /path/to/ad.sql.

Part A: Create a Spatial Index in MySQL

Here you are asked to write two sql-scripts named createSpatialIndex.sql and dropSpatialIndex.sql. The createSpatialIndex.sql-script carries out these tasks
  1. create a new table that associates geo-coordinates (as POINTs) to item-IDs, i.e., a table with two columns: one column for item-IDs, and another column for geo-coordinates. Geo-coordinates are points, represented using the POINT data type of MySQL.
  2. fill this table with all items that have latitude and longitude information, i.e., write a SQL insert into statement with a SQL query that selects each item-ID together with its latitude and longitued information (converted into a POINT).
  3. create a spatial index for the point column of your table from the previous step. Recall that the point-column must be declared as NOT NULL and that the table of the previous step must be created using ENGINE=MyISAM, i.e., the statement of the previous step has this structure

    CREATE TABLE IF NOT EXISTS xxx (...) ENGINE = MyISAM;

The dropSpatialIndex.sql script drops the index and the table generated by the createSpatialIndex.sql script. Note that we will never call the dropSpatialIndex.sql script in the further steps described below.

Part B: Create a Lucene Index

Here you are asked to write a program Indexer.java that creates a lucene index. The index should be stored in a directory named indexes (under the current working directory from which runLoad.sh is called). We will want to use this index to carry out keyword searches over the union of the name, categories, and description of an item. For example, for the query "Disney", your basic search function should print (in HTML) the item-ID and name of items that have the keyword "Disney" in the union of the name, category and description entires. Also, for multiple word queries, such as "star trek", you should consider that as "star" OR "trek". That is, you should return any item if it has either "star" or "trek" in its name, its categories, or its description. Use the SimpleAnalyzer of Lucene which carries out casefolding.

Part C: Implement the Search Function

You will write the Java program Searcher.java which carries out keyword and spatial search over the ebay data. The Searcher program will take as first argument a list of space-separated keywords, given within quotes. For instance,

java Searcher "star trek"

returns a list of item-IDs, item-names, and Lucene scores of all items that contain the word "star" or the word "trek" (or both) in the name of the item, or in one of the categories of the item, or in the description of the item. This list is returned in HTML format. In fact, just plain text without HTML-tags is fine. In the first line, print the number of hits. Your program should print to stdout, not a file. When you initilize the query parser, use again the SimpleAnalyzer of Lucene. An output (possibly with other numbers) of java Searcher "superman" may look as follows:

totalHits 72
1049430907, SUPERMAN WITH GEN 13 AND OTHER PRESTIGE BOOKS, score: 1.6115568, price: 6.00
1047062670, Superman's Pal Jimmy Olsen # 81, score: 1.4560543, price: 1.20
1045823269, Superman Doomsday Hunter Prey tpb ,score: 1.4560543, price: 7.97
1048743351, Superman Lunchbox Hallmark Ornament, score: 1.3813344, price: 9.99
1048647703, SUPERMAN COMIC N0.199 - AUSTRALIAN ISSUE, score: 1.2355031, price: 1.99
1047692530, BATMAN OR SUPERMAN CHRISTMAS ORNAMENTS HOT!!, score: 1.1648434, price: 19.99
1047761329, Superman Domed Lunchbox/Carrying Case NEW!!, score: 1.1511121, price: 11.99
1048263344, SUPERMAN DAILY PLANET Magnet PICTURE FRAME, score: 1.0896113, price: 4.95
1047388061, SUPERMAN #405 NM "BATMAN" (1985), score: 1.069977, price: 6.99
1046936194, SUPERMAN METAL LUNCH BOX, score: 1.069977, price: 12.95
.
.

Your Searcher program should be implemented in such a way that it can also take three arguments, in the following way:

java Searcher "star trek" -x longitude -y latitude -w width

where longitude, latitude, and width are numbers. The longitude and latitude numbers descibe a geo-location, and the width is a number that describes the radius of a circle, in kilometers. If such three parameters are given, then your program should further restrict the results of the keyword search by only returning items that have longitude and latitude numbers, and for which these numbers denote a location that has distance to the given longitude and latitude numbers of at most the given width number. You should carry out the spatial search in two stages: first find items in a bounding box that is guaranteed to be large enough to contain all items of distance at most width. Then from the items in this box, filter out the ones with the correct distances using the precise distance function for longitude/latitude pairs. Return the items in this way: first the item-ID, then the name, then the Lucene score, and then the distance from the given geo-location (in kilometers).
Ranking
If your Searcher program is called only with the keywords and without the other parameters, then the ranked list of item-IDs, names, and scores should be Note that the Lucene score of a ScoreDoc s can be retrieved via s.score. If your Searcher program is called with the latitude, longitude, and width parameters, then the ranked list of item-IDs and names of items should be It is not required but perfectly acceptable (and useful) to also print the prices of items.

Here is an example of a spatial search.

$ java Searcher "star trek" -x -73.997255 -y 40.732371 -w 10
Running search(star trek)
totalHits 13
1049497688, Star Trek Impel 1991 series 1 & 2 complete, score: 0.9100517, dist: 5.91, price: 5.0
1498090827, NEW Star Trek Show TV Guide 3 (MINT) +!!, score: 0.7392576, dist: 5.91, price: 4.0
1496025492, **STAR TREK II 2 THE WRATH OF KHAN** on DVD, score: 0.5630657, dist: 5.91, price: 12.0
1046248019, DOMINICK SWAIN 4 Candid Photos #d2-12 4x6, score: 0.13453901, dist: 5.91, price: 9.99
1678501544, JEFF HAMILTON ORLANDO ALL STAR WEEKEND JACKET, score: 0.13453901, dist: 5.91, price: 35.0
1496595283, ZORRO SET & THE SIGN OF ZORRO-BRAND NEW-SEAL, score: 0.13315909, dist: 5.91, price: 49.99
1047947208, HALLMARK DARTH VADER LUKE SKYWALKER ORNAMENTS, score: 0.10872394, dist: 5.91, price: 15.0
1046171386, Star Spangled Soldier - Unkown Soldier # 182, score: 0.10872394, dist: 6.582, price: 2.5
1496408475, Vin 10x13 Joan Crawford Willinger '38Portrait, score: 0.096099295, dist: 5.778, price: 24.95
1495290715, KEVIN SORBO ANDROMEDA XENA SIGNED AUTOGRAPH, score: 0.096099295, dist: 5.91, price: 19.99
1496307141, (ROCK STAR) Zakk Wylde/ BLACK LABEL SOCIETY, score: 0.08322443, dist: 4.361, price: 1.99
1494690860, Autographed Photo of Soap Star Matt Cedeno, score: 0.081542954, dist: 5.91, price: 15.5
1679297745, NOT IN STORES ADIDAS T-MAC basketball sneaker, score: 0.076879434, dist: 5.91, price: 70.0

Web Page and other Rankings
This part is not taken into account for marking and is optional. You may experiment with the rudimentary web page we have provided. For this, type in the following command in a terminal window of the VM:

php -S 127.0.0.1:8000 &

Now run firefox (in the VM) and go to http://127.0.0.1:8000. You will see a simple web page where you can type in a keyword, press the button next to it, and get displayed to result of running your java Searcher program with the appropriate parameters. You may try to add a way to see the description of items (e.g., by clicking on them), or to highlight (e.g., in bold font) each occurrence of the keywords in the display. You may want to experiment by adding two more buttons: one to rank by lowest price and one for ranking by smallest distance. In each case, for items with equal price or equal distance, respectively, you should then order by Lucene score, and finally order by smallest distance and lowest price, respectively.
NOTE
You do not need to implement the three functions basicSearch(String query, int NumResultsToSkip, int numResultsToReturn), spatialSearch(String query, SearchRegion region, int NumResultsToSkip, int numResultsToReturn) and getHTMLforItemId(String itemId) that were mentioned on the lecture slides! The final version of the second assignment is what you see written on this web page.

Part D: Automize

Write a small runLoad.sh script. First, if they do not exists yet, this script creates the geo-coordinates table and the spatial index. Next, the script compiles your Indexer.java and runs it in order to build the Lucene index. Finally, the script compiles your Searcher.java program. Thus, after completion of this script, one may call the Searcher program via java Searcher "list of keywords" (or, with the additional parameters) to obtain the correct output.
Sample Files
In the zip-file sampleFiles.zip we have provided four three sample files the help you get going: Indexer.java which creates an index in a directory "indexes" and adds a few documents (the ones shown in the lecture slides) into the index. You can compile and run this class via the two-line run-Indexer.sh shell-script provided. Then the file Searcher.java which runs a keyword search with the keyword provided as argument. You can compile and run this class (searching for the keyword "the") via the shell-script run-Searcher.sh. Finally, there is a file jdbc.java. It runs a simple query against the MySQL "ad" database and displays the result. You can compile and run this program via the run-jdbc.sh script provided. Note that on the VM, under the directory AD_assignment2 we have already unpacked this zip file for you.
What to submit:
You should submit a file assignment2.zip containing these files (and no directories whatsoever):
Submission Instruction
To submit, run the following command on DICE:

$ submit ad 2 assignment2.zip


Home : Teaching : Courses : Ad 

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh