Important Notes


Assignment 2: Search and Retrieval

In Assignment 1 you desigend a relational schema for EBAY data (given in XML). You converted the XML files to cvs-files and loaded them into your tables under a MySQL database named "ad". The purpose of this assignment is to
  1. use JDBC to fetch data from your "ad" database,
  2. insert this data into a Lucene index for full text search, and
  3. provide a search function that combines Lucene's full text search with MySQL's spatial queries.
To efficiently carry out spatial search in MySQL, you are asked to convert the longitude and latitude information of itmes into points (i.e., into MySQL's POINT data type), and to build a spacial index over those points.

Note: You can either use your own "ad" database generated with your assignment-1 solution, or, use the "ad" database that we provide for you on the VM. On the VM you can run mysql ad and then type show tables to see a list of table names of that database. Use describe tablename to see the schema of a table. For instance, describe item shows the column names and types of table item.

Part A: Create a Spatial Index in MySQL

Here you are asked to write two sql-scripts named createSpatialIndex.sql and dropSpatialIndex.sql. The createSpatialIndex.sql-script carries out these tasks
  1. create a new table that associates geo-coordinates (as POINTs) to item-IDs, i.e., a table with two columns: one column for item-IDs, and another column for geo-coordinates. Geo-coordinates are points, represented using the POINT data type of MySQL.
  2. fill this table with all items that have latitude and longitude information, i.e., write a SQL insert into statement with a SQL query that selects each item-ID together with its latitude and longitued information (converted into a POINT).
  3. create a spatial index for the point column of your table from the previous step. Recall that the point-column must be declared as NOT NULL and that the table of the previous step must be created using ENGINE=MyISAM, i.e., the statement of the previous step has this structure

    CREATE TABLE IF NOT EXISTS xxx (...) ENGINE = MyISAM;

The dropSpatialIndex.sql script drops the index and the table generated by the createSpatialIndex.sql script. Note that we will never call the dropSpatialIndex.sql script in the further steps described below.

Part B: Create a Lucene Index

Here you are asked to write a program Indexer.java that creates a lucene index. The index should be stored in a directory named indexes (under the current working directory from which runLoad.sh is called). We will want to use this index to carry out keyword searches over the union of the name, categories, and description of an item. For example, for the query "Disney", your basic search function should print (in HTML) the item-ID and name of items that have the keyword "Disney" in the union of the name, category or description attributes. Also, for multiple word queries, such as "star trek", you should consider that as "star" OR "trek". That is, you should return any item if it has either "star" or "trek" in its name, its categories, or its description. Use the SimpleAnalyzer of Lucene which carries out casefolding.

Part C: Implement the Search Function

You will write the Java program Searcher.java which carries out keyword and spatial search over the ebay data. The Searcher program will take as first argument a list of space-separated keywords, given within quotes. For instance,

java Searcher "star trek"

returns a list item-IDs, item-names, and Lucene scores of all items that contain the word "star" or the word "trek" (or both) in the name of the item, or in one of the categories of the item, or in the description of the item. This list is returned in HTML format. In fact, just simply plain text without any HTML-tags will be fine. In the first line, print the number of hits. Your program should print to stdout, not a file. When you initilize the query parser, use again the SimpleAnalyzer of Lucene. An output (possibly with other numbers) of java Searcher "superman" may look as follows:

totalHits 72
1049430907, SUPERMAN WITH GEN 13 AND OTHER PRESTIGE BOOKS, score: 1.6115568
1045823269, Superman Doomsday Hunter Prey tpb ,score: 1.4560543
1047062670, Superman's Pal Jimmy Olsen # 81, score: 1.4560543
1048743351, Superman Lunchbox Hallmark Ornament, score: 1.3813344
1048647703, SUPERMAN COMIC N0.199 - AUSTRALIAN ISSUE, score: 1.2355031
1047692530, BATMAN OR SUPERMAN CHRISTMAS ORNAMENTS HOT!!, score: 1.1648434
1047761329, Superman Domed Lunchbox/Carrying Case NEW!!, score: 1.1511121
1048263344, SUPERMAN DAILY PLANET Magnet PICTURE FRAME, score: 1.0896113
1046936194, SUPERMAN METAL LUNCH BOX, score: 1.069977
1047388061, SUPERMAN #405 NM "BATMAN" (1985), score: 1.069977
.
.

Your Searcher program should be implemented in such a way that it can also take three arguments, in the following way:

java Searcher "star trek" -x longitude -y latitude -w width

where longitude, latitude, and width are numbers. The longitude and latitude numbers descibe a geo-location, and the width is a number that describes the width of a square, in kilometers. If such three parameters are given, then your program should further restrict the results of the keyword search by only returning items that have longitude and latitude information, and for which these numbers fall into the "square" that is centered at the given longitude and latitude numbers, and that has width given by the width number. Recall that this is not exactly a square, because of the curvature of the earth. It is fine to use an actual square, as long as it is big enough so that any point with actual distance at most "width" falls inside the square. In the first line, again print the number of total hits. Then, return the items in this way: first the item-ID, then the name, then the Lucene score, and then the distance from the given geo-location (in kilometers).
Ranking
If your Searcher program is called only with the keywords and without the other parameters, then the ranked list of item-IDs, names, and scores should be Note that the Lucene score of a ScoreDoc s can be retrieved via s.score. If your Searcher program is called with the latitude, longitude, and width parameters, then the ranked list of item-IDs and names of items should be It is not required but perfectly acceptable to also print the prices of items.
Web Page and other Rankings
This part is not taken into account for marking and is optional. You may experiment with the rudimentary web page we have provided. For this, type in the following command in a terminal window of the VM:

php -S 127.0.0.1:8000 &

Now run firefox (in the VM) and go to http://127.0.0.1:8000. You will see a simple web page where you can type in a keyword, press the button next to it, and get displayed to result of running your java Searcher program with the appropriate parameters. You may try to add a way to see the description of items (e.g., by clicking on them), or to highlight (e.g., in bold font) each occurrence of the keywords in the display. You may want to experiment by adding two more buttons: one to rank by lowest price and one for ranking by smallest distance. In each case, for items with equal price or equal distance, respectively, you should then order by Lucene score, and finally order by smallest distance and lowest price, respectively.
NOTE
You do not need to implement the three functions basicSearch(String query, int NumResultsToSkip, int numResultsToReturn), spatialSearch(String query, SearchRegion region, int NumResultsToSkip, int numResultsToReturn) and getHTMLforItemId(String itemId) that were mentioned on the lecture slides! The final version of the second assignment is what you see written on this web page.

Part D: Automize

Write a small runLoad.sh script. First, if they do not exists yet, this script creates the geo-coordinates table and the spatial index. Next, the script compiles your Indexer.java and runs it in order to build the Lucene index. Finally, the script compiles your Searcher.java program. Thus, after completion of this script, one may call the Searcher program via java Searcher "list of keywords" (or, with the additional parameters) to obtain the correct output.
Sample Files
In the zip-file sampleFiles.zip we have provided four three sample files the help you get going: Indexer.java which creates an index in a directory "indexes" and adds a few documents (the ones shown in the lecture slides) into the index. You can compile and run this class via the two-line run-Indexer.sh shell-script provided. Then the file Searcher.java which runs a keyword search with the keyword provided as argument. You can compile and run this class (searching for the keyword "the") via the shell-script run-Searcher.sh. Finally, there is a file jdbc.java. It runs a simple query against the MySQL "ad" database and displays the result. You can compile and run this program via the run-jdbc.sh script provided. Note that on the VM, under the directory AD_assignment2 we have already unpacked this zip file for you.
What to submit:
You should submit a file assignment2.zip containing these files (and no directories whatsoever):
Submission Instruction
To submit, run the following command on DICE:

$ submit ad 2 assignment2.zip


Home : Teaching : Courses : Ad 

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh