Important Notes

Late Policy: please click here to read about the school's late policy.
Partner Programming: you can choose to submit individually, or with a partner.
If you submit with a partner, then
1. submit only one submission (by one of the students in the team) and
2. inlude a file partner.txt which holds name and UUN of your partner.
3. Both students in the team receive the same mark.
Virtual Machine (VM): we provide you with a VirtualBox disc image that runs Ubuntu Linux and has everything installed that is needed for the assignment. Your code must compile and run within this VM without any modifications to the VM (i.e., without installing any packages or libraries, etc.). You may use the AD_assigment_1 directory provided. The username is ad, there is no password.
Click here to download image. Warning: file size is 1.7GB
Warning: VM changed on March 2nd. Download again, if your have an older version.
To use the image, run VirtualBox. Under Machine select New, provide a name, choose type Linux and Ubuntu (32 bit). Choose a memory size (512MB should be OK), then select "Use an existing virtual hard drive file", and click on the folder-icon to select the disc image that you downloaded.

Assignment 2: Search and Retrieval

In Assignment 1 you desigend a relational schema for EBAY data (given in XML). You converted the XML files to cvs-files and loaded them into your tables under a MySQL database named "ad". The purpose of this assignment is to

use JDBC to fetch data from your "ad" database,
insert this data into a Lucene index for full text search, and
provide a search function that combines Lucene's full text search with MySQL's spatial queries.

To efficiently carry out spatial search in MySQL, you are asked to convert the longitude and latitude information of itmes into points (i.e., into MySQL's POINT data type), and to build a spacial index over those points.

Note: You can either use your own "ad" database generated with your assignment-1 solution, or, use the "ad" database that we provide for you on the VM. On the VM you can run mysql ad and then type show tables to see a list of table names of that database. Use describe tablename to see the schema of a table. For instance, describe item shows the column names and types of table item.

Part A: Create a Spatial Index in MySQL

Here you are asked to write two sql-scripts named createSpatialIndex.sql and dropSpatialIndex.sql. The createSpatialIndex.sql-script carries out these tasks

create a new table that associates geo-coordinates (as POINTs) to item-IDs, i.e., a table with two columns: one column for item-IDs, and another column for geo-coordinates. Geo-coordinates are points, represented using the POINT data type of MySQL.
fill this table with all items that have latitude and longitude information, i.e., write a SQL insert into statement with a SQL query that selects each item-ID together with its latitude and longitued information (converted into a POINT).
create a spatial index for the point column of your table from the previous step. Recall that the point-column must be declared as NOT NULL and that the table of the previous step must be created using ENGINE=MyISAM, i.e., the statement of the previous step has this structure

CREATE TABLE IF NOT EXISTS xxx (...) ENGINE = MyISAM;

The dropSpatialIndex.sql script drops the index and the table generated by the createSpatialIndex.sql script. Note that we will never call the dropSpatialIndex.sql script in the further steps described below.

Part B: Create a Lucene Index

Here you are asked to write a program Indexer.java that creates a lucene index. The index should be stored in a directory named indexes (under the current working directory from which runLoad.sh is called). We will want to use this index to carry out keyword searches over the union of the name, categories, and description of an item. For example, for the query "Disney", your basic search function should print (in HTML) the item-ID and name of items that have the keyword "Disney" in the union of the name, category or description attributes. Also, for multiple word queries, such as "star trek", you should consider that as "star" OR "trek". That is, you should return any item if it has either "star" or "trek" in its name, its categories, or its description. Use the SimpleAnalyzer of Lucene which carries out casefolding.

Part C: Implement the Search Function

You will write the Java program Searcher.java which carries out keyword and spatial search over the ebay data. The Searcher program will take as first argument a list of space-separated keywords, given within quotes. For instance,

java Searcher "star trek"

returns a list item-IDs, item-names, and Lucene scores of all items that contain the word "star" or the word "trek" (or both) in the name of the item, or in one of the categories of the item, or in the description of the item. This list is returned in HTML format. In fact, just simply plain text without any HTML-tags will be fine. In the first line, print the number of hits. Your program should print to stdout, not a file. When you initilize the query parser, use again the SimpleAnalyzer of Lucene. An output (possibly with other numbers) of java Searcher "superman" may look as follows:

totalHits 72 

1049430907, SUPERMAN WITH GEN 13 AND OTHER PRESTIGE BOOKS, score: 1.6115568

1045823269, Superman Doomsday Hunter Prey tpb ,score: 1.4560543

1047062670, Superman's Pal Jimmy Olsen # 81, score: 1.4560543

1048743351, Superman Lunchbox Hallmark Ornament, score: 1.3813344

1048647703, SUPERMAN COMIC N0.199 - AUSTRALIAN ISSUE, score: 1.2355031

1047692530, BATMAN OR SUPERMAN CHRISTMAS ORNAMENTS HOT!!, score: 1.1648434

1047761329, Superman Domed Lunchbox/Carrying Case NEW!!, score: 1.1511121

1048263344, SUPERMAN DAILY PLANET Magnet PICTURE FRAME, score: 1.0896113

1046936194, SUPERMAN METAL LUNCH BOX, score: 1.069977

1047388061, SUPERMAN #405 NM "BATMAN" (1985), score: 1.069977

.
.

Your Searcher program should be implemented in such a way that it can also take three arguments, in the following way:

java Searcher "star trek" -x longitude -y latitude -w width

where longitude, latitude, and width are numbers. The longitude and latitude numbers descibe a geo-location, and the width is a number that describes the width of a square, in kilometers. If such three parameters are given, then your program should further restrict the results of the keyword search by only returning items that have longitude and latitude information, and for which these numbers fall into the "square" that is centered at the given longitude and latitude numbers, and that has width given by the width number. Recall that this is not exactly a square, because of the curvature of the earth. It is fine to use an actual square, as long as it is big enough so that any point with actual distance at most "width" falls inside the square. In the first line, again print the number of total hits. Then, return the items in this way: first the item-ID, then the name, then the Lucene score, and then the distance from the given geo-location (in kilometers).

Ranking

If your Searcher program is called only with the keywords and without the other parameters, then the ranked list of item-IDs, names, and scores should be

ordered by decreasing Lucene-score
for items with equal Lucene score, order them by increasing price (i.e., lowest price first). Here, price is defined as the current price of the item.

Note that the Lucene score of a ScoreDoc s can be retrieved via s.score. If your Searcher program is called with the latitude, longitude, and width parameters, then the ranked list of item-IDs and names of items should be

ordered by decreasing Lucene-score
for items with equal Lucene score, order them by increasing distance from the geo-location given by the latitude and longitude numbers (i.e., smallest distance first = closest items first)
for items with equal Lucene score and equal distance, order them by increasing price (i.e., lowest price first), where price is defined as before.

It is not required but perfectly acceptable to also print the prices of items.

Web Page and other Rankings

This part is not taken into account for marking and is optional. You may experiment with the rudimentary web page we have provided. For this, type in the following command in a terminal window of the VM:

php -S 127.0.0.1:8000 &

Now run firefox (in the VM) and go to http://127.0.0.1:8000. You will see a simple web page where you can type in a keyword, press the button next to it, and get displayed to result of running your java Searcher program with the appropriate parameters. You may try to add a way to see the description of items (e.g., by clicking on them), or to highlight (e.g., in bold font) each occurrence of the keywords in the display. You may want to experiment by adding two more buttons: one to rank by lowest price and one for ranking by smallest distance. In each case, for items with equal price or equal distance, respectively, you should then order by Lucene score, and finally order by smallest distance and lowest price, respectively.

NOTE

You do not need to implement the three functions basicSearch(String query, int NumResultsToSkip, int numResultsToReturn), spatialSearch(String query, SearchRegion region, int NumResultsToSkip, int numResultsToReturn) and getHTMLforItemId(String itemId) that were mentioned on the lecture slides! The final version of the second assignment is what you see written on this web page.

Part D: Automize

Write a small runLoad.sh script. First, if they do not exists yet, this script creates the geo-coordinates table and the spatial index. Next, the script compiles your Indexer.java and runs it in order to build the Lucene index. Finally, the script compiles your Searcher.java program. Thus, after completion of this script, one may call the Searcher program via java Searcher "list of keywords" (or, with the additional parameters) to obtain the correct output.

Sample Files

In the zip-file sampleFiles.zip we have provided four three sample files the help you get going: Indexer.java which creates an index in a directory "indexes" and adds a few documents (the ones shown in the lecture slides) into the index. You can compile and run this class via the two-line run-Indexer.sh shell-script provided. Then the file Searcher.java which runs a keyword search with the keyword provided as argument. You can compile and run this class (searching for the keyword "the") via the shell-script run-Searcher.sh. Finally, there is a file jdbc.java. It runs a simple query against the MySQL "ad" database and displays the result. You can compile and run this program via the run-jdbc.sh script provided. Note that on the VM, under the directory AD_assignment2 we have already unpacked this zip file for you.

What to submit:

You should submit a file assignment2.zip containing these files (and no directories whatsoever):

(Optionally) A plain text file README.txt, with any comments you would like to make.
Your MySQL scripts createSpatialIndex.sql, dropSpatialIndex.sql.
The shell script runLoad.sh as described above.
The Java file Indexer.java containing the source code of your Lucene Indexer.
The Java file Searcher.java containing the source code of your Lucene Indexer.
The Java file DbManager.java as provided by us.
(optionally) partner.txt containing your partner's details.
(optionally) index.php for running the Searcher via a web-page.

Submission Instruction

To submit, run the following command on DICE:

$ submit ad 2 assignment2.zip

Home : Teaching : Courses : Ad