Advanced Databases

Coursework assignment 1: External Sort

In this assignment, you will have to implement the External Sort algorithm in attica.

Download the code-base

The first thing to do is to download the code-base from attica's location. There are some added "hooks" in the code so you know where you should insert your own code.

If you are familiar with ant, you can use an ant build file for attica. You can obtain ant from here and a build file from here (the build file is also included in attica's source distribution). If you decide to use the build file, these are (some of) its targets:

ant build builds the system,
ant dist builds the jar file and the javadoc's for distribution,
ant clean cleans up the installation,
ant cleanAll cleans up everything.

Otherwise, import the source files into your favourite IDE.

Your implementation

The steps you have to follow are the following:

look into the ExternalSort.java file,
initialise all the temporary files you are going to need,
implement the external sort algorithm, storing the sorted result in the pre-defined output relation, and
modify the optimiser (PlanBuilder.java) so that it will pick up your code.

In more detail:

Open ExternalSort.java under org/dejave/attica/engine/operators of your source installation. This is the main class that you will have to modify. The class's constructor has the following signature:

/** * Construct a new external sort operator * * @param operator the input operator * @param slots the slots acting as sort keys * @param buffers the number of buffers (i.e., * output files) to be used for the sort * @throws EngineException thrown whenever the sort operator * cannot be properly initialized */ public ExternalSort (Operator operator, StorageManager sm, int [] slots, int buffers) throws EngineException

The first two arguments are obvious and their proper values will actually be taken care of by the caller of the constructor -- in this case, attica's heuristics-based optimizer. The third and fourth arguments are the ones that you should be concerned with.

The third argument is an array of integers. These integers correspond to slots in the input relation of the operator and designate the sort keys. For instance, if the array values are [1, 0, 2] that means that the primary sort key is the value in the first slot of the tuple. The secondary sort key is the value in the zero-th slot of the tuple, while the third sort key is the value in the second slot of the tuple. The optimiser will set these values for you (you do not have to do any range-checking) but you will have to use these values to access the fields of the tuple that you are sorting by.

The fourth argument is the number of buffer pool pages allocated for the sort. You should make sure that this number is less than the number of available buffer pool pages. There is a limit to the number of open files you can have, but that is actually determined by the operating system and depends on your local installation. As a guide, a 64-bit program using standard I/O can use up to 2 billion descriptors. For all sensible uses of your code, you will not reach this upper bound.

Moving on into ExternalSort.java, there is a method called initTempFiles(). You should write your own code here in order to generate the temporary files that will be used for sorting. Notice that this is only for the initialisation phase. In reality, you will have to perform this operation more than one times, i.e., during the intermediate passes of the algorithm.

The following code, initialises a temporary file, monitored by attica's storage manager:

String filename = FileUtil.createTempFileName(); sm.createFile(filename);

where sm is a StorageManager instance.

If you want to have a look at how these files can be created inside an operator, look at the constructor of NestedLoopsJoin.java under the directory attica/engine/operators of your source installation (lines 74 through 77.)

The next step is implementing the external sort algorithm. For this, you will have to modify the setup() method of class ExternalSort. Of course, not your entire implementation has to be in this single method! But when the method exits, the sorted result should be stored in the designated output file.

To re-iterate from lectures, here is a rough sketch of the algorithm to sort a file using B buffer pool pages:

Read in the input file in batches of B pages, sort them in main memory and write them out to a temporary file. If the file had X pages originally, this will generate X/B temporary sorted files on disk.
Read in B-1 temporary files and merge them into a new temporary file. Use one page for output (i.e., keep merging from the B-1 files into the output page, and as soon as the output page fills up, flush it to disk.)
Read in the next B-1 files and merge them. Continue until the X/B files are exhausted.
Apply steps 2 and 3 for the new set of temporary files and iterate until you are left with one big sorted file.

An alternative to this is to use replacement selection with two heaps, the current heap and the next heap. That is, instead of batching the input into chunks of B pages maintain two heaps of a combined size equal to the number of tuples that can fit into these B pages and then:

Read in B pages and turn them into a heap. This is the current heap; the next heap is empty.
While the input is not exhausted, apply the following steps.
1. Output the minimum value from the current heap into the file for the current run.
2. Read the next value from the input and decide if it belongs to the current heap or the next heap. Place it in the appropriate heap and resize the heaps if necessary.
3. If the current heap becomes empty, close the run, start a new run, and swap the heaps (that is, the next heap becomes the current heap and the new next heap is empty).

To implement the algorithm you will have to do relation-level I/O, using attica's storage manager. This I/O can be initialised with the following code:

Relation rel = getInputOperator().getOutputRelation(); RelationIOManager man = new RelationIOManager(getStorageManager(), rel, inputFile);

which initialises a relation I/O manager; input is read from inputFile, which is simply a file name (NB: this should be a file monitored by the storage manager, e.g., the file you created previously though a call to sm.createFile()) and the tuples in the file are expected to conform to the relational schema pointed to by rel. You can use a RelationIOManager instance to read/write tuples, through the nextTuple()/insertTuple() methods it provides. But that is not enough.

You will also need to use the page-level functionality of RelationIOManager. The interface is entirely similar to the tuple-level interface. The following code reads all the pages in a relation:

RelationIOManager manager = ... ... for (Page page : manager.pages()) { // manipulate the tuples in the page }

You will have to determine yourselves when you need page-level and when tuple-level I/O functionality. (Various "hints" about this are in the rest of this assignment.)

If you need to write a page out to disk you need to make a simple call to the StorageManager through the following method: sm.writePage(page) where sm is the StorageManager instance and page is the Page instance you need to write. You will be able to manipulate the contents of the page (which, again, you will need to do) by using the setTuple() and retrieveTuple() methods of the Page class.

When implementing the merge phase, you will not have to access the buffer pool. Instead, you will have to continue writing tuples in the output file designated in your call, by making continuous calls to outputManager.insertTuple(). The storage manager of attica will take care of the rest (i.e., pagination, outputting a page as soon as it becomes full and so on).

After you have implemented both the sort and the merge phases of external sort, the final, sorted result, should be stored in the output file pointed to by the outputFile field of ExternalSort (which outputManager writes to.) This file will then be scanned during the retrieval of the sorted relation. Again, you might want to take a look at NestedLoopsJoin.java to see how this is done in the context of a different operator.

The final step you should take: open PlanBuilder.java under attica/engine/optimiser of your source installation. Go to line 1205 under method imposeSorts(). Notice that a few lines are commented out -- the system right now does no sorting at all. You should take out the comments and comment out lines 1227 and 1228, which implement what the system is doing right now when the user requests a sort -- which is, simply, nothing. After you have made these substitutions and re-compiled, then the next time you start the server and you issue an order by query, the optimiser will pick up your code and use it to sort the output relation. If you do not make this change your code will never run!

Notice that in the example call, external sort is implemented using half the buffer pool pages. However, you should experiment with various values for the number of buffer pool pages.

Testing your implementation

For that purpose, use WBGen.java to generate a synthetic dataset that conforms (for the most part) to the logical/physical layout of the relations in the Wisconsin Benchmark. Compile the source file, and you will have the corresponding class file which can be executed as follows:

java WBGen table-name number-of-tuples

This command line will output into standard output the SQL commands to create a table by the name of table-name containing number-of-tuples tuples. The schema of the generated file is:

Column	Type
`unique1`	`long`
`unique2`	`long`
`two`	`long`
`four`	`long`
`ten`	`long`
`twenty`	`long`
`onepercent`	`long`
`tenpercent`	`long`
`twentypercent`	`long`
`fiftypercent`	`long`
`unique3`	`long`
`even`	`long`
`odd`	`long`
`stringu1`	`string`
`stringu2`	`string`
`stringu4`	`string`

You can redirect the output of the generator to a text file and use that text file to feed attica with tuples. For instance, the following sequence of commands generates the SQL statements for a table of 1,000 tuples named sort_data, stores the statements in a text file called sort_data.sql and then uses the text file as input to attica in order to actually store the table.

$ java WBGen sort_data 1000 > sort_data.sql $ java org.dejave.attica.server.Database attica.properties < sort_data.sql ... [a whole bunch of successful insertion messages] ... $

You can then use the generated table for your tests.

What you need to hand in

The minimum set of files you will have to hand in are (i) the compiled version of your entire source tree as a single jar file, and (ii) the source code for your implementation of ExternalSort.java and any other files you have modified.

The submission is electronic only and the deadline is Wednesday, 26 February, 4:00 pm.

Use the submit program to make the submission. For instance, to submit the compiled version of the source tree, use:

submit adbs 1 attica.jar

You get the idea for the source file. If you have modified any other source files of the code base, please submit them as well. You might also think of handing in a description (a text file will do) of what you did if you think something is worth mentioning. It is not compulsory, but it might make marking the assignment easier. You can write whatever you want in that file, ranging from implementation issues and problems you faced (hopefully, along with the solutions you provided!) to comments about the code in general.

That is all. For any questions, contact me.

Good luck.
--stratis

Home : Teaching : Courses : Adbs