In this assignment, you will have to implement the External Sort algorithm in attica.
The first thing to do is to download the code-base from attica's location. There are some added "hooks" in the code so you know where you should insert your own code.
If you are familiar with ant, you can use an ant build file for attica. You can obtain ant from here and a build file from here (the build file is also included in attica's source distribution). If you decide to use the build file, these are (some of) its targets:
ant build
builds the system,ant dist
builds the jar
file and the
javadoc
's for distribution,ant clean
cleans up the installation,ant cleanAll
cleans up everything.Otherwise, import the source files into your favourite IDE.
The steps you have to follow are the following:
ExternalSort.java
file,PlanBuilder.java
) so that
it will pick up your code.In more detail:
Open ExternalSort.java
under
org/dejave/attica/engine/operators
of your source
installation. This is
the main class that you will have to modify. The class's constructor has
the following signature:
/**
  * Construct a new external sort operator
  *
  * @param operator the input operator
  * @param slots the slots acting as sort keys
  * @param buffers the number of buffers (i.e.,
  * output files) to be used for the sort
  * @throws EngineException thrown whenever the sort operator
  * cannot be properly initialized
  */
public ExternalSort (Operator operator,
            StorageManager sm,
            int [] slots,
            int buffers)
throws EngineException
The first two arguments are obvious and their proper values will actually be taken care of by the caller of the constructor -- in this case, attica's heuristics-based optimizer. The third and fourth arguments are the ones that you should be concerned with.
The third argument is an array of integers. These integers correspond
to slots in the input relation of the operator and designate the sort
keys. For instance, if the array values are [1, 0, 2]
that means that the primary sort key is the value in the first slot
of the tuple. The secondary sort key is the value in the zero-th
slot of the tuple, while the third sort key is the value in the
second slot of the tuple. The optimiser will set these values for
you (you do not have to do any range-checking) but you will have
to use these values to access the fields of the tuple that you are
sorting by.
The fourth argument is the number of buffer pool pages allocated for the sort. You should make sure that this number is less than the number of available buffer pool pages. There is a limit to the number of open files you can have, but that is actually determined by the operating system and depends on your local installation. As a guide, a 64-bit program using standard I/O can use up to 2 billion descriptors. For all sensible uses of your code, you will not reach this upper bound.
Moving on into ExternalSort.java
, there is a method
called initTempFiles()
. You should write your own
code here in order to generate the temporary files that
will be used for sorting. Notice that this is only for the initialisation
phase. In reality, you will have to perform this operation more than
one times, i.e., during the intermediate passes of the algorithm.
The following code, initialises a temporary file, monitored by attica's storage manager:
String filename = FileUtil.createTempFileName();
sm.createFile(filename);
where sm
is a StorageManager
instance.
If you want to have a look
at how these files can be created inside an operator, look at the
constructor of NestedLoopsJoin.java
under
the directory attica/engine/operators
of your source
installation (lines 74
through 77
.)
The next step is implementing the external sort algorithm. For this,
you will have to modify the setup()
method of
class ExternalSort
. Of course, not your
entire implementation has to be in this single method! But
when the method exits, the sorted result should be stored
in the designated output file.
To re-iterate from lectures, here is a rough sketch of the algorithm
to sort a file using B
buffer pool pages:
B
pages, sort them in main memory and write them
out to a temporary file. If the file
had X
pages originally, this will generate
X/B
temporary sorted files on disk.B-1
temporary files
and merge them into a new temporary file. Use one page for output
(i.e., keep merging from the B-1
files into
the output page, and as soon as the output page fills up,
flush it to disk.)B-1
files and merge them.
Continue until the X/B
files are exhausted.
An alternative to this is to use replacement selection with two heaps,
the current heap and the next heap. That is,
instead of batching the input into chunks of B
pages
maintain two heaps of a combined size equal to the number of tuples
that can fit into
these B
pages and then:
B
pages and turn them into a heap. This is
the current heap; the next heap is empty.To implement the algorithm you will have to do relation-level I/O, using attica's storage manager. This I/O can be initialised with the following code:
Relation rel = getInputOperator().getOutputRelation();
RelationIOManager man =
     new RelationIOManager(getStorageManager(), rel,
inputFile);
which initialises a relation I/O manager; input is read from
inputFile
, which is simply a file name (NB: this
should be a file monitored by the storage manager, e.g.,
the file you created previously though a call to sm.createFile()
)
and the tuples in the file are expected to conform to the relational
schema pointed to by rel
.
You can use a RelationIOManager
instance
to read/write tuples, through the nextTuple()
/insertTuple()
methods it provides. But that is not enough.
You will also need to use the page-level functionality
of RelationIOManager
. The interface is entirely
similar to the tuple-level interface. The following
code reads all the pages in a relation:
RelationIOManager manager = ...
...
for (Page page : manager.pages()) {
    // manipulate the tuples in the page
}
You will have to determine yourselves when you need page-level and when tuple-level I/O functionality. (Various "hints" about this are in the rest of this assignment.)
If you need to write a page out to disk
you need to make a simple call to the StorageManager
through
the following
method: sm.writePage(page)
where sm
is
the StorageManager
instance and page
is the
Page
instance you need to write. You will be able
to manipulate the contents of the page (which, again, you will need to do)
by using the setTuple()
and retrieveTuple()
methods of the Page
class.
When implementing the merge phase, you will not have to access the buffer
pool. Instead, you will have to continue writing tuples in the output
file designated in your call, by making continuous calls to
outputManager.insertTuple()
. The storage manager of attica
will take care of the rest (i.e., pagination, outputting a page as soon
as it becomes full and so on).
After you have implemented both the sort and the merge phases of external sort,
the final, sorted result, should be stored in the output file pointed
to by the outputFile
field of ExternalSort
(which
outputManager
writes to.) This file will then be scanned
during the retrieval of the sorted relation. Again, you might
want to take a look
at NestedLoopsJoin.java
to see how this is done
in the context of a different operator.
The final step you should take: open PlanBuilder.java
under attica/engine/optimiser
of your source installation. Go to line 1205
under method imposeSorts()
. Notice that a few lines are commented out --
the system right now does no sorting at all. You should take out the comments
and comment out lines 1227
and 1228
, which
implement
what the system is doing right now when the user requests a sort -- which
is, simply, nothing. After you have made these substitutions and re-compiled,
then the next time
you start the server and you issue an order by
query, the optimiser
will pick up your code and use it to sort the output relation.
If you do not make this change your code will never run!
Notice that in the example call, external sort is implemented using half the buffer pool pages. However, you should experiment with various values for the number of buffer pool pages.
For that purpose, use WBGen.java to generate a synthetic dataset that conforms (for the most part) to the logical/physical layout of the relations in the Wisconsin Benchmark. Compile the source file, and you will have the corresponding class file which can be executed as follows:
java WBGen table-name number-of-tuples
This command line will output into standard output the SQL commands to create
a table by the name of table-name
containing
number-of-tuples
tuples. The schema of the generated file is:
Column | Type |
unique1 |
long |
unique2 |
long |
two |
long |
four |
long |
ten |
long |
twenty |
long |
onepercent |
long |
tenpercent |
long |
twentypercent |
long |
fiftypercent |
long |
unique3 |
long |
even |
long |
odd |
long |
stringu1 |
string |
stringu2 |
string |
stringu4 |
string |
You can redirect the output of the generator to a text file and use that text file to
feed attica with tuples. For instance, the following sequence of commands generates
the SQL statements for a table of 1,000 tuples named sort_data
,
stores the statements in a text file called sort_data.sql
and then uses the text file as input to attica in order to actually store
the table.
$ java WBGen sort_data 1000 > sort_data.sql
$ java org.dejave.attica.server.Database attica.properties < sort_data.sql
...
[a whole bunch of successful insertion messages]
...
$
You can then use the generated table for your tests.
The minimum set of files you will have to hand
in are (i) the compiled version of your entire source tree as a single
jar file, and
(ii) the source code for your implementation of
ExternalSort.java
and any other files you have
modified.
The submission is electronic only and the deadline is Wednesday, 26 February, 4:00 pm.
Use the submit
program to make the submission. For instance,
to submit the compiled version of the source tree, use:
submit adbs 1 attica.jar
You get the idea for the source file. If you have modified any other source files of the code base, please submit them as well. You might also think of handing in a description (a text file will do) of what you did if you think something is worth mentioning. It is not compulsory, but it might make marking the assignment easier. You can write whatever you want in that file, ranging from implementation issues and problems you faced (hopefully, along with the solutions you provided!) to comments about the code in general.
That is all. For any questions, contact me.
Good luck.
--stratis
Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk Please contact our webadmin with any comments or corrections. Logging and Cookies Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh |