Java I/O Fundamentals
Overview

This lab explores the basics of manipulating data in files and streams.

Oracle Java Tutorial links

Exercises

Do the exercises for I/O in the Oracle Java tutorial here. Once you are done, the answers can be checked here.

PDF metadata reading

PDF documents can contain metadata such as author, document title, etc. This metadata is encoded within the PDF data structures. In this exercise you will use the random access file features of Java to parse enough of a PDF file to extract this metadata.

While it would be possible to just scan the entire PDF file for keywords and obtain the data, this exercise requires you to use the java.nio.channels.SeekableByteChannel class and knowledge of the PDF structure to read in as little of the PDF as possible - less than 1000 bytes if you can.

The PDF specification is sizable, available here. At over 1300 pages it covers editable documents, revisions, encryption, data compression, and a host of other features. To simplify things, for this exercise use this test document: TestDocument.pdf, which uses the bare minimum of features.

Create the class ReadPDFInfo in the package io with the following method:

public static String readPDFInfo(File file) throws IOException

This method takes a pdf file as a parameter, and outputs the following metadata from the file:

Title: Test Document
Subject: The Subject
Author: A N Other
Producer: Latex with hyperref
Creator: pdflatex
CreationDate: D:20160118224408Z

From the PDF Reference, Sixth Edition, version 1.7, you'll be needing section 3.4. Specifically, you'll need to locate and parse the following sections of the PDF file:

  • 3.4.3 Cross-Reference Table
  • 3.4.4 File Trailer
  • 10.2.1 Document Information Dictionary

The document information fields in the test document are guaranteed to be string literals (as defined in section 3.2.3) enclosed in parentheses, rather than anything more complex.

An automated test has been created for this exercise: ReadPDFInfoTest.java.

Note - this exercise requires a bit more reading outside the course to complete. The PDF specification is complex, although only a small fraction of it is used here. I'd appreciate feedback as to how difficult you find this exercise in comparison to the other advanced lab exercises. If necessary, more explanation on how to decode the PDF can be added.