This lab explores the basics of manipulating data in files and streams.
Oracle Java Tutorial links
- Read and write data from the console.
- I/O Streams lesson and, in particular, the following pages:
- Byte Streams
- I/O from the Command Line
- Use streams to read and write files.
- File I/O (Featuring NIO.2) lesson, and in particular, the following pages:
- Reading, Writing, and Creating Files
- Creating and Reading Directories
- Random Access Files
PDF documents can contain metadata such as author, document title, etc. This metadata is encoded within the PDF data structures. In this exercise you will use the random access file features of Java to parse enough of a PDF file to extract this metadata.
While it would be possible to just scan the entire PDF file for keywords and obtain the data, this exercise requires you to use the java.nio.channels.SeekableByteChannel class and knowledge of the PDF structure to read in as little of the PDF as possible - less than 1000 bytes if you can.
The PDF specification is sizable, available here. At over 1300 pages it covers editable documents, revisions, encryption, data compression, and a host of other features. To simplify things, for this exercise use this test document: TestDocument.pdf, which uses the bare minimum of features.
Create the class ReadPDFInfo in the package io with the following method:
public static String readPDFInfo(File file) throws IOException
This method takes a pdf file as a parameter, and outputs the following metadata from the file:
Title: Test Document Subject: The Subject Author: A N Other Producer: Latex with hyperref Creator: pdflatex CreationDate: D:20160118224408Z
From the PDF Reference, Sixth Edition, version 1.7, you'll be needing section 3.4. Specifically, you'll need to locate and parse the following sections of the PDF file:
- 3.4.3 Cross-Reference Table
- 3.4.4 File Trailer
- 10.2.1 Document Information Dictionary
The document information fields in the test document are guaranteed to be string literals (as defined in section 3.2.3) enclosed in parentheses, rather than anything more complex.
An automated test has been created for this exercise: ReadPDFInfoTest.java.
Note - this exercise requires a bit more reading outside the course to complete. The PDF specification is complex, although only a small fraction of it is used here. I'd appreciate feedback as to how difficult you find this exercise in comparison to the other advanced lab exercises. If necessary, more explanation on how to decode the PDF can be added.