Using Perl for data conversion

This page gives a very brief introduction to the use of Perl for data conversion. This together with the example program in the tutorial should be enough to get you started writing programs for data conversion. This is by no means a good introduction to the complete Perl language. If you want to learn more about Perl, check out this short tutorial (which was mainly the source for this page), or the on-line textbook Perl by example. The Perl home page is at http://www.perl.com.

Running a Perl program
Comments
Scalar variables
Array variables
Control structures
Reading from and writing to files
Regular expressions
Split and substitution

Running a Perl program

Perl programs have the extension .prl. You can run them by typing 'perl -w program_name.prl'.

Comments

Comments are placed inside a program file using the # character. Everything after the # is ignored. For example:

# This whole line is ignored.

print("Perl is easy.\n");      # Here's a half-line comment.

Scalar variables

Simple variables in Perl have a name that starts with a '$' symbol. You don't have to declare them in advance, just assign a value to them:

$var = 9;
$name = 'John';

The usual arithmetic operations can be used:

$var = 5 + 8;
$var += 10;
$var++;

Array variables

Arrays in Perl have a name that starts with an '@' symbol. They are lists of scalar variables. You can assign values as follows:

@names = ('John', 'Pete');
$name = 'Jake';
@morenames = (@names, $name, 5);

This gives the same as:

@morenames = ('John', 'Pete', 'Jake', 5);

You can also access the array elements one by one, using square brackets. As in C, indexing starts from 0. Remember that an array element is not an array but a scalar variable, so it should start with a '$' symbol.

$myname = $morenames[2];  
$morenames[9] = 'Olga';

The assignment above to element number 9 in the array will cause all the elements between the last assigned position (the value 5 in position 3) and the new element to get an empty value. One last important feature of arrays is how to find out their length. This is done by using @array in a scalar context:

$length = @morenames;

Control structures

The most important control structures you will need are for, while and if.

The for structure takes an initialisation, a condition and an update, for example like this:

for($i = 0;$i < 10;$i++) {
  Commands;
  ...
 }

The while structure only takes a condition:

while ($input < 9) {
  Commands;
  ...
 }

Also if is quite straightforward:

if ($number == 1) {
  ...
 } elsif ($number == 2) {
  ... 
 } else {
  ...
 }

Evaluating conditions is necessary in all of these control structures. It works as follows:

$a == $b       # Is $a numerically equal to $b?
$a != $b       # Not equal
$a < $b        # Less than
$a <= $b       # Less or equal

$a eq $b       # Is string $a equal to string $b?
$a ne $b       # Not equal
$a lt $b       # Less than
$a le $b       # Less than or equal
$a gt $b       # Greater than

$a && $b       # Are $a and $b true
$a || $b       # $a or $b
!($a)          # Not $a

Reading from and writing to files

For data conversion objectives, you'll have to do a lot of reading from and writing to files. You first have to declare filehandles. When you want to open a file for writing, you should type '>' before the filename:

open(IN,'Filename');
open(OUT,'>Filename');

To read from a file, you assign the input to a variable. The following will assign the next line from the input file to the variable $in:

$in = <IN>;

To read the whole contents of a file, you will read until you find an end-of-file. An peculiarity of Perl is that when you evaluate an end-of-file as a condition, you will get false. So the following control structure will read in line by line, until it the end-of-file symbol is found:

while ($in = <IN>) {
  ...
 }

Notice how in the previous '=' was used instead of '==': this is not a test whether $in is equal to the line in the input file. It is an assignment to $in, and the result of the assignment is evaluated. An important caution here is that also empty lines and zeros are evaluated as false. So if the input file contains an empty line, it will stop reading there.

To get input from the user instead of from a file, use the standard input: <STDIN>.

When you read data in from a file, you will read line by line. So the variable $in will contain an end-of-line character. When you process input you will usually want to get rid of this character. You can do this with the following command:

chomp($in);

To write to a file, you use the 'print' command. When you want to print several lines, you have to include newline characters (\n):

print OUT "$name likes pizza \n";

If you use "..." with the print command, variable names will be filled in. If you use '...', you will print literally (so you write $name to the file instead of the variable's value). If you want to type special characters (for example control characters of the perl language like { or }), you'll have to preceed them with the escape character '\'. The following writes 'bra{ckets}' to the output file:

print OUT "bra\{ckets\} \n";

To write to the screen instead of to a file, just omit the filehandle OUT:

print "$name is eating fish";

Regular expressions

One of the most useful features of Perl is the use of regular expressions. This means that you can look for patterns in a string variable. The patterns are defined between forward slashes. Matching is done by '=~' (the opposite of '=~', is not matching: '!~'). The following code will print the string 'funny' on the screen:

$i = 'many people like beer';
if ($i =~ /like/) {
  print "funny";
 }

Regular expressions become very useful when you make more complex and more flexible patterns. You can use '.' as a wild card, '+' means that you want to repeat the last character at least once and '*' that you want to repeat it 0 or more times. '?' takes the last character 1 or 0 times. The following statements all evaluate to true:

$i =~ /l.ke/
$i =~ /be+r/
$i =~ /bee+r/
$i =~ /beee*r/
$i !~ /be?r/

Between square brackets you can define a range of characters you want to look for. '[a-z]' includes all characters between a and z. And if you place '^' after the opening bracket, you get the negation of the defined range. You can combine this with + and *. The following expressions all evaluate to true:

$i =~ /b[eao]*r/
$i =~ /people [a-z]+ beer/
$i =~ /[^A-Z]*/

With round brackets, you can group things together. A '|' means 'or'. A very interesting feature of using round brackets is that you can refer to them afterwards: the patterns matched within the first pair of brackets is saved in '$1', the second in '$2', ... The following will print 'people':

$i =~ /(pe[a-z]*le|lion)/;
print "$1\n";

Perl looks for the pattern anywhere in the string. If you only want a match if the pattern appears in the beginning of the string, use /^.../. If you want to match the end of the string, use /...$/. If you for example get a line of numbers and only want to print out the last one of them, you can do the following:

$i = '5 45 89 9 34 90 98 33 40 59 71';
$i =~ /([0-9]+)$/;
print "$1 \n";

Sometimes you want to use special characters in your regular expressions:

\n            # A newline
\t            # A tab
\w            # Any alphanumeric (word) character, the same as [a-zA-Z0-9_]
\W            # Any non-word character, the same as [^a-zA-Z0-9_]
\d            # Any digit, the same as [0-9]
\D            # Any non-digit, the same as [^0-9]
\s            # Any whitespace character: space, tab, newline, etc
\S            # Any non-whitespace character
\|            # Vertical bar
\[            # An open square bracket
\)            # A closing parenthesis
\*            # An asterisk
\^            # A carat symbol
\/            # A slash
\\            # A backslash

Split and substitution

Here are two functions you will probably find very useful when doing data conversions. They both use regular expressions.

split takes two arguments: a regular expression and a string. It searches the string and splits it up by removing all parts that correspond to the regular expression. The remaining string parts are returned in an array. The following code takes a comma delimited line and returns a space delimited one:

$i = "5,45,89,9,34,90,98,33,40,59,71\n";
chomp($i);       # Take away the newline character
@list = split(/,/,$i);
for ($i=0;$i<@list;$i++) {
  print "$list[$i] ";
 }
print "\n";

The previous can actually be done in less code using substitution. Substitution takes two patterns: one to look for in the string, and a second which is used to replace the first one in the string. The following code does the same as the previous, substituting ',' for ' ':

$i = "5,45,89,9,34,90,98,33,40,59,71\n";
$i =~ s/,/ /g;     # 'g' indicates that you want to replace all occurences of ','
print "$i";

This page was written by Frederick Ducatelle (fredduc@dai.ed.ac.uk).

Home : Teaching : Courses : Dme