Introduction

Computer Science Large Practical

Organisational Matters

  • Me: Allan Clark
  • Email: a.d.clark@ed.ac.uk
  • Website: http://www.inf.ed.ac.uk/teaching/courses/cslp/
  • There is one lecture per week
    • Fridays at 12:10-13:00
    • At 7 Bristo Square - Map lecture theatre 4
  • Coursework: Accounts for 100% of your mark for the course
  • No required textbook
  • No scheduled office hours, please email at any time

Restrictions

  • CSLP is a third-year undergraduate course only available to third-year undergraduate students.
  • CSLP is not available to visiting undergraduate students, or to fourth-year undergraduate students and MSc students, who have their own individual projects.
  • Third-year undergraduate students should choose at most one large practical, as allowed by their degree regulations.
    • Computer Science, Software Engineering and Artificial Intelligence large practicals
    • On most degrees a large practical is compulsory.
    • On some degrees (typically combined Honours) you can do the System Design Project instead, or additionally.
  • See the Degree Programme Tables (DPT) in the Degree Regulations and Programmes of Study (DRPS) for your degree for clarification.

The Computer Science Large Practical Requirement

  • The requirement for the Computer Science Large Practical is to create a command-line application.
  • The purpose of the application is to implement a stochastic, discrete-event, discrete-state, continuous time simulator
    • I'll explain these words further below
  • This will simulate the progression of buses through a network of stops specified by the input
  • The output will be the sequence of events that have been simulated as well as some summary statistics.
  • The input and output formats are specified in the coursework handout together with several other requirements
  • It is your responsibility to read the requirements carefully

Today's Lecture

  • Today I will discuss:
    • Context for the practical, timing and deadlines
    • Motivation for the simulation of a bus network
    • The simulation algorithm used
    • The main requirements for the practical
    • Kinds of simulators and in particular the kind that you are being asked to produce

Context

  • So far most of your practicals have been small exercises
  • Next year, you will undertake an honours project
  • This practical represents something in between those
  • It is larger and less rigidly defined than your previous course works
  • It is more rigidly defined and smaller than your honours project
  • The CSLP tries to prepare you for
    • The System Design Project (in the second semester)
    • The Individual Project (in fourth year).

Requirements

  • The requirements are more realistic than most coursework
  • But still a little contrived in order to allow for grading
  • There is:
    • a set of requirements (rather than a specification);
    • a design element to the course; and
    • more scope for creativity.

How much time should I spend?

  • 100 hours, all in Semester 1, of which
  • 8 hours lecture/demonstrating
  • 92 hours practical work, of which
    • 70 hours non-timetabled assessed assignments
    • 22 hours private study/reading/other

How much time is that really?

  • There are 13 weeks remaining in semester 1 (Weeks 2 to 14)
  • 7 * 13 = 91 hours
  • So you can think of it as 7 hours per week in the first semester
  • This could be one hour a day including weekends
  • You could work 7 hours in a single day
    • for example work 9:00-17:00 with an hour for lunch

Managing your time

It is unlikely that you will want to arrange your work on your large practical as one day where you do nothing else, but one day per week all semester is the amount of work that you should do for the course.

Scheduling work

Course lecturers have been asked not to let deadlines overlap Weeks 11-14 because students are expected to be concentrating on their large practical in that time.

Deadlines

The Computer Science Large Practical is split in two parts:
  • Part 1
    • Deadline:Thursday 24th October, 2013 at 16:00
    • Part 1 is zero-weighted: it is just for feedback.
  • Part 2
    • Deadline: Thursday 19th December, 2013 at 16:00
    • Part 2 is worth 100% of the marks.

Scheduling work

  • It is not necessary to keep working on the project right up to the deadline.
  • For example, if you are travelling home for Christmas you might wish to submit the project early.
  • In this case you need to ensure that you start the project early.
  • The coursework submission is electronic so it is possible to submit remotely.

Early submission credit

  • In order to motivate good project management, planning, and efficient software development, the CSLP reserves marks above 90% for work which is submitted early (specifically, one week before the deadline for Part 2).
  • Work submitted less than a week before the deadline does not qualify as an early submission, and the mark for this work will be capped at 90%. Thus, the mark may be 90%, but it may not be higher than this.
  • Regardless of when it is submitted, every submission is assessed in exactly the same way, but submissions which attract a mark of above 90% which were not submitted early have this mark brought down to 90%.

Early submission credit

Question:
Can I submit both an early submission version and a version for the end deadline and have the marks for whichever is highest?
Answer:
No. Before the early submission deadline you have to choose whether or not you are going to hope for a mark above 90% then, or have an extra week to accumulate more marks up to 90%. The submission marked will be the latest one made before the deadline. Hence if you submit both before and after the early submission deadline, only the last submission will be marked and it will be capped to 90%.

Extensions

Implementation Language

  • You may choose whichever programming language you deem most suitable. However:
    • Your application should compile and run on DiCE
    • Here is an obvious list of languages which should work on DiCE without any problems: C, C++, C#, Haskell, Java, Python, Objective-C, Ruby. However care should be taken with versions.
    • If you wish to use something else it would probably be prudent to ask me first.

Implementation Language

  • You may choose whichever programming language you deem most suitable. However:
    • Your application should compile and run on DiCE
    • I am even open to installing a compiler and/or runtime on my DiCE installation but this is entirely at my discretion.
    • It is up to you to choose a suitable language
    • Your choice will not be judged, however if you choose poorly, this will not be reflected in more lenient marking.
    • Whatever choice you make, you must live with

Source Code Control

  • For this project source code control is mandatory
  • You will have to use the git
    • This is somewhat realistic
    • Any project you join will likely already have some form of source code control set up which you will have to learn to use rather than any system you might already be familiar with
    • See the git homepage

Source Code Control

  • The practical is not looking for you to become an expert in git
  • You will not need to be able to perform complicated branches, merges or rebasing
  • This is, afterall, an individual practical
  • What is key, is that your commits are appropriate:
    • Small frequent commits of single units of work
    • Clear, coherent and unique commit messages

Getting Started

Do this today.
  $ mkdir simulator
  $ cd simulator
  $ git init
  $ editor README.md
  $ git add README.md
  $ git commit -m "Initial commit including a README"
  

Code Sharing Sites

  • Code sharing sites are a great resource but please refrain from using them for this practical. This is an individual practical so code sharing is not allowed. Even if you are not the one benefiting.
    • This is a bit of a shame, but again somewhat realistic
    • It is at least somewhat likely that in the future you will be unable to publicly share all of the code you produce at your place of employment.

Motivation - Simulators

  • It is common in both academic and industrial contexts to author some kind of simulator
  • Simulators can save time, money, effort and even lives
  • Simulators allow the very low cost running of experiments that might otherwise be infeasible
  • However, the catch is that unless the simulator is an appropriate model for the real system under investigation, the results may be worthless

Middle-lane Hogging


A recent BBC news article on the proposed government crackdown on middle-lane hoggers

Middle-lane Hogging

  • The government recently announced a ‘crackdown’ on middle-lane hoggers on motorways
  • Is this a money-making scheme? Is it a publicity scheme? Is it truly a worthwhile policy?
  • Difficult to know. A first step is deciding whether or not middle-lane hoggers cause significant delay and/or danger
  • It's difficult to gather data, how would you know how many people are middle-lane hogging?

Middle-lane Hogging

  • Even if you could count them, how could change this number?
  • Even if you could change (or wait for them to change) how could keep all other conditions the same?
  • With simulation it's possible to do both
  • Hence with simulation this is the first step towards answering the question of how much middle-lane hoggers cost

Why Simulate Buses?

  • City based transport is a huge problem in many parts of the world
  • Different people wish for different outcomes:
    • Passengers do not wish to wait long for a bus, they hope buses are not too full
    • Bus companies do not like empty buses, and would rather run as few as possible (whilst still having the same number of passengers)
    • Citizens wish for less congestion and pollution

Why Simulate Buses?

  • Some of these are complementary some are contradictory
  • With simulation we can try out different policy ideas and see which desires are affected
  • Only recently however have we begun to be able to obtain large amounts of related data: times, queues, passengers etc.
  • In this practical we will be interested in bus queues at bus stops

Why care about Bus Queues?

  • In this practical we are mostly going to be focused on the queue of buses at each bus stop
  • NOTE that is the queue of buses and not the queue of passengers
  • The queue of passengers at a bus stop is almost irrelevant
  • Provided a passenger does get on the next bus, it doesn't really matter when, and the whole queue is dequeued at more or less the same time
  • However, if a bus arrives at a bus stop, only to find a previous bus currently using it (to board and disembark passengers) the new bus is stuck waiting doing nothing productive
  • This is bad, for pretty much every player in the game

Why care about Bus Queues?

  • One possibility is to change the charging model
  • It takes time for passengers to board the bus because they all have to pay the driver
  • Alternatively we could move to a pre-pay scheme, with inspectors that check people have valid tickets and dispense fines for offenders
  • Or simply have an extra conductor on every bus who deals with payment
  • In order to evaluate these possibilities we first need to work out how much of a problem is bus queueing

Your Simulator

  • Your simulator will be a command-line application
  • It will accept a text file with a description of the input network
  • This text file specifies, the routes, buses, rates and other entities required to simulate a given bus network
  • It should output a list of events which occur
  • The strict formats for both input and output are described in the coursework handout
  • In the second part you will analyse the sequence of events to obtain statistics about the input network

Simulation Algorithm

The underlying simulation algorithm is itself quite simple:
WHILE {time ≤ max time}
    Choose an event and time for it based on the current state
    Update the state based on the event
ENDWHILE

Simulation Algorithm

The underlying simulation algorithm is itself quite simple:
WHILE {time ≤ max time}
    From the current state calculate the set of events which may occur
    total rate ← the sum of the rates of those events
    delay ← choose a delay based on the total rate
    event ← choose probabilistically from those events
    modify the state of the system based on the chosen event
    time ← time + delay
ENDWHILE

Simulation Algorithm

WHILE {time ≤ max time}
    ...
    delay ← choose a delay based on the total rate
    ...
ENDWHILE
  • To choose a delay we sample from the exponential distribution
  • I'll say more about this later, but for now it can be done by:
  • −(mean) ∗ log(random(0.0, 1.0))
  • Where mean is the average delay, which is the reciprocal of the total rate

Simulation Algorithm

WHILE {time ≤ max time}
    ...
    event ← choose probabilistically from those events
    ...
ENDWHILE
  • Similarly this means with respect to the rates of those events
  • So if two events a and b are enabled at rates 2.0 and 1.0 respectively, then:
  • Choose in such a way that a is twice as likely as b to be chosen

Components of the Simulation

  • Input network description:
    1. Stops
    2. Routes
    3. Roads
  • Dynamic state components:
    1. Buses
    2. Passengers

Components of the Simulation

Stops

  • Stops have a queue of buses
  • And a set of passengers waiting to board buses which pass through the stop
  • Passengers can only board the bus at the head of the queue

Components of the Simulation

Routes

  • Routes consist of a sequence of stops
  • Routes are implicitly circular in that the next stop after the last stop is the first stop

Components of the Simulation

Roads

  • Between any two stops which occur adjacently on at least one route there is a road
  • Including between the last and first stops of each route
  • Each road has an average rate at which buses can traverse it
  • We simplify things by saying buses may traverse a road at the same speed regardless of how many buses are on that road

Components of the Simulation

Buses

  • Each bus is associated with exactly one route, but there may be many buses associated with that route
  • Each bus has a number unique among the buses which traverse the same route
  • Hence a bus can be uniquely identified by its route and number
  • The bus 31.4 is the fifth bus on route 31
  • Each bus has an associated capacity

Components of the Simulation

Buses

  • A bus is always either at a stop or on a road between stops
  • At a stop it might not be at the head of the queue but behind other buses
  • A bus should not leave a stop if there are passengers wishing to board or disembark from it
  • A bus may leave a stop if there are waiting passengers if:
    • The bus is full, and
    • No passenger on board wishes to disembark

Components of the Simulation

Passengers

  • Each passenger has an origin stop and a destination stop
  • At any one time a passenger is either waiting at a stop or on board a particular bus
  • New passengers enter the simulation at any time at a specified rate
  • New passengers are randomly assigned to origin and destination stops but it must be a valid route

Components of the Simulation

Events

  • Your simulator will produce a sequence of events
    • A bus may arrive at a stop
    • A bus may leave a stop
    • A passenger may board a bus
    • A passenger may disembark from a bus
    • A new passenger may enter the simulation at a particular stop

Components of the Simulation

Events

  • Your simulator will produce a sequence of events looking like:
  • Bus ‹bus› arrives at stop ‹stop› at time ‹time›
    Bus ‹bus› leaves stop ‹stop› at time ‹time›
    Passenger boards bus ‹bus› at stop ‹stop› with
       destination ‹stop› at time ‹time›
    Passenger disembarks bus ‹bus› at stop ‹stop› 
       at time ‹time›
    A new passenger enters at stop ‹stop› with destination 
       ‹stop› at time ‹time›
            

Components of the Simulation

Events

  • In reality of course you will replace the ‹bus›, ‹stop› and ‹time› parts with real values:
  • Bus 1.2 leaves stop 3 at time 99.498
    Bus 1.2 arrives at stop 4 at time 99.692
            
  • This is valid output in the sense that it is formatted correctly
  • It may be invalid for other reasons, for example route 1 may not pass through stop 3

Part One and Part Two

  • For part one, you need only have a working simulator
  • For part two, there are additional requirements:
    • Output of analysis, such as average number of queued buses
    • Experimentation support, varying rates to see how those affect the network
    • Parameter Optimisation, finding the best rates
    • Validation, checking that the input is valid
  • These are all specified in the coursework handout

Coursework Handout

Definitions

  • In the requirements I stated that your simulator will be a:
    • stochastic,
    • discrete event, discrete state,
    • continuous time
    simulator
  • I will now define these terms

I finished the first lecture here.

Stochasticity

  • Don't worry, it essentially just means “non-deterministic”
  • This means that if you run your simulator more than once you might not get the same results
  • This also means that you can use your simulator to obtain some statistics
  • Remember, these are statistics about the model:
    • You hope that the real system exhibits behaviour with similar statistics

Discrete Events, Discrete State

  • It is possible to have discrete events and continuous state or vice-versa
  • But is common that either both are discrete or both continuous
  • This means that each event either takes place or it does not, there is no aggregation of multiple events
  • This generally means that the state could be encoded as an integer
  • Usually it is encoded as a set of integers, possibly coded as a different data types
  • This means there is no ‘fluid-flow’
  • An entity, such as a person, is in a particular place, and cannot be divided up into fractions of a person in multiple places at once

Discrete State vs Continuous State

Continuous Time

  • Some simulations use a discrete number of time points:
    • Days, Weeks, Months, Years
  • Can also be logical time points:
    • Moves in a board game
    • Communications in a protocol
  • These would be examples of discrete time simulators
  • Your task is to write a continuous time simulator
  • An event could therefore happen at any particular time

The Exponential Distribution

  • Both graphs describe probability X at time x related to an event which occurs at a rate of λ
  • The left graph depicts the probability density function
  • The right graph depicts the cumulative distribution function

The Exponential Distribution

  • The PDF is given by: F(x,λ) = λe-λx ∀ x > 0
  • Describes the relative likelihood that an event with rate λ occurs at time x
  • A time point is infinitesimally small
  • The integral of this gives the probability that it occurs within two time bounds
  • But you can largely ignore all this

The Exponential Distribution

  • The CDF is given by: F(x,λ) = 1 - e-λx ∀ x > 0
  • So if something happens at a rate of 0.5 per unit of time, then the probability that we will observe it occurring within 1 time unit is: F(1, 0.5) = 1 - e0.5*1 = 0.393
  • The exponential distribution has a couple of excellent properties for the use of simulation

The Exponential Distribution

  • The mean or expected value is given by the reciprocal of the rate parameter
  • In plain English this means that if something occurs at rate r then we can expect to wait 1 r time units on average to see each occurrence
  • If something occurs 7 times per week, you can expect to wait 1 7 of a week (or a full 24 hours) on average between each occurrence

The Exponential Distribution

  • Even better it is memoryless
  • Formally: Pr(X > s + t | X > s) = Pr(X > t) s, t > 0
  • Less formally: The time that we can expect to wait for the next occurrence of some (exponentially distributed) event, is unaffected by how long we have already been waiting for it
  • In the 7 times a week example, if it has been 24 hours since the last occurrence, the expected additional time I have to wait is still 24 hours
  • A quick note, don't confuse these two properties:
    • Correct Pr(X > 100 | X > 80) = Pr(X > 20)
    • Incorrect Pr(X > 100 | X > 80) = Pr(X > 100)
    The latter would be a strange kind of pre-determined system

The Exponential Distribution

Memorylessness

  • Why is this so great?
  • During simulation, the simulator can choose an event based on the current rates of possible events
  • Those rates are based on the current state of the simulation
  • As a result of firing that event, the global state of the simulation changes
  • However local states may not have changed, in our case for example there may still be two buses at stop 8
  • When we choose the next event, we can simply re-calculate the rates of possible events based on the new state of the simulation
  • We need not remember how long a particular event has been enabled for

Your Simulators

  1. Will be Discrete event simulators
  2. Will be Discrete state simulators
  3. Will be Continuous time simulators
  4. Will make use of the exponential distribution

Coursework Handout

Any Questions?

Source Code Control

Computer Science Large Practical

Quick Introduction to SCC

  • Source Code Control or Version Control Software is used for two main purposes:
    1. To record a history of the changes to the source code that have lead to the current version
    2. To allow multiple developers to develop the same code base concurrently and merge their changes
  • Since this is an individual practical we will concentrate on the first of these two

A common error


/*
 * 12/26/93 (seiwald) - allow NOTIME targets to be expanded via $(<), $(>)
 * 01/04/94 (seiwald) - print all targets, bounded, when tracing commands
 * 12/20/94 (seiwald) - NOTIME renamed NOTFILE.
 * 12/17/02 (seiwald) - new copysettings() to protect target-specific vars
 * 01/03/03 (seiwald) - T_FATE_NEWER once again gets set with missing parent
 * 01/14/03 (seiwald) - fix includes fix with new internal includes TARGET
 * 04/04/03 (seiwald) - fix INTERNAL node binding to avoid T_BIND_PARENTS
 */

Basic Source Code Control

As I stated previously the first thing to do is to initialise your repository

$ mkdir simulator
$ cd simulator
  

$ git init
  

$ editor README.md
$ git add README.md
$ git commit -m "Initial commit including a README"
  

The main point

  • After each portion of work, commit to the repository what you have done
  • Everything you have done since your last commit, is not recorded
  • You can see what has changed since your last commit, with the status and diff commands:

$ git status
# On branch master
nothing to commit (working directory clean)
  

Staging and Committing

  • When you commit, you do not have to record all of your recent changes. Only changes which have been staged will be recorded
  • You stage those changes with the git add command.
  • Here I have modified a file but not staged it

$ editor README.md
$ git status
# On branch master
# Changed but not updated:
#   (use "git add ‹file›..." to update what will be committed)
#   (use "git checkout -- ‹file›..." to discard changes in working directory)
#
#	modified:   README.md
#
no changes added to commit (use "git add" and/or "git commit -a")
  

Unrecorded and Unstaged Changes

  • A git diff at this point will tell me the changes I have made that have not been committed or staged

$ git diff
diff --git a/README.md b/README.md
index 9039fda..eb8a1a2 100644
--- a/README.md
+++ b/README.md
@@ -1 +1,2 @@
 This is a stochastic simulator.
+It is a discrete event/state, continuous time simulator.
  

To Add is to Stage

  • If I stage that modification and then ask for the status I will be told that there are staged changes waiting to be committed
  • To stage the changes in a file use git add

$ git add README.md
$ git status
# On branch master
# Changes to be committed:
#   (use "git reset HEAD ‹file›..." to unstage)
#
#	modified:   README.md
#
  

Viewing Staged Changes

  • At this point git diff is empty because there are no changes that are not either committed or staged
  • Adding --staged will show differences which have been staged but not committed

$ git diff # outputs nothing
$ git diff --staged
diff --git a/README.md b/README.md
index 9039fda..eb8a1a2 100644
--- a/README.md
+++ b/README.md
@@ -1 +1,2 @@
 This is a stochastic simulator.
+It is a discrete event/state, continuous time simulator.
  

New Files

  • Creating a new file causes git to notice there is a file which is not yet tracked by the repository
  • At this point it is treated equivalently to an unstaged/uncommitted change

$ editor mycode.mylang
$ git status
# On branch master
# Changes to be committed:
#   (use "git reset HEAD ‹file›..." to unstage)
#
#	modified:   README.md
#
# Untracked files:
#   (use "git add ‹file›..." to include in what will be committed)
#
#	mycode.mylang

  

New Files

  • Slightly tricky, git add is also used to tell git to start tracking a new file
  • Once done, the creation is treated exactly as if you were modifying an existing file
  • The addition of the file is now treated as a staged but uncommitted change

$ git add mycode.mylang
# On branch master
# Changes to be committed:
#   (use "git reset HEAD ‹file›..." to unstage)
#
#	modified:   README.md
#	new file:   mycode.mylang
#
  

Committing

  • Once you have staged all the changes you wish to record, use git commit to record them
  • Give a useful message to the commit

$ git commit -m "Added more to the readme and started the implementation"
[master a3a0ed9] Added more to the readme and started the implementation
 2 files changed, 2 insertions(+), 0 deletions(-)
 create mode 100644 mycode.mylang
  

A Clean Repository Feels Good

  • After a commit, you can take the status, in this case there are no changes
  • In general though there might be some if you did not stage all of your changes

$ git status
# On branch master
nothing to commit (working directory clean)
  

Finally git log

  • The git log command lists all your commits and their messages

$ git log
commit a3a0ed90bc90e601aca8cc9736827fdd05c97f8d
Author: Allan Clark ‹author email›
Date:   Wed Sep 25 10:26:57 2013 +0100

    Added more to the readme and started the implementation

commit 22de604267645e0485afa7202dd601d7c64c857c
Author: Allan Clark ‹author email›
Date:   Wed Sep 25 10:17:45 2013 +0100

    Initial commit
  

More on the Web

The Point

  • Don't forget that the point of all this is to record a history of changes to the code
  • This allows you to revert to previous versions in order to locate when a bug was introduced
  • This can help greatly in locating the source of a bug
  • This history of changes also helps other people (including your future self) understand why the code is the way it is
  • This is very helpful when you wish to change something without breaking anything

Debugging Help

  • Suppose you write some new test input, try it out, and find that it causes your application to crash
  do{ revert to previous commit/version
      re-compile and re-run your new test
      flag = does the program still crash
    } while(flag)
  
Once you have done this, you now know that the commit you just reverted, contains the code which is causing the crash

Git Blame

Not relevant for this individual project, but when it comes time to do your System Design Project, keep in mind git blame:

$ git blame sbsi_numeric_devel/Template/main_Model.c
352c44 (ntsorman   2010-07-08 14:03:43 +0000  5) #ifndef NO_UCF
352c44 (ntsorman   2010-07-08 14:03:43 +0000  6) #include
352c44 (ntsorman   2010-07-08 14:03:43 +0000  7) #endif
352c44 (ntsorman   2010-07-08 14:03:43 +0000  8) 
815381 (allanderek 2011-08-30 13:24:45 +0000  9) #include "MainOptimiseTemplate"
352c44 (ntsorman   2010-07-08 14:03:43 +0000 10)

Committing

  • When and what to commit?
  • The easy answer is it should be “one unit of work”
  • Defining one unit of work is difficult but if you have to use the word ‘and’ to describe it, there is a good chance you have more than one commit there
  • Note that my previous example was therefore bad
    
    $ git commit -m "Added more to the readme and started the implementation"
    [master a3a0ed9] Added more to the readme and started the implementation
     2 files changed, 2 insertions(+), 0 deletions(-)
     create mode 100644 mycode.mylang
      
  • It is bad because it is doing two separate things, indicated by the use of the word ‘and’, not because it updates more than one file

Committing

  • Your commit should be improving the project. It should be improving one portion of it:
    • The code
    • The documentation
    • The tests
  • And it should be improving that one part in one way:
    • Improving functionality
    • Improving readability
    • Improving maintainability
    • Improving performance

XKCD Signal

  • XKCD is a popular web comic
  • It has an associated IRC channel
  • As with many large communities it faced a problem of a large noise to signal ratio
  • A large part of the problem is that frequently asked questions are not read and hence re-asked
  • Commmon debates are hence frequently re-hashed

XKCD Signal

  • In a blog post the xkcd creator outlines a proposal to deal with this
  • It has been implemented as the ROBOT9000 bot-moderator
  • The rule it enforces is a simple one:
  • ”You are not allowed to repeat anything anyone has already said”

XKCD Signal

  • You can read about the specifics here
  • But some obvious questions arise:
    • Question: Isn't this limiting?
    • Answer: You're underestimating the versatility of natural language and the sheer number of possible sentences

XKCD Signal

  • You can read about the specifics here
  • But some obvious questions arise:
    • Question: Can't I just game it by tagging extra nonsense on?
    • Answer: Yes, but the focus is on dealing with unwitting noise generators. Those who are actively attempting to destroy the conversation can be otherwise banned.

XKCD Signal

  • You can read about the specifics here
  • But some obvious questions arise:
    • Question:What happens if I just want to answer someone with a yes/no?
    • Answer: Expand slightly e.g. “I agree, ... because ...”

What has this got to do with SCC?

  • A persistent problem is the lack of meaningful commit messages
    • “fixed a bug”
    • “More work”
    • “Fixes.”
    • “Updates”
    • “big commit of all outstanding changes”
    • “commit everything”
    • “commit”

What has this got to do with SCC?

  • I hope to give some good advice on this writing good commit messages
  • But it is notoriously difficult to enforce
  • One could easily enforce a minimum length, but this would only solve part of the problem and in some cases would not actually be appropriate
  • A sneakier idea; copy the “Do not repeat” rule from XKCD-Signal
  • “Do not use a commit message which has been used previously”

Non-repeating Commit Messages

  • When I say “used previously” do I mean in the same repository?
  • Beginner level: yes, I mean in the same repository
  • Advanced level: no, I mean in any repository that exists for any project
  • It should not really matter, it is hard to accept that a commit message used for an entirely different project is appropriate for your one

Non-repeating Commit Messages

  • Said in a whingey voice: “But I really did just fix a typo in the README”
  • You can probably expand on that a little
  • However, of course some violations of this rule will be worse than others
  • Similarly just because you pass this rule, does not mean you have a useful commit message
  • Gaming this by adding superfluous characters is definitely wrong

Non-repeating Commit Messages

  • In order to check the advanced level I will need a corpus of repositories
  • I might use github for this. You certainly should not be repeating a commit message used for an entirely different repository
  • But I will at least check your commit messages against all other repositories submitted for this practical
  • Bear in mind, you're all implementing the same requirements

More Advice

  • The commit message should be a summary of the actual ‘diff’
  • Part of the point of the commit message is so that a reader can avoid looking at the actual ‘diff’
  • The reader is looking in the history for a reason. Most likely they are trying to find the source of an issue. Help them.

More Advice

  • You should at least make clear the purpose of the commit
  • Is it?
    • A bug fix
    • A feature addition
    • A conflict resolution between two branches
    • Style enhancements
      • On what scale? A single fixed spelling error, or reformating all of the code?
    • A refactor of some portion of the code
    • Addition of a test
    • Updating of documentation
    • Optimisation

More Advice

  • Even once the purpose is described, try to explain the reason for that purpose
  • Some times this will be obvious, for example if the purpose of the commit is to fix a bug
  • Even then, you may wish to explain why that is fixable now rather than earlier
  • Others, really require an associated why. In particular a refactor.

Summary of the Main Advice

  • Small frequent commits. Each commit should do one thing
  • Ask yourself is it plausible that you might wish to revert some of the changes in a commit but not all of them?
    • If so, you almost certainly have more than one commit's worth of work
  • A person looking through your history is most likely looking for the source of a bug, or trying to figure out why a certain bit of code is the way it is. In either case help them.
  • Some people branch for any new unit of work. You should at least branch if you start doing two things at once

Micro Commits

  • It is possible to commit too little a portion of work
  • But for this practical we will ignore that possibility (unless you're clearly gaming the system)
  • Just a note: small style enhancements are usually not too small
  • “I just fixed a small typo in a comment, no one could possibly wish to revert to the code before I fixed the typo”
    • Probably not, but what are you about to do?
    • Someone may well wish to revert to the code immediately after you fixed the typo

Micro Commits

  • If you commit code such that the “build is broken” it is certainly not an appropriate commit
    • If the code fails to compile, or has a syntax error (for dynamic languages)
  • If this is the case you are likely committing too little
  • Though this could also be caused by over-shooting an appropriate commit
    • In other words you have 1 and a half commits worth of work
    • Or 2 and a half, or X plus 1/y commits worth of work

Branching

Branching

  • This occurs in software development frequently
  • In particular, you aim to add a new feature only to discover that the supporting code does not support your enhancement
  • Hence you need to first improve the supporting code, which may itself depend on supporting code which may itself require modification
  • Branching, is the software solution to this problem that most other projects do not have available to them
  • Because it is pretty easy to copy the current state of a project and work on the copy and then merge back in the results if the work is successful

Branching - The Basic Idea

When commencing a unit of work:
  1. Begin a branch, this logically copies the current state of the project
  2. The original branch might be called ‘master’ and the new branch ‘feature’
  3. Complete your work on the ‘feature’ branch
  4. When you are happy merge the results back into the ‘master’ branch

Branching - First Reason

  • Mid-way through, should you discover that your new feature is ill-conceived,
  • or, your approach is unsuitable,
  • You can simply revert back to the master branch and try again
  • Of course you can revert commits anyway, but this means you're not entirely deleting the failed attempt
  • You can also concurrently work on several branches and only throw away the changes you do not want to keep

Branching - Second Reason

Should you discover that there is some other enhancement required before your proposed enhancement can be delivered:
  1. Create a new branch (let's say ‘sub-feature’) from ‘master’
  2. This new branch does not contain any of the work you have done on ‘feature’
  3. Complete your requirements on ‘sub-feature’
  4. Once you are happy, merge those results with ‘master’
  5. You can now rebase the ‘feature’ branch which essentially pretends that you created it from ‘master’ after the work done on ‘sub-feature’ was merged

Branching

  • It is possible to do these steps retrospectively
  • But it is easier to stay organised
  • One approach is to have a newly named branch for each feature
    • This has the advantage that multiple features can be worked upon concurrently
    • Usually each feature branch is deleted as soon as it is merged back into ‘master’
  • A more lightweight solution is to develop everything on a branch named ‘dev’
  • After each commit, merge it back to ‘master’ you then always have a way of creating a new branch from the previous commit

With Regards to Grading

  • Advice about branch and rebasing etc. is worthwhile and may help you
  • However, I won't be specifically testing you on it
  • The main thing I wish to see is appropriate commits, both the work done in a commit and the commit message
  • These can be retroactively “fixed up”
  • There is no penalty for this. Though I advise that you attempt to render it unnecessary by keeping organised

External Git Advice

  • There are literally millions of web pages offering git support and advice
  • Go forth and explore

Any Questions?

Languages

Computer Science Large Practical

Language Choice

  • I stated that you were free to choose which ever language you wish
  • For anyone who has not yet started this lecture may help you decide
  • For those of you who have, it probably is not too late to switch
  • In any case it won't do you any harm to justify your choice and/or utilise your choice appropriately

Language Choice

Languages come in many varieties, here are some of the distinctions made:
  1. Compiled vs Interpreted
  2. Strongly typed vs Weakly typed
  3. Statically type vs Dynamically typed
  4. Functional vs Imperative
  5. Object Oriented vs Classless
  6. Lazy vs Eager
  7. Managed vs Unmanaged

For the most part these are independent of each other giving us 27 (128) possibilities

Language Choice

  • Before I start though, don't forget
  • Despite being labelled large, this is a short term project
  • As such, it's okay to choose language X because:
    • “I know X better than any other language”

Compiled vs Interpreted

  • Many languages will claim to be either a “compiled language” or an “interpreted language”
  • The distinction is intended to be simple:
    • Either the source code is translated into machine code and then run or:
    • An interpreter reads the source code and executes each line of code dynamically

Advantages of Interpreters

  • An interpreter is a less complicated piece of machinery to implement than is a compiler
  • Interpreters are generally more portable than compilers are re-targetable
  • An interpreter also works well as a debugger

Advantages of Compilers

  • The interpreter need not be installed on users' machines
  • The generated machine code is generally less expensive to run than is interpreting the original source code
  • Significant and complicated transformations can be implemented in the compiler, so even if the above were not true, compiled code should still be faster
    • This is because it represents code which has been automatically optimised

Bytecode

  • Many language implementations therefore implement something of a compromise
  • The language is compiled to a portable bytecode
  • 
    0 iload_1
    1 iload_2
    2 iadd
    3 istore_3
    
    
     
    mov eax, byte [ebp-4]
    mov edx, byte [ebp-8]
    add eax, edx
    mov ecx, eax
    
    
  • This bytecode is then interpreted on the user machines
  • Even this compromise solution is further modified with the use of Just In Time compilers
  • This is now so common that the distinction between compiled and interpreted languages is debatable

Small Rebuttal

  • “The compiler can perform expensive automatic optimisations that the interpreter cannot”
  • However, one might suggest that such expensive optimisations can be performed at the source code level, hence the interpreter can still benefit from them
  • But, whilst some transformations can be performed at the source code level, not all can
  • Source to source optimisations are not common. Likely because if efficiency is a large factor, then a compiler is used

Compiled/Interpreted Language?

  • There is not really any such thing as a “compiled language” or an “interpreted language”
  • There are compiler or interpreter implementations
  • A language may have one particularly official implementation
  • Interpreters are nearly always implemented via some kind of bytecode
  • So we only really have compiler implementations, it is just a question of what that compiler targets, physical machines or virtual bytecode machines

Compiled/Interpreted Implementations

  • Ocaml has ocamlbyte and ocamlopt
  • Java is generally compiled to the JVM, but implementations such as gcc-java exist
  • C# and some other languages now target the CLR runtime
  • Python is generally interpreted but Cython exists (an optimising static compiler)

Compiled vs Interpreted

Conclusion

  • The distinction between compiled and interpreted is one of implementation not languages
  • However, some language features lend themselves to one more easily than the other
  • But, increased runtime sophistication has meant that the line between compiled and interpreted has become increasingly blurred
  • Your language choice should probably not focus too heavily on whether the official language implementation is a compiler or an interpreter

Type Systems

  • Languages involve expressions which evaluate to values
  • It is possible to give a type to those values
  • We can then check that operations use values of an appropriate type
  • For example we may check that we are not trying to add a string to an integer: 3 + "hello"
  • The types may also determine what the operation is:
    • Integer addition: 3 + 2
    • Floating point addition: 3.0 + 2.0

Type Systems

  • Some type systems also give types to statements
  • For example some type systems determine what exceptions may be raised by a given command (which may be a sequence of commands)
  • Some such type systems oblige the user to declare these exceptions
  • For our purposes we will concentrate on the typing of expressions/values

Strongly typed vs Weakly typed

  • This is often confused as a distinction between statically and dynamically typed languages but this distinction is quite separate
  • One can have static-strong, static-weak, dynamic-strong, dynamic-weak

Strongly typed vs Weakly typed

  • Strongly: Objects of the wrong, or incompatible types cause an error:
    • 3 + "5" = error, as seen in C++, Java, Python, Ocaml
  • Weakly: Objects of the wrong, or incompatible types are converted:
    • 3 + "5" = "35" in Javascript
    • 3 + "5" = 8 in PHP, Perl5, Tcl

Advantages of Strong Typing

  • When something goes wrong, the error is produced as soon as it is discovered
  • This makes it easier to investigate the source of the error
  • Additionally, you are less likely to calculate incorrect results
  • Often, incorrect results are worse than no results

Advantages of Weak Typing

Uhm?

Advantages of Weak Typing

  • Occasionally completing a computation and obtaining a result is better than obtaining no result
  • Even if the result you obtain is wrong
  • Displaying a web page wrongly is generally better than not displaying it at all
  • You can implement this in either a strongly or weakly typed language but it is easier in a weakly typed one

Strong vs Weak Typing

Conclusion

  • You're writing a simulator, do you think that any result, no matter how incorrect, is better than no result?
  • Most of the advice I will give you here is of the annoyingly non-committal variety
  • In this case though, unless there are rather compelling reasons to decide otherwise: use a strongly typed programming language
  • But do not confuse weak typing with other type system distinctions, such as nominative, structural, duck typing

Statically typed vs Dynamically typed

  • A statically typed language specifies that the typing of expressions should be done before the program is run
  • A dynamically typed language specifies that the typing of values should be done whilst the program is run

Statically typed vs Dynamically typed


Source: TIOBE language index

Statically typed vs Dynamically typed

  • One reason to type expressions is to aid compilation
  • Recall the typing of the operands to an addition operator meant that we could determine what kind of addition is required
  • We might also need to know the size of the computed value so that we know where it might be stored
  • Obviously, if the purpose of the types is to aid compilation, the type checking will have to be done statically
  • More importantly the typing of expressions and values is done to avoid the computation of incorrect results

Advantages of Static Typing

  • Type errors are caught before you attempt to run the program
    • This means for example that type errors should not occur mid-run on a user's machine
    • Even during development, perhaps you have a program that:
      • takes seconds to compile,
      • minutes/hours to run
      • and a type error in the final printing of the result
    • Using static types you will be alerted to the type error after the compile
    • Using dynamic types you will be alerted at the end of a first run

Advantages of Static Typing

  • You may be releasing a library, which isn't “run”
    • Of course you should have a test suite with 100% code coverage
    • That does not always mean the tests are particularly useful
    • What you should have and what you do have are not always the same
    • Static typing gives you some kind of guarantee for “free”

Advantages of Dynamic Typing

  • Static type checking is necessarily conservative
  • This means it will reject some programs that ultimately would not, when run, have resulted in a type error
  • During development you can avoid type checking code you know will not be run, this is a subtle point

Example of Subtle Point

Suppose you have a method to create some data type:

void create_character(int initial_health){ ... }

You realise some new feature requires a second parameter:

void create_character(int initial_health, Gender gender){ ... }

You have a small test case to test your new feature, which you know will only call this method once, say at the start

Example of Subtle Point

Unfortunately calls to this method are spread throughout your code

void restart_game (...){
  ... create_character(100); ... }
void respawn(...){
  ... create_character(80); ... }
void duplicate_cheat(...){
  ... create_character(100); ... }

But you know none of these will get called in your small test case.

Example of Subtle Point

With a static compiler you will have no choice but to update each call anyway

void restart_game (...){
  ... create_character(100, character.current_gender); ... }
void respawn(...){
  ... create_character(80, character.current_gender); ... }
void duplicate_cheat(...){
  ... create_character(100, character.current_gender); ... }

Furthermore, your new feature might not work so you might revert the change

Example of Subtle Point

Worse, you might not yet have reasonable values so you just do this:

void restart_game (...){
  ... create_character(100, None); ... }
void respawn(...){
  ... create_character(80, None); ... }
void duplicate_cheat(...){
  ... create_character(100, None); ... }

Example of Subtle Point


void restart_game (...){
  ... create_character(100, None); ... }
void respawn(...){
  ... create_character(80, None); ... }
void duplicate_cheat(...){
  ... create_character(100, None); ... }

But now, once you have completed your new feature the static type checker is of no help in finding all the places that you need to update your calls to create_character

Example of Subtle Point

Some languages have optional parameters or default arguments:

void create_character(int initial_health, Gender gender=Female){ ... }

But not all do and the same arguments apply for similar situations with changes to types, classes, interfaces, abstract classes etc.

Two Competing Forces

  • When programmers learn static type systems it often feels like you are getting more program correctness for free
  • It seems as though it is not quite for free, and that the static type system does hamper productivity in the short term
  • It also seems likely that static type systems can save on some kinds of work in the future
  • The question is, does short term loss in productivity repay for itself with long term increase in productivity?

Philosophy of Typing

  • Just as I suggested that a language can be neither a compiled nor interpreted language it is also something of an implementation issue as to when typing is performed
  • However, there is generally a type system attached to each language
  • Some type systems are very difficult or even impossible to fully check statically
  • Some type systems deliberately ensure that it is possible to statically type check the language

Soft Typing

  • Soft typing is something of a compromise between static and dynamic typing
  • The idea of soft typing is to statically type as much of the program as is possible
  • Where the type system cannot determine that an expression or operation will never cause a type error, it inserts a run-time check
  • In this sense a dynamic type system is an extreme example of a soft-typing system that is not very good at determining any expressions which will never produce a type error

Soft Typing

  • In a sense many of our supposedly static type systems are in fact soft type systems which need few checks
  • Commonly, array indexes are not statically checked to be within the bounds of the size of the array
  • Instead a dynamic run-time check is inserted for this purpose
  • Additionally cast operations are generally checked at runtime as they cannot be statically checked to be valid

Static Analysers

  • When a type is not used by the compiler, then ultimately the static type checker is simply a static analyser
  • We can deploy many static analysers
  • We can also, omit to run any or all of them during a development run
  • Personally, I'm a big fan of static analysers
  • Static type systems are no exception, but I think they should be optional

Statically Typed vs Dynamically Typed

Conclusion

  • The distinction between statically typed and dynamically typed is in theory one of implementation, but in practice one of language
  • The distinction though is softer than some may suggest
  • It is more of a gradient than a dichotomy
  • For this project, either kind of type system will be fine
  • But, whichever choice you make, I recommend making use of additional static analysers
  • And, whichever choice you make, you should write some tests

Functional vs Imperative

  • This distinction is somewhat disputed
  • The main idea is that a functional language computes values of expressions, but does not modify state
  • An imperative language is simply a non-functional language, that is, one which allows/encourages the programmer to directly modify state

Functional vs Imperative

  • It turns out, that a lot of programs involve a lot of functional computation, with a very small amount of state modification
  • Hence, the term functional is often relaxed to include those languages that discourage state modification
  • More importantly, such languages, encourage declarative code.
    • That is, code which does not modify the state

Functional Programming

  • I tend to describe any language with proper support and syntax for nested, higher-order functions to be functional
  • A higher-order function is simply one that:
    • Takes one or more functions as parameters
    • Returns a function as a result
  • In general treating functions as any other kind of value is known as providing first class functions
  • If the language also allows nested functions which can access the scope of containing functions, the implementation requires function closures
  • The provision of nested, higher-order functions usually encourages declarative programming

Functional Programming

  • Languages which entirely forbid state updates I describe as strictly functional
  • Even this is a little confusing because some people describe eager evaluation as strict evaluation
  • So I might also say a pure functional language or simply a pure language

Functional Advantages

  • The key advantage of a functional programming language is the hugely pretentious phrase “referential transparency”
  • I'm not sure, but I suspect this phrase is one reason functional programming languages are not more widely adopted
  • It means, that an expression evaluates to the same result regardless of the time, or state, in which it is evaluated
  • In particular invoking a function: some_fun(args) with the same arguments args will always produce the same result

Functional Advantages

  • This makes testing and/or reasoning about the correctness of code much easier
  • In theory, it means code is more re-usable
    • This is debatable, and not, to my knowledge, demonstrated (either way) satisfactorily
    • But it's certainly plausible

Functional Advantages

In theory, this additionally allows for some interesting compiler optimisations, consider the following double transformation over a list of items:

some_list = map f (map g original_list)
This is common in both functional and imperative languages, even if in imperative languages it is an array which is looped over.

Functional Advantages


some_list = map f (map g original_list)
It can be re-written to, the faster:

fg = f . g
some_list = map fg original_list
Where f . g is the composing of two functions together. This is faster because it only loops over the list once.

Functional Advantages


some_list = map f (map g original_list)
It can be re-written to, the faster:

fg = f . g
some_list = map fg original_list
However, this optimisation, changes the order of execution. So it is only applicable where, f does not modify state which g references or vice-versa. In a functional language this is both, more likely and easier to automatically check.

Functional Advantages


fg = f . g
some_list = map fg original_list
  • Similarly if you have multiple processors, you could begin the second map operation in parallel as soon as the first transforms the first item.
  • Again, only if you can determine that there are no state dependencies.
  • In general parallel programming can in theory be advanced by limiting state updates

Imperative Advantages

  • With no state modifications all information required by any function must be passed in as an argument
  • This can arguably make the code more complicated
  • Worse, it can require a large refactoring in order to make a relatively simple change

Imperative Advantages

  • However, recall that my definition of a functional programming language did allow for state modifications.
  • It only required nested, higher order functions
  • It's hard to argue that not providing these is an advantage to the programmer
  • One could argue that the implementation (of the language) is simpler
    • It is debatable, but one can certainly argue that the implementation of nested higer-order functions, requires a performance degradation
    • Functions are more heavyweight and hence more expensive to invoke

Functional vs Imperative

Conclusion

  • You could certainly use either a functional or an imperative language for this practical
  • You're probably best off with whichever you prefer

Object Oriented vs Classless

  • Given my glowing recommendation for higher-order functions why are they not more commonly used?
  • Classes, or objects, allow for a similar abstraction
  • An object is really a collection of state together with operations over that state

Typical Class Definition


class ClassName (ParentClass){
   classmember_1 = 0;
   classmember_2 = "hello";

   void class_method_1(int i){
       self.classmember += i;
   }
   void class_method_2(String suffix){
       print_to_screen(self.class_method_2);
       print_to_screen(suffix);
   }
}

Object Oriented Languages Popularity

CategoryRatings Sep 2013Delta Sep 2012
Object-Oriented Languages 56.0% -1.1%
Procedural Languages 37.3% -0.9%
Functional Languages 3.8% +0.6%
Logical Languages 3.0% +1.3%
Source: TIOBE language index

Advantages of Object-Oriented

  • Surprisingly debatable
  • Most people agree that there is some value in object-oriented programming
  • But when asked to give concrete advantages, most offer:
    • Vague perceived benefits, with no logic connecting to OOP:
      • Advances reuse
      • Better models the real world
    • Clear benefits but which are not unique to OOP:
      • Polymorphism (fancy word for a specific kind of generality)
      • Encapsulation (fancy word for hiding/abstraction)

Advantages of Classless

  • No one really argues that the provision of classes is inherently destructive
  • In a similar way to higher-order functions, having the ability to utilise classes does not do any harm if you never use them
  • However, once the temptation is there, it's easy to go class crazy
  • But such arguments are not arguments against the use of an object-oriented language, so much as an argument for careful use of classes

Object Oriented vs Classless

Conclusion

  • By all means choose an object-oriented language
  • There is little reason not to, but pure languages often do not have a notion of an object
    • This is for good reason and should not put you off choosing a pure language
  • If you do choose an object-oriented language, use your classes with care
  • Classes are just one way of organising source code.
    • There are others which are just as effective
    • Using an OOP language will not magically organise your source code for you

Lazy vs Eager

Lazy vs Eager

  • Suppose we attempt to enforce the policy: everyone leaves the seat down
  • What if two men (or the same man twice) use the toilet in succession
  • This means the first man unnecessarily put the seat down only for the second man to put it back up again

Lazy vs Eager

  • Suppose we attempt to enforce the policy: everyone leaves the seat up
  • Now if two women (or the same woman twice) use the toilet in succession
  • This means the first woman unnecessarily put the seat up only for the second woman to put it back down again
  • “hugely” wasteful

Lazy vs Eager

  • A more efficient strategy: leave the seat as it is
  • If two people of the same gender visit successively no unnecessary work is done
  • Whenever there is a gender switch the second person must change the state of the seat
  • But that would otherwise have been done by the previous visitor anyway

Lazy vs Eager

  • The first two inefficient strategies are examples of eager evaluation
  • The final more efficient strategy is an example of lazy evaluation
  • Essentially lazy evaluation is the policy of only ever computing a value when it is required

Lazy vs Eager

  • Remaining in the household, this is equivalent to only washing dirty dishes when you are about to use them
  • In this case, you do the same amount of washing up, it is only a question of when
  • Unless you have some dish that is used exactly once
  • But note, the lazy policy requires more space next to the sink

Lazy vs Eager

  • Laziness is awesome
  • But, there are some significant caveats to that
  • I'll try to describe why I think laziness is an excellent feature
  • But then also why it is not widely available

I stopped here at the lecture on October 4th

A common argument

You can compute infinite values

primes = [ x | x ‹- [2..], is_prime x ]
get_prime x = primes !! x
This is a mild benefit

Consider

Imagine you are writing software to statically analyse a programming language. You can imagine many such analyses, and you wish that the user can turn on or off various analyses as they see fit. Suppose you first attempt to check if there are any calls to methods which are undefined:

if (analyse_called_names){
   method_names = gather_method_names(..)
   called_names = gather_called_names(..)
   for name in called_names{
       if name ∉ method_names{
           report_error()
       }
   }
}

Being Concise

I'm going to keep everything on one slide so I'll pretend we have some set based operators:

if (analyse_called_names){
   method_names = gather_method_names(..)
   called_names = gather_called_names(..)
   if (called_names ⊈ method_names){ 
      report_error () }
}

Adding A Second Analysis

Now you wish to check if there are any methods which are defined but never used. Note that this might be considered more of a warning than an error:

if (analyse_called_names){
   method_names = gather_method_names(..)
   called_names = gather_called_names(..)
   if (called_names ⊈ method_names){ 
      report_error () }
}
if (analyse_uncalled_names){
   method_names = gather_method_names(..)
   called_names = gather_called_names(..)
   if (method_names ⊈ called_names){ 
      report_error () }
}

Computing Sets Twice

This means if the user wants both analyses we are computing method_names and called_names twice.

if (analyse_called_names){
   method_names = gather_method_names(..)
   called_names = gather_called_names(..)
   if (called_names ⊈ method_names){ 
      report_error () }
}
if (analyse_uncalled_names){
   method_names = gather_method_names(..)
   called_names = gather_called_names(..)
   if (method_names ⊈ called_names){ 
      report_error () }
}

Only Compute What We Need

Any attempt to only compute the stuff you need gets complicated:

if (analyse_called_names || analyse_uncalled_names){
   method_names = gather_method_names(..)
   called_names = gather_called_names(..)
   if (analyse_called_names && called_names ⊈ method_names)
      { report_error() }
   if (analyse_uncalled_names && method_names ⊈ called_names)
      { report_error() }
}
This case was helped because both analyses required the same two sets of names. It's still a touch ugly that I have inspect analyse_(un)called_names twice.

A Third Analysis

Let's add another analysis that checks if method names overlap class names:

if (analyse_called_names || analyse_uncalled_names || analyse_class_names){
   method_names = gather_method_names(..)
   called_names = gather_called_names(..) // Hmm?

   if (analyse_called_names){ 
        if (called_names ⊈ method_names) { report_error() }
   }
   if (analyse_uncalled_names){
        if (method_names ⊈ called_names) { report_error() }
   }
   if (analyse_class_names){
      class_names = gather_class_names(..)
      if (method_names ∩ class_names != {}){ return error() }
   }
}

A Third Analysis

Let's add another analysis that checks if method names overlap class names:

if (analyse_called_names || analyse_uncalled_names || analyse_class_names){
   method_names = gather_method_names(..)

   if (analyse_called_names){ 
        called_names = gather_called_names(..)
        if (called_names ⊈ method_names) { report_error() }
   }
   if (analyse_uncalled_names){
        called_names = gather_called_names(..)
        if (method_names ⊈ called_names) { report_error() }
   }
   if (analyse_class_names){
      class_names = gather_class_names(..)
      if (method_names ∩ class_names != {}){ return error() }
   }
}
Deliberate error?

Worse

  • Imagine I now add another analysis that uses the set of class_names but not either of the other two
  • We can use a thunk pattern, but that is still complicating the code a little

Lazy Implements Thunk Anyway

In a lazy language I just do this:

method_names = gather_method_names(..)
called_names = gather_called_names(..)
class_names = gather_class_names(..)
if (analyse_called_names && called_names ⊈ method_names){
   report_error() }
if (analyse_uncalled_names && method_names ⊈ called_names){
   report_error() }
if (analyse_class_names && method_names ∩ class_names){
   report_error() }
Because each list of names is lazily evaluated each is only computed if required, but I don't need to code that logic myself

Advantages of Eager Evaluation

  • So why are not all languages lazy?
  • Lazy evaluation removes the predictability of when an expression may be evaluated
  • Hence if your language allows side effects, lazy evaluation does not really work
  • So lazy evaluation only really works together with a purely functional language
  • Haskell is lazy, Ocaml is not

Lazy vs Eager

Conclusion

  • There is no reason really to choose a lazy or eager language for this practical
  • In any case your choice is more or less made up for you by your other choices
  • If you like Haskell, Clean, Miranda or Hope, you will compute values lazily, most other languages are eager

Managed vs Unmanaged

  • Automatic memory management, sometimes called garbage collection
  • Without this, whenever you need to store a value in memory, you must first ask for the space in memory
  • When a value in memory is no longer useful, you should give back the space in memory that it used
  • If you let the last reference to a value go out of scope, without freeing up the associated memory, you will not ever do so, hence you have a space-leak
  • Unfortunately, if you give back the memory too soon, you may subsequently try to reference the value, this may cause a segmentation fault

Advantages of Memory Management

  • You need not manage the memory yourself, this is hugely liberating
  • I believe there is much gained productivity associated with:
    • Object Oriented Languages
    • Dynamically/Statically typed languages
    • Lazy languages
    • Reflection
    which is actually gained productivity from automatic memory management which has been misattributed to the above
  • I'm not saying these things do not also improve productivity

Advantages of Memory Management

  • I can say f(g(x)) and not have to think about whether the intermediate result produced by g needs to be cleaned-up
  • I can return from anywhere I like in the middle of a method, without worrying about all paths re-joining to free-up used memory
    • Honestly: “Only One Return” was a common coding rule
    • Sometimes called “Single Entry, Single Exit”

Advantages of Manual Memory Management

  • Nostalgia
  • In theory you can implement manual memory management more efficiently
    • This is a bit debatable
    • In any case, the improved productivity gained through the use of an automatic garbage collector, can be put to use in optimising the rest of your code
    • In particular better algorithms rather than faster implementation of the same one

Advantages of Manual Memory Management

  • Predictability, it can be difficult to know when the garbage collector might run
    • So real-time systems which must respond to incoming external events may suffer
    • But there is much research into automatic garbage collection, and real-time garbage collectors do exist

Managed vs Unmanaged

Conclusion

  • Choose a managed language
  • If you are only familiar with an unmanaged language either:
  • Don't complain that I haven't given you any concrete advice

Other Distinctions

  • Low-level vs High-level
    • This is mostly a distinction made from a combination of those above
  • Significant Whitespace or not:
    • Personally I love it, but it is syntax it does not matter
    • If it bothers you that much you can always write a parser for a different syntax
  • Scripting vs Systems:
    • If you must distinguish these you can interoperate between them

What is the Best Language?

Main Conclusion

  • In general, it is less what the language provides and more what libraries are available in that language
  • This practical however, does not require the use of any major libraries
  • Hence you are somewhat more free to choose based on the criteria I have discussed above
  • Good Luck!

Summary and Conclusions

  • My hard advice can be summarised as:
    • Choose a strongly typed language
    • Choose a language with automatic memory management
  • It may be a useful thing to report in your README why you have chosen the language you have
  • A perfectly valid reason is:
    • “Language X is my favourite language which I know better than all others”

Any Questions?

Structure & Strategy

Computer Science Large Practical

This Lecture

  • In this lecture I will try to give some helpful advice about the structure of your source code and your stategy

Overall Structure

  • I do not wish to give too much advice since I do not want a set of near identically structured solutions
  • Part of the practical is structuring it yourself. However, it seems likely you will want at least the following components:
    • A parser
    • A representation of the state of a simulation
      • Operations over that state
    • The simulation algorithm
    • Something to handle output
    • Something to analyse results
    • A test suite

Overall Structure

  • I call these components, I do not call them:
    • Classes, Instances, Objects, structs
    • Interfaces, signatures, prototypes, aspects
    • Methods, functions
    • Modules, packages, functors
    • Types, type classes
  • This is not because I did not specify the source language
  • It is because they could reasonably be any of these things
  • It is up to you to decide what is most appropriate

Some Obvious Decisions

  • Do you want to parse into some abstract syntax data structure and then convert that into a representation of the initial state
    • Or you could parse directly into the representation of the initial state
  • Do you wish to print out events as they occur during the simulation
    • Or record them and print them out later
  • Do you wish to analyse the simulation events as the simulation proceeds
    • Or analyse the events afterwards
    • By recording them, or you could write a parser for the events
    • You could write the simulator and events analyser as two completely separate programs, even in different languages

Parsing

  • You do not need to start with the parser
  • The parser produces some kind of data structure. You could instead start by hard coding your examples in your source code
  • But the parsing for this project is pretty simple
  • Hence I would start with the parser, even if I did not complete the parser before moving on
    • I find hard coding data structure instances laborious
    • But doing so would ensure your simulator code is not heavily coupled with your parser code, if you decide that is important

Software Construction

  • Software construction is relatively unique in the world of large projects in that it allows a great deal of back tracking
  • Many other forms of projects, such as construction, event planning, and manufacturing, only allow for backtracking in the design phase.
  • Because of this, traditional project planning advocates a large amount of up-front design
  • When computer programming projects first started to grow beyond the remit of a single week, such techniques were applied
  • We now know that often this is something of a waste of this unique ability to allow backtracking

Software Construction

  • Another way to view this:
    • Construction projects cannot afford to change the design once construction has begun
    • Hence, the design phase consists of building the object virtually (on paper, on a computer) when backtracking is inexpensive
    • Software projects do not produce physical artifacts, so the construction of the software is mostly the design

Refactoring

  • Refactoring is the process of changing code such that it computes exactly the same function (of inputs to outputs), but has a better design.
  • This is tremendously powerful, because it allows us to try out various designs, rather than guessing which one is the best
  • It allows us to determine whether something is possible, without necessarily building it in the best way
  • It allows us to design retrospectively once we know significant details about the problem at hand.
  • It allows us to avoid the cost of full commitment to a particular solution which, ultimately, fails.

Suggested Strategy

  • Note that this is merely a suggested strategy
  • Start with the simplest program possible
  • Incrementally add features based on the requirements
  • After each feature is added, refactor your code
    • This step is important, it helps to avoid the risk of developing an unmaintainable mess
    • Additionally it should be done with the goal of making future feature implementations easier
    • This step includes janitorial work (see below)

Suggested Strategy

  • At each stage, you always have something that works
  • Although you need not specifically design for later features you do at least know of them, and hence can avoid doing anything which will make those features particularly difficult.

Alternative Inferior Strategy

  • Design the whole system before you start
  • Work out all components and sub-components you will need
  • Start with the sub-components which have no dependencies
  • Complete each sub-component before moving on to the next
  • Once you have developed all the dependencies of a component you can now choose that component to develop
  • Finally, put everything together to obtain the entire system
    • Test the entire system

Janitorial Work

  • Wish to discuss two points:
    • Real and Logical Time
    • How to break a rule
  • To do so I'll need the notion of janitorial work
  • Examples of Janitorial
    • Reformating
    • Commenting
    • Changing Names
    • Tightening

Janitorial Work

Reformating


void method_name (int x)
{
  return x + 10;
}
Becomes:

void method_name(int x) {
  return x + 10;
}
There is plenty of software which will do this work for you as well.

Janitorial Work

Reformating

  • Refomatting is entirely superficial
  • It is important to consider when you apply this
  • Reformatting can result in a large ‘diff’
  • This may well conflict with other work performed concurrently
  • Reformatting should be largely unnecessary, if you keep your code formatting correctly in the first place
    • More commonly required on group projects

Janitorial Work

Commenting

  • I hope I needn't re-iterate the importance of writing good comments in your source code
  • When done as janitorial work this can be particularly useful
    • You can comment on the stuff that is not obvious even to yourself as you read it. This is much more difficult when writing new code
  • The important thing to comment is not what or how but why
  • Try not to have redundant information in your comments:
    
    // the first integer argument
    
  • The fancy XML formatting does nothing to save this comment

Janitorial Work

Commenting

Ultra bad:

// increment x
x += 1;
Better:

// Since we now have an extra element to consider
// the count must be incremented
x += 1;

Janitorial Work

Changing Names

  • The previous example used x as a variable name
  • Unless it really is the x-axis of a graph, choose a better name
  • This is of course better to do the first time around
  • However as with commenting, unclear code can often be more obvious to its author upon later reading it

Janitorial Work

Tightening


void main(...){
  run_simulation();
}
Tightened to become:

void main(...){
  try{
    run_simulation();
  } catch (FileNotFoundException e) {
    // Explain to the user ..
  }
}

Janitorial Work

Tightening

  • For some this is not janitorial work, since it actually changes in a non-superficial way the function of the code
  • I place it here, since similar to other forms it is often caused by being unable to think of everything when writing new code

Janitorial Work

  • Most of this work is work that arguably could have been done right the first time around when the code was developed
  • However, when developing new code, you have limited cognitive capacity
  • You cannot think of everything when you develop new code, janitorial work is your time to rectify the minor stuff you forgot
  • Better than trying to get it right first time is making sure you later review your code

Janitorial Work

  • “Refactoring is the process of changing code such that it computes exactly the same function (of inputs to outputs), but has a better design.”
  • Strictly speaking janitorial work is not refactoring
    • It should not change the function of the code
      • Tightening might, but generally for exceptional input
    • But neither does it make the design any better
  • In common with refactoring you should not perform janitorial work on pre-existing code whilst developing new code
  • It will not do you any harm to use the phrase “janitorial work” in your commit messages

Janitorial Work

  • Suppose I'm implementing some new feature and I come across this
  • 
    // prase the 'validate' command
    
  • It's tempting to fix it right now and you should
  • 
    // parse the 'validate' command
    
  • So how do I follow these two bits of advice?
  • How do I “fix small things right now” whilst also avoiding “doing two things at once”

Real and Logical Time

  • The answer is, I fix it right now in real time, but use SCC to avoid doing two things at once in logical time
  • You should be on a development branch and do this:
  • 
    $ git checkout master # go back to the master branch
    $ editor mycode.cobol # Fix the typo
    $ git commit -a -m "Fixed a prase->parse typo"
    $ git checkout dev # go back to your development branch
    $ git rebase master # pretend you fixed the typo before
    
Do not use cobol, that's just a joke

How to Break a Rule

  • “you should not perform janitorial work on pre-existing code whilst developing new code”
  • If this is “too much work right now” then just fix the typo rather than leave it as is
  • In other words, if you must break the rule, break it such that the code is still fixed
  • This is especially true if the fix is some form of tightening
  • Of course if the fix itself is too much work for right now, then it should go in a bug tracker

More About Refactoring

  • Refactoring is a term which encompasses both factoring and defactoring
  • Generally the principle is to make sure that code is written exactly once
  • We hope for zero duplication
  • However, we would also like for our code to be as simple and comprehensible as possible

Factoring and Defactoring

  • We avoid duplication by writing re-usable code
  • Re-usuable code is generalised
  • Unfortunately, this often means it is more complicated
  • Factoring is the process of removing common or replaceable units of code, usually in an attempt to make the code more general
  • Defactoring is the opposite process specialising a unit of code usually in an attempt to make it more comprehensible

I ended the lecture on October 11th here

Breaking (Bad) Methods

Here is a question posted to stack overflow:

When is a function too long? [closed]

35 lines, 55 lines, 100 lines, 300 lines? When you should start to break it apart? I'm asking because I have a function with 60 lines (including comments) and was thinking about breaking it apart

long_function(){ ... }

into:

small_function_1(){...}
small_function_2(){...}
small_function_3(){...}

The functions are not going to be used outside of the long_function, making smaller functions means more function calls, etc.

When would you break apart a function into smaller ones? Why?

A Blog Post

Long methods and classes are evil

See the original here.
  1. Any method should fit into your IDE window ... I strive for an average of not more than 5 lines of code per method
  2. Too many private members .. say greater than 10
  3. The size of the code file. Offhand, I’d say any class over 10k in size ...
  4. Exception handling code and instrumentation tend to push methods to be much larger. Invest some thought in how to segregate this type of code away from the main functionality
  5. Reduce the number of public methods in a class. Just picking a number, I would say less than 10 in most cases.

Common Rules

These are some common rules:
  • “Methods should have no more than X lines of code”
  • “Classes should have no more than X methods/private variables”
  • “Files should have no more than X lines of code”

Ignore Such Rules

  • Most such rules are well intentioned
  • The are supposed to be easy to adhere to and check
  • But unless you understand the motivation behind such a rule, following it will do you no good
  • These rules tell you what not to write, but they do not explain what you should write instead
  • Not to mention the fact that most good rules have some exceptions

Example (long methods)


integer my_long_method(int input){
   int x = 0;
   ...
   // Do some stuff
   ...
   // Do some other stuff
   ...
   // Finally return
   return x;
}
  
Oh oh, this method is apparently 900 lines of code long.

Example (long methods)


void do_some_stuff(int x, int y){
    // Do some stuff
}
integer do_some_other_stuff(int x, String star){
    // Do some other stuff
    return x
}
integer my_long_method(int input){
   int x = 0;
   ...
   do_some_stuff(x, 0);
   x = do_some_other_stuff(x, input_string);
   // Finally return
   return x;
}
  
Ah good, this method is now only a few lines long

Factoring


void primes(int limit){
    integer x = 2;
    while (x <= limit){
        boolean prime = true;
        for (i = 2; i < x; i++){
            if (x % 2 == 0){ prime = false; break; }
        }
        if (prime){ System.out.println(x + " is prime"); }
    }
}
A very naive but perfectly reasonable bit of code to print out a set of prime numbers up to a particular limit

Factoring


void print_prime(int x){
    System.out.println(x + " is prime");
}
void primes(int limit){
    x = 2;
    while (x <= limit){
        ... // as before
        if (prime){ print_prime(x); }
    }
}
Here we have “factored out” the code to print the prime number to the screen. This may make it more readable, but I have not made the code more general.

Factoring

To make it more general we have to actually parameterise what we do with the primes once we have found them.

interface PrimeProcessor{
    void process_prime(int x);
}
class PrimePrinter implements PrimeProcessor{
    public void process_prime(int x){
        System.out.println(x + " is prime");
    }
}
void primes(int limit, PrimeProcessor p){
    x = 2;
    while (x <= limit){
        ... // as before
        if (prime){ p.process_prime(x); }
    }
}

Factoring

If I wish to store the primes instead:

class PrimeRecorder implements PrimeProcessor{
    public LinkedList primes;
    public PrimeRecorder(){
       self.primes = new LinkedList();
    }
    public void process_prime(int x){
        self.primes.append(x);
    }
}

Factoring

I can go further and factor out the testing as well:

interface PrimeTester{
    boolean is_primes(int x);
}
class NaivePrimeTester implements PrimeTester{
    public boolean is_prime(int x){
        for (i = 2; i < x; i++){
            if (x % 2 == 0){ return false; }
        }
        return true;
    }
}
void primes(int limit, PrimeTester t, PrimeProcessor p){
    x = 2;
    while (x <= limit){
        if (t.is_prime(p)){ p.process_prime(x); }
    }
}

Factoring

Now that I've factored out the test, it does not have to be used solely for primes

interface IntTester{
    boolean property_holds(int x);
}
class NaivePrimeTester implements IntTester{
    public boolean property_holds(int x){
        for (i = 2; i < x; i++){
            if (x % 2 == 0){ return false; }
        }
        return true;
    }
} // Similarly for PrimeProcessor to IntProcessor
void number_seive(int limit, IntTester t, IntProcessor p){
    x = 0;
    while (x <= limit){
        if (t.property_holds(p)){ p.process_integer(x); }
    }
}

Factoring

Print the perfect numbers:

interface IntTester{
    boolean property_holds(int x);
}
class PerfectTester implements IntTester{
    public boolean property_holds(int x){
        return (sum(factors(x)) == x);
    }
} // Similarly for PerfectProcessor
void number_seive(int limit, IntTester t, IntProcessor p){
    x = 0;
    while (x <= limit){
        if (t.property_holds(p)){ p.process_integer(x); }
    }
}

Factoring

We might find the two extra parameters a bit ugly, no problem:

public abstract class NumberSeive{
    abstract boolean property_holds(int x);
    abstract void process_integer(int x);
    abstract int start_number;
    void number_seive(int limit){
        x = self.start_number;
        while (x <= limit){
            if (self.property_holds(p)){ self.process_integer(x); }
        }
    }
}

Factoring

Here is the code for printing the primes:

public abstract class NumberSeive{
    abstract boolean property_holds(int x);
    abstract void process_integer(int x);
    abstract int start_number;
    void number_seive(int limit){
        x = self.start_number;
        while (x <= limit){
            if (self.property_holds(p)){ self.process_integer(x); }
        }}} // Close all the scopes
public class PrimeSeive inherits NumberSeive{
    public boolean property_holds(int x){
        for (i = 2; i < x; i++){
            if (x % 2 == 0){ return false; }
        }        return true;  }
    void process_integer(int x) { System.out.println (x + " is prime!"); }
    int start_number = 2;
}

Factoring

Print the perfect numbers:

public class PerfectSeive inherits NumberSeive{
    public boolean property_holds(int x){
        return (sum(factors(x)) == x); }
    void process_integer(int x) { System.out.println (x + " is perfect!"); }
    int start_number = 2;
}

Factoring

So which version do we prefer? This one:

public abstract class NumberSeive{
    abstract boolean property_holds(int x);
    abstract void process_integer(int x);
    abstract int start_number;
    void number_seive(int limit){
        x = self.start_number;
        while (x <= limit){
            if (self.property_holds(p)){ self.process_integer(x); }
        }}} // Close all the scopes
public class PrimeSeive inherits NumberSeive{
    public boolean property_holds(int x){
        for (i = 2; i < x; i++){
            if (x % 2 == 0){ return false; }
        }        return true;  }
    void process_integer(int x) { System.out.println (x + " is prime!"); }
    int start_number = 2;
}

Factoring

Or the original version?

void primes(int limit){
    integer x = 2;
    while (x <= limit){
        boolean prime = true;
        for (i = 2; i < x; i++){
            if (x % 2 == 0){ prime = false; break; }
        }
        if (prime){ System.out.println(x + " is prime"); }
    }
}

Factoring

Something in between?

LinkedList get_primes(int limit){
    int x = 2; LinkedList results = new LinkedList();
    while (x <= limit){
        boolean prime = true;
        for (i = 2; i < x; i++){
            if (x % 2 == 0){ prime = false; break; }
        }
        if (prime){ results.append(x); }
    }
}
void primes(int limit){ 
    for x in get_primes(limit){
        System.out.println(x + " is prime"); 
    }
}

Factoring

  • The answer of course depends on the context
  • How likely am I to need more number seives?
  • How likely am I to do something other than print the primes?
  • The compromise is surely slower for printing the primes out
  • But it is very adaptable

Defactoring

  • Numbers such as the number 20 can be factored in different ways
    • 2,10
    • 4,5
    • 2,2,5
  • If we have the factors 2 and 10, and realise that we want the number 4 included in the factorisation we can either:
    • Try to go directly by multiplying one factor and dividing the other
    • Defactor 2 and 10 back into 20 and then divide 20 by 4

Defactoring

  • Similarly your code is factored in some way
  • In order to obtain the factorisation that you desire, you may have to first defactor some of your code
  • This allows you to factor down into the desired components
  • This is often easier than trying to short-cut across factorisations

Sieve of Eratosthenes

  1. Create a list of consecutive integers from 2 to n: (2, 3, 4, ..., n)
  2. Initially, let p equal 2, the first prime number
  3. Starting from p, count up in increments of p and mark each of these numbers greater than p itself in the list
    • These will be multiples of p: 2p, 3p, 4p, etc.; note that some of them may have already been marked.
  4. Find the first number greater than p in the list that is not marked
    • If there was no such number, stop
    • Otherwise, let p now equal this number (which is the next prime), and repeat from step 3

Sieve of Eratosthenes


void primes(int limit){
    LinkedList prime_numbers = new LinkedList();
    boolean[] is_prime = new Array(limit, true);
    for (int i = 2; i ‹ Math.sqrt(limit); i++){
        if (is_prime[i]){
            prime_numbers.append(i);
            for (j = i * i; j ‹ limit; j += i){
                is_prime[j] = false;
    }
I can probably do this via our abstract number sieve class, but I doubt I want to. The alternative is to defactor back to close to our original version and then factor the way we want it.

Defactoring

  • Defactoring then can be used as the first step of refactoring
  • It might also simply be that you feel the current factored version is over-engineered
  • Flexibility is great, but it is generally not without cost
    • The cognitive cost associated with understanding the more abstract code
  • If the flexibility is not now or unlikely to become required then it might be worthwhile defactoring
  • Don't be shy in explaining your reasoning in comments and commit messages

Refactoring Can Better Document

What does this code do?

r = g.nextDouble();
d = 1.0 - (1.0/x * Math.log(r));
System.out.println (d);

Refactoring Can Better Document

Better?

dice = generator.nextDouble();
delay = 1.0 - (1.0/rate * Math.log(dice));
System.out.println (delay);

Refactoring Can Better Document

How about now?

// Choose a delay from the exponential distribution given the rate
dice = generator.nextDouble();
delay = 1.0 - (1.0/rate * Math.log(dice));
System.out.println (delay);

Refactoring Can Better Document

Do I need a comment?

double calculate_exponential_delay(double rate, Random generator){
    dice = generator.nextDouble();
    delay = 1.0 - (1.0/rate * Math.log(dice));
    return delay;
}
System.out.println (calculate_exponential_delay(rate));
Even if the method is defined elsewhere and you only see the print line

System.out.println (calculate_exponential_delay(rate));

Refactoring Can Better Document


System.out.println (calculate_exponential_delay(rate));
  • If your code is highly coupled it will be difficult to extract such self-documenting fragments
  • In this case, you have code you should try to re-arrange first before factoring out
  • If your factored out method has a ridiculously long name, or many parameters it is a good sign that it is not worth factoring out:

xs = calculate_exponential_delays_from_global_events(rate_function, 
                                                     generator , 
                                                     ...);

Refactoring Summary

  • Code should be factored into multiple components
  • Refactoring is the process of changing the division of components
  • Defactoring can help the process of changing the way the code is factored
  • Well factored code will be easier to understand
  • Do not update functionality at the same time

Common Approach

  • There is a common approach to developing applications
    1. Start with the main method
    2. Write some code, for example to parse the input
    3. Write (or update) a test input file
    4. Run your current application
    5. See if the output is what you expect
    6. Go back to step 2.

Do Not Start with Main

  • A better place to start is with a test suite
  • This doesn't have to mean you cannot start coding
  • Write a couple of test inputs
    • in separate files or as string literals
  • Create a skeleton “do nothing” parse method
  • Create an entry point which simply calls your parse method on your test inputs (all of them)
  • Watch them fail

Do Not Start with Main


DataStructure parse_method(String input_string){
    return null;
}
void run_test(input){
    try { result = parse_method(input);
          if (result == null){
            System.out.println("Test failed by producing null");
          } else { System.out.println("Test passed"); }
    }
    catch (Exception e){
        System.out.println("Test raised an exception!");
    }}
test_input_one = "...";
test_input_two = "...";
void test_main(){
    run_test(test_input_one);
    run_test(test_input_two);
    ...
}

Do Not Start with Main

  1. Code until those tests are green
    • Including possibly refactoring
  2. Without forgetting to commit to git as appropriate
  3. Consider new functionality
    • Write a method that tests for that new functionality
    • Watch it fail, whether by raising an exception or simply not producing the results required
    • Return to step 1.
  4. You can write your main method any time you like
    • It should be very simple, as it simply calls all of your fully tested functionality

Do Not Start with Main

  • Any time you run your code and examine the results, you should be examining output of tests
  • If you are examining the output of your program ask yourself:
    • Why am I examining this output by hand and not automatically?
    • If I fix whatever is strange about the output can I be certain that I will never have to fix this again?
  • Of course sometimes you need to examine the output of your program to determine why it is failing a test. This is just semantics (it is still the output of some test)

Do Not Start with Main

Summary

  • Everything your program outputs should be tested
  • Intermediate results that you might not output can still be tested as well
  • Run all of your tests, all of the time
    • They may take too long to run them all for each development run
    • In which case, run them all before and after each commit

Optimisation

  • Re-usability can conflict heavily with readability
  • Similarly optimised or fast code can conflict with readability
  • You are writing a simulator which may have to simulate millions of events
  • In order to obtain statistics, it may then have to repeat the simulation thousands of times
  • Optimised code is generally the opposite of reusable code
  • It is optimised for its particular assumptions which cannot be violated

Premature Optimisation

  • The notion of optimising code before it is required
  • The downside is that code becomes less adaptable
  • Because the requirements on your optimised piece of code may change, you may have to throw away your specialised code and all its optimisations
  • Note: I do not mean the requirements of the project
    • In a realistic setting they may, but not here
    • It is the requirements of a particular portion of your code which may change

Premature Optimisation

  • Worse than throwing away your own optimisations, you may instead elect to work around your specialised and optimised section of code
  • Thus your premature optimisation has negatively effected other parts of your code

Timely Optimisation

  • So when is the correct time to optimise?
  • Refactoring is done in between development of new functionality
    • Recall this makes it easier to test that your refactoring has not changed the behaviour of your code
  • This is also a good time to do some optimisation
    • You should be in a good position to test that your optimisations have not negatively impacted correctness
    • This has the additionally bonus that since you are refactoring at a similar time you should already be considering the adaptability and readability of your code

Timely Optimisation

  • The absolute best time to optimise code is when you discover that it is not running fast enough
  • Often this will come towards the end of the project
  • It should certainly be after you have something deployable
  • After you have developed and tested some major portion of functionality

A Plausible Strategy

  • Perform no optimisation until the end of the project once all functionality is complete and tested
  • This is a reasonable stance to take, however:
  • During development, you may find that your test suite takes days to run
  • Even one simple run to test the functionality you are currently developing may take minutes or hours
  • This can seriously hamper development, so it may be best to do some optimisation at that point

How to Optimise

  • The very first thing you need before you could possibly optimise code is a benchmark
  • This can be as simple as timing how long it takes to run your test suite
  • O(n2) solutions will beat O(n log n) solutions on sufficiently small inputs, so your benchmarks must not be too small

How to Optimise

  • Once you have a suitable benchmark then you can:
    1. Run your benchmark on a build from a clean repository recording the run time
    2. Perform what you think is an optimisation on your source code
    3. Re-run your benchmark and compare the run times
    4. If you have successfully improved the performance of your code commit your changes, otherwise revert them
    5. Do one optimisation at a time

How to Optimise

  • “This can be as simple as timing how long it takes to run your test suite”
  • However, bear in mind that you are writing a stochastic simulator
    • This means each run is different and hence may take a significantly different time to run
    • Even if the code has not changed or has not changed in a way that significantly affects the run time
    • Simply running several inputs or the same input several times should be enough to reduce or nullify the effect of this

Interacting Optimisations

  • Word of caution: some optimisations may interact with each other, so you may wish to evaluate them independently as well as in conjunction
    • As always source code control can empower you to do this

High-level vs Low-level Optimisations

  • It is usually more productive to consider high-level optimisations
  • The compiler is often good at low-level optimisations
  • It is often better to call a method fewer times, than to optimise the code within a method

Profiling

  • Profiling is not the same as benchmarking
  • Benchmarking:
    • determines how quickly your program runs
    • is to performance what testing is to correctness
  • Profiling:
    • is used after benchmarking has determined that your program is running too slowly
    • is used to determine which parts of your program are causing it to run slowly
    • is to performance what debugging is to correctness
  • Without benchmarking you risk making your program worse
  • Without profiling you risk wasting effort optimising a part of your program which either already fast or rarely executed

Documenting Optimisations

  • Source code comments are a good place to explain why the code is the way it is
  • Source code control commits are a good place to document why you performed the optimisations including benchmark/profiler results etc.

Summary

  • I have mostly talked about strategy rather than structure
    • Structure is difficult to give concrete advice about
  • Refactoring is the most important thing you can learn from this lecture:
    • Refactoring allows us to avoid doing a large amount of upfront design and also avoid producing a a big hairy mess
    • Do not change functionality whilst refactoring
    • You code should be adaptable
  • Do not start with main write a test suite instead
  • Do not optimise blindly, benchmark and profile
  • There is not a thing on this page that your source code control will not make easier

Any Questions?

General Tips & Assessment

Computer Science Large Practical

A Small Story

Any Normal Person Would Do

My Message

Response

Reply to Response

Results

Exasperation

The Lesson

  • Aside from the obvious business lesson
  • This tells me that the developers of the website and app are not users
  • They have developed the website for one user story:
    • “I know which film I want to watch I want to book it now”
  • They have developed the app for a different user story:
    • “I might go to the cinema tonight, what's on?”

Assessment Criteria

  1. Implementation of requirements:
    1. Parsing
    2. Correct simulation & correct output
    3. Summary statistics of simulation results
    4. Experimentation implementation
    5. Parameter optimisation implementation
    6. Input Validation
  2. Use of source code control
  3. Documentation, including source comments
  4. Testing, including sample test input scripts
  5. Maintainable code
  6. Evidence of benchmark/profile-based optimisation
  7. Any additional features
  8. Early submission

Objective & Subjective Criteria

  • Some of the items on the above list are objective whilst some are subjective
  • Objective criteria are those which are testable
  • Subjective criteria are those which are, at least partially, based upon opinion
    • Whether or not the criteria is matched is open to debate

Objective Assessment Criteria

  • The most objective assessment criteria is:
    • Early submission
  • Either you submit it before the early submission deadline or you do not
  • Though arguably this is not really an assessment criteria

Objective Assessment Criteria

This first list of implementation requirements are all relatively objective:
  1. Parsing
  2. Correct simulation & correct output
  3. Summary statistics of simulation results
  4. Experimentation implementation
  5. Parameter optimisation implementation
  6. Input Validation

Objective Assessment

  • These will be marked almost entirely algorithmically
  • This means your application will be put through my own suite of test inputs
  • Some of these test inputs will be inputs you have seen, some will be new
  • Part of the exercise is for you to foresee possible inputs for which your application would fail
    • Either by crashing, or by producing incorrect output
  • There may be some non-algorithmic marking to this should your application fail any tests
    • In which case I have to figure out why your application is failing

Parsing

  • Your parser should be able to parse all syntactically valid input scripts
  • I cannot say it much simpler than that
  • There won't be any deliberately tricky tests

Correct Simulation & Output

  • Here I'm testing whether your simulator correctly follows the requirements
  • The simulator is tested via its output, so these are tested at the same time
  • Having said that, where the output is not correct, the code is inspected to determine why
  • Minor syntactic issues with the output will be judged leniently
    • This part of the reason your code must compile on DiCE
    • It certainly won't hurt your grade to get it correct

Summary Statistics

  • Similarly this will test for correctly calculating and reporting the specified summary statistics
  • It is possible to get the simulation incorrect but the summary statistics correct
  • A small tip is to make sure your reported statistics are consistent with each other:
    • I will say more about this later
  • It might be that you are getting inconsistent results because your simulation is incorrect, in which case you should note this in your README

Experimentation Implementation

  • Whether or not you correctly implement the experimentation of rates and numbers
  • As before it is possible to get this correct, without getting either (or both) of the simulation and the summary statistics correct
  • As before, if you are getting inconsistent results you should at least note that in your README

Parameter Optimisation

  • Similarly parameter optimisation should be handled correctly
  • Similarly it is possible to implement this correctly with everything above implemented incorrectly
  • Again, similarly if you get inconsistent results you should at least have noted this in your README.

Input Validation

  • This is the first task which is not finely specified
  • Here you have to demonstrate some ingenuity to conjure up your own rules for what should and should not be valid input
  • You also have to decide which kinds of inputs result in warnings or errors, although this is specified there is still some scope for interpretation
    • Specifically those in which the simulation could be started but may result in an error
    • This may depend upon the structure of your simulator

Noting Deficiencies

  • Use your README file to catalog any deficiencies which are aware of
  • Or a more sophisticated form of bug database, but please not a public one
  • In general any implementation errors will be viewed significantly more leniently if they are known about
  • Known bugs are better than unknown bugs
  • Even better if you additionally avoid the output of erroneous results

How to Fail

  • Remember, it is generally worse to produce incorrect output than no output at all
  • This will generally require defensive programming

Subjective Assessment

The remaining items are mostly judged subjectively
  • Use of source code control
  • Documentation, including source comments
  • Testing, including sample test input scripts
  • Maintainable code
  • Evidence of benchmark/profile-based optimisation
  • Any additional features

Source Code Control

  • I've spoken about this at length already
  • Keep your commits small
  • Write good commit messages
    • Don't be shy, I'm sure I will enjoy reading your commit messages
    • If you know you're committing something you shouldn't at least say so in the commit message

Source Code Control

  • One more thing about source code control
  • Should you commit commented out code?
  • Some people's reaction:
  • I'm a little more mellow, certainly no huge swathes of commented out code
  • But the occasional line of code used as part of a comment explanation is fine

Documentation

  • Mostly of your source code
  • You may develop additional features which, if you do not document, I may not even know about
  • This is not at all unrealistic

Testing

  • Last year, on another course, I asked students to write a simulator for a distributed network protocol
  • The intention of the coursework was not to produce a good simulator
  • It was to investigate the properties of the network protocol
  • Some students returned a simulator with exactly one test input
    • The one I had supplied as an example

Testing

  • For you, the practical is indeed to write a good simulator
    • You can at least strive for “half decent”
  • Either way, running one test input, is woefully insufficient

Maintainable Code

  • Highly subjective
  • Remember, reusable code is more difficult to understand
  • But reusable code is easier to, well, reuse
  • Reused code is easier to maintain
  • What is a poor developer to do?
  • Try to imagine what you might wish to do in the future

Maintainable Code

Specific Example

  • How should I write my parser?
    • Simple string interrogation
    • Regular expressions
    • Handwritten parser using, for example, functions
    • Use a parser-generator in style of flex and bison
  • The simple question you have to ask yourself is this:
  • “What kind of updates to the parser am I (or someone else) likely to do in the future?”

Maintainable Code

  • Highly subjective
  • Trying to justify some of your choices is likely a good thing
  • Even if your reasoning is flawed, it demonstrates that you have thought about how to design/arrange your source code
  • And that you probably could have implemented it in other way, but specifically chose not to
  • A future maintainer at least knows why you made that choice, if they disagree, they can change the code without fear of some other reason they have not yet uncovered

Additional Features

  • This is your change to stop being an automaton and mindlessly implementing requested features
  • It perhaps requires some imagination, but imagine you were really going to use your simulator to investigate some real (or other) network
  • What would be useful to you?
  • The evidence of mindless developers writing to specifications are all around you, once you notice it.

README

  • Don't forget to provide me with a README
  • In general this can only help your grade:
    • It lets me know good things are deliberate and not fortunate
    • It lets me know that deficiencies are at least known about

Final Point

  • Students are often worried about losing marks
  • Indeed our own assessment descriptions often talk of losing marks
  • But let's not forget, you start with zero

Test Strategies

  • Two obvious test strategies:
    1. Test for expected output. A given input should give predictable output
      • This is slightly more tricky for a program that makes use of random numbers
      • You have a stochastic simulator which is not expected to generate the same set of events for identical input
    2. Test for expected properties:
      • This is often used in conjunction with generating random inputs for which you do not know the output
      • If you used quickcheck in Inf1-FP, you have experience with this form of testing

Random Numbers

  • Computers are great at computing deterministic results
  • Not quite so good at generating a sequence of random numbers
  • You are going to need a sequence of random numbers
  • Generally this is done using a Pseudo Random Number Generator
  • This is really hard
    • It must avoid, periodicity as well as a biased distribution

PRNG

  • John von Neumann, suggested an approach in 1946:
    1. Start with some 4 digit number say 6843
    2. Square it, 46826649
    3. Take the middle 4 digits as the next random number
    4. It also serves as the seed for the next random number
  • A few problems, many seeds repeat; 0000, repeats very immediately
  • Much better ones exist today and thankfully you should not need to implement one yourself
  • Generally a PRNG relies on a seed number
  • Where Rn is the nth random number and Sn is the nth seed:
    • Rn, Sn = f(Sn-1)

PRNG


class Random{
    constructor(int seed){
        self.random_seed = seed
    }
    int random(){
        // Update the random seed
        self.random_seed = self.random_seed * 1103515245 +12345;
        // Generate and return a new random number
        return (self.random_seed / 65536) % 32768; }
    }
}

Halo

Replays

  • The replays are stored as one number, plus the sequence of key/button presses made by the user
  • When a replay is run, the sequence of events which take place as a result of those button presses is recomputed
  • Replays are not stored as videos
    • Unless you explicitly ask for that in order to share it
  • This is why you can change the camera position when viewing a replay

Replays - But Wait

  • Are there not some random elements to the sequence of events?
  • Where all players are human this varies from game to game
  • But AI players almost always incorporate some probabilistic decision making
  • So how does this work with the replay?

Replays - PRNG

  • Pseudo random number generators are not really generating a random sequence of numbers
  • Recall: Rn, Sn = f(Sn-1)
  • If you know S0 the rest of the sequence is entirely deterministic
  • Hence the one number stored with the sequence of input events is the initial random number generator seed
  • Halo uses only one seed for all “random” numbers generated

Recall Two Forms of Testing

  1. Test for expected output. A given input should give predictable output
    • Now you have a handle on this:
    • Allow yourself to specify the seed used for a simulation
    • Now your testing routine can specify a seed for which it knows what the output should be
    • This is at least regression testing
    • Of course your production main either chooses a seed randomly or does not specify one and allows your simulation routine to choose one
  2. Test for expected properties:

Expected Properties

  • This kind of testing is used for stochastic programs a lot, including the random number generators themselves
  • It is up to you to come up with your own set of properties
  • But to start you off, consider:
    • Events should be ordered according to the simulated time at which they have occurred
    • A bus should never have more passengers than its capacity suggests
    • Your summarised statistics should be consistent:
      • eg. The number of missed passengers in total should be greater than or equal to the number of missed passengers for any single route

Expected Properties

  • New passengers is an event that is always enabled and occurs at a constant rate:
    • This means that the number of occurrences should be at close to the simulation time divided by the mean delay
    • The mean delay is simply 1 over the rate
    • If the rate is 1.0 and the simulation time is 100, you should expect that it occurs approximately 100 times

Testing Floating Point Numbers


GHCi, version 7.4.2: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Prelude> 1.0 - 0.95

Testing Floating Point Numbers


GHCi, version 7.4.2: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Prelude> 1.0 - 0.95
5.0000000000000044e-2
Prelude> 

Testing Floating Point Numbers


public class Floating {

    public static void main(String[] args) {
        System.out.println(1.0 - 0.95);
    }
}

mimi:tmp$ javac Floating.java
mimi:tmp$ java Floating
0.050000000000000044

Testing Floating Point Numbers


public class Floating {

    public static void main(String[] args) {
        System.out.println(1.0 - 0.95);
        System.out.println(0.05);
    }
}

mimi:tmp$ javac Floating.java
mimi:tmp$ java Floating
0.050000000000000044
0.05
Hence testing the result of some computation against a floating number literal won't necessarily work as expected

Testing Floating Point Numbers

  • In your tests you might not be able to use equality to test for the floating point numbers that you expect
  • 
    void test_some_bit_of_code(...){
      // might not work
      assert_equal(2.0, some_bit_of_code(...));
    }
    
  • Instead you may be forced to test for approximate equality
  • 
    void approximately_equal_simple(x, y){
        return absolute(x - y) < 1.0E-8;
    }
    

Testing Floating Point Numbers

A fancier version


// atol is the absolute tolerance
// rtol is the relative tolerance
void approximately_equal(x, y, rtol, atol){
    return absolute(a - b) <= (atol + rtol * absolute(b))
}

Review of Part One

Computer Science Large Practical

Submissions

13 out of 22 students submitted at least something:
  • 2 out of 13 zipped up their submission directory
    • No need to do this, just submit the directory
    • You will save me the effort of unzipping it
  • 1 out of 13 submitted for part 2

Submissions

13 out of 22 students submitted at least something:
LanguageNumber of Submissions
Python7
Java4
C#1
Perl1

One Brave Sole Chose Perl

  • I did warn you not to choose a weakly typed language
  • However, I also said I would not judge your choice of language
  • As a bit of extra help, I'll recount the tale of the Nancy bug

The Nancy Bug

  • Once upon a time, there was an expert system developed in Perl
  • An expert system is an artificial intelligence kind of program designed to mimic human decision making
  • This means it is quite tricky to debug, since you expect it to be wrong sometimes
  • One component of this system, relied on checking whether two names were equal
  • It was discovered that this system was frequently suggesting that two names were equal, when they were not, giving many false positives

The Nancy Bug

  • But there were also the occasional false negative
  • As you might have guessed “Nancy” was a false negative
JohnGeorgeApparently Equal
PaulRingoApparently Equal
EmmaGeriApparently Equal
MelanieVictoriaApparently Equal
NancyNancyApparently Not Equal

The Nancy Bug

  • String equality in Perl is written as s1 eq s2
  • Unfortunately the expert system had accidentally used integer equality s1 == s2
  • Provided with two strings, Perl converted these to integers
  • Since most names do not represent a reasonable number, each name generally got mapped to 0
    • Hence all the false positives
    • "John" == "George" is the same as 0 == 0
  • So how come "Nancy" != "Nancy"

The Nancy Bug

  • Perl tries to parse as much of the string as an integer as is possible
  • "Nan" parses as not-a-number
  • Most people agree that not-a-number does not equal itself
  • Hence "Nancy" == "Nancy" becomes nan == nan
    • Which is correctly False
    • Hence the false-negative "Nancy" != "Nancy"

The READMEs

  • Ranged from very basic containing a couple of lines to very detailed.

The READMEs - Language Choice

  • 0 out of 13 explained the choice of language
    • 10 out of 13 specified the choice of language with no explanation
    • 3 out of 13 implied the choice of language
    • The source files are right there
    • This is a bit of an artificial requirement. It might just force me to view some design decisions differently (and hence increase your mark).
    • I think it is something of a good exercise though
  • “I picked Python as my programming language because it's nice.”

Random Goodness

  • “Please do not change the test files because that would cause the unit tests to fail”
  • “To compile this ....”
  • “To run this ... ”
  • “Such and such does not yet work ...”

Random Fussiness

  • One person managed to note down their matriculation number incorrectly
    • Don't worry someone else has a directory named “scr”
  • In general don't fuss too much about spelling and grammar, I'm pretty immune to such “errors”

Random Not-So-Goodness

  • Low-level coding decisions:
    • Are apt to change and you will forget to update the README
    • Should be in comments in the source code file concerned
    • “Most of the logic in the simulation is in the objects (Stop, Bus, Passenger) themselves”
  • High-level structure that is less likely to change is absolutely fine
    • You are more likely to remember to change the README in the event of a major re-structuring
  • One person had both a README.md and a Readme.txt which were identical

Random Comments

  • “this might or might not be the worst-looking code you have ever seen”
    • Unlikely. I've seen some pretty hellish code

Git Status

  • 6 out of 13 have some unstaged modifications
  • 
    # On branch master
    # Changed but not updated:
    #   (use "git add/rm file..." to update what will be committed)
    #   (use "git checkout -- file..." to discard changes in working directory)
    #	modified:   src/Event.java
    #	modified:   src/Simulator.java
            
  • 9 out of 13 had some untracked files
  • 
    # On branch master
    # Untracked files:
    #   (use "git add file..." to include in what will be committed)
    #
    #	simulator/input.txt
    nothing added to commit but untracked files present (use "git add" to track)
    
  • 3 out of 13 were completely up to date:
    • Including our Perl user

Untracked Files

  • This did not necessarily correspond to a release or a good stopping place
  • Git helpfully outputs all untracked files, ask yourself:
    • Should they be tracked? You can always remove them later
    • Can they simply be deleted?
    • Can they be put in the .gitignore?
      • Good for editor save files, compiled files etc.
    • 
      # Editor save files
      *~
      *.swp
      *.swo
      # Compiled source #
      ###################
      *.class
      *.pyc
      *.exe
      

Tip

Copy this into your .bashrc (or .brc on DiCE):

#
# Colors
#
RED="\[\033[0;31m\]"
YELLOW="\[\033[0;33m\]"
GREEN="\[\033[0;32m\]"
NORMAL="\[\033[0m\]"

#
# Prompt Setup
#
function parse_git_in_rebase {
  [[ -d .git/rebase-apply ]] && echo " REBASING"
}

function parse_git_dirty {
  [[ $(git status 2> /dev/null | tail -n1) != "nothing to commit (working directory clean)" ]] && echo "*"
}

function parse_git_branch {
  branch=$(git branch 2> /dev/null | grep "*" | sed -e s/^..//g)
  if [[ -z ${branch} ]]; then
    return
  fi
  echo " ("${branch}$(parse_git_dirty)$(parse_git_in_rebase)")"
}

export PS1="$RED\u@\h:$GREEN\W$YELLOW\$(parse_git_branch)$NORMAL\$ " # Add git info to the prompt

Generated Files

  • At least one person had generated files in their repository
  • 
    $ git ls-files
    README.md
    bin/Bus.class
    bin/Event.class
    ... 
            
  • Every time you recompile you will get:
  • 
    # On branch master
    # Changed but not updated:
    #   (use "git add/rm file..." to update what will be committed)
    #   (use "git checkout -- file..." to discard changes in working directory)
    #
    #	modified:   bin/Event.class
    #	modified:   bin/Simulator.class
            

Generated Files

Unanimous opinion on storing generated files in your repository:
Place them in your .gitignore file instead:

# Compiled source #
###################
*.class

Git Commits

Git Lines

Git Lines

Git Lines

Git Lines

Git Lines

Git Lines

Refactoring

  • 4 out of 13 logs contained any mention of refactoring
  • 2 of those 4 was with reference to future refactoring:
    • Either promising to refactor later or
    • Explaining some code saying that it will make future refactoring easier
  • It's early days yet, but still, refactoring is something you should be trying to do constantly

Command-line Applications

  • NO: “To run this open up Eclipse and ...”
  • Your program must be scriptable, so that I can run an external test suite over it
  • In the real world, many apps are command-line apps:
    • Even obviously GUI apps that run on your smartphone often communicate with some server
    • You can run them remotely, for example on a web server
    • You can run them on large computing clusters
    • You can script them to add multiple-run functionality

Command-line Applications

  • NO: “To run this open up Eclipse and ...”
  • Here (briefly) is how to do this in Eclipse:
    1. Right click your project and select “Export”
    2. Select “JAR file”
    3. Select which packages to export (likely only one)
    4. Run it from the command line:
    5. 
      java -jar myprogram.jar args
                      
  • More detailed instructions available here

Command-line Applications

  • NO: “To run this open up Eclipse and ...”
  • Alternatively try this:
  • 
    $ javac *.java
    $ java MyMainClass args
    

What's Wrong?


string num = "th";
int day = Convert.ToInt16(DateTime.Now.ToString("dd"));
switch(day)
{
    case 1:
        num = "st"; break;
    case 21:
        num = "st"; break;
    case 31:
        num = "st"; break;
    case 2:
        num = "nd"; break;
    case 22:
        num = "nd"; break;
    case 3:
        num = "rd"; break;
    case 33:
        num = "rd"; break;
    default:
        num = "th"; break;
}

What's Wrong?


string num = "th";
int day = Convert.ToInt16(DateTime.Now.ToString("dd"));
switch(day)
{
    case 1:
    case 21:
    case 31:
        num = "st"; break;
    case 2:
    case 22:
        num = "nd"; break;
    case 3:
    case 33:
        num = "rd"; break;
    default:
        num = "th"; break;
}

At Least

  • Could stack the cases rather than copy-pasta code
  • 33 is arguably allowable but not at the expense of 23
  • The date is converted to a string, which is then converted back to a number, this will presumably then be converted back to a string to attach it to num

Worst Error

  • Of course the whole code is entirely unnecessary as there surely exists a library function to do the job for you
  • The easiest code to write, is that which you do not have to write yourself
  • It's also easier to maintain
  • It's also probably more bug-free
  • It also doesn't clutter up your source code repository

A More Subtle Error


string num = "th";
int day = Convert.ToInt16(DateTime.Now.ToString("dd"));
switch(day)
{
    case 1:
    case 21:
    case 31:
        num = "st"; break;
    case 2:
    case 22:
        num = "nd"; break;
    case 3:
    case 33:
        num = "rd"; break;
    default:
        num = "th"; break;
}

A More Subtle Error

  • Initial setting of num = "th"
  • In this case this is needless because of the default clause
  • This is common, what does it guard against?
  • The logic usually suggests that it guards against not setting num within the switch and hence getting an uninitialised variable error
  • When could that happen? When you have a bug!
  • Again, better to return no value than an incorrect one
  • You should defensively program against an uninitialised variable:
    • What would you rather:
      • “Monday the 23” or
      • “Monday the 23th”

Multiple Files

  • Quick Pop Quiz: Should you spread your implementation across multiple source code files?

Multiple Source Files

Common reasons given:
  • It increases code reusability
  • It reduces compile time
  • Encapsulation (remember, when someone says this, they are most likely bluffing)
  • It makes code easier to find

Going File Crazy

  • I'm not saying you should not, but do so for a good reason

Many Files


147 Bus.java
74  Event.java
132 Main.java
102 Network.java
46  Passenger.java
26  Road.java
28  Route.java
555 total

Many Files


public class Road {
	private int firstStop;   // Initial stop of the road
	private int endStop;     // Ending stop of the road
	private float rate;      // Rate of a bus traversing the road
	
	// Basic constructor
	public Road (int fs, int es, float r) {
		this.firstStop = fs;
		this.endStop = es;
		this.rate = r;
	}
	
	// Getter Functions
	public int firstStop() {
		return this.firstStop;
	}
	
	public int endStop() {
		return this.endStop;
	}
	
	public float rate() {
		return this.rate;
	}
}

Many Files

Not just the Java developers:

34 bus.py
19 event.py
58 passenger.py
8 road.py
25 route.py
113 simulation_execution.py
101 simulation_instance.py
114 simulator_io.py
12 simulator.py
27 stop.py
12 unit_tests_main.py
314 unit_tests.py
837 total

Many Files


class Road:
	def __init__(self,first_stop,second_stop,rate):
		assert (first_stop.routes & second_stop.routes) != set([]) , 
        "road must be between two stops who are adjacent on at least one route"
		assert rate > 0 , "rates of roads must be positive"
		self.first_stop = first_stop
		self.second_stop = second_stop
		self.rate = rate
Some nice defensive programming going on here. Not sure it requires a whole separate file

Personally

  • I try to use a few files as possible
  • New classes are simply written where required
  • I only move them out to their own file when that seems necessary

Getter Functions

What is the purpose of a getter function?

	// Getter Functions
	public int firstStop() {
		return this.firstStop;
	}
	
	public int endStop() {
		return this.endStop;
	}
	
	public float rate() {
		return this.rate;
	}

Getter Functions

  • Generally to avoid making the field in question public
  • Why? So that later if you wish to make this a computed value you can
  • Is it likely the “firstStop” will ever become a computed value?
  • If it does, you can replace the field with a method and the static type checker will show you all the references you have to change
  • In fairness, the getter, means that the consumer cannot update the value:
    • Since you do not wish to update it privately either you could just make it an immutable field
    • The fact that you cannot mark something as only privately mutable is something of a flaw in the language
    • You could be developing an API

Object Sequences

  • A few people have done something like this:
  • 
    ...
        some_object = SomeClass(...);
    
        some_object.do_something(..);
        some_object.do_next_thing(..);
        some_object.do_the_final_thing(..);
    ...
        
  • You should at least consider having a simple method in SomeClass which does all three of these things
  • It is of course not universally true but when you see an uninterrupted sequence of calls on an object, it makes sense to consider whether the calling code is highly coupled with the called code
  • In other words, what would happen if you missed out one of the calls?

Object Sequences

  • More concretely, a few people did this:
  • 
    ...
        simulation = Simulator(...);
    
        simulation.set_up(..);
        simulation.run_simulation(..);
        simulation.conclude(..);
    ...
        
  • Even if you may wish to run these operations in a different order, you could still package up this functionality

Premature Optimisation

  • A few did something like the following:
    • I see that there are many events which are all currently possible
    • When I select one, and update the state accordingly, that often does not affect the others
    • Therefore I should memorise the list of possible actions and only remove those that the current action makes impossible
  • This is not a bad idea
  • It does seem a little premature when the rest of your simulation is not yet working
  • Tip: What happens when a bus with capacity N boards its Nth customer?

Simulator and State

  • What is a Simulator?
  • What is the state of the simulator?
  • Are they one and the same thing?
  • Many of you, even though you have not gotten as far as implementing experimentation, seem to be worried by this
  • Correctly so, it could prove tricky
  • The simulation algorithm must operate over the state of the simulation
  • How can you be sure that the ending state of one experiment does not affect the starting state of the next experiment run?
  • Do you require multiple simulators to run multiple simulations?

Things Which Should Not Happen

Here is a bit of code from one student's submission

private int getStopIndexByID(int id) {
    for .... {
        if ...{
            return correctId;
        }
    }
    return -1; // if all the other code is correct 
               // that should never happen
}
Here's the calling code:

// update stop buses here
stops[getStopIndexByID(b.currentStop)].addBus(b);
Note that the error code is not checked for

What's Wrong with it?


private int getStopIndexByID(int id) {
    for .... {
        if ...{
            return correctId;
        }
    }
    return -1; // if all the other code is correct 
               // that should never happen
}
  • Remember the golden rule:
    • “Better to return no answer than an incorrect answer”

Two Ways to Fix This

  1. Return some type which forces you to check if there was an error
    • This is surprisingly tricky in most object-oriented languages
    • Functional languages have this done well, with Option or Maybe
    • 
      match getCorrectId(name) with
        | None -> print_error(str(name) + " not ...")
        | Some v -> do_something(v)
                  
    • I cannot do_something(v) without pattern matching against the Option type

Two Ways to Fix This

  1. Return some type which forces you to check if there was an error
    • Null does not quite fill this need
    • Because it does not force you to check that the value returned is not Null
    • Returning Null as an error value is generally the wrong thing to do
    • At worst the Null might be later interpreted incorrectly as meaning something, e.g. an empty list
    • At best you will ultimately raise a NullPointerException
    • You can avoid the uncertainty by:

Two Ways to Fix This

  1. Throw an exception
  2. 
    private int getStopIndexByID(int id) {
        //blah blah might return correctId
        throw new IdNotFound(..); // Should never ...
    }
    

The Original Case

If the id is not found, an OutOfBounds array access exception will be raised.

// update stop buses here
stops[getStopIndexByID(b.currentStop)].addBus(b);
  • So you end up raising an exception anyway
  • But here you get an OutOfBounds error report rather than IdNotFound error report
  • Why make things difficult for yourself
  • Additionally if this calling code changes we may end up simply giving an incorrect answer

Exceptions

  • Exceptions are both loved and loathed
  • Part of the reason for the loathing is the “non-obvious control paths” which can result
  • Try to use exceptions for things which you really do not believe can can happen under any normal executing conditions
  • Essentially, they are for things that you would like the type system to ensure can never happen, but for which the type system is not sophisticated enough
  • Rather less appropriate for errors made by the user, for example errors in the input

Exceptions and Validation

  • Given this definition what should you do if you discover you have incomplete information during a simulation run?
  • For example, you attempt to retrieve the rate associated with a road and find that it is unavailable
  • This is not exceptional, because the user may have simply forgotten to specify the rate for that particular road
  • However, if you validate the input before running the simulation, then it really is exceptional to find a missing rate during the simulation
  • Because the simulation should not have been started since the validation should have uncovered the error

YAGNI

  • A final piece of advice
  • Try to keep things simple: Do the simplest thing that could work
    • Then rethink/refactor if it does not work
  • YAGNI: You Aren't Gonna Need It
    • Try not to over-complicate things by over-anticipating future requirements

Any Questions?

Going Class Crazy

  • A really good video on why writing classes can be harmful
  • Main piece of advice; “If you see a class with one method consider re-writing it as a function”
  • A good rebuttal

Going Class Crazy

  • Partly class craziness may be attributed to the popularity of Java
  • Which does not have first class functions
    • Just a way of saying functions can be passed around as normal values
    • You can fake this in Java by creating a simple class containing that function
    • So to pass a function in as a parameter in Java, one must:
      • Create a class
      • Create an instance of that class
    • This has caused many to believe that creating classes is the way to go for all manner of things

Going Class Crazy

  • Here is a good couple of questions to ask yourself:
    • What is my class?
    • What is an instance of my class?
    • If the answer is the same, then at best, you're really just bunching some functions together

Going Class Crazy

  • Do not forget, an object is a gathering of state together with behaviour/operations over that state
  • If your object lacks either state or behaviour, then it is not an object
  • Tip: design your object first and then write your class to produce such objects