Computer Science Large Practical

Computer Science
Large Practical

Introduction

Paul Patras

Housekeeping

Website: http://www.inf.ed.ac.uk/teaching/courses/cslp/
One lecture per week

When: Fridays, 12:10–13:00
Where: David Hume Tower, George Square - Map Room LG.06

Please ask questions at any time
Coursework accounts for 100% of your mark
Office hours: flexible, but email me first (paul.patras@ed.ac.uk)

Restrictions (I)

CSLP is a third-year undergraduate course only available to third-year undergraduate students.
CSLP is not available to visiting undergraduate students, or to fourth-year undergraduate students and MSc students, who have their own individual projects.

Restrictions (II)

Third-year undergraduate students should choose at most one large practical, as allowed by their degree regulations.
- Computer Science, Software Engineering and Artificial Intelligence large practicals
- On most degrees a large practical is compulsory.
- On some degrees (typically combined Hons) you can do the System Design Project instead, or additionally.
See Degree Programme Tables (DPT) in the Degree Regulations and Programmes of Study (DRPS) for clarifications.

About this course

So far most of your practicals have been small exercises
This practical is larger and less rigidly defined than previous course works
The CSLP tries to prepare you for
- The System Design Project (in the second semester)
- The Individual Project (in fourth year).

Requirements

There is:
- a set of requirements (rather than a specification);
- a design element to the course; and
- more scope for creativity.
The requirements are more realistic than most coursework
But still a little contrived in order to allow for grading

How much time should I spend?

100 hours, all in Semester 1, of which
8 hours lecture/demonstrating
92 hours practical work, of which

70 hours non-timetabled assessed assignments
22 hours private study/reading/other

How much time is that really?

13 weeks remaining in semester 1 (Weeks 2 to 14)
7 * 13 = 91 hours
You can think of it as 7 hours/week in the first semester
This could be one hour a day including weekends
You could work 7 hours in a single day
- for example work 9:00-17:00 with an hour for lunch

Managing your time

It is unlikely that you will want to arrange your work on your large practical as one day where you do nothing else, but one day per week all semester is the amount of work that you should do for the course.

Course lecturers have been asked not to let deadlines overlap Weeks 11-14 because students are expected to be concentrating on their large practical in that time.

Deadlines

The Computer Science Large Practical is split in two parts:

Part 1
- Deadline: Thursday 23^rd October, 2013 at 16:00
- Part 1 is zero-weighted: it is just for feedback.
Part 2
- Deadline: Thursday 18^th December, 2013 at 16:00
- Part 2 is worth 100% of the marks.

Scheduling work

It is not necessary to keep working on the project right up to the deadline.
For example, if you are travelling home for Christmas you might wish to submit the project early.
In this case ensure that you start the project early.
The coursework submission is electronic so it is possible to submit remotely.
- But you must make sure that your submission works as expected on DiCE
- This might be easier to do locally
- But see working remotely and remote graphical login

Early submission credit

To motivate good project management, planning, and efficient software development, marks above 90% are reserved for work that is submitted early (specifically, one week before the deadline for Part 2).
Work submitted less than a week before the deadline does not qualify as an early submission, and the mark for this work will be capped at 90%. Thus, the mark may be 90%, but it may not be higher than this.
Regardless of when it is submitted, every submission is assessed in exactly the same way, but submissions which attract a mark of above 90% and were not submitted early have this mark brought down to 90%.

Early submission credit

Question:: Can I submit both an early submission version and a version for the end deadline and have the marks for whichever is highest?
Answer:: No. Before the early submission deadline you have to choose whether or not you are going to hope for a mark above 90% then, or have an extra week to accumulate more marks up to 90%. The submission marked will be the latest one made before the deadline.

Extensions

Do not ask me for an extension as I cannot grant them
The correct place is the ITO who will pass this on to the year organiser (Vijay Nagrajan)
See the policy on late coursework submission first

The Computer Science Large Practical

The CSLP Requirement

Create a command-line application in C
The purpose of the application is to implement a stochastic, discrete-event, discrete time simulator
- I'll come back to these terms
This will simulate the bin collection process in a “smart” city, with bin locations, capacities, etc. specified by input

The CSLP Requirement (C'tnd)

The output will be the sequence of events that have been simulated as well as some summary statistics
Input and output formats, and several other requirements are specified in the coursework handout
It is your responsibility to read the requirements carefully

Why Simulators?

Stochastic simulation is an important tool in physics, medicine, computer networking, logistics, and many other fields.
Particularly useful to understand complicated processes.
Can save time, money, effort and even lives.
Allow running inexpensive experiments of exceptional circumstances that might otherwise be infeasible.
However, the simulator must have an appropriate model for the real system under investigation, to produce meaningful results.

Example: preventing Internet outages

Source: Internet Census –World map of 24 hour relative average utilization of IPv4 addresses.

Last month CBC news reported that in the U.S. Verizon dumped 15,000 Internet destinations for ~10 minutes.

Preventing Internet outages

Global Internet routing table has passed 512K routes
Older routers have limited size routing tables; when these fill up, new routes are discarded
Large portions of the Internet become unreachable, thus online businesses are loosing money
Upgrading equipment is expensive and takes time; workarounds are being proposed
Ensuring the proposed solutions will work is not trivial

Preventing Internet Outages

Testing patches in live networks poses the risk of further disruption
Waiting for the next surge is not acceptable
Forwarding all traffic for new routes through a default interface can have serious implications on routing costs
With simulation it is possible to generate synthetic traffic and test patches without disrupting the network
It is also possible to evaluate different metrics such as round-trip delays, throughput, routing changes propagation latency

Why C?

Part of the challenge of this practical is to learn a new programming language
This is something you should expect when taking a job as a software developer in a company that has clear incentives to use a particular language.
C is efficient (low execution time), portable, excellent for working directly with the hardware, and also usable for web programming

Why C?

Currently ranked the most popular programming language --TIOBE Index for September 2014

Code Sharing

Code sharing sites are a great resource but please refrain from using them for this practical
This is an individual practical so code sharing is not allowed. Even if you are not the one benefiting
It is somewhat likely that in the future you will be unable to publicly share the code you produce for your employer

Why Simulate Bin Collection?

Waste management is a major operation in many cities
Part of ongoing "smart cities" initiatives, bins are being equipped with occupancy sensors to improve scheduling and route planning for lorries
There are limitations to current practice periodic collection strategies:
- Lorries make unnecessary frequent trips and sometimes take lengthy routes → increased operation cost and pollution
- User daily demand varies and could cause overflows before scheduled pick-up → increased health hazards and cleaning costs

Why Simulate Bin Collection?

With simulation we can investigate the impact of different pick-up intervals and bin occupancy thresholds used to trigger scheduling.
In this practical we will evaluate waste collection efficiency in terms of volume collected per unit of travel time, percentage of overflows, etc.
- small thresholds → longer trips, but cleaner streets
- large thresholds → cost efficient, but risk of overflows

Your Simulator

Your simulator will be a command-line application
It will accept an input text file with the description of the serviced areas and
a set of global parameters: lorry capacity, service time, bin capacity, disposal rate, disposal volume
It should output information about occurring events
The strict formats for both input and output are described in the coursework handout
You will also need to produce summary statistics that you will later analyse

Simulation Algorithm

The underlying simulation algorithm is fairly simple:

WHILE {time ≤ max time}
   determine the set of events that may occur after the current state
   delay ← choose a delay based on the nearest event
   time ← time + delay
   modify the state of the system based on the current event
ENDWHILE

Simulation Algorithm

WHILE {time ≤ max time}
    ...
    delay ← choose a delay based on the nearest event
    ...
ENDWHILE

Some events are deterministic, some occur with exponentially distributed delays
I'll explain this in more details, but for now drawing from an exponential distribution can be done by:


  −(mean) ∗ log(random(0.0, 1.0))

Where mean is the average delay, which is the reciprocal of the rate

Components of the Simulation

Input

Global parameters:
1. Lorry capacity
2. Service time
3. Bin capacity
4. Disposal rate
5. Disposal volume
6. Number of areas

Components of the Simulation

Input

Area description and dynamic parameters:
1. Collection frequency
2. Bin occupancy threshold
3. Number of bins
4. Matrix representation of bin map

Components of the Simulation

Lorries

Each area is serviced by a single lorry
Lorries are scheduled at fixed time intervals (one/twice/n-times per day). This is expressed as number of trips/hour
Lorries have a fixed capacity, expressed in cubic metres

Components of the Simulation

Bins

Community bins have a fixed capacity expressed in m³
Bins have occupancy sensors and we consider an occupancy threshold (fraction) is used in each area to trigger collection
There is a fixed service time (expressed in minutes) required to empty a bin, irrespective of its occupancy

Components of the Simulation

Users

We consider users dispose of rubbish bags of fixed volume, expressed in m³
Bags are disposed at exponentially distributed intervals
The mean disposal rate is expressed as a number of bags per hour

Components of the Simulation

Area map

For each area, we consider a graph representation of the bins' locations and the distances between them.
The graph corresponding to each area is given as an input in matrix form
The (0,0) element represents the waste processing facility and it is both the start and end point of a service route, i.e. we consider routes to be circular
The distances between any two locations are expressed in minutes

Example

Matrix representation

0	8	65535	65535	7	65535	
8	0	5	65535	65535	2	
65535	5	0	4	65535	6		
65535	65535	4	0	2	65535	
7	65535	65535	2	0	3		
65535	2	6	65535	3	0

Components of the Simulation

Events

Your simulator will produce a sequence of events
- A bag may be disposed at a bin
- A lorry may leave from a location
- A lorry may arrive at a location
- A bin may be emptied at a location
- A particular bin may overflow
- A bin's occupancy threshold may have been exceeded

Components of the Simulation

Events

Your simulator will output a sequence of events in the following format:


bag disposed at bin ‹bin_no› at time ‹time›

bin ‹bin_no› overflowed at time ‹time›

lorry ‹lorry_no› leaves location ‹location_id› at time ‹time›

lorry ‹lorry_no› arrives at location ‹location_id› at time ‹time›

bin ‹bin_no› emptied at time ‹time›

Components of the Simulation

Events

Depending on the actual event in your simulation, you will replace the ‹lorry_no›, ‹bin_no›, ‹location_id› and ‹time› with real values, e.g.:

lorry 1 leaves location 1.0 at time 4.00
lorry 1 arrives at location 1.1 at time 4.1
bin 1.1 emptied at time 4.15

This is valid output in the sense that it is formatted correctly
It may be invalid for other reasons, for example the occupancy of bin 1.1 may have not exceeded the predefined threshold

Part One & Part Two Assessments

Part one, is just for feedback. You only need to have a working simulator
For part two, there are additional requirements:
- Full functionality should be implemented
- Summary statistics, such as average trip efficiency, should be produced
- Experimentation support, varying disposal rates to see how those impact the collection process
- Validation, checking that the input is valid
These are all specified in the coursework handout

Coursework Handout

This was a brief summary of the major components of the simulation
It is no substitute for reading the coursework handout
Available at: www.inf.ed.ac.uk/teaching/courses/cslp/coursework/cslp-2014.pdf

The Simulator

Definitions

In the requirements I stated that your simulator will be a:
- stochastic,
- discrete event,
- discrete time
simulator
Let's see what each of these terms means.

Stochasticity

A stochastic process is one whose state evolves “non-deterministically”, i.e. the next state is determined according to a probability distribution.
This means a stochastic simulator may produce slightly different results when run repeatedly with the same input.
Therefore it is appropriate to compute certain statistics to characterise the behaviour of the simulated system.
Remember, these are statistics about the model:
- You hope that the real system exhibits behaviour with similar statistics

Discrete Events

Discrete events happen at a particular time and mark a change of state in the system.
This means discrete-event simulators do not track system dynamics continuously, i.e. an event either takes place or it does not.
There is no fine-grained time slicing of the states, i.e.
Generally a state could be encoded as an integer.
Usually it is encoded as a set of integers, possibly coded as different data types.
Discrete-event simulations run faster than continuous ones.

Discrete vs Continuous States

When working with discrete events, it is common to consider that states are also discrete.
Example:

Discrete Time

Discrete time simulations operate with a discrete number of points:
- Minutes, Hours, Days, Weeks, etc.
These can also be logical time points:
- Moves in a board game,
- Communications in a protocol.
Your task is to write a discrete time simulator.
Events will occur with minute level granularity.

The Exponential Distribution

Remember that the probability distribution gives the probability of the different possible values of a random variable.
The exponential distribution describes the time between events in a Poisson process, i.e.

Events' inter-arrival times are independent (memoryless),
Events occur with a constant average rate λ.

The Exponential Distribution

Roughly speaking, the time X we need to wait before an event occurs has an exponential distribution if the probability that the event occurs during a certain time interval is proportional to the length of that time interval.
Applications:

Call arrivals at a telephone exchange
Radioactive particle decay
Air plane arrivals at a large hub

The Exponential Distribution

The probability density function (PDF) is given by:

f(x,λ) = λe^-λx ∀ x > 0

Describes the relative likelihood that an event with rate λ occurs at time x
A time point is infinitesimally small
The integral of this gives the probability that it occurs within two time bounds (but you can largely ignore this)

The Exponential Distribution

The cumulative distribution function (CDF) is given by:

F(x,λ) = 1 - e^-λx ∀ x > 0

So if something happens at a rate of 0.5 per unit of time, then the probability that we will observe it occurring within 1 time unit is: F(1, 0.5) = 1 - e^0.5*1 = 0.393

The Exponential Distribution

The mean or expected value is given by the reciprocal of the rate parameter.
In plain English this means that if something occurs at rate r then we can expect to wait 1/r time units on average to see each occurrence.
If something occurs 7 times per week, you can expect to wait 1/7 of a week (or a full 24 hours) on average between each occurrence.

Exercise

What is the probability that a random variable X is less than its expected value, if X has an exponential distribution with rate λ?

The expected value of an exponential random variable with parameter λ is:

E[X] = 1/λ

We need to compute P(X ≤ E[X]) using the distribution function:

P(X ≤ E[X]) = P(X ≤ 1/λ)

= F(x,λ)

= 1 - e ^-λ*1/λ

= 1 - 1/e

The Memoryless Property

Formally: P(X > s + t | X > s) = P(X > t) s, t > 0
Less formally: The time that we can expect to wait for the next occurrence of some (exponentially distributed) event, is unaffected by how long we have already been waiting for it
In the 7 times a week example, if it has been 24 hours since the last occurrence, the expected additional time I have to wait is still 24 hours
A quick note, don't confuse these two properties:
- Correct P(X > 100 | X > 80) = P(X > 20)
- Incorrect P(X > 100 | X > 80) = P(X > 100)
The latter would be a strange kind of pre-determined system

The Memoryless Property

In your simulation, users dispose of rubbish bags at exponentially distributed time intervals
That means the next disposal event does not depend on the previous ones
As a result of firing that event, the global state of the simulation changes
However local states may not have changed, e.g. a lorry may still be at the depot

How do we sample from a distribution?

Inverse Transform Method

Let X be random variable with continuous and increasing distribution function F. Denote the inverse by F ⁻¹.
Let U be a random variable uniformly distributed on the unit interval (0, 1).
Then X can be generated by X = F⁻¹(U).

If we use an exponential CDF for F, then we effectively sample from that distribution by

X = -ln(U)/λ

Sampling exponential distributions in practice

Straightforward, right? (denoting mean = 1/λ)


  int r = (int) (-log(rand()/RAND_MAX)*mean);

Well...not quite

rand() is known to be implemented poorly
You want to draw a RV uniformly distributed on (0,1)
Seeding properly a pseudo-random generator is tricky

OK, so what should we do?

xkcd?

Drawing uniformly distributed random numbers

Use a uniform deviate and discard the zero.


double uniform_deviate ( int seed )
{
   return seed * ( 1.0 / ( RAND_MAX + 1.0 ) );
}
 
int r;
do
   r = uniform_deviate ( rand() );
while (r == 0);

r = (int) (-log(r)*mean);

Seeding rand()

The usual solution is to get the system time.


srand ( (unsigned int) time ( NULL ) );

Note there may be some portability issues with the above. Julienne Walker argues that hashing the system time first is a good solution. For a longer discussion about this and random numbers you can check his web page.

Your Simulators

Will be Discrete event simulators
Will be Discrete time simulators
Will make use of the exponential distribution to model user behaviour

Clarifications

Question:: What is the proposed maximum number of bins in a given area?
Answer:: It is foreseeable that the number of bins in an area may limit e.g. the possibility of performing exhaustive search to find an optimal route. For this practical we will assume the number of bins in an area is specified as an unsigned int value, i.e. the maximum number is 65,535.

NB: Some compilers compilers use 4 bytes for int. You can check what your compiler is using with sizeof(int). If you want a portable unsigned 16-bit integer, use uint16_t.

Clarifications

Question:: How long does it take to service (empty) a lorry at the waste processing facility?
Answer:: While in practice it usual takes longer to empty a lorry than to service a single bin, to make things simpler we will consider these service times to be equal.

Clarifications

Question:: Can a lorry receive an updated route from the depot while in service, as a bin's occupancy threshold may be reached after the lorry's departure?
Answer:: No. While this scenario is foreseeable, we will assume lorries are assigned routes only prior to their departure from the waste processing facility.

Clarifications

Question:: The occupancy of some bins may increase as a lorry traverses a route and consequently the lorry may not be able to service all assigned bins due to capacity constraints. How should the lorry proceed in situations like this?

Clarifications

Answer:

How you deal with such events is a design choice you should make. You can either

conservatively assign fewer bins on a single run, to avoid capacity problems,
return following the planned route without servicing other bins in that run if this happens,
compute the shortest path from the current bin and return to the depot, or
implement a combination of these or other similar strategies.

Please ensure your choice is explained in your final report and your code is commented appropriately.

Simulation Components

Route Planning

Service Areas

We need an abstract representation of a street map and bin locations for each service area.
We will consider one lorry per area.
We need to model the roads between different locations and the time required to travel these.

Example

Leith Walk area in Edinburgh; 20 imagined bin locations

Map source: bing.com

Graph representation

In mathematical terms such a collection of bins interconnected with links can be represented through a graph
A graph G = (V,E) comprises a set of vertices V that represent objects (bins) and E edges that connect different pairs of vertices (links).
Graphs can be directed or undirected

Directed Graphs

Edges have a direction associated with them and they are called arcs or directed edges
Formally, they are ordered pairs of vertices,
i.e. (a,b) ≠ (b,a) if a ≠ b

Undirected Graphs

Edges have no orientation, i.e. they are unordered pairs of vertices. That is there is a symmetric relation between nodes and thus (a,b) = (b,a)
For our simulations we will consider undirected graph representations of the service areas

Back to the example

This area...

Map source: bing.com

Corresponding Graph

...can be represented by

Note that we numbered the vertices and added the '0' node, to model the lorry depot.

Weighted Graph

We also need to model the distances between bin locations
We will use a weighted graph representation, where a number (weight) is associated to each edge
In our case weights will represent the average travel duration between two bins (vertices), expressed in minutes

Weighted Graph

For our example, this may be

Input Script

Graph representation of the bins locations and distances between them will be given in the input script in matrix form.
For an area with N bins, a (N+1) x (N+1) matrix will be specified.
The graph keyword will precede the matrix
Being an undirected graph, the matrix will be symmetric
Where there is no edge in the graph between two vertices we will use a (2¹⁶-1) value in the matrix

For The Previous Example


   0     1     2     3     4     5       ...   19    20
   -------------------------------------------------------
 0|0     9     65535 8     10    65535   ...   65535 65535
 1|9     0     2     65535 65535 65535   ...   65535 65535
 2|65535 2     0     1     65535 65535   ...   65535 65535
 3|8     65535 1     0     1     65535   ...   65535 65535
 4|10    65535 65535 1     0     4       ...   65535 65535
 5|65535 65535 65535 65535 4     0       ...   65535 65535
 .|.     .     .     .     .     .             .     .
 .|.     .     .     .     .     .             .     .
 .|.     .     .     .     .     .             .     .
19|65535 65535 65535 65535 65535 65535   ...   0     1
20|65535 65535 65535 65535 65535 65535   ...   1     0

Route Planning

Lorries are scheduled periodically and their frequency is an input parameter
Bin occupancy thresholds are used to decide the subset of bins to be serviced
You must seek the route that visits all bins whose occupancy has exceeded the threshold, and has a minimum cost (in terms of total duration)
All routes are circular, i.e. they must start and end at location x.0, where x is the area index

Route Planning

Not all the bins in an area may need to be serviced at a given time.
Thus it may be appropriate to work with an equivalent graph where vertices that do not require to be visited are isolated and equivalent edge weights are introduced.

The (More) Challenging Part

Let's refer to the graph of all bins that require service at a given time by "service graph"
How to finding the (almost) optimal route that visits all vertices in the service graph with minimum cost?
This is entirely up to you, but I will discuss some possibilities next
You must justify your choice in the final report and comment appropriately the simulator code
You may wish to implement more than one algorithm

Useful terminology

A walk is a sequence of edges connecting a sequence of vertices in a graph
A path is a walk that does not include any vertex twice
A cycle is a path that starts and ends at the same vertex

paths / cycle

Useful terminology

A trail is a walk that does not include any edge twice
A trail may include a vertex twice, as long as it comes and leaves on different edges
A circuit is a trail that starts and ends at the same vertex

trail / circuit

Hamiltonian Circuit

A Hamiltonian circuit (cycle) is a path that visits every vertex exactly once and starts at ends at the same vertex
NB: Note all graphs may have a Hamiltonian circuit

Minimum Cost Hamiltonian Circuit

In a weighted graph, the minimum cost Hamiltonian circuit is that where the sum of the edge weights is the smallest
Finding the minimum cost Hamiltonian circuit on your bin service graph is one option for route planning
Warning: Finding a Hamiltonian circuit can be very difficult. This is a known NP-complete problem. Simply put, the problem may not be solvable in polynomial time and the complexity increases significantly with the number of vertices

Heuristic Algorithms

Heuristics work quite well for finding a solution, most of the time
Solutions may not be always optimal, but good enough
Work relatively fast
Popular heuristics for finding minimum cost Hamiltonian circuits:

Nearest Neighbour Algorithm
Sorted Edges Algorithm

Nearest Neighbour Algorithm

Nearest Neighbour is a greedy algorithm – at every step it chooses as the next vertex the one connected to the current through the edge with the smallest weight
Only searches locally
Nodes already visited are ignored
Due to its greedy nature it may not find a solution
Finding a solution and its total cost depends on the start vertex chosen

Nearest Neighbour

Example: Starting at '0'

No solution

Nearest Neighbour

Example: Starting at '1'

Total cost: 22

Sorted Edges Algorithm

Also greedy, but has a more global view → takes slightly more time to find a solution
First sorts all the edges in ascending order of their weights
Adds sorted edges one at the time, unless adding a new edge leads to three edges entering a node, or creates a circuit that does not include all vertices
Skips the edges that violate these rules
Keeps adding edges until finding a Hamiltonian circuit
Stops when a solution is found, even if there are edges left

Sorted Edges

Example

Adding the first 3 edges is straightforward

Sorted Edges

Example

Adding 3-4 creates a circuit, but not all nodes visited.

Sorted Edges

Example

Remaining edges create circuits and violate the 3-edges rule. No solution found.

Brute Force Algorithm

When the number of vertices is small, a 'brute force' approach could be feasible
Find all paths that visit all vertices once and pick the one with the lowest cost
Guaranteed to find a solution (if there exists one), and this will be optimal

Other approaches

A Hamiltonian circuit may not always be the path that visits all the vertices and has the lowest cost
Sometimes visiting a node more than once could be a good idea
For small graphs, using brute force to find the cheapest path that visit all nodes may be appropriate

Other approaches

Example: passing twice through node '4'

Total cost: 22

Choosing Route Planning Algorithms

You can use any of these approaches and heuristics
You can implement other heuristics you have studied
You can implement multiple solutions, as some may not work for any graph
You have complete freedom, but make sure you document your choice and discuss its implication on system’s performance in your written report

Code Structuring & Coding Strategy

How to structure your work?

This is for guidance only and I will not go into great detail, to avoid seeing identically structured solutions.
Part of the practical is structuring it yourself. However, it is likely you will want at least the following components:
- A parser
- A representation of the states of a simulation
- The simulation algorithm
- Something to handle output
- Something to analyse results
- A test suite

Some Obvious Decisions

Do you want to parse into some abstract syntax data structure and then convert that into a representation of the initial state
- Or you could parse directly into the representation of the initial state
Do you wish to print out events as they occur during the simulation
- Or record them and print them out later
Do you wish to analyse the simulation events as the simulation proceeds
- Or analyse the events afterwards

Parsing

You do not necessarily need to start with the parser
The parser produces some kind of data structure. You could instead start by hard coding your examples in your source code
But the parsing for this project is pretty simple
Hence you could start with the parser, even if this is not complete before moving on
- Hard coding data structure instances could prove laborious
- But doing so would ensure your simulator code is not heavily coupled with your parser code

Software Construction

Software construction is relatively unique in the world of large projects in that it allows a great deal of back tracking
Many other forms of projects, such as construction, event planning, and manufacturing, only allow for backtracking in the design phase.
The design phase consists of building the object virtually (on paper, on a computer) when back tracking is inexpensive
Software projects do not produce physical artefacts, so the construction of the software is mostly the design

Refactoring

Refactoring is the process of restructuring code while achieving exactly the same functionality, but with a better design.
This is powerful, because it allows trying out various designs, rather than guessing which one is the best
It allows the programmer to design retrospectively once significant details are known about the problem at hand
It allows avoiding the cost of full commitment to a particular solution which, ultimately, may fail.

Suggested Strategy

Note that this is merely a suggested strategy
Start with the simplest program possible
Incrementally add features based on the requirements
After each feature is added, refactor your code
- This step is important, it helps to avoid the risk of developing an unmaintainable mess
- Additionally it should be done with the goal of making future feature implementations easier
- This step includes janitorial work (discussed later)

Suggested Strategy

At each stage, you always have something that works
Although you need not specifically design for later features you do at least know of them, and hence can avoid doing anything which will make those features particularly difficult.

Alternative Inferior Strategy

Design the whole system before you start
Work out all components and sub-components needed
Start with the sub-components which have no dependencies
Complete each sub-component at a time
Once all the dependencies of a component have been developed, choose that component to develop
Finally, put everything together to obtain the entire system
- Test the entire system

Janitorial Work

Janitorial work consists mainly of the following

Reformatting
Commenting
Changing Names
Tightening

Janitorial Work

Reformatting


void function_name (int x)
{
  return x + 10;
}

Becomes:


void function_name(int x) {
  return x + 10;
}

There is plenty of software which can do this work.

Janitorial Work

Reformatting

Reformatting is entirely superficial
It is important to consider when you apply this
This may well conflict with other work performed concurrently
Reformatting should be largely unnecessary, if you keep your code formatting correctly in the first place
- More commonly required on group projects

Janitorial Work

Commenting

Writing good comments in your source code is essential
When done as janitorial work this can be particularly useful
- You can comment on the stuff that is not obvious even to yourself as you read it.
The important thing to comment is not what or how but why

Try not to have redundant/obvious information in your comments:


// 'x' is the first integer argument
int leastCommonMultiple(int x, int y);

Janitorial Work

Commenting

Ultra bad:


// increment x
x += 1;

Better:


// Since we now have an extra element to consider
// the count must be incremented
x += 1;

Janitorial Work

Changing Names

The previous example used x as a variable name
Unless it really is the x-axis of a graph, choose a better name
This is of course better to do the first time around
However as with commenting, unclear code can often be more obvious to its author upon later reading it

Janitorial Work

Tightening


  ...
  FILE *fInput;
  fInput = fopen(fileName, "r");
  parseInput(fInput);
  fclose(fInput);

Tightened to become:


  ...
  FILE *fInput;
  fInput = fopen(fileName, "r");
  if (fInput == NULL) {
	// Explain to the user ...
	printf("Error: %d (%s)\n", errno, strerror(errno));
  } else {
  	parseInput(fInput);
  	fclose(fInput);
  }

Janitorial Work

Tightening

For some this is not janitorial work, since it actually changes in a non-superficial way the function of the code
However, similar to other forms it is often caused by being unable to think of everything when writing new code

Janitorial Work

Most of this work is work that arguably could have been done right the first time around when the code was developed
However, when developing new code, you have limited cognitive capacity
You cannot think of everything when you develop new code. Janitorial work is your time to rectify the minor stuff you forgot
Better than trying to get it right first time is making sure you later review your code

Janitorial Work

Remember, refactoring is the process of changing code without changing its functionality, whilst improving design.
Strictly speaking janitorial work is not refactoring
- It should not change the function of the code
  - Tightening might, but generally for exceptional input
- But neither does it make the design any better
In common with refactoring you should not perform janitorial work on pre-existing code whilst developing new code

More About Refactoring

Refactoring is a term which encompasses both factoring and defactoring
Generally the principle is to make sure that code is written exactly once
We hope for zero duplication
However, we would also like for our code to be as simple and comprehensible as possible

Factoring and Defactoring

We avoid duplication by writing re-usable code
Re-usable code is generalised
Unfortunately, this often means it is more complicated
Factoring is the process of removing common or replaceable units of code, usually in an attempt to make the code more general
Defactoring is the opposite process specialising a unit of code usually in an attempt to make it more comprehensible

Factoring Example


#include <stdbool.h>
void primes(int limit) {
    int i, x = 2;
    while (x <= limit){
        bool prime = true;
	for (i = 2; i < x; i++) {
            if (x % i == 0){ 
	        prime = false; 
	        break; 
	    }
        }
        if (prime) { 
            printf("%d is prime\n", x); 
 	}
	x++;
    }
}

A very naive but perfectly reasonable bit of code to print out a set of prime numbers up to a particular limit

Factoring Example


#include <stdbool.h>
void printPrime(int x) {
    printf("%d is prime\n", x); 
}

void primes(int limit) {
    int i, x = 2;
    while (x <= limit) {
        ... // as before
        if (prime) { 
  	   print_prime(x); 
	}
	x++;
    }
}

Here we have “factored out” the code to print the prime number to the screen. This may make it more readable, but the code is not more general.

Factoring Example

To make it more general we have to actually parametrise what we do with the primes once we have found them.


#include <stdbool.h>
void primes(int limit, void (*processPrimes) (int)) {
    int i, x = 2;
    while (x <= limit) {
        ... // as before
        if (prime) { 
  	   (*processPrimes)(x); 
	}
    }
}

You can now use different functions to display, store, etc. the prime numbers.

Factoring Example

For instance, to print to display


void printPrime(int x) {
    printf("%d is prime\n", x);
}
void primes(int limit, void (*processPrimes) (int)) {
    int i, x = 2;
    while (x <= limit) {
    	... // as before
        if (prime) {
        	(*processPrimes)(x);
        }
        x++;
    }
}
int main(void) {
    primes (100, &printPrime);
    ...
}

Factoring

What you should factor depends on the context
How likely am I to need more than just prime numbers?
How likely am I to do something other than print the primes?
Try to find the right re-usability/time trade-off

Defactoring

Numbers such as the number 20 can be factored in different ways
- 2,10
- 4,5
- 2,2,5
If we have the factors 2 and 10, and realise that we want the number 4 included in the factorisation we can either:
- Try to go directly by multiplying one factor and dividing the other
- Defactor 2 and 10 back into 20 and then divide 20 by 4

Defactoring

Similarly, your code is factored in some way
In order to obtain the factorisation that you desire, you may have to first defactor some of your code
This allows you to factor down into the desired components
This is often easier than trying to short-cut across factorisations

Defactoring

Flexibility is great, but it is generally not without cost
- The cognitive cost associated with understanding the more abstract code
If the flexibility is not now or unlikely to become required then it might be worthwhile defactoring
It is appropriate to explain your reasoning in comments

Refactoring Summary

Code should be factored into multiple components
Refactoring is the process of changing the division of components
Defactoring can help the process of changing the way the code is factored
Well factored code will be easier to understand
Do not update functionality at the same time

Common Development Approach

Start with the main function
Write some code, for example to parse the input
Write (or update) a test input file
Run your current application
See if the output is what you expect
Go back to step 2.

Do Not Start with Main

A better place to start is with a test suite
This doesn't have to mean you cannot start coding
Write a couple of test inputs
Create a skeleton “do nothing” parse function
Create an entry point which simply calls your parse function on your test inputs (all of them)
Watch them fail

Do Not Start with Main

Code until those tests are green
- Including possibly refactoring
Consider new functionality
- Write a function that tests for that new functionality
- Watch it fail, whether by generating an error or simply not producing the results required
- Return to step 1.
You can write your main function any time you like
- It should be very simple, as it simply calls all of your fully tested functionality

Do Not Start with Main

Any time you run your code and examine the results, you should be examining output of tests
If you are examining the output of your program ask yourself:
- Why am I examining this output by hand and not automatically?
- If I fix whatever is strange about the output can I be certain that I will never have to fix this again?
Of course sometimes you need to examine the output of your program to determine why it is failing a test. This is just semantics (it is still the output of some test)

Summary

Refactoring allows you to avoid doing a large amount of upfront design and also avoid producing a big hairy mess

Do not change functionality whilst refactoring
Your code should be adaptable

Do not start with main, write a test suite instead

Memory management

Memory management in C

Memory allocation/deallocation is done differently in C as compared to what you may be used with other languages
There is no automatic memory management (garbage collection) and thus the programmer is responsible for releasing dynamically allocated memory when no longer needed
Poor memory management can lead to memory leaks → system performance suffers as virtual memory is progressively paged to hard drive. The OS may crash

Memory allocation

Two mechanisms are used to allocate memory in C

Declaring local variables. These are stored in a stack and are automatically freed when they become out of scope (e.g. exiting a function)
```
int parseInput(FILE *finput) {
   int N;
   int array[100];
   ...
} 
```

Nice, right? Well there's a catch. The compiler needs to know in advance how much memory to allocate (inefficient) and the stack size is limited (not all your variables may fit).

Memory allocation

Two mechanisms are used to allocate memory in C

Requesting memory explicitly and storing variables on the heap.

You can allocate as much memory as the system allows you to, but you need to take care of releasing it


   ...
   int N;
   int *array;
   ...
   array = (int *) malloc(N * sizeof(int));

The compiler cannot know what you intend to do with the array and thus will not release it automatically. You have to do it manually when done:


   free(array);

Segmentation Fault & Co

A few things can go wrong if not careful with variables allocated on the heap.

Memory leaks → you forget to call free() when variables no longer needed
You try to free, but allocation failed or memory already deallocated


char *fileName = malloc(255*sizeof(char));
if (fileName != NULL) {
  ...
  free(fileName);
}

You try to write beyond the allocated length (corruption)


char *temp = malloc(64*sizeof(char));
memcpy(temp, data, dataLen);   // dataLen > 64 gives gives error

Segmentation Fault & Co

You use an out of bounds array index


int *array = malloc(128*sizeof(int));
int N = 200;

for (i = 0; i < N; i++) {  // once i > 127, an error will occur
   ... 
}

You use an address, but memory has not been allocated


struct *listElement;
x = listElement->value;

Segmentation Fault & Co

You return a pointer to a variable from the stack


int *getCount() {
   int n;  // Local stack variable

   ...     // count the number of values that 
	   // divide by x in an array

   return &n;
}

int main(void){
  int *n; 
  ...
  n = getCount();  // Stack given up by getCount(), 
	           // &n no longer safe

  ... // n may be corrupt when needed later
}

Why no garbage collection in C?

Garbage collection involves constructing a complex data structure for keeping track of allocations and references counting.
This mechanism increases the complexity of the language and affects the performance (overhead).
C is meant for designing very fast code, e.g. for operating systems, device drivers, etc.
High performance is traded for convenience.

Code Optimisation

Refactoring is done in between development of new functionality
- Recall this makes it easier to test that this process has not changed the behaviour of your code
This is also a good time to do some optimisation
- You should be in a good position to test that your optimisations have not negatively impacted correctness

When to Optimise?

When you discover that your code is not running fast enough, it's probably wise to optimise it
Often this will come towards the end of the project
It should certainly come after you have something deployable
Preferably after you have developed and tested some major portion of functionality

A Plausible Strategy

Perform no optimisation until the end of the project once all functionality is complete and tested
This is a reasonable approach; however:
During development, you may find that your test suite takes a long time to run
Even one simple run to test the functionality you are currently developing may take minutes or hours
This can slow down development significantly, so it may be appropriate to do some optimisation at that point

How to Optimise

The very first thing you need before you could possibly optimise code is a benchmark
This can be as simple as timing how long it takes to run your test suite
O(n²) solutions will beat O(n log n) solutions on sufficiently small inputs, so your benchmarks must not be too small

How to Optimise

Once you have a suitable benchmark then you can:

Save a copy of your current code
Run your benchmark and record the run time
Perform what you think is an optimisation on your source code
Re-run your benchmark and compare the run times
If you successfully improved the performance of your code keep the new version, otherwise revert changes
Do one optimisation at a time

How to Optimise

However, bear in mind that you are writing a stochastic simulator
- This means each run is different and hence may take a different time to run
- Even if the code has not changed or has changed in a way that does not affect the run time significantly
- Simply using the same input several times should be enough to reduce or nullify the effect of this

Profiling

Profiling is not the same as benchmarking
Benchmarking:
- determines how quickly your program runs
- is to performance what testing is to correctness
Profiling:
- is used after benchmarking has determined that your program is running too slowly
- is used to determine which parts of your program are causing it to run slowly
- is to performance what debugging is to correctness

Benchmarking & Profiling

Without benchmarking you risk making changes to your program that will lead to poorer performance
Without profiling you risk wasting effort optimising a part of code which is either already fast or rarely executed

Documenting: Source code comments are a good place to explain why the code is the way it is

CSLP assessment

Assessment Criteria (I)

Implementation of requirements:
1. Parsing
2. Input validation
3. Correct simulation & correct output
4. Summary statistics of simulation results
5. Experimentation implementation
Source code documentation (comments)

Assessment Criteria (II)

Testing, including sample test input scripts
Maintainable code
Code efficiency (optimisations)
Any additional features
Written report
Early submission

Objective & Subjective Criteria

Some of the items on the above list are objective whilst some are subjective
Objective criteria are those which are testable
Subjective criteria are those which are, at least partially, based upon opinion

Objective Assessment Criteria

The most objective assessment criteria is:
- Early submission
Either you submit it before the early submission deadline or you do not
Though arguably this is not really an assessment criteria

Objective Assessment Criteria

This first list of implementation requirements are all relatively objective:

Parsing
Input validation
Correct simulation & correct output
Summary statistics of simulation results
Experimentation implementation

Objective Assessment

Your application will be put through my own suite of test inputs
Some of these test inputs will be inputs you have seen, some will be new
Part of the exercise is for you to foresee possible inputs for which your application would fail
- Either by crashing, or by producing incorrect output
Should your application fail any tests I would have to figure out why this happened and objective marking will not be so straightforward

Parsing

Your parser should be able to parse all syntactically valid input scripts
I cannot say it much simpler than that
There will not be any deliberately tricky tests

Input Validation

This is the first task which is not finely specified
You have to demonstrate some ingenuity to devise your own rules for what should and should not be valid input
You also have to decide which kinds of inputs result in warnings or errors
- Specifically those in which the simulation could be started but may result in an error
- This may depend upon the structure of your simulator

Correct Simulation & Output

Here I will be testing whether your simulator follows the requirements correctly
The simulator is tested via its output, so these are tested at the same time
Having said that, where the output is not correct, the code is inspected to determine why
Minor syntactic issues with the output will be judged leniently
- This is part of the reason your code must compile on DiCE

Summary Statistics

This will test for correctly calculating and reporting the specified summary statistics
It is possible to get the simulation incorrect but the summary statistics correct
A small tip is to make sure your reported statistics are consistent with each other
It might be that you are getting inconsistent results because your simulation is incorrect, in which case you should note this in your README

Experimentation Implementation

Whether or not you correctly implement the experimentation of disposal rates and collection frequencies
As before it is possible to get this correct, without getting either (or both) of the simulation and the summary statistics correct
As before, if you are getting inconsistent results you should at least note that in your README

Code efficiency

Implementing some code optimisations will lead to shorter run times
It is possible that you implement everything above correctly, but your simulations take a very long time to complete
On the other hand, your code may run fast, but will not have implemented all requirements. This is not considered to be efficient

Noting Deficiencies

Use your README file to record any deficiencies you are aware of
In general any implementation errors will be treated more indulgently if they are known about
Remember, it is generally worse to produce incorrect output than no output at all

Subjective Assessment

The remaining items are mostly judged subjectively

Source code documentation
Testing, including sample test input scripts
Maintainable code
Any additional features
Written report

Documentation

Use appropriate comments to document your code
You may develop additional features which, if you do not document, I may not even know about
Clear mark and explain the code that you have not authored yourselves
Remember that code sharing is not allowed

Testing

The practical is intended to write a good simulator

You can at least strive for “half decent”

Either way, running one test input, is woefully insufficient
You also need to be able to investigate the performance of the "bin collection process", what parameters affect this and how

Maintainable Code

Highly subjective
Remember, reusable code is more difficult to understand
But, reusable code is easier to reuse and maintain
What is an inexperienced developer to do?
Try to imagine what you might wish to do in the future

Maintainable Code

Highly subjective
Trying to justify your choices is likely a good thing
Even if your reasoning is flawed, it demonstrates that you have thought about how to design your source code
It also shows that you probably could have implemented things in other way, but specifically chose not to
A future maintainer at least knows why you made that choice, if they disagree, they can change the code without fear of some other reason they have not yet uncovered

Additional Features

This is your chance to be creative and go beyond the implementation of the requested features
It perhaps requires some imagination, but imagine you were really going to use your simulator to investigate some real (or other) logistics operation
What would be useful to you?

README

Don't forget to provide me with a README
In general this can only help your grade:
- It lets me know good things are deliberate and not fortunate
- It lets me know that deficiencies are at least known about

Written report

You should produce a written report that discusses:

the key building blocks of your design
the results of the analyses you performed with different inputs
insights gained into system's performance
a summary of the most important findings

Useful thing to include in your written report

Produce graphs based on the numerical output of your simulations to support your findings, especially for experimentation
Explain the purpose of the tests carried out whether the results met your initial expectations/lessons learned
Motivate your choice(s) of route planning algorithms implemented and discuss their impact on the performance of the system

Final Points

The report will have a 25% weight of the final mark
There is no minimum number of pages required for the report
Present your findings and results clearly
Submit the report as a PDF file
Students are often worried about losing marks
Indeed our own assessment descriptions often talk of losing marks
But let's not forget, you start with zero

Announcement

ACM ICPC
(international collegiate programming contest)

Prestigious, international programming contest for students in teams of 3.
This year the first regional will be in Sweden on 29-30 November. Details at http://www.nwerc.eu/
UoE agreed to fund travel for a team to compete!
There will be try outs and training sessions in the coming month.

If you'd like to be a part of this, email Hugh Leather <hughleat@gmail.com> by end of today.

Array & String Handling

Allocating arrays & matrices

Last time we discussed how you can allocate memory dynamically for an array of N elements.


int N;
int *array;

array = (int *) malloc(N * sizeof(int));

But, how do you allocate memory for a matrix?

Common mistake


int N;
int **matrix;

matrix = (int **) malloc(N * N * sizeof(int));

Matrix allocation

Remember, you are trying to allocate a pointer to an array of pointers to integers

Matrix allocation

Approach 1


int i,N;
int **matrix;

matrix = (int **) malloc(N * sizeof(int*));  // rows
for(i = 0; i < N; i++)   
   matrix[i] = (int *) malloc(N * sizeof(int));  // columns

// access the (i,j) element by
matrix[i][j] = ...

Approach 2 (define the matrix as an array)


int i,N;
int *matrix;

matrix = (int *) malloc(N * N * sizeof(int));  

// and access the (i,j) element by
matrix[i*N+j] = ...

Matrix deallocation

When using the first approach, first deallocate the memory allocated for each row



for(i = 0; i < N; i++)   
   free(matrix[i]);
free(matrix);

When using the second approach, simply


free(matrix);

What about arrays of structures?

Imagine the following


typedef struct {
  int groupSize;
  float* marks;
} GROUP;

int nGroups = 5;
GROUP *g;

The same principle applies



g = (GROUP *) malloc(nGroups * sizeof(GROUP));
for (i = 0; i < nGroups; i++) {
   fscanf(stdin, "%d", &g[i].groupSize);
   g[i].marks = (float *) malloc(g[i].groupSize * sizeof(float));
   ...
}

String handling

Strings are simply arrays of characters terminated by the ASCII null character '\0'.


char *str;
char string[100]

str = (char*) malloc(100*sizeof(char));

C provides a set of functions in the standard library, that are useful for manipulating strings.
Typical operations: copying, tokenizing, comparing, searching, etc.
Most of these are given in the <string.h> header file, but a few exist in <stdlib.h> as well

Functions you may use

Copying


char* strcpy(char *dst, const char *src);

src

dst


char* strncpy(char *dst, const char *src, int len);

len

src

dst

src

len

NB:

Functions you may use

Comparing


int strcmp(const char *str1, const char *str2);

<0 if the first character that does not match has a lower value in str1 than in str2
0 if the contents of both strings are equal
>0 if the first character that does not match has a greater value in str1 than in str2

Functions you may use

Searching


char* strstr(const char *str1, const char *str2);

str2

str1

Examining


size_t strlen(const char *str);

str

Functions you may use

Tokenising – a string into different tokens according to some delimiter(s)


char *strtok(char *str, const char *delim)

str broken into smaller strings
delim may contain different characters to be used as delimiters
Returns a pointer to the last token found or NULL if none found
Can be called multiple times to find all tokens

Functions you may use

Example


const char str[100] = "The quick brown fox jumps over the lazy dog";
const char delim[2] = " ";
char *token;
   
token = strtok(str, delim);   // gets first token
   
while(token != NULL) {   // retrieve all tokens; stop when no more found 	
   printf("%s\n", token);
   token = strtok(NULL, delim);
}

Converting strings to numbers

Converting to floating-point numbers


double strtod(const char *str, char **ptr);
float strtof(const char *str, char **ptr);

str

ptr

Example:


char str[11] = "9.50 marks";
char *ptr;
float fVal;

fVal = strtof(str, &ptr);  
printf("Number:%.2f\t String:%s\n", fVal, ptr);
// Number:9.50     String: marks

Converting strings to numbers

Converting to (long) integer numbers


long int strtol(const char *str, char **ptr, int base);

str

base

ptr

base must be between 2 and 36
if base is 0, the expected form is a decimal/octal/hexadecimal constant

Example:


char str[11] = "60 seconds";
char *ptr;
long int liVal;

liVal = strtol(str, &ptr, 10);  
printf("Number:%ld\t String:%s\n", liVal, ptr);
// Number:60     String: seconds

Converting strings to numbers

Question: what will be the output of the following?


char *str;
double fVal;
long int liVal;

liVal = strtol("20.00mm", &str, 10);  
printf("Number:%ld\t String:%s\n", liVal, str);

fVal = strtod("1e+2 litres", &str);
printf("Number:%.1lf\t String:%s\n",fVal,str);

liVal = strtol("FFGH", &str, 16);
printf("Number:%ld\t String:%s\n", liVar,str);

Converting strings to numbers

Answer:


Number:255	 String:GH
Number:20	 String:.00mm
Number:100.0	 String: litres

Note: A good resource for understanding other string manipulation functions is available here.

Optimising compilation

We already discussed about code optimisation

Benchmarking
Profiling

It is possible to further optimise your code at compilation

try to minimise program's execution time
try to minimise the amount of memory occupied (less common)
minimise the consumed power (for mobile devices)

Compiling and Linking

Compiling is not the same as creating an executable
Building an executable involves compilation and linking
Your code may compile without errors, but it may fail during the linking phase

Compiling and Linking

Compilation

Turning the source code into an 'object' file.
This is not executable, it only contains the corresponding machine language instructions
If you have multiple files, you will have multiple objects


# gcc -c -o "simulator.o" "simulator.c"
# gcc -c -o "utils.o" "utils.c"

The "-c" flag specifies that no linking should be done at this stage

Compiling and Linking

Linking

The process of creating a single executable from multiple object files
Finds references for the functions that are used in one file but were defined in another


# gcc -o "simulator" simulator.o utils.o

Compiling and Linking

This approach allows building large programs without having to redo the compilation time a file is changed
Conditional compilation --compile only source files that have changed;
Conditional compilation works well when you use an IDE.
Otherwise you will have to manually create a makefile and use the make utility, which determines what needs to be recompiled

Optimising compilation

When you compile your code, you can set some flags that instruct the compiler to perform some optimisation
Note that this often takes more time and require more memory, but your executable may run faster
Example:


# gcc -O3 -o "simulator.o" "simulator.c"

-O<level> instructs the compiler to perform some optimisation.

Optimising compilation

-O1 - tries to reduce code size and execution time, without performing optimizations that increase compilation time significantly.
-O2 - performs several optimisations that do not involve a space-speed trade-off. Increases both compilation time and the performance of the generated code.
-O3 - optimises even more.
-O0 - reduces compilation time and makes debugging produce the expected results (default).

The GCC manual page gives you more in depth information about the above.

Multiple Files

Question: Should you spread your implementation across multiple source code files?
There may be some good reasons to do so:

Increase code reusability
Reduces compilation time
Could help navigating source code faster

Multiple Files

Not suggesting you should not, but do so for a good reason
Given the size of this project, you could try to use as few files as possible
Move type definitions, functions, etc. to separate files when that seems necessary

Should I develop code with or without an IDE?

This shouldn't make a difference, but you may have good reasons for choosing one of the two approaches.
Coding using a plain text editor (e.g. vi, nano)

You can easily code remotely (over ssh) on e.g. a DiCE machine
May have to write a makefile if working with multiple source files
Better control on compilation optimisation

Should I develop code with or without an IDE?

Using IDEs

Nicer keyword highlighting
Some auto complete braces/brackets/parenthesis
Some may have integrated help for functions
Some may warn about certain syntax errors as you type
Perhaps easier if you are not very experienced in C

If you decide to code using an IDE, it's entirely up to you which one you choose (NetBeans C/C++ pack, CodeLite, Eclipse CDT, etc.)
Eclipse CDT is installed on DiCE machines

Questions?

Performance Evaluation

The CSLP requirement includes (among others):

Computing summary statistics
Supporting experimentation with certain parameters

Performing these tasks should help you get a good understanding of the bin collection process and how this may be improved in a practical setup
Today we will look at performance evaluation aspects from a more general perspective

System/Process Implementation

Designing and implementing logistics operations, complex processes, and systems involves several steps.
There is often a feedback loop involved, which allows to refine/improve/extend the system.

Requirements Analysis

Understand the problem domain and specifications, and identify the key entities involved.
Build an abstract representation of the system to be able to handle various input scenarios.

System Design

Dividing the system into components; choosing suitable methodologies for implementation each component.
Defining appropriate data structures, input/output formats, etc.

Development

This is the actual implementation work and is typically coupled with some preliminary testing.
For source code, janitorial, refactoring and some optimisation are also performed at this stage.

Testing

Validation is performed once the system is partially/ entirely developed; also benchmarking and profiling.
A system's performance evaluation is undertaken (experimentation with different inputs, distributions).

Deployment

Once the tool (planner, simulator, etc.) has been thoroughly tested it can be deployed in a real setting.
The input will be based on actual data and inputs may change over time (e.g. based on certain events).

Monitoring

Once the system is operational, it is possible to gather real measurements and use those to refine the design.
If new requirements are identified during operation, the system can be further extended.

The Bin Service Process

Your simulator will be implementing a good bit of what could become a real logistics system.
Unfortunately you will not have the opportunity to experiment with real data, but (time permitting) you have the flexibility to develop additional features.

Performance Evaluation

We have discussed the requirements, as well as different design and development aspects for your simulator.
We will now look into performance evaluation issues. Some of the things I will present may not be needed for this assignment, but will likely prove useful later.

Performance Evaluation

Generally speaking, this is about quantifying the performance of a system
The first step is to identify the relevant metrics, i.e. measurable quantities that capture properties of interest

This could the throughput/delay of a communications link, the power consumption of a mobile device, the memory used by a software application, etc.
For CSLP we are interested in the average trip duration, trip efficiency, number of trips per schedule, and percentage of overflows.

Metrics

It is essential to understand the performance evaluation goals, i.e. whether a metric should be small or large.
It is also important to be aware of the goals of the evaluation:

Improve the dimensioning/parametrisation of a system or process
Compare how different designs perform under different inputs and chose the best one

Methodologies

When designing a system, performance evaluation can be conducted through one or more of the following methodologies

Numerical analysis - plugging some numerical values into a mathematical model of the system and computing the metrics of interest
Simulation - constructing a simplified model of a more complex real system and simulating its behaviour; typically fast, but neglecting certain practical aspects
Experimentation - Analysing the performance of a system through measurements. Assessing performance under exceptional circumstances may be infeasible

Accuracy

It is advisable that the assumptions made for the evaluation campaign are well documented, to ensure the tests performed are reproducible.
You are working with a stochastic simulator and thus there will be some variability in the results of different tests with the same input.
For this practical you have been asked to give average values of a set of metrics.
In rigorous studies, it is necessary to also provide some confidence intervals for the results.

Summary Statistics

Histograms are graphical representations of the distribution of a set of measurements.

Example: distribution of the h-index of Nobel-prize recipients in Physics between 1985-2005.

Source: J.E. Hirsch, "An index to quantify an individual's scientific research output", Proc. NAS, 2005.

h-index: number of papers with h or more citations.

Histograms

In mathematical terms, the histogram is a function that counts the number of observations in different categories (bins)
The number of bins is typically computed as

k = \frac{max(x) - min(x)}{\sqrt{n}}

where n is the number of samples in the data set x.

Mean and Standard Deviation

Computing the mean (average) of a set of measurements is straightforward:

μ = \frac{1}{n} \sum_{i = 1}^{n} x_{i}

The standard deviation gives a measure of the variation of the measurements from the mean:

σ = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(x_{i}- μ)}^{2}}

Confidence Intervals

These can be used to quantify the uncertainty about the average of a set of measurements subject to randomness.
When computing averages across multiple simulations, you are gathering samples to estimate an unknown population mean.
You choose the significance level that will reflect how confident you can be that the true value lies within that interval,
E.g. for a significance level of 0.05, you will obtain a 95% confidence interval (typically used in practice).

Confidence Intervals

The width of the confidence is affected by:

sample size,
population variability (standard deviation),
confidence level chosen.

Central Limit Theorem: For a large sample size, the sample mean will approach a normal distribution.
The sample mean and the mean of the population are identical.

Confidence Intervals

A quick method to compute a CI is:

μ  \pm z_{α/2} \frac{σ}{\sqrt{n}}

where z_α/2 is the critical coefficient corresponding to a confidence level α and is obtained from z-score tables.

Example: Sample size 20, mean 10, standard deviation 1.45, 95% confidence level, i.e. a critical coefficient corresponding to a z-score of 0.475, which is 1.96.

CI is 20±0.02

i.e. [19.8, 20.2]

Confidence Intervals

Plotting CIs

CSLP statistics

Average Trip Duration

Compute the average duration of a lorry journey throughout the total simulation time.
The metrics may be different for different areas so compute per area and global.
When experimenting, you will be able to gain understanding of how bin thresholds impact route lengths, as well as how the implemented route planning algorithms perform.

CSLP statistics

Trip Efficiency

Compute the volume collected per unit of travel time.
This is somewhat related to the previous metric.
When experimenting, you can also examine how waste disposal rates affect the efficiency.
Again, per area and global statistics will be useful.

CSLP statistics

Number of Trips per Schedule

Compute the average number of trips a lorry performs to service all the bins whose thresholds have been reached at the start of the schedule.
Area size, disposal rate and (importantly) lorry capacity will impact this.
If trips to a small number of bins are to be performed to complete the service, the efficiency of the process may be affected.
On a new trip within the same schedule you can chose to go ahead with the initial plan or check if other bins' thresholds have been exceeded in the meantime.

CSLP statistics

Percentage of overflows

This metric should reflect whether the service is scheduled frequently enough.
A bin overflows once its capacity is exceeded and this event is marked only once.
For each area, you can count how many bins are overflowed at the start of a schedule.
There is some volatility in this metric. It can happen that some will overflow while the lorry is en route to them and you may miss those from the counting.

Questions?

Review of Part 1

Submissions

6 out of 29 students submitted at least something.
That's 20% submission rate –was expecting something closer to 50%.
3 out of the 6 submissions did not ask any explicit questions nor did they highlight any aspects on which they wanted feedback.
1 submission did not compile and did not have a proper declaration of main.

Multiple Files

Header files

Number of files	Frequency
16	1
14	1
10	1
8	1
2	1
0	1

Multiple Files

Source files

Number of files	Frequency
16	1
13	1
11	1
8	1
4	1
2	1

Multiple Files

There seems to be a preference for using a relatively large number of files, given the small size of this project
This is likely due to some of you taking an OOP-like design approach
There is nothing wrong with that, but pay attention to memory management
Using more/less files will not be penalised – just explain why you chose to implement things in a certain way
If refactoring is a reason, do emphasize that

The READMEs

Ranged from very basic ones, containing a couple of lines, to very detailed ones.
Most of them explained how to build and execute the simulator, as expected.
Some acknowledged the limitations of the code and problems known at the time of submission.
Some did not state explicitly at which stage of the development they were.

Random Goodness

“...using some pretty nasty typecasts though it seems to me that all of C is this way...”
“Output - is a mess”

Random Not-So-Goodness

Low-level coding decisions:
- Prone to change and you will forget to update the README
- Should be in comments in the source code file concerned – some code lacked commenting altogether
High-level structure: you are more likely to remember to change the README in case of a major re-structuring

Random Not-So-Goodness

Make sure you read carefully the requirements
- On two occasions the first command line parameter was not the input script, but a keyword that introduced the input script
- Although working to some extent, one instance did not implement any command line parsing at all.

Refactoring

2 out of 6 READMEs contained mentions of refactoring
Both with reference to future refactoring:
- Either promising to refactor later or
- Done a good bit of refactoring already, but planning to do more
It is still early days; however, refactoring is something you should be trying to do constantly

Invalid Input

Try to think of exceptions which you really do not believe can happen under normal executing conditions
The user may simply make some typing errors when producing the input
Some parameters may have been given in an order different than the expected one
When is this a serious problem?
Distinguish between warnings/errors where possible

Invalid Input

What should you do if you discover you have incomplete information during a simulation run?
For example, you attempt to retrieve the disposalRate and find that it is unavailable
This is not problematic for parsing, because the user may have simply forgotten to specify the disposal rate
However, if you validate the input before running the simulation, then it really becomes a problem to find a missing disposal rate during the simulation
The simulation should not have been started since the validation should have uncovered the error

Further checks

Check the sign of numbers
Check whether it is indeed numbers you find when numbers are expected
Check if a number has the type you expect (integer/real)
Be careful with hours/minutes conversions

Input scripts

Only some of you have authored input scripts
I recently made some examples available, but it is essential that you produce your own test scripts
Graph size is not the only thing that matters
Think of scenarios where e.g. route planning algorithms may have a hard time

Pleasant surprises

In no particular order:

One submission came with reference documentation generated with Doxygen
One submission used a fancy variant of a Mersenne Twister for pseudo-random number generation
One simulator has used revision control (GIT) during the development

Not-so-pleasant surprises

2 out of 6 submissions did not implement input parsing and validation at all.
Although not marked and aimed at helping you, only a few (20%) submitted part 1.
I have met in class (at least on one occasion) those who submitted something.
Low interest vs. difficult course?

YAGNI

A final piece of advice
Try to keep things simple: Do the simplest thing that could work
- Then rethink/refactor if it does not work
YAGNI: You Aren't Gonna Need It
- Try not to over-complicate things by over-anticipating future requirements

P(X ≤ E[X])	= P(X ≤ 1/λ)
	= F(x,λ)
	= 1 - e ^-λ*1/λ
	= 1 - 1/e