Computer Science
Large Practical

Introduction

 

Paul Patras

Housekeeping

  • Website: http://www.inf.ed.ac.uk/teaching/courses/cslp/
  • One lecture per week
    • When: Fridays, 12:10–13:00
    • Where: David Hume Tower, George Square - Map Room LG.06
  • Please ask questions at any time
  • Coursework accounts for 100% of your mark
  • Office hours: flexible, but email me first (paul.patras@ed.ac.uk)

Restrictions (I)

  • CSLP is a third-year undergraduate course only available to third-year undergraduate students.
  • CSLP is not available to visiting undergraduate students, or to fourth-year undergraduate students and MSc students, who have their own individual projects.

Restrictions (II)

  • Third-year undergraduate students should choose at most one large practical, as allowed by their degree regulations.
    • Computer Science, Software Engineering and Artificial Intelligence large practicals
    • On most degrees a large practical is compulsory.
    • On some degrees (typically combined Hons) you can do the System Design Project instead, or additionally.
  • See Degree Programme Tables (DPT) in the Degree Regulations and Programmes of Study (DRPS) for clarifications.

About this course

  • So far most of your practicals have been small exercises
  • This practical is larger and less rigidly defined than previous course works
  • The CSLP tries to prepare you for
    • The System Design Project (in the second semester)
    • The Individual Project (in fourth year).

Requirements

  • There is:
    • a set of requirements (rather than a specification);
    • a design element to the course; and
    • more scope for creativity.
  • The requirements are more realistic than most coursework
  • But still a little contrived in order to allow for grading

How much time should I spend?

  • 100 hours, all in Semester 1, of which
  • 8 hours lecture/demonstrating
  • 92 hours practical work, of which
    • 70 hours non-timetabled assessed assignments
    • 22 hours private study/reading/other

How much time is that really?

  • 13 weeks remaining in semester 1 (Weeks 2 to 14)
  • 7 * 13 = 91 hours
  • You can think of it as 7 hours/week in the first semester
  • This could be one hour a day including weekends
  • You could work 7 hours in a single day
    • for example work 9:00-17:00 with an hour for lunch

Managing your time

It is unlikely that you will want to arrange your work on your large practical as one day where you do nothing else, but one day per week all semester is the amount of work that you should do for the course.

Course lecturers have been asked not to let deadlines overlap Weeks 11-14 because students are expected to be concentrating on their large practical in that time.

Deadlines

The Computer Science Large Practical is split in two parts:

 

  • Part 1
    • Deadline: Thursday 23rd October, 2013 at 16:00
    • Part 1 is zero-weighted: it is just for feedback.
  • Part 2
    • Deadline: Thursday 18th December, 2013 at 16:00
    • Part 2 is worth 100% of the marks.

Scheduling work

  • It is not necessary to keep working on the project right up to the deadline.
  • For example, if you are travelling home for Christmas you might wish to submit the project early.
  • In this case ensure that you start the project early.
  • The coursework submission is electronic so it is possible to submit remotely.

Early submission credit

  • To motivate good project management, planning, and efficient software development, marks above 90% are reserved for work that is submitted early (specifically, one week before the deadline for Part 2).
  • Work submitted less than a week before the deadline does not qualify as an early submission, and the mark for this work will be capped at 90%. Thus, the mark may be 90%, but it may not be higher than this.
  • Regardless of when it is submitted, every submission is assessed in exactly the same way, but submissions which attract a mark of above 90% and were not submitted early have this mark brought down to 90%.

Early submission credit

Question:
Can I submit both an early submission version and a version for the end deadline and have the marks for whichever is highest?

 

Answer:
No. Before the early submission deadline you have to choose whether or not you are going to hope for a mark above 90% then, or have an extra week to accumulate more marks up to 90%. The submission marked will be the latest one made before the deadline.

Extensions

The Computer Science Large Practical

The CSLP Requirement

  • Create a command-line application in C
  • The purpose of the application is to implement a stochastic, discrete-event, discrete time simulator
    • I'll come back to these terms
  • This will simulate the bin collection process in a “smart” city, with bin locations, capacities, etc. specified by input

The CSLP Requirement (C'tnd)

  • The output will be the sequence of events that have been simulated as well as some summary statistics
  • Input and output formats, and several other requirements are specified in the coursework handout
  • It is your responsibility to read the requirements carefully

Why Simulators?

  • Stochastic simulation is an important tool in physics, medicine, computer networking, logistics, and many other fields.
  • Particularly useful to understand complicated processes.
  • Can save time, money, effort and even lives.
  • Allow running inexpensive experiments of exceptional circumstances that might otherwise be infeasible.
  • However, the simulator must have an appropriate model for the real system under investigation, to produce meaningful results.

Example: preventing Internet outages

Source: Internet Census –World map of 24 hour relative average utilization of IPv4 addresses.


Last month CBC news reported that in the U.S. Verizon dumped 15,000 Internet destinations for ~10 minutes.

Preventing Internet outages

  • Global Internet routing table has passed 512K routes
  • Older routers have limited size routing tables; when these fill up, new routes are discarded
  • Large portions of the Internet become unreachable, thus online businesses are loosing money
  • Upgrading equipment is expensive and takes time; workarounds are being proposed
  • Ensuring the proposed solutions will work is not trivial

Preventing Internet Outages

  • Testing patches in live networks poses the risk of further disruption
  • Waiting for the next surge is not acceptable
  • Forwarding all traffic for new routes through a default interface can have serious implications on routing costs
  • With simulation it is possible to generate synthetic traffic and test patches without disrupting the network
  • It is also possible to evaluate different metrics such as round-trip delays, throughput, routing changes propagation latency

Why C?

  • Part of the challenge of this practical is to learn a new programming language
  • This is something you should expect when taking a job as a software developer in a company that has clear incentives to use a particular language.
  • C is efficient (low execution time), portable, excellent for working directly with the hardware, and also usable for web programming

Why C?

  • Currently ranked the most popular programming language --TIOBE Index for September 2014

Code Sharing

  • Code sharing sites are a great resource but please refrain from using them for this practical
  • This is an individual practical so code sharing is not allowed. Even if you are not the one benefiting
  • It is somewhat likely that in the future you will be unable to publicly share the code you produce for your employer

Why Simulate Bin Collection?

  • Waste management is a major operation in many cities
  • Part of ongoing "smart cities" initiatives, bins are being equipped with occupancy sensors to improve scheduling and route planning for lorries
  • There are limitations to current practice periodic collection strategies:
    • Lorries make unnecessary frequent trips and sometimes take lengthy routes → increased operation cost and pollution
    • User daily demand varies and could cause overflows before scheduled pick-up → increased health hazards and cleaning costs

Why Simulate Bin Collection?

  • With simulation we can investigate the impact of different pick-up intervals and bin occupancy thresholds used to trigger scheduling.
  • In this practical we will evaluate waste collection efficiency in terms of volume collected per unit of travel time, percentage of overflows, etc.
    • small thresholds → longer trips, but cleaner streets
    • large thresholds → cost efficient, but risk of overflows

Your Simulator

  • Your simulator will be a command-line application
  • It will accept an input text file with the description of the serviced areas and
  • a set of global parameters: lorry capacity, service time, bin capacity, disposal rate, disposal volume
  • It should output information about occurring events
  • The strict formats for both input and output are described in the coursework handout
  • You will also need to produce summary statistics that you will later analyse

Simulation Algorithm

The underlying simulation algorithm is fairly simple:
WHILE {time ≤ max time}
   determine the set of events that may occur after the current state
   delay ← choose a delay based on the nearest event
   time ← time + delay
   modify the state of the system based on the current event
ENDWHILE

Simulation Algorithm

WHILE {time ≤ max time}
    ...
    delay ← choose a delay based on the nearest event
    ...
ENDWHILE
  • Some events are deterministic, some occur with exponentially distributed delays
  • I'll explain this in more details, but for now drawing from an exponential distribution can be done by:
  −(mean) ∗ log(random(0.0, 1.0))
  • Where mean is the average delay, which is the reciprocal of the rate

Components of the Simulation

Input

  • Global parameters:
    1. Lorry capacity
    2. Service time
    3. Bin capacity
    4. Disposal rate
    5. Disposal volume
    6. Number of areas

Components of the Simulation

Input

  • Area description and dynamic parameters:
    1. Collection frequency
    2. Bin occupancy threshold
    3. Number of bins
    4. Matrix representation of bin map

Components of the Simulation

Lorries

  • Each area is serviced by a single lorry
  • Lorries are scheduled at fixed time intervals (one/twice/n-times per day). This is expressed as number of trips/hour
  • Lorries have a fixed capacity, expressed in cubic metres

Components of the Simulation

Bins

  • Community bins have a fixed capacity expressed in m3
  • Bins have occupancy sensors and we consider an occupancy threshold (fraction) is used in each area to trigger collection
  • There is a fixed service time (expressed in minutes) required to empty a bin, irrespective of its occupancy

Components of the Simulation

Users

  • We consider users dispose of rubbish bags of fixed volume, expressed in m3
  • Bags are disposed at exponentially distributed intervals
  • The mean disposal rate is expressed as a number of bags per hour

Components of the Simulation

Area map

  • For each area, we consider a graph representation of the bins' locations and the distances between them.
  • The graph corresponding to each area is given as an input in matrix form
  • The (0,0) element represents the waste processing facility and it is both the start and end point of a service route, i.e. we consider routes to be circular
  • The distances between any two locations are expressed in minutes

Example

  • Matrix representation
0	8	65535	65535	7	65535	
8	0	5	65535	65535	2	
65535	5	0	4	65535	6		
65535	65535	4	0	2	65535	
7	65535	65535	2	0	3		
65535	2	6	65535	3	0	

Components of the Simulation

Events

  • Your simulator will produce a sequence of events
    • A bag may be disposed at a bin
    • A lorry may leave from a location
    • A lorry may arrive at a location
    • A bin may be emptied at a location
    • A particular bin may overflow
    • A bin's occupancy threshold may have been exceeded

Components of the Simulation

Events

  • Your simulator will output a sequence of events in the following format:

bag disposed at bin ‹bin_no› at time ‹time›

bin ‹bin_no› overflowed at time ‹time›

lorry ‹lorry_no› leaves location ‹location_id› at time ‹time›

lorry ‹lorry_no› arrives at location ‹location_id› at time ‹time›

bin ‹bin_no› emptied at time ‹time›

        

Components of the Simulation

Events

  • Depending on the actual event in your simulation, you will replace the ‹lorry_no›, ‹bin_no›, ‹location_id› and ‹time› with real values, e.g.:
  • lorry 1 leaves location 1.0 at time 4.00
    lorry 1 arrives at location 1.1 at time 4.1
    bin 1.1 emptied at time 4.15
            
  • This is valid output in the sense that it is formatted correctly
  • It may be invalid for other reasons, for example the occupancy of bin 1.1 may have not exceeded the predefined threshold

Part One & Part Two Assessments

  • Part one, is just for feedback. You only need to have a working simulator
  • For part two, there are additional requirements:
    • Full functionality should be implemented
    • Summary statistics, such as average trip efficiency, should be produced
    • Experimentation support, varying disposal rates to see how those impact the collection process
    • Validation, checking that the input is valid
  • These are all specified in the coursework handout

Coursework Handout

The Simulator

Definitions

  • In the requirements I stated that your simulator will be a:
    • stochastic,
    • discrete event,
    • discrete time
    simulator
  • Let's see what each of these terms means.

Stochasticity

  • A stochastic process is one whose state evolves “non-deterministically”, i.e. the next state is determined according to a probability distribution.
  • This means a stochastic simulator may produce slightly different results when run repeatedly with the same input.
  • Therefore it is appropriate to compute certain statistics to characterise the behaviour of the simulated system.
  • Remember, these are statistics about the model:
    • You hope that the real system exhibits behaviour with similar statistics

Discrete Events

  • Discrete events happen at a particular time and mark a change of state in the system.
  • This means discrete-event simulators do not track system dynamics continuously, i.e. an event either takes place or it does not.
  • There is no fine-grained time slicing of the states, i.e.
  • Generally a state could be encoded as an integer.
  • Usually it is encoded as a set of integers, possibly coded as different data types.
  • Discrete-event simulations run faster than continuous ones.

Discrete vs Continuous States

  • When working with discrete events, it is common to consider that states are also discrete.
  • Example:

Discrete Time

  • Discrete time simulations operate with a discrete number of points:
    • Minutes, Hours, Days, Weeks, etc.
  • These can also be logical time points:
    • Moves in a board game,
    • Communications in a protocol.
  • Your task is to write a discrete time simulator.
  • Events will occur with minute level granularity.

The Exponential Distribution

  • Remember that the probability distribution gives the probability of the different possible values of a random variable.
  • The exponential distribution describes the time between events in a Poisson process, i.e.
    • Events' inter-arrival times are independent (memoryless),
    • Events occur with a constant average rate λ.

The Exponential Distribution

  • Roughly speaking, the time X we need to wait before an event occurs has an exponential distribution if the probability that the event occurs during a certain time interval is proportional to the length of that time interval.
  • Applications:
    • Call arrivals at a telephone exchange
    • Radioactive particle decay
    • Air plane arrivals at a large hub

The Exponential Distribution

  • The probability density function (PDF) is given by:
  • f(x,λ) = λe-λx ∀ x > 0

  • Describes the relative likelihood that an event with rate λ occurs at time x
  • A time point is infinitesimally small
  • The integral of this gives the probability that it occurs within two time bounds (but you can largely ignore this)

The Exponential Distribution

  • The cumulative distribution function (CDF) is given by:
  • F(x,λ) = 1 - e-λx ∀ x > 0

  • So if something happens at a rate of 0.5 per unit of time, then the probability that we will observe it occurring within 1 time unit is: F(1, 0.5) = 1 - e0.5*1 = 0.393

The Exponential Distribution

  • The mean or expected value is given by the reciprocal of the rate parameter.
  • In plain English this means that if something occurs at rate r then we can expect to wait 1/r time units on average to see each occurrence.
  • If something occurs 7 times per week, you can expect to wait 1/7 of a week (or a full 24 hours) on average between each occurrence.

Exercise

What is the probability that a random variable X is less than its expected value, if X has an exponential distribution with rate λ?

The expected value of an exponential random variable with parameter λ is:

E[X] = 1/λ

We need to compute P(X ≤ E[X]) using the distribution function:

P(X ≤ E[X]) = P(X ≤ 1/λ)
= F(x,λ)
= 1 - e -λ*1/λ
= 1 - 1/e

The Memoryless Property

  • Formally: P(X > s + t | X > s) = P(X > t) s, t > 0
  • Less formally: The time that we can expect to wait for the next occurrence of some (exponentially distributed) event, is unaffected by how long we have already been waiting for it
  • In the 7 times a week example, if it has been 24 hours since the last occurrence, the expected additional time I have to wait is still 24 hours
  • A quick note, don't confuse these two properties:
    • Correct P(X > 100 | X > 80) = P(X > 20)
    • Incorrect P(X > 100 | X > 80) = P(X > 100)
    The latter would be a strange kind of pre-determined system

The Memoryless Property

  • In your simulation, users dispose of rubbish bags at exponentially distributed time intervals
  • That means the next disposal event does not depend on the previous ones
  • As a result of firing that event, the global state of the simulation changes
  • However local states may not have changed, e.g. a lorry may still be at the depot

How do we sample from a distribution?

Inverse Transform Method

  • Let X be random variable with continuous and increasing distribution function F. Denote the inverse by F −1.
  • Let U be a random variable uniformly distributed on the unit interval (0, 1).
  • Then X can be generated by X = F −1(U).

If we use an exponential CDF for F, then we effectively sample from that distribution by

X = -ln(U)/λ

Sampling exponential distributions in practice

Straightforward, right? (denoting mean = 1/λ)

  int r = (int) (-log(rand()/RAND_MAX)*mean);

Well...not quite

  • rand() is known to be implemented poorly
  • You want to draw a RV uniformly distributed on (0,1)
  • Seeding properly a pseudo-random generator is tricky

OK, so what should we do?

xkcd?

Drawing uniformly distributed random numbers

Use a uniform deviate and discard the zero.

double uniform_deviate ( int seed )
{
   return seed * ( 1.0 / ( RAND_MAX + 1.0 ) );
}
 
int r;
do
   r = uniform_deviate ( rand() );
while (r == 0);

r = (int) (-log(r)*mean);

Seeding rand()

The usual solution is to get the system time.

srand ( (unsigned int) time ( NULL ) );

Note there may be some portability issues with the above. Julienne Walker argues that hashing the system time first is a good solution. For a longer discussion about this and random numbers you can check his web page.

Further reading

  • Donald E. Knuth (1998). The Art of Computer Programming, volume 2: Seminumerical Algorithms, 3rd Edition, Addison-Wesley.
  • William H. Press et al. (2007). Numerical Recipes: The Art of Scientific Computing, 3rd Edition, Cambridge University Press.

Your Simulators

  1. Will be Discrete event simulators
  2. Will be Discrete time simulators
  3. Will make use of the exponential distribution to model user behaviour

Clarifications

Question:
What is the proposed maximum number of bins in a given area?

 

Answer:
It is foreseeable that the number of bins in an area may limit e.g. the possibility of performing exhaustive search to find an optimal route. For this practical we will assume the number of bins in an area is specified as an unsigned int value, i.e. the maximum number is 65,535.

NB: Some compilers compilers use 4 bytes for int. You can check what your compiler is using with sizeof(int). If you want a portable unsigned 16-bit integer, use uint16_t.

Clarifications

Question:
How long does it take to service (empty) a lorry at the waste processing facility?

 

Answer:
While in practice it usual takes longer to empty a lorry than to service a single bin, to make things simpler we will consider these service times to be equal.

Clarifications

Question:
Can a lorry receive an updated route from the depot while in service, as a bin's occupancy threshold may be reached after the lorry's departure?

 

Answer:
No. While this scenario is foreseeable, we will assume lorries are assigned routes only prior to their departure from the waste processing facility.

Clarifications

Question:
The occupancy of some bins may increase as a lorry traverses a route and consequently the lorry may not be able to service all assigned bins due to capacity constraints. How should the lorry proceed in situations like this?

Clarifications

Answer:
How you deal with such events is a design choice you should make. You can either
  1. conservatively assign fewer bins on a single run, to avoid capacity problems,
  2. return following the planned route without servicing other bins in that run if this happens,
  3. compute the shortest path from the current bin and return to the depot, or
  4. implement a combination of these or other similar strategies.
Please ensure your choice is explained in your final report and your code is commented appropriately.

Simulation Components

Route Planning

Service Areas

  • We need an abstract representation of a street map and bin locations for each service area.
  • We will consider one lorry per area.
  • We need to model the roads between different locations and the time required to travel these.

Example

Leith Walk area in Edinburgh; 20 imagined bin locations

Map source: bing.com

Graph representation

  • In mathematical terms such a collection of bins interconnected with links can be represented through a graph
  • A graph G = (V,E) comprises a set of vertices V that represent objects (bins) and E edges that connect different pairs of vertices (links).
  • Graphs can be directed or undirected

Directed Graphs

  • Edges have a direction associated with them and they are called arcs or directed edges
  • Formally, they are ordered pairs of vertices,
    i.e. (a,b) ≠ (b,a) if a ≠ b

Undirected Graphs

  • Edges have no orientation, i.e. they are unordered pairs of vertices. That is there is a symmetric relation between nodes and thus (a,b) = (b,a)
  • For our simulations we will consider undirected graph representations of the service areas

Back to the example

This area...

Map source: bing.com

Corresponding Graph

...can be represented by

Note that we numbered the vertices and added the '0' node, to model the lorry depot.

Weighted Graph

  • We also need to model the distances between bin locations
  • We will use a weighted graph representation, where a number (weight) is associated to each edge
  • In our case weights will represent the average travel duration between two bins (vertices), expressed in minutes

Weighted Graph

For our example, this may be

Input Script

  • Graph representation of the bins locations and distances between them will be given in the input script in matrix form.
  • For an area with N bins, a (N+1) x (N+1) matrix will be specified.
  • The graph keyword will precede the matrix
  • Being an undirected graph, the matrix will be symmetric
  • Where there is no edge in the graph between two vertices we will use a (216-1) value in the matrix

For The Previous Example

   0     1     2     3     4     5       ...   19    20
   -------------------------------------------------------
 0|0     9     65535 8     10    65535   ...   65535 65535
 1|9     0     2     65535 65535 65535   ...   65535 65535
 2|65535 2     0     1     65535 65535   ...   65535 65535
 3|8     65535 1     0     1     65535   ...   65535 65535
 4|10    65535 65535 1     0     4       ...   65535 65535
 5|65535 65535 65535 65535 4     0       ...   65535 65535
 .|.     .     .     .     .     .             .     .
 .|.     .     .     .     .     .             .     .
 .|.     .     .     .     .     .             .     .
19|65535 65535 65535 65535 65535 65535   ...   0     1
20|65535 65535 65535 65535 65535 65535   ...   1     0

Route Planning

  • Lorries are scheduled periodically and their frequency is an input parameter
  • Bin occupancy thresholds are used to decide the subset of bins to be serviced
  • You must seek the route that visits all bins whose occupancy has exceeded the threshold, and has a minimum cost (in terms of total duration)
  • All routes are circular, i.e. they must start and end at location x.0, where x is the area index

Route Planning

  • Not all the bins in an area may need to be serviced at a given time.
  • Thus it may be appropriate to work with an equivalent graph where vertices that do not require to be visited are isolated and equivalent edge weights are introduced.

The (More) Challenging Part

  • Let's refer to the graph of all bins that require service at a given time by "service graph"
  • How to finding the (almost) optimal route that visits all vertices in the service graph with minimum cost?
  • This is entirely up to you, but I will discuss some possibilities next
  • You must justify your choice in the final report and comment appropriately the simulator code
  • You may wish to implement more than one algorithm

Useful terminology

  • A walk is a sequence of edges connecting a sequence of vertices in a graph
  • A path is a walk that does not include any vertex twice
  • A cycle is a path that starts and ends at the same vertex

paths / cycle

Useful terminology

  • A trail is a walk that does not include any edge twice
  • A trail may include a vertex twice, as long as it comes and leaves on different edges
  • A circuit is a trail that starts and ends at the same vertex

trail / circuit

Hamiltonian Circuit

  • A Hamiltonian circuit (cycle) is a path that visits every vertex exactly once and starts at ends at the same vertex
  • NB: Note all graphs may have a Hamiltonian circuit

Minimum Cost Hamiltonian Circuit

  • In a weighted graph, the minimum cost Hamiltonian circuit is that where the sum of the edge weights is the smallest
  • Finding the minimum cost Hamiltonian circuit on your bin service graph is one option for route planning
  • Warning: Finding a Hamiltonian circuit can be very difficult. This is a known NP-complete problem. Simply put, the problem may not be solvable in polynomial time and the complexity increases significantly with the number of vertices

Heuristic Algorithms

  • Heuristics work quite well for finding a solution, most of the time
  • Solutions may not be always optimal, but good enough
  • Work relatively fast
  • Popular heuristics for finding minimum cost Hamiltonian circuits:
    • Nearest Neighbour Algorithm
    • Sorted Edges Algorithm

Nearest Neighbour Algorithm

  • Nearest Neighbour is a greedy algorithm – at every step it chooses as the next vertex the one connected to the current through the edge with the smallest weight
  • Only searches locally
  • Nodes already visited are ignored
  • Due to its greedy nature it may not find a solution
  • Finding a solution and its total cost depends on the start vertex chosen

Nearest Neighbour

Example: Starting at '0'

No solution

Nearest Neighbour

Example: Starting at '1'
Total cost: 22

Sorted Edges Algorithm

  • Also greedy, but has a more global view → takes slightly more time to find a solution
  • First sorts all the edges in ascending order of their weights
  • Adds sorted edges one at the time, unless adding a new edge leads to three edges entering a node, or creates a circuit that does not include all vertices
  • Skips the edges that violate these rules
  • Keeps adding edges until finding a Hamiltonian circuit
  • Stops when a solution is found, even if there are edges left

Sorted Edges

Example
Adding the first 3 edges is straightforward

Sorted Edges

Example
Adding 3-4 creates a circuit, but not all nodes visited.

Sorted Edges

Example
Remaining edges create circuits and violate the 3-edges rule. No solution found.

Brute Force Algorithm

  • When the number of vertices is small, a 'brute force' approach could be feasible
  • Find all paths that visit all vertices once and pick the one with the lowest cost
  • Guaranteed to find a solution (if there exists one), and this will be optimal

Other approaches

  • A Hamiltonian circuit may not always be the path that visits all the vertices and has the lowest cost
  • Sometimes visiting a node more than once could be a good idea
  • For small graphs, using brute force to find the cheapest path that visit all nodes may be appropriate

Other approaches

Example: passing twice through node '4'
Total cost: 22

Choosing Route Planning Algorithms

  • You can use any of these approaches and heuristics
  • You can implement other heuristics you have studied
  • You can implement multiple solutions, as some may not work for any graph
  • You have complete freedom, but make sure you document your choice and discuss its implication on system’s performance in your written report

Code Structuring & Coding Strategy

How to structure your work?

  • This is for guidance only and I will not go into great detail, to avoid seeing identically structured solutions.
  • Part of the practical is structuring it yourself. However, it is likely you will want at least the following components:
    • A parser
    • A representation of the states of a simulation
    • The simulation algorithm
    • Something to handle output
    • Something to analyse results
    • A test suite

Some Obvious Decisions

  • Do you want to parse into some abstract syntax data structure and then convert that into a representation of the initial state
    • Or you could parse directly into the representation of the initial state
  • Do you wish to print out events as they occur during the simulation
    • Or record them and print them out later
  • Do you wish to analyse the simulation events as the simulation proceeds
    • Or analyse the events afterwards

Parsing

  • You do not necessarily need to start with the parser
  • The parser produces some kind of data structure. You could instead start by hard coding your examples in your source code
  • But the parsing for this project is pretty simple
  • Hence you could start with the parser, even if this is not complete before moving on
    • Hard coding data structure instances could prove laborious
    • But doing so would ensure your simulator code is not heavily coupled with your parser code

Software Construction

  • Software construction is relatively unique in the world of large projects in that it allows a great deal of back tracking
  • Many other forms of projects, such as construction, event planning, and manufacturing, only allow for backtracking in the design phase.
  • The design phase consists of building the object virtually (on paper, on a computer) when back tracking is inexpensive
  • Software projects do not produce physical artefacts, so the construction of the software is mostly the design

Refactoring

  • Refactoring is the process of restructuring code while achieving exactly the same functionality, but with a better design.
  • This is powerful, because it allows trying out various designs, rather than guessing which one is the best
  • It allows the programmer to design retrospectively once significant details are known about the problem at hand
  • It allows avoiding the cost of full commitment to a particular solution which, ultimately, may fail.

Suggested Strategy

  • Note that this is merely a suggested strategy
  • Start with the simplest program possible
  • Incrementally add features based on the requirements
  • After each feature is added, refactor your code
    • This step is important, it helps to avoid the risk of developing an unmaintainable mess
    • Additionally it should be done with the goal of making future feature implementations easier
    • This step includes janitorial work (discussed later)

Suggested Strategy

  • At each stage, you always have something that works
  • Although you need not specifically design for later features you do at least know of them, and hence can avoid doing anything which will make those features particularly difficult.

Alternative Inferior Strategy

  • Design the whole system before you start
  • Work out all components and sub-components needed
  • Start with the sub-components which have no dependencies
  • Complete each sub-component at a time
  • Once all the dependencies of a component have been developed, choose that component to develop
  • Finally, put everything together to obtain the entire system
    • Test the entire system

Janitorial Work

  • Janitorial work consists mainly of the following
    • Reformatting
    • Commenting
    • Changing Names
    • Tightening

Janitorial Work

Reformatting


void function_name (int x)
{
  return x + 10;
}
Becomes:

void function_name(int x) {
  return x + 10;
}
There is plenty of software which can do this work.

Janitorial Work

Reformatting

  • Reformatting is entirely superficial
  • It is important to consider when you apply this
  • This may well conflict with other work performed concurrently
  • Reformatting should be largely unnecessary, if you keep your code formatting correctly in the first place
    • More commonly required on group projects

Janitorial Work

Commenting

  • Writing good comments in your source code is essential
  • When done as janitorial work this can be particularly useful
    • You can comment on the stuff that is not obvious even to yourself as you read it.
  • The important thing to comment is not what or how but why
  • Try not to have redundant/obvious information in your comments:
    
    // 'x' is the first integer argument
    int leastCommonMultiple(int x, int y); 
    

Janitorial Work

Commenting

Ultra bad:

// increment x
x += 1;
Better:

// Since we now have an extra element to consider
// the count must be incremented
x += 1;

Janitorial Work

Changing Names

  • The previous example used x as a variable name
  • Unless it really is the x-axis of a graph, choose a better name
  • This is of course better to do the first time around
  • However as with commenting, unclear code can often be more obvious to its author upon later reading it

Janitorial Work

Tightening


  ...
  FILE *fInput;
  fInput = fopen(fileName, "r");
  parseInput(fInput);
  fclose(fInput);
Tightened to become:

  ...
  FILE *fInput;
  fInput = fopen(fileName, "r");
  if (fInput == NULL) {
	// Explain to the user ...
	printf("Error: %d (%s)\n", errno, strerror(errno));
  } else {
  	parseInput(fInput);
  	fclose(fInput);
  }

Janitorial Work

Tightening

  • For some this is not janitorial work, since it actually changes in a non-superficial way the function of the code
  • However, similar to other forms it is often caused by being unable to think of everything when writing new code

Janitorial Work

  • Most of this work is work that arguably could have been done right the first time around when the code was developed
  • However, when developing new code, you have limited cognitive capacity
  • You cannot think of everything when you develop new code. Janitorial work is your time to rectify the minor stuff you forgot
  • Better than trying to get it right first time is making sure you later review your code

Janitorial Work

  • Remember, refactoring is the process of changing code without changing its functionality, whilst improving design.
  • Strictly speaking janitorial work is not refactoring
    • It should not change the function of the code
      • Tightening might, but generally for exceptional input
    • But neither does it make the design any better
  • In common with refactoring you should not perform janitorial work on pre-existing code whilst developing new code

More About Refactoring

  • Refactoring is a term which encompasses both factoring and defactoring
  • Generally the principle is to make sure that code is written exactly once
  • We hope for zero duplication
  • However, we would also like for our code to be as simple and comprehensible as possible

Factoring and Defactoring

  • We avoid duplication by writing re-usable code
  • Re-usable code is generalised
  • Unfortunately, this often means it is more complicated
  • Factoring is the process of removing common or replaceable units of code, usually in an attempt to make the code more general
  • Defactoring is the opposite process specialising a unit of code usually in an attempt to make it more comprehensible

Factoring Example


#include <stdbool.h>
void primes(int limit) {
    int i, x = 2;
    while (x <= limit){
        bool prime = true;
	for (i = 2; i < x; i++) {
            if (x % i == 0){ 
	        prime = false; 
	        break; 
	    }
        }
        if (prime) { 
            printf("%d is prime\n", x); 
 	}
	x++;
    }
}
A very naive but perfectly reasonable bit of code to print out a set of prime numbers up to a particular limit

Factoring Example


#include <stdbool.h>
void printPrime(int x) {
    printf("%d is prime\n", x); 
}

void primes(int limit) {
    int i, x = 2;
    while (x <= limit) {
        ... // as before
        if (prime) { 
  	   print_prime(x); 
	}
	x++;
    }
}
Here we have “factored out” the code to print the prime number to the screen. This may make it more readable, but the code is not more general.

Factoring Example

To make it more general we have to actually parametrise what we do with the primes once we have found them.

#include <stdbool.h>
void primes(int limit, void (*processPrimes) (int)) {
    int i, x = 2;
    while (x <= limit) {
        ... // as before
        if (prime) { 
  	   (*processPrimes)(x); 
	}
    }
}
You can now use different functions to display, store, etc. the prime numbers.

Factoring Example

For instance, to print to display

void printPrime(int x) {
    printf("%d is prime\n", x);
}
void primes(int limit, void (*processPrimes) (int)) {
    int i, x = 2;
    while (x <= limit) {
    	... // as before
        if (prime) {
        	(*processPrimes)(x);
        }
        x++;
    }
}
int main(void) {
    primes (100, &printPrime);
    ...
}

Factoring

  • What you should factor depends on the context
  • How likely am I to need more than just prime numbers?
  • How likely am I to do something other than print the primes?
  • Try to find the right re-usability/time trade-off

Defactoring

  • Numbers such as the number 20 can be factored in different ways
    • 2,10
    • 4,5
    • 2,2,5
  • If we have the factors 2 and 10, and realise that we want the number 4 included in the factorisation we can either:
    • Try to go directly by multiplying one factor and dividing the other
    • Defactor 2 and 10 back into 20 and then divide 20 by 4

Defactoring

  • Similarly, your code is factored in some way
  • In order to obtain the factorisation that you desire, you may have to first defactor some of your code
  • This allows you to factor down into the desired components
  • This is often easier than trying to short-cut across factorisations

Defactoring

  • Flexibility is great, but it is generally not without cost
    • The cognitive cost associated with understanding the more abstract code
  • If the flexibility is not now or unlikely to become required then it might be worthwhile defactoring
  • It is appropriate to explain your reasoning in comments

Refactoring Summary

  • Code should be factored into multiple components
  • Refactoring is the process of changing the division of components
  • Defactoring can help the process of changing the way the code is factored
  • Well factored code will be easier to understand
  • Do not update functionality at the same time

Common Development Approach

    1. Start with the main function
    2. Write some code, for example to parse the input
    3. Write (or update) a test input file
    4. Run your current application
    5. See if the output is what you expect
    6. Go back to step 2.

Do Not Start with Main

  • A better place to start is with a test suite
  • This doesn't have to mean you cannot start coding
  • Write a couple of test inputs
  • Create a skeleton “do nothing” parse function
  • Create an entry point which simply calls your parse function on your test inputs (all of them)
  • Watch them fail

Do Not Start with Main

  1. Code until those tests are green
    • Including possibly refactoring
  2. Consider new functionality
    • Write a function that tests for that new functionality
    • Watch it fail, whether by generating an error or simply not producing the results required
    • Return to step 1.
  3. You can write your main function any time you like
    • It should be very simple, as it simply calls all of your fully tested functionality

Do Not Start with Main

  • Any time you run your code and examine the results, you should be examining output of tests
  • If you are examining the output of your program ask yourself:
    • Why am I examining this output by hand and not automatically?
    • If I fix whatever is strange about the output can I be certain that I will never have to fix this again?
  • Of course sometimes you need to examine the output of your program to determine why it is failing a test. This is just semantics (it is still the output of some test)

Summary

  • Refactoring allows you to avoid doing a large amount of upfront design and also avoid producing a big hairy mess
    • Do not change functionality whilst refactoring
    • Your code should be adaptable
  • Do not start with main, write a test suite instead

Memory management

Memory management in C

  • Memory allocation/deallocation is done differently in C as compared to what you may be used with other languages
  • There is no automatic memory management (garbage collection) and thus the programmer is responsible for releasing dynamically allocated memory when no longer needed
  • Poor memory management can lead to memory leaks → system performance suffers as virtual memory is progressively paged to hard drive. The OS may crash

Memory allocation

  • Two mechanisms are used to allocate memory in C
    1. Declaring local variables. These are stored in a stack and are automatically freed when they become out of scope (e.g. exiting a function)
      
      int parseInput(FILE *finput) {
         int N;
         int array[100];
         ...
      } 
      
    • Nice, right? Well there's a catch. The compiler needs to know in advance how much memory to allocate (inefficient) and the stack size is limited (not all your variables may fit).

Memory allocation

  • Two mechanisms are used to allocate memory in C
    1. Requesting memory explicitly and storing variables on the heap.
    • You can allocate as much memory as the system allows you to, but you need to take care of releasing it
    • 
         ...
         int N;
         int *array;
         ...
         array = (int *) malloc(N * sizeof(int));
      
    • The compiler cannot know what you intend to do with the array and thus will not release it automatically. You have to do it manually when done:
    • 
         free(array);
      

Segmentation Fault & Co

A few things can go wrong if not careful with variables allocated on the heap.

  • Memory leaks → you forget to call free() when variables no longer needed
  • You try to free, but allocation failed or memory already deallocated
  • 
    char *fileName = malloc(255*sizeof(char));
    if (fileName != NULL) {
      ...
      free(fileName);
    }
    
  • You try to write beyond the allocated length (corruption)
  • 
    char *temp = malloc(64*sizeof(char));
    memcpy(temp, data, dataLen);   // dataLen > 64 gives gives error
    

Segmentation Fault & Co

  • You use an out of bounds array index
  • 
    int *array = malloc(128*sizeof(int));
    int N = 200;
    
    for (i = 0; i < N; i++) {  // once i > 127, an error will occur
       ... 
    }
    
  • You use an address, but memory has not been allocated
  • 
    struct *listElement;
    x = listElement->value;
    

Segmentation Fault & Co

  • You return a pointer to a variable from the stack
  • 
    int *getCount() {
       int n;  // Local stack variable
    
       ...     // count the number of values that 
    	   // divide by x in an array
    
       return &n;
    }
    
    int main(void){
      int *n; 
      ...
      n = getCount();  // Stack given up by getCount(), 
    	           // &n no longer safe
    
      ... // n may be corrupt when needed later
    }            
    

Why no garbage collection in C?

  • Garbage collection involves constructing a complex data structure for keeping track of allocations and references counting.
  • This mechanism increases the complexity of the language and affects the performance (overhead).
  • C is meant for designing very fast code, e.g. for operating systems, device drivers, etc.
  • High performance is traded for convenience.

Code Optimisation

Code Optimisation

  • Refactoring is done in between development of new functionality
    • Recall this makes it easier to test that this process has not changed the behaviour of your code
  • This is also a good time to do some optimisation
    • You should be in a good position to test that your optimisations have not negatively impacted correctness

When to Optimise?

  • When you discover that your code is not running fast enough, it's probably wise to optimise it
  • Often this will come towards the end of the project
  • It should certainly come after you have something deployable
  • Preferably after you have developed and tested some major portion of functionality

A Plausible Strategy

  • Perform no optimisation until the end of the project once all functionality is complete and tested
  • This is a reasonable approach; however:
  • During development, you may find that your test suite takes a long time to run
  • Even one simple run to test the functionality you are currently developing may take minutes or hours
  • This can slow down development significantly, so it may be appropriate to do some optimisation at that point

How to Optimise

  • The very first thing you need before you could possibly optimise code is a benchmark
  • This can be as simple as timing how long it takes to run your test suite
  • O(n2) solutions will beat O(n log n) solutions on sufficiently small inputs, so your benchmarks must not be too small

How to Optimise

Once you have a suitable benchmark then you can:

  1. Save a copy of your current code
  2. Run your benchmark and record the run time
  3. Perform what you think is an optimisation on your source code
  4. Re-run your benchmark and compare the run times
  5. If you successfully improved the performance of your code keep the new version, otherwise revert changes
  6. Do one optimisation at a time

How to Optimise

  • However, bear in mind that you are writing a stochastic simulator
    • This means each run is different and hence may take a different time to run
    • Even if the code has not changed or has changed in a way that does not affect the run time significantly
    • Simply using the same input several times should be enough to reduce or nullify the effect of this

Profiling

  • Profiling is not the same as benchmarking
  • Benchmarking:
    • determines how quickly your program runs
    • is to performance what testing is to correctness
  • Profiling:
    • is used after benchmarking has determined that your program is running too slowly
    • is used to determine which parts of your program are causing it to run slowly
    • is to performance what debugging is to correctness

Benchmarking & Profiling

  • Without benchmarking you risk making changes to your program that will lead to poorer performance
  • Without profiling you risk wasting effort optimising a part of code which is either already fast or rarely executed

Documenting: Source code comments are a good place to explain why the code is the way it is

CSLP assessment

Assessment Criteria (I)

  1. Implementation of requirements:
    1. Parsing
    2. Input validation
    3. Correct simulation & correct output
    4. Summary statistics of simulation results
    5. Experimentation implementation
  2. Source code documentation (comments)

Assessment Criteria (II)

  1. Testing, including sample test input scripts
  2. Maintainable code
  3. Code efficiency (optimisations)
  4. Any additional features
  5. Written report
  6. Early submission

Objective & Subjective Criteria

  • Some of the items on the above list are objective whilst some are subjective
  • Objective criteria are those which are testable
  • Subjective criteria are those which are, at least partially, based upon opinion

Objective Assessment Criteria

  • The most objective assessment criteria is:
    • Early submission
  • Either you submit it before the early submission deadline or you do not
  • Though arguably this is not really an assessment criteria

Objective Assessment Criteria

This first list of implementation requirements are all relatively objective:
  1. Parsing
  2. Input validation
  3. Correct simulation & correct output
  4. Summary statistics of simulation results
  5. Experimentation implementation

Objective Assessment

  • Your application will be put through my own suite of test inputs
  • Some of these test inputs will be inputs you have seen, some will be new
  • Part of the exercise is for you to foresee possible inputs for which your application would fail
    • Either by crashing, or by producing incorrect output
  • Should your application fail any tests I would have to figure out why this happened and objective marking will not be so straightforward

Parsing

  • Your parser should be able to parse all syntactically valid input scripts
  • I cannot say it much simpler than that
  • There will not be any deliberately tricky tests

Input Validation

  • This is the first task which is not finely specified
  • You have to demonstrate some ingenuity to devise your own rules for what should and should not be valid input
  • You also have to decide which kinds of inputs result in warnings or errors
    • Specifically those in which the simulation could be started but may result in an error
    • This may depend upon the structure of your simulator

Correct Simulation & Output

  • Here I will be testing whether your simulator follows the requirements correctly
  • The simulator is tested via its output, so these are tested at the same time
  • Having said that, where the output is not correct, the code is inspected to determine why
  • Minor syntactic issues with the output will be judged leniently
    • This is part of the reason your code must compile on DiCE

Summary Statistics

  • This will test for correctly calculating and reporting the specified summary statistics
  • It is possible to get the simulation incorrect but the summary statistics correct
  • A small tip is to make sure your reported statistics are consistent with each other
  • It might be that you are getting inconsistent results because your simulation is incorrect, in which case you should note this in your README

Experimentation Implementation

  • Whether or not you correctly implement the experimentation of disposal rates and collection frequencies
  • As before it is possible to get this correct, without getting either (or both) of the simulation and the summary statistics correct
  • As before, if you are getting inconsistent results you should at least note that in your README

Code efficiency

  • Implementing some code optimisations will lead to shorter run times
  • It is possible that you implement everything above correctly, but your simulations take a very long time to complete
  • On the other hand, your code may run fast, but will not have implemented all requirements. This is not considered to be efficient

Noting Deficiencies

  • Use your README file to record any deficiencies you are aware of
  • In general any implementation errors will be treated more indulgently if they are known about
  • Remember, it is generally worse to produce incorrect output than no output at all

Subjective Assessment

The remaining items are mostly judged subjectively
  • Source code documentation
  • Testing, including sample test input scripts
  • Maintainable code
  • Any additional features
  • Written report

Documentation

  • Use appropriate comments to document your code
  • You may develop additional features which, if you do not document, I may not even know about
  • Clear mark and explain the code that you have not authored yourselves
  • Remember that code sharing is not allowed

Testing

  • The practical is intended to write a good simulator
    • You can at least strive for “half decent”
  • Either way, running one test input, is woefully insufficient
  • You also need to be able to investigate the performance of the "bin collection process", what parameters affect this and how

Maintainable Code

  • Highly subjective
  • Remember, reusable code is more difficult to understand
  • But, reusable code is easier to reuse and maintain
  • What is an inexperienced developer to do?
  • Try to imagine what you might wish to do in the future

Maintainable Code

  • Highly subjective
  • Trying to justify your choices is likely a good thing
  • Even if your reasoning is flawed, it demonstrates that you have thought about how to design your source code
  • It also shows that you probably could have implemented things in other way, but specifically chose not to
  • A future maintainer at least knows why you made that choice, if they disagree, they can change the code without fear of some other reason they have not yet uncovered

Additional Features

  • This is your chance to be creative and go beyond the implementation of the requested features
  • It perhaps requires some imagination, but imagine you were really going to use your simulator to investigate some real (or other) logistics operation
  • What would be useful to you?

README

  • Don't forget to provide me with a README
  • In general this can only help your grade:
    • It lets me know good things are deliberate and not fortunate
    • It lets me know that deficiencies are at least known about

Written report

You should produce a written report that discusses:

  • the key building blocks of your design
  • the results of the analyses you performed with different inputs
  • insights gained into system's performance
  • a summary of the most important findings

Useful thing to include in your written report

  • Produce graphs based on the numerical output of your simulations to support your findings, especially for experimentation
  • Explain the purpose of the tests carried out whether the results met your initial expectations/lessons learned
  • Motivate your choice(s) of route planning algorithms implemented and discuss their impact on the performance of the system

Final Points

  • The report will have a 25% weight of the final mark
  • There is no minimum number of pages required for the report
  • Present your findings and results clearly
  • Submit the report as a PDF file
  • Students are often worried about losing marks
  • Indeed our own assessment descriptions often talk of losing marks
  • But let's not forget, you start with zero

Announcement

ACM ICPC
(international collegiate programming contest)

  • Prestigious, international programming contest for students in teams of 3.
  • This year the first regional will be in Sweden on 29-30 November. Details at http://www.nwerc.eu/
  • UoE agreed to fund travel for a team to compete!
  • There will be try outs and training sessions in the coming month.
If you'd like to be a part of this, email Hugh Leather <hughleat@gmail.com> by end of today.

Array & String Handling

Allocating arrays & matrices

Last time we discussed how you can allocate memory dynamically for an array of N elements.


int N;
int *array;

array = (int *) malloc(N * sizeof(int));

But, how do you allocate memory for a matrix?

Common mistake

int N;
int **matrix;

matrix = (int **) malloc(N * N * sizeof(int));

Matrix allocation

Remember, you are trying to allocate a pointer to an array of pointers to integers

Matrix allocation

Approach 1


int i,N;
int **matrix;

matrix = (int **) malloc(N * sizeof(int*));  // rows
for(i = 0; i < N; i++)   
   matrix[i] = (int *) malloc(N * sizeof(int));  // columns

// access the (i,j) element by
matrix[i][j] = ...

Approach 2 (define the matrix as an array)


int i,N;
int *matrix;

matrix = (int *) malloc(N * N * sizeof(int));  

// and access the (i,j) element by
matrix[i*N+j] = ...

Matrix deallocation

When using the first approach, first deallocate the memory allocated for each row



for(i = 0; i < N; i++)   
   free(matrix[i]);
free(matrix);

When using the second approach, simply


free(matrix);

What about arrays of structures?

  • Imagine the following

typedef struct {
  int groupSize;
  float* marks;
} GROUP;

int nGroups = 5;
GROUP *g;
  • The same principle applies


g = (GROUP *) malloc(nGroups * sizeof(GROUP));
for (i = 0; i < nGroups; i++) {
   fscanf(stdin, "%d", &g[i].groupSize);
   g[i].marks = (float *) malloc(g[i].groupSize * sizeof(float));
   ...
}

String handling

  • Strings are simply arrays of characters terminated by the ASCII null character '\0'.
  • 
    char *str;
    char string[100]
    
    str = (char*) malloc(100*sizeof(char));
    
  • C provides a set of functions in the standard library, that are useful for manipulating strings.
  • Typical operations: copying, tokenizing, comparing, searching, etc.
  • Most of these are given in the <string.h> header file, but a few exist in <stdlib.h> as well

Functions you may use

  • Copying
  • 
    char* strcpy(char *dst, const char *src);
    
    Copies src to dst including the terminating '\0' character. Returns dst.
    
    char* strncpy(char *dst, const char *src, int len);
    
    Copies at most len characters from src to dst. Appends '\0' to the copied characters if the length of src is less than len. Returns dst.

    NB: careful with sizes to avoid memory corruption.

Functions you may use

  • Comparing
  • 
    int strcmp(const char *str1, const char *str2);
    
    Returns:
    • <0 if the first character that does not match has a lower value in str1 than in str2
    • 0 if the contents of both strings are equal
    • >0 if the first character that does not match has a greater value in str1 than in str2

Functions you may use

  • Searching
  • 
    char* strstr(const char *str1, const char *str2);
    
    Returns: a pointer to the first occurrence of str2 in str1, or NULL if not found.
  • Examining
  • 
    size_t strlen(const char *str);
    
    Returns: the length of the null-terminated string str, i.e. the offset of the terminating '\0' character.

Functions you may use

  • Tokenising – a string into different tokens according to some delimiter(s)
  • 
    char *strtok(char *str, const char *delim)
    
    • str broken into smaller strings
    • delim may contain different characters to be used as delimiters
    • Returns a pointer to the last token found or NULL if none found
    • Can be called multiple times to find all tokens

Functions you may use

Example


const char str[100] = "The quick brown fox jumps over the lazy dog";
const char delim[2] = " ";
char *token;
   
token = strtok(str, delim);   // gets first token
   
while(token != NULL) {   // retrieve all tokens; stop when no more found 	
   printf("%s\n", token);
   token = strtok(NULL, delim);
}

Converting strings to numbers

  • Converting to floating-point numbers
  • 
    double strtod(const char *str, char **ptr);
    float strtof(const char *str, char **ptr);
    
    Convert the initial portion of the string str to double or float. Return the floating-point value and store in ptr the offset of the non-numerical part (if any).
  • Example:
  • 
    char str[11] = "9.50 marks";
    char *ptr;
    float fVal;
    
    fVal = strtof(str, &ptr);  
    printf("Number:%.2f\t String:%s\n", fVal, ptr);
    // Number:9.50     String: marks
    

Converting strings to numbers

  • Converting to (long) integer numbers
  • 
    long int strtol(const char *str, char **ptr, int base);
    
    Converts the initial portion of the string str to long int according to the given base value. Returns the long value and stores in ptr the offset of the non-numerical part.
    • base must be between 2 and 36
    • if base is 0, the expected form is a decimal/octal/hexadecimal constant
  • Example:
  • 
    char str[11] = "60 seconds";
    char *ptr;
    long int liVal;
    
    liVal = strtol(str, &ptr, 10);  
    printf("Number:%ld\t String:%s\n", liVal, ptr);
    // Number:60     String: seconds
    

Converting strings to numbers

  • Question: what will be the output of the following?
  • 
    char *str;
    double fVal;
    long int liVal;
    
    liVal = strtol("20.00mm", &str, 10);  
    printf("Number:%ld\t String:%s\n", liVal, str);
    
    fVal = strtod("1e+2 litres", &str);
    printf("Number:%.1lf\t String:%s\n",fVal,str);
    
    liVal = strtol("FFGH", &str, 16);
    printf("Number:%ld\t String:%s\n", liVar,str);
    

Converting strings to numbers

  • Answer:

Number:255	 String:GH
Number:20	 String:.00mm
Number:100.0	 String: litres

Note: A good resource for understanding other string manipulation functions is available here.

Optimising compilation

Optimising compilation

  • We already discussed about code optimisation
    • Benchmarking
    • Profiling
  • It is possible to further optimise your code at compilation
    • try to minimise program's execution time
    • try to minimise the amount of memory occupied (less common)
    • minimise the consumed power (for mobile devices)

Compiling and Linking

  • Compiling is not the same as creating an executable
  • Building an executable involves compilation and linking
  • Your code may compile without errors, but it may fail during the linking phase

Compiling and Linking

Compilation

  • Turning the source code into an 'object' file.
  • This is not executable, it only contains the corresponding machine language instructions
  • If you have multiple files, you will have multiple objects

# gcc -c -o "simulator.o" "simulator.c"
# gcc -c -o "utils.o" "utils.c"

The "-c" flag specifies that no linking should be done at this stage

Compiling and Linking

Linking

  • The process of creating a single executable from multiple object files
  • Finds references for the functions that are used in one file but were defined in another

# gcc -o "simulator" simulator.o utils.o

Compiling and Linking

  • This approach allows building large programs without having to redo the compilation time a file is changed
  • Conditional compilation --compile only source files that have changed;
  • Conditional compilation works well when you use an IDE.
  • Otherwise you will have to manually create a makefile and use the make utility, which determines what needs to be recompiled

Optimising compilation

  • When you compile your code, you can set some flags that instruct the compiler to perform some optimisation
  • Note that this often takes more time and require more memory, but your executable may run faster
  • Example:
  • 
    # gcc -O3 -o "simulator.o" "simulator.c"
    

    -O<level> instructs the compiler to perform some optimisation.

Optimising compilation

  • -O1 - tries to reduce code size and execution time, without performing optimizations that increase compilation time significantly.
  • -O2 - performs several optimisations that do not involve a space-speed trade-off. Increases both compilation time and the performance of the generated code.
  • -O3 - optimises even more.
  • -O0 - reduces compilation time and makes debugging produce the expected results (default).

The GCC manual page gives you more in depth information about the above.

Multiple Files

  • Question: Should you spread your implementation across multiple source code files?
  • There may be some good reasons to do so:
    • Increase code reusability
    • Reduces compilation time
    • Could help navigating source code faster

Multiple Files

  • Not suggesting you should not, but do so for a good reason
  • Given the size of this project, you could try to use as few files as possible
  • Move type definitions, functions, etc. to separate files when that seems necessary

Should I develop code with or without an IDE?

  • This shouldn't make a difference, but you may have good reasons for choosing one of the two approaches.
  • Coding using a plain text editor (e.g. vi, nano)
    • You can easily code remotely (over ssh) on e.g. a DiCE machine
    • May have to write a makefile if working with multiple source files
    • Better control on compilation optimisation

Should I develop code with or without an IDE?

  • Using IDEs
    • Nicer keyword highlighting
    • Some auto complete braces/brackets/parenthesis
    • Some may have integrated help for functions
    • Some may warn about certain syntax errors as you type
    • Perhaps easier if you are not very experienced in C
  • If you decide to code using an IDE, it's entirely up to you which one you choose (NetBeans C/C++ pack, CodeLite, Eclipse CDT, etc.)
  • Eclipse CDT is installed on DiCE machines

Questions?

Performance Evaluation

Performance Evaluation

  • The CSLP requirement includes (among others):
    • Computing summary statistics
    • Supporting experimentation with certain parameters
  • Performing these tasks should help you get a good understanding of the bin collection process and how this may be improved in a practical setup
  • Today we will look at performance evaluation aspects from a more general perspective

System/Process Implementation

  • Designing and implementing logistics operations, complex processes, and systems involves several steps.
  • There is often a feedback loop involved, which allows to refine/improve/extend the system.

Requirements Analysis

  • Understand the problem domain and specifications, and identify the key entities involved.
  • Build an abstract representation of the system to be able to handle various input scenarios.

System Design

  • Dividing the system into components; choosing suitable methodologies for implementation each component.
  • Defining appropriate data structures, input/output formats, etc.

Development

  • This is the actual implementation work and is typically coupled with some preliminary testing.
  • For source code, janitorial, refactoring and some optimisation are also performed at this stage.

Testing

  • Validation is performed once the system is partially/ entirely developed; also benchmarking and profiling.
  • A system's performance evaluation is undertaken (experimentation with different inputs, distributions).

Deployment

  • Once the tool (planner, simulator, etc.) has been thoroughly tested it can be deployed in a real setting.
  • The input will be based on actual data and inputs may change over time (e.g. based on certain events).

Monitoring

  • Once the system is operational, it is possible to gather real measurements and use those to refine the design.
  • If new requirements are identified during operation, the system can be further extended.

The Bin Service Process

  • Your simulator will be implementing a good bit of what could become a real logistics system.
  • Unfortunately you will not have the opportunity to experiment with real data, but (time permitting) you have the flexibility to develop additional features.

Performance Evaluation

  • We have discussed the requirements, as well as different design and development aspects for your simulator.
  • We will now look into performance evaluation issues. Some of the things I will present may not be needed for this assignment, but will likely prove useful later.

Performance Evaluation

  • Generally speaking, this is about quantifying the performance of a system
  • The first step is to identify the relevant metrics, i.e. measurable quantities that capture properties of interest
    • This could the throughput/delay of a communications link, the power consumption of a mobile device, the memory used by a software application, etc.
    • For CSLP we are interested in the average trip duration, trip efficiency, number of trips per schedule, and percentage of overflows.

Metrics

  • It is essential to understand the performance evaluation goals, i.e. whether a metric should be small or large.
  • It is also important to be aware of the goals of the evaluation:
    • Improve the dimensioning/parametrisation of a system or process
    • Compare how different designs perform under different inputs and chose the best one

Methodologies

When designing a system, performance evaluation can be conducted through one or more of the following methodologies

  1. Numerical analysis - plugging some numerical values into a mathematical model of the system and computing the metrics of interest
  2. Simulation - constructing a simplified model of a more complex real system and simulating its behaviour; typically fast, but neglecting certain practical aspects
  3. Experimentation - Analysing the performance of a system through measurements. Assessing performance under exceptional circumstances may be infeasible

Accuracy

  • It is advisable that the assumptions made for the evaluation campaign are well documented, to ensure the tests performed are reproducible.
  • You are working with a stochastic simulator and thus there will be some variability in the results of different tests with the same input.
  • For this practical you have been asked to give average values of a set of metrics.
  • In rigorous studies, it is necessary to also provide some confidence intervals for the results.

Summary Statistics

Histograms are graphical representations of the distribution of a set of measurements.

Example: distribution of the h-index of Nobel-prize recipients in Physics between 1985-2005.

Source: J.E. Hirsch, "An index to quantify an individual's scientific research output", Proc. NAS, 2005.

h-index: number of papers with h or more citations.

Histograms

  • In mathematical terms, the histogram is a function that counts the number of observations in different categories (bins)
  • The number of bins is typically computed as
k = max(x) - min(x) n

where n is the number of samples in the data set x.

Mean and Standard Deviation

Computing the mean (average) of a set of measurements is straightforward:

μ = 1 n i = 1 n x i

The standard deviation gives a measure of the variation of the measurements from the mean:

σ = 1 n i = 1 n ( x i - μ) 2

Confidence Intervals

  • These can be used to quantify the uncertainty about the average of a set of measurements subject to randomness.
  • When computing averages across multiple simulations, you are gathering samples to estimate an unknown population mean.
  • You choose the significance level that will reflect how confident you can be that the true value lies within that interval,
  • E.g. for a significance level of 0.05, you will obtain a 95% confidence interval (typically used in practice).

Confidence Intervals

  • The width of the confidence is affected by:
    • sample size,
    • population variability (standard deviation),
    • confidence level chosen.
  • Central Limit Theorem: For a large sample size, the sample mean will approach a normal distribution.
  • The sample mean and the mean of the population are identical.

Confidence Intervals

A quick method to compute a CI is:

μ ± z α/2 σ n

where zα/2 is the critical coefficient corresponding to a confidence level α and is obtained from z-score tables.

Example: Sample size 20, mean 10, standard deviation 1.45, 95% confidence level, i.e. a critical coefficient corresponding to a z-score of 0.475, which is 1.96.

CI is 20±0.02

i.e. [19.8, 20.2]

Confidence Intervals

Plotting CIs

CSLP statistics

Average Trip Duration
  • Compute the average duration of a lorry journey throughout the total simulation time.
  • The metrics may be different for different areas so compute per area and global.
  • When experimenting, you will be able to gain understanding of how bin thresholds impact route lengths, as well as how the implemented route planning algorithms perform.

CSLP statistics

Trip Efficiency
  • Compute the volume collected per unit of travel time.
  • This is somewhat related to the previous metric.
  • When experimenting, you can also examine how waste disposal rates affect the efficiency.
  • Again, per area and global statistics will be useful.

CSLP statistics

Number of Trips per Schedule
  • Compute the average number of trips a lorry performs to service all the bins whose thresholds have been reached at the start of the schedule.
  • Area size, disposal rate and (importantly) lorry capacity will impact this.
  • If trips to a small number of bins are to be performed to complete the service, the efficiency of the process may be affected.
  • On a new trip within the same schedule you can chose to go ahead with the initial plan or check if other bins' thresholds have been exceeded in the meantime.

CSLP statistics

Percentage of overflows
  • This metric should reflect whether the service is scheduled frequently enough.
  • A bin overflows once its capacity is exceeded and this event is marked only once.
  • For each area, you can count how many bins are overflowed at the start of a schedule.
  • There is some volatility in this metric. It can happen that some will overflow while the lorry is en route to them and you may miss those from the counting.

Questions?

Review of Part 1

Submissions

  • 6 out of 29 students submitted at least something.
  • That's 20% submission rate –was expecting something closer to 50%.
  • 3 out of the 6 submissions did not ask any explicit questions nor did they highlight any aspects on which they wanted feedback.
  • 1 submission did not compile and did not have a proper declaration of main.

Multiple Files

Header files
Number of files Frequency
16 1
14 1
10 1
8 1
2 1
0 1

Multiple Files

Source files
Number of files Frequency
16 1
13 1
11 1
8 1
4 1
2 1

Multiple Files

  • There seems to be a preference for using a relatively large number of files, given the small size of this project
  • This is likely due to some of you taking an OOP-like design approach
  • There is nothing wrong with that, but pay attention to memory management
  • Using more/less files will not be penalised – just explain why you chose to implement things in a certain way
  • If refactoring is a reason, do emphasize that

The READMEs

  • Ranged from very basic ones, containing a couple of lines, to very detailed ones.
  • Most of them explained how to build and execute the simulator, as expected.
  • Some acknowledged the limitations of the code and problems known at the time of submission.
  • Some did not state explicitly at which stage of the development they were.

Random Goodness

  • “...using some pretty nasty typecasts though it seems to me that all of C is this way...”
  • “Output - is a mess”

Random Not-So-Goodness

  • Low-level coding decisions:
    • Prone to change and you will forget to update the README
    • Should be in comments in the source code file concerned – some code lacked commenting altogether
  • High-level structure: you are more likely to remember to change the README in case of a major re-structuring

Random Not-So-Goodness

  • Make sure you read carefully the requirements
    • On two occasions the first command line parameter was not the input script, but a keyword that introduced the input script
    • Although working to some extent, one instance did not implement any command line parsing at all.

Refactoring

  • 2 out of 6 READMEs contained mentions of refactoring
  • Both with reference to future refactoring:
    • Either promising to refactor later or
    • Done a good bit of refactoring already, but planning to do more
  • It is still early days; however, refactoring is something you should be trying to do constantly

Invalid Input

  • Try to think of exceptions which you really do not believe can happen under normal executing conditions
  • The user may simply make some typing errors when producing the input
  • Some parameters may have been given in an order different than the expected one
  • When is this a serious problem?
  • Distinguish between warnings/errors where possible

Invalid Input

  • What should you do if you discover you have incomplete information during a simulation run?
  • For example, you attempt to retrieve the disposalRate and find that it is unavailable
  • This is not problematic for parsing, because the user may have simply forgotten to specify the disposal rate
  • However, if you validate the input before running the simulation, then it really becomes a problem to find a missing disposal rate during the simulation
  • The simulation should not have been started since the validation should have uncovered the error

Further checks

  • Check the sign of numbers
  • Check whether it is indeed numbers you find when numbers are expected
  • Check if a number has the type you expect (integer/real)
  • Be careful with hours/minutes conversions

Input scripts

  • Only some of you have authored input scripts
  • I recently made some examples available, but it is essential that you produce your own test scripts
  • Graph size is not the only thing that matters
  • Think of scenarios where e.g. route planning algorithms may have a hard time

Pleasant surprises

In no particular order:
  • One submission came with reference documentation generated with Doxygen
  • One submission used a fancy variant of a Mersenne Twister for pseudo-random number generation
  • One simulator has used revision control (GIT) during the development

Not-so-pleasant surprises

  • 2 out of 6 submissions did not implement input parsing and validation at all.
  • Although not marked and aimed at helping you, only a few (20%) submitted part 1.
  • I have met in class (at least on one occasion) those who submitted something.
  • Low interest vs. difficult course?

YAGNI

  • A final piece of advice
  • Try to keep things simple: Do the simplest thing that could work
    • Then rethink/refactor if it does not work
  • YAGNI: You Aren't Gonna Need It
    • Try not to over-complicate things by over-anticipating future requirements

Questions?