Git - Source Code Control

Software Engineering Large Practical

No Tutorials

There seems to be a little confusion over whether or not there are any tutorials. There are no tutorials so you do not need to worry about attending any.

Today's Lecture

  • I will talk about your source code control mechanism, git
  • Since I want your proposal to be stored under source code control, I'd better say something about it now
  • I will start with a very brief tutorial
    • What you need to know for this practical
  • I will then try to motivate more sophisticated use

Basic Source Code Control

As I stated previously the first thing to do is to initialise your repository

$ mkdir selp # Call it something sensible
$ cd selp
  

$ git init
  

$ editor README.md
$ git add README.md
$ git commit -m "Initial commit including a README"
  

The main point

  • After each portion of work, commit to the repository what you have done
  • Everything you have done since your last commit, is not recorded
  • You can see what has changed since your last commit, with the status and diff commands:

$ git status
# On branch master
nothing to commit (working directory clean)
  

Staging and Committing

  • If I edit a file the git status command will tell me what has changed

$ editor README.md
$ git status
# On branch master
# Changed but not updated:
#   (use "git add ‹file›..." to update what will be committed)
#   (use "git checkout -- ‹file›..." to discard changes in working directory)
#
#	modified:   README.md
#
no changes added to commit (use "git add" and/or "git commit -a")
  

Unrecorded and Unstaged Changes

  • A git diff at this point will tell me the changes I have made that have not been committed or staged

$ git diff
diff --git a/README.md b/README.md
index 9039fda..eb8a1a2 100644
--- a/README.md
+++ b/README.md
@@ -1 +1,2 @@
 I am going to write a new web site.
+The web site will rank video games for difficulty.
  

Staging and Committing

  • When you commit, you do not have to record all of your recent changes. Only changes which have been staged will be recorded
  • You stage those changes with the git add command.

To Add is to Stage

  • If I stage that modification and then ask for the status I will be told that there are staged changes waiting to be committed
  • To stage the changes in a file use git add

$ git status
# On branch master
# Changed but not updated:
#   (use "git add ‹file›..." to update what will be committed)
#   (use "git checkout -- ‹file›..." to discard changes in working directory)
#
#	modified:   README.md
#
no changes added to commit (use "git add" and/or "git commit -a")
$ git add README.md
$ git status
# On branch master
# Changes to be committed:
#   (use "git reset HEAD ‹file›..." to unstage)
#
#	modified:   README.md
#
  

Viewing Staged Changes

  • At this point git diff is empty because there are no changes that are not either committed or staged
  • Adding --staged will show differences which have been staged but not committed

$ git diff # outputs nothing
$ git diff --staged
diff --git a/README.md b/README.md
index 9039fda..eb8a1a2 100644
--- a/README.md
+++ b/README.md
@@ -1 +1,2 @@
 I am going to write a new web site.
+The web site will rank video games for difficulty.
  

New Files

  • Creating a new file causes git to notice there is a file which is not yet tracked by the repository
  • At this point it is treated equivalently to an unstaged/uncommitted change

$ editor mycode.mylang
$ git status
# On branch master
# Changes to be committed:
#   (use "git reset HEAD ‹file›..." to unstage)
#
#	modified:   README.md
#
# Untracked files:
#   (use "git add ‹file›..." to include in what will be committed)
#
#	mycode.mylang

  

New Files

  • Slightly tricky, git add is also used to tell git to start tracking a new file
  • Once done, the creation is treated exactly as if you were modifying an existing file
  • The addition of the file is now treated as a staged but uncommitted change

$ git add mycode.mylang
# On branch master
# Changes to be committed:
#   (use "git reset HEAD ‹file›..." to unstage)
#
#	modified:   README.md
#	new file:   mycode.mylang
#
  

Committing

  • Once you have staged all the changes you wish to record, use git commit to record them
  • Give a useful message to the commit

$ git commit -m "Added more to the readme and started the implementation"
[master a3a0ed9] Added more to the readme and started the implementation
 2 files changed, 2 insertions(+), 0 deletions(-)
 create mode 100644 mycode.mylang
  

A Clean Repository Feels Good

  • After a commit, you can take the status, in this case there are no changes
  • In general though there might be some if you did not stage all of your changes

$ git status
# On branch master
nothing to commit (working directory clean)
  

Ignoring Files

  • Compiling and running your application may well produce files that you do not wish to be recorded in the repository
  • Git will “complain” about untracked files
  • Your .gitignore file is used to ignore them

$ cat .gitignore
*.py[cod]

# editor backup files
*~

# C extensions
*.so
  

Ignoring Files

  • Don't forget to store your .gitignore file itself in the repository

$ git add .gitignore
$ git commit -m "Added a .gitignore file"
  

Finally git log

  • The git log command lists all your commits and their messages

$ git log
commit a3a0ed90bc90e601aca8cc9736827fdd05c97f8d
Author: Allan Clark ‹author email›
Date:   Wed Sep 25 10:26:57 2013 +0100

    Added more to the readme and started the implementation

commit 22de604267645e0485afa7202dd601d7c64c857c
Author: Allan Clark ‹author email›
Date:   Wed Sep 25 10:17:45 2013 +0100

    Initial commit
  

Minimal Workflow for the Lazy

  • For the really lazy among you
  • Simply remember two things:
    1. Whenever you create a new file git add it
    2. Once you have done one portion of work, commit all your changes with: git commit -a -m "Fixed bug ..."
  • When you actually want to use the SCC features you can read about them then

More on the Web

Keep In Mind

  • Although named the Software Engineering Large Practical, the effort and resources expended means that this is firmly in the category of pretty small projects
  • Most of this advice applies more fully to larger scale projects with multiple developers
  • However it is well worth building up the skills now when you do not need them, such that when you do, you do not mess up

Source Code Control - Problems

  • Solves two main problems:
    1. Allows two or more developers to concurrently work on the same project
    2. Records a history of changes made so that you can revert to any version

Concurrent Development

  • Previously solved using multiple files and good modularisation
  • Multiple developers literally worked on different files
  • Turns out we're not very good at modularisation
    • And we're certainly not good at foresight modularisation
    • Once you have the interfaces set up there is not much work left to do

Concurrent Development

  • You may think that this does not apply to you as this is an individual project
  • That is to assume that you may not take on multiple roles yourself
  • Even the most single-minded of you will at least:
    1. Develop new code
    2. Fix issues in existing code

Concurrent Development

  • You can attempt to do these two things simultaneously, I'll briefly attempt to convince you to do otherwise
    • But I know I will fail
    • I know you will disregard this advice and do your own thing anyway
    • But at least then I'm not culpable

Branching

Branching

  • Let us suppose you have a nice clean source code repository
  • You have some modules with nice clean interfaces
  • You decide today is the day that you will implement new feature X
  • You can simply start implementing and committing as you go

Branching

  • During your development of new feature X, you realise that interface A is not general enough to properly aid the development of feature X
  • So now you have to modify interface A
    • and perhaps even “the” implementation of A
  • In short, you wish the code base had been in a different state before you started the development of new feature X

Branching

  • You should have started a new branch feature-x
  • Once you discover that deficiency of interface A:
    1. Revert back to the master branch
    2. This will have none of the changes which began the implementation of feature X
    3. Fix the interface of A there
      • And test it
    4. Switch back to your feature-x branch
    5. Rebase the changes to interface A here
    6. Continue development of feature X

Branching

  • This is the hard part: convincing you that this is worth it
  • Now your history of changes has logically separated pieces of work

Branching

With simply commiting as you go:
  1. Created three new tests for the intended new feature X, it currently fails
  2. began the implementation of feature X, one of the three acceptance tests passing
  3. Updated the interface of A to add an extra parameter to the get_info method
  4. Match the updated interface of A within the implementation
  5. Updated the doc-comment for the updated interface of get_info method ...
  6. Updated the implementation of feature X all three acceptance tests now passing
  7. Cleaned up the new code for the implementation of feature X
  8. Fixed the tests broken by the update for the get_info method of interface A
  9. Created a new test for the extra parameter in the get_info method of interface A
  10. Added a new acceptance test for feature X

Branching

Branching and rebasing:
  1. Updated the interface of A to add an extra parameter to the get_info method
  2. Match the updated interface of A within the implementation
  3. Updated the doc-comment for the updated interface of get_info method ...
  4. Fixed the tests broken by the update for the get_info method of interface A
  5. Created a new test for the extra parameter in the get_info method of interface A
  6. Created three new tests for the intended new feature X, it currently fails
  7. began the implementation of feature X, one of the three acceptance tests passing
  8. Updated the implementation of feature X all three acceptance tests now passing
  9. Cleaned up the new code for the implementation of feature X
  10. Added a new acceptance test for feature X

Branching

  • Why on earth do you care?
  • Why would you want a history full of logically separated pieces of work?
  • Because you may wish to revert any one of them
  • Suppose, you fix interface A but continued development of feature X makes you realise that this new feature is ill-conceived:
    • Perhaps it is infeasible to develop
    • Or perhaps you see another way to provide the same functionality
  • Now you may wish to revert the implementation of feature X
  • But the new interface A is possibly still better than the previous one, so you wish to keep that

Historical “Versions”

  • This is why source code control mechanims are often called “Revision Control Systems”
  • Note that we are not talking about each explicitly marked and labelled version, but each and every logically point in development that has lead to the current state of affairs
  • Although being able to revert to a labelled version is also useful

A common error


/*
 * 12/26/93 (seiwald) - allow NOTIME targets to be expanded via $(<), $(>)
 * 01/04/94 (seiwald) - print all targets, bounded, when tracing commands
 * 12/20/94 (seiwald) - NOTIME renamed NOTFILE.
 * 12/17/02 (seiwald) - new copysettings() to protect target-specific vars
 * 01/03/03 (seiwald) - T_FATE_NEWER once again gets set with missing parent
 * 01/14/03 (seiwald) - fix includes fix with new internal includes TARGET
 * 04/04/03 (seiwald) - fix INTERNAL node binding to avoid T_BIND_PARENTS
 */

Why is Version Control Important

  • Standard answer: “bug fixes must be back ported to older versions”

Why is Version Control Important

  • Suppose you develop GolfApp 1.0
  • You begin development of GolfApp 2.0
  • Unfortunately someone discovers a security bug in GolfApp 1.0
  • You need to revert back to the source code of GolfApp 1.0 to fix the bug for those users
  • You cannot wait until you release GolfApp 2.0
  • You also cannot fix the bug there and release what you have
    • That source code may contain many unfinished and untested features
    • Or you may not wish to give new features to GolfApp 1.0 users who have not paid for them

The Point

  • Again you may think this does not apply to you because you have only one release to make
  • This allows you to revert to previous versions in order to locate when a bug was introduced
  • This can help greatly in locating the source of a bug
  • This history of changes also helps other people (including your future self) understand why the code is the way it is
  • This is very helpful when you wish to change something without breaking anything

Why is that line like that?

  • Suppose you're reading the linux kernel source code
  • You're in ipc/msg.c

git log -S'msq = msq_obtain_object_check(ns, msqid);' ipc/msg.c

commit 41a0d523d0f626e9da0dc01de47f1b89058033cf
Author: Davidlohr Bueso 
Date:   Mon Jul 8 16:01:18 2013 -0700

    ipc,msg: shorten critical region in msgrcv
    
    do_msgrcv() is the last msg queue function that abuses the ipc lock Take
    it only when needed when actually updating msq.
    
    Signed-off-by: Davidlohr Bueso 
    Cc: Andi Kleen 
    Cc: Rik van Riel 
    Tested-by: Sedat Dilek 
    Signed-off-by: Andrew Morton 
    Signed-off-by: Linus Torvalds 

commit 3dd1f784ed6603d7ab1043e51e6371235edf2313
Author: Davidlohr Bueso 
Date:   Mon Jul 8 16:01:17 2013 -0700

    ipc,msg: shorten critical region in msgsnd
    
    do_msgsnd() is another function that does too many things with the ipc
:
commit 41a0d523d0f626e9da0dc01de47f1b89058033cf
Author: Davidlohr Bueso 
Date:   Mon Jul 8 16:01:18 2013 -0700

    ipc,msg: shorten critical region in msgrcv
    
    do_msgrcv() is the last msg queue function that abuses the ipc lock Take
    it only when needed when actually updating msq.
    
    Signed-off-by: Davidlohr Bueso 
    Cc: Andi Kleen 
    Cc: Rik van Riel 
    Tested-by: Sedat Dilek 
    Signed-off-by: Andrew Morton 
    Signed-off-by: Linus Torvalds 

commit 3dd1f784ed6603d7ab1043e51e6371235edf2313
Author: Davidlohr Bueso 
Date:   Mon Jul 8 16:01:17 2013 -0700

    ipc,msg: shorten critical region in msgsnd
    
    do_msgsnd() is another function that does too many things with the ipc
    object lock acquired.  Take it only when needed when actually updating
    msq.
    
    Signed-off-by: Davidlohr Bueso 
    Cc: Andi Kleen 
    Cc: Rik van Riel 
    Signed-off-by: Andrew Morton 
    Signed-off-by: Linus Torvalds 

commit ac0ba20ea6f2201a1589d6dc26ad1a4f0f967bb8
Author: Davidlohr Bueso 
Date:   Mon Jul 8 16:01:16 2013 -0700

    ipc,msg: make msgctl_nolock lockless
    
    While the INFO cmd doesn't take the ipc lock, the STAT commands do
    acquire it unnecessarily.  We can do the permissions and security checks
    only holding the rcu lock.
    
    This function now mimics semctl_nolock().
    
    Signed-off-by: Davidlohr Bueso 
    Cc: Andi Kleen 
    Cc: Rik van Riel 
    Signed-off-by: Andrew Morton 
    Signed-off-by: Linus Torvalds 

Debugging Help

  • Suppose you write some new test input, try it out, and find that it causes your application to crash
  do{ revert to previous commit/version
      re-compile and re-run your new test
      flag = does the program still crash
    } while(flag)
  
Once you have done this, you now know that the commit you just reverted, contains the code which is causing the crash

Committing

  • When and what to commit?
  • The easy answer is it should be “one unit of work”
  • Defining one unit of work is difficult but if you have to use the word ‘and’ to describe it, there is a good chance you have more than one commit there
  • Note that my previous example was therefore bad
    
    $ git commit -m "Added more to the readme and started the implementation"
    [master a3a0ed9] Added more to the readme and started the implementation
     2 files changed, 2 insertions(+), 0 deletions(-)
     create mode 100644 mycode.mylang
      
  • It is bad because it is doing two separate things, indicated by the use of the word ‘and’, not because it updates more than one file

Committing

  • Your commit should be improving the project. It should be improving one portion of it:
    • The code
    • The documentation
    • The tests
  • And it should be improving that one part in one way:
    • Improving functionality
    • Improving readability
    • Improving maintainability
    • Improving performance

XKCD Signal

  • XKCD is a popular web comic
  • It has an associated IRC channel
  • As with many large communities it faced a problem of a large noise to signal ratio
  • A large part of the problem is that frequently asked questions are not read and hence re-asked
  • Common debates are hence frequently re-hashed

XKCD Signal

  • In a blog post the xkcd creator outlines a proposal to deal with this
  • It has been implemented as the ROBOT9000 bot-moderator
  • The rule it enforces is a simple one:
  • ”You are not allowed to repeat anything anyone has already said”

XKCD Signal

  • You can read about the specifics here
  • But some obvious questions arise:
    • Question: Isn't this limiting?
    • Answer: You're underestimating the versatility of natural language and the sheer number of possible sentences

XKCD Signal

  • You can read about the specifics here
  • But some obvious questions arise:
    • Question: Can't I just game it by tagging extra nonsense on?
    • Answer: Yes, but the focus is on dealing with unwitting noise generators. Those who are actively attempting to destroy the conversation can be otherwise banned.

XKCD Signal

  • You can read about the specifics here
  • But some obvious questions arise:
    • Question:What happens if I just want to answer someone with a yes/no?
    • Answer: Expand slightly e.g. “I agree, ... because ...”

What has this got to do with SCC?

  • A persistent problem is the lack of meaningful commit messages
    • “fixed a bug”
    • “More work”
    • “Fixes.”
    • “Updates”
    • “big commit of all outstanding changes”
    • “commit everything”
    • “commit”

What has this got to do with SCC?

  • I hope to give some good advice on this writing good commit messages
  • But it is notoriously difficult to enforce
  • One could easily enforce a minimum length, but this would only solve part of the problem and in some cases would not actually be appropriate
  • A sneakier idea; copy the “Do not repeat” rule from XKCD-Signal
  • “Do not use a commit message which has been used previously”

Non-repeating Commit Messages

  • When I say “used previously” do I mean in the same repository?
  • Beginner level: yes, I mean in the same repository
  • Advanced level: no, I mean in any repository that exists for any project
  • It should not really matter, it is hard to accept that a commit message used for an entirely different project is appropriate for your one

Non-repeating Commit Messages

  • Said in a whingey voice: “But I really did just fix a typo in the README”
  • You can probably expand on that a little
  • However, of course some violations of this rule will be worse than others
  • Similarly just because you pass this rule, does not mean you have a useful commit message
  • Gaming this by adding superfluous characters is definitely wrong

Non-repeating Commit Messages

  • In order to check the advanced level I will need a corpus of repositories
  • I might use github for this. You certainly should not be repeating a commit message used for an entirely different repository
  • But I will at least check your commit messages against all other repositories submitted for this practical

More Advice

  • The commit message should be a summary of the actual ‘diff’
  • Part of the point of the commit message is so that a reader can avoid looking at the actual ‘diff’
  • The reader is looking in the history for a reason. Most likely they are trying to find the source of an issue. Help them.

More Advice

  • You should at least make clear the purpose of the commit
  • Is it?
    • A bug fix
    • A feature addition
    • A conflict resolution between two branches
    • Style enhancements
      • On what scale? A single fixed spelling error, or reformating all of the code?
    • A refactor of some portion of the code
    • Addition of a test
    • Updating of documentation
    • Optimisation

More Advice

  • Even once the purpose is described, try to explain the reason for that purpose
  • Some times this will be obvious, for example if the purpose of the commit is to fix a bug
  • Even then, you may wish to explain why that is fixable now rather than earlier
  • Others, really require an associated why. In particular a refactor.

Summary of the Main Advice

  • Small frequent commits. Each commit should do one thing
  • Ask yourself is it plausible that you might wish to revert some of the changes in a commit but not all of them?
    • If so, you almost certainly have more than one commit's worth of work
  • A person looking through your history is most likely looking for the source of a bug, or trying to figure out why a certain bit of code is the way it is. In either case help them.
  • Some people branch for any new unit of work. You should at least branch if you start doing two things at once

Micro Commits

  • It is possible to commit too little a portion of work
  • But for this practical we will ignore that possibility (unless you're clearly gaming the system)
  • Just a note: small style enhancements are usually not too small
  • “I just fixed a small typo in a comment, no one could possibly wish to revert to the code before I fixed the typo”
    • Probably not, but what are you about to do?
    • Someone may well wish to revert to the code immediately after you fixed the typo

Micro Commits

  • If you commit code such that the “build is broken” it is certainly not an appropriate commit
    • If the code fails to compile, or has a syntax error (for dynamic languages)
  • If this is the case you are likely committing too little
  • Though this could also be caused by over-shooting an appropriate commit
    • In other words you have 1 and a half commits worth of work
    • Or 2 and a half, or X plus 1/y commits worth of work

Git Lines

Git Lines

Git Lines

Git Lines

Git Lines

Git Lines

Other Software Development Tools

  • Source code control is just one of many development tools:
    1. Issue Tracking Software
    2. Profilers
    3. Debuggers
    4. Testing frameworks
    5. Static Analysers
    6. Automatic Documentation Generators
    7. IDEs/Good Editors with plug-ins

Square Wheels

  • All such tools have an initial investment of time to learn to use them
  • But generally that investment of time is well worth it in the long run

Small/Large Projects

  • As I stated at the start, you could easily do this project without source code control
  • You might not even utilise its backtracking features
  • But building up the habit will be invaluable to you when you do need it
  • You will certainly need it for the system design project

External Git Advice

  • There are literally millions of web pages offering git support and advice
  • Go forth and explore

Any Questions

Proposal: Word of Caution

  • The diff algorithm used by SCC is line based
  • If you only take newlines for paragraphs you will cause larger than necessary diffs
  • Programmer editor tend to implement a word-wrap feature that inserts a newline character
  • Other editors just display the long-line on multiple lines but do not insert newline characters