Git - Source Code Control

Software Engineering Large Practical

No Tutorials

There seems to be a little confusion over whether or not there are any tutorials. There are no tutorials so you do not need to worry about attending any.

Today's Lecture

I will talk about your source code control mechanism, git
Since I want your proposal to be stored under source code control, I'd better say something about it now
I will start with a very brief tutorial
- What you need to know for this practical
I will then try to motivate more sophisticated use

Basic Source Code Control

As I stated previously the first thing to do is to initialise your repository


$ mkdir selp # Call it something sensible
$ cd selp


$ git init


$ editor README.md
$ git add README.md
$ git commit -m "Initial commit including a README"

The main point

After each portion of work, commit to the repository what you have done
Everything you have done since your last commit, is not recorded
You can see what has changed since your last commit, with the status and diff commands:


$ git status
# On branch master
nothing to commit (working directory clean)

Staging and Committing

If I edit a file the git status command will tell me what has changed


$ editor README.md
$ git status
# On branch master
# Changed but not updated:
#   (use "git add ‹file›..." to update what will be committed)
#   (use "git checkout -- ‹file›..." to discard changes in working directory)
#
#	modified:   README.md
#
no changes added to commit (use "git add" and/or "git commit -a")

Unrecorded and Unstaged Changes

A git diff at this point will tell me the changes I have made that have not been committed or staged


$ git diff
diff --git a/README.md b/README.md
index 9039fda..eb8a1a2 100644
--- a/README.md
+++ b/README.md
@@ -1 +1,2 @@
 I am going to write a new web site.
+The web site will rank video games for difficulty.

Staging and Committing

When you commit, you do not have to record all of your recent changes. Only changes which have been staged will be recorded
You stage those changes with the git add command.

To Add is to Stage

If I stage that modification and then ask for the status I will be told that there are staged changes waiting to be committed
To stage the changes in a file use git add


$ git status
# On branch master
# Changed but not updated:
#   (use "git add ‹file›..." to update what will be committed)
#   (use "git checkout -- ‹file›..." to discard changes in working directory)
#
#	modified:   README.md
#
no changes added to commit (use "git add" and/or "git commit -a")
$ git add README.md
$ git status
# On branch master
# Changes to be committed:
#   (use "git reset HEAD ‹file›..." to unstage)
#
#	modified:   README.md
#

Viewing Staged Changes

At this point git diff is empty because there are no changes that are not either committed or staged
Adding --staged will show differences which have been staged but not committed


$ git diff # outputs nothing
$ git diff --staged
diff --git a/README.md b/README.md
index 9039fda..eb8a1a2 100644
--- a/README.md
+++ b/README.md
@@ -1 +1,2 @@
 I am going to write a new web site.
+The web site will rank video games for difficulty.

New Files

Creating a new file causes git to notice there is a file which is not yet tracked by the repository
At this point it is treated equivalently to an unstaged/uncommitted change


$ editor mycode.mylang
$ git status
# On branch master
# Changes to be committed:
#   (use "git reset HEAD ‹file›..." to unstage)
#
#	modified:   README.md
#
# Untracked files:
#   (use "git add ‹file›..." to include in what will be committed)
#
#	mycode.mylang

New Files

Slightly tricky, git add is also used to tell git to start tracking a new file
Once done, the creation is treated exactly as if you were modifying an existing file
The addition of the file is now treated as a staged but uncommitted change


$ git add mycode.mylang
# On branch master
# Changes to be committed:
#   (use "git reset HEAD ‹file›..." to unstage)
#
#	modified:   README.md
#	new file:   mycode.mylang
#

Committing

Once you have staged all the changes you wish to record, use git commit to record them
Give a useful message to the commit


$ git commit -m "Added more to the readme and started the implementation"
[master a3a0ed9] Added more to the readme and started the implementation
 2 files changed, 2 insertions(+), 0 deletions(-)
 create mode 100644 mycode.mylang

A Clean Repository Feels Good

After a commit, you can take the status, in this case there are no changes
In general though there might be some if you did not stage all of your changes


$ git status
# On branch master
nothing to commit (working directory clean)

Ignoring Files

Compiling and running your application may well produce files that you do not wish to be recorded in the repository
Git will “complain” about untracked files
Your .gitignore file is used to ignore them


$ cat .gitignore
*.py[cod]

# editor backup files
*~

# C extensions
*.so

Ignoring Files

Don't forget to store your .gitignore file itself in the repository


$ git add .gitignore
$ git commit -m "Added a .gitignore file"

Finally git log

The git log command lists all your commits and their messages


$ git log
commit a3a0ed90bc90e601aca8cc9736827fdd05c97f8d
Author: Allan Clark ‹author email›
Date:   Wed Sep 25 10:26:57 2013 +0100

    Added more to the readme and started the implementation

commit 22de604267645e0485afa7202dd601d7c64c857c
Author: Allan Clark ‹author email›
Date:   Wed Sep 25 10:17:45 2013 +0100

    Initial commit

Minimal Workflow for the Lazy

For the really lazy among you
Simply remember two things:
1. Whenever you create a new file git add it
2. Once you have done one portion of work, commit all your changes with: git commit -a -m "Fixed bug ..."
When you actually want to use the SCC features you can read about them then

Keep In Mind

Although named the Software Engineering Large Practical, the effort and resources expended means that this is firmly in the category of pretty small projects
Most of this advice applies more fully to larger scale projects with multiple developers
However it is well worth building up the skills now when you do not need them, such that when you do, you do not mess up

Source Code Control - Problems

Solves two main problems:
1. Allows two or more developers to concurrently work on the same project
2. Records a history of changes made so that you can revert to any version

Concurrent Development

Previously solved using multiple files and good modularisation
Multiple developers literally worked on different files
Turns out we're not very good at modularisation
- And we're certainly not good at foresight modularisation
- Once you have the interfaces set up there is not much work left to do

Concurrent Development

You may think that this does not apply to you as this is an individual project
That is to assume that you may not take on multiple roles yourself
Even the most single-minded of you will at least:
1. Develop new code
2. Fix issues in existing code

Concurrent Development

You can attempt to do these two things simultaneously, I'll briefly attempt to convince you to do otherwise
- But I know I will fail
- I know you will disregard this advice and do your own thing anyway
- But at least then I'm not culpable

Branching

Let us suppose you have a nice clean source code repository
You have some modules with nice clean interfaces
You decide today is the day that you will implement new feature X
You can simply start implementing and committing as you go

Branching

During your development of new feature X, you realise that interface A is not general enough to properly aid the development of feature X
So now you have to modify interface A
- and perhaps even “the” implementation of A
In short, you wish the code base had been in a different state before you started the development of new feature X

Branching

You should have started a new branch feature-x
Once you discover that deficiency of interface A:
1. Revert back to the master branch
2. This will have none of the changes which began the implementation of feature X
3. Fix the interface of A there
  - And test it
4. Switch back to your feature-x branch
5. Rebase the changes to interface A here
6. Continue development of feature X

Branching

This is the hard part: convincing you that this is worth it
Now your history of changes has logically separated pieces of work

Branching

With simply commiting as you go:

Created three new tests for the intended new feature X, it currently fails
began the implementation of feature X, one of the three acceptance tests passing
Updated the interface of A to add an extra parameter to the get_info method
Match the updated interface of A within the implementation
Updated the doc-comment for the updated interface of get_info method ...
Updated the implementation of feature X all three acceptance tests now passing
Cleaned up the new code for the implementation of feature X
Fixed the tests broken by the update for the get_info method of interface A
Created a new test for the extra parameter in the get_info method of interface A
Added a new acceptance test for feature X

Branching

Branching and rebasing:

Updated the interface of A to add an extra parameter to the get_info method
Match the updated interface of A within the implementation
Updated the doc-comment for the updated interface of get_info method ...
Fixed the tests broken by the update for the get_info method of interface A
Created a new test for the extra parameter in the get_info method of interface A
Created three new tests for the intended new feature X, it currently fails
began the implementation of feature X, one of the three acceptance tests passing
Updated the implementation of feature X all three acceptance tests now passing
Cleaned up the new code for the implementation of feature X
Added a new acceptance test for feature X

Branching

Why on earth do you care?
Why would you want a history full of logically separated pieces of work?
Because you may wish to revert any one of them
Suppose, you fix interface A but continued development of feature X makes you realise that this new feature is ill-conceived:
- Perhaps it is infeasible to develop
- Or perhaps you see another way to provide the same functionality
Now you may wish to revert the implementation of feature X
But the new interface A is possibly still better than the previous one, so you wish to keep that

Historical “Versions”

This is why source code control mechanims are often called “Revision Control Systems”
Note that we are not talking about each explicitly marked and labelled version, but each and every logically point in development that has lead to the current state of affairs
Although being able to revert to a labelled version is also useful

A common error


/*
 * 12/26/93 (seiwald) - allow NOTIME targets to be expanded via $(<), $(>)
 * 01/04/94 (seiwald) - print all targets, bounded, when tracing commands
 * 12/20/94 (seiwald) - NOTIME renamed NOTFILE.
 * 12/17/02 (seiwald) - new copysettings() to protect target-specific vars
 * 01/03/03 (seiwald) - T_FATE_NEWER once again gets set with missing parent
 * 01/14/03 (seiwald) - fix includes fix with new internal includes TARGET
 * 04/04/03 (seiwald) - fix INTERNAL node binding to avoid T_BIND_PARENTS
 */

Why is Version Control Important

Standard answer: “bug fixes must be back ported to older versions”

Why is Version Control Important

Suppose you develop GolfApp 1.0
You begin development of GolfApp 2.0
Unfortunately someone discovers a security bug in GolfApp 1.0
You need to revert back to the source code of GolfApp 1.0 to fix the bug for those users
You cannot wait until you release GolfApp 2.0
You also cannot fix the bug there and release what you have
- That source code may contain many unfinished and untested features
- Or you may not wish to give new features to GolfApp 1.0 users who have not paid for them

The Point

Again you may think this does not apply to you because you have only one release to make
This allows you to revert to previous versions in order to locate when a bug was introduced
This can help greatly in locating the source of a bug
This history of changes also helps other people (including your future self) understand why the code is the way it is
This is very helpful when you wish to change something without breaking anything

Why is that line like that?

Suppose you're reading the linux kernel source code
You're in ipc/msg.c


git log -S'msq = msq_obtain_object_check(ns, msqid);' ipc/msg.c

commit 41a0d523d0f626e9da0dc01de47f1b89058033cf
Author: Davidlohr Bueso 
Date:   Mon Jul 8 16:01:18 2013 -0700

    ipc,msg: shorten critical region in msgrcv
    
    do_msgrcv() is the last msg queue function that abuses the ipc lock Take
    it only when needed when actually updating msq.
    
    Signed-off-by: Davidlohr Bueso 
    Cc: Andi Kleen 
    Cc: Rik van Riel 
    Tested-by: Sedat Dilek 
    Signed-off-by: Andrew Morton 
    Signed-off-by: Linus Torvalds 

commit 3dd1f784ed6603d7ab1043e51e6371235edf2313
Author: Davidlohr Bueso 
Date:   Mon Jul 8 16:01:17 2013 -0700

    ipc,msg: shorten critical region in msgsnd
    
    do_msgsnd() is another function that does too many things with the ipc
:
commit 41a0d523d0f626e9da0dc01de47f1b89058033cf
Author: Davidlohr Bueso 
Date:   Mon Jul 8 16:01:18 2013 -0700

    ipc,msg: shorten critical region in msgrcv
    
    do_msgrcv() is the last msg queue function that abuses the ipc lock Take
    it only when needed when actually updating msq.
    
    Signed-off-by: Davidlohr Bueso 
    Cc: Andi Kleen 
    Cc: Rik van Riel 
    Tested-by: Sedat Dilek 
    Signed-off-by: Andrew Morton 
    Signed-off-by: Linus Torvalds 

commit 3dd1f784ed6603d7ab1043e51e6371235edf2313
Author: Davidlohr Bueso 
Date:   Mon Jul 8 16:01:17 2013 -0700

    ipc,msg: shorten critical region in msgsnd
    
    do_msgsnd() is another function that does too many things with the ipc
    object lock acquired.  Take it only when needed when actually updating
    msq.
    
    Signed-off-by: Davidlohr Bueso 
    Cc: Andi Kleen 
    Cc: Rik van Riel 
    Signed-off-by: Andrew Morton 
    Signed-off-by: Linus Torvalds 

commit ac0ba20ea6f2201a1589d6dc26ad1a4f0f967bb8
Author: Davidlohr Bueso 
Date:   Mon Jul 8 16:01:16 2013 -0700

    ipc,msg: make msgctl_nolock lockless
    
    While the INFO cmd doesn't take the ipc lock, the STAT commands do
    acquire it unnecessarily.  We can do the permissions and security checks
    only holding the rcu lock.
    
    This function now mimics semctl_nolock().
    
    Signed-off-by: Davidlohr Bueso 
    Cc: Andi Kleen 
    Cc: Rik van Riel 
    Signed-off-by: Andrew Morton 
    Signed-off-by: Linus Torvalds

Debugging Help

Suppose you write some new test input, try it out, and find that it causes your application to crash

  do{ revert to previous commit/version
      re-compile and re-run your new test
      flag = does the program still crash
    } while(flag)

Once you have done this, you now know that the commit you just reverted, contains the code which is causing the crash

Committing

When and what to commit?
The easy answer is it should be “one unit of work”
Defining one unit of work is difficult but if you have to use the word ‘and’ to describe it, there is a good chance you have more than one commit there

Note that my previous example was therefore bad


$ git commit -m "Added more to the readme and started the implementation"
[master a3a0ed9] Added more to the readme and started the implementation
 2 files changed, 2 insertions(+), 0 deletions(-)
 create mode 100644 mycode.mylang

It is bad because it is doing two separate things, indicated by the use of the word ‘and’, not because it updates more than one file

Committing

Your commit should be improving the project. It should be improving one portion of it:
- The code
- The documentation
- The tests
And it should be improving that one part in one way:
- Improving functionality
- Improving readability
- Improving maintainability
- Improving performance

XKCD Signal

XKCD is a popular web comic
It has an associated IRC channel
As with many large communities it faced a problem of a large noise to signal ratio
A large part of the problem is that frequently asked questions are not read and hence re-asked
Common debates are hence frequently re-hashed

XKCD Signal

In a blog post the xkcd creator outlines a proposal to deal with this
It has been implemented as the ROBOT9000 bot-moderator
The rule it enforces is a simple one:
”You are not allowed to repeat anything anyone has already said”

XKCD Signal

You can read about the specifics here
But some obvious questions arise:
- Question: Isn't this limiting?
- Answer: You're underestimating the versatility of natural language and the sheer number of possible sentences

XKCD Signal

You can read about the specifics here
But some obvious questions arise:
- Question: Can't I just game it by tagging extra nonsense on?
- Answer: Yes, but the focus is on dealing with unwitting noise generators. Those who are actively attempting to destroy the conversation can be otherwise banned.

XKCD Signal

You can read about the specifics here
But some obvious questions arise:
- Question:What happens if I just want to answer someone with a yes/no?
- Answer: Expand slightly e.g. “I agree, ... because ...”

What has this got to do with SCC?

A persistent problem is the lack of meaningful commit messages
- “fixed a bug”
- “More work”
- “Fixes.”
- “Updates”
- “big commit of all outstanding changes”
- “commit everything”
- “commit”

What has this got to do with SCC?

I hope to give some good advice on this writing good commit messages
But it is notoriously difficult to enforce
One could easily enforce a minimum length, but this would only solve part of the problem and in some cases would not actually be appropriate
A sneakier idea; copy the “Do not repeat” rule from XKCD-Signal
“Do not use a commit message which has been used previously”

Non-repeating Commit Messages

When I say “used previously” do I mean in the same repository?
Beginner level: yes, I mean in the same repository
Advanced level: no, I mean in any repository that exists for any project
It should not really matter, it is hard to accept that a commit message used for an entirely different project is appropriate for your one

Non-repeating Commit Messages

Said in a whingey voice: “But I really did just fix a typo in the README”
You can probably expand on that a little
However, of course some violations of this rule will be worse than others
Similarly just because you pass this rule, does not mean you have a useful commit message
Gaming this by adding superfluous characters is definitely wrong

Non-repeating Commit Messages

In order to check the advanced level I will need a corpus of repositories
I might use github for this. You certainly should not be repeating a commit message used for an entirely different repository
But I will at least check your commit messages against all other repositories submitted for this practical

More Advice

The commit message should be a summary of the actual ‘diff’
Part of the point of the commit message is so that a reader can avoid looking at the actual ‘diff’
The reader is looking in the history for a reason. Most likely they are trying to find the source of an issue. Help them.

More Advice

You should at least make clear the purpose of the commit
Is it?
- A bug fix
- A feature addition
- A conflict resolution between two branches
- Style enhancements
  - On what scale? A single fixed spelling error, or reformating all of the code?
- A refactor of some portion of the code
- Addition of a test
- Updating of documentation
- Optimisation

More Advice

Even once the purpose is described, try to explain the reason for that purpose
Some times this will be obvious, for example if the purpose of the commit is to fix a bug
Even then, you may wish to explain why that is fixable now rather than earlier
Others, really require an associated why. In particular a refactor.

Summary of the Main Advice

Small frequent commits. Each commit should do one thing
Ask yourself is it plausible that you might wish to revert some of the changes in a commit but not all of them?
- If so, you almost certainly have more than one commit's worth of work
A person looking through your history is most likely looking for the source of a bug, or trying to figure out why a certain bit of code is the way it is. In either case help them.
Some people branch for any new unit of work. You should at least branch if you start doing two things at once

Micro Commits

It is possible to commit too little a portion of work
But for this practical we will ignore that possibility (unless you're clearly gaming the system)
Just a note: small style enhancements are usually not too small
“I just fixed a small typo in a comment, no one could possibly wish to revert to the code before I fixed the typo”
- Probably not, but what are you about to do?
- Someone may well wish to revert to the code immediately after you fixed the typo

Micro Commits

If you commit code such that the “build is broken” it is certainly not an appropriate commit
- If the code fails to compile, or has a syntax error (for dynamic languages)
If this is the case you are likely committing too little
Though this could also be caused by over-shooting an appropriate commit
- In other words you have 1 and a half commits worth of work
- Or 2 and a half, or X plus 1/y commits worth of work

Git Lines

Git Lines

Git Lines

Git Lines

Git Lines

Other Software Development Tools

Source code control is just one of many development tools:
1. Issue Tracking Software
2. Profilers
3. Debuggers
4. Testing frameworks
5. Static Analysers
6. Automatic Documentation Generators
7. IDEs/Good Editors with plug-ins

Square Wheels

All such tools have an initial investment of time to learn to use them
But generally that investment of time is well worth it in the long run

Small/Large Projects

As I stated at the start, you could easily do this project without source code control
You might not even utilise its backtracking features
But building up the habit will be invaluable to you when you do need it
You will certainly need it for the system design project

External Git Advice

There are literally millions of web pages offering git support and advice
Go forth and explore

Any Questions

Proposal: Word of Caution

The diff algorithm used by SCC is line based
If you only take newlines for paragraphs you will cause larger than necessary diffs
Programmer editor tend to implement a word-wrap feature that inserts a newline character
Other editors just display the long-line on multiple lines but do not insert newline characters

Git - Source Code Control

Software Engineering Large Practical

No Tutorials

Today's Lecture

Basic Source Code Control

The main point

Staging and Committing

Unrecorded and Unstaged Changes

Staging and Committing

To Add is to Stage

Viewing Staged Changes

New Files

New Files

Committing

A Clean Repository Feels Good

Ignoring Files

Ignoring Files

Finally git log

Minimal Workflow for the Lazy

More on the Web

Keep In Mind

Source Code Control - Problems

Concurrent Development

Concurrent Development

Concurrent Development

Branching

Branching

Branching

Branching

Branching

Branching

Branching

Branching

Historical “Versions”

A common error

Why is Version Control Important

Why is Version Control Important

The Point

Why is that line like that?

Debugging Help

Committing

Committing

XKCD Signal

XKCD Signal

XKCD Signal

XKCD Signal

XKCD Signal

What has this got to do with SCC?

What has this got to do with SCC?

Non-repeating Commit Messages

Non-repeating Commit Messages

Non-repeating Commit Messages

More Advice

More Advice

More Advice

Summary of the Main Advice

Micro Commits

Micro Commits

Git Lines

Git Lines

Git Lines

Git Lines

Git Lines

Git Lines

Other Software Development Tools

Square Wheels

Small/Large Projects

External Git Advice

Any Questions

Proposal: Word of Caution