Lecture 11, Friday w6, 2014-10-24
=================================

**Finished off source coding loose end:**

* Predictions can't just count how often things happened in a context.
    * Small contexts: model's predictions will be too vague.
    * Large contexts: won't have enough data to get good counts.
      No matter how much data we have, it makes sense to consider
      contexts of multiple sizes.
* Prediction by Partial Match (PPM). One simple way (frequently
  combined with arithmetic coding) to back off between contexts.


**Started noisy channel coding:**

* Communication: compress (remove redundancy), encode so robust to noise
  (add redundancy again!), send over channel, decode, decompress.
* Discrete memoryless channels:
    * Discrete: finite number of possible input and output symbols
    * Memoryless: nothing that happens to the channel affects its
      properties (unlike a neuron that can't fire any more after
      sustained activity, or a disk that flips more bits if too
      many 1's are stored in a row).
    * Can have different input and output alphabets (including
      different sizes).
    * Synchronized: we know the sequence of every received symbol.
      None are *deleted*, removed without knowing we lost a symbol,
      no extra symbols are *inserted*. The order is maintained.
* Example channels: BSC, BEC, Z.
* Channel defines $Q$ matrix of transition probabilities. We don't usually
  get to choose that. We get to set vector of input probabilities: $\mathbf{p}_x$.
* $Q$ and $\mathbf{p}_x$ specify whole joint distribution of $x$ and $y$.
  Marginal distribution of output is $\mathbf{p}_y = Q\mathbf{p}_x$.
* Joint entropy of channel $H(X,Y) < H(X) + H(Y)$. Equality would mean
  channel is of zero use!
* The mutual information $I(X;Y) = H(X) + H(Y) - H(X,Y)$ tells us how much
  information is shared between the input and output.


Check your progress
-------------------

* Why do we compress (remove redundancy) if we're just going to add
  redundancy again?
* What's the difference between a *deletion* channel (which is hard to deal
  with), and an *erasure* channel (which is fairly easy)?
* Why is $\mathbf{p}_y = Q\mathbf{p}_x$?

We're up to around the 13th slide in the 'week 6' slides. Mark anything
that's unclear or that needs expanding on NB.

This lecture covered around pp146--148 of the textbook.


Extra reading
-------------

If you'd like to know how compressors like gzip work (non-examinable) read
MacKay p119, but also do Exercise 6.13 and then read its answer.

If you want to read ahead, look at Chapter 8 and the rest of Chapter 9.