Python is a great language for developing software quickly, and by combining several large libraries, we can obtain a system for scientific computing that has every bit as much functionality as Matlab or R, but also has a programming language that is not completely crazy.
As you'll see from the list below, there are a large number of libraries that you need to know to do data analysis in Python:
First, Dive Into Python is a book about the programming language. Here is a quick reference guide for Python
iPython provides the interactive environment and notebook functionality that we will use
This iPython notebook provides a reference about the iPython notebook functionality. The short version is: If you are in command mode, i.e., not currently entering text, then type "h" for a list of keyboard shortcuts.
sklearn is a machine learning toolkit for Python
scipy is a collection of a large number of libraries for scientific computing, which covers a lot of Matlab functionality, including...
numpy (part of SciPy but can be used separately) provides matrix operations. If you know Matlab, here is an Numpy Cheat Sheet for Matlab users
matplotlib is a plotting library
The SciPy Cookbook is a great resource for the above.
pandas provides Python with the goodness of R data frames. We won't use it on this lab because it would repeat ideas that you've see for R, but if you use Python long-term, then this is very much worth checking out.
For this lab we will be using Python along with a few open-source libraries (packages). These packages cannot be installed directly, so we will have to create a virtual environment. We are using virtual enviroments to make the installation of packages and retention of correct versions as simple as possible. You can read here if you want to learn about virtual environments, but this is not neccessary for this tutorial.
Now open a terminal and follow these instructions. We are expecting you to enter these commands in one-by-one. Waiting for each command to complete will help catch any unexpected warnings and errors. Please read heed any warnings and errors you may encounter. We are on standby in the labs to help if required.
Change directory to home and create a virtual enviroment
cd
virtualenv --distribute virtualenvs/irds_env # Creates a virtual environment called iaml_irds
Navigate to and activate the virtual enviroment (you will need to activate the virtual environment every time you open a new terminal - this adds the correct python version with all installed packages to your system's $PATH environment variable)
cd virtualenvs/irds_env
source ./bin/activate # Activates the environment, your shell prompt should now change to reflect you are in the iaml_irds enviornment
Install all the python packages we need (once the correct virtual environment is activated, pip install will install packages to the virtual environent - if you're ever unsure which python you are using, type which python in the terminal) WATCH FOR WARNINGS AND ERRORS HERE. We have split these commands up to encourage you to enter them one-by-one.
pip install -U setuptools # The -U flag upgrades the current version
pip install -U pip
pip install yolk
pip install jupyter
pip install numpy
pip install scipy
pip install matplotlib
pip install pandas
pip install statsmodels
pip install scikit-learn
You should now have all the required modules installed. Our next step is to make a new directory where you will keep all the lab notebook. Within your terminal:
Navigate back to your home directory
cd
Make a new directory (e.g. called irds_lab3)
mkdir irds_lab3
Navigate home and ensure the irds_env virtualenv is activated
cd
source virtualenvs/irds_env/bin/activate # Activates the environment
Enter the directory you just created
cd irds_lab_3
Start a jupyter notebook
jupyter notebook
In the first text box, enter this Python code
import urllib urllib.urlretrieve('http://www.inf.ed.ac.uk/teaching/courses/irds/2016-autumn/labs/Lab3.ipynb', 'Lab3.ipynb')
This will download the Jupyter notebook that contains the rest of this lab. Now you should be able to find the notebook by clicking File -> Open on the Jupyter notebook menu.
It is probably a good idea to delete the virtual environment once you have finished working on the lab to free up some space:
Navigate back to your home directory
cd
Remove the virtual environment
rm -rf virtualenvs/irds_env
If you are using a personal machine, you can choose whether to do as above or use the Anaconda distribution (Python version 2.7, choose the appropriate installer according to your operating system). Anaconda is a standard set of packages used in scientific computing which the Anaconda team curate to keep them consistent. It's also recommended that you set up a virtual environment for this project. This way, if you update anything in your anaconda base install, this virtual environment will remain unchanged. To create a virtual environment called irds, open a Terminal (or Command Prompt window if you are running Windows) and type:
conda create -n irds python=2.7 anaconda
Don't forget to activate the virtual environment every time you begin work from a new terminal:
source activate irds
Once you have finished installed everything, open a terminal (or Command Prompt in Windows), navigate to the lab folder and type:
jupyter notebook
In the first text box, enter this Python code:
import urllib urllib.urlretrieve('http://www.inf.ed.ac.uk/teaching/courses/irds/2016-autumn/labs/Lab3.ipynb', 'Lab3.ipynb')
Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk Please contact our webadmin with any comments or corrections. Logging and Cookies Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh |