On the 25th of April David Lusher and Kieran Selvon gave a workshop on the Pandas Python "Data Analysis" library as part of the FEE module Advance Computational Methods II. The event was popular with attendees from outside the faculty including PhD and undergraduate students.
Python is a becoming an increasingly common high level programming language, excelling in easy implementation of scientific programming tasks. For statistical analysis however, environments such as R or Stata remain the first choice of most.
Pandas is an open source python library which allows advance data analysis to be performed in the python environment.
The workshop began with Kieran giving a presentation providing a brief overview of what Pandas is and some of its basic functionality, including the Pandas data structures, the built in masking, sorting and ''groupby'' methods. Also included was how to output LaTeX and HTML code using Pandas. This is very useful for research students as it allows the publishing of data stored in arrays with minimal effort. Finally information was provided about using Pandas in conjunction with other Python data analysis libraries such as 'statsmodels'.
After the presentation David performed a demonstration in which Pandas was used to 'Munge' or 'wrangle' through data read in from csv files acquired from https://data.police.uk to extract interesting information from the dataset. Pandas was used to find out how many bicycle thefts were occurring on Burgess Road (the home of the NGCM, where many of its members cycle to work!). Next the general trend of crime levels for the whole of Hampshire was explored. Examples of outputting LaTeX and HTML code were provided and finally an example of how to perform an 'R style' linear regression including statistical outputs and residual plotting was shown.
After the demonstration followed a series of exercises which continued the crime data munging exercise, participants in the workshop had to find out the most frequently committed crime type on the road they live on. The exercises then ended on a more light hearted note with participants having to sort through a dataset of films and their ratings, finding out for instance what the most popular film was during the year they were born.
A full description of the workshop including links to all the content and a virtual machine which can be used for the exercises if desired is available here.