Saturday 26 March 2016

Data Science and Higher Education South Africa data

TL;DR 
* I'm exploring "data science" and related technologies, including R
* This is a fun "puzzle": http://priceonomics.com/the-priceonomics-data-puzzle-treefortbnb/ 
* There exist some nice open data sets relating to Higher Education in South Africa 

Data Science
"Data Science" is as much of a buzzword as "The Cloud", "Big Data", and "Artificial Intelligence", and many intelligent people will make unidentifiable sounds of contempt when they hear or read it. But like like the other buzzwords mentioned, "Data Science" started out as an interesting idea, which the media, recruiters, and marketing departments ran away with in order to impress various stakeholders and make lots of money.

With an increasing amount of open data sets being made available (see https://en.wikipedia.org/wiki/Open_data), being able to get information from raw data is an an ever-more useful skill to learn. I came across an fun and simple puzzle recently here http://priceonomics.com/the-priceonomics-data-puzzle-treefortbnb/ and decided to use it as as a starting point for learning more about technologies that are useful for data analysis. While Python is normally the first tool I'd turn towards to solve a problem like this, I recently saw some quite impressive work done with R. I was surprised by how easy it was to carry out common data manipulations and visualisations and I wanted to try it for myself. 

I won't go into detail in how I solved the puzzle linked above, as Priceonomics use it as part of their recruitment process. But I downloaded R, and messed around with it and the Treefort dataset for an evening and had a lot of fun. Below is a brief write-up on my first experiences with R, and the most interesting graphs from the South Africa education data set I was using. There's also a link to the Excel spreadsheet I used instead of R.

Why not Python?
One of the main reasons I enjoy Python is the intuitiveness of its syntax. If I don't know how to do something using a Python library, I can usually fire up a shell and with a combination of dir() and guesswork work out how to do what I want faster than looking it up on stackoverflow. However, with matplotlib, numpy, and pandas, I always find the opposite. Even when faced with a very basic problem, I often find myself trawling through documentation and examples to work out how to solve it. 

While manipulating and plotting data from a .csv file in R, I very quickly got into my Python habits of using trial-and-error and the R help() command. It was very satisfying to read, manipulate, and plot data in a few lines of code. My current impression of R (which will almost certainly change drastically as I use it more), is that it will fit somewhere between M$Excel and Python for me. If I just want to do some really basic calculations, I'll use Excel. If I want to build and maintain a 100+ line programme, that I'll need to use and change for the foreseeable future, I'll use Python. And if I need to mess around programatically with rows and columns, but I don't need to build anything maintainable, R looks like it could be a good compromise between the two. 

South African Education Data
There's no shortage of data sets to play with. Cape Town open data (https://web1.capetown.gov.za/web1/OpenDataPortal/) was the first place I looked, but it seems that that initiative was a bit of a let down. While there's some interesting data available, most of it is hugely inconsistent in format, and looks as if it was intended for human consumption instead of for programmatic analysis. I thought education data might be interesting, and I found that the datasets available here http://chet.org.za/data/sahe-open-data were comprehensive and fairly consistent. Unfortunately they're also presented in xlsx format instead of .csv and are not as ideal as the Treefort data set to load directly into R. 

I converted them to .csv files and loaded them into R, but I need to spend some more time with R's syntax and libraries to efficiently work with data in non-ideal formats. The pain point was the double headers in most of the data sets. For example, the dataset of enrollments by race looks like this: 



I wanted to graph the data by institution, as in the picture below. Getting the specific row that represented each institution in R was straightforward enough, but I couldn't easily find a way to transform the data to use the year as the x-axis, the categories as separate series, and the the numbers as the y-axis. I'm sure it'll seem trivial once I've worked out how to do it, but I decided to play around with the data in M$Excel first so I could have a clear goal in mind before diving deeply into R.  



UCT enrollment by race
Interestingly, of all the institutions listed, UCT is the only one to have any crossing lines
.
I've used Excel pretty extensively in the past, and even taught an introductory course on it, so it was much easier to clean and manipulate the data and create pretty graphs than working out how to do everything in R. I loaded the simplest datasets from the CHET collection (Race, Gender, and Success) into separate worksheets, created some hacky VLOOKUPs to separate the time and category data by institution, and added some graphs in a separate sheet. A screenshot of the result is below - the big cell at the top is a dropdown that contains all the institutions, and the graphs update dynamically when a new institution is selected.
All data for Rhodes University
I'm not sure what happened to the success rate in 2013 - none of the other institutions showed a similar decline. Hopefully it's a mistake. (My brief lecturing attempt at Rhodes was last year, so it can't be caused by that).
Soon, I'll attempt to replicate the graphs using R, and write a follow up post about how I do it. I'll also extend the data sets I looked at, and if there's anything interesting I'll write a post which focuses on the data instead of the technology used to analyse it. 

If you want to play around with the education data and see the graphs for the other institutions, you can download the Excel spreadsheet I built here: https://docs.google.com/uc?authuser=0&id=0ByEENivQuwUBSmNJUXdPbDI2cU0&export=download. The messy VLOOKUPs would probably be enough to have me expelled from any respectable computer science institution, but luckily I don't belong to any. Feel free to write me snarky comments below on how I could have done it in a cleaner way.