the Kaggle competition: Predicting HIV Progression

I have recently decided to participate to a competition offered by the website, about writing a predictor for the outcome of an HIV threatement. Since I have not been blogging here for a while now, I will use this an excuse to re-start blogging and I will dedicate a series of post here about my ideas for this competition. is a new web 2.0 site that propose competitions to people able to write predictors using machine learning or other techniques. In short, they propose a study case, like this one on HIV, and they give a prize of 500$ to the team who is able to write the best predictor for the data. In theory, if you have a nice study case to propose, you can send it to them and they will also pay you if it is interesting.

The competition I want to participate to is on predicting the outcome of an HIV treatment. In short, you have a set of ~700 individuals, exposed to a cure for AIDS, and for each individual you have the sequences of two viral proteins, some other parameters, and the outcome of the cure. Then, you are given a set of 300 individuals, and you have to tell how the threatment will perform on them.

Since I am more interested in learning about machine learning methods than in the money prize, I will post here my thoughts about the competition.This will probably make for a nice tutorial to the basics of bioinformatics for dummies: how to interpretate the data, which database to interrogate, which tools, a bit of bash scripting, … I will start writing by tomorrow 🙂


