Rice University logo
 
Top blue bar image
Kaggle competition
 

Project Proposal

Choosing a competition

The first step towards the completion of our term project is choosing a Kaggle competition to compete in. The loose restrictions are to pick a competition that is:

  • ending later rather than sooner (ideally towards the end of the semester)
  • a problem conducive for machine learning research as opposed to just a programming challenge

This makes the process of picking a competition to work on pretty hard. With these constraints, I see four main options: loan default predictions, galaxy classification, hierarchical text classification, and object recognition in images. None of these jump out at me immediately. I’m not particularly interested in financial problems or computer vision so I decided to table these — although the loan default prediction looks like the most well defined of the available problems.

Large Scale Hierarchical Text Classification Challenge

For now I’m choosing to work on the LSHTC challenge. The LSHTC4 is the latest in a serious of challenges sponsored/organized by various academics institutes in Greece and France. The dataset is large with 2,400,000 documents and 325,000 labels. Additionally, each observation can have multiple labels, making the problem seemingly very difficult.

I chose this competition because I find text mining to be an interesting field within machine learning. Multi-label responses is also another interesting aspect of the challenge. Holistically, I feel that the problem is the most closely related to machine learning and data mining challenges one might meet out in the real world.

I do have some lingering concerns regarding the competition and I’m not completely set in my choice as of now. The first is that the problem is hard–perhaps too hard. Clustering and unsupervised learning is tough enough with small numbers of single-label categories. Over 325,000 multi-label categories seems rough. In the same vein, I’m a little worried about computational problems. The dataset might be too large for model fitting on my laptop, but this might be alleviated by the in-class suggestion of access to the davinci cluster.

Approach

I will start by attempting to train basic linear models on the data. This might be particularly challenging due to the large number of features.

Resources

As mentioned earlier, it might be very useful to obtain access to the davinci cluster for computing on this large dataset. I plan on reading research literature on the subject as the topic does seem to be an open problem. Any suggestions for relevant text or other words of advice would be appreciated.

P.S.: It would be interesting to see what problems the rest of the class has decided to work on.


Comments are closed.