It turns out that the Large Scale Hierarchical Text Classification Challenge was a little too challenging for the purposes of this project.
There are two main problems I have with this competition.
- Multi-class and multi-label. Most of the algorithms we discuss in class are not fundamentally designed to handle multi-class problems. The methods for extending to multi-class using these algorithms often involves techniques such as training k-1 binary models. When k = 350,000 this doesn’t work, and this is only for multi-class. When we combine this with the multi-label nature of the problem, the basic methods blow up. This makes it very difficult to fit the models we talk about in class and so the competition looses some of its relevance.
- Dataset size. The LSHTC dataset hardly fits in the memory of my laptop and I certainly cannot run any computation over it. This just adds even more complexity to an already difficult problem. While large datasets are certainly part of dealing with ML in the wild, I feel like it’s more of a hindrance to learning than a useful experience for the purposes of the class project.
Due to these issues, I’ve chosen instead to work on the Galaxy Zoo problem, which is particularly interesting in it’s own right. At its heart, the Galaxy Zoo competition is an image recognition problem. But what makes it different from a typical image recognition problem is that it involves predicting proportions or probabilities generated by citizen scientists who classified the images by following a decision tree. We can think about this either a regression problem or a probability density generation problem.
I will be updating the blog with my first stab at the Galaxy Zoo problem shortly.