Rice University logo
 
Top blue bar image
Kaggle competition
 

Archive for February, 2014


Starting on Galaxy Zoo

February 24th, 2014 by mjd2

As a first stab at working on the Galaxy Zoo problem, I ran least squares on a variety of compressed image dimensions.

Here we can see that optimal results came from 32×32 images, but that 24×24 performed about the same. This is good news considering the training data size grows very fast with increased dimensionality. My next move here is to try matrix factorization/pca to get a better least squares result, and then try regularization.

I’ve also noted that some form of data augmentation will be important in my final submissions, but I’m working with the hypothesis that what works well without data augmentation will also work well with it. This allows me to avoid its computational costs during this more experimental period.

Changing kaggle competition

February 15th, 2014 by mjd2

It turns out that the Large Scale Hierarchical Text Classification Challenge was a little too challenging for the purposes of this project.

There are two main problems I have with this competition.

  1. Multi-class and multi-label. Most of the algorithms we discuss in class are not fundamentally designed to handle multi-class problems. The methods for extending to multi-class using these algorithms often involves techniques such as training k-1 binary models. When k = 350,000 this doesn’t work, and this is only for multi-class. When we combine this with the multi-label nature of the problem, the basic methods blow up. This makes it very difficult to fit the models we talk about in class and so the competition looses some of its relevance.
  2. Dataset size. The LSHTC dataset hardly fits in the memory of my laptop and I certainly cannot run any computation over it. This just adds even more complexity to an already difficult problem. While large datasets are certainly part of dealing with ML in the wild, I feel like it’s more of a hindrance to learning than a useful experience for the purposes of the class project.

Due to these issues, I’ve chosen instead to work on the Galaxy Zoo problem, which is particularly interesting in it’s own right. At its heart, the Galaxy Zoo competition is an image recognition problem. But what makes it different from a typical image recognition problem is that it involves predicting proportions or probabilities generated by citizen scientists who classified the images by following a decision tree. We can think about this either a regression problem or a probability density generation problem.

I will be updating the blog with my first stab at the Galaxy Zoo problem shortly.