Rice University logo
 
Top blue bar image
Kaggle competition
 

Final report

May 1st, 2014 by mjd2

The Galaxy Zoo Challenge asks participants to predict proportions of
user classifications of galaxy morphologies. Using the Extra-Trees
algorithm and data preprocessing and augmentation, an error rate of
0.12859 and rank of 145/329 was achieved. The key idea behind
the solution was to mimic the success of convolutional neural
networks by implementing a large model with lots of randomness to
combat overfitting. While this approach performed moderately well, a
feature engineering focused approach would likely perform better
given my computational limitations.

report



Final presentation

April 30th, 2014 by mjd2

Here is the poster for my presentation. I ended up using the Extra-Trees algorithm.

poster



Learning curve for least squares

March 29th, 2014 by mjd2

From the learning curve for least squares we can see that the feature set that I’m using is simply not rich enough to capture enough information required for a competitive score on the leaderboard.

 I’m having a bit of trouble with my implementation of ridge regression, but I cannot figure what I’ve done wrong in my code.



Quick Description of Galaxy Zoo 2 (GZ2) data

March 29th, 2014 by mjd2

This information was obtained by reading the official GZ2 paper: http://arxiv.org/abs/1308.3496.

The GZ2 dataset comes from citizen scientists who voluntarily classify galaxies using a guided process, namely a multi-step decision tree. The dataset used in the Kaggle competition is the result of several debiasing procedures which produces likelihoods from the classifications. ML applications of this data tend to interpret these likelihoods instead as probabilistic weights.

Each row in the Y matrix that we wish to predict is then constituted by probabilistic weights. This does not mean that each row sums to 1. Instead, each Y value corresponds to each of the individual 37 responses. There are 11 sub-questions which sum up to 1, e.g. “Is the galaxy simply smooth and rounded, with no sign of a disk” has 3 responses: smooth, features, star which corresponds to Y(:,1), Y(:,2) and Y(:,3) and sum to 1.

Not all responses are answered for each individual classification, i.e. only parts of the decision tree from specific previous answers.



Starting on Galaxy Zoo

February 24th, 2014 by mjd2

As a first stab at working on the Galaxy Zoo problem, I ran least squares on a variety of compressed image dimensions.

Here we can see that optimal results came from 32×32 images, but that 24×24 performed about the same. This is good news considering the training data size grows very fast with increased dimensionality. My next move here is to try matrix factorization/pca to get a better least squares result, and then try regularization.

I’ve also noted that some form of data augmentation will be important in my final submissions, but I’m working with the hypothesis that what works well without data augmentation will also work well with it. This allows me to avoid its computational costs during this more experimental period.



Changing kaggle competition

February 15th, 2014 by mjd2

It turns out that the Large Scale Hierarchical Text Classification Challenge was a little too challenging for the purposes of this project.

There are two main problems I have with this competition.

  1. Multi-class and multi-label. Most of the algorithms we discuss in class are not fundamentally designed to handle multi-class problems. The methods for extending to multi-class using these algorithms often involves techniques such as training k-1 binary models. When k = 350,000 this doesn’t work, and this is only for multi-class. When we combine this with the multi-label nature of the problem, the basic methods blow up. This makes it very difficult to fit the models we talk about in class and so the competition looses some of its relevance.
  2. Dataset size. The LSHTC dataset hardly fits in the memory of my laptop and I certainly cannot run any computation over it. This just adds even more complexity to an already difficult problem. While large datasets are certainly part of dealing with ML in the wild, I feel like it’s more of a hindrance to learning than a useful experience for the purposes of the class project.

Due to these issues, I’ve chosen instead to work on the Galaxy Zoo problem, which is particularly interesting in it’s own right. At its heart, the Galaxy Zoo competition is an image recognition problem. But what makes it different from a typical image recognition problem is that it involves predicting proportions or probabilities generated by citizen scientists who classified the images by following a decision tree. We can think about this either a regression problem or a probability density generation problem.

I will be updating the blog with my first stab at the Galaxy Zoo problem shortly.



Project Proposal

January 24th, 2014 by mjd2

Choosing a competition

The first step towards the completion of our term project is choosing a Kaggle competition to compete in. The loose restrictions are to pick a competition that is:

  • ending later rather than sooner (ideally towards the end of the semester)
  • a problem conducive for machine learning research as opposed to just a programming challenge

This makes the process of picking a competition to work on pretty hard. With these constraints, I see four main options: loan default predictions, galaxy classification, hierarchical text classification, and object recognition in images. None of these jump out at me immediately. I’m not particularly interested in financial problems or computer vision so I decided to table these — although the loan default prediction looks like the most well defined of the available problems.

Large Scale Hierarchical Text Classification Challenge

For now I’m choosing to work on the LSHTC challenge. The LSHTC4 is the latest in a serious of challenges sponsored/organized by various academics institutes in Greece and France. The dataset is large with 2,400,000 documents and 325,000 labels. Additionally, each observation can have multiple labels, making the problem seemingly very difficult.

I chose this competition because I find text mining to be an interesting field within machine learning. Multi-label responses is also another interesting aspect of the challenge. Holistically, I feel that the problem is the most closely related to machine learning and data mining challenges one might meet out in the real world.

I do have some lingering concerns regarding the competition and I’m not completely set in my choice as of now. The first is that the problem is hard–perhaps too hard. Clustering and unsupervised learning is tough enough with small numbers of single-label categories. Over 325,000 multi-label categories seems rough. In the same vein, I’m a little worried about computational problems. The dataset might be too large for model fitting on my laptop, but this might be alleviated by the in-class suggestion of access to the davinci cluster.

Approach

I will start by attempting to train basic linear models on the data. This might be particularly challenging due to the large number of features.

Resources

As mentioned earlier, it might be very useful to obtain access to the davinci cluster for computing on this large dataset. I plan on reading research literature on the subject as the topic does seem to be an open problem. Any suggestions for relevant text or other words of advice would be appreciated.

P.S.: It would be interesting to see what problems the rest of the class has decided to work on.




Hello world!

January 24th, 2014 by mjd2

Welcome to Blogs @ Rice University.

Either start blogging right away or take a few minutes to turn your blog into a normal web site