Rice University logo
 
Top blue bar image
Kaggle competition
 

Quick Description of Galaxy Zoo 2 (GZ2) data

This information was obtained by reading the official GZ2 paper: http://arxiv.org/abs/1308.3496.

The GZ2 dataset comes from citizen scientists who voluntarily classify galaxies using a guided process, namely a multi-step decision tree. The dataset used in the Kaggle competition is the result of several debiasing procedures which produces likelihoods from the classifications. ML applications of this data tend to interpret these likelihoods instead as probabilistic weights.

Each row in the Y matrix that we wish to predict is then constituted by probabilistic weights. This does not mean that each row sums to 1. Instead, each Y value corresponds to each of the individual 37 responses. There are 11 sub-questions which sum up to 1, e.g. “Is the galaxy simply smooth and rounded, with no sign of a disk” has 3 responses: smooth, features, star which corresponds to Y(:,1), Y(:,2) and Y(:,3) and sum to 1.

Not all responses are answered for each individual classification, i.e. only parts of the decision tree from specific previous answers.

Comments are closed.