Tuesday, April 09 2013 @ 00:00 +0200
After almost two years without a single competition, last September I decided to enter the Stackoverflow contest on Kaggle. It was a straightforward text classification problem with extremely unbalanced classes.
The winning model is an average of 10 neural network ensembles of five constituent models, three of which are Deep Belief Networks, one is logistic regression, and one is Vowpal Wabbit. Features are all binary and include handcrafted, binned indicators (time of post, length of title, etc) and unigrams from the title and body.
Since the data set - especially the class distribution - evolves with time, one crucial step is to compensate for the effect of time. This is partly accomplished by adding date and time information as features, and also by training the ensemble on the most recent posts.
Since the constituent models are trained on a subset of the stratified sample provided by the organizer, the ensemble does two of things:
- Blend the constituent models, duh.
- Compensate for the differences between the stratified sample and the most recent months of the full training set.
Features Selection / Extraction
Didn't spend too much time on handcrafting the features, just played around with adding features one-by-one, keeping an eye on how the loss changes. These are all binary features. For example (:post-hour 8) is 8 o'clock UTC, (:post-hour 11) is 11 o'clock UTC.
Depending on the model the top-N features are used, where the features are sorted by log likelihood ratio. There were a number of other feature selection methods tried, see below.
Modeling Techniques and Training
First let's look one by one at the models that went into the ensemble.
Deep Belief Networks
A DBN is made of boltzmann machines stacked on top of each other, trained in a layerwise manner. After training (called 'pretraining') the DBN is 'unrolled' into a backpropagation network that's trained (called 'fine-tuning') to minimize the cross entropy between the predicted and actual class probabilities.
There are three DBNs in the ensemble. The first one looks like this (omitted the biases for clarity):
LABEL(5) INPUTS(2000) \ / F1(400) | F2(800)
So we have 5 softmax neurons in the LABEL chunk, representing the class probabilities. There are 2000 sigmoid neuron in the INPUTS chunk standing for the top 2000 binary features extracted from the post. Then we have two hidden layers of sigmoid neurons: F1 and F2. This is created in the code by MAKE-MALACKA-DBN-SMALL.
The second DBN is the same expect INPUTS, F1 and F2 have 10000, 800, 800 neurons respectively. See MAKE-MALACKA-DBN-BIG.
The third DBN is the same expect INPUTS, F1 and F2 have 10000, 2000, 2000 neurons respectively. See MAKE-MALACKA-DBN-BIGGER. Note that this last DBN wasn't fine tuned due to time constraints; predictions are extracted directly from the DBN that doesn't try to minimize cross entropy.
The RBMs in the DBN were trained with contrastive divergence with mini batches of 100 posts. Learning rate was 0.001, momentum 0.9, weight decay 0.0002.
The backprop networks were trained for 38 epochs with the conjugate gradient method with three line searches on batches of 10000 posts. For the first 5 epochs, only the softmax units were trained, and for the last 3 epochs there was only one batch epoch (i.e. normal conjugate gradient).
These guys take several hours to days to train.
Not much to say here. I used liblinear with the top 250000 features, with these parameters:
:solver-type :l2r-lr :c 256 :eps 0.001
Even though it had access to a much larger set of features, liblinear could only achieve ~0.83 on the stratified sample used for development vs ~0.79 for the second DBN. Still, even though they used the same kind of features, they were different enough to slightly improve in the ensemble.
I'm not sure adding this helped at all in the end, the results weren't entirely convincing. I just took Foxtrot's code. VW is run with --loss_function logistic --oaa 5.
The ensemble is a backpropagation neural network with one hidden layer of 800 stochastic sigmoid neurons (at least that was the intention, see below). The network looked like this:
PRED1 PRED2 PRED3 PRED4 PRED5 \ ___\____|___/____/ OUTPUT(800) | CROSS-ENTROPY
PRED1 is made of five neurons representing the class probabilities in the prediction of the first DBN. The rest of PRED* are for the other two DBNs, the liblinear model, and VW.
The network was trained with gradient descent with mini batches of 100 posts. Learning rate started out as 0.01 and multiplies by 0.98 each epoch. Momentum started out as 0.5 and was increased to 0.99 in 50 epochs. Learning rate was also multiplied by (1 - momentum) to disentangle it from the momentum. No weight decay was used.
I tried to get Hinton's dropout technique working, but it didn't live up to my expectations. On the other hand, stochastic binary neurons mentioned in the dropout presentation, did help a tiny bit. Unfortunately, I managed to make the final submission with a broken version where the weights of stochastic binary neurons were not trained at all, effectively resulting in 800 random features (!).
As good as stochastic binary neurons were before I broke the code, it still helped a tiny bit (as in a couple of 0.0001s) to average 10 ensembles.
Additional Comments and Observations
It was clear from the beginning that time plays an important role, and if scores are close then predicting the class distribution of the test set could be the deciding factor. I saw the pace of change (with regards to distribution of classes) picking up near the end of the development training set and probed into the public leaderboard by submitting a number different constant predictions (the same prior for every post). It seemed that the last two weeks or one month is best.
There was no obvious seasonality or trend that could be exploited on the scale of months. I checked whether stackoverflow were changing the mechanics, but didn't find anything. I certainly didn't foresee the drastic class distribution change that was to come.
I tried a couple of feature extraction methods. The Key-Substring-Group extractor looked very promising, but it simply didn't scale to more than a thousand features.
In the end, I found that no important features were left out by playing with liblinear that could handle all features at the same time. Take it with a grain of salt, of course, because there is noise/signal issue lurking.
Naive Bayes, random forests, gradient boosting
I experimented with the above in scikit-learn. The results were terrible, but worse, they didn't contribute to the ensemble either. Maybe it was only me.
I couldn't get it to scale to several tens of thousands posts so I had to go with liblinear.
Fine tuning DBNs with dropout or stochastic binary neurons (without the bugs) didn't work. The best I could achive was slightly worse than the conjugate gradient based score.
Retraining consituent models
Recall that the consituent models were trained only on 4/5 of the available data. After the ensemble was trainined, I intended to replace retrain them on the whole stratified training set. Initial experiments with liblinear were promising, but with the DBN the public leaderboard score got a lot worse and I ran out of time to experiment.