We develop Multi-label Random Forests to tackle problems with millions of
labels. We propose a novel node splitting criterion that: (a) generalizes
traditional random forests to multi-label problems; (b) facilitates very large
scale training by learning from only positive data; and (c) overcomes the
well-known random forest limitation of overfitting in high dimensional spaces.
To compensate for numerous missing and incorrect labels, we do not train on the
given labels directly but extend our random forest classifier to train on
beliefs inferred about the state of each label. These beliefs are generated
through a novel sparse semi-supervised learning formulation optimized via an
efficient distributed iterative hard thresholding algorithm.