GitHub - a061105/ExtremeMulticlass

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 150 Commits
examples		examples
LICENSE		LICENSE
Makefile		Makefile
PostSolve.h		PostSolve.h
README.txt		README.txt
SBCDsolve.h		SBCDsolve.h
SplitOracleActBCD.h		SplitOracleActBCD.h
multi.h		multi.h
multiPred.cpp		multiPred.cpp
multiTrain.cpp		multiTrain.cpp
newHash.h		newHash.h
util.h		util.h

Repository files navigation

0. Compile:
Compile it by "make" with g++ and support of openmp.

1. Usage:

Three binary files will be generated: multiTrain, multiTrainHash and multiPred
Both "multiTrain" and "multiTrainHash" can be used for training since they have the exact same functionality.
Note that "multiTrainHash" is designed to be more memory efficient whereas "multiTrain" is faster when memory is sufficient(i.e. when several matrices of size (#classes) by (#features) can fit into memory).


1.1. Training: ./multiTrain (options) [train_data] (model)
options:
-s solver: (default 0)
	0 -- Stochastic Block Coordinate Descent
	1 -- Stochastic-Active Block Coordinate Descent(PD-Sparse)
-l lambda: L1 regularization weight (default 1.0)
-c cost: cost of each sample (default 1.0)
-r speed_up_rate: sample 1/r fraction of non-zero features to estimate gradient (default r = ceil(min( 5DK/(Clog(K)nnz(X)), nnz(X)/(5N) )) )
-q split_up_rate: divide all classes into q disjoint subsets (default 1)
-m max_iter: maximum number of iterations allowed (default 20)
-u uniform_sampling: use uniform sampling instead of importance sampling (default not)
-g max_select: maximum number of dual variables selected during search (default: -1 (i.e. dynamically adjusted during iterations) )
-p post_train_iter: #iter of post-training without L1R (default 0)
-e early_terminate (default 3)
-h <file>: using heldout file <file>

Train models for data sets provided in examples folder
./multiTrain -c 1.0 -l 0.1 -s 1 -r -1.0 -e 3 -m 200 -q 3 -g -1 -p 200  -h ./examples/multilabel/rcv1_regions.heldout ./examples/multilabel/rcv1_regions.train rcv1_regions.model
./multiTrain -c 1.0 -l 0.1 -s 1 -r -1.0 -e 3 -m 200 -q 1 -g -1 -p 200  -h ./examples/multiclass/sector/sector.heldout ./examples/multiclass/sector/sector.train sector.model
models generated by post training will be stored separately with name "rcv1_regions.model.p" and "sector.model.p"

1.2. Prediction: ./multiPred [testfile] [model] (k) (compute top k accuracy, default 1)

Compute top-1 accuracy of the model generated:
./multiPred ./examples/multilabel/rcv1_regions.test rcv1_regions.model 1

Compute test accuracy of the model generated by post training:
./multiPred ./examples/multiclass/sector/sector.test sector.model.p 1

1.3 Downloading data:
We provide data split into train, heldout and test set. 
The data set names are listed below:

multilabel data sets:
"Eur-Lex", "rcv1_regions", "bibtex", "LSHTCwiki"

multiclass data sets:
"sector", "aloi.bin", "Dmoz", "LSHTC1", "imageNet"  

Note that some data sets(like "rcv1-regions") are availabe online whereas some data set(like "aloi.bin") are specially processed.

One can download them by

$ cd examples/
$ make construct dataset=rcv1_regions

Note that the exact name of data sets are listed above should be assigned to variable "dataset".

1.4. Run examples with Makefile:
For the data sets listed above, we also provide a simple script to download , train and test on them automatically.
One can do that by

make rcv1_regions


2. Input Format(with examples):

We use libsvm format:
<label> <index1>:<value1> <index2>:<value2> ... 
.
.
.

$ head -1 ./examples/multiclass/sector/sector.train
53 1:0.00049 2:0.0009 3:4e-05 5:0.00054 6:0.00458 8:0.01302 ... 41303:0.14897 

$ head -1 ./examples/multilabel/rcv1_regions.train
53,112 440:0.0463308116107426 730:0.0982669864147017 ... 46694:0.0628286522726235

3. Output Format(with examples):

Our model file format:
nr_class <K = number of classes>
label <label 1> ... <label K>
nr_features <D = number of features>
<nnz(w[0])> <index 1>:<w[0][index 1]> <index 2>:<w[0][index 2]> ... <index nnz(w[0])>:<w[0][index nnz(w[0])]>
<nnz(w[1])> <index 1>:<w[1][index 1]> <index 2>:<w[1][index 2]> ... <index nnz(w[1])>:<w[1][index nnz(w[1])]>
.
.
.
<nnz(w[D-1]) ...
 
$ head -5 model
nr_class 228
label 293 53 112 288 285 273 94 11 289 231 32 25 30 129 134 136 145 203 214 220 246 248 270 265 62 180 202 131 103 222 17 193 74 139 282 177 351 89 142 219 3 133 86 236 155 167 138 140 159 115 146 196 197 258 73 230 365 256 195 106 2 128 7 90 349 143 250 79 45 149 24 63 362 281 43 290 41 224 153 175 161 70 251 61 223 93 113 6 75 28 158 48 147 44 174 156 348 190 266 69 33 170 169 242 182 173 109 130 233 267 232 363 151 268 126 46 271 22 262 364 26 18 346 188 162 178 42 85 87 165 245 361 359 261 192 163 287 101 199 114 152 34 210 347 206 80 166 183 252 216 212 95 154 76 205 283 92 276 118 345 65 291 123 29 148 40 157 235 36 91 144 16 68 27 39 184 264 179 218 64 82 211 243 21 239 181 108 226 116 217 72 97 1 160 292 60 172 227 52 102 350 66 360 254 240 141 50 56 358 234 280 255 257 54 278 105 194 208 35 204 124 132 275 37 168 13 171 352 
nr_feature 47237
0 
6 65:0.223974 3:0.00958934 17:0.0270412 11:0.134858 42:-0.0397849 98:-0.0188011 

4. Citation:
Ian En-Hsu Yen ([email protected])
PD-Sparse: A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel Classification. [pdf] 
Ian E.H. Yen*, Xiangru Huang*, Kai Zhong, Pradeep Ravikumar and Inderjit S. Dhillon. (* equally contributed)
In International Conference on Machine Learning (ICML), 2016.