@@ -99,7 +99,7 @@ from the repository using the function
99
99
For example, to download a dataset of gene expressions in mice brains::
100
100
101
101
>>> from sklearn.datasets import fetch_openml
102
- >>> mice = fetch_openml(name='miceprotein', version=4)
102
+ >>> mice = fetch_openml(name='miceprotein', version=4, parser="auto" )
103
103
104
104
To fully specify a dataset, you need to provide a name and a version, though
105
105
the version is optional, see :ref: `openml_versions ` below.
@@ -147,7 +147,7 @@ dataset on the openml website::
147
147
148
148
The ``data_id `` also uniquely identifies a dataset from OpenML::
149
149
150
- >>> mice = fetch_openml(data_id=40966)
150
+ >>> mice = fetch_openml(data_id=40966, parser="auto" )
151
151
>>> mice.details # doctest: +SKIP
152
152
{'id': '4550', 'name': 'MiceProtein', 'version': '1', 'format': 'ARFF',
153
153
'creator': ...,
@@ -171,8 +171,8 @@ which can contain entirely different datasets.
171
171
If a particular version of a dataset has been found to contain significant
172
172
issues, it might be deactivated. Using a name to specify a dataset will yield
173
173
the earliest version of a dataset that is still active. That means that
174
- ``fetch_openml(name="miceprotein") `` can yield different results at different
175
- times if earlier versions become inactive.
174
+ ``fetch_openml(name="miceprotein", parser="auto" ) `` can yield different results
175
+ at different times if earlier versions become inactive.
176
176
You can see that the dataset with ``data_id `` 40966 that we fetched above is
177
177
the first version of the "miceprotein" dataset::
178
178
@@ -182,19 +182,19 @@ the first version of the "miceprotein" dataset::
182
182
In fact, this dataset only has one version. The iris dataset on the other hand
183
183
has multiple versions::
184
184
185
- >>> iris = fetch_openml(name="iris")
185
+ >>> iris = fetch_openml(name="iris", parser="auto" )
186
186
>>> iris.details['version'] #doctest: +SKIP
187
187
'1'
188
188
>>> iris.details['id'] #doctest: +SKIP
189
189
'61'
190
190
191
- >>> iris_61 = fetch_openml(data_id=61)
191
+ >>> iris_61 = fetch_openml(data_id=61, parser="auto" )
192
192
>>> iris_61.details['version']
193
193
'1'
194
194
>>> iris_61.details['id']
195
195
'61'
196
196
197
- >>> iris_969 = fetch_openml(data_id=969)
197
+ >>> iris_969 = fetch_openml(data_id=969, parser="auto" )
198
198
>>> iris_969.details['version']
199
199
'3'
200
200
>>> iris_969.details['id']
@@ -212,7 +212,7 @@ binarized version of the data::
212
212
You can also specify both the name and the version, which also uniquely
213
213
identifies the dataset::
214
214
215
- >>> iris_version_3 = fetch_openml(name="iris", version=3)
215
+ >>> iris_version_3 = fetch_openml(name="iris", version=3, parser="auto" )
216
216
>>> iris_version_3.details['version']
217
217
'3'
218
218
>>> iris_version_3.details['id']
@@ -225,6 +225,45 @@ identifies the dataset::
225
225
machine learning" ACM SIGKDD Explorations Newsletter, 15(2), 49-60, 2014.
226
226
<1407.7722> `
227
227
228
+ .. _openml_parser :
229
+
230
+ ARFF parser
231
+ ~~~~~~~~~~~
232
+
233
+ From version 1.2, scikit-learn provides a new keyword argument `parser ` that
234
+ provides several options to parse the ARFF files provided by OpenML. The legacy
235
+ parser (i.e. `parser="liac-arff" `) is based on the project
236
+ `LIAC-ARFF <https://github.com/renatopp/liac-arff >`_. This parser is however
237
+ slow and consume more memory than required. A new parser based on pandas
238
+ (i.e. `parser="pandas" `) is both faster and more memory efficient.
239
+ However, this parser does not support sparse data.
240
+ Therefore, we recommend using `parser="auto" ` which will use the best parser
241
+ available for the requested dataset.
242
+
243
+ The `"pandas" ` and `"liac-arff" ` parsers can lead to different data types in
244
+ the output. The notable differences are the following:
245
+
246
+ - The `"liac-arff" ` parser always encodes categorical features as `str `
247
+ objects. To the contrary, the `"pandas" ` parser instead infers the type while
248
+ reading and numerical categories will be casted into integers whenever
249
+ possible.
250
+ - The `"liac-arff" ` parser uses float64 to encode numerical features tagged as
251
+ 'REAL' and 'NUMERICAL' in the metadata. The `"pandas" ` parser instead infers
252
+ if these numerical features corresponds to integers and uses panda's Integer
253
+ extension dtype.
254
+ - In particular, classification datasets with integer categories are typically
255
+ loaded as such `(0, 1, ...) ` with the `"pandas" ` parser while `"liac-arff" `
256
+ will force the use of string encoded class labels such as `"0" `, `"1" ` and so
257
+ on.
258
+
259
+ In addition, when `as_frame=False ` is used, the `"liac-arff" ` parser returns
260
+ ordinally encoded data where the categories are provided in the attribute
261
+ `categories ` of the `Bunch ` instance. Instead, `"pandas" ` returns a NumPy array
262
+ were the categories. Then it's up to the user to design a feature
263
+ engineering pipeline with an instance of `OneHotEncoder ` or
264
+ `OrdinalEncoder ` typically wrapped in a `ColumnTransformer ` to
265
+ preprocess the categorical columns explicitly. See for instance: :ref: `sphx_glr_auto_examples_compose_plot_column_transformer_mixed_types.py `.
266
+
228
267
.. _external_datasets :
229
268
230
269
Loading from external datasets
0 commit comments