Skip to content

Commit

Permalink
Expose slicing utility method in __init__.py
Browse files Browse the repository at this point in the history
Add documentation for slicing.

PiperOrigin-RevId: 292452583
  • Loading branch information
paulgc authored and tfx-copybara committed Jan 31, 2020
1 parent 70a08b4 commit e4adca1
Show file tree
Hide file tree
Showing 2 changed files with 43 additions and 0 deletions.
40 changes: 40 additions & 0 deletions g3doc/get_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -436,3 +436,43 @@ with beam.Pipeline() as p:
coder=beam.coders.ProtoCoder(
statistics_pb2.DatasetFeatureStatisticsList)))
```

## Computing statistics over slices of data

TFDV can be configured to compute statistics over slices of data. Slicing can be
enabled by providing slicing functions which take in an Arrow table and output
a sequence of tuples of form `(slice key, Arrow table)`. TFDV provides an easy
way to
[generate feature value based slicing functions](https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/utils/slicing_util.py#L47)
which can be provided as part of `tfdv.StatsOptions` when computing statistics.

When slicing is enabled, the output
[DatasetFeatureStatisticsList](https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/statistics.proto#L36)
proto contains multiple
[DatasetFeatureStatistics](https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/statistics.proto#L41)
protos, one for each slice. Each slice is identified by a unique name which is
set as the
[dataset name in the DatasetFeatureStatistics proto](https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/statistics.proto#L43).
By default TFDV computes statistics for the overall dataset in addition to the
configured slices.

```python
import tensorflow_data_validation as tfdv
from tensorflow_data_validation.utils import slicing_util

# Slice on country feature (i.e., every unique value of the feature).
slice_fn1 = slicing_util.get_feature_value_slicer(features={'country': None})

# Slice on the cross of country and state feature (i.e., every unique pair of
# values of the cross).
slice_fn2 = slicing_util.get_feature_value_slicer(
features={'country': None, 'state': None})

# Slice on specific values of a feature.
slice_fn3 = slicing_util.get_feature_value_slicer(
features={'age': [10, 50, 70]})

stats_options = tfdv.StatsOptions(
slice_functions=[slice_fn1, slice_fn2, slice_fn3])

```
3 changes: 3 additions & 0 deletions tensorflow_data_validation/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,9 @@
from tensorflow_data_validation.utils.schema_util import set_domain
from tensorflow_data_validation.utils.schema_util import write_schema_text

# Import slicing utilities.
from tensorflow_data_validation.utils.slicing_util import get_feature_value_slicer

# Import stats lib.
from tensorflow_data_validation.utils.stats_gen_lib import generate_statistics_from_csv
from tensorflow_data_validation.utils.stats_gen_lib import generate_statistics_from_dataframe
Expand Down

0 comments on commit e4adca1

Please sign in to comment.