Skip to content

Commit

Permalink
Feature consistency fixes (databrickslabs#182)
Browse files Browse the repository at this point in the history
* fixed use of time strings to allow both  and  etc

* fixed use of time strings to allow both 'seconds' and 'second' etc

* id column fixes

Doc updates 100522 (databrickslabs#119)

* fixed reference to dbx in pull_request_template

* reverted inadvertently changed file

* release 0.2.1

* doc updates

* doc updates

* updates for building docs

* updated public docs

* updated sphinx version

* updated docs

* doc updates

* removed generated docs

* removed changes to non-doc

* reverted inadvertently changed file

* release 0.2.1

* doc updates

doc updates

* tidied up makefile

* added workflow action to update tag 'preview'

* develop branch updates

* revised unit tests to use parameterized approach

* changed utils tests to pytest

* changed changelog format

* changelog changes

* changelog updates from merge

* update to change as a resultt of merge of time fixes

* updates for test speed improvements

* updated tests

* updated tests

* updated tests

* updated tests

* fixed typo

* reverted pytest changes - separate feature

* reverted pytest changes - separate feature

* reverted pytest changes - separate feature

* reverted pytest changes - separate feature

* changed partitioning to run more efficiently on github runner

* changed partitioning to run more efficiently on github runner

* changed partitioning to run more efficiently on github runner

* changed partitioning to run more efficiently on github runner

* changed partitioning to run more efficiently on github runner

* use  as query name for spark instance

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* Updates to template text generation for better performance and repeatable text generation

* Updates to template text generation for better performance and repeatable text generation

* reverted unrelated changes

* added further coverage tests and renamed option fromn 'seedColumn' to 'seedColumnName' for clarity

* added further coverage test for 'seedColumnName' property'

* additional test coverage

* updated tests for ILText generation

* updated tests for ILText generation

* merged changes from master

* change to test potential break in build process

* updated build process to explicotly use python 3.8

* added explicit python version setup to build

* changes to build actions

* reverted changes to master + build action changes

* remerged repeatable feature generation

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* changed table formatting in TemplateGenerator doc string

* changed table formatting in TemplateGenerator doc string

* updates from master

* updates to develop

* dont update coverage when pushing to develop

* Feature docs v34 (databrickslabs#197)

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* doc changes only

* wip

* wip

* Update generating_column_data.rst

* Update generating_json_data.rst

* wip

* new document build

* adjusted comment banner at start of each doc file

* updated build

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* reverted code comment changes

* merge Feature ordering improvements2 into develop (databrickslabs#198)

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* improved build ordering

* improved build ordering

* improved build ordering

* reverted unnecessary changes

* reverted unnecessary changes

* reverted inadvertent merge

* Revert "Merge branch 'develop' into feature_consistency_fixes"

This reverts commit e0efc4e, reversing
changes made to a263bd9.
  • Loading branch information
ronanstokes-db authored Apr 7, 2023
1 parent 3a2e3a8 commit 6c4702d
Show file tree
Hide file tree
Showing 17 changed files with 718 additions and 260 deletions.
14 changes: 12 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,19 @@ All notable changes to the Databricks Labs Data Generator will be documented in
#### Changed
* Fixed use of logger in _version.py and in spark_singleton.py
* Fixed template issues
* Added use of prospector to build process to validate common code issues
* Apply pandas optimizations when generating multiple columns using same `withColumn` or `withColumnSpec`
* Document reformatting and updates
* Modified option to allow for range when specifying `numFeatures` with `structType='array'` to allow generation
of varying number of columns
* When generating multi-column or array valued columns, compute random seed with different name for each column

### Fixed
* Apply pandas optimizations when generating multiple columns using same `withColumn` or `withColumnSpec`

### Added
* Added use of prospector to build process to validate common code issues
* Added top level `random` attribute to data generator specification constructor



### Version 0.3.2

Expand Down
217 changes: 146 additions & 71 deletions dbldatagen/column_generation_spec.py

Large diffs are not rendered by default.

62 changes: 48 additions & 14 deletions dbldatagen/data_generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,9 @@
from .spark_singleton import SparkSingleton
from .column_generation_spec import ColumnGenerationSpec
from .datagen_constants import DEFAULT_RANDOM_SEED, RANDOM_SEED_FIXED, RANDOM_SEED_HASH_FIELD_NAME, \
DEFAULT_SEED_COLUMN, SPARK_RANGE_COLUMN, MIN_SPARK_VERSION
DEFAULT_SEED_COLUMN, SPARK_RANGE_COLUMN, MIN_SPARK_VERSION, \
OPTION_RANDOM, OPTION_RANDOM_SEED, OPTION_RANDOM_SEED_METHOD

from .utils import ensure, topologicalSort, DataGenError, deprecated, split_list_matching_condition
from . _version import _get_spark_version
from .schema_parser import SchemaParser
Expand All @@ -40,6 +42,7 @@ class DataGenerator:
:param batchSize: = UDF batch number of rows to pass via Apache Arrow to Pandas UDFs
:param debug: = if set to True, output debug level of information
:param seedColumnName: = if set, this should be the name of the `seed` or logical `id` column. Defaults to `id`
:param random: = if set, specifies default value of `random` attribute for all columns where not set
By default the seed column is named `id`. If you need to use this column name in your generated data,
it is recommended that you use a different name for the seed column - for example `_id`.
Expand All @@ -63,6 +66,7 @@ class DataGenerator:
def __init__(self, sparkSession=None, name=None, randomSeedMethod=None,
rows=1000000, startingId=0, randomSeed=None, partitions=None, verbose=False,
batchSize=None, debug=False, seedColumnName=DEFAULT_SEED_COLUMN,
random=False,
**kwargs):
""" Constructor for data generator object """

Expand Down Expand Up @@ -119,6 +123,9 @@ def __init__(self, sparkSession=None, name=None, randomSeedMethod=None,

self._seedMethod = randomSeedMethod

# set default random setting
self._defaultRandom = random if random is not None else False

if randomSeed is None:
self._instanceRandomSeed = self._randomSeed

Expand Down Expand Up @@ -297,6 +304,13 @@ def randomSeed(self):
""" return the data generation spec random seed"""
return self._instanceRandomSeed

@property
def random(self):
""" return the data generation spec default random setting for columns to be used
when an explicit `random` attribute setting is not supplied
"""
return self._defaultRandom

def _markForPlanRegen(self):
"""Mark that build plan needs to be regenerated
Expand Down Expand Up @@ -591,13 +605,19 @@ def withColumnSpecs(self, patterns=None, fields=None, matchTypes=None, **kwargs)
:returns: modified in-place instance of test data generator allowing for chaining of calls following
Builder pattern
.. note::
matchTypes may also take SQL type strings or a list of SQL type strings such as "array<integer>"
You may also add a variety of options to further control the test data generation process.
For full list of options, see :doc:`/reference/api/dbldatagen.column_spec_options`.
"""
if fields is not None and type(fields) is str:
fields = [fields]

if OPTION_RANDOM not in kwargs:
kwargs[OPTION_RANDOM] = self._defaultRandom

# add support for deprecated legacy names
if "match_types" in kwargs:
assert matchTypes is None, "Argument 'match_types' is deprecated, use 'matchTypes' instead"
Expand All @@ -620,7 +640,15 @@ def withColumnSpecs(self, patterns=None, fields=None, matchTypes=None, **kwargs)
effective_fields = [x for x in effective_fields for y in patterns if re.search(y, x) is not None]

if matchTypes is not None:
effective_fields = [x for x in effective_fields for y in matchTypes
effective_types = []

for typ in matchTypes:
if isinstance(typ, str):
effective_types.append(SchemaParser.columnTypeFromString(typ))
else:
effective_types.append(typ)

effective_fields = [x for x in effective_fields for y in effective_types
if self.getColumnType(x) == y]

for f in effective_fields:
Expand Down Expand Up @@ -648,7 +676,7 @@ def _checkColumnOrColumnList(self, columns, allowId=False):
return True

def withColumnSpec(self, colName, minValue=None, maxValue=None, step=1, prefix=None,
random=False, distribution=None,
random=None, distribution=None,
implicit=False, dataRange=None, omit=False, baseColumn=None, **kwargs):
""" add a column specification for an existing column
Expand All @@ -670,6 +698,9 @@ def withColumnSpec(self, colName, minValue=None, maxValue=None, step=1, prefix=N
Datatype parameter is only needed for `withColumn` and not permitted for `withColumnSpec`
""")

if random is None:
random = self._defaultRandom

# handle migration of old `min` and `max` options
if _OLD_MIN_OPTION in kwargs:
assert minValue is None, \
Expand Down Expand Up @@ -705,7 +736,7 @@ def hasColumnSpec(self, colName):
return colName in self._columnSpecsByName

def withColumn(self, colName, colType=StringType(), minValue=None, maxValue=None, step=1,
dataRange=None, prefix=None, random=False, distribution=None,
dataRange=None, prefix=None, random=None, distribution=None,
baseColumn=None, nullable=True,
omit=False, implicit=False, noWarn=False,
**kwargs):
Expand Down Expand Up @@ -756,6 +787,9 @@ def withColumn(self, colName, colType=StringType(), minValue=None, maxValue=None
maxValue = kwargs[_OLD_MAX_OPTION]
kwargs.pop(_OLD_MAX_OPTION, None)

if random is None:
random = self._defaultRandom

new_props = {}
new_props.update(kwargs)

Expand Down Expand Up @@ -792,25 +826,25 @@ def _generateColumnDefinition(self, colName, colType=None, baseColumn=None,
# if the column has the option `random` set to true
# then use the instance level random seed
# otherwise use the default random seed for the class
if "randomSeed" in new_props:
effective_random_seed = new_props["randomSeed"]
new_props.pop("randomSeed")
new_props["random"] = True
if OPTION_RANDOM_SEED in new_props:
effective_random_seed = new_props[OPTION_RANDOM_SEED]
new_props.pop(OPTION_RANDOM_SEED)
new_props[OPTION_RANDOM] = True

# if random seed has override but randomSeedMethod does not
# set it to fixed
if "randomSeedMethod" not in new_props:
new_props["randomSeedMethod"] = RANDOM_SEED_FIXED
if OPTION_RANDOM_SEED_METHOD not in new_props:
new_props[OPTION_RANDOM_SEED_METHOD] = RANDOM_SEED_FIXED

elif "random" in new_props and new_props["random"]:
elif OPTION_RANDOM in new_props and new_props[OPTION_RANDOM]:
effective_random_seed = self._instanceRandomSeed
else:
effective_random_seed = self._randomSeed

# handle column level override
if "randomSeedMethod" in new_props:
effective_random_seed_method = new_props["randomSeedMethod"]
new_props.pop("randomSeedMethod")
if OPTION_RANDOM_SEED_METHOD in new_props:
effective_random_seed_method = new_props[OPTION_RANDOM_SEED_METHOD]
new_props.pop(OPTION_RANDOM_SEED_METHOD)
else:
effective_random_seed_method = self._seedMethod

Expand Down
5 changes: 5 additions & 0 deletions dbldatagen/datagen_constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,8 @@
# minimum versions for version checks
MIN_PYTHON_VERSION = (3, 8)
MIN_SPARK_VERSION = (3, 1, 2)

# options for randon data generation
OPTION_RANDOM = "random"
OPTION_RANDOM_SEED_METHOD = "randomSeedMethod"
OPTION_RANDOM_SEED = "randomSeed"
1 change: 1 addition & 0 deletions dbldatagen/text_generator_plugins.py
Original file line number Diff line number Diff line change
Expand Up @@ -375,6 +375,7 @@ def fakerText(mname, *args, _lib=None, _rootClass=None, **kwargs):
:param args: positional args to be passed to underlying Faker instance
:param _lib: internal only param - library to load
:param _rootClass: internal only param - root class to create
:returns : instance of PyfuncText for use with Faker
``fakerText("sentence")`` is same as ``FakerTextFactory()("sentence")``
Expand Down
36 changes: 18 additions & 18 deletions dbldatagen/text_generators.py
Original file line number Diff line number Diff line change
Expand Up @@ -178,24 +178,24 @@ class TemplateGenerator(TextGenerator): # lgtm [py/missing-equals]
It uses the following special chars:
======== ======================================
Chars Meaning
======== ======================================
``\\`` Apply escape to next char.
v0,v1,..v9 Use base value as an array of values and substitute the `nth` element ( 0 .. 9). Always escaped.
x Insert a random lowercase hex digit
X Insert an uppercase random hex digit
d Insert a random lowercase decimal digit
D Insert an uppercase random decimal digit
a Insert a random lowercase alphabetical character
A Insert a random uppercase alphabetical character
k Insert a random lowercase alphanumeric character
K Insert a random uppercase alphanumeric character
n Insert a random number between 0 .. 255 inclusive. This option must always be escaped
N Insert a random number between 0 .. 65535 inclusive. This option must always be escaped
w Insert a random lowercase word from the ipsum lorem word set. Always escaped
W Insert a random uppercase word from the ipsum lorem word set. Always escaped
======== ======================================
========== ======================================
Chars Meaning
========== ======================================
``\\`` Apply escape to next char.
v0,v1,..v9 Use base value as an array of values and substitute the `nth` element ( 0 .. 9). Always escaped.
x Insert a random lowercase hex digit
X Insert an uppercase random hex digit
d Insert a random lowercase decimal digit
D Insert an uppercase random decimal digit
a Insert a random lowercase alphabetical character
A Insert a random uppercase alphabetical character
k Insert a random lowercase alphanumeric character
K Insert a random uppercase alphanumeric character
n Insert a random number between 0 .. 255 inclusive. This option must always be escaped
N Insert a random number between 0 .. 65535 inclusive. This option must always be escaped
w Insert a random lowercase word from the ipsum lorem word set. Always escaped
W Insert a random uppercase word from the ipsum lorem word set. Always escaped
========== ======================================
.. note::
If escape is used and`escapeSpecialChars` is False, then the following
Expand Down
29 changes: 18 additions & 11 deletions docs/source/APIDOCS.md
Original file line number Diff line number Diff line change
Expand Up @@ -165,13 +165,13 @@ testDataSpec = (
numColumns=column_count,
)
.withColumn("code1", IntegerType(), minValue=100, maxValue=200)
.withColumn("code2", IntegerType(), minValue=0, maxValue=10, random=True)
.withColumn("code2", "integer", minValue=0, maxValue=10, random=True)
.withColumn("code3", StringType(), values=["online", "offline", "unknown"])
.withColumn(
"code4", StringType(), values=["a", "b", "c"], random=True, percentNulls=0.05
)
.withColumn(
"code5", StringType(), values=["a", "b", "c"], random=True, weights=[9, 1, 1]
"code5", "string", values=["a", "b", "c"], random=True, weights=[9, 1, 1]
)
)

Expand All @@ -193,7 +193,8 @@ column. Note this expression can refer to any preceding column including the `id
inclusive. These will be computed using modulo arithmetic on the `id` column.

- The `withColumn` method call for the `code2` column specifies the generation of values between 0 and 10
inclusive. These will be computed via a uniformly distributed random value.
inclusive. These will be computed via a uniformly distributed random value. Note that type strings can be used
in place of "IntegerType()"

> By default all random values are uniformly distributed
> unless either the `weights` option is used or a specific distribution is used.
Expand Down Expand Up @@ -329,29 +330,29 @@ testDataSpec = (
.withIdOutput()
# we'll use hash of the base field to generate the ids to
# avoid a simple incrementing sequence
.withColumn("internal_device_id", LongType(), minValue=0x1000000000000,
.withColumn("internal_device_id", "long", minValue=0x1000000000000,
uniqueValues=device_population, omit=True, baseColumnType="hash",
)
# note for format strings, we must use "%lx" not "%x" as the
# underlying value is a long
.withColumn(
"device_id", StringType(), format="0x%013x", baseColumn="internal_device_id"
"device_id", "string", format="0x%013x", baseColumn="internal_device_id"
)
# the device / user attributes will be the same for the same device id
# so lets use the internal device id as the base column for these attribute
.withColumn("country", StringType(), values=country_codes, weights=country_weights,
.withColumn("country", "string", values=country_codes, weights=country_weights,
baseColumn="internal_device_id")
.withColumn("manufacturer", StringType(), values=manufacturers,
.withColumn("manufacturer", "string", values=manufacturers,
baseColumn="internal_device_id", )
# use omit = True if you don't want a column to appear in the final output
# but just want to use it as part of generation of another column
.withColumn("line", StringType(), values=lines, baseColumn="manufacturer",
.withColumn("line", "string", values=lines, baseColumn="manufacturer",
baseColumnType="hash", omit=True )
.withColumn("model_ser", IntegerType(), minValue=1, maxValue=11, baseColumn="device_id",
.withColumn("model_ser", "integer", minValue=1, maxValue=11, baseColumn="device_id",
baseColumnType="hash", omit=True, )
.withColumn("model_line", StringType(), expr="concat(line, '#', model_ser)",
.withColumn("model_line", "string", expr="concat(line, '#', model_ser)",
baseColumn=["line", "model_ser"] )
.withColumn("event_type", StringType(),
.withColumn("event_type", "string",
values=["activation", "deactivation", "plan change", "telecoms activity",
"internet activity", "device error", ],
random=True)
Expand Down Expand Up @@ -379,6 +380,12 @@ of unique values.
- The `withColumn` method call for the `line` column introduces a temporary column for purposes of
generating other columns, but through the use of the `omit` option, omits it from the final data set.

> NOTE: Type strings can be used in place of instances of data type objects. Type strings use SQL data type syntax
> and can be used to specify basic types, numeric types such as "decimal(10,3)" as well as complex structured types
> such as "array<string>", "map<string, int>" and "struct<a:binary, b:int, c:float>".
>
> Type strings are case-insensitive.
### Scaling it up

When generating data, the number of rows to be generated is controlled by the `rows` parameter supplied to the
Expand Down
26 changes: 26 additions & 0 deletions docs/source/generating_column_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -90,11 +90,37 @@ Use of the base column attribute has several effects:
Generating complex columns - structs, maps, arrays
--------------------------------------------------

Complex column types are supported - that is a column may have its type specified as an array, map or struct. This can
be specified in the datatype parameter to the `withColumn` method as a string such as "array<string>" or as a
composite of datatype object instances.

If the column type is based on a struct, map or array, then the `expr` attribute must be specified to provide a
value for the column.

If the `expr` attribute is not specified, then the default column value will be `NULL`.

For array valued columns, where all of the elements of the array are to be generated with the same column
specification, an alternative method is also supported.

You can specify that a column has a specific number of features with structType of 'array' to control the generation of
the column. In this case, the datatype should be the type of the individual element, not of the array.

For example, the following code will generate rows with varying numbers of synthetic emails for each customer:

.. code-block:: python
import dbldatagen as dg
ds = (
dg.DataGenerator(sparkSession=spark, name="test_dataset1", rows=1000, partitions=4,
random=True)
.withColumn("name", "string", percentNulls=0.01, template=r'\\w \\w|\\w A. \\w|test')
.withColumn("emails", "string", template=r'\\w.\\w@\\w.com', random=True,
numFeatures=(1, 6), structType="array")
)
df = ds.build()
The mechanics of column data generation
---------------------------------------
The data set is generated when the ``build`` method is invoked on the data generation instance.
Expand Down
Loading

0 comments on commit 6c4702d

Please sign in to comment.