Feature consistency fixes (databrickslabs#182)

* fixed use of time strings to allow both and etc * fixed use of time strings to allow both 'seconds' and 'second' etc * id column fixes Doc updates 100522 (databrickslabs#119) * fixed reference to dbx in pull_request_template * reverted inadvertently changed file * release 0.2.1 * doc updates * doc updates * updates for building docs * updated public docs * updated sphinx version * updated docs * doc updates * removed generated docs * removed changes to non-doc * reverted inadvertently changed file * release 0.2.1 * doc updates doc updates * tidied up makefile * added workflow action to update tag 'preview' * develop branch updates * revised unit tests to use parameterized approach * changed utils tests to pytest * changed changelog format * changelog changes * changelog updates from merge * update to change as a resultt of merge of time fixes * updates for test speed improvements * updated tests * updated tests * updated tests * updated tests * fixed typo * reverted pytest changes - separate feature * reverted pytest changes - separate feature * reverted pytest changes - separate feature * reverted pytest changes - separate feature * changed partitioning to run more efficiently on github runner * changed partitioning to run more efficiently on github runner * changed partitioning to run more efficiently on github runner * changed partitioning to run more efficiently on github runner * changed partitioning to run more efficiently on github runner * use as query name for spark instance * wip * wip * wip * wip * wip * wip * wip * Updates to template text generation for better performance and repeatable text generation * Updates to template text generation for better performance and repeatable text generation * reverted unrelated changes * added further coverage tests and renamed option fromn 'seedColumn' to 'seedColumnName' for clarity * added further coverage test for 'seedColumnName' property' * additional test coverage * updated tests for ILText generation * updated tests for ILText generation * merged changes from master * change to test potential break in build process * updated build process to explicotly use python 3.8 * added explicit python version setup to build * changes to build actions * reverted changes to master + build action changes * remerged repeatable feature generation * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * changed table formatting in TemplateGenerator doc string * changed table formatting in TemplateGenerator doc string * updates from master * updates to develop * dont update coverage when pushing to develop * Feature docs v34 (databrickslabs#197) * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * doc changes only * wip * wip * Update generating_column_data.rst * Update generating_json_data.rst * wip * new document build * adjusted comment banner at start of each doc file * updated build * wip * wip * wip * wip * wip * wip * wip * reverted code comment changes * merge Feature ordering improvements2 into develop (databrickslabs#198) * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * improved build ordering * improved build ordering * improved build ordering * reverted unnecessary changes * reverted unnecessary changes * reverted inadvertent merge * Revert "Merge branch 'develop' into feature_consistency_fixes" This reverts commit e0efc4e, reversing changes made to a263bd9.
altman-powell · Apr 7, 2023 · 6c4702d · 6c4702d
1 parent 3a2e3a8
commit 6c4702d
Show file tree

Hide file tree

Showing 17 changed files with 718 additions and 260 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,9 +8,19 @@ All notable changes to the Databricks Labs Data Generator will be documented in
 #### Changed
 * Fixed use of logger in _version.py and in spark_singleton.py
 * Fixed template issues 
-* Added use of prospector to build process to validate common code issues
-* Apply pandas optimizations when generating multiple columns using same `withColumn` or `withColumnSpec`
 * Document reformatting and updates
+* Modified option to allow for range when specifying `numFeatures` with `structType='array'` to allow generation
+  of varying number of columns
+* When generating multi-column or array valued columns, compute random seed with different name for each column
+
+### Fixed
+* Apply pandas optimizations when generating multiple columns using same `withColumn` or `withColumnSpec`
+
+### Added
+* Added use of prospector to build process to validate common code issues
+* Added top level `random` attribute to data generator specification constructor
+
+
 
 ### Version 0.3.2
 

diff --git a/dbldatagen/column_generation_spec.py b/dbldatagen/column_generation_spec.py
diff --git a/dbldatagen/data_generator.py b/dbldatagen/data_generator.py
@@ -13,7 +13,9 @@
 from .spark_singleton import SparkSingleton
 from .column_generation_spec import ColumnGenerationSpec
 from .datagen_constants import DEFAULT_RANDOM_SEED, RANDOM_SEED_FIXED, RANDOM_SEED_HASH_FIELD_NAME, \
-                               DEFAULT_SEED_COLUMN, SPARK_RANGE_COLUMN, MIN_SPARK_VERSION
+                               DEFAULT_SEED_COLUMN, SPARK_RANGE_COLUMN, MIN_SPARK_VERSION, \
+                               OPTION_RANDOM, OPTION_RANDOM_SEED, OPTION_RANDOM_SEED_METHOD
+
 from .utils import ensure, topologicalSort, DataGenError, deprecated, split_list_matching_condition
 from . _version import _get_spark_version
 from .schema_parser import SchemaParser
@@ -40,6 +42,7 @@ class DataGenerator:
     :param batchSize: = UDF batch number of rows to pass via Apache Arrow to Pandas UDFs
     :param debug: = if set to True, output debug level of information
     :param seedColumnName: = if set, this should be the name of the `seed` or logical `id` column. Defaults to `id`
+    :param random: = if set, specifies default value of `random` attribute for all columns where not set
 
     By default the seed column is named `id`. If you need to use this column name in your generated data,
     it is recommended that you use a different name for the seed column - for example `_id`.
@@ -63,6 +66,7 @@ class DataGenerator:
     def __init__(self, sparkSession=None, name=None, randomSeedMethod=None,
                  rows=1000000, startingId=0, randomSeed=None, partitions=None, verbose=False,
                  batchSize=None, debug=False, seedColumnName=DEFAULT_SEED_COLUMN,
+                 random=False,
                  **kwargs):
         """ Constructor for data generator object """
 
@@ -119,6 +123,9 @@ def __init__(self, sparkSession=None, name=None, randomSeedMethod=None,
 
         self._seedMethod = randomSeedMethod
 
+        # set default random setting
+        self._defaultRandom = random if random is not None else False
+
         if randomSeed is None:
             self._instanceRandomSeed = self._randomSeed
 
@@ -297,6 +304,13 @@ def randomSeed(self):
         """ return the data generation spec random seed"""
         return self._instanceRandomSeed
 
+    @property
+    def random(self):
+        """ return the data generation spec default random setting for columns to be used
+            when an explicit `random` attribute setting is not supplied
+        """
+        return self._defaultRandom
+
     def _markForPlanRegen(self):
         """Mark that build plan needs to be regenerated
 
@@ -591,13 +605,19 @@ def withColumnSpecs(self, patterns=None, fields=None, matchTypes=None, **kwargs)
         :returns: modified in-place instance of test data generator allowing for chaining of calls following
                   Builder pattern
 
+        .. note::
+           matchTypes may also take SQL type strings or a list of SQL type strings such as "array<integer>"
+
         You may also add a variety of options to further control the test data generation process.
         For full list of options, see :doc:`/reference/api/dbldatagen.column_spec_options`.
 
         """
         if fields is not None and type(fields) is str:
             fields = [fields]
 
+        if OPTION_RANDOM not in kwargs:
+            kwargs[OPTION_RANDOM] = self._defaultRandom
+
         # add support for deprecated legacy names
         if "match_types" in kwargs:
             assert matchTypes is None, "Argument 'match_types' is deprecated, use 'matchTypes' instead"
@@ -620,7 +640,15 @@ def withColumnSpecs(self, patterns=None, fields=None, matchTypes=None, **kwargs)
             effective_fields = [x for x in effective_fields for y in patterns if re.search(y, x) is not None]
 
         if matchTypes is not None:
-            effective_fields = [x for x in effective_fields for y in matchTypes
+            effective_types = []
+
+            for typ in matchTypes:
+                if isinstance(typ, str):
+                    effective_types.append(SchemaParser.columnTypeFromString(typ))
+                else:
+                    effective_types.append(typ)
+
+            effective_fields = [x for x in effective_fields for y in effective_types
                                 if self.getColumnType(x) == y]
 
         for f in effective_fields:
@@ -648,7 +676,7 @@ def _checkColumnOrColumnList(self, columns, allowId=False):
         return True
 
     def withColumnSpec(self, colName, minValue=None, maxValue=None, step=1, prefix=None,
-                       random=False, distribution=None,
+                       random=None, distribution=None,
                        implicit=False, dataRange=None, omit=False, baseColumn=None, **kwargs):
         """ add a column specification for an existing column
 
@@ -670,6 +698,9 @@ def withColumnSpec(self, colName, minValue=None, maxValue=None, step=1, prefix=N
                     Datatype parameter is only needed for `withColumn` and not permitted for `withColumnSpec`
                """)
 
+        if random is None:
+            random = self._defaultRandom
+
         # handle migration of old `min` and `max` options
         if _OLD_MIN_OPTION in kwargs:
             assert minValue is None, \
@@ -705,7 +736,7 @@ def hasColumnSpec(self, colName):
         return colName in self._columnSpecsByName
 
     def withColumn(self, colName, colType=StringType(), minValue=None, maxValue=None, step=1,
-                   dataRange=None, prefix=None, random=False, distribution=None,
+                   dataRange=None, prefix=None, random=None, distribution=None,
                    baseColumn=None, nullable=True,
                    omit=False, implicit=False, noWarn=False,
                    **kwargs):
@@ -756,6 +787,9 @@ def withColumn(self, colName, colType=StringType(), minValue=None, maxValue=None
             maxValue = kwargs[_OLD_MAX_OPTION]
             kwargs.pop(_OLD_MAX_OPTION, None)
 
+        if random is None:
+            random = self._defaultRandom
+
         new_props = {}
         new_props.update(kwargs)
 
@@ -792,25 +826,25 @@ def _generateColumnDefinition(self, colName, colType=None, baseColumn=None,
         # if the column  has the option `random` set to true
         # then use the instance level random seed
         # otherwise use the default random seed for the class
-        if "randomSeed" in new_props:
-            effective_random_seed = new_props["randomSeed"]
-            new_props.pop("randomSeed")
-            new_props["random"] = True
+        if OPTION_RANDOM_SEED in new_props:
+            effective_random_seed = new_props[OPTION_RANDOM_SEED]
+            new_props.pop(OPTION_RANDOM_SEED)
+            new_props[OPTION_RANDOM] = True
 
             # if random seed has override but randomSeedMethod does not
             # set it to fixed
-            if "randomSeedMethod" not in new_props:
-                new_props["randomSeedMethod"] = RANDOM_SEED_FIXED
+            if OPTION_RANDOM_SEED_METHOD not in new_props:
+                new_props[OPTION_RANDOM_SEED_METHOD] = RANDOM_SEED_FIXED
 
-        elif "random" in new_props and new_props["random"]:
+        elif OPTION_RANDOM in new_props and new_props[OPTION_RANDOM]:
             effective_random_seed = self._instanceRandomSeed
         else:
             effective_random_seed = self._randomSeed
 
         # handle column level override
-        if "randomSeedMethod" in new_props:
-            effective_random_seed_method = new_props["randomSeedMethod"]
-            new_props.pop("randomSeedMethod")
+        if OPTION_RANDOM_SEED_METHOD in new_props:
+            effective_random_seed_method = new_props[OPTION_RANDOM_SEED_METHOD]
+            new_props.pop(OPTION_RANDOM_SEED_METHOD)
         else:
             effective_random_seed_method = self._seedMethod
 

diff --git a/dbldatagen/datagen_constants.py b/dbldatagen/datagen_constants.py
@@ -36,3 +36,8 @@
 # minimum versions for version checks
 MIN_PYTHON_VERSION = (3, 8)
 MIN_SPARK_VERSION = (3, 1, 2)
+
+# options for randon data generation
+OPTION_RANDOM = "random"
+OPTION_RANDOM_SEED_METHOD = "randomSeedMethod"
+OPTION_RANDOM_SEED = "randomSeed"
diff --git a/dbldatagen/text_generator_plugins.py b/dbldatagen/text_generator_plugins.py
@@ -375,6 +375,7 @@ def fakerText(mname, *args, _lib=None, _rootClass=None, **kwargs):
        :param args: positional args to be passed to underlying Faker instance
        :param _lib: internal only param - library to load
        :param _rootClass: internal only param - root class to create
+       
        :returns : instance of PyfuncText for use with Faker
 
        ``fakerText("sentence")`` is same as ``FakerTextFactory()("sentence")``

diff --git a/dbldatagen/text_generators.py b/dbldatagen/text_generators.py
@@ -178,24 +178,24 @@ class TemplateGenerator(TextGenerator):  # lgtm [py/missing-equals]
 
     It uses the following special chars:
 
-    ========   ======================================
-    Chars      Meaning
-    ========   ======================================
-    ``\\``     Apply escape to next char.
-    v0,v1,..v9 Use base value as an array of values and substitute the `nth` element ( 0 .. 9). Always escaped.
-    x          Insert a random lowercase hex digit
-    X          Insert an uppercase random hex digit
-    d          Insert a random lowercase decimal digit
-    D          Insert an uppercase random decimal digit
-    a          Insert a random lowercase alphabetical character
-    A          Insert a random uppercase alphabetical character
-    k          Insert a random lowercase alphanumeric character
-    K          Insert a random uppercase alphanumeric character
-    n          Insert a random number between 0 .. 255 inclusive. This option must always be escaped
-    N          Insert a random number between 0 .. 65535 inclusive. This option must always be escaped
-    w          Insert a random lowercase word from the ipsum lorem word set. Always escaped
-    W          Insert a random uppercase word from the ipsum lorem word set. Always escaped
-    ========   ======================================
+    ==========  ======================================
+    Chars       Meaning
+    ==========  ======================================
+    ``\\``       Apply escape to next char.
+    v0,v1,..v9  Use base value as an array of values and substitute the `nth` element ( 0 .. 9). Always escaped.
+    x           Insert a random lowercase hex digit
+    X           Insert an uppercase random hex digit
+    d           Insert a random lowercase decimal digit
+    D           Insert an uppercase random decimal digit
+    a           Insert a random lowercase alphabetical character
+    A           Insert a random uppercase alphabetical character
+    k           Insert a random lowercase alphanumeric character
+    K           Insert a random uppercase alphanumeric character
+    n           Insert a random number between 0 .. 255 inclusive. This option must always be escaped
+    N           Insert a random number between 0 .. 65535 inclusive. This option must always be escaped
+    w           Insert a random lowercase word from the ipsum lorem word set. Always escaped
+    W           Insert a random uppercase word from the ipsum lorem word set. Always escaped
+    ==========  ======================================
 
     .. note::
               If escape is used and`escapeSpecialChars` is False, then the following

diff --git a/docs/source/APIDOCS.md b/docs/source/APIDOCS.md
@@ -165,13 +165,13 @@ testDataSpec = (
         numColumns=column_count,
     )
     .withColumn("code1", IntegerType(), minValue=100, maxValue=200)
-    .withColumn("code2", IntegerType(), minValue=0, maxValue=10, random=True)
+    .withColumn("code2", "integer", minValue=0, maxValue=10, random=True)
     .withColumn("code3", StringType(), values=["online", "offline", "unknown"])
     .withColumn(
         "code4", StringType(), values=["a", "b", "c"], random=True, percentNulls=0.05
     )
     .withColumn(
-        "code5", StringType(), values=["a", "b", "c"], random=True, weights=[9, 1, 1]
+        "code5", "string", values=["a", "b", "c"], random=True, weights=[9, 1, 1]
     )
 )
 
@@ -193,7 +193,8 @@ column. Note this expression can refer to any preceding column including the `id
 inclusive. These will be computed using modulo arithmetic on the `id` column. 
 
 - The `withColumn` method call for the `code2` column specifies the generation of values between 0 and 10 
-inclusive. These will be computed via a uniformly distributed random value. 
+inclusive. These will be computed via a uniformly distributed random value. Note that type strings can be used
+in place of "IntegerType()"
 
 > By default all random values are uniformly distributed
 > unless either the `weights` option is used or a specific distribution is used. 
@@ -329,29 +330,29 @@ testDataSpec = (
     .withIdOutput()
     # we'll use hash of the base field to generate the ids to
     # avoid a simple incrementing sequence
-    .withColumn("internal_device_id", LongType(), minValue=0x1000000000000, 
+    .withColumn("internal_device_id", "long", minValue=0x1000000000000, 
                 uniqueValues=device_population, omit=True, baseColumnType="hash",
     )
     # note for format strings, we must use "%lx" not "%x" as the
     # underlying value is a long
     .withColumn(
-        "device_id", StringType(), format="0x%013x", baseColumn="internal_device_id"
+        "device_id", "string", format="0x%013x", baseColumn="internal_device_id"
     )
     # the device / user attributes will be the same for the same device id
     # so lets use the internal device id as the base column for these attribute
-    .withColumn("country", StringType(), values=country_codes, weights=country_weights, 
+    .withColumn("country", "string", values=country_codes, weights=country_weights, 
                 baseColumn="internal_device_id")
-    .withColumn("manufacturer", StringType(), values=manufacturers, 
+    .withColumn("manufacturer", "string", values=manufacturers, 
                 baseColumn="internal_device_id", )
     # use omit = True if you don't want a column to appear in the final output
     # but just want to use it as part of generation of another column
-    .withColumn("line", StringType(), values=lines, baseColumn="manufacturer", 
+    .withColumn("line", "string", values=lines, baseColumn="manufacturer", 
                 baseColumnType="hash", omit=True )
-    .withColumn("model_ser", IntegerType(), minValue=1, maxValue=11, baseColumn="device_id", 
+    .withColumn("model_ser", "integer", minValue=1, maxValue=11, baseColumn="device_id", 
                 baseColumnType="hash", omit=True, )
-    .withColumn("model_line", StringType(), expr="concat(line, '#', model_ser)", 
+    .withColumn("model_line", "string", expr="concat(line, '#', model_ser)", 
                 baseColumn=["line", "model_ser"] )
-    .withColumn("event_type", StringType(), 
+    .withColumn("event_type", "string", 
                 values=["activation", "deactivation", "plan change", "telecoms activity", 
                         "internet activity", "device error", ],
                 random=True)
@@ -379,6 +380,12 @@ of unique values.
 - The `withColumn` method call for the `line` column introduces a temporary column for purposes of 
 generating other columns, but through the use of the `omit` option, omits it from the final data set.
 
+> NOTE: Type strings can be used in place of instances of data type objects. Type strings use SQL data type syntax
+> and can be used to specify basic types, numeric types such as "decimal(10,3)" as well as complex structured types
+> such as "array<string>", "map<string, int>" and "struct<a:binary, b:int, c:float>".
+> 
+> Type strings are case-insensitive.
+
 ### Scaling it up
 
 When generating data, the number of rows to be generated is controlled by the `rows` parameter supplied to the 

diff --git a/docs/source/generating_column_data.rst b/docs/source/generating_column_data.rst
@@ -90,11 +90,37 @@ Use of the base column attribute has several effects:
 Generating complex columns - structs, maps, arrays
 --------------------------------------------------
 
+Complex column types are supported - that is a column may have its type specified as an array, map or struct. This can
+be specified in the datatype parameter to the `withColumn` method as a string such as "array<string>" or as a
+composite of datatype object instances.
+
 If the column type is based on a struct, map or array, then the `expr` attribute must be specified to provide a
 value for the column.
 
 If the `expr` attribute is not specified, then the default column value will be `NULL`.
 
+For array valued columns, where all of the elements of the array are to be generated with the same column
+specification, an alternative method is also supported.
+
+You can specify that a column has a specific number of features with structType of 'array' to control the generation of
+the column. In this case, the datatype should be the type of the individual element, not of the array.
+
+For example, the following code will generate rows with varying numbers of synthetic emails for each customer:
+
+.. code-block:: python
+
+   import dbldatagen as dg
+
+   ds = (
+        dg.DataGenerator(sparkSession=spark, name="test_dataset1", rows=1000, partitions=4,
+                         random=True)
+        .withColumn("name", "string", percentNulls=0.01, template=r'\\w \\w|\\w A. \\w|test')
+        .withColumn("emails", "string", template=r'\\w.\\w@\\w.com', random=True,
+                    numFeatures=(1, 6), structType="array")
+   )
+
+   df = ds.build()
+
 The mechanics of column data generation
 ---------------------------------------
 The data set is generated when the ``build`` method is invoked on the data generation instance.