Fix CSV input

ForkLab · May 5, 2019 · 213fff8 · 213fff8
1 parent 583fdb0
commit 213fff8
Show file tree

Hide file tree

Showing 3 changed files with 4 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -93,6 +93,7 @@ The method GPT-2 uses to generate text is slightly different than those like oth
 * GPT-2 cannot stop early upon reaching a specific end token. (workaround: pass the `truncate` parameter to a `generate` function to only collect text until a specified end token. You may want to reduce `length` appropriately.)
 * Higher temperatures work better (e.g. 0.7 - 1.0) to generate more interesting text, while other frameworks work better between 0.2 - 0.5.
 * When finetuning GPT-2, it has no sense of the beginning or end of a document within a larger text. You'll need to use a bespoke character sequence to indicate the beginning and end of a document. Then while generating, you can specify a `prefix` targeting the beginning token sequences, and a `truncate` targeting the end token sequence. You can also set `include_prefix=False` to discard the prefix token while generating (e.g. if it's something unwanted like `<|startoftext|>`).
+* If you pass a single-column `.csv` file to `finetune()`, it will automatically parse the CSV into a format ideal for training with GPT-2 (including prepending `<|startoftext|>` and suffixing `<|endoftext|>` to every text document, so the `truncate` tricks above are helpful when generating output). This is necessary to handle both quotes and newlines in each text document correctly.
 * GPT-2 allows you to generate texts in parallel by setting a `batch_size` that is divisible into `nsamples`, resulting in much faster generation. Works very well with a GPU (can set `batch_size` up to 20 on Colaboratory's K80)!
 * Due to GPT-2's architecture, it scales up nicely with more powerful GPUs. For the 117M model, if you want to train for longer periods of time, GCP's P100 GPU is about 3x faster than a K80/T4 for only 3x the price, making it price-comparable (the V100 is about 1.5x faster than the P100 but about 2x the price). The P100 uses 100% of the GPU even with `batch_size=1`, and about 88% of the V100 GPU.
 
@@ -114,6 +115,8 @@ Note: this project is intended to have a very tight scope unless demand dictates
 
 Max Woolf ([@minimaxir](https://minimaxir.com))
 
+*Max's open-source projects are supported by his [Patreon](https://www.patreon.com/minimaxir). If you found this project helpful, any monetary contributions to the Patreon are appreciated and will be put to good creative use.*
+
 ## License
 
 MIT

diff --git a/gpt_2_simple/src/load_dataset.py b/gpt_2_simple/src/load_dataset.py
@@ -37,8 +37,6 @@ def load_dataset(enc, path, combine):
                 reader = csv.reader(fp)
                 for row in reader:
                     raw_text += start_token + row[0] + end_token + "\n"
-            tokens = np.stack(enc.encode(raw_text))
-            token_chunks.append(tokens)
         else:
             # Plain text
             with open(path, 'r', encoding='utf8', errors='ignore') as fp:

diff --git a/setup.py b/setup.py
@@ -47,7 +47,7 @@
 setup(
     name='gpt_2_simple',
     packages=['gpt_2_simple'],  # this must be the same as the name above
-    version='0.4',
+    version='0.4.1',
     description="Python package to easily retrain OpenAI's GPT-2 " \
     "text-generating model on new texts.",
     long_description=long_description,