Skip to content

Commit

Permalink
fix: just read from watermark file
Browse files Browse the repository at this point in the history
  • Loading branch information
zanussbaum committed Mar 27, 2023
1 parent b369e5a commit 1a95f68
Showing 1 changed file with 1 addition and 2 deletions.
3 changes: 1 addition & 2 deletions data.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,8 +70,7 @@ def load_data(config, tokenizer):
else:
dataset = load_dataset(dataset_path)

uuids = dataset.filter(lambda x: x["source"] == "nomic")
dataset = dataset.filter(lambda x: x["source"] != "nomic")
uuids = load_dataset("json", data_files="watermark.jsonl", split="train")
dataset = dataset.train_test_split(test_size=.05, seed=config["seed"])

train_dataset, val_dataset = dataset["train"], dataset["test"]
Expand Down

0 comments on commit 1a95f68

Please sign in to comment.