Here, we provide our pipeline for generating KodCode dataset.
To generate synthetic questions, we first need to put seed questions/snippets/docs in the ../seeds
folder.
Then, we can run the following command to generate questions. Available modes are leetcode
, algorithm
, data_structure
, package
, apps
, codeforces
, code_contests
, taco
, and docs
.
python step1.1_gen_questions.py --total_prompts [total_prompts] --mode [mode]
We then call the GPT-4o API to generate instructions for each question.
To do this step, simply run the following command.
python step1.3_proccess_and_sanitize.py --input_file [file_name]
After you get the filtered instructions, you can run the following command to generate solutions and tests.
bash step2.1_gpt_completion.sh [file_name]
This step will generate unit tests for each solution. The input folder contains trials of solutions and tests. In our experiments, we use 10 trials for each solution.
python step2.2_gen_unit_tests.py --input_folder [folder_name]
A folder starts with unit_test_
will be generated, which contains the unit tests for each solution.
This step will run all the tests and generate the results.
bash step2.3_run_all_tests.sh [unit_test_folder_name]
This step will generate verified triplets for each solution.
python step2.4_gen_verified_triplets.py --unit_test_folder [unit_test_folder_name]
After this step, you will get the verified question-solution-test triplets.