ProtoCode: Large Language Model pipeline for enhanced human and machine interpretable protocol resource construction
The ProtoCode can standardized protocol by either reading the input text or URL for the literature. To initiate protocol standardization in ProtoCode, users are required to submit their URL for the literature described in natural text, and choose the protocol for extraction using a dropdown menu. Upon submission, ProtoCode first performs a screening analysis to identify the text region corresponding protocol. Users can correct any miss-annotation by highlighting the region of interest. Next, ProtoCode performs data extraction on a protocol specific fine-tuned LLM. The extracted data, in JSONL format, can be converted into standardized natural language text. Moreover, if the corresponding protocol data contains information on equipment settings and/or programs, users can select the outputs for operating experimental equipment.
- Data source layer: This layer collects data from different sources.
- Protocol Extraction layer: This layer retrieves the content from article URLs, which is particularly helpful for reducing the input tokens for the model.
- LLM layer: This layer mitigates the input protocols, ensuring the model’s relevance and accuracy with fine-tuning and cross-validation.
- Application layer: Practical applications, this layer highlights the potential capability of the ProtoCode in different functions.
- ProtoCode Framework: Large Language Model for enhanced human and machine interpretable protocol resource construction
Ensure you have the required dependencies installed by running:
pip install -r requirements.txt
Make sure you have Conda installed on your system before proceeding. If you don't have Conda installed, you can download it from the official Conda website: https://docs.conda.io/en/latest/miniconda.html
conda env create -f environment.yml
conda activate name_of_your_environment
ProtoCode_Content_Extraction extract and save content from URL
- Select a BioRxiv URL (full-text) for the paper.
- Paste the URL on the input_path in the [input file].
- The result will be saved based on the output_path.
- The number of keywords selected determines the quality of the extracted protocol.
content_config is the Config file for Paper Extraction
- URL for the article
[input_link]
url =
- Output path - output_path
- Number of keywords - num_keywords # default is 4
[output_path]
out_path = ./protocols/protocol.csv
[num_keywords]
num_keywords =
ProtoCode_Application has functions include:
- Read extracted content or custom input
- obtain output from the fine-tuned model
- convert the output to robot language
- Use the extracted protocol as the input from the [input_file].
- If the quality is not satisfactory, manually extract content and paste it in [input_content].
- Choose a number between 1 and 5 to select one of the five fine-tuned models.
- The result will be saved based on the output_path.
application_config is the config file for Robot Language
- Use extracted content - input_file
- Or customized content - input_content
- API keys from OPENAI
[input_file]
input_path = ./protocols/protocol.csv
[input_content]
content =
[openai_key]
key =
- Choice of fine-tuned model - model_selection
- Output path - output_path
[model_selection]
model_num = 1
[output_path]
output_path = ./output_language/
- The file will store errors if they occur in any step
@article{,
title={Shuo Jiang, Daniel Evans-Yamamoto, Dennis Bersenev, Sucheendra K. Palaniappan, and Ayako Yachie-Kinoshita},
author={},
journal={},
year={2023}
}