Skip to content

Commit

Permalink
readme
Browse files Browse the repository at this point in the history
  • Loading branch information
smith-nathanh committed Nov 25, 2024
1 parent 30ada94 commit 6e53db6
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 2 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# autoSWE

autoSWE is a software engineering automation tool designed to produce a fully-functioning code repository based on input Produce Requirements Document (PRD). It produces the artifacts measured in [DevBench ](https://github.com/open-compass/DevBench), a software engineering benchmark designed to test the efficacy of LLM-based code generation systems.
autoSWE is a software engineering automation tool designed to produce a fully-functioning code repository based on a Produce Requirements Document (PRD). It produces the artifacts measured in [DevBench ](https://github.com/open-compass/DevBench), a software engineering benchmark designed to test the efficacy of LLM-based code generation systems.

For more detailed information on the features, tools, and processes used in autoSWE, please refer to the [system/README.md](system/README.md) document.

Expand Down
4 changes: 3 additions & 1 deletion system/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# autoSWE

AutoSWE is a system for producing entire code repositories from a PRD.md file. It is designed to produce the artifacts necessary for the [DevBench ](https://github.com/open-compass/DevBench) benchmark designed to evaluate the effectiveness of LLM-based code generation. It differs from typical evaluations of LLM-coding systems in that it is designed to evaluate the entire software engineering process, not just bug fixes or code completion.
AutoSWE is a system for producing entire code repositories from a PRD.md file. It is designed to produce the artifacts necessary for the [DevBench ](https://github.com/open-compass/DevBench) benchmark which was established to evaluate the effectiveness of LLM-based code generation. It differs from typical evaluations of LLM-coding systems in that it is designed to evaluate the entire software engineering process, not just bug fixes or code completion.

DevBench has five evaluation tasks:

Expand All @@ -13,6 +13,8 @@ DevBench has five evaluation tasks:

We have implemented a system that can automatically generate the artifacts for these tasks. The system uses LangGraph to orchestrate the control flow of the system and the artifacts are accumulated in a `state` object. We use Pydantic to validate for structured outputs of the LLMs for each task - such as requesting dictionaries with specific keys. The system will also check for and handle installing necessary dependencies to run the code it generates.

We use LangChain, LangGraph, and LangSmith for tracing the OpenAI API calls and the `state` of the system. We use GPT-4o as the LLM.

### Control flow

We use langgraph to manage the control flow of the system and nodes prefixed with "approve_" evaluate the documents/code and either approve the documents/code or circle back with a message regarding what is incorrect. They have conditional edges to route the flow of the system.
Expand Down

0 comments on commit 6e53db6

Please sign in to comment.