Skip to content

Exafunction/codeium-parse

Repository files navigation

Codeium


Discord Twitter Follow License

Visual Studio JetBrains Open VSX Google Chrome

codeium-parse

A command line tool for parsing code syntax

This repository contains a binary built with tree-sitter that lets you:

  • Inspect the concrete syntax tree of a source file
  • Use pre-written tree-sitter query files to locate important symbols in source code
  • Format output in JSON to use the results in your own applications

In particular, this repo provides a binary prepackaged with:

  • A recent version of the tree-sitter library
  • A large number of tree-sitter grammars
  • An implementation of many common query predicates

Contributions are welcome and we encourage using this tool for any applications that involve code syntax analysis. For example, these queries are used by Codeium Search to index code locally for repo-wide semantic search. If you use Codeium Search, adding queries for your language here will enable it to work better on your own code!

Example

(Requires fd and jq.)

# Print all names and arguments from function definitions.
fd -e js \
  | xargs -i ./parse -quiet -use_tags_query -json -json_include_path -file '{}' \
  | jq -r '.
    | select(.captures."definition.function" != null)
    | .file + ":" + .captures.name[0].text + .captures."codeium.parameters"[0].text'
# Output:
# examples/example.js:add(a, b)

Getting started

$ ./download_parse.sh
$ ./parse -file examples/example.js -named_only
program [0, 0] - [4, 0] "// Adds two numbers.\n…"
  comment [0, 0] - [0, 20] "// Adds two numbers."
  function_declaration [1, 0] - [3, 1] "function add(a, b) {\n…"
    name: identifier [1, 9] - [1, 12] "add"
    parameters: formal_parameters [1, 12] - [1, 18] "(a, b)"
      identifier [1, 13] - [1, 14] "a"
      identifier [1, 16] - [1, 17] "b"
    body: statement_block [1, 19] - [3, 1] "{\n…"
      return_statement [2, 4] - [2, 17] "return a + b;"
        binary_expression [2, 11] - [2, 16] "a + b"
          left: identifier [2, 11] - [2, 12] "a"
          right: identifier [2, 15] - [2, 16] "b"
$ ./parse -file examples/example.js -use_tags_query -json | jq ".captures.doc[0].text"
"// Adds two numbers."

Support status

Queries

Queries try to follow the conventions established by tree-sitter.

Most captures also include documentation as @doc. @definition.function and @definition.method also capture @codeium.parameters.

Python TypeScript JavaScript Go Java C++ PHP
@definition.class
@definition.function 1 N/A
@definition.method 2 1 2
@definition.interface N/A N/A N/A
@definition.namespace N/A N/A N/A N/A
@definition.module N/A N/A N/A N/A N/A
@definition.type N/A N/A N/A
@definition.constant
@definition.enum
@definition.import N/A
@definition.include N/A N/A N/A N/A N/A
@definition.package N/A N/A N/A N/A N/A
@reference.call
@reference.class 3

Want to write a query for a new language? tags.scm and other queries in each language's tree-sitter repository, like tree-sitter-javascript, are a good place to start.

Query predicates

$ ./parse -supported_predicates
#eq?/#not-eq?
    (#eq? <@capture|"literal"> <@capture|"literal">)
    Checks if two values are equal.

#has-parent?/#not-has-parent?
    (#has-parent? @capture node_type...)
    Checks if @capture has a parent node of any of the given types.

#has-type?/#not-has-type?
    (#has-type? @capture node_type...)
    Checks if @capture has a node of any of the given types.

#lineage-from-name!
    (#lineage-from-name! "literal")
    If the name captures scopes, split by "literal" and retain the last element
    as the name. The other elements are appended to the lineage.

#match?/#not-match?
    (#match? @capture "regex")
    Checks if the text for @capture matches the given regular expression.

#select-adjacent!
    (#select-adjacent! @capture @anchor)
    Selects @capture nodes contiguous with @anchor (all starting and ending on
    adjacent lines).

#set!
    (#set! key <@capture|"literal">)
    Store metadata as a side effect of a match.

#strip!
    (#strip! @capture "regex")
    Removes all matching text from all @capture nodes.

Need a predicate which hasn't been implemented? File an issue! We try to use predicates from nvim-treesitter.

Grammars

$ ./parse -supported_languages
ada
c
cpp
csharp
css
dart
go
hcl
html
java
javascript
json
kotlin
latex
markdown
ocaml
ocaml_interface
perl
php
protobuf
python
ruby
rust
shell
svelte
swift
toml
tree_sitter_query
tsx
typescript
vue
yaml

Looking for support for another language? File an issue with a link to the repo that contains the grammar.

Contributing

Pull requests are welcome. For non-issue discussions about codeium-parse, join our Discord.

Adding and testing queries

  • You can create new source files with patterns you want to target in test_files/.
  • Look at the syntax tree using ./parse -file test_files/<your file> to get a sense of how to capture the pattern.
  • Learn the query syntax from tree-sitter documentation.
  • Run ./goldens.sh to see what your query captures.

Footnotes

  1. Function and method signatures are captured individually in TypeScript. Therefore, the @doc capture may not exist on all nodes. 2

  2. Currently functions and methods are not distinguished. 2

  3. Function calls and class instantiation are indistinguishable in Python.