java-columnar-query-engine

A toy Java implementation of a query engine over columnar, schema-ful, in-memory data.

Overview

I want to search over a cyclic structured set of data. Think "cities" belonging to "states" which are adjacent to other states which contain cities.

NOTE: This project was developed on macOS. It is for my own personal use.

I want all the following characteristics:

Columnar
- The data is physically laid out as columns (e.g. Java arrays).
Schema-ful
- The data adheres to a schema. Look back to my cities and states comment earlier. In other words, I don't care about implementing a query engine over a traditional "graph database" where the relationship of edges and vertices is unconstrained, like in Neo4J. The effect of having schema-ful data is that the data can be described as columns, and columnar data lays out physically as arrays. I believe what I want is something called an "object database". This type of database is somewhat obscure but actually there is one modern one called Realm which is popular on mobile. Also, I've been inspired by Kuzu which is a property graph database, but it has schemas (which I like) so doesn't that make it a traditional object database? I tried to build Kuzu from source but had issues (it's extremely new; so that's ok) so maybe I'll try Realm (although it's also C++ so I'm scared).
In-memory
- This is mostly for convenience. I don't want to deal with file IO.
Foreign Memory API (stretch goal)
- I want to lay the memory out not in arrays or objects but in contiguous chunks of off-heap memory. See JEP 442: Foreign Function & Memory API (Third Preview).
Vectorization (stretch goal)
- I want to at least vaguely think about vectorized CPU computation in light of Java's (incubating) vector API. I probably won't implement to this because it's kind of beyond me plus I don't know if it even applies.

Apache Arrow is the natural choice for modeling in-memory columnar data in 2023, but I've already learned that API in my other repository: https://github.com/dgroomes/arrow-playground. It has a strong reference implementation for Java, which is my go-to language. The Java Arrow implementation also offers table-like data modeling which goes a long way to making the developer experience pretty good for modeling the aforementioned "cities and states" data. But the convenience stops there. The Java Arrow implementation does not offer entity-to-entity relationships and it offers only very basic implementations of algorithms: specifically binary search and sorting. And I find it a bit cumbersome (but I still appreciate it; thank you open source developers). I want an actual basic query engine. So, this repo is me doing that (fingers crossed).

Design

The domain data of this project is ZIP code data, contained in cities, contained in states and with state adjacencies. This data is small (3MB) but it can be multiplied into "parallel universes" if needed. It's relatable. It's real. And the surface area of the schema is about as small as possible: ZIP codes, population, containing city name, containing state name and state code, and then state-to-state adjacency relationships. Nice. I don't want to be bogged down with more tables/columns than that.

When it comes to the workload of the queries, I want to execute a query like:

Find all ZIP codes that have a population around 10,000 and are in a state adjacent to a state with at least one city named "Plymouth".

Or something cyclic (although contrived) like:

Find all states named with "North" that are adjacent to a state with "South" that are adjacent to a state with "North".

Simple requirements. That allows the focus to be on the implementation.

The code is implemented across a few modules:

data-system
- This module describes the API of the "read-only object datastore and query engine". It also defines the core interfaces of tables, columns and associations. A data system is characterized by its physical data strategy (e.g. a columnar SQL database) and its query execution strategy.
data-system-serial-indices-arrays
- This module is an implementation of a data system. It is characterized by a serial-style (non-parallel) and indices- tracking execution strategy and the physical data is laid out in arrays. This is the most interesting module.
data-model-in-memory
- This module is a concrete implementation of the data model API using in-memory data structures (i.e. no file IO).
geography
- This module is a pure domain model. It has zero dependencies by design. It models the ZIP, city and state data which we can collectively refer to as "geography" for short.
geography-loader
- This module is responsible for loading the data from disk (JSON) into memory (Java domain objects). It's a "glue" module that is not very interesting.
geography-query
- NOT YET IMPLEMENTED
- This module is an application of the query engine over the geography domain. This is an interesting module.
app
- This module is the entrypoint of the project. It's a "glue" module.

Prior Art

I would love to know other implementations of something like this. I think there are many implementations of this and much more sophisticated systems that take it into the stratosphere like Apache Spark (Scala/Java) and maybe Data Fusion and other implementations of data frames. I'm not very educated in the space. While it would be fruitful to peruse the code of these open source projects and toy around with it I still think the barrier to entry is quite high; especially when you compare it to the library-less Java implementation like I want to do here (well I'm using SLF4J and Jackson to load the data). I like Rust for this workload but no I don't want to be constrained by ownership/borrowing. I'm in "query engine learning mode" not "Rust learning mode".

DuckDB is another OLAP database (columnar) and it is well-loved by developers and offers lots of features. I bet I could learn by building it from source and poking around. Although it's C++ so that's another learning curve for me.

tablesaw is a Java-based data frame implementation.

Instructions

Follow these instructions to build and run the example program:

Pre-requisite: Java 21
Build and run the program:
- ```
./gradlew :app:run
```
Run the tests:
- ```
./gradlew test
```

Wish List

General clean-ups, TODOs and things I wish to implement for this project:

Create a parallel query engine. Parallelization would be cool, and it's just a natural thing to do with data workloads like this.
DONE Re-package query-engine. I want a data-system module which is an API implemented by query-engine. query-engine is a "serial, indices-tracking query execution strategy" that implements the main database API. The overarching query API is this method signature: QueryResult match(Query query, Table table). There can be multiple different query engine implementations, like a parallel one, one that tracks statistics and has heuristics, one for readability and trying new features. I think creating more layers of abstraction within query-engine to accommodate different behavior will not scale well. It's ok to just re-implement some things for the sake of decoupling, interpretability, execution speed, and development speed. Thinking more widely, I even want to just use SQLite, DuckDB, Kuzu etc as a form of a "query engine" and be able to benchmark the same workload between my query engine (and in-memory column model).
- DONE Create data-system (it is purposely not called "database" because it is far from a database. A database supports writes and is durable and has more features. This is more like a "query engine for ephemeral data").
- DONE Move QueryResult match(Query query, Table table) out of Executor and into data-system.
- DONE Rename query-engine to something like data-system-serial-indices ... and follow-on renames like classes.
- DONE The data system should encapsulate the table. The API should become QueryResult match(Query query). While it is convenient to allow the caller to inject their own physical data model (e.g. InMemoryTable in this case), that should be encapsulated by the data system (see the earlier language I used to describe a data system). And actually data-system-serial-indices should really be data-system-serial-indices-arrays which makes way for things like data-system-serial-indices-arrow and data-system-serial-indices-foreign-memory. I actually want to extract the common "indices-" query engine again... but anyway this is progress. It's important that the data system encapsulates the physical data because that makes it possible for a substantially different query engine to code to its substantially and necessarily different physical data model. For example, Neo4J (Cypher engine and vertices/edges data model).
- DONE Rename data-system-serial-indices to data-system-serial-indices-arrays.
- DONE Clean up package and module names.
- DONE Update docs as appropriate.
- DONE Plan future work.
"Test Compatibility Kit" (TCK). Create a test fixture (harness? TCK?) that defines functional query tests but does not code to a specific implementation. In other words, make QueryEngineTest into QueryTest and use some indirection (explore the JUnit5 and Gradle options for this. Or, just do something plain old like a template method pattern) to allow the test to be implemented by different data systems (e.g. data-system-serial-indices-arrays or a future data-system-serial-indices-arrow or data-system-serial-indices-foreign-memory).
(cosmetic) Consider renaming the project to something like "object-query-engine" or something more specific/descriptive.
(stretch) Consider compressing integer arrays with this integer compression library which uses the (incubating) Java vector API. This would be kind of epic.
(stretch) Consider creating performance benchmarks. Consider using Java MicroBenchmark Harness.. Be careful with the benchmarks. Don't draw overly broad conclusions.
(stretch) Consider splitting apart a query verifier (UPDATE: the verifier is implemented) from a query planner from a query executor (and maybe even a query optimizer but I don't think I care to do that). I'm already finding that there is too much verification logic in the engine code which I'd rather be used just for execution.
(cosmetic) Implement some human readable descriptive toStrings for the domain types like Table, Column, etc.
(cosmetic) Criteria/criterion language. Consider it. singular/plural. I don't care much.
(stretch) Create a test fixtures module or maybe just a module built for testing. This will encapsulate the TestUtil class.
DONE (I narrowed the issue and just worked around it) Upgrade to Java 21. WARNING: I got some serious JVM failures when trying this (I guess this is liable to happen with preview features but I thought preview was more a statement of "this could change" not "this could fail"). Specifically I got the following.
```
Caused by: java.lang.VerifyError: Inconsistent stackmap frames at branch target 494
Exception Details:
  Location:
    dgroomes/queryengine/QueryEngineTest.queryOnAssociationProperty()V @494: aload
  Reason:
    Type top (current frame, locals[8]) is not assignable to 'dgroomes/queryengine/Executor$QueryResult$Failure' (stack map, locals[8])
```
I want to use the latest version of Java (currently 20), but the latest version of Gradle (8.1) only supports up to Java 19 when running Gradle itself but can support Java 20 for "forked work" like compilation, testing and running the program. See the Gradle Compatibility Matrix.
POJO (static) vs Table (dynamic) tension. By necessity, the query engine and its core API needs to be implemented with dynamism to accommodate tables of different widths and types but as application developers, we want to work with static types like a State POJO. This is what I envision the geography-query to demonstrate: it will bridge the dynamic query API to a static/specific domain. Now that I type all this out, I realize that modeling the static stuff and a "view/wrapper" class is a secondary concern for this project. I really want to focus on the query engine algorithm. Still, it's a valid nice-to-have. Or, seriously consider deleting the code and marking this wish list item as "won't do".
Fully implement boolean support. (remember we want to support 1-bit (boolean), 32-bit (int) and variable length (string)).
Consider using ZCTA (ZIP Code Tabulation Areas) instead of raw ZIP code data. ZCTA is a trademark of the US Census, and it represents an area whereas a ZIP code does not because it describes mail delivery routes. With area data, we can get into other area entities like counties. This is interesting for queries. Also, the data is well-described and official. See the related page on the Census website.
Maybe create a data system using Apache Calcite. Calcite has a model for describing relational queries, and it has many implementations (i.e. adapters) to implement those queries, including a simple "frontend" in-memory one. That's what I'll use.
- First I'll start with a proof-of-technology in its own module and not integrate it with the functional tests.
- Next, I'll actually load the geography data into (unfortuately yet another) POJO model and make a join query across the entities.
- Next, I'll write a converter to convert the Query to a Calcite relational algrebra expression/object. I found this is tricky because the user docs are not super amazing, however there are good JavaDocs (always hard to discover).

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
app		app
buildSrc		buildSrc
data-model-in-memory		data-model-in-memory
data-system-serial-indices-arrays		data-system-serial-indices-arrays
data-system		data-system
geography-loader		geography-loader
geography		geography
gradle		gradle
util		util
.gitignore		.gitignore
README.md		README.md
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle.kts		settings.gradle.kts
zips.jsonl		zips.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

java-columnar-query-engine

Overview

Design

Prior Art

Instructions

Wish List

Finished Wish List Items

About

Releases

Packages

Languages

dgroomes/java-columnar-query-engine

Folders and files

Latest commit

History

Repository files navigation

java-columnar-query-engine

Overview

Design

Prior Art

Instructions

Wish List

Finished Wish List Items

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages