Skip to content

64bit multithreaded python data analytics tools for numpy arrays and datasets

License

Notifications You must be signed in to change notification settings

fabregas201307/riptable

 
 

Repository files navigation

Riptable

An open-source, 64-bit Python analytics engine for high-performance data analysis with multithreading support. Riptable supports Python 3.10 through 3.12 on 64-bit Linux and Windows.

Similar to Pandas and based on NumPy, Riptable optimizes analyzing large volumes of data interactively, in real time. Riptable can crunch numbers often at 1.5x to 10x the speed of NumPy or Pandas.

Riptable achieves maximum speed through the use of:

Intro to Riptable and reference documentation is available at: riptable.readthedocs.io

Basic concepts and classes

FastArray is a subclass of NumPy's ndarray that enables built-in multithreaded number crunching. All Scikit routines that expect a NumPy array also accept a FastArray.

Dataset replaces the Pandas DataFrame class and holds NumPy arrays of equal length.

Struct holds a collection of mixed-type data members, with Dataset as a subclass.

Categorical replaces both the Pandas DataFrame.groupby() method and the Pandas Categorical class. A Riptable Categorical supports multi-key, filterable groupings with the same functionality of Pandas groupby and more.

Datetime classes replace most NumPy and Pandas date/time classes. Riptable's DateTimeNano, Date, TimeSpan, and DateSpan classes have a design that's closer to Java, C++, or C# date/time classes.

Accum2 and AccumTable enable cross-tabulation functionality.

SDS provides a new file format which can stack multiple datasets in multiple files with zstd compression, threads, and no extra memory copies.

Small, medium, and large array performance

Riptable is designed for arrays of all sizes. For small arrays (< 100 length), low processing overhead is important. Riptable's FastArray is written in hand-coded C and processes simple arithmetic functions faster than NumPy arrays. For medium arrays (< 100,000 length), Riptable has vector-instrinic loops. For large arrays (>= 100,000) Riptable knows how to dynamically scale out threading, waking up threads efficiently using a futex.

Install and import Riptable

Create a Conda environment and run this command to install Riptable on Windows or Linux:

conda install riptable

Import Riptable in your Python code to access its functions, methods, and classes:

import riptable as rt

Note: We shorten the name of the Riptable module to rt to improve the readability of code.

Use NumPy arrays with Riptable

Easily change between NumPy's ndarray and Riptable's FastArray without producing a copy of the array.

import riptable as rt
import numpy as np
rtarray = rt.arange(100)
numpyarray = rtarray._np
fastarray = rt.FastArray(numpyarray)

Change the view of the two instances to confirm that FastArray is a subclass of ndarray.

numpyarray.view(rt.FastArray)
fastarray.view(np.ndarray)
isinstance(fastarray, np.ndarray)

Use Pandas DataFrames with Riptable

Construct a Riptable Dataset directly from a Pandas DataFrame.

import riptable as rt
import numpy as np
import pandas as pd
df = pd.DataFrame({"intarray": np.arange(1_000_000), "floatarray": np.arange(1_000_000.0)})
ds = rt.Dataset(df)

How can I trust Riptable calculations?

Riptable has undergone years of development, and dozens of quants at a large financial firm have tested its capabilities. We also provide a full suite of tests to ensure that the modules are functioning as expected. But as with any project, there are still bugs and opportunities for improvement, which can be reported using GitHub issues.

How can Riptable perform calculations faster?

Riptable was written from day one to handle large data and multithreading using the riptide_cpp layer for basic arithmetic functions and algorithms. Many core algorithms have been painstakingly rewritten for multithreading.

How can I contribute?

The Riptable engine is another building block for Python data analytics computing, and we welcome help from users and contributors to take it to the next level. As you encounter bugs, issues with the documentation, and opportunities for new or improved functionality, please consider reaching out to the team.

See the contributing guide for more information.

About

64bit multithreaded python data analytics tools for numpy arrays and datasets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%