Skip to content
/ T6 Public

The official implementation of Tensor ProducT ATTenTion Transformer (T6)

License

Notifications You must be signed in to change notification settings

tensorgi/T6

Repository files navigation

T6: Tensor ProducT ATTenTion Transformer

License Python PyTorch

T6 (Tensor ProducT ATTenTion Transformer) is a state-of-the-art transformer model that leverages Tensor Product Attention (TPA) mechanisms to enhance performance and reduce KV cache size. This repository provides tools for data preparation, model pretraining, and evaluation to facilitate research and development using the T6 architecture.

This repository contains the official code for the paper "Tensor Product Attention Is All You Need".

Authors: Yifan Zhang*, Yifeng Liu*, Huizhuo Yuan, Zhen Qin, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao

[Webpage] [Huggingface]

Table of Contents

Features

  • Tensor Product Attention: Implements advanced attention mechanisms for improved model performance.
  • Scalability: Efficient training procedures optimized for large-scale datasets and multi-GPU setups.
  • Flexible Data Support: Compatible with popular datasets like Fineweb-Edu-100B and OpenWebText.
  • Comprehensive Evaluation: Integrated with lm-evaluation-harness for standardized benchmarking.
  • Higher-order TPA (TBD): Higher-order TPA.
  • Flash TPA (TBD): Flash TPA.

Installation

Ensure you have Python 3.10 or higher installed. It's recommended to use a virtual environment to manage dependencies.

  1. Clone the Repository

    git clone https://github.com/tensorgi/T6.git
    cd T6
  2. Create and Activate a Virtual Environment

    python3 -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install Required Packages

    pip install torch==2.4.0 numpy transformers datasets tiktoken wandb tqdm

Data Preparation

Prepare the necessary datasets before pretraining the model. T6 supports both Fineweb-Edu-100B and OpenWebText.

Fineweb-Edu-100B

Fineweb-Edu-100B is a large-scale educational dataset hosted on Hugging Face.

  1. Navigate to the Data Directory

    cd data/fineweb-edu
  2. Run the Data Preparation Script

    python fineweb-edu.py
  3. Move the Prepared Data

    mv fineweb-edu100B ..
    cd ../..

OpenWebText

OpenWebText is an open reproduction of OpenAI's WebText dataset.

  1. Run the Data Preparation Script

    python data/openwebtext/prepare.py

    Ensure you have sufficient storage and computational resources as OpenWebText is sizable.

Pretraining

Pretrain the T6 model using the prepared datasets. The provided scripts support distributed training across multiple GPUs.

  1. Using the Provided Bash Script

    Execute the pretraining script which handles the training process.

    bash pretrain.sh
  2. Manual Execution with torchrun

    For more control or customization, use torchrun to initiate training. Replace config/train_T6_medium_adam_80g8.py with your desired configuration file.

    torchrun --standalone --nproc_per_node=8 \
        train_adam_fw.py \
        config/train_T6_medium_adam_80g8.py
    • --nproc_per_node=8 specifies the number of processes (typically matching the number of GPUs).

Evaluation

Evaluate the performance of the pretrained T6 model using standardized benchmarks.

  1. Navigate to the Evaluation Harness Directory

    cd lm-evaluation-harness
  2. Follow the Instructions Within This Directory

    Ensure your model is compatible with the evaluation harness requirements.

Acknowledgements

Star History

Star History Chart

Citation

If you use Tensor Product Attention (TPA) or the Tensor ProducT ATTenTion Transformer (T6) in your research or application, please consider citing it!

@article{zhang2025tensor,
    title={Tensor Product Attention Is All You Need},
    author={Zhang, Yifan and Liu, Yifeng and Yuan, Huizhuo and Qin, Zhen and Yuan, Yang and Gu, Quanquan and Yao, Andrew Chi-Chih},
    journal={arXiv preprint arXiv:2501.06425},
    year={2025},
}