T6 (Tensor ProducT ATTenTion Transformer) is a state-of-the-art transformer model that leverages Tensor Product Attention (TPA) mechanisms to enhance performance and reduce KV cache size. This repository provides tools for data preparation, model pretraining, and evaluation to facilitate research and development using the T6 architecture.
This repository contains the official code for the paper "Tensor Product Attention Is All You Need".
Authors: Yifan Zhang*, Yifeng Liu*, Huizhuo Yuan, Zhen Qin, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao
[Webpage] [Huggingface]
- Features
- Installation
- Data Preparation
- Pretraining
- Evaluation
- Usage
- Contributing
- License
- Acknowledgements
- Tensor Product Attention: Implements advanced attention mechanisms for improved model performance.
- Scalability: Efficient training procedures optimized for large-scale datasets and multi-GPU setups.
- Flexible Data Support: Compatible with popular datasets like Fineweb-Edu-100B and OpenWebText.
- Comprehensive Evaluation: Integrated with lm-evaluation-harness for standardized benchmarking.
- Higher-order TPA (TBD): Higher-order TPA.
- Flash TPA (TBD): Flash TPA.
Ensure you have Python 3.10 or higher installed. It's recommended to use a virtual environment to manage dependencies.
-
Clone the Repository
git clone https://github.com/tensorgi/T6.git cd T6
-
Create and Activate a Virtual Environment
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install Required Packages
pip install torch==2.4.0 numpy transformers datasets tiktoken wandb tqdm
Prepare the necessary datasets before pretraining the model. T6 supports both Fineweb-Edu-100B and OpenWebText.
Fineweb-Edu-100B is a large-scale educational dataset hosted on Hugging Face.
-
Navigate to the Data Directory
cd data/fineweb-edu
-
Run the Data Preparation Script
python fineweb-edu.py
-
Move the Prepared Data
mv fineweb-edu100B .. cd ../..
OpenWebText is an open reproduction of OpenAI's WebText dataset.
-
Run the Data Preparation Script
python data/openwebtext/prepare.py
Ensure you have sufficient storage and computational resources as OpenWebText is sizable.
Pretrain the T6 model using the prepared datasets. The provided scripts support distributed training across multiple GPUs.
-
Using the Provided Bash Script
Execute the pretraining script which handles the training process.
bash pretrain.sh
-
Manual Execution with
torchrun
For more control or customization, use
torchrun
to initiate training. Replaceconfig/train_T6_medium_adam_80g8.py
with your desired configuration file.torchrun --standalone --nproc_per_node=8 \ train_adam_fw.py \ config/train_T6_medium_adam_80g8.py
--nproc_per_node=8
specifies the number of processes (typically matching the number of GPUs).
Evaluate the performance of the pretrained T6 model using standardized benchmarks.
-
Navigate to the Evaluation Harness Directory
cd lm-evaluation-harness
-
Follow the Instructions Within This Directory
Ensure your model is compatible with the evaluation harness requirements.
- Hugging Face for providing the Fineweb-Edu-100B dataset.
- EleutherAI for the lm-evaluation-harness.
- OpenWebText team for replicating the WebText dataset.
If you use Tensor Product Attention (TPA) or the Tensor ProducT ATTenTion Transformer (T6) in your research or application, please consider citing it!
@article{zhang2025tensor,
title={Tensor Product Attention Is All You Need},
author={Zhang, Yifan and Liu, Yifeng and Yuan, Huizhuo and Qin, Zhen and Yuan, Yang and Gu, Quanquan and Yao, Andrew Chi-Chih},
journal={arXiv preprint arXiv:2501.06425},
year={2025},
}