Oracle Bone Character data collected by VLRLab of HUST We have open-sourced the HUST-OBC dataset and the models used in the dataset, including: Chinese OCR, MoCo, and the ResNet50 for Validation.
- HUST-OBC (We have renamed HUST-OBS to HUST-OBC)
- deciphered
- ID1
- Source_ID1_Filename
- Source_ID1_Filename
- .....
- ID2
- Source_ID2_Filename
- .....
- ID3
- .....
- chinese_to_ID.json
- ID_to_chinese.json
- ID1
- undeciphered
- L
- L_?_Filename
- L_?_Filename
- .....
- X
- X_?_Filename
- .....
- Y+H
- Y_?_Filename
- H_?_Filename
- .....
- L
- GuoXueDaShi_1390
- ID1
- Source_ID1_Filename
- Source_ID1_Filename
- .....
- ID2
- Source_ID2_Filename
- .....
- ID3
- .....
- chinese_to_ID.json
- ID_to_chinese.json
- ID1
- deciphered
Source:’X’ represents "New Compilation of Oracle Bone Scripts", ’L’ represents the "Oracle Bone Script: Six Digit Numerical Code",’G’ represents the "GuoXueDaShi" website, ’Y’ represents the "YinQiWenYuan" website, and ’H’ represents the HWOBC dataset, they are the sources of the data.
conda create -n HUST-OBC python=3.10
conda activate HUST-OBC
git clone https://github.com/Pengjie-W/HUST-OBC.git
cd HUST-OBC
pip install -r requirements.txt
To use MoCo or Validation, you need to download HUST-OBC. You can then directly use their trained models for prediction. If you want to use Chinese OCR, please download the OCR dataset and the corresponding model. After downloading, organize the data as follows.
- Your_dataroot
- HUST-OBC
- deciphered
- ...
- MoCo
- model_last.pth
- ...
- OCR
- Validation
- max_val_acc.pth
- ...
- HUST-OBC
The code for training and testing (usage) is provided in the OCR folder. Includes recognition of 88,899 classes of Chinese characters. Model download. Category numbers and their corresponding Chinese characters are stored in OCR/label.json. We have provided models and code with α set to 0.
OCR Dataset download.
You can use train.py for fine-tuning or retraining. Chinese_to_ID.json and ID_to_Chinese.json store the mappings between OCR dataset category IDs and Chinese characters. Dataset establishment.py is used to generate the training dataset OCR_train.json. Once the model is downloaded, you can directly use test.py for testing, which includes two example test images that are Chinese character images cropped from other PDFs. It's best to use images with a white background. use.json contains the paths to the test images, saved in a list format. The recognized content is output to result.json.
The code for training and testing (usage) is provided in the MoCo folder. Model download.
You can use train.py for fine-tuning or retraining, Dataset establishment.py is used to generate the training dataset MOCO_train.json. After downloading the MoCo model, test.py is utilized for operating MoCo on 1,781 unmerged categories of oracle bones, seeking the first sample from another category with a similarity greater than args.w to find the similarity between different categories of oracle bones. The results are saved in result.json.
The code for training and testing (usage) is provided in the Validation folder. Model download.
Dataset establishment.py is used for splitting the dataset. Since the classification model cannot recognize unseen categories, all categories with only one sample are allocated to the train set. Validation_test.json, Validation_val.json and Validation_train.json are the test, val and training sets, respectively, split in a 1:1:8 ratio. standard deviation.py is used to obtain the standard deviation of the training set.
You can use train.py for fine-tuning or retraining. Once the model is downloaded, you can use test.py to validate the test set with an accuracy of 94.6%. log.csv records the changes in training set accuracy and test set accuracy for each epoch. Validation_label.json stores the relationship between classification IDs and dataset category IDs.