Skip to content

Oracle Bone Script data collected by VLRLab of HUST

Notifications You must be signed in to change notification settings

Pengjie-W/HUST-OBC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HUST-OBC

Paper figshare Download Dataset

Oracle Bone Character data collected by VLRLab of HUST We have open-sourced the HUST-OBC dataset and the models used in the dataset, including: Chinese OCR, MoCo, and the ResNet50 for Validation.

HUST-OBC Dataset

HUST-OBC Download

Tree of our dataset

  • HUST-OBC (We have renamed HUST-OBS to HUST-OBC)
    • deciphered
      • ID1
        • Source_ID1_Filename
        • Source_ID1_Filename
        • .....
      • ID2
        • Source_ID2_Filename
        • .....
      • ID3
      • .....
      • chinese_to_ID.json
      • ID_to_chinese.json
    • undeciphered
      • L
        • L_?_Filename
        • L_?_Filename
        • .....
      • X
        • X_?_Filename
        • .....
      • Y+H
        • Y_?_Filename
        • H_?_Filename
        • .....
    • GuoXueDaShi_1390
      • ID1
        • Source_ID1_Filename
        • Source_ID1_Filename
        • .....
      • ID2
        • Source_ID2_Filename
        • .....
      • ID3
      • .....
      • chinese_to_ID.json
      • ID_to_chinese.json

Source:’X’ represents "New Compilation of Oracle Bone Scripts", ’L’ represents the "Oracle Bone Script: Six Digit Numerical Code",’G’ represents the "GuoXueDaShi" website, ’Y’ represents the "YinQiWenYuan" website, and ’H’ represents the HWOBC dataset, they are the sources of the data.

Environment

conda create -n HUST-OBC python=3.10
conda activate HUST-OBC
git clone https://github.com/Pengjie-W/HUST-OBC.git
cd HUST-OBC
pip install -r requirements.txt

Instructions for use

To use MoCo or Validation, you need to download HUST-OBC. You can then directly use their trained models for prediction. If you want to use Chinese OCR, please download the OCR dataset and the corresponding model. After downloading, organize the data as follows.

Chinese OCR

The code for training and testing (usage) is provided in the OCR folder. Includes recognition of 88,899 classes of Chinese characters. Model download. Category numbers and their corresponding Chinese characters are stored in OCR/label.json. We have provided models and code with α set to 0.
OCR Dataset download.

You can use train.py for fine-tuning or retraining. Chinese_to_ID.json and ID_to_Chinese.json store the mappings between OCR dataset category IDs and Chinese characters. Dataset establishment.py is used to generate the training dataset OCR_train.json. Once the model is downloaded, you can directly use test.py for testing, which includes two example test images that are Chinese character images cropped from other PDFs. It's best to use images with a white background. use.json contains the paths to the test images, saved in a list format. The recognized content is output to result.json.

MoCo

The code for training and testing (usage) is provided in the MoCo folder. Model download.

You can use train.py for fine-tuning or retraining, Dataset establishment.py is used to generate the training dataset MOCO_train.json. After downloading the MoCo model, test.py is utilized for operating MoCo on 1,781 unmerged categories of oracle bones, seeking the first sample from another category with a similarity greater than args.w to find the similarity between different categories of oracle bones. The results are saved in result.json.

Validation

The code for training and testing (usage) is provided in the Validation folder. Model download.

Dataset establishment.py is used for splitting the dataset. Since the classification model cannot recognize unseen categories, all categories with only one sample are allocated to the train set. Validation_test.json, Validation_val.json and Validation_train.json are the test, val and training sets, respectively, split in a 1:1:8 ratio. standard deviation.py is used to obtain the standard deviation of the training set.

You can use train.py for fine-tuning or retraining. Once the model is downloaded, you can use test.py to validate the test set with an accuracy of 94.6%. log.csv records the changes in training set accuracy and test set accuracy for each epoch. Validation_label.json stores the relationship between classification IDs and dataset category IDs.

About

Oracle Bone Script data collected by VLRLab of HUST

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages