HUST-OBC

Oracle Bone Character data collected by VLRLab of HUST We have open-sourced the HUST-OBC dataset and the models used in the dataset, including: Chinese OCR, MoCo, and the ResNet50 for Validation.

HUST-OBC Dataset

HUST-OBC Download

Tree of our dataset

HUST-OBC (We have renamed HUST-OBS to HUST-OBC)
- deciphered
  - ID1
    - Source_ID1_Filename
    - Source_ID1_Filename
    - .....
  - ID2
    - Source_ID2_Filename
    - .....
  - ID3
  - .....
  - chinese_to_ID.json
  - ID_to_chinese.json
- undeciphered
  - L
    - L_?_Filename
    - L_?_Filename
    - .....
  - X
    - X_?_Filename
    - .....
  - Y+H
    - Y_?_Filename
    - H_?_Filename
    - .....
- GuoXueDaShi_1390
  - ID1
    - Source_ID1_Filename
    - Source_ID1_Filename
    - .....
  - ID2
    - Source_ID2_Filename
    - .....
  - ID3
  - .....
  - chinese_to_ID.json
  - ID_to_chinese.json

Source:’X’ represents "New Compilation of Oracle Bone Scripts", ’L’ represents the "Oracle Bone Script: Six Digit Numerical Code",’G’ represents the "GuoXueDaShi" website, ’Y’ represents the "YinQiWenYuan" website, and ’H’ represents the HWOBC dataset, they are the sources of the data.

Environment

conda create -n HUST-OBC python=3.10
conda activate HUST-OBC
git clone https://github.com/Pengjie-W/HUST-OBC.git
cd HUST-OBC
pip install -r requirements.txt

Instructions for use

To use MoCo or Validation, you need to download HUST-OBC. You can then directly use their trained models for prediction. If you want to use Chinese OCR, please download the OCR dataset and the corresponding model. After downloading, organize the data as follows.

Your_dataroot
- HUST-OBC
  - deciphered
  - ...
- MoCo
  - model_last.pth
  - ...
- OCR
  - OCR_Dataset
  - model_last.pth
  - ...
- Validation
  - max_val_acc.pth
  - ...

Chinese OCR

The code for training and testing (usage) is provided in the OCR folder. Includes recognition of 88,899 classes of Chinese characters. Model download. Category numbers and their corresponding Chinese characters are stored in OCR/label.json. We have provided models and code with α set to 0.
OCR Dataset download.

You can use train.py for fine-tuning or retraining. Chinese_to_ID.json and ID_to_Chinese.json store the mappings between OCR dataset category IDs and Chinese characters. Dataset establishment.py is used to generate the training dataset OCR_train.json. Once the model is downloaded, you can directly use test.py for testing, which includes two example test images that are Chinese character images cropped from other PDFs. It's best to use images with a white background. use.json contains the paths to the test images, saved in a list format. The recognized content is output to result.json.

MoCo

The code for training and testing (usage) is provided in the MoCo folder. Model download.

You can use train.py for fine-tuning or retraining, Dataset establishment.py is used to generate the training dataset MOCO_train.json. After downloading the MoCo model, test.py is utilized for operating MoCo on 1,781 unmerged categories of oracle bones, seeking the first sample from another category with a similarity greater than args.w to find the similarity between different categories of oracle bones. The results are saved in result.json.

Validation

The code for training and testing (usage) is provided in the Validation folder. Model download.

Dataset establishment.py is used for splitting the dataset. Since the classification model cannot recognize unseen categories, all categories with only one sample are allocated to the train set. Validation_test.json, Validation_val.json and Validation_train.json are the test, val and training sets, respectively, split in a 1:1:8 ratio. standard deviation.py is used to obtain the standard deviation of the training set.

You can use train.py for fine-tuning or retraining. Once the model is downloaded, you can use test.py to validate the test set with an accuracy of 94.6%. log.csv records the changes in training set accuracy and test set accuracy for each epoch. Validation_label.json stores the relationship between classification IDs and dataset category IDs.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.idea		.idea
MoCo		MoCo
OCR		OCR
Validation		Validation
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HUST-OBC

HUST-OBC Dataset

Tree of our dataset

Environment

Instructions for use

Chinese OCR

MoCo

Validation

About

Releases

Packages

Languages

Pengjie-W/HUST-OBC

Folders and files

Latest commit

History

Repository files navigation

HUST-OBC

HUST-OBC Dataset

Tree of our dataset

Environment

Instructions for use

Chinese OCR

MoCo

Validation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages