pip install -r requirements.txt
-
Load data
- Read training data, including authors and their correctly and incorrectly assigned papers.
- Read detailed information of all papers, including title, author, abstract, keywords, conference or journal, and publication year.
- Read test data, including authors and all their papers that need to be verified.
-
Convert to DataFrame
- Convert the dictionary format of training data to Pandas DataFrame format for subsequent processing.
- Process correctly assigned papers and incorrectly assigned papers separately, and add labels (the label of correctly assigned papers is 1, and the label of incorrectly assigned papers is 0).
- Merge the processed data to form the final training data set.
-
Process test data
- Convert test data to DataFrame format and extract all paper information of each author.
-
Process paper details
- Convert the detailed information dictionary of the paper to DataFrame format to facilitate subsequent feature engineering.
Feature engineering is the core part of this project. It uses a variety of techniques to extract useful features from data to improve the performance of the model. We used the following feature engineering techniques:
-
Keyword processing
-
Extract a list of common keywords from the keywords of the paper.
-
Count the number of occurrences of these keywords in each paper and generate keyword features.
-
-
Author information processing
-
Parse the author information in each paper and count the number of papers for each author.
-
Generate matching features between authors and papers.
-
-
Text processing
-
Perform text processing on the title and abstract of the paper, and extract text features using multiple methods [using idf and w2v as examples].
-
Use the TF-IDF (term frequency-inverse document frequency) method to extract text features.
-
Generate text embedding using the Word2Vec model.
-
Calculate the similarity between the title and abstract and generate text similarity features.
-
-
Paper publication information processing
-
Extract the year of paper publication and generate year features.
-
Extract the conference or journal information where the paper was published and generate classification features.
-
-
Other features
-
Combine the above features to generate interactive features (such as the interaction between authors and keywords, the interaction between the publication year and keywords, etc.).
-
Generate embedding features based on paper ID, and embed paper ID through Word2Vec model.
-
-
Data preparation
- Use stratified K-fold cross-validation method to divide the training data into training set and validation set.
-
Model selection
-
Select LightGBM and XGBoost models for training respectively.
-
Use early stopping and logging functions to optimize the model training process.
-
-
Model training and validation
-
Train LightGBM and XGBoost models on the training set, and evaluate the models on the validation set.
-
Use ROC AUC as the model evaluation indicator to select the best model.
-
-
Model fusion
- Fusion the prediction results of LightGBM and XGBoost models to improve the stability and generalization ability of the model.
-
Model prediction
- Use the fused model to predict the test data and output the prediction results.
-
CHatGLM3-32k
- Use the official base and modify some parameters (MAX_SOURCE_LEN, LR, EPOCH)
- Device: 8*A100
cd ChatGLM3 train:bash train.sh infer:bash test.sh
-
Llama3-6b
- Modify according to the official base and adjust the DataSet and DataCollator to align the input and output of Llama
- Device 8*A100
cd llama3 train:bash train.sh infer:bash test.sh
-
GCN
- From the official
cd GCN python encoding.py python build_graph.py --author_dir /data/laiguibin/LLMs/incorrect_assignment_detection/data/IND-WhoIsWho/train_author.json --save_dir /data/laiguibin/LLMs/incorrect_assignment_detection/data/IND-WhoIsWho/train.pkl python build_graph.py python train.py
We perform weighted fusion on the output results of the above models. The weight consideration mainly comes from the difference between online scores and model estimates.
cd LGBM
python lgb_xgb.py
python final.py # Model fusion
parameter:8,000,000,000
total video memory(GB):640
Device: CPU 64C.256G / GPU 8*A100