WhoIsWho-IND-KDD-2024 Rank4

Tree modeling process&train

Prerequisites

pip install -r requirements.txt

Data processing

Load data
- Read training data, including authors and their correctly and incorrectly assigned papers.
- Read detailed information of all papers, including title, author, abstract, keywords, conference or journal, and publication year.
- Read test data, including authors and all their papers that need to be verified.
Convert to DataFrame
- Convert the dictionary format of training data to Pandas DataFrame format for subsequent processing.
- Process correctly assigned papers and incorrectly assigned papers separately, and add labels (the label of correctly assigned papers is 1, and the label of incorrectly assigned papers is 0).
- Merge the processed data to form the final training data set.
Process test data
- Convert test data to DataFrame format and extract all paper information of each author.
Process paper details
- Convert the detailed information dictionary of the paper to DataFrame format to facilitate subsequent feature engineering.

Feature engineering

Feature engineering is the core part of this project. It uses a variety of techniques to extract useful features from data to improve the performance of the model. We used the following feature engineering techniques:

Keyword processing
- Extract a list of common keywords from the keywords of the paper.
- Count the number of occurrences of these keywords in each paper and generate keyword features.
Author information processing
- Parse the author information in each paper and count the number of papers for each author.
- Generate matching features between authors and papers.
Text processing
- Perform text processing on the title and abstract of the paper, and extract text features using multiple methods [using idf and w2v as examples].
- Use the TF-IDF (term frequency-inverse document frequency) method to extract text features.
- Generate text embedding using the Word2Vec model.
- Calculate the similarity between the title and abstract and generate text similarity features.
Paper publication information processing
- Extract the year of paper publication and generate year features.
- Extract the conference or journal information where the paper was published and generate classification features.
Other features
- Combine the above features to generate interactive features (such as the interaction between authors and keywords, the interaction between the publication year and keywords, etc.).
- Generate embedding features based on paper ID, and embed paper ID through Word2Vec model.

Model training

Data preparation
- Use stratified K-fold cross-validation method to divide the training data into training set and validation set.
Model selection
- Select LightGBM and XGBoost models for training respectively.
- Use early stopping and logging functions to optimize the model training process.
Model training and validation
- Train LightGBM and XGBoost models on the training set, and evaluate the models on the validation set.
- Use ROC AUC as the model evaluation indicator to select the best model.
Model fusion
- Fusion the prediction results of LightGBM and XGBoost models to improve the stability and generalization ability of the model.
Model prediction
- Use the fused model to predict the test data and output the prediction results.

NN modeling process&train

CHatGLM3-32k
- Use the official base and modify some parameters (MAX_SOURCE_LEN, LR, EPOCH)
- Device: 8*A100
```
   cd ChatGLM3
   train：bash train.sh
   infer：bash test.sh
```
Llama3-6b
- Modify according to the official base and adjust the DataSet and DataCollator to align the input and output of Llama
- Device 8*A100
```
   cd llama3
   train：bash train.sh
   infer：bash test.sh
```

GCN

From the official

   cd GCN
   python encoding.py
   python build_graph.py --author_dir /data/laiguibin/LLMs/incorrect_assignment_detection/data/IND-WhoIsWho/train_author.json --save_dir /data/laiguibin/LLMs/incorrect_assignment_detection/data/IND-WhoIsWho/train.pkl
   python build_graph.py
   python train.py

Model fusion

We perform weighted fusion on the output results of the above models. The weight consideration mainly comes from the difference between online scores and model estimates.

 cd LGBM
 python lgb_xgb.py 
 python final.py # Model fusion

Parameter&Device Description

parameter：8,000,000,000

total video memory(GB)：640

Device: CPU 64C.256G / GPU 8*A100

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WhoIsWho-IND-KDD-2024 Rank4

Tree modeling process&train

Prerequisites

Data processing

Feature engineering

Model training

NN modeling process&train

Model fusion

Parameter&Device Description

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
ChatGLM3		ChatGLM3
GCN		GCN
LGBM		LGBM
llama3		llama3
README.md		README.md
requirements.txt		requirements.txt

zyorangecat/WhoIsWho-IND-KDD-2024

Folders and files

Latest commit

History

Repository files navigation

WhoIsWho-IND-KDD-2024 Rank4

Tree modeling process&train

Prerequisites

Data processing

Feature engineering

Model training

NN modeling process&train

Model fusion

Parameter&Device Description

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages