Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练代码中怎么没有看到对 MI 和 对比学习相关的使用 #2

Closed
GYee opened this issue Jan 13, 2024 · 2 comments
Closed

Comments

@GYee
Copy link

GYee commented Jan 13, 2024

No description provided.

@xuyaoxun
Copy link
Collaborator

Thank you for your interest in our project! I'd like to address your question with two main points:

  1. Our training process is divided into two stages. The first stage is the pre-training of the Q-former, which involves the use of contrastive learning and mutual information (MI). The second stage is the overall training, where the Q-former is trained again using the output from LLaMA. In this repository, for the sake of simplicity and clarity, we only provide the second stage of the training process. The first stage is focused on the pre-training of the Q-former, and it still requires fine-tuning during the second stage. In order to make it easier for others to use and train, we believe that providing a pre-trained checkpoint and allowing users to fine-tune it with their own data in the second stage is more beneficial. Moreover, the method we use in the first stage is directly related to our training dataset. We divide the dataset based on the manually annotated speech emotion labels and use this division for MI and contrastive learning. Unfortunately, due to certain restrictions, I cannot disclose the complete training dataset. Meanwhile, the preparation of the dataset for the first stage is more challenging than that for the second stage.
  2. If you would like to learn more about contrastive learning and MI training, I recommend focusing on the following resources: CLUB: A Contrastive Log-ratio Upper Bound of Mutual Information (https://proceedings.mlr.press/v119/cheng20b.html) and its GitHub repository (https://github.com/Linear95/CLUB); as well as Learning Transferable Visual Models From Natural Language Supervision (https://arxiv.org/abs/2103.00020) and its GitHub repository (https://github.com/openai/CLIP).

@lucashueda
Copy link

Excuse me, regarding the first stage training, in Figure 3 of the paper, both transcription embedding and caption embeddings are passed through a Q-Former block, so both transcriptions and caption embeddings outputs are Q-former-like? (shape of Q-embedding == T-Embedding == C-Embedding?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants