Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry about training setting #8

Closed
Nyquist0 opened this issue Sep 26, 2024 · 4 comments
Closed

Inquiry about training setting #8

Nyquist0 opened this issue Sep 26, 2024 · 4 comments

Comments

@Nyquist0
Copy link

Hi Sir or Madam,

Thanks for sharing your great work.
I would like to consult 2 questions regarding your training setting.

  1. I found the identity consistency between your generated frames and reference image is higher than hallo. For results of Hallo, the chin from generated frames would become stronger than reference. May I ask why?
  2. Referring your paper, I found the dataset jdh-hallo is all of front-view. But for testing data, the side-view is also good. Why is this happened?

Looking forward to your reply.
Thanks

@DBDXSS
Copy link
Member

DBDXSS commented Sep 26, 2024

Thanks for your question, it's valuable.

  1. As mentioned in the paper, we propose a semi-decoupled structure. It can capture inter-feature relationships among lip, expression, and pose features, which makes the result more vivid.

  2. Actually, it is not all of the front-view in jdh-Hallo. We just show one frame of the video.

@Nyquist0
Copy link
Author

Thanks for your quick reply.

  1. But referring to your paper, I think the semi-decoupled structure you proposed is the same as Hierarchical Audio-Visual Cross Attention in Hallo. That should not lead to different results for testing.. Is there any misunderstanding here?
  2. Got it. Thanks. So the dataset is similar as MEAD, but without emotions.

Best~

@DBDXSS
Copy link
Member

DBDXSS commented Sep 26, 2024

But in Hallo's repository, if you read the codes, you will find they are different. There is a misleading in Hallo's paper. Hallo's Hierarchical Audio-Visual Cross Attention is actually a full-decoupled structure.

@Nyquist0
Copy link
Author

Git it. Thanks for explanation~

@DBDXSS DBDXSS pinned this issue Sep 27, 2024
@DBDXSS DBDXSS closed this as completed Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants