To combine the sound and image input streams the concatenation was used,after the two transformed soundimages were concatenated,as well. After the sound and image concatenation the GAN network starts the training while the generator tries to generate better images and the discriminator judges the output. After the training the network has to do the reverse way to display deconvoluted images. The last step is de deconcatenation of the sound images and the retransformation into a .wav file, so sound can be played.
For more information read the paper: https://github.com/markus-weiss/AVGAN/blob/master/Forschungsprojekt%20(2).pdf
https://colab.research.google.com/drive/1bxOf8m8MZAOuuih5zz3Wk3taASdY-U2t
https://github.com/markus-weiss/AVGAN/blob/master/AVGAN_Paper.pdf