We learn a score function (similar to the noise direction) and use that to recover our distribution.
Emperical success:
- handling multimodal action distributions
- being suitable for high-dimensional action spaces
- exhibiting impressive training stability
Overall objective: use a diffuser as a powerful distribution matching tool for control and planning problems.
Where do we need a distribution match?
- Imitation learning: match the expert's action distribution (mentioning GAIL, adverserial training. with diffusion model, it become more stable. especially for multi-task)
- Offline reinforcement learning: match the policy's action distribution (need to be expressive enough to match the distribution of the policy and also not deviate too much from the expert's distribution, extrapolation error problem)
- challenge: extrapolation error problem
- current solution: panelize/constrain OOD samples -> overconserative
- Model-based reinforcement learning: match the dynamic model (need to work in the long horizon) + policy's action distribution(sometimes)
Why diffusion works here?
- non-autoregressive (no sequential dependency): compounding error is not a problem, but still can generate any length of sequence with certain architecture choise
- multimodal: can handle multimodal action distributions
- matching the distribution: can match the distribution of the expert's action
- High capacity + high expressiveness: can handle high-dimensional action spaces -> foundation models, 50 demostrations per task
smooth
Things to diffuse:
- in image: 2d pixel value
- in control: 1d control/trajectory sequence
Architecture:
- temporal convolutional network (TCN)
How to make it condition on certain objective?
- guidance function: directly shift the distribution / cost or learned value etc.
- inpainting: fill the missing part of the distribution so as to constrains certain part of the distribution