taking the past into account - why?
- Compare:
GPTVer1
-generated completion vsGPTVer2
-generated completion - What difference do you notice?
- Why is there a difference?
vectorizing for loops - why?
- How is
HeadVer2
logically the same asHeadVer1
? - Why is
HeadVer2
faster thanHeadVer1
?
taking the past into account with masking & normalization - how?
- How is
HeadVer3
logically the same asHeadVer1
? - Why mask
wei
with-inf
? Why not0
?
...