Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-46182][CORE] Track
lastTaskFinishTime
using the exact task f…
…inished event ### What changes were proposed in this pull request? We found a race condition between lastTaskRunningTime and lastShuffleMigrationTime that could lead to a decommissioned executor exit before all the shuffle blocks have been discovered. The issue could lead to immediate task retry right after an executor exit, thus longer query execution time. To fix the issue, we choose to update the lastTaskRunningTime only when a task updates its status to finished through the StatusUpdate event. This is better than the current approach (which use a thread to check for number of running tasks every second), because in this way we clearly know whether the shuffle block refresh happened after all tasks finished running or not, thus resolved the race condition mentioned above. ### Why are the changes needed? To fix a race condition that could lead to shuffle data lost, thus longer query execution time. ### How was this patch tested? This is a very subtle race condition that is hard to write a unit test using current unit test framework. And we are confident the change is low risk. Thus only verify by passing all the existing tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#44090 from jiangxb1987/SPARK-46182. Authored-by: Xingbo Jiang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
- Loading branch information