-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix global linear indexing (fill!
)
#496
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metal Benchmarks
Benchmark suite | Current: 28fe952 | Previous: 634cec7 | Ratio |
---|---|---|---|
private array/construct |
27387 ns |
26887 ns |
1.02 |
private array/broadcast |
460417 ns |
462333 ns |
1.00 |
private array/random/randn/Float32 |
806479.5 ns |
865083 ns |
0.93 |
private array/random/randn!/Float32 |
631125 ns |
658875 ns |
0.96 |
private array/random/rand!/Int64 |
566333 ns |
548542 ns |
1.03 |
private array/random/rand!/Float32 |
598083.5 ns |
584292 ns |
1.02 |
private array/random/rand/Int64 |
750791 ns |
756500 ns |
0.99 |
private array/random/rand/Float32 |
617292 ns |
605542 ns |
1.02 |
private array/copyto!/gpu_to_gpu |
683708 ns |
658041 ns |
1.04 |
private array/copyto!/cpu_to_gpu |
651791.5 ns |
701083 ns |
0.93 |
private array/copyto!/gpu_to_cpu |
822896 ns |
824854.5 ns |
1.00 |
private array/accumulate/1d |
1313500 ns |
1325958.5 ns |
0.99 |
private array/accumulate/2d |
1381083 ns |
1382792 ns |
1.00 |
private array/iteration/findall/int |
2062187 ns |
2067791 ns |
1.00 |
private array/iteration/findall/bool |
1816000 ns |
1825375 ns |
0.99 |
private array/iteration/findfirst/int |
1688250 ns |
1674750 ns |
1.01 |
private array/iteration/findfirst/bool |
1645416 ns |
1646437.5 ns |
1.00 |
private array/iteration/scalar |
3873833 ns |
3884646 ns |
1.00 |
private array/iteration/logical |
3163875 ns |
3164458 ns |
1.00 |
private array/iteration/findmin/1d |
1734833.5 ns |
1740125 ns |
1.00 |
private array/iteration/findmin/2d |
1348875 ns |
1346458 ns |
1.00 |
private array/reductions/reduce/1d |
1034000 ns |
1020729 ns |
1.01 |
private array/reductions/reduce/2d |
651208.5 ns |
664250 ns |
0.98 |
private array/reductions/mapreduce/1d |
1033791 ns |
1032125 ns |
1.00 |
private array/reductions/mapreduce/2d |
658334 ns |
659542 ns |
1.00 |
private array/permutedims/4d |
2540500 ns |
2720542 ns |
0.93 |
private array/permutedims/2d |
1011000 ns |
1011208 ns |
1.00 |
private array/permutedims/3d |
1579959 ns |
1574854 ns |
1.00 |
private array/copy |
603542 ns |
557250 ns |
1.08 |
latency/precompile |
5146918500 ns |
5138248250 ns |
1.00 |
latency/ttfp |
6634638146 ns |
6754936625 ns |
0.98 |
latency/import |
1162510500 ns |
1151697916.5 ns |
1.01 |
integration/metaldevrt |
712750 ns |
719667 ns |
0.99 |
integration/byval/slices=1 |
1567645.5 ns |
1560604.5 ns |
1.00 |
integration/byval/slices=3 |
10250000 ns |
10389833 ns |
0.99 |
integration/byval/reference |
1546208 ns |
1566416.5 ns |
0.99 |
integration/byval/slices=2 |
2583708 ns |
2542084 ns |
1.02 |
kernel/indexing |
459667 ns |
487187.5 ns |
0.94 |
kernel/indexing_checked |
451250 ns |
469791.5 ns |
0.96 |
kernel/launch |
9895.833333333332 ns |
8042 ns |
1.23 |
metal/synchronization/stream |
14708 ns |
14125 ns |
1.04 |
metal/synchronization/context |
15250 ns |
14542 ns |
1.05 |
shared array/construct |
26496.583333333336 ns |
26607.14285714286 ns |
1.00 |
shared array/broadcast |
470917 ns |
453208 ns |
1.04 |
shared array/random/randn/Float32 |
820708 ns |
790125 ns |
1.04 |
shared array/random/randn!/Float32 |
666750 ns |
668750 ns |
1.00 |
shared array/random/rand!/Int64 |
564875 ns |
573042 ns |
0.99 |
shared array/random/rand!/Float32 |
590042 ns |
596500 ns |
0.99 |
shared array/random/rand/Int64 |
771042 ns |
780209 ns |
0.99 |
shared array/random/rand/Float32 |
590292 ns |
618625 ns |
0.95 |
shared array/copyto!/gpu_to_gpu |
86583 ns |
87334 ns |
0.99 |
shared array/copyto!/cpu_to_gpu |
88292 ns |
98084 ns |
0.90 |
shared array/copyto!/gpu_to_cpu |
82375 ns |
77041 ns |
1.07 |
shared array/accumulate/1d |
1325583.5 ns |
1330458 ns |
1.00 |
shared array/accumulate/2d |
1384459 ns |
1388291 ns |
1.00 |
shared array/iteration/findall/int |
1801250 ns |
1768750 ns |
1.02 |
shared array/iteration/findall/bool |
1564583 ns |
1573208 ns |
0.99 |
shared array/iteration/findfirst/int |
1384375 ns |
1392333 ns |
0.99 |
shared array/iteration/findfirst/bool |
1364167 ns |
1363479.5 ns |
1.00 |
shared array/iteration/scalar |
158270.5 ns |
152500 ns |
1.04 |
shared array/iteration/logical |
2951000 ns |
2956500 ns |
1.00 |
shared array/iteration/findmin/1d |
1471166.5 ns |
1459687.5 ns |
1.01 |
shared array/iteration/findmin/2d |
1370666.5 ns |
1350209 ns |
1.02 |
shared array/reductions/reduce/1d |
732833 ns |
718916 ns |
1.02 |
shared array/reductions/reduce/2d |
653000 ns |
670875 ns |
0.97 |
shared array/reductions/mapreduce/1d |
732583 ns |
722583 ns |
1.01 |
shared array/reductions/mapreduce/2d |
667542 ns |
669292 ns |
1.00 |
shared array/permutedims/4d |
2555500 ns |
2722021 ns |
0.94 |
shared array/permutedims/2d |
1021812.5 ns |
1005625 ns |
1.02 |
shared array/permutedims/3d |
1586041 ns |
1573479 ns |
1.01 |
shared array/copy |
242416 ns |
248062.5 ns |
0.98 |
This comment was automatically generated by workflow using github-action-benchmark.
What is the underlying issue here? The blame on CUDA.jl's implementation goes back to when we imported that code from KA.jl, so cc @vchuravy. |
I think the real fix may be to switch from However, this would be a very big (potentially breaking) change. |
I opened #497 for discussion. In the meantime I think this PR should me merged as-is (assuming the code is sound) |
Implementation borrowed from CUDA.jl version.
Close #466