-
Notifications
You must be signed in to change notification settings - Fork 506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ET-VK][ez] Fix 8 bit linear compute shader dispatch #9531
Conversation
## Context Currently, for the `q_8w_linear` shader, both the texture and the buffer variants use the same global work group and local work group setting. Specially, the global work group is set to `{out.numel(), 1, 1}` and the local work group is set to `{64, 1, 1}`. However, I believe this results in a very poor memory re-use for the texture shader. In this configuration: * Within a work group each invocation will be requesting a different row of A - 64 rows of A requested in total * All work groups will be requesting the same row of B * One work group will load 65 unique rows from A and B Compare this to a local work group size of `{8, 8, 1}` * Across the work group, 8 rows will be loaded from A and 8 rows will be loaded from B * One work group will load 16 unique rows total from A and B Evidently, there is better memory re-use in the latter work group as fewer unique rows are loaded. ## Changes Modify the `q_8w_linear` shader to use `{8, 8, 1}` local wg if possible. If `M` is small, then instead use `{4, 16, 1}` or `{2, 32, 1}` to reduce the number of inactive invocations. Differential Revision: [D71706489](https://our.internmc.facebook.com/intern/diff/D71706489/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9531
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New FailuresAs of commit a3a3d85 with merge base 7159650 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
## Context Currently, for the `q_8w_linear` shader, both the texture and the buffer variants use the same global work group and local work group setting. Specially, the global work group is set to `{out.numel(), 1, 1}` and the local work group is set to `{64, 1, 1}`. However, I believe this results in a very poor memory re-use for the texture shader. In this configuration: * Within a work group each invocation will be requesting a different row of A - 64 rows of A requested in total * All work groups will be requesting the same row of B * One work group will load 65 unique rows from A and B Compare this to a local work group size of `{8, 8, 1}` * Across the work group, 8 rows will be loaded from A and 8 rows will be loaded from B * One work group will load 16 unique rows total from A and B Evidently, there is better memory re-use in the latter work group as fewer unique rows are loaded. ## Changes Modify the `q_8w_linear` shader to use `{8, 8, 1}` local wg if possible. If `M` is small, then instead use `{4, 16, 1}` or `{2, 32, 1}` to reduce the number of inactive invocations. Differential Revision: [D71706489](https://our.internmc.facebook.com/intern/diff/D71706489/) ghstack-source-id: 273548740 Pull Request resolved: #9531
This pull request was exported from Phabricator. Differential Revision: D71706489 |
This PR needs a
|
## Context Currently, for the `q_8w_linear` shader, both the texture and the buffer variants use the same global work group and local work group setting. Specially, the global work group is set to `{out.numel(), 1, 1}` and the local work group is set to `{64, 1, 1}`. However, I believe this results in a very poor memory re-use for the texture shader. In this configuration: * Within a work group each invocation will be requesting a different row of A - 64 rows of A requested in total * All work groups will be requesting the same row of B * One work group will load 65 unique rows from A and B Compare this to a local work group size of `{8, 8, 1}` * Across the work group, 8 rows will be loaded from A and 8 rows will be loaded from B * One work group will load 16 unique rows total from A and B Evidently, there is better memory re-use in the latter work group as fewer unique rows are loaded. ## Changes Modify the `q_8w_linear` shader to use `{8, 8, 1}` local wg if possible. If `M` is small, then instead use `{4, 16, 1}` or `{2, 32, 1}` to reduce the number of inactive invocations. Differential Revision: [D71706489](https://our.internmc.facebook.com/intern/diff/D71706489/) [ghstack-poisoned]
Pull Request resolved: #9531 ## Context Currently, for the `q_8w_linear` shader, both the texture and the buffer variants use the same global work group and local work group setting. Specially, the global work group is set to `{out.numel(), 1, 1}` and the local work group is set to `{64, 1, 1}`. However, I believe this results in a very poor memory re-use for the texture shader. In this configuration: * Within a work group each invocation will be requesting a different row of A - 64 rows of A requested in total * All work groups will be requesting the same row of B * One work group will load 65 unique rows from A and B Compare this to a local work group size of `{8, 8, 1}` * Across the work group, 8 rows will be loaded from A and 8 rows will be loaded from B * One work group will load 16 unique rows total from A and B Evidently, there is better memory re-use in the latter work group as fewer unique rows are loaded. ## Changes Modify the `q_8w_linear` shader to use `{8, 8, 1}` local wg if possible. If `M` is small, then instead use `{4, 16, 1}` or `{2, 32, 1}` to reduce the number of inactive invocations. ghstack-source-id: 274198011 @exported-using-ghexport Differential Revision: [D71706489](https://our.internmc.facebook.com/intern/diff/D71706489/)
This pull request was exported from Phabricator. Differential Revision: D71706489 |
## Context Currently, for the `q_8w_linear` shader, both the texture and the buffer variants use the same global work group and local work group setting. Specially, the global work group is set to `{out.numel(), 1, 1}` and the local work group is set to `{64, 1, 1}`. However, I believe this results in a very poor memory re-use for the texture shader. In this configuration: * Within a work group each invocation will be requesting a different row of A - 64 rows of A requested in total * All work groups will be requesting the same row of B * One work group will load 65 unique rows from A and B Compare this to a local work group size of `{8, 8, 1}` * Across the work group, 8 rows will be loaded from A and 8 rows will be loaded from B * One work group will load 16 unique rows total from A and B Evidently, there is better memory re-use in the latter work group as fewer unique rows are loaded. ## Changes Modify the `q_8w_linear` shader to use `{8, 8, 1}` local wg if possible. If `M` is small, then instead use `{4, 16, 1}` or `{2, 32, 1}` to reduce the number of inactive invocations. Differential Revision: [D71706489](https://our.internmc.facebook.com/intern/diff/D71706489/) [ghstack-poisoned]
Pull Request resolved: #9531 ## Context Currently, for the `q_8w_linear` shader, both the texture and the buffer variants use the same global work group and local work group setting. Specially, the global work group is set to `{out.numel(), 1, 1}` and the local work group is set to `{64, 1, 1}`. However, I believe this results in a very poor memory re-use for the texture shader. In this configuration: * Within a work group each invocation will be requesting a different row of A - 64 rows of A requested in total * All work groups will be requesting the same row of B * One work group will load 65 unique rows from A and B Compare this to a local work group size of `{8, 8, 1}` * Across the work group, 8 rows will be loaded from A and 8 rows will be loaded from B * One work group will load 16 unique rows total from A and B Evidently, there is better memory re-use in the latter work group as fewer unique rows are loaded. ## Changes Modify the `q_8w_linear` shader to use `{8, 8, 1}` local wg if possible. If `M` is small, then instead use `{4, 16, 1}` or `{2, 32, 1}` to reduce the number of inactive invocations. ghstack-source-id: 274260277 @exported-using-ghexport Differential Revision: [D71706489](https://our.internmc.facebook.com/intern/diff/D71706489/)
This pull request was exported from Phabricator. Differential Revision: D71706489 |
e918ec2
into
gh/SS-JIA/200/base
Stack from ghstack (oldest at bottom):
Context
Currently, for the
q_8w_linear
shader, both the texture and the buffer variants use the same global work group and local work group setting.Specially, the global work group is set to
{out.numel(), 1, 1}
and the local work group is set to{64, 1, 1}
.However, I believe this results in a very poor memory re-use for the texture shader. In this configuration:
Compare this to a local work group size of
{8, 8, 1}
Evidently, there is better memory re-use in the latter work group as fewer unique rows are loaded.
Changes
Modify the
q_8w_linear
shader to use{8, 8, 1}
local wg if possible. IfM
is small, then instead use{4, 16, 1}
or{2, 32, 1}
to reduce the number of inactive invocations.Differential Revision: D71706489