Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] Regression: Out-of-memory in ttnn.conv2d #17787

Open
friedrich opened this issue Feb 10, 2025 · 0 comments
Open

[Bug Report] Regression: Out-of-memory in ttnn.conv2d #17787

friedrich opened this issue Feb 10, 2025 · 0 comments
Assignees

Comments

@friedrich
Copy link

Describe the bug

Since commit 5444b3c we get an OOM error in ttnn.conv2d for an operation required in the transformer of Stable Diffusion 3.5 for an image size of 1024x1024.

There is a similar issue (#17489) regarding a requirement to perform conv2d on even larger tensors. The failing tests in the similar issue however did never succeed, as far as I know. The one in the current issue worked before commit 5444b3c.

Test:

from __future__ import annotations

import pytest
import torch

import ttnn
from tests.ttnn.utils_for_testing import assert_with_pcc


@pytest.mark.parametrize(
    ("batch_size", "in_channels", "out_channels", "kernel_size", "stride", "padding", "height", "width"),
    [
        (2, 16, 1536, (2, 2), (2, 2), (0, 0), 128, 128),
    ],
)
@pytest.mark.parametrize("device_params", [{"l1_small_size": 8192}], indirect=True)
def test_conv2d(
    *,
    device: ttnn.Device,
    batch_size: int,
    in_channels: int,
    out_channels: int,
    kernel_size: tuple[int, int],
    stride: tuple[int, int],
    padding: tuple[int, int],
    height: int,
    width: int,
) -> None:
    dtype = ttnn.bfloat16

    torch_model = torch.nn.Conv2d(
        in_channels=in_channels,
        out_channels=out_channels,
        kernel_size=kernel_size,
        stride=stride,
        padding=padding,
    )
    torch_model.eval()

    tt_weight = ttnn.from_torch(torch_model.state_dict()["weight"], dtype=dtype)
    tt_bias = ttnn.from_torch(torch_model.state_dict()["bias"].reshape((1, 1, 1, -1)), dtype=dtype)

    torch_input = torch.ones((batch_size, in_channels, height, width))

    tt_input = ttnn.from_torch(
        torch_input.permute([0, 2, 3, 1]),  # BCYX -> BYXC
        device=device,
        layout=ttnn.TILE_LAYOUT,
        dtype=dtype,
    )

    with torch.no_grad():
        torch_output = torch_model(torch_input)

    batch_size = tt_input.shape[0]
    device = tt_input.device()

    tt_output, [output_height, output_width], [prepared_weight, prepared_bias] = ttnn.conv2d(
        input_tensor=tt_input,
        weight_tensor=tt_weight,
        bias_tensor=tt_bias,
        in_channels=in_channels,
        out_channels=out_channels,
        device=device,
        kernel_size=kernel_size,
        stride=stride,
        padding=padding,
        batch_size=batch_size,
        input_height=tt_input.shape[1],
        input_width=tt_input.shape[2],
        return_output_dim=True,
        return_weights_and_bias=True,
        memory_config=ttnn.DRAM_MEMORY_CONFIG,
    )

    tt_output_torch = ttnn.to_torch(tt_output).reshape([batch_size, output_height, output_width, out_channels]).permute([0, 3, 1, 2])

    assert_with_pcc(torch_output, tt_output_torch, pcc=0.995)

Error log:

E       RuntimeError: TT_THROW @ /home/user/tt-metal/tt_metal/impl/allocator/bank_manager.cpp:132: tt::exception
E       info:
E       Out of Memory: Not enough space to allocate 2097152 B L1 buffer across 1 banks, where each bank needs to store 2097152 B
E       backtrace:
E        --- /home/user/tt-metal/build/lib/libtt_metal.so(+0x14bd73) [0x7f87fa2cbd73]
E        --- tt::tt_metal::BankManager::allocate_buffer(unsigned long, unsigned long, bool, CoreRangeSet const&, std::__1::optional<unsigned int>)
E        --- tt::tt_metal::Allocator::allocate_buffer(tt::tt_metal::v0::Buffer*)
E        --- tt::tt_metal::v0::Buffer::allocate_impl()
E        --- /home/user/tt-metal/build/lib/libtt_metal.so(+0x1156d3) [0x7f87fa2956d3]
E        --- /home/user/tt-metal/build/lib/libtt_metal.so(+0xf4aaf) [0x7f87fa274aaf]
E        --- tt::tt_metal::v0::Buffer::create(tt::tt_metal::v0::IDevice*, unsigned long, unsigned long, tt::tt_metal::BufferType, tt::tt_metal::TensorMemoryLayout, std::__1::optional<tt::tt_metal::ShardSpecBuffer> const&, std::__1::optional<bool>, std::__1::optional<tt::tt_metal::SubDeviceId>)
E        --- tt::tt_metal::tensor_impl::allocate_buffer_on_device(tt::tt_metal::v0::IDevice*, tt::tt_metal::TensorSpec const&)
E        --- tt::tt_metal::create_device_tensor(tt::tt_metal::TensorSpec const&, tt::tt_metal::v0::IDevice*)
E        --- /home/user/tt-metal/ttnn/ttnn/_ttnn.so(_ZN2tt8tt_metal9operation29default_create_output_tensorsIN4ttnn10operations13data_movement35InterleavedToShardedDeviceOperationEEENS1_21program_output_helperIT_Xsr18has_create_programIS8_EE5valueEE4typeERKS8_RKNSt3__16vectorINS0_6TensorENSD_9allocatorISF_EEEERKNSE_INSD_8optionalISF_EENSG_ISM_EEEE+0x194) [0x7f87fb1b7d94]
E        --- /home/user/tt-metal/ttnn/ttnn/_ttnn.so(+0x6bdbee) [0x7f87fb1b7bee]
E        --- tt::tt_metal::operation::OldInfraDeviceOperation<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>>::tensor_return_value_t ttnn::device_operation::detail::launch_on_single_device<tt::tt_metal::operation::OldInfraDeviceOperation<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>>>(unsigned char, tt::tt_metal::operation::OldInfraDeviceOperation<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>>::operation_attributes_t const&, tt::tt_metal::operation::OldInfraDeviceOperation<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>>::tensor_args_t const&)
E        --- /home/user/tt-metal/ttnn/ttnn/_ttnn.so(+0x22d4ebf) [0x7f87fcdceebf]
E        --- tt::tt_metal::operation::OldInfraDeviceOperation<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>>::tensor_return_value_t ttnn::device_operation::detail::invoke<tt::tt_metal::operation::OldInfraDeviceOperation<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>>>(unsigned char, tt::tt_metal::operation::OldInfraDeviceOperation<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>>::operation_attributes_t const&, tt::tt_metal::operation::OldInfraDeviceOperation<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>>::tensor_args_t const&)
E        --- /home/user/tt-metal/ttnn/ttnn/_ttnn.so(+0x22d408f) [0x7f87fcdce08f]
E        --- /home/user/tt-metal/ttnn/ttnn/_ttnn.so(+0x22cf445) [0x7f87fcdc9445]
E        --- std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> tt::tt_metal::operation::run<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>>(tt::tt_metal::operation::DeviceOperation<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>>&&, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor>>> const&, unsigned char)
E        --- /home/user/tt-metal/ttnn/ttnn/_ttnn.so(+0x69bd63) [0x7f87fb195d63]
E        --- /home/user/tt-metal/ttnn/ttnn/_ttnn.so(+0x69a830) [0x7f87fb194830]
E        --- /home/user/tt-metal/ttnn/ttnn/_ttnn.so(+0x6ccf7a) [0x7f87fb1c6f7a]
E        --- /home/user/tt-metal/ttnn/ttnn/_ttnn.so(+0x6ccee7) [0x7f87fb1c6ee7]
E        --- /home/user/tt-metal/ttnn/ttnn/_ttnn.so(+0x6cce52) [0x7f87fb1c6e52]
E        --- /home/user/tt-metal/ttnn/ttnn/_ttnn.so(+0x22e3565) [0x7f87fcddd565]
E        --- /home/user/tt-metal/ttnn/ttnn/_ttnn.so(+0x22e3cb7) [0x7f87fcdddcb7]
E        --- /home/user/tt-metal/build/lib/libtt_metal.so(+0xf4aaf) [0x7f87fa274aaf]
E        --- void tt::tt_metal::operation::launch_op_func<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>>(std::__1::function<std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> (std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>> const&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor>>> const&)> const&, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>, std::__1::vector<tt::tt_metal::Tensor, std::__1::allocator<tt::tt_metal::Tensor>>&, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor const>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor const>>>, std::__1::vector<std::__1::optional<tt::tt_metal::Tensor>, std::__1::allocator<std::__1::optional<tt::tt_metal::Tensor>>>, bool)
E        --- /home/user/tt-metal/ttnn/ttnn/_ttnn.so(+0x6cc977) [0x7f87fb1c6977]
E        --- /home/user/tt-metal/ttnn/ttnn/_ttnn.so(+0x6cc2cb) [0x7f87fb1c62cb]
E        --- /home/user/tt-metal/ttnn/ttnn/_ttnn.so(+0x6c570c) [0x7f87fb1bf70c]
E        --- std::__1::tuple<tt::tt_metal::Tensor, ttnn::operations::sliding_window::ParallelConfig, ttnn::operations::sliding_window::ParallelConfig, bool> ttnn::operations::conv::shard_or_reshard_tensor_if_required<tt::tt_metal::v0::IDevice>(tt::tt_metal::v0::IDevice*, tt::tt_metal::Tensor const&, ttnn::operations::conv::conv2d::Conv2dConfig const&, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, bool, bool, bool)
E        --- /home/user/tt-metal/ttnn/ttnn/_ttnn.so(_ZN4ttnn10operations4conv6conv2d6conv2dIN2tt8tt_metal2v07IDeviceEEENSt3__15tupleIJNS5_6TensorEjjSA_NS8_8optionalISA_EEEEERKSA_SF_PT_jjjjjNS8_5arrayIjLm2EEESJ_SJ_SJ_jNSB_ISE_EERKNSB_IKNS2_12Conv2dConfigEEERKNSB_IKNS8_7variantIJNS_28GrayskullComputeKernelConfigENS_27WormholeComputeKernelConfigEEEEEERKNSB_IKNS5_12MemoryConfigEEE+0x5da) [0x7f87fb189e3a]
E        --- /home/user/tt-metal/ttnn/ttnn/_ttnn.so(_ZN4ttnn10operations4conv6conv2d15Conv2dOperation6invokeEhRKN2tt8tt_metal6TensorES8_PNS5_2v07IDeviceEjjjjjNSt3__15arrayIjLm2EEESE_SE_SE_jNSC_8optionalIS7_EERKNSF_IKNS2_12Conv2dConfigEEERKNSF_IKNSC_7variantIJNS_28GrayskullComputeKernelConfigENS_27WormholeComputeKernelConfigEEEEEERKNSF_IKNS5_12MemoryConfigEEE+0x84) [0x7f87fb189814]
E        --- /home/user/tt-metal/ttnn/ttnn/_ttnn.so(+0x6f1cef) [0x7f87fb1ebcef]
E        --- /home/user/tt-metal/ttnn/ttnn/_ttnn.so(+0x6f12f4) [0x7f87fb1eb2f4]
E        --- /home/user/tt-metal/ttnn/ttnn/_ttnn.so(+0x6f0207) [0x7f87fb1ea207]
E        --- /home/user/tt-metal/ttnn/ttnn/_ttnn.so(+0x6efe62) [0x7f87fb1e9e62]
E        --- /home/user/tt-metal/ttnn/ttnn/_ttnn.so(+0x6e1166) [0x7f87fb1db166]
E        --- /home/user/tt-metal/python_env/bin/python3(PyCFunction_Call+0x59) [0x5e66b9]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyObject_MakeTpCall+0x29e) [0x5e728e]
E        --- /home/user/tt-metal/python_env/bin/python3() [0x4f9588]
E        --- /home/user/tt-metal/python_env/bin/python3(PyObject_Call+0x62) [0x5e5e32]
E        --- /home/user/tt-metal/python_env/bin/python3() [0x58db4c]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyObject_MakeTpCall+0x29e) [0x5e728e]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyEval_EvalFrameDefault+0x628d) [0x56247d]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyEval_EvalCodeWithName+0x26a) [0x55abda]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyFunction_Vectorcall+0x393) [0x5e6c43]
E        --- /home/user/tt-metal/python_env/bin/python3(PyObject_Call+0x62) [0x5e5e32]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyEval_EvalFrameDefault+0x1f34) [0x55e124]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyEval_EvalCodeWithName+0x26a) [0x55abda]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyFunction_Vectorcall+0x393) [0x5e6c43]
E        --- /home/user/tt-metal/python_env/bin/python3() [0x58d83f]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyObject_MakeTpCall+0x29e) [0x5e728e]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyEval_EvalFrameDefault+0x628d) [0x56247d]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyEval_EvalCodeWithName+0x26a) [0x55abda]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyFunction_Vectorcall+0x393) [0x5e6c43]
E        --- /home/user/tt-metal/python_env/bin/python3(PyObject_Call+0x62) [0x5e5e32]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyEval_EvalFrameDefault+0x1f34) [0x55e124]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyEval_EvalCodeWithName+0x26a) [0x55abda]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyFunction_Vectorcall+0x393) [0x5e6c43]
E        --- /home/user/tt-metal/python_env/bin/python3(PyObject_Call+0x62) [0x5e5e32]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyEval_EvalFrameDefault+0x1f34) [0x55e124]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyEval_EvalCodeWithName+0x26a) [0x55abda]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyFunction_Vectorcall+0x393) [0x5e6c43]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyEval_EvalFrameDefault+0x57f2) [0x5619e2]
E        --- /home/user/tt-metal/python_env/bin/python3() [0x4f8d5e]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyEval_EvalFrameDefault+0x57f2) [0x5619e2]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyEval_EvalCodeWithName+0x26a) [0x55abda]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyFunction_Vectorcall+0x393) [0x5e6c43]
E        --- /home/user/tt-metal/python_env/bin/python3() [0x58d83f]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyObject_MakeTpCall+0x29e) [0x5e728e]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyEval_EvalFrameDefault+0x628d) [0x56247d]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyFunction_Vectorcall+0x1b6) [0x5e6a66]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyEval_EvalFrameDefault+0x859) [0x55ca49]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyFunction_Vectorcall+0x1b6) [0x5e6a66]
E        --- /home/user/tt-metal/python_env/bin/python3(PyObject_Call+0x62) [0x5e5e32]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyEval_EvalFrameDefault+0x1f34) [0x55e124]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyEval_EvalCodeWithName+0x26a) [0x55abda]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyFunction_Vectorcall+0x393) [0x5e6c43]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyEval_EvalFrameDefault+0x57f2) [0x5619e2]
E        --- /home/user/tt-metal/python_env/bin/python3() [0x4f8d5e]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyEval_EvalFrameDefault+0x57f2) [0x5619e2]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyEval_EvalCodeWithName+0x26a) [0x55abda]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyFunction_Vectorcall+0x393) [0x5e6c43]
E        --- /home/user/tt-metal/python_env/bin/python3() [0x58d83f]
E        --- /home/user/tt-metal/python_env/bin/python3(PyObject_Call+0x25e) [0x5e602e]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyEval_EvalFrameDefault+0x1f34) [0x55e124]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyEval_EvalCodeWithName+0x26a) [0x55abda]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyFunction_Vectorcall+0x393) [0x5e6c43]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyEval_EvalFrameDefault+0x72d) [0x55c91d]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyEval_EvalCodeWithName+0x26a) [0x55abda]
E        --- /home/user/tt-metal/python_env/bin/python3() [0x4f8ed0]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyEval_EvalFrameDefault+0x190b) [0x55dafb]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyEval_EvalCodeWithName+0x26a) [0x55abda]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyFunction_Vectorcall+0x393) [0x5e6c43]
E        --- /home/user/tt-metal/python_env/bin/python3(PyObject_Call+0x62) [0x5e5e32]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyEval_EvalFrameDefault+0x1f34) [0x55e124]
E        --- /home/user/tt-metal/python_env/bin/python3(_PyEval_EvalCodeWithName+0x26a) [0x55abda]

ttnn/ttnn/operations/conv2d.py:194: RuntimeError

Environment:

OS: Linux
Distro: Ubuntu 20.04.6 LTS
Kernel: 5.4.0-196-generic
Platform: x86_64
Python: 3.8.10
Device: n150
FW Bundle Version: 80.10.0.0
TT-Flash Version: 0.2.2.0
CM FW Version: 2.27.0.0
ETH FW Version: 6.9.0
BM BL Version: 129.2.0.0
BM App Version: 5.9.0.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants