在训练过程中指标值突然为0 #13

MingboDuan · 2024-10-02T03:12:52Z

您好，我在运行train.py时在某一轮时，指标值全部骤降为0，如下图的IoU:

我的配置如下图所示：

为什么会出现这种问题呢？（大致270轮左右时骤降指标）之前训练时从没有出现过，而且数据集也没有缺损或者异常！
麻烦您能帮我解决困惑，谢谢您！

MingboDuan · 2024-10-02T09:40:28Z

作者，您好！上面问题我通过将‘begin_test’开始的轮数提前（从220提前到100），成功解决了该问题！
那么想请问对于我训练SIRST3数据集，我开始测试轮数以及最终结束轮数设为多少是比较合适的呢？以及这是为什么不合理的设置会导致训练指标的严重下滑呢？

xdFai · 2024-10-04T02:54:10Z

您好，对于 SIRST3 建议500轮开测结束轮1000。
很抱歉，说实话我训练的时候并没有遇到过您这种指标变成0的情况。

MingboDuan · 2024-10-04T05:47:19Z

对于训练SIRST3数据集，作者您有没有对设置开始测试以及最终结束的轮数有经验之谈呢？毕竟1000轮的训练周期时间实属太长!谢谢啦

…

---原始邮件--- 发件人: "Shuai ***@***.***> 发送时间: 2024年10月4日(周五) 上午10:54 收件人: ***@***.***>; 抄送: ***@***.******@***.***>; 主题: Re: [xdFai/SCTransNet] 在训练过程中指标值突然为0 (Issue #13) 您好，对于 SIRST3 建议500轮开测结束轮1000。很抱歉，说实话我训练的时候并没有遇到过您这种指标变成0的情况。 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

xdFai · 2024-10-05T05:46:24Z

经验之谈的话一般是在600~800轮之间能出现比较好的结果 800轮之后就可以停止训练啦

MingboDuan · 2024-10-06T02:15:22Z

好的，谢谢！

…

---原始邮件--- 发件人: "Shuai ***@***.***> 发送时间: 2024年10月5日(周六) 中午1:46 收件人: ***@***.***>; 抄送: ***@***.******@***.***>; 主题: Re: [xdFai/SCTransNet] 在训练过程中指标值突然为0 (Issue #13) 经验之谈的话一般是在600~800轮之间能出现比较好的结果 800轮之后就可以停止训练啦 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

arrowonstr · 2024-12-10T02:45:48Z

同样的问题也是指标变为0 且loss在200多轮的时候飙升至一千多

xdFai · 2024-12-11T07:30:42Z

您好是三个数据集一起训练的吗

arrowonstr · 2024-12-11T09:22:40Z

@xdFai 使用的IRSTD1K,训练集比测试集4：1
优化器Adagrad 500轮 begintest200轮
loss还加入了其它的iou loss进行测试
log如下：

Dec  9 19:42:43 Epoch---10, total_loss---15.624369,
Dec  9 19:45:58 Epoch---20, total_loss---12.861537,
Dec  9 19:49:15 Epoch---30, total_loss---12.847774,
Dec  9 19:52:31 Epoch---40, total_loss---12.836510,
Dec  9 19:55:47 Epoch---50, total_loss---12.835945,
Dec  9 19:59:03 Epoch---60, total_loss---12.832285,
Dec  9 20:02:19 Epoch---70, total_loss---12.832659,
Dec  9 20:05:34 Epoch---80, total_loss---12.835690,
Dec  9 20:08:50 Epoch---90, total_loss---12.831832,
Dec  9 20:12:06 Epoch---100, total_loss---12.834795,
Dec  9 20:15:21 Epoch---110, total_loss---12.836212,
Dec  9 20:18:37 Epoch---120, total_loss---12.828453,
Dec  9 20:21:53 Epoch---130, total_loss---12.822076,
Dec  9 20:25:08 Epoch---140, total_loss---12.835477,
Dec  9 20:28:24 Epoch---150, total_loss---12.831886,
Dec  9 20:31:39 Epoch---160, total_loss---12.828893,
Dec  9 20:34:55 Epoch---170, total_loss---12.843395,
Dec  9 20:38:10 Epoch---180, total_loss---12.836482,
Dec  9 20:41:26 Epoch---190, total_loss---12.820898,
Dec  9 20:44:43 Epoch---200, total_loss---12.825828,
the best model epoch 	200
pixAcc, mIoU:	(0.842510461807251, np.float64(0.5948936996408116))
PD, FA:	(0.936026936026936, 6.841783033451065e-05)
Dec  9 20:50:35 Epoch---210, total_loss---111.857483,
Dec  9 20:56:29 Epoch---220, total_loss---113.037758,
Dec  9 21:02:21 Epoch---230, total_loss---113.045013,
Dec  9 21:08:13 Epoch---240, total_loss---113.042252,
Dec  9 21:14:06 Epoch---250, total_loss---113.056198,
Dec  9 21:19:58 Epoch---260, total_loss---113.046364,
Dec  9 21:25:50 Epoch---270, total_loss---113.059280,
Dec  9 21:31:43 Epoch---280, total_loss---113.042625,
Dec  9 21:37:37 Epoch---290, total_loss---113.056938,
Dec  9 21:43:30 Epoch---300, total_loss---113.048706,
Dec  9 21:49:23 Epoch---310, total_loss---113.051620,
Dec  9 21:55:16 Epoch---320, total_loss---113.050255,
Dec  9 22:01:10 Epoch---330, total_loss---113.046349,
Dec  9 22:07:03 Epoch---340, total_loss---113.040436,
Dec  9 22:12:56 Epoch---350, total_loss---113.055527,
Dec  9 22:18:49 Epoch---360, total_loss---113.044685,
Dec  9 22:24:43 Epoch---370, total_loss---113.042641,
Dec  9 22:30:36 Epoch---380, total_loss---113.047241,
Dec  9 22:36:29 Epoch---390, total_loss---113.046936,
Dec  9 22:42:24 Epoch---400, total_loss---113.043839,
Dec  9 22:48:16 Epoch---410, total_loss---113.050171,
Dec  9 22:54:11 Epoch---420, total_loss---113.042526,
Dec  9 23:00:06 Epoch---430, total_loss---113.046516,
Dec  9 23:05:59 Epoch---440, total_loss---113.046837,
Dec  9 23:11:53 Epoch---450, total_loss---113.053688,
Dec  9 23:17:47 Epoch---460, total_loss---113.042084,
Dec  9 23:23:41 Epoch---470, total_loss---113.057220,
Dec  9 23:29:34 Epoch---480, total_loss---113.051018,
Dec  9 23:35:27 Epoch---490, total_loss---113.048325,
Dec  9 23:41:21 Epoch---500, total_loss---113.043686,
pixAcc, mIoU:	(0.0, np.float64(0.0))
PD, FA:	(0.0, 0.0)

其中200轮miou异常过高
200后loss突然变成100多，指标全为0

xdFai · 2024-12-11T09:40:23Z

我这个是三个数据集混合在一起训练的，单独一个数据集可能不容易收敛，数据量太小了，模型略大。  袁帅 ***@***.***  

…

------------------ 原始邮件 ------------------ 发件人: ***@***.***>; 发送时间: 2024年12月11日(星期三) 下午5:23 收件人: ***@***.***>; 抄送: ***@***.***>; ***@***.***>; 主题: Re: [xdFai/SCTransNet] 在训练过程中指标值突然为0 (Issue #13) @xdFai 使用的IRSTD1K,训练集比测试集4：1 优化器Adagrad 500轮 begintest200轮 loss还加入了其它的iou loss进行测试 log如下： Dec 9 19:42:43 Epoch---10, total_loss---15.624369, Dec 9 19:45:58 Epoch---20, total_loss---12.861537, Dec 9 19:49:15 Epoch---30, total_loss---12.847774, Dec 9 19:52:31 Epoch---40, total_loss---12.836510, Dec 9 19:55:47 Epoch---50, total_loss---12.835945, Dec 9 19:59:03 Epoch---60, total_loss---12.832285, Dec 9 20:02:19 Epoch---70, total_loss---12.832659, Dec 9 20:05:34 Epoch---80, total_loss---12.835690, Dec 9 20:08:50 Epoch---90, total_loss---12.831832, Dec 9 20:12:06 Epoch---100, total_loss---12.834795, Dec 9 20:15:21 Epoch---110, total_loss---12.836212, Dec 9 20:18:37 Epoch---120, total_loss---12.828453, Dec 9 20:21:53 Epoch---130, total_loss---12.822076, Dec 9 20:25:08 Epoch---140, total_loss---12.835477, Dec 9 20:28:24 Epoch---150, total_loss---12.831886, Dec 9 20:31:39 Epoch---160, total_loss---12.828893, Dec 9 20:34:55 Epoch---170, total_loss---12.843395, Dec 9 20:38:10 Epoch---180, total_loss---12.836482, Dec 9 20:41:26 Epoch---190, total_loss---12.820898, Dec 9 20:44:43 Epoch---200, total_loss---12.825828, the best model epoch 200 pixAcc, mIoU: (0.842510461807251, np.float64(0.5948936996408116)) PD, FA: (0.936026936026936, 6.841783033451065e-05) Dec 9 20:50:35 Epoch---210, total_loss---111.857483, Dec 9 20:56:29 Epoch---220, total_loss---113.037758, Dec 9 21:02:21 Epoch---230, total_loss---113.045013, Dec 9 21:08:13 Epoch---240, total_loss---113.042252, Dec 9 21:14:06 Epoch---250, total_loss---113.056198, Dec 9 21:19:58 Epoch---260, total_loss---113.046364, Dec 9 21:25:50 Epoch---270, total_loss---113.059280, Dec 9 21:31:43 Epoch---280, total_loss---113.042625, Dec 9 21:37:37 Epoch---290, total_loss---113.056938, Dec 9 21:43:30 Epoch---300, total_loss---113.048706, Dec 9 21:49:23 Epoch---310, total_loss---113.051620, Dec 9 21:55:16 Epoch---320, total_loss---113.050255, Dec 9 22:01:10 Epoch---330, total_loss---113.046349, Dec 9 22:07:03 Epoch---340, total_loss---113.040436, Dec 9 22:12:56 Epoch---350, total_loss---113.055527, Dec 9 22:18:49 Epoch---360, total_loss---113.044685, Dec 9 22:24:43 Epoch---370, total_loss---113.042641, Dec 9 22:30:36 Epoch---380, total_loss---113.047241, Dec 9 22:36:29 Epoch---390, total_loss---113.046936, Dec 9 22:42:24 Epoch---400, total_loss---113.043839, Dec 9 22:48:16 Epoch---410, total_loss---113.050171, Dec 9 22:54:11 Epoch---420, total_loss---113.042526, Dec 9 23:00:06 Epoch---430, total_loss---113.046516, Dec 9 23:05:59 Epoch---440, total_loss---113.046837, Dec 9 23:11:53 Epoch---450, total_loss---113.053688, Dec 9 23:17:47 Epoch---460, total_loss---113.042084, Dec 9 23:23:41 Epoch---470, total_loss---113.057220, Dec 9 23:29:34 Epoch---480, total_loss---113.051018, Dec 9 23:35:27 Epoch---490, total_loss---113.048325, Dec 9 23:41:21 Epoch---500, total_loss---113.043686, pixAcc, mIoU: (0.0, np.float64(0.0)) PD, FA: (0.0, 0.0) 其中200轮miou异常过高 200后loss突然变成100多，指标全为0 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

arrowonstr · 2024-12-11T14:44:04Z

@xdFai 我怀疑pd降为0的原因是不是

def update(self, preds, labels, size):
        predits = np.array((preds).cpu()).astype('int64')
        labelss = np.array((labels).cpu()).astype('int64')

        image = measure.label(predits, connectivity=2)
        coord_image = measure.regionprops(image)
        label = measure.label(labelss, connectivity=2)
        coord_label = measure.regionprops(label)

中没有对predits做sigmoid或者>threshold 二值处理直接int64 导致有些应该被判为连同的地方值不一样

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

在训练过程中指标值突然为0 #13

在训练过程中指标值突然为0 #13

MingboDuan commented Oct 2, 2024

MingboDuan commented Oct 2, 2024

xdFai commented Oct 4, 2024

MingboDuan commented Oct 4, 2024 via email

xdFai commented Oct 5, 2024

MingboDuan commented Oct 6, 2024 via email

arrowonstr commented Dec 10, 2024

xdFai commented Dec 11, 2024

arrowonstr commented Dec 11, 2024

xdFai commented Dec 11, 2024 via email

arrowonstr commented Dec 11, 2024

在训练过程中指标值突然为0 #13

在训练过程中指标值突然为0 #13

Comments

MingboDuan commented Oct 2, 2024

MingboDuan commented Oct 2, 2024

xdFai commented Oct 4, 2024

MingboDuan commented Oct 4, 2024 via email

xdFai commented Oct 5, 2024

MingboDuan commented Oct 6, 2024 via email

arrowonstr commented Dec 10, 2024

xdFai commented Dec 11, 2024

arrowonstr commented Dec 11, 2024

xdFai commented Dec 11, 2024 via email

arrowonstr commented Dec 11, 2024