srt/13 - 2 - K-Means Algorithm (13 min).srt

﻿1
00:00:00,300 --> 00:00:02,220
In the clustering problem we are
在聚类问题中
(字幕整理：中国海洋大学 黄海广,haiguang2000@qq.com )

2
00:00:02,360 --> 00:00:03,630
given an unlabeled data
我们有未加标签的数据

3
00:00:03,950 --> 00:00:05,040
set and we would like
我们希望有一个算法

4
00:00:05,200 --> 00:00:06,480
to have an algorithm automatically
能够自动的

5
00:00:07,320 --> 00:00:08,700
group the data into coherent
把这些数据分成

6
00:00:09,340 --> 00:00:11,000
subsets or into coherent clusters for us.
有紧密关系的子集或是簇

7
00:00:12,380 --> 00:00:14,160
The K Means algorithm is by
K均值 (K-means) 算法

8
00:00:14,310 --> 00:00:15,860
far the most popular, by
是现在最为广泛使用的

9
00:00:16,090 --> 00:00:17,410
far the most widely used clustering
聚类方法

10
00:00:17,780 --> 00:00:19,380
algorithm, and in this
那么在这个视频中

11
00:00:19,550 --> 00:00:20,320
video I would like to tell
我将会告诉你

12
00:00:20,570 --> 00:00:23,400
you what the K Means Algorithm is and how it works.
什么是K均值算法以及它是怎么运作的

13
00:00:27,000 --> 00:00:29,310
The K means clustering algorithm is best illustrated in pictures.
K均值算法最好用图来表达

14
00:00:29,960 --> 00:00:30,770
Let's say I want to take
如图所示

15
00:00:31,080 --> 00:00:32,330
an unlabeled data set like
现在我有一些

16
00:00:32,490 --> 00:00:34,040
the one shown here, and I
没加标签的数据

17
00:00:34,100 --> 00:00:36,450
want to group the data into two clusters.
而我想将这些数据分成两个簇

18
00:00:37,710 --> 00:00:38,740
If I run the K Means clustering
现在我执行K均值算法

19
00:00:39,080 --> 00:00:41,560
algorithm, here is what
方法是这样的

20
00:00:41,910 --> 00:00:44,190
I'm going to do. The first step is to randomly initialize two
首先我随机选择两个点

21
00:00:44,410 --> 00:00:45,920
points, called the cluster centroids.
这两个点叫做聚类中心 (cluster centroids)

22
00:00:46,700 --> 00:00:48,170
So, these two crosses here,
就是图上边的两个叉

23
00:00:49,010 --> 00:00:51,730
these are called the Cluster Centroids
这两个就是聚类中心

24
00:00:53,270 --> 00:00:54,320
and I have two of them
为什么要两个点呢

25
00:00:55,100 --> 00:00:57,840
because I want to group my data into two clusters.
因为我希望聚出两个类

26
00:00:59,130 --> 00:01:02,400
K Means is an iterative algorithm and it does two things.
K均值是一个迭代方法 它要做两件事情

27
00:01:03,480 --> 00:01:04,790
First is a cluster assignment
第一个是簇分配

28
00:01:05,330 --> 00:01:07,800
step, and second is a move centroid step.
第二个是移动聚类中心

29
00:01:08,360 --> 00:01:09,630
So, let me tell you what those things mean.
我来告诉你这两个是干嘛的

30
00:01:11,170 --> 00:01:12,520
The first of the two steps in the
在K均值算法的每次循环中

31
00:01:12,700 --> 00:01:14,930
loop of K means, is this cluster assignment step.
第一步是要进行簇分配

32
00:01:15,840 --> 00:01:17,070
What that means is that, it's
这就是说

33
00:01:17,220 --> 00:01:18,360
going through each of the
我要遍历所有的样本

34
00:01:18,700 --> 00:01:19,880
examples, each of these green
就是图上所有的绿色的点

35
00:01:20,170 --> 00:01:22,120
dots shown here and depending
然后依据

36
00:01:22,580 --> 00:01:24,140
on whether it's closer to the
每一个点

37
00:01:24,350 --> 00:01:25,530
red cluster centroid or the
是更接近红色的这个中心

38
00:01:25,620 --> 00:01:27,390
blue cluster centroid, it is going
还是蓝色的这个中心

39
00:01:27,560 --> 00:01:28,570
to assign each of the
来将每个数据点

40
00:01:28,670 --> 00:01:30,670
data points to one of the two cluster centroids.
分配到两个不同的聚类中心中

41
00:01:32,040 --> 00:01:33,350
Specifically, what I mean
具体来讲

42
00:01:33,460 --> 00:01:34,610
by that, is to go through your
我指的是

43
00:01:34,730 --> 00:01:36,930
data set and color each
对数据集中的所有点

44
00:01:37,130 --> 00:01:38,510
of the points either red or
依据他们

45
00:01:38,810 --> 00:01:39,890
blue, depending on whether
更接近红色这个中心

46
00:01:40,160 --> 00:01:41,060
it is closer to the red
还是蓝色这个中心

47
00:01:41,170 --> 00:01:42,150
cluster centroid or the blue
进行染色

48
00:01:42,470 --> 00:01:45,210
cluster centroid, and I've done that in this diagram here.
染色之后的结果如图所示

49
00:01:46,930 --> 00:01:48,700
So, that was the cluster assignment step.
以上就是簇分配的步骤

50
00:01:49,780 --> 00:01:52,270
The other part of K means, in the
K均值的另一部分

51
00:01:52,410 --> 00:01:53,390
loop of K means, is the move
是要移动聚类中心

52
00:01:53,590 --> 00:01:54,860
centroid step, and what
具体的操作方法

53
00:01:55,020 --> 00:01:55,730
we are going to do is, we
是这样的

54
00:01:55,800 --> 00:01:56,890
are going to take the two cluster centroids,
我们将两个聚类中心

55
00:01:57,390 --> 00:01:58,550
that is, the red cross and
也就是说红色的叉

56
00:01:58,830 --> 00:02:00,270
the blue cross, and we are
和蓝色的叉

57
00:02:00,420 --> 00:02:01,420
going to move them to the average
移动到

58
00:02:02,070 --> 00:02:03,900
of the points colored the same colour.
和它一样颜色的那堆点的均值处

59
00:02:04,880 --> 00:02:05,700
So what we are going
那么我们要做的是

60
00:02:05,730 --> 00:02:06,510
to do is look at all the
找出所有红色的点

61
00:02:06,630 --> 00:02:07,810
red points and compute the
计算出它们的均值

62
00:02:08,240 --> 00:02:09,520
average, really the mean
就是所有红色的点

63
00:02:10,080 --> 00:02:11,500
of the location of all the red points,
平均下来的位置

64
00:02:11,650 --> 00:02:13,690
and we are going to move the red cluster centroid there.
然后我们就把红色点的聚类中心移动到这里

65
00:02:14,190 --> 00:02:15,260
And the same things for the
蓝色的点也是这样

66
00:02:15,460 --> 00:02:16,370
blue cluster centroid, look at all
找出所有蓝色的点

67
00:02:16,560 --> 00:02:17,720
the blue dots and compute their
计算它们的均值

68
00:02:17,840 --> 00:02:19,710
mean, and then move the blue cluster centroid there.
把蓝色的叉放到那里

69
00:02:20,320 --> 00:02:20,880
So, let me do that now.
那我们现在就这么做

70
00:02:21,170 --> 00:02:22,990
We're going to move the cluster centroids as follows
我们将按照图上所示这么移动

71
00:02:24,590 --> 00:02:27,350
and I've now moved them to their new means.
现在两个中心都已经移动到新的均值那里了

72
00:02:28,300 --> 00:02:29,760
The red one moved like that
你看

73
00:02:29,820 --> 00:02:31,350
and the blue one moved
蓝色的这么移动

74
00:02:31,510 --> 00:02:34,460
like that and the red one moved like that.
红色的这么移动

75
00:02:34,620 --> 00:02:35,460
And then we go back to another cluster
然后我们就会进入下一个

76
00:02:35,910 --> 00:02:36,920
assignment step, so we're again
簇分配

77
00:02:37,190 --> 00:02:38,090
going to look at all of
我们重新检查

78
00:02:38,160 --> 00:02:39,670
my unlabeled examples and depending
所有没有标签的样本

79
00:02:40,090 --> 00:02:42,840
on whether it's closer the red or the blue cluster centroid,
依据它离红色中心还是蓝色中心更近一些

80
00:02:43,340 --> 00:02:45,150
I'm going to color them either red or blue.
将它染成红色或是蓝色

81
00:02:45,640 --> 00:02:47,160
I'm going to assign each point
我要将每个点

82
00:02:47,530 --> 00:02:48,550
to one of the two cluster centroids, so let me do that now.
分配给两个中心的某一个 就像这么做

83
00:02:51,450 --> 00:02:52,260
And so the colors of some of the points just changed.
你看某些点的颜色变了

84
00:02:53,400 --> 00:02:55,690
And then I'm going to do another move centroid step.
然后我们又要移动聚类中心

85
00:02:56,040 --> 00:02:56,810
So I'm going to compute the
于是我计算

86
00:02:57,070 --> 00:02:57,880
average of all the blue points,
蓝色点的均值

87
00:02:58,110 --> 00:02:59,000
compute the average of all
还有红色点的均值

88
00:02:59,040 --> 00:03:00,360
the red points and move my
然后就像图上所表示的

89
00:03:00,480 --> 00:03:03,770
cluster centroids like this, and
移动两个聚类中心

90
00:03:03,930 --> 00:03:05,650
so, let's do that again.
来我们再来一遍

91
00:03:06,160 --> 00:03:07,810
Let me do one more cluster assignment step.
下面我还是要做一次簇分配

92
00:03:08,320 --> 00:03:09,450
So colour each point red
将每个点

93
00:03:09,620 --> 00:03:10,840
or blue, based on what it's
染成红色或是蓝色

94
00:03:11,170 --> 00:03:13,070
closer to and then
依然根据它们离那个中心近

95
00:03:13,310 --> 00:03:20,000
do another move centroid step and we're done.
然后是移动中心 你看就像这样

96
00:03:20,350 --> 00:03:21,230
And in fact if you
实际上

97
00:03:21,290 --> 00:03:23,250
keep running additional iterations of
如果你从这一步开始

98
00:03:23,500 --> 00:03:26,020
K means from here the
一直迭代下去

99
00:03:26,160 --> 00:03:27,240
cluster centroids will not change
聚类中心是不会变的

100
00:03:27,540 --> 00:03:28,770
any further and the colours of
并且

101
00:03:28,830 --> 00:03:29,760
the points will not change any
那些点的颜色也不会变

102
00:03:29,940 --> 00:03:31,520
further. And so, this is
在这时

103
00:03:31,810 --> 00:03:33,520
the, at this point,
我们就能说

104
00:03:33,770 --> 00:03:35,290
K means has converged and it's
K均值方法已经收敛了

105
00:03:35,400 --> 00:03:36,430
done a pretty good job finding
在这些数据中找到两个簇

106
00:03:37,470 --> 00:03:38,750
the two clusters in this data.
K均值表现的很好

107
00:03:39,360 --> 00:03:40,310
Let's write out the K means algorithm more formally.
来我们用更加规范的格式描述K均值算法

108
00:03:42,150 --> 00:03:43,930
The K means algorithm takes two inputs.
K均值算法接受两个输入

109
00:03:44,570 --> 00:03:46,200
One is a parameter K,
第一个是参数K

110
00:03:46,450 --> 00:03:47,260
which is the number of clusters
表示你想从数据中

111
00:03:47,830 --> 00:03:48,900
you want to find in the data.
聚类出的簇的个数

112
00:03:49,640 --> 00:03:50,820
I'll later say how we might
我一会儿会讲到

113
00:03:51,170 --> 00:03:53,290
go about trying to choose k, but
我们可以怎样选择K

114
00:03:53,470 --> 00:03:54,600
for now let's just say that
这里呢 我们只是说

115
00:03:55,110 --> 00:03:56,210
we've decided we want a
我们已经确定了

116
00:03:56,360 --> 00:03:57,600
certain number of clusters and we're
需要几个簇

117
00:03:57,690 --> 00:03:58,810
going to tell the algorithm how many
然后我们要告诉这个算法

118
00:03:59,040 --> 00:04:00,730
clusters we think there are in the data set.
我们觉得在数据集里有多少个簇

119
00:04:01,170 --> 00:04:02,120
And then K means also
K均值同时要

120
00:04:02,490 --> 00:04:03,430
takes as input this sort
接收另外一个输入

121
00:04:03,880 --> 00:04:05,060
of unlabeled training set of
那就是只有 x 的

122
00:04:05,250 --> 00:04:06,530
just the Xs and
没有标签 y 的训练集

123
00:04:06,710 --> 00:04:08,430
because this is unsupervised learning, we
因为这是非监督学习

124
00:04:08,520 --> 00:04:10,690
don't have the labels Y anymore.
我们用不着 y

125
00:04:10,980 --> 00:04:12,470
And for unsupervised learning of
同时在非监督学习的

126
00:04:12,740 --> 00:04:14,020
the K means I'm going to
K均值算法里

127
00:04:14,550 --> 00:04:16,160
use the convention that XI
我们约定

128
00:04:16,420 --> 00:04:17,750
is an RN dimensional vector.
x(i) 是一个n维向量

129
00:04:18,280 --> 00:04:19,190
And that's why my training examples
这就是

130
00:04:19,750 --> 00:04:22,460
are now N dimensional rather N plus one dimensional vectors.
训练样本是 n 维而不是 n+1 维的原因

131
00:04:24,340 --> 00:04:25,430
This is what the K means algorithm does.
这就是K均值算法

132
00:04:27,180 --> 00:04:28,630
The first step is that it
第一步是

133
00:04:28,790 --> 00:04:31,170
randomly initializes k cluster
随机初始化 K 个聚类中心

134
00:04:31,570 --> 00:04:33,550
centroids which we will
记作

135
00:04:33,820 --> 00:04:34,610
call mu 1, mu 2, up
μ1, μ2 一直到 μk

136
00:04:34,840 --> 00:04:36,250
to mu k. And so
就像之前

137
00:04:36,650 --> 00:04:38,450
in the earlier diagram, the
图中所示

138
00:04:38,550 --> 00:04:40,770
cluster centroids corresponded to the
聚类中心对应于

139
00:04:41,060 --> 00:04:42,240
location of the red cross
红色叉和蓝色叉

140
00:04:42,660 --> 00:04:44,020
and the location of the blue cross.
所在的位置

141
00:04:44,410 --> 00:04:45,640
So there we had two cluster
于是我们有两个聚类中心

142
00:04:45,960 --> 00:04:47,000
centroids, so maybe the red
按照这样的记法

143
00:04:47,170 --> 00:04:48,470
cross was mu 1
红叉是 μ1

144
00:04:48,650 --> 00:04:49,940
and the blue cross was mu
蓝叉是 μ2

145
00:04:50,300 --> 00:04:51,360
2, and more generally we would have
通常情况下

146
00:04:51,820 --> 00:04:53,830
k cluster centroids rather than just 2.
我们可能会有比2要多的聚类中心

147
00:04:54,520 --> 00:04:56,240
Then the inner loop
K均值的内部循环

148
00:04:56,520 --> 00:04:57,360
of k means does the following,
是这样的

149
00:04:57,830 --> 00:04:59,020
we're going to repeatedly do the following.
我们会重复做下面的事情

150
00:05:00,070 --> 00:05:01,950
First for each of
首先

151
00:05:02,160 --> 00:05:03,920
my training examples, I'm going
对于每个训练样本

152
00:05:04,110 --> 00:05:05,950
to set this variable CI
我们用变量 c(i) 表示

153
00:05:06,790 --> 00:05:07,960
to be the index 1 through
K个聚类中心中最接近 x(i) 的

154
00:05:08,170 --> 00:05:10,520
K of the cluster centroid closest to XI.
那个中心的下标

155
00:05:11,170 --> 00:05:13,810
So this was my cluster assignment
这就是簇分配

156
00:05:14,330 --> 00:05:16,870
step, where we
这个步骤

157
00:05:17,000 --> 00:05:18,680
took each of my examples and
我先将每个样本

158
00:05:18,980 --> 00:05:20,740
coloured it either red
依据它离那个聚类中心近

159
00:05:21,050 --> 00:05:22,050
or blue, depending on which
将其染成

160
00:05:22,380 --> 00:05:23,940
cluster centroid it was closest to.
红色或是蓝色

161
00:05:24,140 --> 00:05:25,090
So CI is going to be
所以 c(i) 是一个

162
00:05:25,280 --> 00:05:26,280
a number from 1 to
在1到 K 之间的数

163
00:05:26,380 --> 00:05:27,680
K that tells us, you
而且它表明

164
00:05:27,780 --> 00:05:28,760
know, is it closer to the
这个点到底是

165
00:05:28,920 --> 00:05:29,820
red cross or is it
更接近红色叉

166
00:05:29,900 --> 00:05:31,170
closer to the blue cross,
还是蓝色叉

167
00:05:32,200 --> 00:05:33,210
and another way of writing this
另一种表达方式是

168
00:05:33,580 --> 00:05:35,350
is I'm going to,
我想要计算 c(i)

169
00:05:35,620 --> 00:05:37,820
to compute Ci, I'm
那么

170
00:05:37,890 --> 00:05:39,120
going to take my Ith
我要用第i个样本x(i)

171
00:05:39,380 --> 00:05:41,170
example Xi and and
然后

172
00:05:41,360 --> 00:05:42,670
I'm going to measure it's distance
计算出这个样本

173
00:05:43,900 --> 00:05:44,860
to each of my cluster centroids,
距离所有K个聚类中心的距离

174
00:05:45,410 --> 00:05:46,690
this is mu and then
这是 μ

175
00:05:47,060 --> 00:05:48,640
lower-case k, right, so
以及小写的k

176
00:05:48,890 --> 00:05:50,630
capital K is the total
大写的 K 表示

177
00:05:50,910 --> 00:05:51,900
number centroids and I'm going
所有聚类中心的个数

178
00:05:52,100 --> 00:05:53,160
to use lower case k here
小写的 k 则是

179
00:05:53,770 --> 00:05:55,140
to index into the different centroids.
不同的中心的下标

180
00:05:56,240 --> 00:05:58,470
But so, Ci is going to, I'm going
我希望的是

181
00:05:58,550 --> 00:06:00,110
to minimize over my values
在所有K个中心中

182
00:06:00,550 --> 00:06:01,930
of k and find the
找到一个k

183
00:06:02,120 --> 00:06:03,650
value of K that minimizes this
使得xi到μk的距离

184
00:06:03,900 --> 00:06:04,750
distance between Xi and the
是xi到所有的聚类中心的距离中

185
00:06:04,800 --> 00:06:06,130
cluster centroid, and then,
最小的那个

186
00:06:06,340 --> 00:06:08,990
you know, the
也就是说

187
00:06:09,070 --> 00:06:10,350
value of k that minimizes
k的值使这个最小

188
00:06:10,940 --> 00:06:12,160
this, that's what gets set in
这就是计算ci的方法

189
00:06:12,300 --> 00:06:14,100
Ci. So, here's
这里还有

190
00:06:14,360 --> 00:06:16,470
another way of writing out what Ci is.
另外的表示ci的方法

191
00:06:18,050 --> 00:06:19,150
If I write the norm between
我用xi减μk的范数

192
00:06:19,270 --> 00:06:21,500
Xi minus Mu-k,
来表示

193
00:06:23,000 --> 00:06:24,120
then this is the distance between
这是第i个训练样本

194
00:06:24,630 --> 00:06:26,040
my ith training example
到聚类中心μk

195
00:06:26,180 --> 00:06:27,350
Xi and the cluster centroid
的距离

196
00:06:28,140 --> 00:06:30,280
Mu subscript K, this is--this
注意

197
00:06:31,150 --> 00:06:32,830
here, that's a lowercase K. So uppercase
我这里用的是小写的k

198
00:06:33,320 --> 00:06:34,710
K is going to be
大写的K

199
00:06:34,980 --> 00:06:36,210
used to denote the total
大写的k表示

200
00:06:36,450 --> 00:06:38,020
number of cluster centroids,
聚类中心的总数

201
00:06:38,770 --> 00:06:40,430
and this lowercase K's
这个小写的k

202
00:06:40,790 --> 00:06:41,840
a number between one and
是第一个到第K个中心

203
00:06:41,960 --> 00:06:42,940
capital K. I'm just using
中的一个

204
00:06:43,210 --> 00:06:44,450
lower case K to index
我用小写的k

205
00:06:44,930 --> 00:06:45,990
into my different cluster centroids.
表示不同聚类中心的下标

206
00:06:47,130 --> 00:06:49,020
Next is lower case k. So
这是个小写k

207
00:06:50,050 --> 00:06:51,330
that's the distance between the example and the cluster centroid
这就是某个样本到聚类中心的距离

208
00:06:51,490 --> 00:06:52,810
and so what I'm going to
接下来

209
00:06:53,050 --> 00:06:54,330
do is find the value
我要做的是

210
00:06:55,250 --> 00:06:56,390
of K, of lower case
找出小写的k的值

211
00:06:56,710 --> 00:06:58,900
k that minimizes this, and
让这个式子最小

212
00:06:59,080 --> 00:07:00,320
so the value of
那么

213
00:07:00,480 --> 00:07:02,100
k that minimizes you know,
接下来

214
00:07:02,280 --> 00:07:03,610
that's what I'm going to
我就要将 c(i)

215
00:07:04,000 --> 00:07:06,560
set as Ci, and
赋值为k

216
00:07:06,760 --> 00:07:07,850
by convention here I've written
我这里按照惯例表示

217
00:07:08,190 --> 00:07:09,430
the distance between Xi and
x(i) 和聚类中心的距离

218
00:07:09,480 --> 00:07:11,310
the cluster centroid, by convention
因为出于惯例

219
00:07:11,820 --> 00:07:13,330
people actually tend to write this as the squared distance.
人们更喜欢用距离的平方来表示

220
00:07:13,780 --> 00:07:15,370
So we think of Ci as picking
所以我们可以认为

221
00:07:15,660 --> 00:07:16,860
the cluster centroid with the smallest
c(i) 是距样本 x(i) 的距离的平方

222
00:07:17,450 --> 00:07:20,110
squared distance to my training example Xi.
最小的那个聚类中心

223
00:07:20,750 --> 00:07:22,080
But of course minimizing squared distance,
当然

224
00:07:22,500 --> 00:07:23,700
and minimizing distance that should
使距离的平方最小或是距离最小

225
00:07:23,880 --> 00:07:25,670
give you the same value of Ci,
都能让我们得到相同的 c(i)

226
00:07:25,830 --> 00:07:26,670
but we usually put in the
但是我们通常还是

227
00:07:26,750 --> 00:07:28,120
square there, just as the
写成距离的平方

228
00:07:28,430 --> 00:07:31,020
convention that people use for K means.
因为这是约定俗成的

229
00:07:31,170 --> 00:07:32,320
So that was the cluster assignment step.
这就是簇分配

230
00:07:33,480 --> 00:07:34,750
The other in the loop
K均值循环中的另一部分是

231
00:07:34,980 --> 00:07:37,740
of K means does the move centroid step.
移动聚类中心

232
00:07:40,540 --> 00:07:41,750
And what that does is for
这是说

233
00:07:42,160 --> 00:07:43,460
each of my cluster centroids, so
对于每个聚类中心

234
00:07:43,550 --> 00:07:44,740
for lower case k equals 1 through
也就是说

235
00:07:44,870 --> 00:07:46,190
K, it sets Mu-k equals
小写k从1循环到K

236
00:07:46,710 --> 00:07:48,460
to the average of the points assigned to cluster.
将 μk 赋值为这个簇的均值

237
00:07:49,270 --> 00:07:50,720
So as a concrete example, let's say
举个例子

238
00:07:50,910 --> 00:07:52,100
that one of my cluster
某一个聚类中心

239
00:07:52,340 --> 00:07:53,420
centroids, let's say cluster centroid
比如说是 μ2

240
00:07:53,750 --> 00:07:55,030
two, has training examples,
被分配了一些训练样本

241
00:07:55,820 --> 00:08:02,390
you know, 1, 5, 6, and 10 assigned to it.
像是1,5,6,10

242
00:08:03,220 --> 00:08:04,270
And what this means is,
这个表明

243
00:08:04,470 --> 00:08:05,510
really this means that C1 equals
c(1) 等于2

244
00:08:06,560 --> 00:08:09,180
to C5 equals to
c(5) 等于2

245
00:08:10,690 --> 00:08:12,180
C6 equals to and
c(6) 等于2

246
00:08:12,300 --> 00:08:13,730
similarly well c10 equals, too, right?
同样的 c(10) 也是等于2 对吧?

247
00:08:14,970 --> 00:08:17,070
If we got that
如果我们从上一步

248
00:08:17,160 --> 00:08:18,940
from the cluster assignment step, then
也就是簇分配那一步得到了这些

249
00:08:19,190 --> 00:08:21,250
that means examples 1,5,6 and
这个表明

250
00:08:21,450 --> 00:08:22,960
10 were assigned to the cluster centroid two.
样本1 5 6 10被分配给了聚类中心2

251
00:08:24,020 --> 00:08:25,210
Then in this move centroid step,
然后在移动聚类中心这一步中

252
00:08:25,540 --> 00:08:26,580
what I'm going to do is just
我们要做的是

253
00:08:27,180 --> 00:08:29,290
compute the average of these four things.
计算出这四个的平均值

254
00:08:31,340 --> 00:08:33,950
So X1 plus X5 plus X6
即

255
00:08:34,270 --> 00:08:35,620
plus X10.
计算 x(1)+x(5)+x(6)+x(10)

256
00:08:35,890 --> 00:08:37,190
And now I'm going
然后计算

257
00:08:37,380 --> 00:08:38,630
to average them so here I
它们的平均值

258
00:08:38,920 --> 00:08:40,020
have four points assigned to
这里聚类中心有

259
00:08:40,100 --> 00:08:41,700
this cluster centroid, just take
4个点

260
00:08:42,280 --> 00:08:43,240
one quarter of that.
那么我们要计算和的四分之一

261
00:08:43,980 --> 00:08:45,890
And now Mu2 is going to
这时μ2就是一个

262
00:08:46,100 --> 00:08:47,910
be an n-dimensional vector.
n维的向量

263
00:08:48,420 --> 00:08:49,480
Because each of these
因为

264
00:08:49,700 --> 00:08:51,050
example x1, x5, x6, x10
x(1) x(5) x(6) x(10) 都是

265
00:08:52,160 --> 00:08:53,170
each of them were an n-dimensional
n维的向量

266
00:08:53,700 --> 00:08:55,150
vector, and I'm going to
然后

267
00:08:55,240 --> 00:08:56,270
add up these things and, you
把这些相加

268
00:08:56,550 --> 00:08:57,870
know, divide by four because I
再除以4

269
00:08:57,940 --> 00:08:59,320
have four points assigned to this
因为

270
00:08:59,490 --> 00:09:00,730
cluster centroid, I end up
有4个点分配到了这个聚类中心

271
00:09:01,280 --> 00:09:02,770
with my move centroid step,
这样聚类中心μ2的移动

272
00:09:03,870 --> 00:09:05,260
for my cluster centroid mu-2.
就结束了

273
00:09:05,870 --> 00:09:06,850
This has the effect of moving
这个作用是说

274
00:09:07,210 --> 00:09:08,950
mu-2 to the average of
将μ2移动到

275
00:09:09,130 --> 00:09:10,620
the four points listed here.
这四个点的均值处

276
00:09:12,430 --> 00:09:13,850
One thing that I've asked is, well here we said, let's
我要问的问题是

277
00:09:14,080 --> 00:09:16,600
let mu-k be the average of the points assigned to the cluster.
既然我们要让μk移动到分配给它的那些点的均值处

278
00:09:17,500 --> 00:09:18,900
But what if there is
那么如果

279
00:09:18,960 --> 00:09:21,310
a cluster centroid no points
存在一个

280
00:09:21,690 --> 00:09:23,000
with zero points assigned to it.
没有点分配给它的聚类中心 那怎么办?

281
00:09:23,280 --> 00:09:24,300
In that case the more common
通常在这种情况下

282
00:09:24,650 --> 00:09:25,720
thing to do is to just
我们就直接移除

283
00:09:26,140 --> 00:09:27,220
eliminate that cluster centroid.
那个聚类中心

284
00:09:27,830 --> 00:09:28,630
And if you do that, you end
如果这么做了

285
00:09:28,840 --> 00:09:30,260
up with K minus one clusters
最终将会得到K-1个簇

286
00:09:31,350 --> 00:09:33,840
instead of k clusters.
而不是K个簇

287
00:09:34,400 --> 00:09:35,620
Sometimes if you really need
如果就是要K个簇

288
00:09:35,830 --> 00:09:37,380
k clusters, then the other
不多不少

289
00:09:37,490 --> 00:09:38,220
thing you can do if you
但是有个

290
00:09:38,290 --> 00:09:39,530
have a cluster centroid with no
没有点分配给它的聚类中心

291
00:09:39,740 --> 00:09:41,170
points assigned to it is you can
你所要做的是

292
00:09:41,260 --> 00:09:42,590
just randomly reinitialize that cluster
重新随机找一个聚类中心

293
00:09:43,450 --> 00:09:44,870
centroid, but it's more
但是直接移除那个中心

294
00:09:45,170 --> 00:09:46,590
common to just eliminate a
是更为常见的方法

295
00:09:46,670 --> 00:09:48,210
cluster if somewhere during
当你遇到了一个

296
00:09:48,410 --> 00:09:49,690
K means it with no points
没有分配点的

297
00:09:50,290 --> 00:09:52,020
assigned to that cluster centroid, and
聚类中心

298
00:09:52,140 --> 00:09:53,340
that can happen, altthough in practice
不过在实际过程中

299
00:09:53,820 --> 00:09:55,590
it happens not that often.
这个问题不会经常出现

300
00:09:55,810 --> 00:09:57,280
So that's the K means Algorithm.
这就是K均值算法

301
00:09:59,330 --> 00:10:00,220
Before wrapping up this video
在这个视频结束之前

302
00:10:00,620 --> 00:10:01,290
I just want to tell you
我还想告诉你

303
00:10:01,350 --> 00:10:02,710
about one other common application
K均值的

304
00:10:03,350 --> 00:10:04,680
of K Means and that's
另外一个常见应用

305
00:10:04,920 --> 00:10:06,840
to the problems with non well separated clusters.
应对没有很好分开的簇

306
00:10:08,160 --> 00:10:08,640
Here's what I mean.
我指的是这个

307
00:10:09,280 --> 00:10:10,320
So far we've been picturing K Means
到目前为止

308
00:10:10,950 --> 00:10:12,090
and applying it to data
我们的K均值算法

309
00:10:12,330 --> 00:10:13,520
sets like that shown here where
都是基于一些像图中所示的数据

310
00:10:14,150 --> 00:10:15,590
we have three pretty
有很好的隔离开来的

311
00:10:15,900 --> 00:10:17,380
well separated clusters, and we'd
三个簇

312
00:10:17,670 --> 00:10:19,930
like an algorithm to find maybe the 3 clusters for us.
然后我们就用这个算法找出三个簇

313
00:10:20,750 --> 00:10:21,840
But it turns out that
但是事实是

314
00:10:21,980 --> 00:10:23,180
very often K Means is also
K均值经常会用于

315
00:10:23,600 --> 00:10:24,860
applied to data sets that
一些这样的数据

316
00:10:25,170 --> 00:10:26,240
look like this where there may
看起来并没有

317
00:10:26,330 --> 00:10:28,300
not be several very
很好的分来的

318
00:10:28,550 --> 00:10:30,250
well separated clusters.
几个簇

319
00:10:30,830 --> 00:10:32,960
Here is an example application, to t-shirt sizing.
这是一个应用的例子 关于T恤的大小

320
00:10:34,070 --> 00:10:34,650
Let's say you are a t-shirt
假设你是T恤制造商

321
00:10:35,270 --> 00:10:37,360
manufacturer you've done is you've gone
你找到了一些人

322
00:10:38,030 --> 00:10:39,310
to the population that you
想把T恤卖给他们

323
00:10:39,380 --> 00:10:40,520
want to sell t-shirts to, and
然后

324
00:10:40,800 --> 00:10:42,190
you've collected a number of
你搜集了一些

325
00:10:42,580 --> 00:10:43,770
examples of the height and weight
这些人的

326
00:10:44,270 --> 00:10:45,350
of these people in your
身高和体重的数据

327
00:10:45,680 --> 00:10:46,740
population and so, well I
我猜

328
00:10:47,220 --> 00:10:48,280
guess height and weight tend to
身高体重更重要一些

329
00:10:48,370 --> 00:10:50,310
be positively highlighted so maybe
然后你可能

330
00:10:50,540 --> 00:10:51,160
you end up with a data
收集到了这样的样本

331
00:10:51,430 --> 00:10:52,590
set like this, you know, with
一些关于

332
00:10:52,830 --> 00:10:53,910
a sample or set of
人们身高和体重的样本

333
00:10:53,980 --> 00:10:56,000
examples of different peoples heights and weight.
就像这个图所表示的

334
00:10:56,530 --> 00:10:57,880
Let's say you want to size your t shirts.
然后你想确定一下T恤的大小

335
00:10:58,570 --> 00:10:59,810
Let's say I want to design
假设我们要设计

336
00:11:00,330 --> 00:11:01,480
and sell t shirts of three
三种不同大小的t恤

337
00:11:01,660 --> 00:11:03,970
sizes, small, medium and large.
小号 中号 和大号

338
00:11:04,660 --> 00:11:06,420
So how big should I make my small one?
那么小号应该是多大的?

339
00:11:06,550 --> 00:11:07,320
How big should I my medium?
中号呢?

340
00:11:07,690 --> 00:11:09,300
And how big should I make my large t-shirts.
大号呢?

341
00:11:10,370 --> 00:11:11,290
One way to do that would
有一种

342
00:11:11,410 --> 00:11:12,050
to be to run my k means
在这样的数据上

343
00:11:12,270 --> 00:11:13,570
clustering logarithm on this data
使用K均值算法进行聚类

344
00:11:13,830 --> 00:11:14,640
set that I have shown on
的方法就像我展示的那样