forked from gaoxiang12/slambook-en
-
Notifications
You must be signed in to change notification settings - Fork 0
/
mapping.tex
1037 lines (839 loc) · 87 KB
/
mapping.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% !Mode:: "TeX:UTF-8"
\chapter{Dense Reconstruction}
\label{cpt:12}
\begin{mdframed}
\textbf{Goal of Study}
\begin{enumerate}[labelindent=0em,leftmargin=1.5em]
\item Learn how to estimate the dense depth in monocular SLAM.
\item Implement the dense depth estimation in monocular SLAM.
\item Learn some of the commonly used map forms in RGB-D reconstruction.
\end{enumerate}
\end{mdframed}
In this lecture, we start to introduce the algorithm of the mapping part. In the front and backends, we focus on simultaneously estimating the camera motion trajectory and the feature points' spatial position. However, in real applications, in addition to localizing the camera, there are many other requirements. For example, consider the SLAM on the robot. We hope that the map can be used for localization, navigation, obstacle avoidance, and interaction. The feature point map obviously cannot meet all these needs. Therefore, in this lecture, we will discuss various forms of maps and point out the shortcomings in current visual SLAM maps.
\newpage
\section{Brief Introduction}
Mapping should be one of the two goals of SLAM-because SLAM is called simultaneous localization and mapping. But until now, we have been discussing the localization problems for a long time, including localization through feature points, direct method, and backend optimization. Does this imply that mapping is not that important in SLAM, so we didn't discuss it until this lecture?
The answer is negative. In fact, in the classic SLAM model, a map is just a collection of all landmarks. Once the location of the landmarks is determined, it can be said that we have completed the mapping. Therefore, the aforementioned visual odometry or bundle adjustment model is proposed to model the landmarks' positions and optimize them. From this perspective, we have discussed the issue of mapping. So why do we need to list the map separately?
This is because people have different needs for mapping. As a kind of underlying technology, SLAM is often used to provide information for upper-layer applications. If the upper layer is a robot, then the application developer may want to use SLAM as a global localization and navigation module on the map. For example, a sweeper needs to complete the sweeping work, hoping to calculate a path covering the entire map. If the upper layer is an augmented reality device, the developer may wish to superimpose the virtual object on the real thing. It may also need to handle the occlusion relationship between the virtual object and the real object.
We found that the requirements for \textit{localization} at the application level are similar. They hope that SLAM provides the real-time spatial pose information of the subject carrying the camera. For maps, there are many different requirements. From the point of view of visual SLAM, \textit{mapping} is simultaneously done with \textit{localization}. But from the perspective of the application, \textit{mapping} obviously has many other requirements. Regarding the usage of the map, we can roughly summarize it as follows:
\begin{enumerate}
\item \textbf{Localization}. Localization is a basic function of the map. In the previous part of visual odometry, we discussed how to use local maps to achieve localization. In the loop detection part, we also saw that we can also determine the robot's position through re-location with the descriptors. Furthermore, we also hope to save the map so that the robot can still locate on the map after the next startup. We only need to model the map once instead of doing a complete SLAM every time the robot is started.
\item \textbf{Navigation}. Navigation refers to the process in which a robot can plan a path on a map, find a path between any two map points, and then control its movement to a target point. In this process, we need to know at least which places on the map are not passable and which places are passable. This is beyond the capability of sparse feature point maps, and we must have another map form. We will say later that this must be at least a dense map.
\item \textbf{Obstacle avoidance}. Obstacle avoidance is also a problem often encountered by robots. It is similar to navigation but pays more attention to the handling of local and dynamic obstacles. Similarly, we cannot judge whether a feature point is an obstacle, so a dense map is also needed here.
\item \textbf{Reconstruction}. Sometimes, we hope to use SLAM to obtain the reconstruction of the surrounding environment. This kind of map is mainly used for demonstration, so we hope it looks more comfortable and beautiful. We can also use the map for communication so that others can remotely watch the 3D objects or scenes we reconstructed-such as 3D video calls or online shopping. This kind of map is also dense, and we also have some additional requirements for its appearance. We may not be satisfied with the dense point cloud reconstruction but hope to build a textured plane, just like the three-dimensional scene in a video game.
\item \textbf{Interaction}. Interaction mainly refers to the interaction between people and the map. For example, in augmented reality, we will place virtual objects in the room and interact with them. For example, we will click on the virtual web browser to watch a video or throw objects on the wall, hoping that they will have a virtual physical collision. On the other hand, there will also be interactions with people and maps in robot applications. For example, the robot may receive the command ``take the newspaper on the table''. In addition to the traditional map, the robot needs to know where the ``table'', what is called ``above'', and what is called ``newspaper''. This requires robots to have a higher level of knowledge of maps-also known as semantic maps.
\end{enumerate}
\autoref{fig:maps} visually explains the relationship between the various map types discussed above and their uses. Our previous discussion was basically focused on the ``sparse feature map'' part and did not discuss dense maps. The so-called dense map not only models the part of interest, i.e., the feature points, but also models all the seen objects. For a table, the sparse map may only model the four corners of the table, while the dense map will model the entire desktop. From the localization perspective, a map with only four corners can also be used to locate the camera. But since we cannot infer the spatial structure between these points from the four corners, it is impossible to complete navigation with only four corners. , Obstacle avoidance and other tasks that require dense maps to complete.
\begin{figure}[!ht]
\centering
\includegraphics[width=1.0\textwidth]{mapping/maps.pdf}
\caption{Different maps from~\cite{Mur-Artal2015, Labbe2014, Salas-Moreno2013}.}
\label{fig:maps}
\end{figure}
As can be seen from the above discussion, the dense map occupies a significant position. So, the remaining question is: Can a dense map be established through visual SLAM? If so, how to build it?
\section{Monocular Dense Reconstruction}
\subsection{Stereo Vision}
The dense reconstruction of visual SLAM is an important topic of this lecture. Cameras have long been considered as bearing-only sensors. The pixels in a single image can only provide the angle between the object and the camera's imaging plane and the brightness collected by the object, but not the distance (range). In dense reconstruction, we need to know each pixel's distance (or most of the pixels). There are roughly the following solutions for this:
\begin{enumerate}
\item Use monocular cameras and estimate the depth using triangulation after motion.
\item Use stereo cameras by its disparity (similar for more than two eyes).
\item Use the depth sensor in RGB-D cameras to directly get the depth.
\end{enumerate}
The first two methods are sometimes called stereo vision, and the moving monocular is also called moving view stereo (MVS). Compared with the depth measured directly by RGB-D, the acquisition of depth by monocular and binocular is often fragile. We need to spend a lot of calculations and finally get some unreliable depth estimation. Of course, RGB-D also has some limits on the range, application range, and illumination conditions, but compared to monocular and binocular results, using RGB-D for dense reconstruction is often a more common choice. The advantage of monocular and binocular is that in outdoor and large scenes where RGB-D is not yet well applied, depth information can still be estimated through stereo vision.
Having said that, in this section, we will lead readers to realize a single-purpose dense estimation and experience why it is thankless. Let's start with the simplest case: estimate the depth of an image based on a video sequence with a given camera trajectory. In other words, we do not consider SLAM, first consider the slightly simpler mapping problem.
Assuming that there is a video sequence, we get the trajectory corresponding to each frame through some magic (of course, it is also probably estimated by the frontend of the visual odometry). We use the first image as the reference frame to calculate the depth (or distance) of each pixel in the reference frame. First of all, please remember how we completed the process in the feature matching section:
\begin{enumerate}
\item First, we extract features from the image and calculate the matching between the features based on the descriptor. In other words, through the feature, we track a certain spatial point and know its position between each image.
\item Then, since we cannot determine the feature point's position with only one image, we must estimate its depth through observations under different viewing angles. The principle is the triangulation mentioned above.
\end{enumerate}
\subsection{Epipolar Line Search and Block Matching}
Let's first discuss the geometric relationship produced by observing the same point from different perspectives. This is very similar to the epipolar geometry discussed in section ~\ref{sec:epipolar-geometry}~. Please see \autoref{fig:epipolar-line-search}~. The camera on the left has observed a certain pixel $\mathbf{p}_1$. Since this is a monocular camera, we have no way of knowing its depth, so assuming that the depth may be within a certain area, it may be said that it is between a certain minimum and infinity: $(d_\mathrm{min}, +\infty)$. Therefore, the spatial points corresponding to the pixel are distributed on a certain line segment (a ray in this example). Seeing from the camera on the right, we may find that this line segment's projection also forms a line on the image plane, which we know is called the epipolar line. When the movement between the two cameras is known, this epipolar line can be determined\footnote{On the contrary, if the movement is not known, the epipolar line cannot be determined. }. Then the question is: Which point on the epipolar line is the $\mathbf{p}_1$ point we just saw?
\begin{figure}[!htp]
\centering
\includegraphics[width=.8\textwidth]{mapping/epipolar-line-search.pdf}
\caption{Epipolar line search.}
\label{fig:epipolar-line-search}
\end{figure}
Again, in the feature matching method, we find the position of $\mathbf{p}_2$ through features. However, now we don't have a descriptor, so we can only search for points similar to $\mathbf{p}_1$ on the epipolar line. To be more specific, we may walk along one end of the epipolar line in the second image to the other and compare each pixel's similarity one by one with $\mathbf{p}_1$. From the point of view of directly comparing pixels, this approach is the same as the direct method.
When discussing the direct method, we know that comparing a single pixel's brightness value is not necessarily stable and reliable. One obvious thing is: if there are many similarities to $\mathbf{p}_1$ in the epipolar line, how can we determine which one is true? This seems to return to the question we mentioned in the loop detection: how to determine two images' similarity (or two points)? Loop detection is solved by the bag of words, but we have to find another way because there are no features.
An intuitive idea is: Since one pixel's brightness is not distinguishable, is it possible to compare pixel blocks? We take a small block of size $w \times w$ around $\mathbf{p}_1$, and then take many small blocks of the same size on the epipolar line for comparison. It should improve the discrimination to a certain extent. This is the so-called block match method. Note that in this process, this comparison is meaningful only if the gray value of the entire small block remains unchanged between different images. Therefore, the algorithm's assumption has changed from the grayscale invariance of one pixel to image blocks. To a certain extent, it has become more assertive.
Okay, now we have taken the small blocks around $\mathbf{p}_1$, and many small blocks on the epipolar line. We denote the small blocks around $\mathbf{p}_1$ as $\mathbf{A} \in \mathbb{R}^{w \times w}$, and denote the $n$ small blocks on the epipolar line into $\mathbf{B}_i, i=1, \cdots, n$. So, how to calculate the difference between a small block and another block? There are several different calculation methods:
\begin{enumerate}
\item SAD (Sum of Absolute Difference):
\begin{equation}
S( \mathbf{A}, \mathbf{B} )_{\mathrm{SAD}} = \sum_{i,j} | \mathbf{A}(i,j) - \mathbf{B}(i,j) |.
\end{equation}
\item SSD (Sum of Squared Distance), not solid state drive:
\begin{equation}
S( \mathbf{A}, \mathbf{B} )_{\mathrm{SSD}} = \sum_{i,j} \left( \mathbf{A}(i,j) - \mathbf{B}(i,j) \right)^2.
\end{equation}
\item NCC (Normalized Cross Correlation):
\begin{equation}
S( \mathbf{A}, \mathbf{B} )_{\mathrm{NCC}} = \frac{{\sum\limits_{i,j} {\mathbf{A}(i,j)\mathbf{B}(i,j)} }}{{\sqrt {\sum\limits_{i,j} {\mathbf{A}{{(i,j)}^2}\sum\limits_{i,j} {\mathbf{B}{{(i,j)}^2}} } } }}.
\end{equation}
Please note that since correlation is used here, correlation close to 0 means that the two images are not similar, and close to 1 means similar. The first two distances are reversed. Close to 0 means similarity, while a larger value means different.
\end{enumerate}
Like many situations we have encountered, these calculation methods often have a contradiction between accuracy and efficiency. Methods with good accuracy often require complex calculations, while simple and fast algorithms often do not work well. This requires us to make a choice in actual engineering. In addition to these simple versions, we can remove each small block's mean first, which is called zero-mean SSD, zero mean NCC, and so on. After removing the mean value, we allow situations like ``a small piece of $\mathbf{B}$ is brighter than $\mathbf{A}$ as a whole, but still very similar'' \footnote{The overall lighter may be caused by the ambient light or the camera exposure parameters increase. }, so it is more robust than before. If readers are interested in other block matching measurement methods, it is recommended to read the literature~\cite{stereo-matching-website, Hirschmuller2007} as supplementary material.
We have now calculated the similarity measure between $\mathbf{A}$ and $\mathbf{B}_i$ on the epipolar line. For the convenience of description, suppose we use NCC, then we will get an NCC distribution along the epipolar line. This distribution's shape depends heavily on the image data, as shown in \autoref{fig:matching-score}~. In a long search distance, we usually get a non-convex function: this distribution has many peaks, but there must be only one true corresponding point. In this case, we tend to use probability distributions to describe depth values rather than using a single value to describe the depth. Therefore, our question turns to update the depth distribution when we continue to search for different images with epipolar lines. This is the so-called \textit{depth filter}.
\begin{figure}[!htp]
\centering
\includegraphics[width=.75\textwidth]{mapping/matching-score.pdf}
\caption{Matching score along with the epipolar line~\cite{Vogiatzis2011}.}
\label{fig:matching-score}
\end{figure}
\subsection{Gaussian Depth Filters}
The estimation of pixel depth can also be modeled as a state estimation problem, so there are naturally two ways of solving the problem: filter and nonlinear optimization. Although nonlinear optimization is literally better, in the case of strong real-time requirements such as SLAM, considering that the frontend has already occupied many calculations, the filter method with less calculation is usually used in the mapping. This is also the purpose of the depth filter discussion in this section.
There are several different ways of assuming the distribution of depth. Under relatively simple assumptions, we can assume that the depth value obeys the Gaussian distribution and gets a Kalman-like method (but in fact, it is just a normalized production, as we will see later). On the other hand, in~\cite{Vogiatzis2011, Forster2014} and other documents, the assumption of uniform-Gaussian mixture distribution is also used to derive a more complex depth filter. Based on the principle of simplicity and ease of use, we first introduce and demonstrate the depth filter under the assumption of Gaussian distribution and then take the filter with uniform-Gaussian mixture distribution as an exercise.
Assume the depth $d$ of a certain pixel satisfy:
\begin{equation}
P(d) = N(\mu, \sigma^2).
\end{equation}
And whenever new data arrives, we will observe its depth. Similarly, suppose this observation is also a Gaussian distribution:
\begin{equation}
P(d_{\mathrm{obs}}) = N(\mu_{\mathrm{obs}}, \sigma_{\mathrm{obs}}^2 ).
\end{equation}
Therefore, our question is how to use the observed information to update the original distribution of $d$. This is exactly an information fusion problem. According to appendix~\ref{cpt:app-A}, we know that the normalized product of two Gaussian distributions is still a Gaussian distribution. Suppose the distribution of $d$ after fusion is $N(\mu_{\mathrm{fuse}}, \sigma_{\mathrm{fuse}}^2)$, then according to the product of the Gaussian distribution, there are:
\begin{equation}
{\mu _{\mathrm{fuse}}} = \frac{{\sigma _{\mathrm{obs}}^2\mu + {\sigma ^2}{\mu _{\mathrm{obs}}}}}{{{\sigma ^2} + \sigma _{\mathrm{obs}}^2}},\quad \sigma _{\mathrm{fuse}}^2 = \frac{{{\sigma ^2}\sigma _{\mathrm{obs}}^2}}{{{\sigma ^2} + \sigma _{\mathrm{obs}}^2}}.
\end{equation}
Since we only have observation equations and no motion equations, the depth here only uses the information fusion part, and there is no need to predict and update like the complete Kalman filter. Of course, you can consider it as if the motion equation is fixed for the depth value. It can be seen that the fusion equation is indeed relatively simple and easy to understand, but the question remains: how to determine the distribution of the depth we observe? That is, how to calculate $\mu_{\mathrm{obs}}, \sigma_{\mathrm{obs}}$?
Regarding of $\mu_{\mathrm{obs}}$ and $\sigma_{\mathrm{obs}}$, there are also some different processing methods. For example,~\cite{Engel2013} considers the sum of geometric uncertainty and photometric uncertainty, while~\cite{Vogiatzis2011} only considers geometric uncertainty. For now, we only consider the uncertainty caused by geometric relations. Suppose we have determined the projection position of a pixel through epipolar search and block matching. We know that the block matching's accuracy is about one pixel. So, how does this uncertainty affect the estimated depth?
Take \autoref{fig:uncertainty-mapping}~ as an example. Consider in an epipolar searching, we find the pixel $\mathbf{p}_2$ corresponding to $\mathbf{p}_1$, and then compute the depth value of $\mathbf{p}_1$. Let the 3D point corresponding to $\mathbf{ p}_1$ is $\mathbf{P}$. We denote that $\mathbf{O}_1 \mathbf{P}$ is $\mathbf{p}$, $\mathbf{O}_1 \mathbf{O}_2$ is the camera's translation $\mathbf{t} $, and $\mathbf{O}_2 \mathbf{P}$ is $\mathbf{a}$. And, let the two bottom angles of this triangle be $\alpha$ and $\beta$. Now, consider that there is one pixel error on the epipolar line $l_2$, so that the angle of $\beta$ becomes $\beta'$, and $\mathbf{p}_2$ also changes to $\mathbf{p }_2'$. Let the upper corner be $\gamma$. We want to ask, if $\mathbf{p}_2 \mathbf{p}_2^\prime $ is one pixel, then how long is $\mathbf{p}'$ and $\mathbf{p}$?
\begin{figure}[!ht]
\centering
\includegraphics[width=.84\textwidth]{mapping/uncertainty.pdf}
\caption{Uncertainty analysis.}
\label{fig:uncertainty-mapping}
\end{figure}
This is a typical geometric problem. Let's list the geometric relationship between these quantities. Obviously:
\begin{equation}
\begin{array}{l}
\mathbf{a} = \mathbf{p} - \mathbf{t} \\
\alpha = \arccos \left\langle {\mathbf{p}, \mathbf{t}} \right\rangle \\
\beta = \arccos \left\langle {\mathbf{a}, - \mathbf{t}} \right\rangle .
\end{array}
\end{equation}
Perturbing $\mathbf{p}_2$ by one pixel will cause $\beta$ to produce a change, which becomes $\beta'$. According to the geometric relationship, there are:
\begin{equation}
\begin{array}{l}
\beta ' = \arccos \left\langle {\mathbf{O}_2 \mathbf{p}_2', -\mathbf{t}} \right\rangle \\
\gamma = \pi - \alpha - \beta '.
\end{array}
\end{equation}
Therefore, according to the law of sine, $\mathbf{p}'$ can be obtained as:
\begin{equation}
\| \mathbf{p}' \| = \| \mathbf{t} \| \frac{{\sin \beta '}}{{\sin \gamma }}.
\end{equation}
Thus, we computed the depth uncertainty caused by the uncertainty of a single pixel. If we assume that the block matching of the epipolar search has only one pixel error, then we can set:
\begin{equation}
\sigma_{\mathrm{obs}} = \| \mathbf{p} \|-\| \mathbf{p}' \|.
\end{equation}
Of course, if the uncertainty of epipolar search is greater than one pixel, we can also amplify this uncertainty according to this derivation. The depth fusion process has been introduced before. In engineering, when the uncertainty is less than a certain threshold, it can be considered that the depth data has converged.
In summary, we give a complete process of estimating the pixel depth:
\begin{mdframed}
\begin{enumerate}
\item Assume that the depth of all pixels meets an initial Gaussian distribution.
\item When a new image is generated, the projected point's location is determined through epipolar search and block matching.
\item Calculate the depth and uncertainty of the triangle based on the geometric relationship.
\item Fuse the current observation into the last estimate. If it converges, stop the calculation, otherwise return to step 2.
\end{enumerate}
\end{mdframed}
These steps constitute a feasible depth estimation method. Please note that the depth value mentioned here is the length of $O_1 P$, which is slightly different from the depth we mentioned in the pinhole camera model. The depth in a pinhole camera refers to the $z$ value of the pixel. We will demonstrate the results of the algorithm in the practical part.
\section{Practice: Monocular Dense Reconstruction}
The sample program in this section will use the test dataset of REMODE {\cite{Handa2012, Pizzoli2014}}. It provides a total of 200 monocular top-view images collected by a drone, and it also provides the ground-truth pose of each image. Based on these data, we can estimate each pixel's depth value of the first frame, which is exactly the dense monocular reconstruction.
~\url{http://rpg.ifi.uzh.ch/datasets/remode_test_data.zip}~. You can use a web browser or download software to download. After decompression, all images from 0 to 200 can be found in test\_data/Images, and a text file in the test\_data directory records the poses.
\begin{lstlisting}
scene_000.png 1.086410 4.766730 -1.449960 0.789455 0.051299 -0.000779 0.611661
scene_001.png 1.086390 4.766370 -1.449530 0.789180 0.051881 -0.001131 0.611966
scene_002.png 1.086120 4.765520 -1.449090 0.788982 0.052159 -0.000735 0.612198
......
\end{lstlisting}
\autoref{fig:remode-dataset}~ shows images from several moments. It can be seen that the scene is mainly composed of the ground, a table, and small objects on the table. If the depth estimate is roughly correct, we can at least see the difference between the table's depth and the ground. Below, we follow the previous explanation to write the dense depth estimation program. The program is written in C language style and placed in a single file for ease of understanding. This program is a bit long compared with previous demos. We will focus on a few important functions. Please read the rest of the content from the source code on GitHub.
\begin{figure}[!ht]
\centering
\includegraphics[width=1.0\textwidth]{mapping/remode-dataset.pdf}
\caption{Sample images from the dataset.}
\label{fig:remode-dataset}
\end{figure}
\begin{lstlisting}[language=c++,caption=slambook2/ch12/dense\_monocular/dense\_mapping.cpp (part)]
/**********************************************
* This program demonstrates the dense depth estimation of a monocular camera under a known trajectory
* use epipolar search + NCC matching method, which corresponds to section 12.2 of the book
* Please note that this program is not perfect, you can improve it by yourself.
***********************************************/
// ------------------------------------------------------------------
// parameters
const int boarder = 20; // image boarder
const int width = 640; // image width
const int height = 480; // image height
const double fx = 481.2f; // camera intrinsicss
const double fy = -480.0f;
const double cx = 319.5f;
const double cy = 239.5f;
const int ncc_window_size = 3; // half window size of NCC
const int ncc_area = (2 * ncc_window_size + 1) * (2 * ncc_window_size + 1); // area of NCC
const double min_cov = 0.1; // converge criteria: minimal covariance
const double max_cov = 10; // disconverge criteria: maximal covariance
// ------------------------------------------------------------------
// important functions
/**
* update depth using new images
* @param ref refernce image
* @param curr current image
* @param T_C_R matrix from ref to cur
* @param depth depth estimation
* @param depth_cov covariance of depth
* @return true if success
*/
bool update(
const Mat &ref, const Mat &curr, const SE3d &T_C_R,
Mat &depth, Mat &depth_cov2);
/**
* epipolar search
* @param ref refernce image
* @param curr current image
* @param T_C_R matrix from ref to cur
* @param pt_ref point in ref
* @param depth_mu mean of depth
* @param depth_cov cov of depth
* @param pt_curr point in current
* @param epipolar_direction
* @return true if success
*/
bool epipolarSearch(
const Mat &ref, const Mat &curr, const SE3d &T_C_R,
const Vector2d &pt_ref, const double &depth_mu, const double &depth_cov,
Vector2d &pt_curr, Vector2d &epipolar_direction);
/**
* update depth filter
* @param pt_ref point in ref
* @param pt_curr point in cur
* @param T_C_R matrix from ref to cur
* @param epipolar_direction
* @param depth mean of depth
* @param depth_cov2 cov of depth
* @return true if success
*/
bool updateDepthFilter(
const Vector2d &pt_ref, const Vector2d &pt_curr, const SE3d &T_C_R,
const Vector2d &epipolar_direction, Mat &depth, Mat &depth_cov2);
/**
* NCC computation
* @param ref reference image
* @param curr current image
* @param pt_ref reference pixel
* @param pt_curr current pixel
* @return NCC score
*/
double NCC(const Mat &ref, const Mat &curr, const Vector2d &pt_ref, const Vector2d &pt_curr);
// bilinear interpolation
inline double getBilinearInterpolatedValue(const Mat &img, const Vector2d &pt) {
uchar *d = &img.data[int(pt(1, 0)) * img.step + int(pt(0, 0))];
double xx = pt(0, 0) - floor(pt(0, 0));
double yy = pt(1, 0) - floor(pt(1, 0));
return ((1 - xx) * (1 - yy) * double(d[0]) +
xx * (1 - yy) * double(d[1]) +
(1 - xx) * yy * double(d[img.step]) +
xx * yy * double(d[img.step + 1])) / 255.0;
}
int main(int argc, char **argv) {
if (argc != 2) {
cout << "Usage: dense_mapping path_to_test_dataset" << endl;
return -1;
}
// read data
vector<string> color_image_files;
vector<SE3d> poses_TWC;
Mat ref_depth;
bool ret = readDatasetFiles(argv[1], color_image_files, poses_TWC, ref_depth);
if (ret == false) {
cout << "Reading image files failed!" << endl;
return -1;
}
cout << "read total " << color_image_files.size() << " files." << endl;
// first image
Mat ref = imread(color_image_files[0], 0); // gray-scale image
SE3d pose_ref_TWC = poses_TWC[0];
double init_depth = 3.0; // initial depth
double init_cov2 = 3.0; // initial covariance
Mat depth(height, width, CV_64F, init_depth); // depth image
Mat depth_cov2(height, width, CV_64F, init_cov2); // depth cov image
for (int index = 1; index < color_image_files.size(); index++) {
cout << "*** loop " << index << " ***" << endl;
Mat curr = imread(color_image_files[index], 0);
if (curr.data == nullptr) continue;
SE3d pose_curr_TWC = poses_TWC[index];
SE3d pose_T_C_R = pose_curr_TWC.inverse() * pose_ref_TWC; // T_C_W * T_W_R = T_C_R
update(ref, curr, pose_T_C_R, depth, depth_cov2);
evaludateDepth(ref_depth, depth);
plotDepth(ref_depth, depth);
imshow("image", curr);
waitKey(1);
}
cout << "estimation returns, saving depth map ..." << endl;
imwrite("depth.png", depth);
cout << "done." << endl;
return 0;
}
bool update(const Mat &ref, const Mat &curr, const SE3d &T_C_R, Mat &depth, Mat &depth_cov2) {
for (int x = boarder; x < width - boarder; x++)
for (int y = boarder; y < height - boarder; y++) {
if (depth_cov2.ptr<double>(y)[x] < min_cov || depth_cov2.ptr<double>(y)[x] > max_cov) {
// converge or abort
continue;
}
// search match of (x,y) along the epipolar line
Vector2d pt_curr;
Vector2d epipolar_direction;
bool ret = epipolarSearch(
ref, curr, T_C_R, Vector2d(x, y), depth.ptr<double>(y)[x], sqrt(depth_cov2.ptr<double>(y)[x]), pt_curr, epipolar_direction);
if (ret == false) // failed
continue;
// un-comment this to display the match result
// showEpipolarMatch(ref, curr, Vector2d(x, y), pt_curr);
// update if succeed
updateDepthFilter(Vector2d(x, y), pt_curr, T_C_R, epipolar_direction, depth, depth_cov2);
}
}
bool epipolarSearch(
const Mat &ref, const Mat &curr,
const SE3d &T_C_R, const Vector2d &pt_ref,
const double &depth_mu, const double &depth_cov,
Vector2d &pt_curr, Vector2d &epipolar_direction) {
Vector3d f_ref = px2cam(pt_ref);
f_ref.normalize();
Vector3d P_ref = f_ref * depth_mu; // reference vector
Vector2d px_mean_curr = cam2px(T_C_R * P_ref); // pixel according to mean depth
double d_min = depth_mu - 3 * depth_cov, d_max = depth_mu + 3 * depth_cov;
if (d_min < 0.1) d_min = 0.1;
Vector2d px_min_curr = cam2px(T_C_R * (f_ref * d_min)); // pixel of minimal depth
Vector2d px_max_curr = cam2px(T_C_R * (f_ref * d_max)); // pixel of maximal depth
Vector2d epipolar_line = px_max_curr - px_min_curr; // epipolar line
epipolar_direction = epipolar_line; // normalized
epipolar_direction.normalize();
double half_length = 0.5 * epipolar_line.norm();
if (half_length > 100) half_length = 100; // we don'e want to search too much
// un-comment this to show the epipolar line
// showEpipolarLine( ref, curr, pt_ref, px_min_curr, px_max_curr );
// epipolar search
double best_ncc = -1.0;
Vector2d best_px_curr;
for (double l = -half_length; l <= half_length; l += 0.7) { // l+=sqrt(2)
Vector2d px_curr = px_mean_curr + l * epipolar_direction;
if (!inside(px_curr))
continue;
// compute NCC score
double ncc = NCC(ref, curr, pt_ref, px_curr);
if (ncc > best_ncc) {
best_ncc = ncc;
best_px_curr = px_curr;
}
}
if (best_ncc < 0.85f) // only trust NCC with high scores
return false;
pt_curr = best_px_curr;
return true;
}
double NCC(
const Mat &ref, const Mat &curr,
const Vector2d &pt_ref, const Vector2d &pt_curr) {
// zero-mean NCC
// compute the mean
double mean_ref = 0, mean_curr = 0;
vector<double> values_ref, values_curr;
for (int x = -ncc_window_size; x <= ncc_window_size; x++)
for (int y = -ncc_window_size; y <= ncc_window_size; y++) {
double value_ref = double(ref.ptr<uchar>(int(y + pt_ref(1, 0)))[int(x + pt_ref(0, 0))]) / 255.0;
mean_ref += value_ref;
double value_curr = getBilinearInterpolatedValue(curr, pt_curr + Vector2d(x, y));
mean_curr += value_curr;
values_ref.push_back(value_ref);
values_curr.push_back(value_curr);
}
mean_ref /= ncc_area;
mean_curr /= ncc_area;
// compute Zero mean NCC
double numerator = 0, demoniator1 = 0, demoniator2 = 0;
for (int i = 0; i < values_ref.size(); i++) {
double n = (values_ref[i] - mean_ref) * (values_curr[i] - mean_curr);
numerator += n;
demoniator1 += (values_ref[i] - mean_ref) * (values_ref[i] - mean_ref);
demoniator2 += (values_curr[i] - mean_curr) * (values_curr[i] - mean_curr);
}
return numerator / sqrt(demoniator1 * demoniator2 + 1e-10);
}
bool updateDepthFilter(
const Vector2d &pt_ref, const Vector2d &pt_curr, const SE3d &T_C_R,
const Vector2d &epipolar_direction, Mat &depth, Mat &depth_cov2) {
// anybody still reading?
// tri-angulation
SE3d T_R_C = T_C_R.inverse();
Vector3d f_ref = px2cam(pt_ref);
f_ref.normalize();
Vector3d f_curr = px2cam(pt_curr);
f_curr.normalize();
// equation:
// d_ref * f_ref = d_cur * ( R_RC * f_cur ) + t_RC
// f2 = R_RC * f_cur
// convert to this:
// => [ f_ref^T f_ref, -f_ref^T f2 ] [d_ref] [f_ref^T t]
// [ f_cur^T f_ref, -f2^T f2 ] [d_cur] = [f2^T t ]
Vector3d t = T_R_C.translation();
Vector3d f2 = T_R_C.so3() * f_curr;
Vector2d b = Vector2d(t.dot(f_ref), t.dot(f2));
Matrix2d A;
A(0, 0) = f_ref.dot(f_ref);
A(0, 1) = -f_ref.dot(f2);
A(1, 0) = -A(0, 1);
A(1, 1) = -f2.dot(f2);
Vector2d ans = A.inverse() * b;
Vector3d xm = ans[0] * f_ref; // result in ref
Vector3d xn = t + ans[1] * f2; // result in cur
Vector3d p_esti = (xm + xn) / 2.0; // take average as p
double depth_estimation = p_esti.norm(); // depth
// compute the covariance
Vector3d p = f_ref * depth_estimation;
Vector3d a = p - t;
double t_norm = t.norm();
double a_norm = a.norm();
double alpha = acos(f_ref.dot(t) / t_norm);
double beta = acos(-a.dot(t) / (a_norm * t_norm));
Vector3d f_curr_prime = px2cam(pt_curr + epipolar_direction);
f_curr_prime.normalize();
double beta_prime = acos(f_curr_prime.dot(-t) / t_norm);
double gamma = M_PI - alpha - beta_prime;
double p_prime = t_norm * sin(beta_prime) / sin(gamma);
double d_cov = p_prime - depth_estimation;
double d_cov2 = d_cov * d_cov;
// Gaussian fusion
double mu = depth.ptr<double>(int(pt_ref(1, 0)))[int(pt_ref(0, 0))];
double sigma2 = depth_cov2.ptr<double>(int(pt_ref(1, 0)))[int(pt_ref(0, 0))];
double mu_fuse = (d_cov2 * mu + sigma2 * depth_estimation) / (sigma2 + d_cov2);
double sigma_fuse2 = (sigma2 * d_cov2) / (sigma2 + d_cov2);
depth.ptr<double>(int(pt_ref(1, 0)))[int(pt_ref(0, 0))] = mu_fuse;
depth_cov2.ptr<double>(int(pt_ref(1, 0)))[int(pt_ref(0, 0))] = sigma_fuse2;
return true;
}
\end{lstlisting}
We omit functions such as drawing and reading data and only show the part related to depth calculation. If the reader understands the previous section's content, I believe it is not difficult to understand the source code here. Nevertheless, we will briefly explain several key functions:
\begin{enumerate}
\item The main function (not listed here) is very simple. It is only responsible for reading the image from the dataset and then handing it over to the update function to update the depth map.
\item In the update function, we traverse each pixel of the reference frame, first look for an epipolar match in the current frame. If it can match, use the epipolar match to update the estimation of the depth map.
\item The principle of epipolar search is roughly the same as the one introduced in the previous section. Still, some details have been added to the implementation. Because the depth value is assumed to obey the Gaussian distribution, we take the mean value as the center and take $\pm 3 \sigma$ as the radius and then look for the epipolar line's projection in the current frame. Then, traverse the pixels on this epipolar line (the step size is approximately 0.7 of $\sqrt{2}/2$), and find the point with the highest NCC as the matching point. If the highest NCC is also lower than the threshold (here taken as 0.85), the match is considered failed.
\item The calculation of NCC uses the zero-mean version, that is, for the image block $\mathbf{A}, \mathbf{B}$, take:
\begin{equation}
\mathrm{NCC}_{z} (\mathbf{A}, \mathbf{B}) = \frac{{\sum\limits_{i,j} {\left( {\mathbf{A}(i,j )-\bar{\mathbf{A}}(i,j)} \right)\left( {\mathbf{B}(i,j)-\bar{\mathbf{B}}(i,j)} \right)} }}{{\sqrt {\sum\limits_{i,j} {{{\left( {\mathbf{A}(i,j)-\bar{\mathbf{A}}(i, j)} \right)}^2}} \sum\limits_{i,j} {{{\left( {\mathbf{B}(i,j)-\bar{\mathbf{B}}(i, j)} \right)}^2}}} }}.
\end{equation}
\item The triangulation part is consistent with section~\ref{sec:7.5}, and the calculation of uncertainty is consistent with the Gaussian fusion method and the previous section.
\end{enumerate}
Although the program is a bit long, I believe readers can understand it according to the above tips. Let's take a look at the results.
\subsection*{Experimental Results}
After compiling this program, run it in the dataset directory: \footnote{Please note that the dense depth estimation is time-consuming. If your computer is older, please wait patiently for a while. }
\begin{lstlisting}[language=sh,caption=Terminal ouptut:]
$ build/dense_mapping ~/dataset/test_data
read total 202 files.
*** loop 1 ***
*** loop 2 ***
......
\end{lstlisting}
The program's output is relatively concise, only showing the number of iterations, the current image, and the depth map. Regarding the depth map, we show the depth image by multiplying it by 0.4-that is, the depth of the pure white point (the value is 1.0) is about 2.5 meters. The darker the color, the smaller the depth value, and the closer the object to us. If you have run the program, you should find that depth estimation is a dynamic process that gradually converges from an uncertain initial value to a stable value. Our initial value used a distribution with a mean and variance of 3.0. Of course, you can also modify the initial distribution to see how it affects the results.
From \autoref{fig:snapshot}~ it can be found that when the number of iterations exceeds a specific value, the depth map becomes stable, and no more changes are made to the new data. The stabilized depth map shows that the difference between the floor and the table can be roughly seen, and the depth of the objects is close to the table. Some of the estimations are correct, but there are also many wrong estimates. They appear as inconsistencies between the depth pixel and the surrounding data, which seems to be too large or too small estimates. The wrong places are often located at the edges because the number of times seen is not enough, so they are not correctly estimated. In summary, we think that most of the depth map is correct, but it did not achieve the desired effect. We will analyze the causes of these situations in the next section and discuss what can be improved.
\begin{figure}[!ht]
\centering
\includegraphics[width=.9\textwidth]{mapping/snapshot.pdf}
\caption{Snapshots of running the depth filter after 10 and 30 iterations. }
\label{fig:snapshot}
\end{figure}
\subsection{Discussion}
In the previous section, we demonstrated the dense mapping of a mobile monocular camera and estimated each pixel's depth of the reference frame. Our code is relatively simple and straightforward, without using any tricks. This is a common situation that simple methods are always not effective in real engineering.
In the previous section, we demonstrated the dense mapping of a mobile monocular camera and estimated each pixel's depth of the reference frame. Our code is relatively simple and straightforward, without using any tricks. This is a common situation that simple methods are always not effective in real engineering. Due to the complexity of real data, programs that can work in a real environment often require careful consideration and many engineering tricks, making lots of practical code extremely complicated. They are difficult to explain to beginners, so we have to use a less effective but relatively easy read and write implementation. Of course, we can put forward several suggestions for improving the demo program, but we do not intend to present the modified (very complicated) program directly to the reader.
Below we conduct a preliminary analysis of the results of the experiment in the previous section. We will analyze the results of the demonstration experiment from the perspective of computer vision and filters.
\subsection{Pixel Gradients}
Observing the depth image, we will find an obvious fact. Whether the block matching is correct or not depends on whether the image block is distinguishable. Obviously, suppose the image block is only a piece of black or white, lacking visual information. In that case, we may erroneously match it with some surrounding pixels. For example, the printer surface in the demo program is uniformly white. It is very easy to cause mismatches, so the depth information on the printer's surface is mostly incorrect. The sample program's spatial surface has obviously undesirable striped depth estimates, and according to our intuitive imagination, The surface of the printer must be smooth.
This involves a problem that has been seen once in the direct method chapter. When performing block matching (and calculation of NCC), we must assume that the small block is unchanged, and then we compare it with other blocks. Of course, blocks with noticeable gradients will have good discrimination and will not easily cause mismatches. For pixels with inconspicuous gradients, since there is no discrimination in block matching, it is difficult for us to effectively estimate its depth. Conversely, the depth information we have evident gradients will be relatively accurate, such as magazines, phones, and other objects with obvious texture on the desktop. Therefore, the demo program reflects a widespread stereo vision problem: dependence on the texture. This problem is also extremely common in binocular vision, which shows that the reconstruction quality of stereo vision is very dependent on the environmental texture.
Our demo program deliberately uses a good-textured environment, such as a checkerboard-like floor, a wood-grained desktop, etc., so we can get a seemingly good result. However, in practice, places with uniform brightness such as walls and smooth surfaces will often appear, affecting our depth estimation. From a certain perspective, the problem cannot be improved or solved on the current algorithm flow (block matching) if we only care about the neighborhood around a certain block.
Further discussing the pixel gradient problem, we will also find the connection between the pixel gradient and the epipolar line. The literature~\cite{Engel2013} has discussed their relationship in detail, but it is also intuitively reflected in our demo program.
Taking \autoref{fig:epipolar-gradient}~ as an example, we will give two extreme cases: the pixel gradient is parallel to the epipolar direction and orthogonal to the epipolar direction. Let's look at the orthogonal situation first. In this case, even if the blocks have noticeable gradients, when we do block matching along the epipolar line, we will find that the matching degree is the same, so no effective matching can be obtained. Conversely, in the parallel case, we can accurately determine where the highest matching point appears. In reality, the gradient and the epipolar line are probably somewhere in between: they are neither completely orthogonal nor completely parallel. When the angle between the pixel gradient and the epipolar line is large, the uncertainty of the epipolar line matching is large. When the angle is small, the uncertainty of the matching becomes smaller. We uniformly treat these conditions in the demo program as one pixel error, which is not fair enough. A more accurate uncertainty model should be used after considering the angle between the epipolar line and the pixel gradient. Specific adjustments and improvements are left as exercises.
\begin{figure}[!htp]
\centering
\includegraphics[width=1.0\textwidth]{mapping/epipolar-gradient.pdf}
\caption{Relationship of pixel gradients and the epipolar direction.}
\label{fig:epipolar-gradient}
\end{figure}
\subsection{Inverse Depth Filter}
From another perspective, we might as well ask: Is it appropriate to assume that the pixel depth is a Gaussian distribution? This is related to a \textit{parameterization} problem.
In the previous content, we often use the world coordinates $x, y, z$ to describe a point, which is one of many parameterization methods. We may think that the three quantities $x, y, z$ are random, and they obey the 3D Gaussian distribution. However, this lecture uses the image coordinates $u,v$ and the depth value $d$ to describe a certain spatial point. We think that $u,v$ do not move, and $d$ obeys (one-dimensional) Gaussian distribution, which is another form of parameterization. Then we have to ask: Is there any difference between these two parameterized forms?
Different parameterized forms are used to describe the same quantity, that is, a 3D spatial point. Considering that when we see a specific point on the camera, its image coordinates $u,v$ are relatively accurate. The uncertainty of the depth value $d$ is very uncertain. If the world coordinates $x,y,z$ are used to describe this point, then according to the camera's current pose, there will be an apparent correlation between the three quantities $x,y,z$. The non-diagonal elements are of the covariance matrix will not be zero. And if a point is parameterized with $u, v, d$, then its $u, v$ and $d$ are approximately independent. We can even think that $u, v$ are independent. Then the covariance matrix is roughly diagonal, which is more concise.
Inverse depth is a widely used parameterization technique {\cite{Montiel2006, Civera2008}} that has appeared in SLAM research in recent years. In the demo program, we assume that the depth value satisfies the Gaussian distribution: $d \sim N(\mu, \sigma^2)$. But is it reasonable to do so? Does the depth really approximate a Gaussian distribution? If we think more deeply, there are indeed some problems with the normal distribution of depth:
\begin{enumerate}
\item What we actually want to express is: the depth of this scene is about $5 \textasciitilde 10$ meters, there may be some further points, but the close distance will definitely not be less than the camera focal length (or the depth will not be less than 0). This distribution does not form a symmetrical shape like the Gaussian distribution. Its tail may be slightly longer, and the negative area is zero.
\item There may be points that are very far away in some outdoor applications or even at infinity. It is difficult to cover these points in our initial value, and there will be some numerical difficulties in describing the large depth values with Gaussian distribution.
\end{enumerate}
Thus, the inverse depth came into being. People found in the simulation that the hypothesis that the inverse depth forms a Gaussian distribution is more effective {\cite{Civera2008}}. Later, in practical applications, the inverse depth also has better numerical stability, which gradually becomes a general and standard practice in the existing SLAM systems {\cite{Forster2014, Engel2014, Mur-Artal2015}}.
It is not complicated to change the demonstration program from positive depth to inverse depth. Just change $d$ to the inverse depth $d^{-1}$ in the previous depth's derivation. We also leave this change as an exercise for readers to complete.
\subsection{Pre-transform the Image}
Before block matching, it is also a common preprocessing method to do a transformation from image to image. This is because we assume that image patches remain unchanged when the camera is moving, and this assumption can be held when the camera is shifted (the example data set is basically such an example). But when the camera rotates significantly, it is no more effective. In particular, when the camera rotates around the optical center, an image block that is black on the bottom may become a black on top, causing the correlation to directly become a negative number (although it is still the same block).
To prevent this situation, we usually need to consider the motion between the reference frame with the current frame. A point $\mathbf{P}_R$ in the reference frame has the following relationship with the 3D point $\mathbf{P}_W$:
\begin{equation}
d_R {\mathbf{P}_R} = \mathbf{K} \left( {{\mathbf{R}_{{RW}}}{\mathbf{P}_W} + {\mathbf{t}_{{RW}}}} \right).
\end{equation}
Similarly, for the current frame, there is a projection of $\mathbf{P}_W$ on it, denoted as $\mathbf{P}_C$:
\begin{equation}
d_C {\mathbf{P}_C} = \mathbf{K} \left( {{\mathbf{R}_{{CW}}}{\mathbf{P}_W} + {\mathbf{t}_{{CW}}}} \right).
\end{equation}
Substituting and eliminating $\mathbf{P}_W$, the pixel relationship between the two images is obtained:
\begin{equation}
d_C {\mathbf{P}_C} = d_R \mathbf{K} \mathbf{R}_{{CW}} \mathbf{R}_{{RW}}^T \mathbf{K}^{-1} \mathbf{P}_R + \mathbf{K} \mathbf{t}_{{CW}} - \mathbf{K} \mathbf{R}_{{CW}} \mathbf{R}_{{RW}}^T \mathbf{K} \mathbf{t}_{{RW}}.
\end{equation}
When we know $d_R, \mathbf{P}_R$, we can calculate the projection position of $\mathbf{P}_C$. At this time, give the two components of $\mathbf{P}_R$ an increment $\mathrm{d}u, \mathrm{d}v$, then the increment of $\mathbf{P}_C$ can be obtained as $\mathrm{d}u_c, \mathrm{d}v_c$. In this way, a linear relationship between the coordinate transformation of the reference frame and the current frame image in a local range is calculated to form an affine transformation:
\begin{equation}
\left[ \begin{array}{l}
\mathrm{d}u_c\\
\mathrm{d}v_c
\end{array} \right] = \left[ {\begin{array}{*{20}{c}}
{\frac{{\mathrm{d}u_c}}{{\mathrm{d}u}}}&{\frac{{\mathrm{d}u_c}}{{\mathrm{d}v}}}\\
{\frac{{\mathrm{d}v_c}}{{\mathrm{d}u}}}&{\frac{{\mathrm{d}v_c}}{{\mathrm{d}v}}}
\end{array}} \right]\left[ \begin{array}{l}
\mathrm{d}u\\
\mathrm{d}v
\end{array} \right]
\end{equation}
According to the affine transformation matrix, we can transform the current frame's pixels (or reference frame) and then perform block matching to obtain a better effect on rotation.
\subsection{Parallel Computing}
In the experiment, we have also seen that the dense depth map estimation is very time-consuming. This is because the estimated points have changed from the original hundreds of feature points to hundreds of thousands of pixels. Even now, the mainstream CPUs are impossible to do such a large calculation in real-time. However, the problem also has another nature: the depth estimates of these hundreds of thousands of pixels are independent. This makes parallelization possible.
In the sample program, we traverse all the pixels in a double loop and perform epipolar searches one by one. When we use the CPU, this process is carried out sequentially. The calculation of the next pixel must wait for the previous pixel. However, there is no need to wait because the calculation is independent. So we can use multiple threads to calculate each pixel separately and then collect the results. Theoretically, if we have 300,000 threads, the calculation time for this problem is the same as calculating one pixel.
The parallel computing architecture of GPU is very suitable for such problems. Therefore, in dense reconstruction, the GPU is often used for parallel acceleration. Of course, this book will not involve GPU programming, so we only point out the possibility of using GPU acceleration here, and the specific practice is left to the reader as a verification. According to some similar work, the dense depth estimation using GPU can be real-time on mainstream GPUs.
\subsection{Other Improvements}
In fact, we can also propose many improvements to this example, such as:
\begin{enumerate}
\item If each pixel is completely calculated independently, there may be cases where one pixel's depth is small, and the next one is large. We don't have any smooth constraints on the depth map. However, we can assume that the adjacent depth will not change too much, thus adding a spatial regularization term to the depth estimation.
\item We did not explicitly deal with the case of outliers. Due to various factors such as occlusion, lighting, motion blur, etc., it is impossible to maintain a successful match for every pixel. As long as the NCC is greater than a specific value in the demonstration program, it is considered a successful match, and the mismatch is not considered. There are also several ways to handle mismatches. For example, the depth filter under the uniform-Gaussian mixture distribution is proposed in~\cite{Vogiatzis2011} explicitly distinguishes the inlier from the outlier and performs probabilistic modeling, which can better process the outliers. However, this type of filter theory is more complicated, and this book does not want to involve too much. Please read the original paper if you are interested.
\end{enumerate}
As can be seen from the above discussion, there are many possible improvements. If we carefully improve every step, we can hope to get a good dense mapping algorithm in the end. However, as we discussed, there are problems with theoretical difficulties, such as the dependence on the texture and the correlation between the pixel gradient and the epipolar direction (the orthogonal case). These problems are difficult to solve by only adjusting the code or parameters. So, until now, although binoculars and mobile monoculars can build dense maps, we usually think that they rely too much on environmental textures and lighting and are not robust enough.
\section{Dense RGB-D Mapping}
In addition to using monocular and binocular for dense reconstruction, RGB-D cameras are a better choice within the application scope. The depth estimation problem discussed in detail in the last lecture can be obtained by hardware measurement in RGB-D cameras without consuming many computing resources. In addition, the structured light or time-of-flight principle of RGB-D ensures that the depth data is independent of texture. Even when facing a solid-colored object, we can measure its depth as long as it can reflect light. This is also a major advantage of RGB-D sensors.
It is relatively easy to use RGB-D for dense mapping. However, depending on the map format, there are also several different mainstream mapping methods. The most intuitive and straightforward way is to convert the RGB-D data into a point cloud based on the estimated camera pose and then stitch them into a global point cloud map composed of discrete points. On this basis, if we have further requirements for the appearance and want to estimate the object's surface, we can use the triangular mesh and the surface (surfel) to build the map. On the other hand, if you want to know the map's obstacle information and navigate the map, you can also create an occupancy map through voxels.
We seem to have introduced many new concepts. Please don't worry, we will investigate them one by one slowly. For some suitable experiments, we will also provide several demonstration programs as usual. Since there is not much theoretical knowledge involved in RGB-D mapping, the following sections will directly introduce the practical part. GPU mapping is beyond this book's scope, so we will briefly explain its principles and not demonstrate them.
\subsection{Practice: RGB-D Point Cloud Mapping}
First, let's explain the simplest point cloud map. The so-called point cloud is a map represented by a set of discrete points. The most basic point contains three-dimensional coordinates of $x, y, z$ and also color information of $r, g, b$. Since the RGB-D camera provides a color map and a depth map, it is easy to calculate the RGB-D point cloud based on the camera's internal parameters. If the camera's pose is obtained, then we can directly merge the keyframes into a global point cloud. In the section ~\ref{sec:join-point-cloud}~ of this book, an example of merging point clouds through camera internal and external parameters was given. However, that example is mainly for the reader to understand the camera parameters. In the real mapping, we will also add some filtering processing to the point cloud to obtain a better visual effect. This program mainly uses two kinds of filters: the outer point removal filter and the voxel grid filter. The code of the sample program is as follows:
\begin{lstlisting}[language=c++,caption=slambook/ch12/dense\_RGBD/pointcloud\_mapping.cpp (part)]
int main(int argc, char **argv) {
vector<cv::Mat> colorImgs, depthImgs;
vector<Eigen::Isometry3d> poses;
ifstream fin("./data/pose.txt");
if (!fin) {
cerr << "cannot find pose file" << endl;
return 1;
}
for (int i = 0; i < 5; i++) {
boost::format fmt("./data/%s/%d.%s"); // we use boost::format to read image files
colorImgs.push_back(cv::imread((fmt % "color" % (i + 1) % "png").str()));
depthImgs.push_back(cv::imread((fmt % "depth" % (i + 1) % "png").str(), -1)); // use -1 to read the unchanged data
double data[7] = {0};
for (int i = 0; i < 7; i++) {
fin >> data[i];
}
Eigen::Quaterniond q(data[6], data[3], data[4], data[5]);
Eigen::Isometry3d T(q);
T.pretranslate(Eigen::Vector3d(data[0], data[1], data[2]));
poses.push_back(T);
}
// merge the point clouds
// intrinsics
double cx = 319.5;
double cy = 239.5;
double fx = 481.2;
double fy = -480.0;
double depthScale = 5000.0;
cout << "convering image to point cloud ..." << endl;
// use XYZRGB as our format
typedef pcl::PointXYZRGB PointT;
typedef pcl::PointCloud<PointT> PointCloud;
PointCloud::Ptr pointCloud(new PointCloud);
for (int i = 0; i < 5; i++) {
PointCloud::Ptr current(new PointCloud);
cout << "converting " << i + 1 << endl;
cv::Mat color = colorImgs[i];
cv::Mat depth = depthImgs[i];
Eigen::Isometry3d T = poses[i];
for (int v = 0; v < color.rows; v++)
for (int u = 0; u < color.cols; u++) {
unsigned int d = depth.ptr<unsigned short>(v)[u]; // depth data
if (d == 0) continue; // 0 means invalid reading
Eigen::Vector3d point;
point[2] = double(d) / depthScale;
point[0] = (u - cx) * point[2] / fx;
point[1] = (v - cy) * point[2] / fy;
Eigen::Vector3d pointWorld = T * point;
PointT p;
p.x = pointWorld[0];
p.y = pointWorld[1];
p.z = pointWorld[2];
p.b = color.data[v * color.step + u * color.channels()];
p.g = color.data[v * color.step + u * color.channels() + 1];
p.r = color.data[v * color.step + u * color.channels() + 2];
current->points.push_back(p);
}
// depth filter and statistical removal
PointCloud::Ptr tmp(new PointCloud);
pcl::StatisticalOutlierRemoval<PointT> statistical_filter;
statistical_filter.setMeanK(50);
statistical_filter.setStddevMulThresh(1.0);
statistical_filter.setInputCloud(current);
statistical_filter.filter(*tmp);
(*pointCloud) += *tmp;
}
pointCloud->is_dense = false;
cout << "we have " << pointCloud->size() << " points." << endl;
// voxel filter
pcl::VoxelGrid<PointT> voxel_filter;
double resolution = 0.03;
voxel_filter.setLeafSize(resolution, resolution, resolution); // resolution
PointCloud::Ptr tmp(new PointCloud);
voxel_filter.setInputCloud(pointCloud);
voxel_filter.filter(*tmp);
tmp->swap(*pointCloud);
cout << "Now we have " << pointCloud->size() << " points after voxel filtering." << endl;
pcl::io::savePCDFileBinary("map.pcd", *pointCloud);
return 0;
}
\end{lstlisting}
This code needs to install the point cloud library. In Ubuntu 18.04, just one command is enough:
\begin{lstlisting}[language=sh, caption=Terminal input:]
sudo apt-get install libpcl-dev pcl-tools
\end{lstlisting}
The code does not change much compared with the lecture~\ref{cpt:5}. The main differences are:
\begin{enumerate}
\item When generating the point cloud of each frame, we remove the points with invalid depth values. This is mainly because of the effective range of Kinect. The depth value after exceeding the range will have a large error or return a zero.
\item Use the statistical filter method to remove outliers. This filter counts the distance distribution between each point and the nearest $N$ points and removes those with extremely large distances. In this way, we keep those sticky points and remove isolated noise points.
\item Finally, the voxel filter is used for downsampling. There will be many redundant points in the overlapping area due to multiple viewing angles. This will take up a lot of memory space in vain. Voxel filtering ensures that there is only one point in a specific size cube (or voxel), which is equivalent to downsampling the 3D space, saving a lot of storage space.
\end{enumerate}
In the second edition of the book, we use the ICL-NUIM dataset~\cite{Handa2014} as an example. This data set is a synthetic RGB-D dataset, which allows us to get noise-free depth data to facilitate experiments. We store five images and depth maps and the corresponding camera poses in the data/ directory. In the voxel filter, we set the resolution to 0.03, which means that there is only one point reserved in each 0.03$\times$0.03$\times$0.03 grid. This is a relatively high resolution. We can see from the program output that the number of points has been significantly reduced (from 1.3 million points to 30,000 points, only 2$ \%$ storage space) but still keeps a similar visual effect.
Run the program in the dense\_RGBD directory:
\begin{lstlisting}[language=sh, caption=Terminal input:]
./build/pointcloud_mapping
\end{lstlisting}
The point cloud file map.pcd can be obtained in the same directory. Then, open the pcd with the pcl\_viewer tool. We can see the content, as shown in \autoref{fig:pcd-filter}.
\begin{figure}[!ht]
\centering
\includegraphics[width=1.0\textwidth]{mapping/pcd-filter.pdf}
\caption{Point cloud mapping using five image pairs in ICL-NUIM.}
\label{fig:pcd-filter}
\end{figure}
The point cloud map provides us with a relatively basic visual map, allowing us to roughly understand what the environment looks like. It is stored in three dimensions so we can quickly browse all corners of the scene and even roam in the scene. A significant advantage of point clouds is that they can be efficiently generated directly from RGB-D images without additional processing. Its filtering operation is also very intuitive, and the processing efficiency is acceptable. However, the point clouds maps are still very fundamental. Let's see if point cloud maps can meet the requirements mentioned earlier.
\begin{enumerate}
\item Localization requirements: depends on the implementation of the frontend visual odometry. If it is based on feature points, the point cloud map cannot be directly used for localization because there is no feature information stored in the point cloud. If the frontend uses ICP for point cloud alignment, then you can perform ICP for the local point cloud to the global point cloud to estimate the camera's pose. However, this requires the global point cloud to have better accuracy. We only merge the point clouds without any optimization in this demo, so it is not enough.
\item Navigation and obstacle avoidance: point clouds cannot be used directly for navigation and obstacle avoidance. A pure point cloud may be disturbed by dynamic objects and cannot represent the occupancy information. We usually need post-processes based on raw point clouds to obtain a map format that is more suitable for navigation and obstacle avoidance.
\item Visualization and interaction: point clouds have basic visualization and interaction capabilities. We can see the appearance of the scene, and we can also walk through the scene. From the perspective of visualization, since the point cloud only contains discrete points and no surface information (such as normals), it does not conform to people's visualization habits. For example, the object of a point cloud map is the same from the front and the back, and you can see what is behind it through the object: these are not compatible with our daily experience.
\end{enumerate}
n summary, we say that the raw point cloud map is \textit{basic} or \textit{primary}, which means that it is closer to the raw data read by the sensor. It has some essential functions, but it is usually used for debugging and basic display, inconvenient in most applications. If we want the map to have more advanced functions, the point cloud map is a good starting point. For example, for the navigation function, we can start from the point cloud to construct an occupancy grid map for the navigation algorithm to query whether a point can pass. Another example is the Poisson reconstruction {\cite{Kazhdan2006}} method commonly used in SfM, which can reconstruct the meshes from the point cloud to obtain the surface information. In addition to Poisson reconstruction, surfel is also a way to express the surface of an object. Using facets as the basic unit of the map, it can build a visually satisfactory map {\cite{Stuckler2014}}.
\autoref{fig:poisson-surfel}~ shows an example of Poisson reconstruction and surfel. It can be seen that their visual effects are significantly better, and they can all be constructed through point clouds. Most of the map formats obtained from point cloud conversion are provided in the PCL library, and interested readers can further explore the PCL's contents. As this book serves as introductory material, it does not introduce every map form in detail.
\begin{figure}[!htp]
\centering
\includegraphics[width=1.0\textwidth]{mapping/poisson-surfel.pdf}
\caption{Reconstruction results of Poisson and surfel model. }
\label{fig:poisson-surfel}
\end{figure}
\subsection{Building Meshes from Point Cloud}
Reconstructing the mesh from the point cloud is also relatively easy. Let's demonstrate how to build the mesh based on the point cloud file just now. The general idea is: first calculate the point cloud's normal and then calculate the grid from the normal.
\begin{lstlisting}[language=c++,caption=slambook2/ch12/dense_RGBD/surfel_mapping.cpp]
#include <pcl/point_cloud.h>
#include <pcl/point_types.h>
#include <pcl/io/pcd_io.h>
#include <pcl/visualization/pcl_visualizer.h>
#include <pcl/kdtree/kdtree_flann.h>
#include <pcl/surface/surfel_smoothing.h>
#include <pcl/surface/mls.h>
#include <pcl/surface/gp3.h>
#include <pcl/surface/impl/mls.hpp>
// typedefs
typedef pcl::PointXYZRGB PointT;
typedef pcl::PointCloud<PointT> PointCloud;
typedef pcl::PointCloud<PointT>::Ptr PointCloudPtr;
typedef pcl::PointXYZRGBNormal SurfelT;
typedef pcl::PointCloud<SurfelT> SurfelCloud;
typedef pcl::PointCloud<SurfelT>::Ptr SurfelCloudPtr;
SurfelCloudPtr reconstructSurface(
const PointCloudPtr &input, float radius, int polynomial_order) {
pcl::MovingLeastSquares<PointT, SurfelT> mls;
pcl::search::KdTree<PointT>::Ptr tree(new pcl::search::KdTree<PointT>);
mls.setSearchMethod(tree);
mls.setSearchRadius(radius);
mls.setComputeNormals(true);
mls.setSqrGaussParam(radius * radius);
mls.setPolynomialFit(polynomial_order > 1);
mls.setPolynomialOrder(polynomial_order);
mls.setInputCloud(input);
SurfelCloudPtr output(new SurfelCloud);
mls.process(*output);
return (output);
}
pcl::PolygonMeshPtr triangulateMesh(const SurfelCloudPtr &surfels) {
// Create search tree*
pcl::search::KdTree<SurfelT>::Ptr tree(new pcl::search::KdTree<SurfelT>);
tree->setInputCloud(surfels);
// Initialize objects
pcl::GreedyProjectionTriangulation<SurfelT> gp3;
pcl::PolygonMeshPtr triangles(new pcl::PolygonMesh);
// Set the maximum distance between connected points (maximum edge length)
gp3.setSearchRadius(0.05);
// Set typical values for the parameters
gp3.setMu(2.5);
gp3.setMaximumNearestNeighbors(100);
gp3.setMaximumSurfaceAngle(M_PI / 4); // 45 degrees
gp3.setMinimumAngle(M_PI / 18); // 10 degrees
gp3.setMaximumAngle(2 * M_PI / 3); // 120 degrees
gp3.setNormalConsistency(true);
// Get result
gp3.setInputCloud(surfels);
gp3.setSearchMethod(tree);
gp3.reconstruct(*triangles);
return triangles;
}
int main(int argc, char **argv) {
// Load the points
PointCloudPtr cloud(new PointCloud);
if (argc == 0 || pcl::io::loadPCDFile(argv[1], *cloud)) {
cout << "failed to load point cloud!";
return 1;
}
cout << "point cloud loaded, points: " << cloud->points.size() << endl;
// Compute surface elements
cout << "computing normals ... " << endl;
double mls_radius = 0.05, polynomial_order = 2;
auto surfels = reconstructSurface(cloud, mls_radius, polynomial_order);
// Compute a greedy surface triangulation
cout << "computing mesh ... " << endl;
pcl::PolygonMeshPtr mesh = triangulateMesh(surfels);
cout << "display mesh ... " << endl;
pcl::visualization::PCLVisualizer vis;
vis.addPolylineFromPolygonMesh(*mesh, "mesh frame");
vis.addPolygonMesh(*mesh, "mesh");
vis.resetCamera();
vis.spin();
}
\end{lstlisting}
This program demonstrates the process of calculating normals and meshes. Use:
\begin{lstlisting}[language=sh,caption=Terminal input:]
./build/surfel_mapping map.pcd
\end{lstlisting}
to convert the point cloud into a grid map, as shown in \autoref{fig:mesh}. It can be seen that after the mesh is reconstructed, the normals, texture, and other information can be constructed from the point cloud without surface information. For the point cloud reconstruction algorithms (moving least-square and greedy projection) demonstrated in this section, readers can find them in the literature~\cite{Alexa2003} and~\cite{Marton2009}, which are classic algorithms in this area.
\begin{figure}[!htp]
\centering
\includegraphics[width=1.0\textwidth]{mapping/mesh.pdf}
\caption{Building meshes from point clouds.}
\label{fig:mesh}
\end{figure}
\subsection{Octo-Mapping}
The following section introduces a map format that is commonly used in navigation and has better compression performance: the \textit{octree map}. In the point cloud map, although we have a three-dimensional structure and voxel filtering to adjust the resolution, the point cloud has several obvious defects:
\begin{itemize}
\item The point cloud map is usually very large, so the pcd file size will be very large. An image of $640$pixel$\times480$pixel will generate 300,000 spatial points and require a lot of storage space. Even after some filtering, the pcd file size is still unacceptable in large-scale environments. And the annoying thing is that its huge size is not necessary for map usage. Point cloud maps provide a lot of unnecessary details. The folds on the carpet and the shadows in the wall, we do not particularly care about these things. Putting them on the map is a waste of space. In the navigation task, we only want to know if it is passable or not. Of course, reducing the resolution will save space, but it also decreases the map's quality. Is there any way to compress and store the map and discard some duplicate information?
\item The point cloud map cannot handle moving objects. Our approach only adds points, and there is no mechanism like removing points when they disappear. In the real environment, the ubiquity of moving objects makes point cloud maps not practical enough.
\end{itemize}
We will introduce next is a flexible, compressed, and updateable map format: Octo-map {\cite{Hornung2013}}. I'm not doing an advertisement.
We know that it is common to model the 3D space as many small cubes (or voxels). If we cut each face of a small cube into two pieces on average, this small cube will become eight smaller pieces of the same size. This step can be repeated continuously until the final size reaches the highest accuracy of modeling. In this process, dividing a small square into eight of the same size is regarded as expanding from one node into eight child nodes, then the whole process of subdividing from the largest space to the smallest space is an octo-tree.
As shown in \autoref{fig:octomap}~, the left side shows a large cube continuously divided into eight pieces evenly until it becomes the smallest one. Therefore, the entire large cube can be regarded as the root node, and the smallest block can be regarded as the leaf node. Therefore, when we move up one level in the octree, the map's volume can be expanded to eight times the original. We might as well do a simple calculation: if the size of the leaf node is 1 cm$^3$, then when we limit the octree to 10 levels, the total volume that can be modeled is about $8^{10}\text{ cm}^3 = 1,073\text{m}^3$, which is enough to model a room. The volume and depth have an exponential relationship. When we use deeper depth, the modeled volume will grow very fast.
\begin{figure}[!ht]
\centering
\includegraphics[width=1.0\textwidth]{mapping/octomap.pdf}
\caption{The structure of an octo-tree.}
\label{fig:octomap}
\end{figure}
Readers may be wondering that we also limit a voxel to only one point? Why do we say that the point cloud takes up space while the octree saves space? This is because, in the octree, we store information about whether it is occupied or not in the node. However, the difference is that when all the child nodes of a block are occupied or not, then we don't need to expand this node. For example, when the initial map is blank, we only need a root node instead of a complete tree. When adding information, since occupied objects and black spaces are often connected together, most octree nodes do not need to be expanded to the leaf level. So, octree saves a lot of storage space than point cloud.
As mentioned earlier, the nodes of the octree store information about whether it is occupied. From the point cloud level, we can naturally use 0 for blank and 1 for occupied. This 0−1 representation can be stored in one bit, saving space, but it seems a bit too simple. Due to noise's influence, we may see a certain point as 0 for a while and 1 for a while. Or 0 for most time and 1 for a small amount of time. In addition to the two cases of yes and no, there is an unknown state in navigation. We usually want to choose the probability to express whether a node is occupied. For example, use a floating-point number $x \in [0,1]$ to express. This $x$ takes 0.5 at the beginning to describe the unknown state. If we keep observing that it is occupied, we increase its value. On the contrary, if we keep observing that it is blank, we decrease it.
In this way, we dynamically model the obstacle information in the map. However, the current method has a small problem: if $x$ is kept increasing or decreasing, it may go outside the range of $[0,1]$, causing inconvenience in processing. So we do not directly use the probability but use log-odds to describe it. Let $y \in \mathbb{R}$ be the logarithmic value and $x$ be the probability of 0\textasciitilde1, then the transformation between them is described by the \textit{logit} transformation:
\begin{equation}
y = \mathrm{logit}(x) = \log \left( \frac{x}{1-x} \right).
\end{equation}
The inverse transform is:
\begin{equation}
x = \mathrm{logit}^{-1}(y) = \frac{\exp(y)}{\exp(y)+1}.
\end{equation}
It can be seen that when $y$ changes from $-\infty$ to $+\infty$, $x$ changes from 0 to 1 accordingly. When $y$ takes 0, $x$ takes 0.5. Therefore, we might as well store $y$ to express whether the node is occupied. When the occupation is continuously observed, let $y$ increase by one value; otherwise, let $y$ decrease. When querying the probability, use the inverse logit transformation to convert $y$ to probability. In mathematical terms, suppose a certain node is $n$, and the observed data is $z$. Then the logarithm value of the probability of a node from the beginning to the moment $t$ is $L(n|z_{1:t})$, and the time $t+1$ is:
\begin{equation}
L(n|z_{1:t+1}) = L(n|z_{1:t-1}) + L(n|z_{t}).
\end{equation}
If written in probabilistic form instead of logarithmic form of probability, it would be a bit more complicated:
\begin{equation}
P(n|z_{1:T}) = \left[ 1+ \frac{1-P(n|z_T)}{P(n|z_T)} \frac{1-P(n|z_{1:T-1})}{P(n|z_{1:T-1})} \frac{P(n)}{1-P(n)} \right]^{-1}.
\end{equation}
With logarithmic probability, we can update the entire octree map based on RGB-D data. Suppose we observe a specific pixel with depth $d$ in the RGB-D image, it means: (1) There observed point is occupied; (2) The line from the camera center to the observed point is free. With this information, the octree map can be updated, and dynamic objects can also be handled.
\subsection{Practice: Octo-mapping}
Let's demonstrate the process of octo-mapping through the program. Please install the octomap library first. After 18.04, octomap and the corresponding visualization tool octovis have been integrated into the apt library and can be installed by the following command:
\begin{lstlisting}[language=sh,caption=Terminal input:]
sudo apt-get install liboctomap-dev octovis
\end{lstlisting}
We will directly demonstrate how to generate an octree map from the previous five images and then draw it with octovis.
\begin{lstlisting}[language=c++,caption=slambook/ch13/dense\_RGBD/octomap\_mapping.cpp (part)]
// octomap tree
octomap::OcTree tree(0.01); // resolution=0.01
for (int i = 0; i < 5; i++) {
cout << "Converting " << i + 1 << endl;
cv::Mat color = colorImgs[i];
cv::Mat depth = depthImgs[i];
Eigen::Isometry3d T = poses[i];
octomap::Pointcloud cloud; // the point cloud in octomap
for (int v = 0; v < color.rows; v++)
for (int u = 0; u < color.cols; u++) {
unsigned int d = depth.ptr<unsigned short>(v)[u];
if (d == 0) continue;
Eigen::Vector3d point;
point[2] = double(d) / depthScale;
point[0] = (u - cx) * point[2] / fx;
point[1] = (v - cy) * point[2] / fy;
Eigen::Vector3d pointWorld = T * point;
cloud.push_back(pointWorld[0], pointWorld[1], pointWorld[2]);
}
// save into octo tree
tree.insertPointCloud(cloud, octomap::point3d(T(0, 3), T(1, 3), T(2, 3)));
}
// update and save into a bt file
tree.updateInnerOccupancy();
cout << "saving octomap ... " << endl;
tree.writeBinary("octomap.bt");
\end{lstlisting}
We used octomap::OcTree to build the entire map. In fact, octomap provides many kinds of octrees: some with RGB maps and some with occupancy information. You can also define which variables each node needs to carry. For simplicity, we used the most basic octree map without color information.
A point cloud structure is provided inside Ocotmap. It is slightly simpler than the point cloud of PCL and only carries the point's spatial position information. According to the RGB-D image and camera pose information, we first transfer the coordinates of the point to the world coordinates, then put it into the point cloud of octomap, and finally give it to the octree map. After that, the octomap will update the internal occupation probability according to the projection information introduced before and finally save it as a compressed octree map. We save the generated map as \textit{octomap.bt} file. Now, call the octovis to open the map file, and you can see the map.
\autoref{fig:octomap-result}~ shows the result of the map we built. Since we did not add color information to the map, it will be gray. Press the "1" key to color it according to the height information. Readers can explore the octovis interface by themselves, including map viewing, rotation, zooming, etc.
\begin{figure}[!htp]
\centering
\includegraphics[width=1.0\textwidth]{mapping/octomap-result.pdf}
\caption{The display results of the octree map at different resolutions.}
\label{fig:octomap-result}
\end{figure}
There is an octree depth limit bar on the right, where you can adjust the map's resolution. Since the default depth is 16 layers, the 16th layer is the highest resolution displayed here, which is blocks with 0.05 meters. If we reduce the depth by one layer, the octree leaf nodes are raised by one layer, and the resolution doubles to 0.1 meters. As you can see, we can easily adjust the map resolution to suit different occasions.
Octomap also has some places that can be explored. For example, we can easily query the occupation probability of any point to design a navigation method in the map {\cite{Burri2015}}. Readers can also compare the file sizes of point cloud maps and octree maps. The disk file of the point cloud map generated in the previous section is about 6.9MB, while the octomap is only 56KB, which is less than one percent of the point cloud map, which can effectively model larger scenes.
\section{*TSDF and RGB-D Fusion Series}
At the end of this lecture, we introduce a research direction that is very similar to SLAM but slightly different: real-time 3D reconstruction. This section involves GPU programming and does not provide examples, so it is used as optional reading material.
In the previous map model, localization is regarded as the main body. The mesh or octomap is used as a post-processing step. This framework has become mainstream because the localization algorithm can meet the real-time requirements, and the processing of the map can be processed at the keyframe without real-time response. Localization is usually lightweight, especially when using sparse features or sparse direct methods; accordingly, the maps' expression and storage are heavyweight. The large-scale mapping and its computational requirements are not conducive to real-time processing. Dense maps can only be calculated at the keyframe level.
However, in the current practice, we have not optimized the dense map. For example, when the same chair is observed in two images, we only stitch the point clouds at the two locations based on the two images' poses to generate a map. Since pose estimation is usually error-prone, this direct stitching is often not accurate enough. For example, the point clouds of the same chair cannot be stitched perfectly. At this time, two observations of the same chair will appear on the map-this phenomenon is sometimes vividly called \textit{ghost shadow}.
This phenomenon is obviously not what we want. We hope that the reconstruction result is smooth and complete. Under this kind of thinking, there has been a practice of taking the map as the main body and localization in a secondary position, which is the real-time 3D reconstruction. Since 3D reconstruction takes the reconstruction of an accurate map as the main goal, it usually needs GPU for acceleration. In contrast, SLAM is developing towards lightweight and miniaturization. Some solutions even abandon the mapping and loop detection part and only retain the visual odometry. The real-time reconstruction is developing towards the rebuilding of large-scale dynamic scenes.
Since the emergence of RGB-D sensors, real-time reconstruction using RGB-D images has formed an important development direction, such as Kinect Fusion {\cite{Newcombe2011}}, Dynamic Fusion {\cite{Newcombe2015}} , Elastic Fusion {\cite{Whelan2015}}, Fusion4D {\cite{Dou2016}}, Volumn Deform {\cite{Innmann2016}} and other achievements. Among them, Kinect Fusion has completed the basic model reconstruction, but it is limited to small scenes; the follow-up work is to expand it to large, dynamic, and even deformed scenes. We regard them as real-time reconstruction work, but it is impossible to discuss each's working principles in detail due to the large variety. \autoref{fig:fusions}~ shows part of the reconstruction results. You can see that these modeling results are very fine, much more delicate than simple point clouds.