0% found this document useful (0 votes)
36 views116 pages

Object

object
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views116 pages

Object

object
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 116

1 1

2 00:00:00,000 --> 00:00:04,339


3 well visual recognition basically I will
4
5 2
6 00:00:01,979 --> 00:00:06,419
7 just talk about image classification
8
9 3
10 00:00:04,339 --> 00:00:09,089
11 actually I want to talk about object
12
13 4
14 00:00:06,419 --> 00:00:11,280
15 detection but because God is here so you
16
17 5
18 00:00:09,089 --> 00:00:13,740
19 will see even better talk about object
20
21 6
22 00:00:11,279 --> 00:00:18,539
23 detection so I will just talk about
24
25 7
26 00:00:13,740 --> 00:00:21,600
27 application so in this talk I will first
28
29 8
30 00:00:18,539 --> 00:00:24,300
31 give some introduction about injury
32
33 9
34 00:00:21,600 --> 00:00:26,820
35 recognition and inge classification then
36
37 10
38 00:00:24,300 --> 00:00:29,640
39 I will give some review of convolutional
40
41 11
42 00:00:26,820 --> 00:00:32,219
43 neural networks starting from the neck
44
45 12
46 00:00:29,640 --> 00:00:35,270
47 and to Alexander with your Google net in
48
49 13
50 00:00:32,219 --> 00:00:39,539
51 the past few years then I will give some
52
53 14
54 00:00:35,270 --> 00:00:41,700
55 review of resonate and also some
56
57 15
58 00:00:39,539 --> 00:00:44,039
59 introduction of our recent work or resin
60
61 16
62 00:00:41,700 --> 00:00:47,039
63 next which will be presented in this
64
65 17
66 00:00:44,039 --> 00:00:49,769
67 lívia and by the way at the slides of
68
69 18
70 00:00:47,039 --> 00:00:53,399
71 this tutorial will be available online
72
73 19
74 00:00:49,770 --> 00:00:56,789
75 after this tutorial so in the past few
76
77 20
78 00:00:53,399 --> 00:00:59,308
79 years we have been witnessing a
80
81 21
82 00:00:56,789 --> 00:01:01,350
83 revolution of depth so image net
84
85 22
86 00:00:59,308 --> 00:01:04,259
87 classification is a very good benchmark
88
89 23
90 00:01:01,350 --> 00:01:07,349
91 for this revolution in the first two
92
93 24
94 00:01:04,260 --> 00:01:10,618
95 years of this competition the models are
96
97 25
98 00:01:07,349 --> 00:01:14,280
99 shallow and accuracy are usually not
100
101 26
102 00:01:10,618 --> 00:01:16,349
103 very good so in the famous 2012
104
105 27
106 00:01:14,280 --> 00:01:18,090
107 Alexander which is one of the first
108
109 28
110 00:01:16,349 --> 00:01:21,899
111 convolutional neural network apply to
112
113 29
114 00:01:18,090 --> 00:01:24,600
115 this largest gill dataset greatly reduce
116
117 30
118 00:01:21,900 --> 00:01:28,228
119 their error rates by about 10 percent on
120
121 31
122 00:01:24,599 --> 00:01:29,669
123 this data set and after two years we
124
125 32
126 00:01:28,228 --> 00:01:31,709
127 have seen another significant
128
129 33
130 00:01:29,670 --> 00:01:34,170
131 improvement by we Gigi and Google net
132
133 34
134 00:01:31,709 --> 00:01:37,739
135 which again in this country increase the
136
137 35
138 00:01:34,170 --> 00:01:43,009
139 depth from about a layers to over 20
140
141 36
142 00:01:37,739 --> 00:01:46,009
143 layers so two years ago which is a 2015
144
145 37
146 00:01:43,009 --> 00:01:47,909
147 the deep residual network has again
148
149 38
150 00:01:46,009 --> 00:01:50,989
151 significantly increased the depth to
152
153 39
154 00:01:47,909 --> 00:01:55,469
155 over 100 layers and we have seen another
156
157 40
158 00:01:50,989 --> 00:01:57,599
159 improvement of accuracy so this deep
160
161 41
162 00:01:55,469 --> 00:02:00,359
163 neural networks are also the engines of
164
165 42
166 00:01:57,599 --> 00:02:03,118
167 visual recognition and example of the
168
169 43
170 00:02:00,359 --> 00:02:05,789
171 detection is a very good benchmark for
172
173 44
174 00:02:03,118 --> 00:02:09,628
175 for this behavior so for example in a
176
177 45
178 00:02:05,790 --> 00:02:13,530
179 girl will see 2007 of the detection the
180
181 46
182 00:02:09,628 --> 00:02:16,318
183 most popular model which was a hot glass
184
185 47
186 00:02:13,530 --> 00:02:18,840
187 DPM which is a shadow model has only
188
189 48
190 00:02:16,318 --> 00:02:22,708
191 achieve about 34 percent energy on this
192
193 49
194 00:02:18,840 --> 00:02:24,750
195 data set and after the Alex net was
196
197 50
198 00:02:22,709 --> 00:02:27,180
199 introduced and together with the region
200
201 51
202 00:02:24,750 --> 00:02:31,289
203 based conclusionary neural network or
204
205 52
206 00:02:27,180 --> 00:02:34,140
207 RCN the enmity has been significantly
208
209 53
210 00:02:31,289 --> 00:02:36,689
211 improved by 20 percent on this data set
212
213 54
214 00:02:34,139 --> 00:02:40,649
215 and if we replace the features from Alex
216
217 55
218 00:02:36,689 --> 00:02:43,889
219 net to rigidly observe from Alex net to
220
221 56
222 00:02:40,650 --> 00:02:46,980
223 vdd to resonate we see another very big
224
225 57
226 00:02:43,889 --> 00:02:51,449
227 improvement on this challenging of the
228
229 58
230 00:02:46,979 --> 00:02:53,399
231 detection task so actually currently red
232
233 59
234 00:02:51,449 --> 00:02:55,378
235 net and its extensions are the leading
236
237 60
238 00:02:53,400 --> 00:02:58,439
239 models for many popular visual
240
241 61
242 00:02:55,378 --> 00:03:01,919
243 recognition benchmarks such as auto
244
245 62
246 00:02:58,439 --> 00:03:03,870
247 detection in Cocoa and we'll see somatic
248
249 63
250 00:03:01,919 --> 00:03:07,049
251 segmentation and instant segmentation in
252
253 64
254 00:03:03,870 --> 00:03:09,390
255 Cocoa will see a de or cityscape or many
256
257 65
258 00:03:07,050 --> 00:03:11,909
259 other data set and it has also be
260
261 66
262 00:03:09,389 --> 00:03:13,979
263 applied to extract our visual features
264
265 67
266 00:03:11,908 --> 00:03:17,429
267 for visual reasoning such as vqa or
268
269 68
270 00:03:13,979 --> 00:03:21,539
271 clever also it can be used to improve
272
273 69
274 00:03:17,430 --> 00:03:25,230
275 video understanding so actually if you
276
277 70
278 00:03:21,539 --> 00:03:28,318
279 search ResNet on the image net 2016
280
281 71
282 00:03:25,229 --> 00:03:31,768
283 result webpage you will see about 200
284
285 72
286 00:03:28,318 --> 00:03:34,229
287 entries and so these demonstrate the
288
289 73
290 00:03:31,769 --> 00:03:38,400
291 prevalence of these models in general
292
293 74
294 00:03:34,229 --> 00:03:40,079
295 computer visual recognition so now let's
296
297 75
298 00:03:38,400 --> 00:03:43,349
299 see how the computer
300
301 76
302 00:03:40,080 --> 00:03:45,540
303 recognize an image before the prevalence
304
305 77
306 00:03:43,348 --> 00:03:49,199
307 of deep learning so let's think about a
308
309 78
310 00:03:45,539 --> 00:03:51,568
311 most simple case so given an image we
312
313 79
314 00:03:49,199 --> 00:03:54,000
315 can just represent it as pixels and then
316
317 80
318 00:03:51,568 --> 00:03:56,699
319 we can train a classifier on top of
320
321 81
322 00:03:54,000 --> 00:03:57,840
323 these pixels the classifier can be very
324
325 82
326 00:03:56,699 --> 00:04:02,878
327 simple it can be a nearest neighbor
328
329 83
330 00:03:57,840 --> 00:04:06,810
331 classifier as we am or forest so it is
332
333 84
334 00:04:02,878 --> 00:04:08,188
335 unlikely for this model to work so we
336
337 85
338 00:04:06,810 --> 00:04:10,140
339 can make it a little bit more
340
341 86
342 00:04:08,188 --> 00:04:12,449
343 complicated for example we can extract
344
345 87
346 00:04:10,139 --> 00:04:14,548
347 edges from this image so this
348
349 88
350 00:04:12,449 --> 00:04:17,129
351 representation of edges can be a little
352
353 89
354 00:04:14,549 --> 00:04:23,819
355 bit invariant to some sectors such as
356
357 90
358 00:04:17,129 --> 00:04:27,329
359 colors or intimations of some north
360
361 91
362 00:04:23,819 --> 00:04:31,170
363 so following this theme we can build
364
365 92
366 00:04:27,329 --> 00:04:33,329
367 even more and more complicated visual
368
369 93
370 00:04:31,170 --> 00:04:34,770
371 features to do better with your
372
373 94
374 00:04:33,329 --> 00:04:36,959
375 recognition depending on how much
376
377 95
378 00:04:34,769 --> 00:04:40,349
379 variance and invariance we want to have
380
381 96
382 00:04:36,959 --> 00:04:42,389
383 for example the popular features such as
384
385 97
386 00:04:40,350 --> 00:04:45,600
387 this and Haque usually extra edges and
388
389 98
390 00:04:42,389 --> 00:04:47,939
391 then we build ocol histograms of the
392
393 99
394 00:04:45,600 --> 00:04:50,100
395 orientations of edges so actually this
396
397 100
398 00:04:47,939 --> 00:04:52,589
399 type of Instagram can be think of as
400
401 101
402 00:04:50,100 --> 00:04:55,350
403 kind of a normalization and pulling
404
405 102
406 00:04:52,589 --> 00:04:58,739
407 operations locally operated on the edges
408
409 103
410 00:04:55,350 --> 00:05:01,350
411 so we can apply a classifier on top of
412
413 104
414 00:04:58,740 --> 00:05:03,689
415 these features and one step further we
416
417 105
418 00:05:01,350 --> 00:05:07,260
419 can do some back of word models using
420
421 106
422 00:05:03,689 --> 00:05:09,629
423 k-means or using start coding or even
424
425 107
426 00:05:07,259 --> 00:05:11,459
427 higher order future vector or we are ad
428
429 108
430 00:05:09,629 --> 00:05:13,350
431 on these people hugger features and
432
433 109
434 00:05:11,459 --> 00:05:15,870
435 after that still we train a classifier
436
437 110
438 00:05:13,350 --> 00:05:17,460
439 on top of that and usually these are the
440
441 111
442 00:05:15,870 --> 00:05:20,639
443 state-of-the-art models before the
444
445 112
446 00:05:17,459 --> 00:05:23,750
447 prevalence of deep learning so from this
448
449 113
450 00:05:20,639 --> 00:05:27,120
451 diagram we can see that our models are
452
453 114
454 00:05:23,750 --> 00:05:30,420
455 going from a simpler form to a more
456
457 115
458 00:05:27,120 --> 00:05:34,550
459 complicated form and it is just going
460
461 116
462 00:05:30,420 --> 00:05:37,110
463 from a shallower one to a deeper one but
464
465 117
466 00:05:34,550 --> 00:05:42,090
467 what can we do if we want to have an
468
469 118
470 00:05:37,110 --> 00:05:44,069
471 even better model so actually this is
472
473 119
474 00:05:42,089 --> 00:05:46,649
475 the motivation of learning deep features
476
477 120
478 00:05:44,069 --> 00:05:48,569
479 so in traditional computer vision we
480
481 121
482 00:05:46,649 --> 00:05:50,519
483 design specialized components which
484
485 122
486 00:05:48,569 --> 00:05:53,009
487 require domain knowledge for example we
488
489 123
490 00:05:50,519 --> 00:05:54,750
491 we want to extract something which is
492
493 124
494 00:05:53,009 --> 00:05:56,610
495 called edges we want to extract
496
497 125
498 00:05:54,750 --> 00:05:59,610
499 something which is called histograms or
500
501 126
502 00:05:56,610 --> 00:06:01,889
503 we want to extract their orientations so
504
505 127
506 00:05:59,610 --> 00:06:03,780
507 all these components require specific
508
509 128
510 00:06:01,889 --> 00:06:07,229
511 domain knowledge of the problem on hand
512
513 129
514 00:06:03,779 --> 00:06:08,969
515 however a limit per component
516
517 130
518 00:06:07,230 --> 00:06:11,160
519 the number of components we can decide
520
521 131
522 00:06:08,970 --> 00:06:12,840
523 because of the domain knowledge so on
524
525 132
526 00:06:11,160 --> 00:06:14,760
527 the other hand in order to learn deeper
528
529 133
530 00:06:12,839 --> 00:06:17,239
531 features actually in a deep neural
532
533 134
534 00:06:14,759 --> 00:06:19,800
535 network we don't need these special
536
537 135
538 00:06:17,240 --> 00:06:22,139
539 components we usually just need some
540
541 136
542 00:06:19,800 --> 00:06:24,060
543 generic components which is usually
544
545 137
546 00:06:22,139 --> 00:06:28,110
547 called layers nowadays such as
548
549 138
550 00:06:24,060 --> 00:06:31,589
551 convolutions relu are normalizations or
552
553 139
554 00:06:28,110 --> 00:06:33,780
555 appalling so are if generic component
556
557 140
558 00:06:31,589 --> 00:06:36,209
559 require less domain knowledge so we can
560
561 141
562 00:06:33,779 --> 00:06:36,929
563 just repeat these elementary layers
564
565 142
566 00:06:36,209 --> 00:06:38,879
567 which
568
569 143
570 00:06:36,930 --> 00:06:41,280
571 means that we can it is easy for us to
572
573 144
574 00:06:38,879 --> 00:06:43,860
575 create our deeper neural networks
576
577 145
578 00:06:41,279 --> 00:06:46,138
579 without much domain knowledge so this
580
581 146
582 00:06:43,860 --> 00:06:48,870
583 can give us a richer solution space when
584
585 147
586 00:06:46,139 --> 00:06:51,660
587 going deeper and also we can still
588
589 148
590 00:06:48,870 --> 00:06:53,490
591 easily train them using some n2n
592
593 149
594 00:06:51,660 --> 00:06:58,410
595 learning algorithms such as back
596
597 150
598 00:06:53,490 --> 00:07:00,449
599 propagation so and this is the
600
601 151
602 00:06:58,410 --> 00:07:01,919
603 fundamental motivation of doing deep
604
605 152
606 00:07:00,449 --> 00:07:03,780
607 learning and convolutional neural
608
609 153
610 00:07:01,918 --> 00:07:07,079
611 networks so in the following I will
612
613 154
614 00:07:03,779 --> 00:07:08,989
615 review a few typical neural networks in
616
617 155
618 00:07:07,079 --> 00:07:11,848
619 the last few years
620
621 156
622 00:07:08,990 --> 00:07:14,759
623 so first let's recap there Lynnette
624
625 157
626 00:07:11,848 --> 00:07:18,000
627 which were actually developed at 20 or
628
629 158
630 00:07:14,759 --> 00:07:20,550
631 30 years ago so here are the basic
632
633 159
634 00:07:18,000 --> 00:07:23,788
635 components of Linette which are still
636
637 160
638 00:07:20,550 --> 00:07:26,610
639 popular in the modern Conville net so
640
641 161
642 00:07:23,788 --> 00:07:29,519
643 their first example component is
644
645 162
646 00:07:26,610 --> 00:07:31,979
647 convolution so convolution is actually a
648
649 163
650 00:07:29,519 --> 00:07:35,549
651 locally connected layer and more
652
653 164
654 00:07:31,978 --> 00:07:38,430
655 importantly it is a layer with spatially
656
657 165
658 00:07:35,550 --> 00:07:40,770
659 sharing weight so in my opinion sharing
660
661 166
662 00:07:38,430 --> 00:07:43,348
663 a weight sharing is the kid in deep
664
665 167
666 00:07:40,769 --> 00:07:46,740
667 learning so in the case of convolutional
668
669 168
670 00:07:43,348 --> 00:07:50,459
671 neural a neural net we shall wait across
672
673 169
674 00:07:46,740 --> 00:07:52,228
675 spatial across the spatial domain in the
676
677 170
678 00:07:50,459 --> 00:07:54,719
679 case of recurrent neural net we shall
680
681 171
682 00:07:52,228 --> 00:07:57,240
683 wait temporarily so by doing weight
684
685 172
686 00:07:54,720 --> 00:07:59,699
687 sharing we can significantly reduce the
688
689 173
690 00:07:57,240 --> 00:08:03,300
691 number of parameters in the in the model
692
693 174
694 00:07:59,699 --> 00:08:05,658
695 per still have very good capacity on the
696
697 175
698 00:08:03,300 --> 00:08:09,240
699 other hand a learner also has another
700
701 176
702 00:08:05,658 --> 00:08:11,219
703 key component of sub sampling so
704
705 177
706 00:08:09,240 --> 00:08:13,680
707 nowadays we still do this type of sub
708
709 178
710 00:08:11,220 --> 00:08:18,270
711 sampling using are pulling or just some
712
713 179
714 00:08:13,680 --> 00:08:20,908
715 straight to convolutions so in the case
716
717 180
718 00:08:18,269 --> 00:08:24,209
719 of Linette the network is ended by some
720
721 181
722 00:08:20,908 --> 00:08:26,370
723 fully connected layers so typically for
724
725 182
726 00:08:24,209 --> 00:08:28,739
727 the last layer we can just think of it
728
729 183
730 00:08:26,370 --> 00:08:30,538
731 as a linear classifier which is very
732
733 184
734 00:08:28,740 --> 00:08:33,120
735 similar to an SVM
736
737 185
738 00:08:30,538 --> 00:08:37,379
739 so the entire architecture can be
740
741 186
742 00:08:33,120 --> 00:08:39,389
743 trained and to end by propagation so all
744
745 187
746 00:08:37,379 --> 00:08:44,208
747 these components are still their key
748
749 188
750 00:08:39,389 --> 00:08:48,088
751 components in nowadays called net so in
752
753 189
754 00:08:44,208 --> 00:08:50,489
755 2012 the framers alex net crashed into
756
757 190
758 00:08:48,089 --> 00:08:53,550
759 the in genetic classification challenge
760
761 191
762 00:08:50,490 --> 00:08:57,629
763 so for Alexander it is do a the next
764
765 192
766 00:08:53,549 --> 00:09:00,599
767 style backbone but it has some our key
768
769 193
770 00:08:57,629 --> 00:09:03,480
771 improvement so the first improvement is
772
773 194
774 00:09:00,600 --> 00:09:07,379
775 relu or rectified linear units
776
777 195
778 00:09:03,480 --> 00:09:09,870
779 so in some sense relu is kind of the
780
781 196
782 00:09:07,379 --> 00:09:12,539
783 reason for the revolution of deep
784
785 197
786 00:09:09,870 --> 00:09:15,210
787 learning recently because you can
788
789 198
790 00:09:12,539 --> 00:09:18,889
791 accelerate training because of better
792
793 199
794 00:09:15,210 --> 00:09:21,150
795 gradient propagation versus some typical
796
797 200
798 00:09:18,889 --> 00:09:24,569
799 activations such as tangent nature or
800
801 201
802 00:09:21,149 --> 00:09:27,629
803 sigmoid so another key component of Alex
804
805 202
806 00:09:24,570 --> 00:09:30,600
807 net is the job house operation which is
808
809 203
810 00:09:27,629 --> 00:09:34,139
811 essentially in network and sampling so
812
813 204
814 00:09:30,600 --> 00:09:36,409
815 in the case of Alex net or later on for
816
817 205
818 00:09:34,139 --> 00:09:40,699
819 a widow Jeanette which has very large
820
821 206
822 00:09:36,409 --> 00:09:43,500
823 fully kinetic layers that with many
824
825 207
826 00:09:40,700 --> 00:09:45,810
827 parameters jobel can help to reduce
828
829 208
830 00:09:43,500 --> 00:09:48,779
831 overfitting but it is also worth
832
833 209
834 00:09:45,809 --> 00:09:52,109
835 mentioning that java can be replaced by
836
837 210
838 00:09:48,779 --> 00:09:56,370
839 better known in some later networks so
840
841 211
842 00:09:52,110 --> 00:09:58,500
843 another key contribution of the Alex net
844
845 212
846 00:09:56,370 --> 00:10:00,750
847 design in my opinion is the data element
848
849 213
850 00:09:58,500 --> 00:10:03,090
851 Asian step so they complete domination
852
853 214
854 00:10:00,750 --> 00:10:06,210
855 is actually kind of label preserving
856
857 215
858 00:10:03,090 --> 00:10:08,610
859 trance transformation so you can do some
860
861 216
862 00:10:06,210 --> 00:10:09,060
863 random cropping or random scaling or
864
865 217
866 00:10:08,610 --> 00:10:12,600
867 flipping
868
869 218
870 00:10:09,059 --> 00:10:15,809
871 which can virtually create more data
872
873 219
874 00:10:12,600 --> 00:10:18,600
875 from other existing data on hand so this
876
877 220
878 00:10:15,809 --> 00:10:20,759
879 means even we have 1 million images on
880
881 221
882 00:10:18,600 --> 00:10:24,509
883 hand from internet so the number of data
884
885 222
886 00:10:20,759 --> 00:10:26,189
887 is still kind of limited comparing with
888
889 223
890 00:10:24,509 --> 00:10:28,679
891 the neural network size so the
892
893 224
894 00:10:26,190 --> 00:10:30,990
895 documentation is also our one of the key
896
897 225
898 00:10:28,679 --> 00:10:32,669
899 reasons to the recent success of neural
900
901 226
902 00:10:30,990 --> 00:10:38,190
903 networks it can also help to reduce
904
905 227
906 00:10:32,669 --> 00:10:41,399
907 overfitting so another milestone in my
908
909 228
910 00:10:38,190 --> 00:10:43,590
911 opinion is the vision Network so I still
912
913 229
914 00:10:41,399 --> 00:10:47,789
915 remember that when we just saw their
916
917 230
918 00:10:43,590 --> 00:10:49,470
919 image net 2014 with our webpage we
920
921 231
922 00:10:47,789 --> 00:10:54,750
923 discovered that there are some networks
924
925 232
926 00:10:49,470 --> 00:10:57,389
927 which are with 16 or even 19 years so
928
929 233
930 00:10:54,750 --> 00:11:02,279
931 our comment is just that it is beyond
932
933 234
934 00:10:57,389 --> 00:11:04,439
935 our imagination so actually the widget
936
937 235
938 00:11:02,279 --> 00:11:08,610
939 networks are very simple and which
940
941 236
942 00:11:04,440 --> 00:11:12,060
943 makes them a very real burrs so in my
944
945 237
946 00:11:08,610 --> 00:11:14,820
947 opinion the key design of the rigid
948
949 238
950 00:11:12,059 --> 00:11:18,239
951 networks are moderate modularization
952
953 239
954 00:11:14,820 --> 00:11:20,970
955 designs so in this case they just stack
956
957 240
958 00:11:18,240 --> 00:11:24,539
959 a lot of three by three convolutions
960
961 241
962 00:11:20,970 --> 00:11:26,310
963 following some very simple rules so in
964
965 242
966 00:11:24,539 --> 00:11:29,250
967 the same stage of the neural network all
968
969 243
970 00:11:26,309 --> 00:11:31,229
971 the layers have the same shape and when
972
973 244
974 00:11:29,250 --> 00:11:33,210
975 the spatial size is reduced they just
976
977 245
978 00:11:31,230 --> 00:11:35,490
979 increase the number of filters so to
980
981 246
982 00:11:33,210 --> 00:11:38,759
983 roughly keep the competition for each
984
985 247
986 00:11:35,490 --> 00:11:41,519
987 module so this is very simple so when we
988
989 248
990 00:11:38,759 --> 00:11:43,620
991 go deeper and deeper we don't need to
992
993 249
994 00:11:41,519 --> 00:11:49,049
995 design new layers we can just use the
996
997 250
998 00:11:43,620 --> 00:11:51,870
999 same template and when the ouija network
1000
1001 251
1002 00:11:49,049 --> 00:11:53,549
1003 was just published it was trained using
1004
1005 252
1006 00:11:51,870 --> 00:11:56,580
1007 some stage wise training for example
1008
1009 253
1010 00:11:53,549 --> 00:11:59,370
1011 people started from a shallower network
1012
1013 254
1014 00:11:56,580 --> 00:12:02,610
1015 that with that data has 11 layers and
1016
1017 255
1018 00:11:59,370 --> 00:12:06,960
1019 then gradually increase the depth to 13
1020
1021 256
1022 00:12:02,610 --> 00:12:09,450
1023 and to 16 so this type of stage wise
1024
1025 257
1026 00:12:06,960 --> 00:12:12,389
1027 training is not practical because it is
1028
1029 258
1030 00:12:09,450 --> 00:12:15,120
1031 time consuming and it is not end to end
1032
1033 259
1034 00:12:12,389 --> 00:12:17,039
1035 in some sense so we're actually which
1036
1037 260
1038 00:12:15,120 --> 00:12:21,570
1039 the network can be trained from scratch
1040
1041 261
1042 00:12:17,039 --> 00:12:23,459
1043 if we have some better initialization so
1044
1045 262
1046 00:12:21,570 --> 00:12:25,860
1047 next I will briefly talk about the
1048
1049 263
1050 00:12:23,460 --> 00:12:29,700
1051 initialization techniques let's think
1052
1053 264
1054 00:12:25,860 --> 00:12:31,769
1055 about a layer whose input is X and the
1056
1057 265
1058 00:12:29,700 --> 00:12:36,180
1059 weight matrix is W and the output is
1060
1061 266
1062 00:12:31,769 --> 00:12:39,449
1063 just y equals to W X so if we assume
1064
1065 267
1066 00:12:36,179 --> 00:12:43,649
1067 that their equations are linear and also
1068
1069 268
1070 00:12:39,450 --> 00:12:46,620
1071 they're the vectors X Y and W are
1072
1073 269
1074 00:12:43,649 --> 00:12:51,240
1075 independent then from some basic math we
1076
1077 270
1078 00:12:46,620 --> 00:12:55,110
1079 can show that the variance of y equals
1080
1081 271
1082 00:12:51,240 --> 00:12:58,080
1083 to the variance of X multiplied by a
1084
1085 272
1086 00:12:55,110 --> 00:13:00,990
1087 vector and this vector is the number of
1088
1089 273
1090 00:12:58,080 --> 00:13:03,930
1091 neurons multiplied by the variance of
1092
1093 274
1094 00:13:00,990 --> 00:13:06,930
1095 the weight so this is the case of one
1096
1097 275
1098 00:13:03,929 --> 00:13:08,639
1099 layer so if we have multiple layers we
1100
1101 276
1102 00:13:06,929 --> 00:13:10,769
1103 can see that the outputs of the network
1104
1105 277
1106 00:13:08,639 --> 00:13:13,620
1107 are the variance of the above network is
1108
1109 278
1110 00:13:10,769 --> 00:13:17,159
1111 also proportional to the variance of the
1112
1113 279
1114 00:13:13,620 --> 00:13:20,129
1115 input of the network up to with a factor
1116
1117 280
1118 00:13:17,159 --> 00:13:24,269
1119 of the multiplication of the variance of
1120
1121 281
1122 00:13:20,129 --> 00:13:26,879
1123 every single layer so actually this is
1124
1125 282
1126 00:13:24,269 --> 00:13:29,129
1127 the reason of the framers vanishing
1128
1129 283
1130 00:13:26,879 --> 00:13:32,278
1131 gradient or exploding gradient problem
1132
1133 284
1134 00:13:29,129 --> 00:13:34,259
1135 actually in some sense it is not just
1136
1137 285
1138 00:13:32,278 --> 00:13:36,179
1139 about gradient actually both the forward
1140
1141 286
1142 00:13:34,259 --> 00:13:40,220
1143 pass and the backward pass can both
1144
1145 287
1146 00:13:36,179 --> 00:13:42,719
1147 expose in this regard so if the
1148
1149 288
1150 00:13:40,220 --> 00:13:45,180
1151 initialization or if the weight of them
1152
1153 289
1154 00:13:42,720 --> 00:13:48,660
1155 the network is slightly smaller than
1156
1157 290
1158 00:13:45,179 --> 00:13:51,239
1159 ideal then images vanish because it is a
1160
1161 291
1162 00:13:48,659 --> 00:13:52,588
1163 multiplication of many small numbers on
1164
1165 292
1166 00:13:51,240 --> 00:13:56,519
1167 the other hand if it is just lightly
1168
1169 293
1170 00:13:52,589 --> 00:13:58,170
1171 larger than it will just expose so this
1172
1173 294
1174 00:13:56,519 --> 00:13:59,698
1175 is why we need some careful
1176
1177 295
1178 00:13:58,169 --> 00:14:01,429
1179 initialization so one of the most
1180
1181 296
1182 00:13:59,698 --> 00:14:03,659
1183 popular one is the so-called zombie
1184
1185 297
1186 00:14:01,429 --> 00:14:06,628
1187 initialization which is developed under
1188
1189 298
1190 00:14:03,659 --> 00:14:09,708
1191 some linear assumptions basically it
1192
1193 299
1194 00:14:06,629 --> 00:14:12,329
1195 just says we want some constant factors
1196
1197 300
1198 00:14:09,708 --> 00:14:14,369
1199 in the forward pass and also some other
1200
1201 301
1202 00:14:12,328 --> 00:14:16,229
1203 content sectors in the better path and
1204
1205 302
1206 00:14:14,370 --> 00:14:18,688
1207 some simple assumption is just that a
1208
1209 303
1210 00:14:16,230 --> 00:14:20,938
1211 defector for every single layer is one
1212
1213 304
1214 00:14:18,688 --> 00:14:23,809
1215 for either the forward pass or the
1216
1217 305
1218 00:14:20,938 --> 00:14:27,719
1219 backward pass so this obvious
1220
1221 306
1222 00:14:23,809 --> 00:14:29,068
1223 initialization is very useful however it
1224
1225 307
1226 00:14:27,720 --> 00:14:31,980
1227 was developed under the linear
1228
1229 308
1230 00:14:29,068 --> 00:14:35,519
1231 assumption and actually we can show that
1232
1233 309
1234 00:14:31,980 --> 00:14:39,170
1235 if the assumption is if the activation
1236
1237 310
1238 00:14:35,519 --> 00:14:44,850
1239 is real we can also have some nice
1240
1241 311
1242 00:14:39,169 --> 00:14:47,688
1243 analytic derivation which simply modify
1244
1245 312
1246 00:14:44,850 --> 00:14:53,189
1247 the above as of your case by a factor of
1248
1249 313
1250 00:14:47,688 --> 00:14:55,740
1251 1 over 2 where this factor is important
1252
1253 314
1254 00:14:53,188 --> 00:14:58,349
1255 because if we have D layers on hand and
1256
1257 315
1258 00:14:55,740 --> 00:15:01,350
1259 a factor of 2 per layer just we just
1260
1261 316
1262 00:14:58,350 --> 00:15:05,250
1263 have an exponential effect of 2 to the
1264
1265 317
1266 00:15:01,350 --> 00:15:08,009
1267 power of P and we can also see this kind
1268
1269 318
1270 00:15:05,250 --> 00:15:11,818
1271 of exponential effect is very prevalent
1272
1273 319
1274 00:15:08,009 --> 00:15:17,009
1275 in deep neural network if we don't do
1276
1277 320
1278 00:15:11,818 --> 00:15:20,068
1279 things right so basically in my opinion
1280
1281 321
1282 00:15:17,009 --> 00:15:23,419
1283 the Salvia or the MSI initialization are
1284
1285 322
1286 00:15:20,068 --> 00:15:26,969
1287 required for training we do g16
1288
1289 323
1290 00:15:23,419 --> 00:15:29,159
1291 originating from scratch and if we want
1292
1293 324
1294 00:15:26,970 --> 00:15:30,209
1295 to chain even deeper with reduced our
1296
1297 325
1298 00:15:29,159 --> 00:15:33,688
1299 networks for example
1300
1301 326
1302 00:15:30,208 --> 00:15:36,258
1303 with 20 layers usually we we have to use
1304
1305 327
1306 00:15:33,688 --> 00:15:39,539
1307 the image MSI initialization
1308
1309 328
1310 00:15:36,259 --> 00:15:42,808
1311 unfortunately unfortunately with this
1312
1313 329
1314 00:15:39,539 --> 00:15:44,969
1315 number of layers the deep plane networks
1316
1317 330
1318 00:15:42,808 --> 00:15:47,039
1319 usually are not better at this is why we
1320
1321 331
1322 00:15:44,970 --> 00:15:50,610
1323 have to develop the ResNet which we will
1324
1325 332
1326 00:15:47,039 --> 00:15:52,558
1327 discuss later but on the other hand this
1328
1329 333
1330 00:15:50,610 --> 00:15:55,528
1331 type of initializations are still useful
1332
1333 334
1334 00:15:52,558 --> 00:15:58,230
1335 for example if we want to do fine-tuning
1336
1337 335
1338 00:15:55,528 --> 00:16:00,558
1339 for detection or segmentation where the
1340
1341 336
1342 00:15:58,230 --> 00:16:03,688
1343 network have some newly initialization
1344
1345 337
1346 00:16:00,558 --> 00:16:05,909
1347 newly initialized layers in this case I
1348
1349 338
1350 00:16:03,688 --> 00:16:08,849
1351 recommend to try severe or MSI
1352
1353 339
1354 00:16:05,909 --> 00:16:13,259
1355 installation first and also it is worth
1356
1357 340
1358 00:16:08,850 --> 00:16:16,439
1359 mentioning that the mathematical outcome
1360
1361 341
1362 00:16:13,259 --> 00:16:18,899
1363 of this two type of initialization does
1364
1365 342
1366 00:16:16,438 --> 00:16:21,928
1367 not directly apply to multi branch
1368
1369 343
1370 00:16:18,899 --> 00:16:24,389
1371 network such as Google net are actually
1372
1373 344
1374 00:16:21,928 --> 00:16:28,198
1375 the same the relation methodology is
1376
1377 345
1378 00:16:24,389 --> 00:16:31,438
1379 also applicable so we can still do 30
1380
1381 346
1382 00:16:28,198 --> 00:16:34,049
1383 relation for different types of those
1384
1385 347
1386 00:16:31,438 --> 00:16:36,438
1387 inception blocks but usually now we
1388
1389 348
1390 00:16:34,049 --> 00:16:38,219
1391 don't need to do this because we have
1392
1393 349
1394 00:16:36,438 --> 00:16:43,139
1395 personalizations which are we'll also
1396
1397 350
1398 00:16:38,220 --> 00:16:45,749
1399 discuss later so also in 2014 another
1400
1401 351
1402 00:16:43,139 --> 00:16:48,419
1403 very successful neural network is
1404
1405 352
1406 00:16:45,749 --> 00:16:51,360
1407 designed and it is the Google net or
1408
1409 353
1410 00:16:48,419 --> 00:16:54,748
1411 called inception later so Google net is
1412
1413 354
1414 00:16:51,360 --> 00:16:58,188
1415 known for its very good accuracy and at
1416
1417 355
1418 00:16:54,749 --> 00:17:01,589
1419 the same time the small footprint so
1420
1421 356
1422 00:16:58,188 --> 00:17:04,139
1423 there are many complicated designed in
1424
1425 357
1426 00:17:01,589 --> 00:17:06,480
1427 the Google net and in my opinion I I
1428
1429 358
1430 00:17:04,140 --> 00:17:09,959
1431 would like to summarize them into three
1432
1433 359
1434 00:17:06,480 --> 00:17:11,548
1435 main points so the first property of
1436
1437 360
1438 00:17:09,959 --> 00:17:14,100
1439 Google net is that it is a multiple
1440
1441 361
1442 00:17:11,548 --> 00:17:16,019
1443 branch architecture so for example in
1444
1445 362
1446 00:17:14,099 --> 00:17:18,658
1447 the original version it has a one by one
1448
1449 363
1450 00:17:16,019 --> 00:17:21,689
1451 branch and then three by three five by
1452
1453 364
1454 00:17:18,659 --> 00:17:24,539
1455 five and pulling and another very
1456
1457 365
1458 00:17:21,689 --> 00:17:26,579
1459 interesting behavior which may be kind
1460
1461 366
1462 00:17:24,538 --> 00:17:29,640
1463 of an acident in the original kikuna is
1464
1465 367
1466 00:17:26,578 --> 00:17:32,250
1467 the usage of socket so as we can see
1468
1469 368
1470 00:17:29,640 --> 00:17:35,880
1471 there is a standalone 1x1 convolution in
1472
1473 369
1474 00:17:32,250 --> 00:17:38,700
1475 the original inception block and this
1476
1477 370
1478 00:17:35,880 --> 00:17:40,649
1479 one by one shortcut is merged into the
1480
1481 371
1482 00:17:38,700 --> 00:17:44,130
1483 other branches by condition in the
1484
1485 372
1486 00:17:40,648 --> 00:17:46,859
1487 original inception so
1488
1489 373
1490 00:17:44,130 --> 00:17:49,110
1491 we will see these one-by-one Shakur has
1492
1493 374
1494 00:17:46,859 --> 00:17:53,819
1495 been kept in almost older following
1496
1497 375
1498 00:17:49,109 --> 00:17:55,679
1499 generations of inceptions so in my
1500
1501 376
1502 00:17:53,819 --> 00:17:59,069
1503 understanding this one by one shortcut
1504
1505 377
1506 00:17:55,680 --> 00:18:03,570
1507 somehow helps optimization of this very
1508
1509 378
1510 00:17:59,069 --> 00:18:06,059
1511 complicated designs of google nets and
1512
1513 379
1514 00:18:03,569 --> 00:18:08,339
1515 also at that time google net is also
1516
1517 380
1518 00:18:06,059 --> 00:18:11,909
1519 kind of a photo net architecture
1520
1521 381
1522 00:18:08,339 --> 00:18:14,909
1523 so for each inception block there are
1524
1525 382
1526 00:18:11,910 --> 00:18:16,920
1527 some 1x1 convolutions to reduce the
1528
1529 383
1530 00:18:14,910 --> 00:18:20,580
1531 number of channels before doing the
1532
1533 384
1534 00:18:16,920 --> 00:18:22,769
1535 expensive 3x3 and 5x5 convolution so
1536
1537 385
1538 00:18:20,579 --> 00:18:24,720
1539 this is kind of alternate representation
1540
1541 386
1542 00:18:22,769 --> 00:18:27,529
1543 in in terms of the number of channels
1544
1545 387
1546 00:18:24,720 --> 00:18:30,390
1547 and in the original Google net because
1548
1549 388
1550 00:18:27,529 --> 00:18:32,879
1551 after the 3 by 3 or 5 by 5 convolution
1552
1553 389
1554 00:18:30,390 --> 00:18:37,830
1555 the journals are calculated so usually
1556
1557 390
1558 00:18:32,880 --> 00:18:42,630
1559 they don't need to do some dimension
1560
1561 391
1562 00:18:37,829 --> 00:18:45,179
1563 increase in that case so there are many
1564
1565 392
1566 00:18:42,630 --> 00:18:47,790
1567 other versions of Google net or in
1568
1569 393
1570 00:18:45,180 --> 00:18:50,670
1571 sumption developed after that in my
1572
1573 394
1574 00:18:47,789 --> 00:18:52,589
1575 opinion all those inception templates
1576
1577 395
1578 00:18:50,670 --> 00:18:55,740
1579 still have more or less the same three
1580
1581 396
1582 00:18:52,589 --> 00:18:59,509
1583 main poverty's multiple branches socket
1584
1585 397
1586 00:18:55,740 --> 00:19:02,579
1587 and bottleneck so I believe this other
1588
1589 398
1590 00:18:59,509 --> 00:19:07,859
1591 key component in the success of Google
1592
1593 399
1594 00:19:02,579 --> 00:19:10,409
1595 net so as we mentioned before are the
1596
1597 400
1598 00:19:07,859 --> 00:19:12,479
1599 zombie or MSI initialization are not
1600
1601 401
1602 00:19:10,410 --> 00:19:18,690
1603 directly applicable for multiple branch
1604
1605 402
1606 00:19:12,480 --> 00:19:20,519
1607 network such as Google net for Google
1608
1609 403
1610 00:19:18,690 --> 00:19:22,650
1611 net can still be optimized very
1612
1613 404
1614 00:19:20,519 --> 00:19:27,629
1615 successfully thanks to the introduction
1616
1617 405
1618 00:19:22,650 --> 00:19:30,150
1619 of personalization or called B n so M
1620
1621 406
1622 00:19:27,630 --> 00:19:32,640
1623 understanding BN is also a milestone
1624
1625 407
1626 00:19:30,150 --> 00:19:40,580
1627 technique in the reason deep learning
1628
1629 408
1630 00:19:32,640 --> 00:19:42,960
1631 revolution so basically before the
1632
1633 409
1634 00:19:40,579 --> 00:19:45,449
1635 recent prevalence of deep learning
1636
1637 410
1638 00:19:42,960 --> 00:19:47,880
1639 people has been long realized that if we
1640
1641 411
1642 00:19:45,450 --> 00:19:51,120
1643 want to train a neural network we at
1644
1645 412
1646 00:19:47,880 --> 00:19:55,290
1647 least want to normalize the impulse of
1648
1649 413
1650 00:19:51,119 --> 00:19:57,029
1651 the data by it mean or its STD I on the
1652
1653 414
1654 00:19:55,289 --> 00:19:57,950
1655 other hand the development of the Xavier
1656
1657 415
1658 00:19:57,029 --> 00:20:01,009
1659 or MS
1660
1661 416
1662 00:19:57,950 --> 00:20:03,980
1663 I initialization is just to analytically
1664
1665 417
1666 00:20:01,009 --> 00:20:06,319
1667 normalize the mean and STD for each
1668
1669 418
1670 00:20:03,980 --> 00:20:09,259
1671 layer based on some strong assumptions
1672
1673 419
1674 00:20:06,319 --> 00:20:11,869
1675 such as linear or independence
1676
1677 420
1678 00:20:09,259 --> 00:20:14,089
1679 assumptions so in the case of better
1680
1681 421
1682 00:20:11,869 --> 00:20:17,029
1683 normalization it is a kind of
1684
1685 422
1686 00:20:14,089 --> 00:20:20,089
1687 data-driven normalization of each layer
1688
1689 423
1690 00:20:17,029 --> 00:20:21,230
1691 and this is done for each mini batch so
1692
1693 424
1694 00:20:20,089 --> 00:20:23,899
1695 this is why it is called a
1696
1697 425
1698 00:20:21,230 --> 00:20:26,029
1699 virtualization the better-known can
1700
1701 426
1702 00:20:23,900 --> 00:20:27,490
1703 greatly accelerate training and also
1704
1705 427
1706 00:20:26,029 --> 00:20:30,039
1707 make the network less sensitive to
1708
1709 428
1710 00:20:27,490 --> 00:20:32,329
1711 initialization and it can also help
1712
1713 429
1714 00:20:30,039 --> 00:20:37,639
1715 generalization are because of the noise
1716
1717 430
1718 00:20:32,329 --> 00:20:39,379
1719 introduced into their statistics so here
1720
1721 431
1722 00:20:37,640 --> 00:20:42,170
1723 is a simple formulation of the
1724
1725 432
1726 00:20:39,380 --> 00:20:45,620
1727 better-known so given the outputs of any
1728
1729 433
1730 00:20:42,170 --> 00:20:48,740
1731 layer which we denote as X here we come
1732
1733 434
1734 00:20:45,619 --> 00:20:51,739
1735 we can compute the mean mu and standard
1736
1737 435
1738 00:20:48,740 --> 00:20:54,380
1739 deviation Sigma of X within the
1740
1741 436
1742 00:20:51,740 --> 00:20:56,029
1743 mini-batch then we can normalize the
1744
1745 437
1746 00:20:54,380 --> 00:20:59,300
1747 mini-wedge by subtracting the mean and
1748
1749 438
1750 00:20:56,029 --> 00:21:01,549
1751 divided by the standard deviation and
1752
1753 439
1754 00:20:59,299 --> 00:21:04,700
1755 after that we still learn a new skill
1756
1757 440
1758 00:21:01,549 --> 00:21:07,309
1759 gamma and a new shift beta which is to
1760
1761 441
1762 00:21:04,700 --> 00:21:10,160
1763 compensate the loss of representation
1764
1765 442
1766 00:21:07,309 --> 00:21:12,980
1767 power in the normalization so in this
1768
1769 443
1770 00:21:10,160 --> 00:21:15,019
1771 sentence actually mu and Sigma are kind
1772
1773 444
1774 00:21:12,980 --> 00:21:18,259
1775 of the functions of the activation and
1776
1777 445
1778 00:21:15,019 --> 00:21:20,359
1779 on the other hand the skill gamma and
1780
1781 446
1782 00:21:18,259 --> 00:21:22,400
1783 she's beta are parameters with too big
1784
1785 447
1786 00:21:20,359 --> 00:21:26,359
1787 to relearn they are just analogous to
1788
1789 448
1790 00:21:22,400 --> 00:21:28,670
1791 weight so this means that there will be
1792
1793 449
1794 00:21:26,359 --> 00:21:30,889
1795 two modes for better normalization
1796
1797 450
1798 00:21:28,670 --> 00:21:34,250
1799 so the first mode is the training mode
1800
1801 451
1802 00:21:30,890 --> 00:21:38,660
1803 in in decades mu and Sigma are function
1804
1805 452
1806 00:21:34,250 --> 00:21:41,809
1807 of a batch of the activations and in the
1808
1809 453
1810 00:21:38,660 --> 00:21:43,790
1811 case of testing these statistics of mu
1812
1813 454
1814 00:21:41,809 --> 00:21:46,279
1815 and Sigma are pre computed on the
1816
1817 455
1818 00:21:43,789 --> 00:21:50,529
1819 training set and they will be frozen and
1820
1821 456
1822 00:21:46,279 --> 00:21:54,019
1823 fixed so in my research experience
1824
1825 457
1826 00:21:50,529 --> 00:21:56,059
1827 usually are the difference between these
1828
1829 458
1830 00:21:54,019 --> 00:22:01,099
1831 two modes will create many practical
1832
1833 459
1834 00:21:56,059 --> 00:22:02,990
1835 problems in our implementation so I just
1836
1837 460
1838 00:22:01,099 --> 00:22:04,459
1839 want to know that we usually have to
1840
1841 461
1842 00:22:02,990 --> 00:22:07,130
1843 make sure our version normalization
1844
1845 462
1846 00:22:04,460 --> 00:22:10,100
1847 usage or implementation is correct and
1848
1849 463
1850 00:22:07,130 --> 00:22:10,370
1851 usually it is the cause of both in many
1852
1853 464
1854 00:22:10,099 --> 00:22:15,319
1855 of
1856
1857 465
1858 00:22:10,369 --> 00:22:19,189
1859 cause so anyway Vietnam is great it can
1860
1861 466
1862 00:22:15,319 --> 00:22:21,919
1863 greatly help accellerate training and it
1864
1865 467
1866 00:22:19,190 --> 00:22:26,539
1867 can most of the time you can improve
1868
1869 468
1870 00:22:21,920 --> 00:22:32,210
1871 accuracy so now we are ready to
1872
1873 469
1874 00:22:26,539 --> 00:22:34,220
1875 introduce the rest net so we have very
1876
1877 470
1878 00:22:32,210 --> 00:22:39,650
1879 good initializations and we have vaginal
1880
1881 471
1882 00:22:34,220 --> 00:22:44,870
1883 so why can we still train even deeper in
1884
1885 472
1886 00:22:39,650 --> 00:22:48,530
1887 neural networks actually we have tried
1888
1889 473
1890 00:22:44,869 --> 00:22:50,949
1891 it so here we introduce the concept of
1892
1893 474
1894 00:22:48,529 --> 00:22:53,740
1895 plane network which is just to
1896
1897 475
1898 00:22:50,950 --> 00:22:56,779
1899 repeatedly stacking three by three
1900
1901 476
1902 00:22:53,740 --> 00:23:00,769
1903 convolutional layers so we train this
1904
1905 477
1906 00:22:56,779 --> 00:23:03,259
1907 network on a small Civitan dataset which
1908
1909 478
1910 00:23:00,769 --> 00:23:05,808
1911 in a 20 layer version and also a 56
1912
1913 479
1914 00:23:03,259 --> 00:23:08,058
1915 layer version so somehow surprisingly we
1916
1917 480
1918 00:23:05,808 --> 00:23:11,269
1919 found that the deeper version is not
1920
1921 481
1922 00:23:08,058 --> 00:23:12,649
1923 better than the shallow version and even
1924
1925 482
1926 00:23:11,269 --> 00:23:15,170
1927 worse we found that the training error
1928
1929 483
1930 00:23:12,650 --> 00:23:19,190
1931 of the deeper version is higher than the
1932
1933 484
1934 00:23:15,170 --> 00:23:21,259
1935 general area of the shallower version so
1936
1937 485
1938 00:23:19,190 --> 00:23:23,900
1939 actually we Thunder this is the general
1940
1941 486
1942 00:23:21,259 --> 00:23:27,289
1943 phenomena and it is observed in almost
1944
1945 487
1946 00:23:23,900 --> 00:23:31,960
1947 all datasets and in many type of models
1948
1949 488
1950 00:23:27,289 --> 00:23:35,149
1951 if the model airplane networks so
1952
1953 489
1954 00:23:31,960 --> 00:23:38,299
1955 however this is counterintuitive in some
1956
1957 490
1958 00:23:35,150 --> 00:23:40,820
1959 sense I think of a shallow model for
1960
1961 491
1962 00:23:38,299 --> 00:23:42,500
1963 example it has 18 layers on the other
1964
1965 492
1966 00:23:40,819 --> 00:23:44,750
1967 hand we also think of a different model
1968
1969 493
1970 00:23:42,500 --> 00:23:48,049
1971 which is a counterpart of the shoulder
1972
1973 494
1974 00:23:44,750 --> 00:23:49,670
1975 model let's say it has 30 folders so
1976
1977 495
1978 00:23:48,049 --> 00:23:52,159
1979 actually a deeper model has a richer
1980
1981 496
1982 00:23:49,670 --> 00:23:54,230
1983 solution space and a deeper model should
1984
1985 497
1986 00:23:52,160 --> 00:23:57,080
1987 not have higher training error than its
1988
1989 498
1990 00:23:54,230 --> 00:23:59,620
1991 shallower counterpart this is because we
1992
1993 499
1994 00:23:57,079 --> 00:24:02,058
1995 can think of a solution by construction
1996
1997 500
1998 00:23:59,619 --> 00:24:04,969
1999 so let's say we have a well trained
2000
2001 501
2002 00:24:02,058 --> 00:24:07,308
2003 shadow model then we can just copy the
2004
2005 502
2006 00:24:04,970 --> 00:24:09,558
2007 weight from this model to the deeper
2008
2009 503
2010 00:24:07,308 --> 00:24:11,210
2011 model and then for the extra layers in
2012
2013 504
2014 00:24:09,558 --> 00:24:14,450
2015 the deeper model we can just simply set
2016
2017 505
2018 00:24:11,210 --> 00:24:16,370
2019 them as the identity so the existence of
2020
2021 506
2022 00:24:14,450 --> 00:24:18,920
2023 this solution for the deeper model
2024
2025 507
2026 00:24:16,369 --> 00:24:21,049
2027 indicate that at least you should have
2028
2029 508
2030 00:24:18,920 --> 00:24:22,950
2031 more or less the same training area with
2032
2033 509
2034 00:24:21,049 --> 00:24:25,648
2035 the shallower model
2036
2037 510
2038 00:24:22,950 --> 00:24:28,288
2039 so the degradation problem observed in
2040
2041 511
2042 00:24:25,648 --> 00:24:30,418
2043 experiment indicated there might be some
2044
2045 512
2046 00:24:28,288 --> 00:24:33,058
2047 optimization difficulties in the current
2048
2049 513
2050 00:24:30,419 --> 00:24:35,970
2051 servers so the servers cannot just find
2052
2053 514
2054 00:24:33,058 --> 00:24:38,278
2055 a solution when we create deeper and
2056
2057 515
2058 00:24:35,970 --> 00:24:40,259
2059 deeper models so these are the
2060
2061 516
2062 00:24:38,278 --> 00:24:43,829
2063 motivation of developing the deep
2064
2065 517
2066 00:24:40,259 --> 00:24:47,128
2067 residual Network the let's think about a
2068
2069 518
2070 00:24:43,829 --> 00:24:52,378
2071 plane net so we may also think about any
2072
2073 519
2074 00:24:47,128 --> 00:24:55,259
2075 2 or 3 or any number of consequence
2076
2077 520
2078 00:24:52,378 --> 00:24:59,308
2079 layers in the plane next in a plane acts
2080
2081 521
2082 00:24:55,259 --> 00:25:01,710
2083 as a small subnet so let's say if X X is
2084
2085 522
2086 00:24:59,308 --> 00:25:06,089
2087 the desired mapping to be learned by
2088
2089 523
2090 00:25:01,710 --> 00:25:10,169
2091 this small subnet so we just hope this
2092
2093 524
2094 00:25:06,089 --> 00:25:13,138
2095 Muslim net to fit this mapping so ended
2096
2097 525
2098 00:25:10,169 --> 00:25:15,720
2099 in the in the case of this residual
2100
2101 526
2102 00:25:13,138 --> 00:25:18,329
2103 learning instead of directly fit the
2104
2105 527
2106 00:25:15,720 --> 00:25:19,950
2107 mapping of this small subnet actually we
2108
2109 528
2110 00:25:18,329 --> 00:25:23,099
2111 hope the small similar to fit another
2112
2113 529
2114 00:25:19,950 --> 00:25:26,190
2115 mapping which is called FX and then let
2116
2117 530
2118 00:25:23,099 --> 00:25:30,719
2119 desire nothing to be the summation of FX
2120
2121 531
2122 00:25:26,190 --> 00:25:32,730
2123 and X so in this case actually FX is
2124
2125 532
2126 00:25:30,720 --> 00:25:36,690
2127 kind of a residual mapping with respect
2128
2129 533
2130 00:25:32,730 --> 00:25:39,358
2131 to identity so the heuristic is that if
2132
2133 534
2134 00:25:36,690 --> 00:25:42,690
2135 identity is optimal then it should be
2136
2137 535
2138 00:25:39,358 --> 00:25:44,398
2139 easy to just set all the weights as 0 so
2140
2141 536
2142 00:25:42,690 --> 00:25:47,249
2143 on the other hand if the optimal mapping
2144
2145 537
2146 00:25:44,398 --> 00:25:49,288
2147 is close up to identity then it should
2148
2149 538
2150 00:25:47,249 --> 00:25:52,889
2151 also be easy to find some fluctuations
2152
2153 539
2154 00:25:49,288 --> 00:25:54,720
2155 in addition to the to the identity so
2156
2157 540
2158 00:25:52,888 --> 00:25:56,788
2159 this is the basic idea of the residual
2160
2161 541
2162 00:25:54,720 --> 00:25:59,399
2163 learning and actually it works very well
2164
2165 542
2166 00:25:56,788 --> 00:26:03,628
2167 so here are the experiments on the
2168
2169 543
2170 00:25:59,398 --> 00:26:07,918
2171 Civitan data set under left hand side
2172
2173 544
2174 00:26:03,628 --> 00:26:10,048
2175 other results of plane networks and on
2176
2177 545
2178 00:26:07,919 --> 00:26:13,049
2179 the right hand side the result of Dib
2180
2181 546
2182 00:26:10,048 --> 00:26:14,970
2183 residual networks so as we can see the
2184
2185 547
2186 00:26:13,048 --> 00:26:17,908
2187 deep ResNet can be trained without
2188
2189 548
2190 00:26:14,970 --> 00:26:19,499
2191 difficulty and depressed nets have lower
2192
2193 549
2194 00:26:17,909 --> 00:26:25,070
2195 training error and also can be
2196
2197 550
2198 00:26:19,499 --> 00:26:28,618
2199 generalized to the to the test error so
2200
2201 551
2202 00:26:25,069 --> 00:26:33,388
2203 this means that the residual learning
2204
2205 552
2206 00:26:28,618 --> 00:26:36,269
2207 can help to can help the optimization so
2208
2209 553
2210 00:26:33,388 --> 00:26:37,000
2211 we also try this on the image net data
2212
2213 554
2214 00:26:36,269 --> 00:26:39,579
2215 set
2216
2217 555
2218 00:26:37,000 --> 00:26:41,980
2219 so a practical design when we go even
2220
2221 556
2222 00:26:39,579 --> 00:26:45,639
2223 deeper is we also introduced the
2224
2225 557
2226 00:26:41,980 --> 00:26:48,579
2227 bottleneck design into the residual
2228
2229 558
2230 00:26:45,640 --> 00:26:52,090
2231 block so instead of using two or three
2232
2233 559
2234 00:26:48,579 --> 00:26:54,099
2235 sequential 3x3 convolutions in each
2236
2237 560
2238 00:26:52,089 --> 00:26:56,678
2239 residual plot we first used a one by one
2240
2241 561
2242 00:26:54,099 --> 00:26:58,750
2243 to reduce the number of channels and
2244
2245 562
2246 00:26:56,679 --> 00:27:01,090
2247 then we do the three by three
2248
2249 563
2250 00:26:58,750 --> 00:27:03,339
2251 convolution and on the other hand we
2252
2253 564
2254 00:27:01,089 --> 00:27:08,048
2255 also use another one by one to increase
2256
2257 565
2258 00:27:03,339 --> 00:27:10,089
2259 the dimension back to the original so in
2260
2261 566
2262 00:27:08,048 --> 00:27:13,210
2263 understanding this this design or
2264
2265 567
2266 00:27:10,089 --> 00:27:17,558
2267 bottleneck is also helpful in addition
2268
2269 568
2270 00:27:13,210 --> 00:27:20,490
2271 to going deeper so here are the results
2272
2273 569
2274 00:27:17,558 --> 00:27:24,609
2275 of a ResNet on the image net data set so
2276
2277 570
2278 00:27:20,490 --> 00:27:28,720
2279 we trained residual neural net with 34
2280
2281 571
2282 00:27:24,609 --> 00:27:31,119
2283 layers 50 layers 101 layers and 150
2284
2285 572
2286 00:27:28,720 --> 00:27:34,450
2287 layers so the deeper the model is the
2288
2289 573
2290 00:27:31,119 --> 00:27:38,139
2291 lower the training and validation areas
2292
2293 574
2294 00:27:34,450 --> 00:27:40,298
2295 are and it is also worth mentioning that
2296
2297 575
2298 00:27:38,140 --> 00:27:43,809
2299 actually even our deepest model which
2300
2301 576
2302 00:27:40,298 --> 00:27:46,720
2303 has 150 layers actually has lower time
2304
2305 577
2306 00:27:43,808 --> 00:27:48,819
2307 complexity or number of flops than the
2308
2309 578
2310 00:27:46,720 --> 00:27:51,940
2311 framers with redream models this is
2312
2313 579
2314 00:27:48,819 --> 00:27:54,839
2315 because if we can have a better
2316
2317 580
2318 00:27:51,940 --> 00:27:57,759
2319 representation power by going deeper
2320
2321 581
2322 00:27:54,839 --> 00:28:00,639
2323 we can just use a smaller number of
2324
2325 582
2326 00:27:57,759 --> 00:28:05,200
2327 channels so which can help us to greatly
2328
2329 583
2330 00:28:00,640 --> 00:28:08,200
2331 reduce the complexity actually ResNet is
2332
2333 584
2334 00:28:05,200 --> 00:28:11,169
2335 very popular even beyond computer vision
2336
2337 585
2338 00:28:08,200 --> 00:28:14,140
2339 so for example in a recent work of
2340
2341 586
2342 00:28:11,169 --> 00:28:17,259
2343 neural machine translation and they also
2344
2345 587
2346 00:28:14,140 --> 00:28:21,549
2347 reported they contain an a layer l STM
2348
2349 588
2350 00:28:17,259 --> 00:28:23,109
2351 by stacking our STM blocks so this is
2352
2353 589
2354 00:28:21,548 --> 00:28:25,418
2355 with the help of the residual
2356
2357 590
2358 00:28:23,109 --> 00:28:28,089
2359 connections so the paper reported that
2360
2361 591
2362 00:28:25,419 --> 00:28:31,000
2363 usually people are not able to train our
2364
2365 592
2366 00:28:28,089 --> 00:28:32,798
2367 team with more than four layers for with
2368
2369 593
2370 00:28:31,000 --> 00:28:35,859
2371 residual connections this is portable
2372
2373 594
2374 00:28:32,798 --> 00:28:40,089
2375 and they can even have some games with
2376
2377 595
2378 00:28:35,859 --> 00:28:43,048
2379 up to alias or maybe even more the
2380
2381 596
2382 00:28:40,089 --> 00:28:46,359
2383 retina has also be used for speech
2384
2385 597
2386 00:28:43,048 --> 00:28:48,788
2387 synthesis in this paper called wavenet
2388
2389 598
2390 00:28:46,359 --> 00:28:50,758
2391 they construct a residual convolutional
2392
2393 599
2394 00:28:48,788 --> 00:28:54,868
2395 net work on the one DCP
2396
2397 600
2398 00:28:50,759 --> 00:28:57,628
2399 so unlike many workers recurrent network
2400
2401 601
2402 00:28:54,868 --> 00:29:01,769
2403 or LSP n to do the sequence to sequence
2404
2405 602
2406 00:28:57,628 --> 00:29:04,108
2407 learning in this work they use compose
2408
2409 603
2410 00:29:01,769 --> 00:29:07,199
2411 net so the key to the success in my
2412
2413 604
2414 00:29:04,108 --> 00:29:10,499
2415 understanding are in to design the first
2416
2417 605
2418 00:29:07,199 --> 00:29:12,599
2419 one is they use some deletions which can
2420
2421 606
2422 00:29:10,499 --> 00:29:15,149
2423 help them to capture some long-term
2424
2425 607
2426 00:29:12,598 --> 00:29:19,798
2427 dependency on the other hand they just
2428
2429 608
2430 00:29:15,148 --> 00:29:21,478
2431 stack a lot of such layers so in this
2432
2433 609
2434 00:29:19,798 --> 00:29:23,429
2435 regard they can have a very large
2436
2437 610
2438 00:29:21,479 --> 00:29:26,548
2439 receptive field so they can have even
2440
2441 611
2442 00:29:23,429 --> 00:29:28,528
2443 longer dependency captured so when they
2444
2445 612
2446 00:29:26,548 --> 00:29:32,269
2447 take a lot of layers the residual
2448
2449 613
2450 00:29:28,528 --> 00:29:34,979
2451 connections are the key to their success
2452
2453 614
2454 00:29:32,269 --> 00:29:37,440
2455 and also this story also applied to
2456
2457 615
2458 00:29:34,979 --> 00:29:39,659
2459 speech recognition and in another paper
2460
2461 616
2462 00:29:37,440 --> 00:29:41,338
2463 they also train a residual convolutional
2464
2465 617
2466 00:29:39,659 --> 00:29:46,469
2467 net work on the 1d sequence with the
2468
2469 618
2470 00:29:41,338 --> 00:29:49,440
2471 help of residual connections so next I
2472
2473 619
2474 00:29:46,469 --> 00:29:51,838
2475 will talk about one extension of ResNet
2476
2477 620
2478 00:29:49,440 --> 00:29:56,788
2479 which we call the rest next and this
2480
2481 621
2482 00:29:51,838 --> 00:29:58,858
2483 work will be presented in this video so
2484
2485 622
2486 00:29:56,788 --> 00:30:00,388
2487 as I mentioned before I'm understanding
2488
2489 623
2490 00:29:58,858 --> 00:30:02,638
2491 there are at least three key components
2492
2493 624
2494 00:30:00,388 --> 00:30:06,298
2495 in the governor or inception design
2496
2497 625
2498 00:30:02,638 --> 00:30:08,908
2499 circuit bottleneck and multi branch so
2500
2501 626
2502 00:30:06,298 --> 00:30:11,128
2503 in the original ResNet design we have
2504
2505 627
2506 00:30:08,909 --> 00:30:13,469
2507 the Shockers we have the bottleneck and
2508
2509 628
2510 00:30:11,128 --> 00:30:18,418
2511 we didn't have the multiple branch
2512
2513 629
2514 00:30:13,469 --> 00:30:21,979
2515 design so rezneck is just a simple multi
2516
2517 630
2518 00:30:18,419 --> 00:30:25,679
2519 branch components are designed in the
2520
2521 631
2522 00:30:21,979 --> 00:30:27,899
2523 scenario of ResNet so unlike Inception
2524
2525 632
2526 00:30:25,679 --> 00:30:30,319
2527 which are has the genius multi branch
2528
2529 633
2530 00:30:27,898 --> 00:30:34,108
2531 building block which means that it has
2532
2533 634
2534 00:30:30,318 --> 00:30:36,898
2535 different shapes of different branches
2536
2537 635
2538 00:30:34,108 --> 00:30:39,058
2539 in the case of Resnick's or the branch
2540
2541 636
2542 00:30:36,898 --> 00:30:43,288
2543 shared the same shape and the same
2544
2545 637
2546 00:30:39,058 --> 00:30:45,808
2547 number of our filters so the main
2548
2549 638
2550 00:30:43,288 --> 00:30:48,078
2551 observation in the rest net paper are as
2552
2553 639
2554 00:30:45,808 --> 00:30:50,808
2555 follows so first we find that actually
2556
2557 640
2558 00:30:48,078 --> 00:30:53,548
2559 condition and addition are
2560
2561 641
2562 00:30:50,808 --> 00:30:56,729
2563 interchangeable so actually this is the
2564
2565 642
2566 00:30:53,548 --> 00:30:58,979
2567 general property for deep neural network
2568
2569 643
2570 00:30:56,729 --> 00:31:03,479
2571 it is just not limited to resin xor
2572
2573 644
2574 00:30:58,979 --> 00:31:04,558
2575 ResNet so and if we have a uniform
2576
2577 645
2578 00:31:03,479 --> 00:31:07,169
2579 multiple
2580
2581 646
2582 00:31:04,558 --> 00:31:09,749
2583 network then we can just use this
2584
2585 647
2586 00:31:07,169 --> 00:31:12,690
2587 property to to convert it into some
2588
2589 648
2590 00:31:09,749 --> 00:31:15,569
2591 group convolutions so for example this
2592
2593 649
2594 00:31:12,690 --> 00:31:20,159
2595 is the original original design of the
2596
2597 650
2598 00:31:15,569 --> 00:31:23,579
2599 rednecks which has many branches of the
2600
2601 651
2602 00:31:20,159 --> 00:31:27,778
2603 same ship so we can show that instead of
2604
2605 652
2606 00:31:23,579 --> 00:31:31,668
2607 doing the addition in this block we can
2608
2609 653
2610 00:31:27,778 --> 00:31:34,288
2611 insert a tool condition in the second
2612
2613 654
2614 00:31:31,669 --> 00:31:36,480
2615 second set of layers in this block and
2616
2617 655
2618 00:31:34,288 --> 00:31:39,990
2619 after this condition we can just put it
2620
2621 656
2622 00:31:36,480 --> 00:31:42,960
2623 into a single or wider one by one
2624
2625 657
2626 00:31:39,990 --> 00:31:45,028
2627 convolution and also because of this
2628
2629 658
2630 00:31:42,960 --> 00:31:47,610
2631 condition we can replace a previous
2632
2633 659
2634 00:31:45,028 --> 00:31:51,839
2635 layer into a group convolutional layer
2636
2637 660
2638 00:31:47,609 --> 00:31:56,329
2639 which can be implemented more efficient
2640
2641 661
2642 00:31:51,839 --> 00:32:00,298
2643 than we're basically doing many branches
2644
2645 662
2646 00:31:56,329 --> 00:32:03,509
2647 so actually the idea of this uniform
2648
2649 663
2650 00:32:00,298 --> 00:32:06,839
2651 multiple branching is very successful we
2652
2653 664
2654 00:32:03,509 --> 00:32:10,079
2655 have observed better accuracy than rest
2656
2657 665
2658 00:32:06,839 --> 00:32:12,628
2659 net when we manually keep the number of
2660
2661 666
2662 00:32:10,079 --> 00:32:16,288
2663 flops and parameters as the original
2664
2665 667
2666 00:32:12,628 --> 00:32:20,069
2667 resinous per example in this liquor the
2668
2669 668
2670 00:32:16,288 --> 00:32:23,609
2671 X XOR the number of epochs and Y XOR the
2672
2673 669
2674 00:32:20,069 --> 00:32:26,849
2675 number of top one error rate so the blue
2676
2677 670
2678 00:32:23,609 --> 00:32:30,028
2679 lines are the original ResNet and the
2680
2681 671
2682 00:32:26,849 --> 00:32:32,998
2683 red lines are the resin next model so
2684
2685 672
2686 00:32:30,028 --> 00:32:35,849
2687 the dashed lines other training error
2688
2689 673
2690 00:32:32,999 --> 00:32:38,399
2691 and the solid lines other validation
2692
2693 674
2694 00:32:35,849 --> 00:32:40,798
2695 error so we can see that when having the
2696
2697 675
2698 00:32:38,398 --> 00:32:42,868
2699 same number of lobes and parameters the
2700
2701 676
2702 00:32:40,798 --> 00:32:44,460
2703 rest next model is better than the
2704
2705 677
2706 00:32:42,868 --> 00:32:46,888
2707 original resonant model
2708
2709 678
2710 00:32:44,460 --> 00:32:49,590
2711 so actually what we learn from this is
2712
2713 679
2714 00:32:46,888 --> 00:32:53,298
2715 that we can have a better trade-off when
2716
2717 680
2718 00:32:49,589 --> 00:32:57,349
2719 we train larger and larger models and
2720
2721 681
2722 00:32:53,298 --> 00:33:00,388
2723 also this rift next model can be
2724
2725 682
2726 00:32:57,349 --> 00:33:02,368
2727 generalized very well to some other more
2728
2729 683
2730 00:33:00,388 --> 00:33:05,459
2731 complicated recognition type such as
2732
2733 684
2734 00:33:02,368 --> 00:33:09,028
2735 auto detection and instance segmentation
2736
2737 685
2738 00:33:05,460 --> 00:33:12,329
2739 so here are the results of rest next for
2740
2741 686
2742 00:33:09,028 --> 00:33:15,538
2743 mass RCN so we can see that our best
2744
2745 687
2746 00:33:12,329 --> 00:33:18,269
2747 Rest next model can improve the cocoa
2748
2749 688
2750 00:33:15,538 --> 00:33:22,170
2751 bounding box AP by 1 point 6 I also
2752
2753 689
2754 00:33:18,269 --> 00:33:25,440
2755 the math AP by 1.4 so all these results
2756
2757 690
2758 00:33:22,170 --> 00:33:30,240
2759 indicate that features do matters in the
2760
2761 691
2762 00:33:25,440 --> 00:33:32,269
2763 current visual recognition community so
2764
2765 692
2766 00:33:30,240 --> 00:33:35,130
2767 also there are many architectures
2768
2769 693
2770 00:33:32,269 --> 00:33:38,839
2771 developed recently which are not covered
2772
2773 694
2774 00:33:35,130 --> 00:33:42,960
2775 in this tutorial for example there is a
2776
2777 695
2778 00:33:38,839 --> 00:33:46,759
2779 inception ResNet which use inception as
2780
2781 696
2782 00:33:42,960 --> 00:33:49,319
2783 the transformation function and also
2784
2785 697
2786 00:33:46,759 --> 00:33:53,069
2787 changing with the help of residual
2788
2789 698
2790 00:33:49,319 --> 00:33:56,339
2791 connections on the other hand a method
2792
2793 699
2794 00:33:53,069 --> 00:33:58,980
2795 called dense net is developed which use
2796
2797 700
2798 00:33:56,339 --> 00:34:00,990
2799 some dense niche densely connected
2800
2801 701
2802 00:33:58,980 --> 00:34:04,410
2803 shortcuts which are merged by a
2804
2805 702
2806 00:34:00,990 --> 00:34:06,690
2807 condition and on the other hand also in
2808
2809 703
2810 00:34:04,410 --> 00:34:09,750
2811 this division there is a work called
2812
2813 704
2814 00:34:06,690 --> 00:34:12,260
2815 exception and later on another work
2816
2817 705
2818 00:34:09,750 --> 00:34:14,849
2819 called mobile Nets both of which are
2820
2821 706
2822 00:34:12,260 --> 00:34:17,970
2823 given by the so called depth wise
2824
2825 707
2826 00:34:14,849 --> 00:34:20,878
2827 convolutions so depth wise convolutions
2828
2829 708
2830 00:34:17,969 --> 00:34:22,829
2831 are kind of group conclusions with where
2832
2833 709
2834 00:34:20,878 --> 00:34:25,949
2835 the number of groups equals to the
2836
2837 710
2838 00:34:22,829 --> 00:34:28,349
2839 number of channels and also following
2840
2841 711
2842 00:34:25,949 --> 00:34:31,408
2843 this line another reason work called
2844
2845 712
2846 00:34:28,349 --> 00:34:34,500
2847 shadow net use even more group or depth
2848
2849 713
2850 00:34:31,409 --> 00:34:38,190
2851 wise convolutions with shuttle to reduce
2852
2853 714
2854 00:34:34,500 --> 00:34:40,398
2855 the complexity of the models so actually
2856
2857 715
2858 00:34:38,190 --> 00:34:43,349
2859 as we can see from all these
2860
2861 716
2862 00:34:40,398 --> 00:34:48,059
2863 illustrations there are still three key
2864
2865 717
2866 00:34:43,349 --> 00:34:50,309
2867 components in these designs the first
2868
2869 718
2870 00:34:48,059 --> 00:34:53,398
2871 one is the usage of shortcut and the
2872
2873 719
2874 00:34:50,309 --> 00:34:57,929
2875 second one is a bottleneck and the third
2876
2877 720
2878 00:34:53,398 --> 00:35:01,889
2879 one is multiple branching so finally I
2880
2881 721
2882 00:34:57,929 --> 00:35:04,889
2883 will also like to mention a reason work
2884
2885 722
2886 00:35:01,889 --> 00:35:07,019
2887 done by Facebook AI research and apply
2888
2889 723
2890 00:35:04,889 --> 00:35:09,480
2891 machine learning teams so now we are
2892
2893 724
2894 00:35:07,019 --> 00:35:14,869
2895 able to change the image net data set in
2896
2897 725
2898 00:35:09,480 --> 00:35:18,869
2899 one hour so our setting is driven by 256
2900
2901 726
2902 00:35:14,869 --> 00:35:24,050
2903 GPUs so now we can train an infinite
2904
2905 727
2906 00:35:18,869 --> 00:35:27,960
2907 model using a mini batch size of 8192
2908
2909 728
2910 00:35:24,050 --> 00:35:30,180
2911 using a synchronized a 3d version so we
2912
2913 729
2914 00:35:27,960 --> 00:35:31,920
2915 train a reson activity on this data set
2916
2917 730
2918 00:35:30,179 --> 00:35:35,909
2919 so we have observed
2920
2921 731
2922 00:35:31,920 --> 00:35:39,720
2923 no loss of accuracy so as shown in the
2924
2925 732
2926 00:35:35,909 --> 00:35:42,960
2927 figure here the xxo is the mini batch
2928
2929 733
2930 00:35:39,719 --> 00:35:47,308
2931 size and the y-axis is the accuracy of
2932
2933 734
2934 00:35:42,960 --> 00:35:49,798
2935 the of the model so we can see that in a
2936
2937 735
2938 00:35:47,309 --> 00:35:55,319
2939 very wide spectrum of mini batch size
2940
2941 736
2942 00:35:49,798 --> 00:35:58,440
2943 from 64 to a k the models has more or
2944
2945 737
2946 00:35:55,318 --> 00:36:01,920
2947 less the same accuracy in this case up
2948
2949 738
2950 00:35:58,440 --> 00:36:05,460
2951 to some random variation so the key
2952
2953 739
2954 00:36:01,920 --> 00:36:08,338
2955 factors in in this algorithm are in two
2956
2957 740
2958 00:36:05,460 --> 00:36:11,429
2959 aspects the first aspect is a linear
2960
2961 741
2962 00:36:08,338 --> 00:36:14,338
2963 scaling learning rate in terms of the
2964
2965 742
2966 00:36:11,429 --> 00:36:16,019
2967 mini batch another factor is the warm-up
2968
2969 743
2970 00:36:14,338 --> 00:36:19,170
2971 of learning rates at the beginning of
2972
2973 744
2974 00:36:16,019 --> 00:36:22,048
2975 training and also in our paper we report
2976
2977 745
2978 00:36:19,170 --> 00:36:24,930
2979 some theoretical foundations of the
2980
2981 746
2982 00:36:22,048 --> 00:36:27,059
2983 usage of these two factors and also in
2984
2985 747
2986 00:36:24,929 --> 00:36:29,578
2987 my experience I found one of the most
2988
2989 748
2990 00:36:27,059 --> 00:36:32,099
2991 important factor is just to implement
2992
2993 749
2994 00:36:29,579 --> 00:36:36,869
2995 everything correctly in the case of
2996
2997 750
2998 00:36:32,099 --> 00:36:40,318
2999 multiple GPUs or multiple machines so in
3000
3001 751
3002 00:36:36,869 --> 00:36:42,568
3003 conclusion in my understanding the
3004
3005 752
3006 00:36:40,318 --> 00:36:44,599
3007 success of deep learning is the success
3008
3009 753
3010 00:36:42,568 --> 00:36:48,329
3011 of feature learning so in the case of
3012
3013 754
3014 00:36:44,599 --> 00:36:52,769
3015 visual recognition features still matter
3016
3017 755
3018 00:36:48,329 --> 00:36:54,809
3019 so the power of deep features can be
3020
3021 756
3022 00:36:52,769 --> 00:36:57,690
3023 demonstrated by the amazing visual
3024
3025 757
3026 00:36:54,809 --> 00:37:00,480
3027 recognition results in complicated tiles
3028
3029 758
3030 00:36:57,690 --> 00:37:03,030
3031 such as instant segmentation so here I
3032
3033 759
3034 00:37:00,480 --> 00:37:06,000
3035 show a result of mass ICM with ResNet
3036
3037 760
3038 00:37:03,030 --> 00:37:09,829
3039 101 which will be covered in next talk
3040
3041 761
3042 00:37:06,000 --> 00:37:09,829
3043 by rows so that's all thank you
3044
3045 762
3046 00:37:13,539 --> 00:37:25,640
3047 nope any and cut any questions there is
3048
3049 763
3050 00:37:18,650 --> 00:37:28,249
3051 a microphone there so ever I have a
3052
3053 764
3054 00:37:25,639 --> 00:37:30,379
3055 question so so how you determine the
3056
3057 765
3058 00:37:28,248 --> 00:37:32,058
3059 longer layers you want so when you
3060
3061 766
3062 00:37:30,380 --> 00:37:34,130
3063 design the network there's a first
3064
3065 767
3066 00:37:32,059 --> 00:37:36,619
3067 question second opportunities how do you
3068
3069 768
3070 00:37:34,130 --> 00:37:40,189
3071 think the mathematic for deep learning
3072
3073 769
3074 00:37:36,619 --> 00:37:42,349
3075 you think it's understandable for when
3076
3077 770
3078 00:37:40,188 --> 00:37:45,828
3079 you decide enamel yes so the number of
3080
3081 771
3082 00:37:42,349 --> 00:37:48,439
3083 layers so up before the development of a
3084
3085 772
3086 00:37:45,829 --> 00:37:50,449
3087 deep ResNet the number of layers is
3088
3089 773
3090 00:37:48,438 --> 00:37:53,838
3091 still kind of try an area because we
3092
3093 774
3094 00:37:50,449 --> 00:37:55,818
3095 don't know when it degrades after the
3096
3097 775
3098 00:37:53,838 --> 00:37:57,889
3099 development of ResNet in theory we can
3100
3101 776
3102 00:37:55,818 --> 00:38:00,168
3103 just go deeper so now the number of
3104
3105 777
3106 00:37:57,889 --> 00:38:03,798
3107 layers are mainly limited by practical
3108
3109 778
3110 00:38:00,168 --> 00:38:06,558
3111 concerns such as a memory or running
3112
3113 779
3114 00:38:03,798 --> 00:38:12,079
3115 time or sometimes by the number of back
3116
3117 780
3118 00:38:06,559 --> 00:38:13,640
3119 by the by the amount of data so for the
3120
3121 781
3122 00:38:12,079 --> 00:38:15,919
3123 second question in my understanding
3124
3125 782
3126 00:38:13,639 --> 00:38:18,798
3127 there are many recent interesting work
3128
3129 783
3130 00:38:15,918 --> 00:38:21,139
3131 trying to explain the mathematics of
3132
3133 784
3134 00:38:18,798 --> 00:38:24,199
3135 deep learning but in my understanding it
3136
3137 785
3138 00:38:21,139 --> 00:38:26,748
3139 is still kind of an open question so I
3140
3141 786
3142 00:38:24,199 --> 00:38:31,028
3143 hope we will see more results in the
3144
3145 787
3146 00:38:26,748 --> 00:38:33,759
3147 future let's let's thank our speaker
3148
3149 788
3150 00:38:31,028 --> 00:38:35,730
3151 thank you
3152
3153 789
3154 00:38:33,760 --> 00:38:39,370
3155 [Music]
3156
3157 790
3158 00:38:35,730 --> 00:38:41,710
3159 so our next speaker will be raw score
3160
3161 791
3162 00:38:39,369 --> 00:38:44,469
3163 check so roast right now is a research
3164
3165 792
3166 00:38:41,710 --> 00:38:46,539
3167 scientist at state booyah research so he
3168
3169 793
3170 00:38:44,469 --> 00:38:48,789
3171 has downloaded interesting world and
3172
3173 794
3174 00:38:46,539 --> 00:38:52,960
3175 very influenced work on object detection
3176
3177 795
3178 00:38:48,789 --> 00:38:54,880
3179 so he proposed region zealand and also
3180
3181 796
3182 00:38:52,960 --> 00:38:57,760
3183 involved in the development of faster
3184
3185 797
3186 00:38:54,880 --> 00:39:01,000
3187 zone faster our zone at the next step
3188
3189 798
3190 00:38:57,760 --> 00:39:03,630
3191 with the instance soon so today he will
3192
3193 799
3194 00:39:01,000 --> 00:39:06,219
3195 talk about the instant and detection on
3196
3197 800
3198 00:39:03,630 --> 00:39:17,740
3199 is to narrow up your understanding so
3200
3201 801
3202 00:39:06,219 --> 00:39:19,779
3203 let's work our speaker all right thank
3204
3205 802
3206 00:39:17,739 --> 00:39:21,849
3207 you very much I'll try to lean into this
3208
3209 803
3210 00:39:19,780 --> 00:39:23,860
3211 microphone so you can hear me so I'll be
3212
3213 804
3214 00:39:21,849 --> 00:39:25,710
3215 talking today about deep learning for
3216
3217 805
3218 00:39:23,860 --> 00:39:27,789
3219 instance level object understanding and
3220
3221 806
3222 00:39:25,710 --> 00:39:29,610
3223 this is work with a bunch of
3224
3225 807
3226 00:39:27,789 --> 00:39:33,159
3227 collaborators at Facebook AAA research
3228
3229 808
3230 00:39:29,610 --> 00:39:33,940
3231 so first of all to get started thank you
3232
3233 809
3234 00:39:33,159 --> 00:39:35,889
3235 for being here
3236
3237 810
3238 00:39:33,940 --> 00:39:39,789
3239 you know there's tough competition for
3240
3241 811
3242 00:39:35,889 --> 00:39:41,259
3243 your for your attention so you know I'm
3244
3245 812
3246 00:39:39,789 --> 00:39:45,099
3247 glad to see that the room is still full
3248
3249 813
3250 00:39:41,260 --> 00:39:47,560
3251 and that we we won out so now there's a
3252
3253 814
3254 00:39:45,099 --> 00:39:50,079
3255 lot to cover so let's dive right in so
3256
3257 815
3258 00:39:47,559 --> 00:39:51,909
3259 the first is you've just seen this talk
3260
3261 816
3262 00:39:50,079 --> 00:39:55,000
3263 and now you're all experts on deep
3264
3265 817
3266 00:39:51,909 --> 00:39:57,848
3267 representations so now
3268
3269 818
3270 00:39:55,000 --> 00:40:00,639
3271 and take that information and apply it
3272
3273 819
3274 00:39:57,849 --> 00:40:02,559
3275 to some other computer vision tasks such
3276
3277 820
3278 00:40:00,639 --> 00:40:06,460
3279 as object detection and instance level
3280
3281 821
3282 00:40:02,559 --> 00:40:09,639
3283 understanding so this is what object
3284
3285 822
3286 00:40:06,460 --> 00:40:11,710
3287 detection looked like in 2007 so this is
3288
3289 823
3290 00:40:09,639 --> 00:40:13,480
3291 when I started my PhD so first of all
3292
3293 824
3294 00:40:11,710 --> 00:40:16,358
3295 images were black and white because this
3296
3297 825
3298 00:40:13,480 --> 00:40:20,409
3299 was forever ago and you know we could
3300
3301 826
3302 00:40:16,358 --> 00:40:21,900
3303 detect things but really object
3304
3305 827
3306 00:40:20,409 --> 00:40:24,670
3307 detectors were not working all that well
3308
3309 828
3310 00:40:21,900 --> 00:40:26,980
3311 so this is what object detection looks
3312
3313 829
3314 00:40:24,670 --> 00:40:29,260
3315 like today so first of all there are a
3316
3317 830
3318 00:40:26,980 --> 00:40:31,960
3319 few things that you can look at that are
3320
3321 831
3322 00:40:29,260 --> 00:40:33,910
3323 different here so one is that the scenes
3324
3325 832
3326 00:40:31,960 --> 00:40:36,639
3327 that we can successfully detect objects
3328
3329 833
3330 00:40:33,909 --> 00:40:39,399
3331 in are much more complicated than they
3332
3333 834
3334 00:40:36,639 --> 00:40:41,230
3335 were back in 2007 the second thing is
3336
3337 835
3338 00:40:39,400 --> 00:40:43,389
3339 not only can we put a bounding box
3340
3341 836
3342 00:40:41,230 --> 00:40:46,059
3343 around objects but we can actually now
3344
3345 837
3346 00:40:43,389 --> 00:40:48,159
3347 provide a lot more information about
3348
3349 838
3350 00:40:46,059 --> 00:40:50,469
3351 those objects so for example we can
3352
3353 839
3354 00:40:48,159 --> 00:40:54,670
3355 provide detailed pixel levels instant
3356
3357 840
3358 00:40:50,469 --> 00:40:56,379
3359 segmentations so that's why the title of
3360
3361 841
3362 00:40:54,670 --> 00:40:58,230
3363 this is not just object detection a
3364
3365 842
3366 00:40:56,380 --> 00:41:00,608
3367 rather instance level understanding
3368
3369 843
3370 00:40:58,230 --> 00:41:02,289
3371 because we're now moving away from
3372
3373 844
3374 00:41:00,608 --> 00:41:05,170
3375 simply detecting objects and putting
3376
3377 845
3378 00:41:02,289 --> 00:41:06,608
3379 boxes around them to being able to find
3380
3381 846
3382 00:41:05,170 --> 00:41:08,920
3383 the objects say what category they are
3384
3385 847
3386 00:41:06,608 --> 00:41:11,920
3387 but also provide much richer information
3388
3389 848
3390 00:41:08,920 --> 00:41:14,470
3391 about what those objects are so as two
3392
3393 849
3394 00:41:11,920 --> 00:41:16,630
3395 examples of this so one is that for the
3396
3397 850
3398 00:41:14,469 --> 00:41:20,588
3399 person category which is your particular
3400
3401 851
3402 00:41:16,630 --> 00:41:25,869
3403 focus of study we can provide a key
3404
3405 852
3406 00:41:20,588 --> 00:41:28,779
3407 point pose estimation and again for
3408
3409 853
3410 00:41:25,869 --> 00:41:30,519
3411 people we can also now pretty reliably
3412
3413 854
3414 00:41:28,780 --> 00:41:32,380
3415 detect the activities that those people
3416
3417 855
3418 00:41:30,519 --> 00:41:37,239
3419 are engaged in and what the target
3420
3421 856
3422 00:41:32,380 --> 00:41:38,950
3423 objects of those activities are so the
3424
3425 857
3426 00:41:37,239 --> 00:41:40,568
3427 outline for this talk is that I'll spend
3428
3429 858
3430 00:41:38,949 --> 00:41:42,969
3431 the first part of it going over masks
3432
3433 859
3434 00:41:40,568 --> 00:41:44,858
3435 our CNN which is the system that
3436
3437 860
3438 00:41:42,969 --> 00:41:47,019
3439 produced some of the visualizations I
3440
3441 861
3442 00:41:44,858 --> 00:41:49,150
3443 was just showing there and then in the
3444
3445 862
3446 00:41:47,019 --> 00:41:51,550
3447 second part of the talk I'll try to
3448
3449 863
3450 00:41:49,150 --> 00:41:54,190
3451 provide a brief survey of the current
3452
3453 864
3454 00:41:51,550 --> 00:41:56,920
3455 landscape of object detection using deep
3456
3457 865
3458 00:41:54,190 --> 00:41:58,389
3459 learning and this is really a vast area
3460
3461 866
3462 00:41:56,920 --> 00:42:00,130
3463 that's kind of exploded over the last
3464
3465 867
3466 00:41:58,389 --> 00:42:02,650
3467 few years so it's impossible to cover
3468
3469 868
3470 00:42:00,130 --> 00:42:04,568
3471 all of it in any detail so I'll just hit
3472
3473 869
3474 00:42:02,650 --> 00:42:06,920
3475 upon a few of the points that I think
3476
3477 870
3478 00:42:04,568 --> 00:42:10,170
3479 are some of the most salient
3480
3481 871
3482 00:42:06,920 --> 00:42:12,150
3483 so let's start by specifying what the
3484
3485 872
3486 00:42:10,170 --> 00:42:14,550
3487 task is that we're going to look at here
3488
3489 873
3490 00:42:12,150 --> 00:42:16,619
3491 so we're going to look at instant
3492
3493 874
3494 00:42:14,550 --> 00:42:19,280
3495 segmentation and I'll show it in
3496
3497 875
3498 00:42:16,619 --> 00:42:21,630
3499 contrast to a couple of other tasks so
3500
3501 876
3502 00:42:19,280 --> 00:42:23,850
3503 object detection traditionally has been
3504
3505 877
3506 00:42:21,630 --> 00:42:26,010
3507 this task of putting boxes around the
3508
3509 878
3510 00:42:23,849 --> 00:42:28,259
3511 objects are trying to detect so here
3512
3513 879
3514 00:42:26,010 --> 00:42:31,140
3515 they're five people that are detected
3516
3517 880
3518 00:42:28,260 --> 00:42:33,180
3519 and because there's a box around each
3520
3521 881
3522 00:42:31,139 --> 00:42:36,389
3523 one we can say that they're exactly five
3524
3525 882
3526 00:42:33,179 --> 00:42:38,969
3527 instances now there's this other problem
3528
3529 883
3530 00:42:36,389 --> 00:42:41,190
3531 called semantic segmentation in which
3532
3533 884
3534 00:42:38,969 --> 00:42:42,929
3535 you don't want to use a box because
3536
3537 885
3538 00:42:41,190 --> 00:42:45,599
3539 that's a very coarse representation for
3540
3541 886
3542 00:42:42,929 --> 00:42:47,909
3543 an object but instead you want to try to
3544
3545 887
3546 00:42:45,599 --> 00:42:50,429
3547 label each pixel so that you can much
3548
3549 888
3550 00:42:47,909 --> 00:42:52,139
3551 more precisely delineate what pixels on
3552
3553 889
3554 00:42:50,429 --> 00:42:55,500
3555 the background versus what pixel is on
3556
3557 890
3558 00:42:52,139 --> 00:42:58,289
3559 an object now compared to box level
3560
3561 891
3562 00:42:55,500 --> 00:43:00,929
3563 detection this is in advanced but it's
3564
3565 892
3566 00:42:58,289 --> 00:43:03,840
3567 also a step backwards because at least
3568
3569 893
3570 00:43:00,929 --> 00:43:06,659
3571 if this is applied to things instead of
3572
3573 894
3574 00:43:03,840 --> 00:43:08,160
3575 stuff then what happens is you're no
3576
3577 895
3578 00:43:06,659 --> 00:43:11,369
3579 longer able to differentiate between
3580
3581 896
3582 00:43:08,159 --> 00:43:13,559
3583 different instances so all the person
3584
3585 897
3586 00:43:11,369 --> 00:43:16,589
3587 pixels got lumped together into one
3588
3589 898
3590 00:43:13,559 --> 00:43:18,659
3591 massive person pixels so instant
3592
3593 899
3594 00:43:16,590 --> 00:43:20,280
3595 segmentation is essentially the task
3596
3597 900
3598 00:43:18,659 --> 00:43:23,699
3599 that tries to take the best of both
3600
3601 901
3602 00:43:20,280 --> 00:43:25,830
3603 worlds from these so not only do you get
3604
3605 902
3606 00:43:23,699 --> 00:43:28,739
3607 rid of this very crude box level
3608
3609 903
3610 00:43:25,829 --> 00:43:30,840
3611 representation but instead you can
3612
3613 904
3614 00:43:28,739 --> 00:43:32,669
3615 replace it with this much finer
3616
3617 905
3618 00:43:30,840 --> 00:43:34,740
3619 representation that delineates which
3620
3621 906
3622 00:43:32,670 --> 00:43:38,970
3623 pixels are part of the person which are
3624
3625 907
3626 00:43:34,739 --> 00:43:40,889
3627 background but unlike semantic
3628
3629 908
3630 00:43:38,969 --> 00:43:44,549
3631 segmentation you actually retain this
3632
3633 909
3634 00:43:40,889 --> 00:43:46,500
3635 notion of instance so you can tell even
3636
3637 910
3638 00:43:44,550 --> 00:43:52,019
3639 with the segmentation that there still
3640
3641 911
3642 00:43:46,500 --> 00:43:54,300
3643 are exactly five people so masks are CNN
3644
3645 912
3646 00:43:52,019 --> 00:43:57,119
3647 is a model that's developed to try to
3648
3649 913
3650 00:43:54,300 --> 00:43:59,940
3651 solve this task and I'm going to start
3652
3653 914
3654 00:43:57,119 --> 00:44:02,250
3655 by providing an overview of how the
3656
3657 915
3658 00:43:59,940 --> 00:44:03,900
3659 model is structured and the first thing
3660
3661 916
3662 00:44:02,250 --> 00:44:06,750
3663 that I want to sort of impress upon you
3664
3665 917
3666 00:44:03,900 --> 00:44:08,610
3667 with this is that we're now in this era
3668
3669 918
3670 00:44:06,750 --> 00:44:11,309
3671 where we can build a whole lot of
3672
3673 919
3674 00:44:08,610 --> 00:44:13,110
3675 different modular components and they
3676
3677 920
3678 00:44:11,309 --> 00:44:15,960
3679 often stack on top of each other and
3680
3681 921
3682 00:44:13,110 --> 00:44:18,000
3683 very useful in interesting ways so a
3684
3685 922
3686 00:44:15,960 --> 00:44:19,289
3687 typical system that's developed today is
3688
3689 923
3690 00:44:18,000 --> 00:44:20,969
3691 going to take
3692
3693 924
3694 00:44:19,289 --> 00:44:22,739
3695 the successful modules that were built
3696
3697 925
3698 00:44:20,969 --> 00:44:24,179
3699 in the last couple of years and then
3700
3701 926
3702 00:44:22,739 --> 00:44:26,339
3703 figure out a new way of adding some
3704
3705 927
3706 00:44:24,179 --> 00:44:28,679
3707 peace to that so that it has some new
3708
3709 928
3710 00:44:26,340 --> 00:44:30,660
3711 and interesting and useful property so
3712
3713 929
3714 00:44:28,679 --> 00:44:34,169
3715 the way that I'm going to describe this
3716
3717 930
3718 00:44:30,659 --> 00:44:36,420
3719 model is by walking through the sequence
3720
3721 931
3722 00:44:34,170 --> 00:44:38,730
3723 of modules that is built up from and
3724
3725 932
3726 00:44:36,420 --> 00:44:41,519
3727 I'll try to provide sort of a bird's eye
3728
3729 933
3730 00:44:38,730 --> 00:44:42,960
3731 view as how this is being built so that
3732
3733 934
3734 00:44:41,519 --> 00:44:46,650
3735 you can sort of keep the whole picture
3736
3737 935
3738 00:44:42,960 --> 00:44:49,889
3739 in context along the way so the first
3740
3741 936
3742 00:44:46,650 --> 00:44:53,880
3743 place that I want to start is with the
3744
3745 937
3746 00:44:49,889 --> 00:44:55,859
3747 are CNN or region based CNN approach to
3748
3749 938
3750 00:44:53,880 --> 00:44:57,240
3751 object detection because this is the
3752
3753 939
3754 00:44:55,860 --> 00:45:00,420
3755 general framework that Mouse Carson
3756
3757 940
3758 00:44:57,239 --> 00:45:02,369
3759 operates in so this was originally
3760
3761 941
3762 00:45:00,420 --> 00:45:04,500
3763 developed on with particular
3764
3765 942
3766 00:45:02,369 --> 00:45:05,880
3767 architecture and design choices they
3768
3769 943
3770 00:45:04,500 --> 00:45:08,340
3771 share have sort of abstracted the
3772
3773 944
3774 00:45:05,880 --> 00:45:11,340
3775 concept a little bit away from those
3776
3777 945
3778 00:45:08,340 --> 00:45:13,289
3779 specific choices so in general there's
3780
3781 946
3782 00:45:11,340 --> 00:45:16,140
3783 some input image then there's some
3784
3785 947
3786 00:45:13,289 --> 00:45:18,570
3787 mechanism which produces object or
3788
3789 948
3790 00:45:16,139 --> 00:45:21,059
3791 region proposals this could either be an
3792
3793 949
3794 00:45:18,570 --> 00:45:22,860
3795 external source for example the very
3796
3797 950
3798 00:45:21,059 --> 00:45:26,190
3799 successful selective search algorithm or
3800
3801 951
3802 00:45:22,860 --> 00:45:28,470
3803 it could be in sort of internal source
3804
3805 952
3806 00:45:26,190 --> 00:45:30,360
3807 which is the network that's going to be
3808
3809 953
3810 00:45:28,469 --> 00:45:34,469
3811 doing object detection provides the
3812
3813 954
3814 00:45:30,360 --> 00:45:36,990
3815 proposals itself now the next part of
3816
3817 955
3818 00:45:34,469 --> 00:45:39,689
3819 this is that there's some region of
3820
3821 956
3822 00:45:36,989 --> 00:45:42,829
3823 interest or ROI transformation which is
3824
3825 957
3826 00:45:39,690 --> 00:45:42,829
3827 going to take
3828
3829 958
3830 00:45:43,840 --> 00:45:48,840
3831 to coerce it from its arbitrary shape
3832
3833 959
3834 00:45:49,590 --> 00:45:54,140
3835 vector and this could actually
3836
3837 960
3838 00:45:54,960 --> 00:46:00,289
3839 some level after many process after many
3840
3841 961
3842 00:46:00,300 --> 00:46:02,930
3843 con gusto No
3844
3845 962
3846 00:46:02,969 --> 00:46:07,519
3847 that are white transformation is perform
3848
3849 963
3850 00:46:09,710 --> 00:46:14,740
3851 they're obviously going to be based on
3852
3853 964
3854 00:46:11,659 --> 00:46:14,739
3855 the deep neural network
3856
3857 965
3858 00:46:18,000 --> 00:46:23,329
3859 of classifying regions and then there
3860
3861 966
3862 00:46:23,989 --> 00:46:28,569
3863 sober which have to do with refining
3864
3865 967
3866 00:46:26,179 --> 00:46:28,569
3867 spatial
3868
3869 968
3870 00:46:28,710 --> 00:46:33,610
3871 rushon and then what I'll focus on a
3872
3873 969
3874 00:46:30,809 --> 00:46:38,460
3875 little bit more predicting
3876
3877 970
3878 00:46:33,610 --> 00:46:38,460
3879 now star CNN okay
3880
3881 971
3882 00:46:38,530 --> 00:46:43,330
3883 the general type of approach that we're
3884
3885 972
3886 00:46:40,480 --> 00:46:45,980
3887 taking now I want to go
3888
3889 973
3890 00:46:43,329 --> 00:46:48,369
3891 modules that are used to build up the
3892
3893 974
3894 00:46:45,980 --> 00:46:48,369
3895 system
3896
3897 975
3898 00:46:49,210 --> 00:46:54,400
3899 many examples up and that's using what
3900
3901 976
3902 00:46:52,420 --> 00:46:57,670
3903 we
3904
3905 977
3906 00:46:54,400 --> 00:46:59,789
3907 so the backbone architecture is going to
3908
3909 978
3910 00:46:57,670 --> 00:46:59,789
3911 be
3912
3913 979
3914 00:47:00,400 --> 00:47:07,480
3915 to drive the whole recognition system
3916
3917 980
3918 00:47:02,860 --> 00:47:08,590
3919 and this could really be any successful
3920
3921 981
3922 00:47:07,480 --> 00:47:10,690
3923 Network that's been developed in the
3924
3925 982
3926 00:47:08,590 --> 00:47:12,070
3927 past or when one that's even more
3928
3929 983
3930 00:47:10,690 --> 00:47:14,740
3931 successful is developed a year from now
3932
3933 984
3934 00:47:12,070 --> 00:47:17,350
3935 that could be dropped in as well so for
3936
3937 985
3938 00:47:14,739 --> 00:47:21,489
3939 example it could be Alex now vgg ResNet
3940
3941 986
3942 00:47:17,349 --> 00:47:23,769
3943 res next now a couple of very basic
3944
3945 987
3946 00:47:21,489 --> 00:47:27,429
3947 guidelines that are useful to keep in
3948
3949 988
3950 00:47:23,769 --> 00:47:29,590
3951 mind is that so the first is that it's
3952
3953 989
3954 00:47:27,429 --> 00:47:32,109
3955 useful to use what's often referred to
3956
3957 990
3958 00:47:29,590 --> 00:47:34,210
3959 is same padding so this is idea that
3960
3961 991
3962 00:47:32,110 --> 00:47:36,610
3963 when you do any sort of pooling or
3964
3965 992
3966 00:47:34,210 --> 00:47:38,409
3967 convolutional operator you want the
3968
3969 993
3970 00:47:36,610 --> 00:47:40,510
3971 spatial extent 'ti input to that
3972
3973 994
3974 00:47:38,409 --> 00:47:42,009
3975 operator to be the same as the spatial
3976
3977 995
3978 00:47:40,510 --> 00:47:44,470
3979 extent of the output of that operator
3980
3981 996
3982 00:47:42,010 --> 00:47:46,270
3983 and the reason for doing this is because
3984
3985 997
3986 00:47:44,469 --> 00:47:47,679
3987 it preserves an integer scale
3988
3989 998
3990 00:47:46,269 --> 00:47:50,320
3991 relationship between different levels
3992
3993 999
3994 00:47:47,679 --> 00:47:51,699
3995 that are computed by the network and
3996
3997 1000
3998 00:47:50,320 --> 00:47:55,180
3999 you'll see in a little bit why that's
4000
4001 1001
4002 00:47:51,699 --> 00:47:57,819
4003 useful the second thing is that it's
4004
4005 1002
4006 00:47:55,179 --> 00:48:00,190
4007 nice to prefer using as a backbone
4008
4009 1003
4010 00:47:57,820 --> 00:48:02,410
4011 architecture a fully convolutional net
4012
4013 1004
4014 00:48:00,190 --> 00:48:05,349
4015 work and the reasons again end up being
4016
4017 1005
4018 00:48:02,409 --> 00:48:07,500
4019 a little bit more subtle and it
4020
4021 1006
4022 00:48:05,349 --> 00:48:09,849
4023 typically has to do with the later
4024
4025 1007
4026 00:48:07,500 --> 00:48:12,159
4027 architectural modifications that will
4028
4029 1008
4030 00:48:09,849 --> 00:48:14,799
4031 make and using a fully convolutional
4032
4033 1009
4034 00:48:12,159 --> 00:48:16,779
4035 network often provides a greater degree
4036
4037 1010
4038 00:48:14,800 --> 00:48:18,130
4039 of flexibility in terms of what you
4040
4041 1011
4042 00:48:16,780 --> 00:48:20,830
4043 might want to do with that network later
4044
4045 1012
4046 00:48:18,130 --> 00:48:22,720
4047 and then of course the last point is
4048
4049 1013
4050 00:48:20,829 --> 00:48:24,940
4051 that pre-training
4052
4053 1014
4054 00:48:22,719 --> 00:48:27,489
4055 on image net or a similar type of data
4056
4057 1015
4058 00:48:24,940 --> 00:48:30,490
4059 set is an extremely powerful mechanism
4060
4061 1016
4062 00:48:27,489 --> 00:48:32,799
4063 for transferring knowledge in the
4064
4065 1017
4066 00:48:30,489 --> 00:48:35,049
4067 weights of the backbone network to
4068
4069 1018
4070 00:48:32,800 --> 00:48:38,070
4071 another task like object detection where
4072
4073 1019
4074 00:48:35,050 --> 00:48:40,420
4075 you typically have less data available
4076
4077 1020
4078 00:48:38,070 --> 00:48:42,280
4079 so the first thing that we're going to
4080
4081 1021
4082 00:48:40,420 --> 00:48:45,010
4083 do after we've selected a backbone
4084
4085 1022
4086 00:48:42,280 --> 00:48:46,870
4087 architecture is prepare it for detection
4088
4089 1023
4090 00:48:45,010 --> 00:48:50,710
4091 assuming that we've done pre training
4092
4093 1024
4094 00:48:46,869 --> 00:48:52,539
4095 and so this is going to involve a little
4096
4097 1025
4098 00:48:50,710 --> 00:48:55,539
4099 bit of minor surgery to that network so
4100
4101 1026
4102 00:48:52,539 --> 00:48:57,809
4103 the first thing that we'll do is we'll
4104
4105 1027
4106 00:48:55,539 --> 00:49:00,579
4107 take the batch normalization layers and
4108
4109 1028
4110 00:48:57,809 --> 00:49:03,219
4111 we'll take their test time parameters
4112
4113 1029
4114 00:49:00,579 --> 00:49:05,739
4115 which are these scale and bias factors
4116
4117 1030
4118 00:49:03,219 --> 00:49:07,480
4119 and we're basically just going to treat
4120
4121 1031
4122 00:49:05,739 --> 00:49:09,009
4123 those as constants so effectively
4124
4125 1032
4126 00:49:07,480 --> 00:49:11,400
4127 removing batch alarm from the network
4128
4129 1033
4130 00:49:09,010 --> 00:49:13,870
4131 and the reason why this is done is
4132
4133 1034
4134 00:49:11,400 --> 00:49:15,369
4135 purely pragmatic and hopefully
4136
4137 1035
4138 00:49:13,869 --> 00:49:17,259
4139 we'll be able to remove it someday in
4140
4141 1036
4142 00:49:15,369 --> 00:49:19,239
4143 the future more easily and it's
4144
4145 1037
4146 00:49:17,260 --> 00:49:20,980
4147 basically just that for training most of
4148
4149 1038
4150 00:49:19,239 --> 00:49:24,429
4151 these object detection networks you
4152
4153 1039
4154 00:49:20,980 --> 00:49:26,170
4155 can't fit very many images on the GPU at
4156
4157 1040
4158 00:49:24,429 --> 00:49:27,879
4159 the time and therefore what ends up
4160
4161 1041
4162 00:49:26,170 --> 00:49:29,740
4163 happening is that the bachelor on
4164
4165 1042
4166 00:49:27,880 --> 00:49:31,599
4167 statistics I would be computed to end up
4168
4169 1043
4170 00:49:29,739 --> 00:49:34,419
4171 not being very good for training and
4172
4173 1044
4174 00:49:31,599 --> 00:49:36,039
4175 often lead to worst results so this is
4176
4177 1045
4178 00:49:34,420 --> 00:49:39,369
4179 just sort of a simple pragmatic CAC
4180
4181 1046
4182 00:49:36,039 --> 00:49:41,710
4183 these days to avoid that issue now the
4184
4185 1047
4186 00:49:39,369 --> 00:49:42,880
4187 second thing is that you need because
4188
4189 1048
4190 00:49:41,710 --> 00:49:44,980
4191 we're going to repurpose the
4192
4193 1049
4194 00:49:42,880 --> 00:49:46,780
4195 classification Network for detection is
4196
4197 1050
4198 00:49:44,980 --> 00:49:48,969
4199 that you need to remove the
4200
4201 1051
4202 00:49:46,780 --> 00:49:51,880
4203 classification specific head from that
4204
4205 1052
4206 00:49:48,969 --> 00:49:54,309
4207 network so in the case of this ResNet
4208
4209 1053
4210 00:49:51,880 --> 00:49:57,039
4211 that's illustrated here that amounts to
4212
4213 1054
4214 00:49:54,309 --> 00:49:59,949
4215 removing one average pooling layer and
4216
4217 1055
4218 00:49:57,039 --> 00:50:01,840
4219 then the fully connected layer after
4220
4221 1056
4222 00:49:59,949 --> 00:50:04,419
4223 that which was used for the thousand way
4224
4225 1057
4226 00:50:01,840 --> 00:50:06,460
4227 classification on image net and one
4228
4229 1058
4230 00:50:04,420 --> 00:50:09,190
4231 thing to note at least in the case of
4232
4233 1059
4234 00:50:06,460 --> 00:50:10,929
4235 resna is that once you've done this you
4236
4237 1060
4238 00:50:09,190 --> 00:50:17,110
4239 now have a fully convolutional network
4240
4241 1061
4242 00:50:10,929 --> 00:50:18,489
4243 that can take an input of any size now
4244
4245 1062
4246 00:50:17,110 --> 00:50:21,309
4247 the second thing that we're going to
4248
4249 1063
4250 00:50:18,489 --> 00:50:24,459
4251 look at is how scale and varying object
4252
4253 1064
4254 00:50:21,309 --> 00:50:26,049
4255 detection is realized so there are a
4256
4257 1065
4258 00:50:24,460 --> 00:50:28,179
4259 bunch of strategies for doing this
4260
4261 1066
4262 00:50:26,050 --> 00:50:30,370
4263 including sort of making the second
4264
4265 1067
4266 00:50:28,179 --> 00:50:34,169
4267 module and no op so so that's
4268
4269 1068
4270 00:50:30,369 --> 00:50:37,569
4271 illustrated you can see the cursor as
4272
4273 1069
4274 00:50:34,170 --> 00:50:39,700
4275 pane be here so so this is a strategy
4276
4277 1070
4278 00:50:37,570 --> 00:50:41,950
4279 that's been used in fast our CNN for
4280
4281 1071
4282 00:50:39,699 --> 00:50:44,349
4283 example which is basically just to use a
4284
4285 1072
4286 00:50:41,949 --> 00:50:47,259
4287 single feature map from the backbone
4288
4289 1073
4290 00:50:44,349 --> 00:50:49,929
4291 network as the basis for doing object
4292
4293 1074
4294 00:50:47,260 --> 00:50:52,510
4295 detection in scale and variants just
4296
4297 1075
4298 00:50:49,929 --> 00:50:54,069
4299 ends up coming through in this case via
4300
4301 1076
4302 00:50:52,510 --> 00:50:58,000
4303 the region of interest transformation
4304
4305 1077
4306 00:50:54,070 --> 00:51:01,480
4307 operation now sort of combat compatible
4308
4309 1078
4310 00:50:58,000 --> 00:51:04,960
4311 with approach B is the very classic idea
4312
4313 1079
4314 00:51:01,480 --> 00:51:07,539
4315 illustrated in a of building an image
4316
4317 1080
4318 00:51:04,960 --> 00:51:09,789
4319 pyramid and then applying whatever
4320
4321 1081
4322 00:51:07,539 --> 00:51:12,159
4323 technique you have independently to each
4324
4325 1082
4326 00:51:09,789 --> 00:51:13,659
4327 level of that image pyramid now in
4328
4329 1083
4330 00:51:12,159 --> 00:51:15,879
4331 practice this ends up working quite well
4332
4333 1084
4334 00:51:13,659 --> 00:51:18,339
4335 and will usually give you a nice
4336
4337 1085
4338 00:51:15,880 --> 00:51:20,680
4339 improvement in object detection quality
4340
4341 1086
4342 00:51:18,340 --> 00:51:22,210
4343 however it ends up being quite slow
4344
4345 1087
4346 00:51:20,679 --> 00:51:25,179
4347 because you now have to apply your
4348
4349 1088
4350 00:51:22,210 --> 00:51:26,260
4351 entire system to every level of an image
4352
4353 1089
4354 00:51:25,179 --> 00:51:31,278
4355 pyramid
4356
4357 1090
4358 00:51:26,260 --> 00:51:33,230
4359 so there's another approach which I want
4360
4361 1091
4362 00:51:31,278 --> 00:51:36,528
4363 to dig into a little bit and that's the
4364
4365 1092
4366 00:51:33,230 --> 00:51:38,960
4367 idea of using the fact that deep cough
4368
4369 1093
4370 00:51:36,528 --> 00:51:41,500
4371 Nets already inherently are building a
4372
4373 1094
4374 00:51:38,960 --> 00:51:44,119
4375 multiscale representation inside of them
4376
4377 1095
4378 00:51:41,500 --> 00:51:46,639
4379 so let's look at this in a little bit
4380
4381 1096
4382 00:51:44,119 --> 00:51:49,099
4383 more detail so in this illustration
4384
4385 1097
4386 00:51:46,639 --> 00:51:51,858
4387 there's an image here and you have say
4388
4389 1098
4390 00:51:49,099 --> 00:51:56,329
4391 column 3 comma 4 comma 5 feature Maps
4392
4393 1099
4394 00:51:51,858 --> 00:52:00,828
4395 computed by the network and you could in
4396
4397 1100
4398 00:51:56,329 --> 00:52:04,130
4399 principle take a model a detection model
4400
4401 1101
4402 00:52:00,829 --> 00:52:06,410
4403 and make predictions based on each one
4404
4405 1102
4406 00:52:04,130 --> 00:52:07,849
4407 of those levels of the network but
4408
4409 1103
4410 00:52:06,409 --> 00:52:10,759
4411 there's some issues that come up with
4412
4413 1104
4414 00:52:07,849 --> 00:52:13,400
4415 this so the first thing I guess the
4416
4417 1105
4418 00:52:10,760 --> 00:52:15,140
4419 first thing to note is that the reason
4420
4421 1106
4422 00:52:13,400 --> 00:52:18,710
4423 why you might want to do this and it
4424
4425 1107
4426 00:52:15,139 --> 00:52:22,129
4427 seems like a good idea is because you
4428
4429 1108
4430 00:52:18,710 --> 00:52:25,039
4431 allow your detector access to a range of
4432
4433 1109
4434 00:52:22,130 --> 00:52:27,170
4435 different scales so for example you
4436
4437 1110
4438 00:52:25,039 --> 00:52:28,549
4439 could detect small objects on column 3
4440
4441 1111
4442 00:52:27,170 --> 00:52:30,559
4443 because it has much higher spatial
4444
4445 1112
4446 00:52:28,548 --> 00:52:33,679
4447 resolution that should in principle
4448
4449 1113
4450 00:52:30,559 --> 00:52:35,930
4451 allow you to extract much better
4452
4453 1114
4454 00:52:33,679 --> 00:52:38,118
4455 features for detection then you could if
4456
4457 1115
4458 00:52:35,929 --> 00:52:40,219
4459 you try to detect tiny objects on column
4460
4461 1116
4462 00:52:38,119 --> 00:52:43,910
4463 5 which has been subsampled
4464
4465 1117
4466 00:52:40,219 --> 00:52:46,038
4467 significantly however there's sort of a
4468
4469 1118
4470 00:52:43,909 --> 00:52:48,078
4471 catch in this which is that if you were
4472
4473 1119
4474 00:52:46,039 --> 00:52:49,880
4475 to try to do that directly you'd be
4476
4477 1120
4478 00:52:48,079 --> 00:52:51,920
4479 compromising the quality of the features
4480
4481 1121
4482 00:52:49,880 --> 00:52:54,170
4483 because we know that the feature is
4484
4485 1122
4486 00:52:51,920 --> 00:52:56,358
4487 computed up here in comm 5 are going to
4488
4489 1123
4490 00:52:54,170 --> 00:52:58,548
4491 be really good for classification but
4492
4493 1124
4494 00:52:56,358 --> 00:53:00,828
4495 the features down here at Comp 3 are not
4496
4497 1125
4498 00:52:58,548 --> 00:53:02,210
4499 going to be so good and in the extreme
4500
4501 1126
4502 00:53:00,829 --> 00:53:04,430
4503 if you went down far enough you'd
4504
4505 1127
4506 00:53:02,210 --> 00:53:09,970
4507 effectively be using something that was
4508
4509 1128
4510 00:53:04,429 --> 00:53:14,000
4511 sort of equivalent to say sift or hog so
4512
4513 1129
4514 00:53:09,969 --> 00:53:16,639
4515 what we propose to do is to make a minor
4516
4517 1130
4518 00:53:14,000 --> 00:53:17,869
4519 modification of that approach and build
4520
4521 1131
4522 00:53:16,639 --> 00:53:20,088
4523 something called a feature pyramid
4524
4525 1132
4526 00:53:17,869 --> 00:53:20,930
4527 Network and this is a paper that's at
4528
4529 1133
4530 00:53:20,088 --> 00:53:24,528
4531 the cvpr
4532
4533 1134
4534 00:53:20,929 --> 00:53:28,940
4535 and will be presented on Saturday so the
4536
4537 1135
4538 00:53:24,528 --> 00:53:31,608
4539 idea is to try to get away with the best
4540
4541 1136
4542 00:53:28,940 --> 00:53:33,349
4543 of both worlds so we want to be able to
4544
4545 1137
4546 00:53:31,608 --> 00:53:35,989
4547 use the inherent multi scale
4548
4549 1138
4550 00:53:33,349 --> 00:53:38,359
4551 representation in the network but we
4552
4553 1139
4554 00:53:35,989 --> 00:53:39,559
4555 want to be able to use strong features
4556
4557 1140
4558 00:53:38,358 --> 00:53:41,179
4559 everywhere
4560
4561 1141
4562 00:53:39,559 --> 00:53:42,500
4563 and I guess I didn't mention it's
4564
4565 1142
4566 00:53:41,179 --> 00:53:45,529
4567 explicitly though you probably saw in
4568
4569 1143
4570 00:53:42,500 --> 00:53:47,809
4571 the slide we also want it to be fast by
4572
4573 1144
4574 00:53:45,530 --> 00:53:50,720
4575 add by requiring only a marginal
4576
4577 1145
4578 00:53:47,809 --> 00:53:54,380
4579 marginal increase in the computation
4580
4581 1146
4582 00:53:50,719 --> 00:53:57,709
4583 required to to build this pyramid so the
4584
4585 1147
4586 00:53:54,380 --> 00:53:59,570
4587 basic idea here is that as before you
4588
4589 1148
4590 00:53:57,710 --> 00:54:01,159
4591 have the standard forward pass to the
4592
4593 1149
4594 00:53:59,570 --> 00:54:03,170
4595 network that builds up the multiple
4596
4597 1150
4598 00:54:01,159 --> 00:54:05,329
4599 levels of representation at different
4600
4601 1151
4602 00:54:03,170 --> 00:54:08,869
4603 scales but now we're going to add to
4604
4605 1152
4606 00:54:05,329 --> 00:54:10,099
4607 that forward pass so new connections so
4608
4609 1153
4610 00:54:08,869 --> 00:54:11,989
4611 they're going to be these lateral
4612
4613 1154
4614 00:54:10,099 --> 00:54:13,809
4615 connections and they're also going to be
4616
4617 1155
4618 00:54:11,989 --> 00:54:15,859
4619 these top-down connections and
4620
4621 1156
4622 00:54:13,809 --> 00:54:19,000
4623 effectively what these are going to do
4624
4625 1157
4626 00:54:15,860 --> 00:54:22,490
4627 is they're going to take the top-down
4628
4629 1158
4630 00:54:19,000 --> 00:54:25,579
4631 strong features and propagate them to
4632
4633 1159
4634 00:54:22,489 --> 00:54:28,309
4635 the high resolution feature Maps below
4636
4637 1160
4638 00:54:25,579 --> 00:54:30,799
4639 and that's going to create this auxilary
4640
4641 1161
4642 00:54:28,309 --> 00:54:33,079
4643 or secondary pyramid over here which is
4644
4645 1162
4646 00:54:30,800 --> 00:54:35,180
4647 going to have a variety of different
4648
4649 1163
4650 00:54:33,079 --> 00:54:37,579
4651 spatial resolutions and the features
4652
4653 1164
4654 00:54:35,179 --> 00:54:42,500
4655 will ideally be strong across all those
4656
4657 1165
4658 00:54:37,579 --> 00:54:44,779
4659 levels so just to illustrate that again
4660
4661 1166
4662 00:54:42,500 --> 00:54:47,570
4663 the idea is that we're going to be able
4664
4665 1167
4666 00:54:44,780 --> 00:54:51,500
4667 to have strong features everywhere in
4668
4669 1168
4670 00:54:47,570 --> 00:54:53,180
4671 this pyramid and I also want to note
4672
4673 1169
4674 00:54:51,500 --> 00:54:55,159
4675 that this this is an idea that seems to
4676
4677 1170
4678 00:54:53,179 --> 00:54:58,190
4679 be very popular right now because it
4680
4681 1171
4682 00:54:55,159 --> 00:55:00,319
4683 seems to have effectively been invented
4684
4685 1172
4686 00:54:58,190 --> 00:55:03,829
4687 sort of simultaneously by I think there
4688
4689 1173
4690 00:55:00,320 --> 00:55:07,309
4691 are four or five different groups okay
4692
4693 1174
4694 00:55:03,829 --> 00:55:09,610
4695 so that's the second module in the
4696
4697 1175
4698 00:55:07,309 --> 00:55:12,739
4699 approach so now the third module is
4700
4701 1176
4702 00:55:09,610 --> 00:55:14,840
4703 going to be the mechanism that provides
4704
4705 1177
4706 00:55:12,739 --> 00:55:17,599
4707 the region proposals for doing object
4708
4709 1178
4710 00:55:14,840 --> 00:55:20,750
4711 detection now before describing this I
4712
4713 1179
4714 00:55:17,599 --> 00:55:22,250
4715 just want to draw your attention to what
4716
4717 1180
4718 00:55:20,750 --> 00:55:24,860
4719 I'm trying to provide is the bird's eye
4720
4721 1181
4722 00:55:22,250 --> 00:55:26,239
4723 view of what we're building up as I go
4724
4725 1182
4726 00:55:24,860 --> 00:55:28,250
4727 through each one of these steps so you
4728
4729 1183
4730 00:55:26,239 --> 00:55:30,019
4731 don't lose track of what's going on so
4732
4733 1184
4734 00:55:28,250 --> 00:55:32,389
4735 over here on the left side of the slide
4736
4737 1185
4738 00:55:30,019 --> 00:55:36,920
4739 you have what's now this little tiny
4740
4741 1186
4742 00:55:32,389 --> 00:55:39,589
4743 image of a ResNet and you have coming
4744
4745 1187
4746 00:55:36,920 --> 00:55:41,150
4747 off of that the feature pyramid network
4748
4749 1188
4750 00:55:39,590 --> 00:55:42,920
4751 which I just described which built that
4752
4753 1189
4754 00:55:41,150 --> 00:55:45,170
4755 feature pyramid and now that I'm going
4756
4757 1190
4758 00:55:42,920 --> 00:55:46,940
4759 to describe is the region proposal
4760
4761 1191
4762 00:55:45,170 --> 00:55:51,170
4763 mechanism which is going to be applied
4764
4765 1192
4766 00:55:46,940 --> 00:55:52,679
4767 to each one of the levels computed by
4768
4769 1193
4770 00:55:51,170 --> 00:55:55,619
4771 the feature pyramid Network
4772
4773 1194
4774 00:55:52,679 --> 00:55:59,480
4775 now the idea of the region proposal
4776
4777 1195
4778 00:55:55,619 --> 00:56:02,548
4779 network is that it's going to provide
4780
4781 1196
4782 00:55:59,480 --> 00:56:06,088
4783 these object proposals for detecting
4784
4785 1197
4786 00:56:02,548 --> 00:56:08,219
4787 objects using a sliding window mechanism
4788
4789 1198
4790 00:56:06,088 --> 00:56:10,828
4791 and what it's going to do is that at
4792
4793 1199
4794 00:56:08,219 --> 00:56:17,629
4795 each sliding window position it's going
4796
4797 1200
4798 00:56:10,829 --> 00:56:21,000
4799 to try to predict whether each one of K
4800
4801 1201
4802 00:56:17,630 --> 00:56:25,048
4803 prototypical boxes centered at that
4804
4805 1202
4806 00:56:21,000 --> 00:56:27,568
4807 position corresponds to an object so we
4808
4809 1203
4810 00:56:25,048 --> 00:56:29,548
4811 call these anchor boxes so they come in
4812
4813 1204
4814 00:56:27,568 --> 00:56:32,699
4815 different aspect ratios and different
4816
4817 1205
4818 00:56:29,548 --> 00:56:34,288
4819 scales and the idea is that hopefully
4820
4821 1206
4822 00:56:32,699 --> 00:56:36,899
4823 one of those aspect ratios and scales
4824
4825 1207
4826 00:56:34,289 --> 00:56:38,520
4827 will be kind of close to the aspect
4828
4829 1208
4830 00:56:36,900 --> 00:56:40,380
4831 ratio non scale of an object centered at
4832
4833 1209
4834 00:56:38,519 --> 00:56:42,409
4835 that location and then the region
4836
4837 1210
4838 00:56:40,380 --> 00:56:45,420
4839 proposal network basically needs to say
4840
4841 1211
4842 00:56:42,409 --> 00:56:48,449
4843 yes this anchor box is good and then
4844
4845 1212
4846 00:56:45,420 --> 00:56:50,670
4847 additionally it's going to suggest how
4848
4849 1213
4850 00:56:48,449 --> 00:56:53,338
4851 you transform that anchor box via a
4852
4853 1214
4854 00:56:50,670 --> 00:56:55,920
4855 small regression so that it better
4856
4857 1215
4858 00:56:53,338 --> 00:56:59,608
4859 localizes the object that's near it so
4860
4861 1216
4862 00:56:55,920 --> 00:57:00,900
4863 in practice at each location there most
4864
4865 1217
4866 00:56:59,608 --> 00:57:03,449
4867 likely won't be an object but if there
4868
4869 1218
4870 00:57:00,900 --> 00:57:05,338
4871 is an object maybe only one of the
4872
4873 1219
4874 00:57:03,449 --> 00:57:07,588
4875 anchor boxes at that location will be
4876
4877 1220
4878 00:57:05,338 --> 00:57:09,929
4879 good match to it in the job of our peon
4880
4881 1221
4882 00:57:07,588 --> 00:57:12,298
4883 it's identify which one of those anchor
4884
4885 1222
4886 00:57:09,929 --> 00:57:15,239
4887 boxes is a good match give it a high
4888
4889 1223
4890 00:57:12,298 --> 00:57:18,960
4891 object no score and then transform it so
4892
4893 1224
4894 00:57:15,239 --> 00:57:22,259
4895 that it matches the object there looks
4896
4897 1225
4898 00:57:18,960 --> 00:57:24,358
4899 like there's no question so about this
4900
4901 1226
4902 00:57:22,260 --> 00:57:26,400
4903 anchor box is the other actually
4904
4905 1227
4906 00:57:24,358 --> 00:57:28,338
4907 convolutional filters I was trying to
4908
4909 1228
4910 00:57:26,400 --> 00:57:30,750
4911 read the paper but it's not very clear
4912
4913 1229
4914 00:57:28,338 --> 00:57:33,869
4915 let's say you have different scales and
4916
4917 1230
4918 00:57:30,750 --> 00:57:37,190
4919 davonne ratios of these boxes do you run
4920
4921 1231
4922 00:57:33,869 --> 00:57:39,778
4923 them as convolutions on top of the image
4924
4925 1232
4926 00:57:37,190 --> 00:57:41,369
4927 because in the previous paper it says
4928
4929 1233
4930 00:57:39,778 --> 00:57:43,139
4931 you have free by free convolution and
4932
4933 1234
4934 00:57:41,369 --> 00:57:45,480
4935 now you have this anchors so what they
4936
4937 1235
4938 00:57:43,139 --> 00:57:47,639
4939 are in the actors realization yeah so
4940
4941 1236
4942 00:57:45,480 --> 00:57:51,240
4943 the realization so the anchor boxes are
4944
4945 1237
4946 00:57:47,639 --> 00:57:53,278
4947 are not filters they're they're just
4948
4949 1238
4950 00:57:51,239 --> 00:57:55,288
4951 these prototype boxes that act as
4952
4953 1239
4954 00:57:53,278 --> 00:57:57,559
4955 references and then there's a three by
4956
4957 1240
4958 00:57:55,289 --> 00:58:00,299
4959 three filter for each one of those and
4960
4961 1241
4962 00:57:57,559 --> 00:58:03,210
4963 that three by three fill both it's not
4964
4965 1242
4966 00:58:00,298 --> 00:58:05,179
4967 exactly one it's essentially one for
4968
4969 1243
4970 00:58:03,210 --> 00:58:06,500
4971 each property needs to predict
4972
4973 1244
4974 00:58:05,179 --> 00:58:08,929
4975 so there'll be one that predicts whether
4976
4977 1245
4978 00:58:06,500 --> 00:58:10,730
4979 it's an object or not an object and then
4980
4981 1246
4982 00:58:08,929 --> 00:58:12,919
4983 four that predicts geometric
4984
4985 1247
4986 00:58:10,730 --> 00:58:16,309
4987 transformations of the ink of the inker
4988
4989 1248
4990 00:58:12,920 --> 00:58:18,548
4991 box so they're there oh it's always
4992
4993 1249
4994 00:58:16,309 --> 00:58:22,069
4995 three by three filters predicting
4996
4997 1250
4998 00:58:18,548 --> 00:58:23,630
4999 properties of the anchor boxes I hope
5000
5001 1251
5002 00:58:22,068 --> 00:58:28,730
5003 that's clear okay
5004
5005 1252
5006 00:58:23,630 --> 00:58:30,680
5007 all right thanks okay so so that's going
5008
5009 1253
5010 00:58:28,730 --> 00:58:36,170
5011 to be the mechanism in the network which
5012
5013 1254
5014 00:58:30,679 --> 00:58:38,358
5015 provides the object proposals now the
5016
5017 1255
5018 00:58:36,170 --> 00:58:41,180
5019 fourth step and again your little road
5020
5021 1256
5022 00:58:38,358 --> 00:58:44,389
5023 map is over here in the corner with
5024
5025 1257
5026 00:58:41,179 --> 00:58:46,730
5027 increasingly shrinking size so the
5028
5029 1258
5030 00:58:44,389 --> 00:58:51,170
5031 fourth part is this region of interest
5032
5033 1259
5034 00:58:46,730 --> 00:58:54,130
5035 transformation step and so in NASCAR CNN
5036
5037 1260
5038 00:58:51,170 --> 00:58:56,990
5039 we use a new mechanism to provide this
5040
5041 1261
5042 00:58:54,130 --> 00:58:59,900
5043 transformation it called ROI align and
5044
5045 1262
5046 00:58:56,989 --> 00:59:03,288
5047 the idea of ROI align is that it's going
5048
5049 1263
5050 00:58:59,900 --> 00:59:05,869
5051 to smoothly transform the features from
5052
5053 1264
5054 00:59:03,289 --> 00:59:09,520
5055 whatever arbitrary aspect ratio the
5056
5057 1265
5058 00:59:05,869 --> 00:59:12,769
5059 region of interest has into a fixed size
5060
5061 1266
5062 00:59:09,519 --> 00:59:16,670
5063 feature vector without doing any
5064
5065 1267
5066 00:59:12,769 --> 00:59:19,190
5067 quantization so the way that it's going
5068
5069 1268
5070 00:59:16,670 --> 00:59:21,019
5071 to do that is using a very simple
5072
5073 1269
5074 00:59:19,190 --> 00:59:24,440
5075 mechanism adjust by linear interpolation
5076
5077 1270
5078 00:59:21,019 --> 00:59:28,099
5079 so if this is the region proposal here
5080
5081 1271
5082 00:59:24,440 --> 00:59:31,159
5083 then what we want to get out of this in
5084
5085 1272
5086 00:59:28,099 --> 00:59:34,970
5087 in this Illustrated example is a 2x2
5088
5089 1273
5090 00:59:31,159 --> 00:59:37,730
5091 grid of features and within each one is
5092
5093 1274
5094 00:59:34,969 --> 00:59:39,679
5095 the region of interest bins we're going
5096
5097 1275
5098 00:59:37,730 --> 00:59:42,440
5099 to lay down a grid of sampling points
5100
5101 1276
5102 00:59:39,679 --> 00:59:43,848
5103 here Illustrated is 2x2 and those
5104
5105 1277
5106 00:59:42,440 --> 00:59:47,028
5107 sampling points are going to be used for
5108
5109 1278
5110 00:59:43,849 --> 00:59:48,680
5111 bilinear interpolation and so one case
5112
5113 1279
5114 00:59:47,028 --> 00:59:50,510
5115 of that is illustrated here where you
5116
5117 1280
5118 00:59:48,679 --> 00:59:52,098
5119 have this point interpolating the
5120
5121 1281
5122 00:59:50,510 --> 00:59:55,339
5123 features at its four nearest neighbors
5124
5125 1282
5126 00:59:52,099 --> 00:59:56,838
5127 and so so via this interpolation process
5128
5129 1283
5130 00:59:55,338 --> 01:00:00,230
5131 we're going to get out a fixed size
5132
5133 1284
5134 00:59:56,838 --> 01:00:01,940
5135 feature vector now this probably sounds
5136
5137 1285
5138 01:00:00,230 --> 01:00:05,150
5139 like the obvious thing that you want to
5140
5141 1286
5142 01:00:01,940 --> 01:00:07,369
5143 do but it's actually a little bit
5144
5145 1287
5146 01:00:05,150 --> 01:00:09,829
5147 different in kind of very minor details
5148
5149 1288
5150 01:00:07,369 --> 01:00:13,490
5151 from what has been done in the past and
5152
5153 1289
5154 01:00:09,829 --> 01:00:15,559
5155 so for example in fast our CNN there's
5156
5157 1290
5158 01:00:13,489 --> 01:00:19,009
5159 this region of interest pool operation
5160
5161 1291
5162 01:00:15,559 --> 01:00:22,790
5163 which is very similar except for the fat
5164
5165 1292
5166 01:00:19,010 --> 01:00:24,830
5167 that it performs quantization and max
5168
5169 1293
5170 01:00:22,789 --> 01:00:27,559
5171 cooling but the next cooling is not
5172
5173 1294
5174 01:00:24,829 --> 01:00:30,769
5175 really the issue at hand here the real
5176
5177 1295
5178 01:00:27,559 --> 01:00:32,719
5179 issue is quantization so as illustrated
5180
5181 1296
5182 01:00:30,769 --> 01:00:35,389
5183 here if you start from this original
5184
5185 1297
5186 01:00:32,719 --> 01:00:37,939
5187 region of interest and then you perform
5188
5189 1298
5190 01:00:35,389 --> 01:00:40,909
5191 some quantization that snaps its
5192
5193 1299
5194 01:00:37,940 --> 01:00:44,119
5195 coordinates to the coordinate of the
5196
5197 1300
5198 01:00:40,909 --> 01:00:46,670
5199 underlying feature grid what happens is
5200
5201 1301
5202 01:00:44,119 --> 01:00:48,859
5203 that you're going to break the pixel to
5204
5205 1302
5206 01:00:46,670 --> 01:00:51,079
5207 pixel alignment between the input and
5208
5209 1303
5210 01:00:48,860 --> 01:00:53,720
5211 the output and it turns out that that
5212
5213 1304
5214 01:00:51,079 --> 01:00:56,480
5215 doesn't matter that much when it comes
5216
5217 1305
5218 01:00:53,719 --> 01:00:58,250
5219 to predicting bounding boxes but when
5220
5221 1306
5222 01:00:56,480 --> 01:01:00,769
5223 you move to tasks that require much
5224
5225 1307
5226 01:00:58,250 --> 01:01:03,199
5227 finer spatial localization such as
5228
5229 1308
5230 01:01:00,769 --> 01:01:04,820
5231 predicting object masks or predicting
5232
5233 1309
5234 01:01:03,199 --> 01:01:07,639
5235 key points like human pose estimation
5236
5237 1310
5238 01:01:04,820 --> 01:01:09,769
5239 then this type of detail actually starts
5240
5241 1311
5242 01:01:07,639 --> 01:01:11,599
5243 to matter quite a bit and we have a
5244
5245 1312
5246 01:01:09,769 --> 01:01:14,449
5247 detailed ablation breakdown in the paper
5248
5249 1313
5250 01:01:11,599 --> 01:01:16,759
5251 which shows that this tiny detail of
5252
5253 1314
5254 01:01:14,449 --> 01:01:18,409
5255 whether you quantize the coordinates or
5256
5257 1315
5258 01:01:16,760 --> 01:01:22,180
5259 not actually makes a very significant
5260
5261 1316
5262 01:01:18,409 --> 01:01:26,119
5263 difference in the final results okay so
5264
5265 1317
5266 01:01:22,179 --> 01:01:28,879
5267 now we're to the fifth and final modular
5268
5269 1318
5270 01:01:26,119 --> 01:01:31,909
5271 components of the system and this is the
5272
5273 1319
5274 01:01:28,880 --> 01:01:33,920
5275 part that's going to make predictions
5276
5277 1320
5278 01:01:31,909 --> 01:01:36,319
5279 for each region of interest that the
5280
5281 1321
5282 01:01:33,920 --> 01:01:41,150
5283 system has proposed and then transformed
5284
5285 1322
5286 01:01:36,320 --> 01:01:44,360
5287 via ROI of line and we the idea here is
5288
5289 1323
5290 01:01:41,150 --> 01:01:47,030
5291 that we really refer to this as the head
5292
5293 1324
5294 01:01:44,360 --> 01:01:48,829
5295 of the network and there going to be a
5296
5297 1325
5298 01:01:47,030 --> 01:01:52,760
5299 variety of heads that perform different
5300
5301 1326
5302 01:01:48,829 --> 01:01:55,190
5303 tasks so the first two are standard from
5304
5305 1327
5306 01:01:52,760 --> 01:01:58,160
5307 fasterfaster our CN n so that's doing
5308
5309 1328
5310 01:01:55,190 --> 01:02:00,710
5311 bounding box detection essentially
5312
5313 1329
5314 01:01:58,159 --> 01:02:02,179
5315 classifying whether this box is one one
5316
5317 1330
5318 01:02:00,710 --> 01:02:07,039
5319 of the categories are background and
5320
5321 1331
5322 01:02:02,179 --> 01:02:08,480
5323 then the second actually that probably
5324
5325 1332
5326 01:02:07,039 --> 01:02:11,900
5327 should that should have said bounding
5328
5329 1333
5330 01:02:08,480 --> 01:02:16,610
5331 box regression so the first one is doing
5332
5333 1334
5334 01:02:11,900 --> 01:02:19,610
5335 a geometric shift in scale change of the
5336
5337 1335
5338 01:02:16,610 --> 01:02:22,460
5339 box in order to try to more finely
5340
5341 1336
5342 01:02:19,610 --> 01:02:25,370
5343 localize the object the second one is
5344
5345 1337
5346 01:02:22,460 --> 01:02:27,889
5347 doing object classification saying
5348
5349 1338
5350 01:02:25,369 --> 01:02:29,210
5351 whether the proposal is one of the
5352
5353 1339
5354 01:02:27,889 --> 01:02:31,788
5355 foreground categories that we're trying
5356
5357 1340
5358 01:02:29,210 --> 01:02:35,048
5359 to detect or part of the background
5360
5361 1341
5362 01:02:31,789 --> 01:02:37,639
5363 and then in addition to those two
5364
5365 1342
5366 01:02:35,048 --> 01:02:41,978
5367 standard components we're going to add
5368
5369 1343
5370 01:02:37,639 --> 01:02:44,358
5371 in two new components so the first is
5372
5373 1344
5374 01:02:41,978 --> 01:02:46,638
5375 what NASCAR CNN is sort of all about
5376
5377 1345
5378 01:02:44,358 --> 01:02:49,458
5379 which is predicting instance level masks
5380
5381 1346
5382 01:02:46,639 --> 01:02:51,829
5383 for each object and then the second
5384
5385 1347
5386 01:02:49,458 --> 01:02:53,838
5387 which is sort of optional and wasn't
5388
5389 1348
5390 01:02:51,829 --> 01:02:56,239
5391 part of the original design
5392
5393 1349
5394 01:02:53,838 --> 01:02:57,949
5395 is that we discovered that there's
5396
5397 1350
5398 01:02:56,239 --> 01:03:00,528
5399 actually a very simple way in which the
5400
5401 1351
5402 01:02:57,949 --> 01:03:04,009
5403 same system can be used to predict human
5404
5405 1352
5406 01:03:00,528 --> 01:03:07,039
5407 pose pretty reliably so what's
5408
5409 1353
5410 01:03:04,009 --> 01:03:09,409
5411 illustrated here on the right is the
5412
5413 1354
5414 01:03:07,039 --> 01:03:12,649
5415 standard classification and bounding box
5416
5417 1355
5418 01:03:09,409 --> 01:03:16,039
5419 regression head that's used for Bastion
5420
5421 1356
5422 01:03:12,648 --> 01:03:18,048
5423 caster our CNN and then you can just
5424
5425 1357
5426 01:03:16,039 --> 01:03:19,939
5427 think that in parallel to that we're
5428
5429 1358
5430 01:03:18,048 --> 01:03:23,449
5431 adding in a new head that's going to
5432
5433 1359
5434 01:03:19,938 --> 01:03:25,998
5435 apply several convolution layers and
5436
5437 1360
5438 01:03:23,449 --> 01:03:28,249
5439 then the transpose layer to increase
5440
5441 1361
5442 01:03:25,998 --> 01:03:33,408
5443 spatial resolution in order to predict
5444
5445 1362
5446 01:03:28,248 --> 01:03:34,908
5447 instance segmentations okay so I would
5448
5449 1363
5450 01:03:33,409 --> 01:03:36,559
5451 like to talk about training but
5452
5453 1364
5454 01:03:34,909 --> 01:03:38,059
5455 unfortunately there just isn't enough
5456
5457 1365
5458 01:03:36,559 --> 01:03:41,719
5459 time right now to go into any of the
5460
5461 1366
5462 01:03:38,059 --> 01:03:43,880
5463 details so what I will say just in brief
5464
5465 1367
5466 01:03:41,719 --> 01:03:46,338
5467 summary is that the training procedure
5468
5469 1368
5470 01:03:43,880 --> 01:03:49,429
5471 is almost identical to fast and faster
5472
5473 1369
5474 01:03:46,338 --> 01:03:51,018
5475 our CNN the main difference is that
5476
5477 1370
5478 01:03:49,429 --> 01:03:53,418
5479 there are now these targets for
5480
5481 1371
5482 01:03:51,018 --> 01:03:55,248
5483 predicting masks and so I do want to
5484
5485 1372
5486 01:03:53,418 --> 01:03:56,568
5487 spend a couple of slides to show you
5488
5489 1373
5490 01:03:55,248 --> 01:03:59,868
5491 what those look like to give you a
5492
5493 1374
5494 01:03:56,568 --> 01:04:03,139
5495 little bit of intuition about that so
5496
5497 1375
5498 01:03:59,869 --> 01:04:06,079
5499 here's one image and in this image I've
5500
5501 1376
5502 01:04:03,139 --> 01:04:07,399
5503 highlighted four different regions of
5504
5505 1377
5506 01:04:06,079 --> 01:04:09,469
5507 interest that are going to be used
5508
5509 1378
5510 01:04:07,398 --> 01:04:11,808
5511 during training and then for each one of
5512
5513 1379
5514 01:04:09,469 --> 01:04:14,119
5515 those I'm showing The Associated mask
5516
5517 1380
5518 01:04:11,809 --> 01:04:16,880
5519 target that the network is being trained
5520
5521 1381
5522 01:04:14,119 --> 01:04:21,199
5523 to predict and these are represented as
5524
5525 1382
5526 01:04:16,880 --> 01:04:23,239
5527 binary 28 by 28 masks so you can see
5528
5529 1383
5530 01:04:21,199 --> 01:04:25,938
5531 that when an object is very well
5532
5533 1384
5534 01:04:23,239 --> 01:04:31,369
5535 localized by the region proposal the
5536
5537 1385
5538 01:04:25,938 --> 01:04:34,219
5539 target fills up the entire 28 by 28 grid
5540
5541 1386
5542 01:04:31,369 --> 01:04:36,318
5543 and then you can see in this other
5544
5545 1387
5546 01:04:34,219 --> 01:04:37,938
5547 example here when the region of interest
5548
5549 1388
5550 01:04:36,318 --> 01:04:39,829
5551 is not well aligned with the object
5552
5553 1389
5554 01:04:37,938 --> 01:04:42,048
5555 there's sort of an appropriate
5556
5557 1390
5558 01:04:39,829 --> 01:04:45,048
5559 transformation of the target so that it
5560
5561 1391
5562 01:04:42,048 --> 01:04:48,409
5563 only occupies a portion
5564
5565 1392
5566 01:04:45,048 --> 01:04:50,630
5567 of the proposal and this would mean that
5568
5569 1393
5570 01:04:48,409 --> 01:04:54,349
5571 the system is being trained so that even
5572
5573 1394
5574 01:04:50,630 --> 01:04:56,568
5575 if the box that it has predicted ends up
5576
5577 1395
5578 01:04:54,349 --> 01:04:59,149
5579 not being a great box for the object it
5580
5581 1396
5582 01:04:56,568 --> 01:05:01,068
5583 still should be able hopefully to
5584
5585 1397
5586 01:04:59,148 --> 01:05:06,648
5587 predict a reasonably good mask for the
5588
5589 1398
5590 01:05:01,068 --> 01:05:08,449
5591 object okay so unfortunately that's all
5592
5593 1399
5594 01:05:06,648 --> 01:05:11,868
5595 about training so now let's talk about
5596
5597 1400
5598 01:05:08,449 --> 01:05:14,749
5599 how inference works so imprints proceeds
5600
5601 1401
5602 01:05:11,869 --> 01:05:17,119
5603 in two steps and essentially the first
5604
5605 1402
5606 01:05:14,748 --> 01:05:20,208
5607 step is just to perform standard Doster
5608
5609 1403
5610 01:05:17,119 --> 01:05:22,009
5611 our scene and type inference so if
5612
5613 1404
5614 01:05:20,208 --> 01:05:24,228
5615 you're familiar with that then you'll
5616
5617 1405
5618 01:05:22,009 --> 01:05:26,568
5619 follow this if not then unfortunately
5620
5621 1406
5622 01:05:24,228 --> 01:05:29,958
5623 probably be a little bit too terse to
5624
5625 1407
5626 01:05:26,568 --> 01:05:31,880
5627 really understand but basically the
5628
5629 1408
5630 01:05:29,958 --> 01:05:36,018
5631 first step is to generate proposals
5632
5633 1409
5634 01:05:31,880 --> 01:05:39,709
5635 using RPN then to score those proposals
5636
5637 1410
5638 01:05:36,018 --> 01:05:42,228
5639 using the object classification head of
5640
5641 1411
5642 01:05:39,708 --> 01:05:46,038
5643 the network and also to regress the
5644
5645 1412
5646 01:05:42,228 --> 01:05:48,468
5647 refined proposals using bounding box
5648
5649 1413
5650 01:05:46,039 --> 01:05:51,979
5651 regression and then to apply non Maxima
5652
5653 1414
5654 01:05:48,469 --> 01:05:54,019
5655 suppression and take the top say 100
5656
5657 1415
5658 01:05:51,978 --> 01:05:57,288
5659 detection is what we typically do in
5660
5661 1416
5662 01:05:54,018 --> 01:05:59,618
5663 practice and now the second part of the
5664
5665 1417
5666 01:05:57,289 --> 01:06:02,719
5667 inference procedure is going to be
5668
5669 1418
5670 01:05:59,619 --> 01:06:06,439
5671 predicting masks for those top 100
5672
5673 1419
5674 01:06:02,719 --> 01:06:08,778
5675 detections and the way that this is done
5676
5677 1420
5678 01:06:06,438 --> 01:06:10,879
5679 is simply by reusing all of the features
5680
5681 1421
5682 01:06:08,778 --> 01:06:15,489
5683 that have already been computed and then
5684
5685 1422
5686 01:06:10,880 --> 01:06:19,130
5687 for each one of these refined detections
5688
5689 1423
5690 01:06:15,489 --> 01:06:22,639
5691 running the ROI transformation operation
5692
5693 1424
5694 01:06:19,130 --> 01:06:25,459
5695 on an ROI line and then running those
5696
5697 1425
5698 01:06:22,639 --> 01:06:27,048
5699 features through the mask ad and doing
5700
5701 1426
5702 01:06:25,458 --> 01:06:29,898
5703 this sort of two-stage Cascade
5704
5705 1427
5706 01:06:27,048 --> 01:06:32,059
5707 classification or not prediction has a
5708
5709 1428
5710 01:06:29,898 --> 01:06:34,818
5711 couple of advantages to it so the first
5712
5713 1429
5714 01:06:32,059 --> 01:06:38,359
5715 is that it's fast because you only have
5716
5717 1430
5718 01:06:34,818 --> 01:06:40,338
5719 to predict masks for say 100 objects
5720
5721 1431
5722 01:06:38,358 --> 01:06:42,918
5723 rather than for say a thousand objects
5724
5725 1432
5726 01:06:40,338 --> 01:06:45,199
5727 and the other is that you get slightly
5728
5729 1433
5730 01:06:42,918 --> 01:06:48,708
5731 improved accuracy because you're using
5732
5733 1434
5734 01:06:45,199 --> 01:06:50,179
5735 the refined detections rather than the
5736
5737 1435
5738 01:06:48,708 --> 01:06:52,419
5739 original proposals for doing the mask
5740
5741 1436
5742 01:06:50,179 --> 01:06:56,210
5743 prediction
5744
5745 1437
5746 01:06:52,420 --> 01:06:58,039
5747 so one of the things that has frequently
5748
5749 1438
5750 01:06:56,210 --> 01:07:01,220
5751 come up when I talk to people is that
5752
5753 1439
5754 01:06:58,039 --> 01:07:02,719
5755 intuitively you know I was the same way
5756
5757 1440
5758 01:07:01,219 --> 01:07:05,899
5759 you kind of have a hard time believing
5760
5761 1441
5762 01:07:02,719 --> 01:07:08,059
5763 that a 28 by 28 mask prediction is going
5764
5765 1442
5766 01:07:05,900 --> 01:07:10,430
5767 to be high enough resolution to give you
5768
5769 1443
5770 01:07:08,059 --> 01:07:11,929
5771 anything reasonable looking and in fact
5772
5773 1444
5774 01:07:10,429 --> 01:07:15,230
5775 I don't have an illustration of it but
5776
5777 1445
5778 01:07:11,929 --> 01:07:17,480
5779 even 14 by 14 is quite reasonable so so
5780
5781 1446
5782 01:07:15,230 --> 01:07:20,510
5783 what I want to show here is the result
5784
5785 1447
5786 01:07:17,480 --> 01:07:24,440
5787 of the inference procedure so here's a
5788
5789 1448
5790 01:07:20,510 --> 01:07:26,599
5791 detected person here's the 28 by 28 soft
5792
5793 1449
5794 01:07:24,440 --> 01:07:30,800
5795 mask so this is before doing any sort of
5796
5797 1450
5798 01:07:26,599 --> 01:07:35,269
5799 threshold encoder it has a value between
5800
5801 1451
5802 01:07:30,800 --> 01:07:38,510
5803 0 and 1 now in order to transform that
5804
5805 1452
5806 01:07:35,269 --> 01:07:40,159
5807 28 by 28 prediction into a prediction
5808
5809 1453
5810 01:07:38,510 --> 01:07:43,220
5811 that serving the coordinate space of the
5812
5813 1454
5814 01:07:40,159 --> 01:07:46,489
5815 image the first thing to do is resample
5816
5817 1455
5818 01:07:43,219 --> 01:07:49,399
5819 it so that it has the appropriate done
5820
5821 1456
5822 01:07:46,489 --> 01:07:51,789
5823 so that its aspect ratio matches the
5824
5825 1457
5826 01:07:49,400 --> 01:07:55,430
5827 aspect ratio of the detected box and
5828
5829 1458
5830 01:07:51,789 --> 01:07:58,338
5831 that rescaling is done using the soft
5832
5833 1459
5834 01:07:55,429 --> 01:08:00,858
5835 mask which is important to do it that
5836
5837 1460
5838 01:07:58,338 --> 01:08:02,809
5839 way rather than binarize in the mask and
5840
5841 1461
5842 01:08:00,858 --> 01:08:06,739
5843 then rescaling which introduces that
5844
5845 1462
5846 01:08:02,809 --> 01:08:09,199
5847 artifacts so after having rescaled that
5848
5849 1463
5850 01:08:06,739 --> 01:08:11,449
5851 mask the soft mask then you can simply
5852
5853 1464
5854 01:08:09,199 --> 01:08:14,809
5855 threshold it in order to get the final
5856
5857 1465
5858 01:08:11,449 --> 01:08:17,838
5859 prediction and here's another example
5860
5861 1466
5862 01:08:14,809 --> 01:08:20,420
5863 using the same image so you see the 28
5864
5865 1467
5866 01:08:17,838 --> 01:08:23,028
5867 by 28 soft mask which contains a fair
5868
5869 1468
5870 01:08:20,420 --> 01:08:24,920
5871 amount of detail the soft resampled
5872
5873 1469
5874 01:08:23,029 --> 01:08:29,230
5875 version of it and then the final
5876
5877 1470
5878 01:08:24,920 --> 01:08:32,390
5879 prediction after threshold again and
5880
5881 1471
5882 01:08:29,229 --> 01:08:37,218
5883 then here's a final example of the bird
5884
5885 1472
5886 01:08:32,390 --> 01:08:39,410
5887 and that image being detected ok so here
5888
5889 1473
5890 01:08:37,219 --> 01:08:41,119
5891 are just a few more qualitative results
5892
5893 1474
5894 01:08:39,409 --> 01:08:44,479
5895 showing you what the output of the
5896
5897 1475
5898 01:08:41,119 --> 01:08:47,599
5899 system looks like so I like this example
5900
5901 1476
5902 01:08:44,479 --> 01:08:49,968
5903 because it shows a really nice success
5904
5905 1477
5906 01:08:47,600 --> 01:08:52,880
5907 case of the system where you have a
5908
5909 1478
5910 01:08:49,969 --> 01:08:54,859
5911 person here and the person is cut in
5912
5913 1479
5914 01:08:52,880 --> 01:08:57,798
5915 half completely by the surfboard that
5916
5917 1480
5918 01:08:54,859 --> 01:09:00,319
5919 they're holding but it's still able to
5920
5921 1481
5922 01:08:57,798 --> 01:09:03,048
5923 detect the top part in the bottom part
5924
5925 1482
5926 01:09:00,319 --> 01:09:05,210
5927 as being part of the same instance and
5928
5929 1483
5930 01:09:03,048 --> 01:09:06,528
5931 the reason that it's doing this that
5932
5933 1484
5934 01:09:05,210 --> 01:09:09,020
5935 able to do this is because it's not
5936
5937 1485
5938 01:09:06,529 --> 01:09:11,960
5939 relying on any bottom of grouping bottom
5940
5941 1486
5942 01:09:09,020 --> 01:09:14,480
5943 of grouping would fail in such a case
5944
5945 1487
5946 01:09:11,960 --> 01:09:16,548
5947 because it wouldn't be able to connect
5948
5949 1488
5950 01:09:14,479 --> 01:09:18,259
5951 the bottom part to the top part but
5952
5953 1489
5954 01:09:16,548 --> 01:09:23,119
5955 because it's doing sort of a more
5956
5957 1490
5958 01:09:18,259 --> 01:09:24,649
5959 holistic reasoning of what the object
5960
5961 1491
5962 01:09:23,119 --> 01:09:30,588
5963 looks like in the scene it's able to
5964
5965 1492
5966 01:09:24,649 --> 01:09:32,439
5967 actually predict the full extent so
5968
5969 1493
5970 01:09:30,588 --> 01:09:35,028
5971 here's another example where you can see
5972
5973 1494
5974 01:09:32,439 --> 01:09:37,849
5975 that it's able to deal pretty well with
5976
5977 1495
5978 01:09:35,029 --> 01:09:40,069
5979 people who are very heavily overlapping
5980
5981 1496
5982 01:09:37,850 --> 01:09:42,109
5983 each other and objects that are
5984
5985 1497
5986 01:09:40,069 --> 01:09:43,850
5987 overlapping so for example the bottle
5988
5989 1498
5990 01:09:42,109 --> 01:09:48,829
5991 that's being held in the hand of this
5992
5993 1499
5994 01:09:43,850 --> 01:09:51,079
5995 person here so now I want to just very
5996
5997 1500
5998 01:09:48,829 --> 01:09:55,909
5999 briefly describe the application of this
6000
6001 1501
6002 01:09:51,079 --> 01:09:59,238
6003 to human pose so the idea here is that
6004
6005 1502
6006 01:09:55,909 --> 01:10:03,250
6007 human pose in kind of a funny way can be
6008
6009 1503
6010 01:09:59,238 --> 01:10:03,250
6011 expressed as a mask prediction problem
6012
6013 1504
6014 01:10:03,279 --> 01:10:08,210
6015 so that the idea is that you can think
6016
6017 1505
6018 01:10:05,779 --> 01:10:09,859
6019 of each one of the safe seventeen key
6020
6021 1506
6022 01:10:08,210 --> 01:10:13,609
6023 points that's used in the cocoa dataset
6024
6025 1507
6026 01:10:09,859 --> 01:10:15,079
6027 as being a one-pot mask the mask is on
6028
6029 1508
6030 01:10:13,609 --> 01:10:17,420
6031 where the key point is and it's off
6032
6033 1509
6034 01:10:15,079 --> 01:10:19,579
6035 everywhere else so now you can just
6036
6037 1510
6038 01:10:17,420 --> 01:10:21,230
6039 change the mask prediction head so that
6040
6041 1511
6042 01:10:19,579 --> 01:10:22,488
6043 instead of predicting a binary
6044
6045 1512
6046 01:10:21,229 --> 01:10:26,179
6047 foreground/background
6048
6049 1513
6050 01:10:22,488 --> 01:10:28,429
6051 mask is predicting 17 masks where each
6052
6053 1514
6054 01:10:26,180 --> 01:10:32,600
6055 one should have its Arg max at the
6056
6057 1515
6058 01:10:28,430 --> 01:10:35,539
6059 location of the key point there is one
6060
6061 1516
6062 01:10:32,600 --> 01:10:38,180
6063 small technical change that we make here
6064
6065 1517
6066 01:10:35,539 --> 01:10:41,000
6067 and that is rather than having a sigmoid
6068
6069 1518
6070 01:10:38,180 --> 01:10:43,940
6071 unit at each of the say 28 by 28 spatial
6072
6073 1519
6074 01:10:41,000 --> 01:10:46,250
6075 locations the prediction is now going to
6076
6077 1520
6078 01:10:43,939 --> 01:10:50,629
6079 for the mask is now going to be formed
6080
6081 1521
6082 01:10:46,250 --> 01:10:52,880
6083 by doing a soft meth max over the grid
6084
6085 1522
6086 01:10:50,630 --> 01:10:54,319
6087 of spatial locations and you can think
6088
6089 1523
6090 01:10:52,880 --> 01:10:56,779
6091 of that as sort of encoding the prior
6092
6093 1524
6094 01:10:54,319 --> 01:11:01,029
6095 that the key point is going to exist in
6096
6097 1525
6098 01:10:56,779 --> 01:11:05,029
6099 only one location within the spatial map
6100
6101 1526
6102 01:11:01,029 --> 01:11:07,130
6103 so here are some examples of the sort of
6104
6105 1527
6106 01:11:05,029 --> 01:11:09,289
6107 output that that the system produces and
6108
6109 1528
6110 01:11:07,130 --> 01:11:11,090
6111 again I guess I should emphasize that
6112
6113 1529
6114 01:11:09,289 --> 01:11:14,060
6115 this is sort of the most naive way that
6116
6117 1530
6118 01:11:11,090 --> 01:11:16,670
6119 you could do key point estimation and we
6120
6121 1531
6122 01:11:14,060 --> 01:11:18,320
6123 but it ends up working quite well and I
6124
6125 1532
6126 01:11:16,670 --> 01:11:19,760
6127 think should serve as the
6128
6129 1533
6130 01:11:18,319 --> 01:11:22,130
6131 pretty reasonable baseline for trying to
6132
6133 1534
6134 01:11:19,760 --> 01:11:25,820
6135 do more sophisticated things on top of
6136
6137 1535
6138 01:11:22,130 --> 01:11:28,970
6139 this so here's the video where NASCAR
6140
6141 1536
6142 01:11:25,819 --> 01:11:31,399
6143 CNN is being run frame by frame it's
6144
6145 1537
6146 01:11:28,970 --> 01:11:34,610
6147 doing so it's one model in this case
6148
6149 1538
6150 01:11:31,399 --> 01:11:38,389
6151 that's trained to do box detection it's
6152
6153 1539
6154 01:11:34,609 --> 01:11:41,059
6155 not shown mask inference and also human
6156
6157 1540
6158 01:11:38,390 --> 01:11:50,270
6159 key point imprints all within the same
6160
6161 1541
6162 01:11:41,060 --> 01:11:52,760
6163 model and here's another video okay so
6164
6165 1542
6166 01:11:50,270 --> 01:11:54,620
6167 now that brings me to the second part of
6168
6169 1543
6170 01:11:52,760 --> 01:11:58,460
6171 the talk which is just going to be a
6172
6173 1544
6174 01:11:54,619 --> 01:12:02,269
6175 very brief survey of deep learning
6176
6177 1545
6178 01:11:58,460 --> 01:12:04,909
6179 project detection so there have been a
6180
6181 1546
6182 01:12:02,270 --> 01:12:07,370
6183 huge number of proposed methods over the
6184
6185 1547
6186 01:12:04,909 --> 01:12:10,220
6187 last few years and they often have very
6188
6189 1548
6190 01:12:07,369 --> 01:12:12,050
6191 different names but they're actually all
6192
6193 1549
6194 01:12:10,220 --> 01:12:15,320
6195 related to each other and sort of
6196
6197 1550
6198 01:12:12,050 --> 01:12:17,810
6199 interesting in complex ways and I want
6200
6201 1551
6202 01:12:15,319 --> 01:12:20,929
6203 to try to provide a little bit of
6204
6205 1552
6206 01:12:17,810 --> 01:12:23,090
6207 structure on this space I really like
6208
6209 1553
6210 01:12:20,930 --> 01:12:24,619
6211 mountains and so I just wanted to put in
6212
6213 1554
6214 01:12:23,090 --> 01:12:26,000
6215 a nice picture of a mountain here this
6216
6217 1555
6218 01:12:24,619 --> 01:12:30,470
6219 slide doesn't really convey any other
6220
6221 1556
6222 01:12:26,000 --> 01:12:32,000
6223 information so let's start with
6224
6225 1557
6226 01:12:30,470 --> 01:12:35,150
6227 something that's common to all of these
6228
6229 1558
6230 01:12:32,000 --> 01:12:37,130
6231 methods so so the first is that they all
6232
6233 1559
6234 01:12:35,149 --> 01:12:39,619
6235 start by modifying a classification
6236
6237 1560
6238 01:12:37,130 --> 01:12:42,829
6239 network so you saw this for the
6240
6241 1561
6242 01:12:39,619 --> 01:12:45,170
6243 particular case of NASCAR CNN but this
6244
6245 1562
6246 01:12:42,829 --> 01:12:47,899
6247 is true of all of the existing methods
6248
6249 1563
6250 01:12:45,170 --> 01:12:49,880
6251 and that's because classification
6252
6253 1564
6254 01:12:47,899 --> 01:12:52,549
6255 networks and the features they learn are
6256
6257 1565
6258 01:12:49,880 --> 01:12:54,590
6259 really the backbone or the engine that's
6260
6261 1566
6262 01:12:52,550 --> 01:13:00,199
6263 driving a lot of visual recognition
6264
6265 1567
6266 01:12:54,590 --> 01:13:01,550
6267 these days now I wanted I was searching
6268
6269 1568
6270 01:13:00,199 --> 01:13:04,039
6271 around for what I thought was for the
6272
6273 1569
6274 01:13:01,550 --> 01:13:08,420
6275 highest information gain split of these
6276
6277 1570
6278 01:13:04,039 --> 01:13:14,119
6279 methods and I think that it's this idea
6280
6281 1571
6282 01:13:08,420 --> 01:13:15,470
6283 of a stage so if you split by stage and
6284
6285 1572
6286 01:13:14,119 --> 01:13:17,510
6287 I'll explain what that means in a moment
6288
6289 1573
6290 01:13:15,470 --> 01:13:21,110
6291 then you sort of get this division into
6292
6293 1574
6294 01:13:17,510 --> 01:13:24,470
6295 one set of methods like the our CNN
6296
6297 1575
6298 01:13:21,109 --> 01:13:26,659
6299 style set of methods and another set of
6300
6301 1576
6302 01:13:24,470 --> 01:13:29,150
6303 methods which are often called one stage
6304
6305 1577
6306 01:13:26,659 --> 01:13:29,899
6307 and these are approaches like over feet
6308
6309 1578
6310 01:13:29,149 --> 01:13:30,529
6311 Yolo
6312
6313 1579
6314 01:13:29,899 --> 01:13:33,558
6315 SSD
6316
6317 1580
6318 01:13:30,529 --> 01:13:37,158
6319 II and then a new method that that we
6320
6321 1581
6322 01:13:33,559 --> 01:13:39,110
6323 recently have written a paper out about
6324
6325 1582
6326 01:13:37,158 --> 01:13:41,058
6327 that will show up at ICC V and will be
6328
6329 1583
6330 01:13:39,109 --> 01:13:44,808
6331 presented on Wednesday as a poster
6332
6333 1584
6334 01:13:41,059 --> 01:13:48,199
6335 called Retina net so so what's this idea
6336
6337 1585
6338 01:13:44,809 --> 01:13:51,349
6339 of a stage so if we sort of go back to
6340
6341 1586
6342 01:13:48,198 --> 01:13:55,759
6343 basics of object detection so if you
6344
6345 1587
6346 01:13:51,349 --> 01:13:57,770
6347 have a H by W image with n pixels if you
6348
6349 1588
6350 01:13:55,760 --> 01:13:59,480
6351 think about every rectangle in that
6352
6353 1589
6354 01:13:57,770 --> 01:14:02,869
6355 image you have order N squared which is
6356
6357 1590
6358 01:13:59,479 --> 01:14:05,089
6359 a huge number of windows and essentially
6360
6361 1591
6362 01:14:02,868 --> 01:14:06,500
6363 the detection problem we saw very
6364
6365 1592
6366 01:14:05,090 --> 01:14:09,110
6367 popular way to think about it is
6368
6369 1593
6370 01:14:06,500 --> 01:14:11,599
6371 reducing detection to classifying each
6372
6373 1594
6374 01:14:09,109 --> 01:14:14,000
6375 one of those with those windows this is
6376
6377 1595
6378 01:14:11,599 --> 01:14:16,880
6379 a huge output space it's very difficult
6380
6381 1596
6382 01:14:14,000 --> 01:14:18,469
6383 to deal with in practice so a lot of the
6384
6385 1597
6386 01:14:16,880 --> 01:14:20,389
6387 literature on object detection whether
6388
6389 1598
6390 01:14:18,469 --> 01:14:22,219
6391 explicitly stated or not is essentially
6392
6393 1599
6394 01:14:20,389 --> 01:14:26,770
6395 trying to figure out ways to manage this
6396
6397 1600
6398 01:14:22,219 --> 01:14:29,149
6399 computational complexity and a very
6400
6401 1601
6402 01:14:26,770 --> 01:14:31,360
6403 popular approach to this has been
6404
6405 1602
6406 01:14:29,149 --> 01:14:33,920
6407 sliding window which allows you to
6408
6409 1603
6410 01:14:31,359 --> 01:14:36,769
6411 rather than considering every possible
6412
6413 1604
6414 01:14:33,920 --> 01:14:39,618
6415 rectangle you reduce it to a discrete
6416
6417 1605
6418 01:14:36,770 --> 01:14:41,780
6419 set of aspect ratios translation scales
6420
6421 1606
6422 01:14:39,618 --> 01:14:43,460
6423 and it can often bring you down to sort
6424
6425 1607
6426 01:14:41,779 --> 01:14:46,759
6427 of around a hundred thousand different
6428
6429 1608
6430 01:14:43,460 --> 01:14:50,868
6431 rectangles to consider but the other
6432
6433 1609
6434 01:14:46,760 --> 01:14:53,570
6435 very powerful idea is this idea of the
6436
6437 1610
6438 01:14:50,868 --> 01:14:56,328
6439 Cascade in which rather than making a
6440
6441 1611
6442 01:14:53,569 --> 01:15:00,460
6443 decision all at once you have some
6444
6445 1612
6446 01:14:56,328 --> 01:15:02,719
6447 sequence of decisions which allows
6448
6449 1613
6450 01:15:00,460 --> 01:15:05,059
6451 faster testing but perhaps more
6452
6453 1614
6454 01:15:02,719 --> 01:15:07,399
6455 importantly simpler training because it
6456
6457 1615
6458 01:15:05,059 --> 01:15:09,770
6459 allows so some parts of the model to
6460
6461 1616
6462 01:15:07,399 --> 01:15:11,569
6463 focus on the easy cases and other parts
6464
6465 1617
6466 01:15:09,770 --> 01:15:14,929
6467 of the model to focus on the hard cases
6468
6469 1618
6470 01:15:11,569 --> 01:15:16,609
6471 and I think that the way that this idea
6472
6473 1619
6474 01:15:14,929 --> 01:15:18,800
6475 the cascade interacts with training is
6476
6477 1620
6478 01:15:16,609 --> 01:15:22,130
6479 actually sort of the core difference
6480
6481 1621
6482 01:15:18,800 --> 01:15:26,900
6483 between the many stage versus one stage
6484
6485 1622
6486 01:15:22,130 --> 01:15:29,328
6487 set of approaches so let's start with
6488
6489 1623
6490 01:15:26,899 --> 01:15:31,969
6491 the more than one stage group because
6492
6493 1624
6494 01:15:29,328 --> 01:15:35,328
6495 you just saw an example of that and so
6496
6497 1625
6498 01:15:31,969 --> 01:15:38,300
6499 let me illustrate how the rc9 style
6500
6501 1626
6502 01:15:35,328 --> 01:15:40,729
6503 approach has more than one stage
6504
6505 1627
6506 01:15:38,300 --> 01:15:44,480
6507 so one of the very first things I did
6508
6509 1628
6510 01:15:40,729 --> 01:15:46,639
6511 that this approach involves is coming up
6512
6513 1629
6514 01:15:44,479 --> 01:15:50,419
6515 with a set of object or region proposals
6516
6517 1630
6518 01:15:46,640 --> 01:15:52,369
6519 and not off the bat is already some form
6520
6521 1631
6522 01:15:50,420 --> 01:15:54,710
6523 of drastically reducing the complexity
6524
6525 1632
6526 01:15:52,369 --> 01:15:56,750
6527 of the output space you're going from
6528
6529 1633
6530 01:15:54,710 --> 01:16:01,069
6531 you know order N squared different
6532
6533 1634
6534 01:15:56,750 --> 01:16:03,109
6535 rectangles to say mm during training so
6536
6537 1635
6538 01:16:01,069 --> 01:16:06,500
6539 this is a huge reduction and that's done
6540
6541 1636
6542 01:16:03,109 --> 01:16:08,989
6543 via some mechanism either learned or via
6544
6545 1637
6546 01:16:06,500 --> 01:16:12,170
6547 bottom-up grouping and now what that
6548
6549 1638
6550 01:16:08,989 --> 01:16:14,090
6551 means is that the second part of the
6552
6553 1639
6554 01:16:12,170 --> 01:16:16,550
6555 system which is doing classification of
6556
6557 1640
6558 01:16:14,090 --> 01:16:19,340
6559 these proposals only has to focus on
6560
6561 1641
6562 01:16:16,550 --> 01:16:21,079
6563 this particular subset of the output
6564
6565 1642
6566 01:16:19,340 --> 01:16:23,150
6567 space and that's going to make the
6568
6569 1643
6570 01:16:21,079 --> 01:16:25,789
6571 training process easier it's going to
6572
6573 1644
6574 01:16:23,149 --> 01:16:28,189
6575 simplify it because it's effectively
6576
6577 1645
6578 01:16:25,789 --> 01:16:30,319
6579 taking this problem which previously was
6580
6581 1646
6582 01:16:28,189 --> 01:16:32,449
6583 extremely class and balanced between
6584
6585 1647
6586 01:16:30,319 --> 01:16:35,779
6587 foreground and background and making it
6588
6589 1648
6590 01:16:32,449 --> 01:16:38,269
6591 more class balanced so now in contrast
6592
6593 1649
6594 01:16:35,779 --> 01:16:40,939
6595 the one stage approach and here I'm
6596
6597 1650
6598 01:16:38,270 --> 01:16:43,610
6599 showing a slide of Yolo has sort of an
6600
6601 1651
6602 01:16:40,939 --> 01:16:45,379
6603 illustrative example of that is going to
6604
6605 1652
6606 01:16:43,609 --> 01:16:47,839
6607 take a different approach and it's
6608
6609 1653
6610 01:16:45,380 --> 01:16:51,470
6611 basically going to just try to classify
6612
6613 1654
6614 01:16:47,840 --> 01:16:54,079
6615 the whole output space at once now there
6616
6617 1655
6618 01:16:51,470 --> 01:16:57,170
6619 are different ways of making this more
6620
6621 1656
6622 01:16:54,079 --> 01:16:59,840
6623 reasonable then dealing with millions of
6624
6625 1657
6626 01:16:57,170 --> 01:17:01,880
6627 possible boxes to classify so one is
6628
6629 1658
6630 01:16:59,840 --> 01:17:04,279
6631 sort of in the spirit of sliding window
6632
6633 1659
6634 01:17:01,880 --> 01:17:06,590
6635 and that's you come up with some sort
6636
6637 1660
6638 01:17:04,279 --> 01:17:09,920
6639 Uruguay of reducing the size of the
6640
6641 1661
6642 01:17:06,590 --> 01:17:11,600
6643 output space so for example Yolo has
6644
6645 1662
6646 01:17:09,920 --> 01:17:15,560
6647 this grid division and that that's
6648
6649 1663
6650 01:17:11,600 --> 01:17:21,079
6651 effectively massive reduction in the
6652
6653 1664
6654 01:17:15,560 --> 01:17:23,690
6655 size of the output space so that's
6656
6657 1665
6658 01:17:21,079 --> 01:17:25,880
6659 that's sort of a the main high-level
6660
6661 1666
6662 01:17:23,689 --> 01:17:28,189
6663 split that I want to communicate and now
6664
6665 1667
6666 01:17:25,880 --> 01:17:30,680
6667 I want to sort of go down one of those
6668
6669 1668
6670 01:17:28,189 --> 01:17:33,769
6671 branches the the multistage detector
6672
6673 1669
6674 01:17:30,680 --> 01:17:36,020
6675 branch and provide what I think is one
6676
6677 1670
6678 01:17:33,770 --> 01:17:38,390
6679 useful way to think about the set of
6680
6681 1671
6682 01:17:36,020 --> 01:17:41,260
6683 methods that that fall down on the side
6684
6685 1672
6686 01:17:38,390 --> 01:17:45,530
6687 of the brand on the side of the tree so
6688
6689 1673
6690 01:17:41,260 --> 01:17:49,690
6691 you can think of this type of detector
6692
6693 1674
6694 01:17:45,529 --> 01:17:52,189
6695 as being composed of two parts logically
6696
6697 1675
6698 01:17:49,689 --> 01:17:55,009
6699 so one is that there's some
6700
6701 1676
6702 01:17:52,189 --> 01:17:56,689
6703 sort of image wise computation that's
6704
6705 1677
6706 01:17:55,010 --> 01:18:00,739
6707 performed over the whole image and has
6708
6709 1678
6710 01:17:56,689 --> 01:18:05,899
6711 nothing to do with the proposal part of
6712
6713 1679
6714 01:18:00,738 --> 01:18:08,809
6715 the system it and then the second part
6716
6717 1680
6718 01:18:05,899 --> 01:18:12,259
6719 is going to be region wise computation
6720
6721 1681
6722 01:18:08,810 --> 01:18:15,350
6723 and that's going to scale with the
6724
6725 1682
6726 01:18:12,260 --> 01:18:17,960
6727 number of proposals or regions that the
6728
6729 1683
6730 01:18:15,350 --> 01:18:19,670
6731 system needs to classify and you can
6732
6733 1684
6734 01:18:17,960 --> 01:18:22,399
6735 think of there being sort of a slider
6736
6737 1685
6738 01:18:19,670 --> 01:18:24,350
6739 that you can use to slide from one end
6740
6741 1686
6742 01:18:22,399 --> 01:18:27,139
6743 to the other end in terms of how much
6744
6745 1687
6746 01:18:24,350 --> 01:18:31,430
6747 computation you put into either side of
6748
6749 1688
6750 01:18:27,140 --> 01:18:32,840
6751 the split so this is an idea that we
6752
6753 1689
6754 01:18:31,430 --> 01:18:35,810
6755 described in this paper called networks
6756
6757 1690
6758 01:18:32,840 --> 01:18:38,180
6759 on convolutional feature maps and and I
6760
6761 1691
6762 01:18:35,810 --> 01:18:40,070
6763 think it's a useful idea in the sense of
6764
6765 1692
6766 01:18:38,180 --> 01:18:43,340
6767 helping you organize the landscape of
6768
6769 1693
6770 01:18:40,069 --> 01:18:47,448
6771 this type of detection method so at one
6772
6773 1694
6774 01:18:43,340 --> 01:18:50,180
6775 extreme you have our CNN and it's at the
6776
6777 1695
6778 01:18:47,448 --> 01:18:53,839
6779 extreme where the image wise computation
6780
6781 1696
6782 01:18:50,180 --> 01:18:55,130
6783 is essentially subtracting the pixel
6784
6785 1697
6786 01:18:53,840 --> 01:18:56,600
6787 mean and dividing by standard deviation
6788
6789 1698
6790 01:18:55,130 --> 01:19:00,800
6791 or something like that it's almost
6792
6793 1699
6794 01:18:56,600 --> 01:19:04,460
6795 nothing and then the per vision
6796
6797 1700
6798 01:19:00,800 --> 01:19:05,480
6799 computation is the whole convolutional
6800
6801 1701
6802 01:19:04,460 --> 01:19:08,719
6803 network itself
6804
6805 1702
6806 01:19:05,479 --> 01:19:10,789
6807 applied independently teach region which
6808
6809 1703
6810 01:19:08,719 --> 01:19:12,560
6811 is obviously very expensive although
6812
6813 1704
6814 01:19:10,789 --> 01:19:14,238
6815 this might be a very good way to deal
6816
6817 1705
6818 01:19:12,560 --> 01:19:17,600
6819 with small objects for example so there
6820
6821 1706
6822 01:19:14,238 --> 01:19:20,718
6823 are trade-offs now faster and faster are
6824
6825 1707
6826 01:19:17,600 --> 01:19:22,250
6827 seen on depending on specific
6828
6829 1708
6830 01:19:20,719 --> 01:19:24,980
6831 implementation details can sort of fall
6832
6833 1709
6834 01:19:22,250 --> 01:19:27,260
6835 at different points in the spectrum so
6836
6837 1710
6838 01:19:24,979 --> 01:19:28,279
6839 on one hand the sort of original version
6840
6841 1711
6842 01:19:27,260 --> 01:19:33,050
6843 of it that doesn't use this feature
6844
6845 1712
6846 01:19:28,279 --> 01:19:36,880
6847 pyramid network and may involve using
6848
6849 1713
6850 01:19:33,050 --> 01:19:38,719
6851 sort of more complex heavier classic
6852
6853 1714
6854 01:19:36,880 --> 01:19:44,300
6855 classification and bounding boxes
6856
6857 1715
6858 01:19:38,719 --> 01:19:46,189
6859 regression head when you use fpn feature
6860
6861 1716
6862 01:19:44,300 --> 01:19:47,960
6863 pyramid network that actually enables
6864
6865 1717
6866 01:19:46,189 --> 01:19:49,279
6867 you to reduce the amount of computation
6868
6869 1718
6870 01:19:47,960 --> 01:19:51,260
6871 that you put into the head of the
6872
6873 1719
6874 01:19:49,279 --> 01:19:53,269
6875 detector and so that's sort of that
6876
6877 1720
6878 01:19:51,260 --> 01:19:56,480
6879 shifts that version of the system so
6880
6881 1721
6882 01:19:53,270 --> 01:19:58,670
6883 further down this axis here and then
6884
6885 1722
6886 01:19:56,479 --> 01:20:00,468
6887 there's also this other nice approach
6888
6889 1723
6890 01:19:58,670 --> 01:20:02,659
6891 called region fully convolutional
6892
6893 1724
6894 01:20:00,469 --> 01:20:04,948
6895 network which is sort of represents the
6896
6897 1725
6898 01:20:02,659 --> 01:20:07,948
6899 complete other extreme from our
6900
6901 1726
6902 01:20:04,948 --> 01:20:09,928
6903 which is that essentially all of the
6904
6905 1727
6906 01:20:07,948 --> 01:20:13,829
6907 almost all the computation that's being
6908
6909 1728
6910 01:20:09,929 --> 01:20:15,690
6911 done is image wise and there's a very
6912
6913 1729
6914 01:20:13,829 --> 01:20:18,300
6915 very small amount which essentially
6916
6917 1730
6918 01:20:15,689 --> 01:20:20,939
6919 amounts to a bunch of average cooling
6920
6921 1731
6922 01:20:18,300 --> 01:20:26,969
6923 operations of computation that's done
6924
6925 1732
6926 01:20:20,939 --> 01:20:28,829
6927 per region and so now I realize that I'm
6928
6929 1733
6930 01:20:26,969 --> 01:20:30,779
6931 starting to run out of time here but I
6932
6933 1734
6934 01:20:28,829 --> 01:20:35,729
6935 want to just briefly talk about speed
6936
6937 1735
6938 01:20:30,779 --> 01:20:37,679
6939 accuracy trade-offs so so I think right
6940
6941 1736
6942 01:20:35,729 --> 01:20:40,529
6943 now there's sort of a lot of confusion
6944
6945 1737
6946 01:20:37,679 --> 01:20:41,940
6947 about speed and accuracy of the
6948
6949 1738
6950 01:20:40,529 --> 01:20:43,829
6951 different detection systems I've been
6952
6953 1739
6954 01:20:41,939 --> 01:20:46,678
6955 proposed and I think one of the reasons
6956
6957 1740
6958 01:20:43,829 --> 01:20:48,929
6959 for that is that if you just look at
6960
6961 1741
6962 01:20:46,679 --> 01:20:51,390
6963 performance on the pascal vo c dataset
6964
6965 1742
6966 01:20:48,929 --> 01:20:53,489
6967 and often it will be reported on the
6968
6969 1743
6970 01:20:51,390 --> 01:20:56,460
6971 2007 version which is now ten years old
6972
6973 1744
6974 01:20:53,488 --> 01:21:00,569
6975 it sort of provides an incomplete
6976
6977 1745
6978 01:20:56,460 --> 01:21:03,899
6979 picture so let me draw an analogy if and
6980
6981 1746
6982 01:21:00,569 --> 01:21:05,969
6983 this was the only data set we we had we
6984
6985 1747
6986 01:21:03,899 --> 01:21:07,609
6987 effectively wouldn't understand that
6988
6989 1748
6990 01:21:05,969 --> 01:21:10,289
6991 there's any difference between
6992
6993 1749
6994 01:21:07,609 --> 01:21:14,009
6995 nearest-neighbor on simple features sbm
6996
6997 1750
6998 01:21:10,289 --> 01:21:16,738
6999 and comp nets because they'll basically
7000
7001 1751
7002 01:21:14,010 --> 01:21:20,159
7003 work about the same on that list but you
7004
7005 1752
7006 01:21:16,738 --> 01:21:22,859
7007 bring in image net and now that really
7008
7009 1753
7010 01:21:20,159 --> 01:21:25,319
7011 increases the degree to which you can
7012
7013 1754
7014 01:21:22,859 --> 01:21:28,559
7015 discern which methods are doing useful
7016
7017 1755
7018 01:21:25,319 --> 01:21:30,599
7019 things in which ones aren't so sort of
7020
7021 1756
7022 01:21:28,560 --> 01:21:33,719
7023 by analogy I'd like to suggest that
7024
7025 1757
7026 01:21:30,600 --> 01:21:36,300
7027 Pascal sort of is an outdated data set
7028
7029 1758
7030 01:21:33,719 --> 01:21:39,359
7031 at this point and perhaps doesn't
7032
7033 1759
7034 01:21:36,300 --> 01:21:41,850
7035 provide quite enough complexity to quite
7036
7037 1760
7038 01:21:39,359 --> 01:21:44,869
7039 understand that which methods are
7040
7041 1761
7042 01:21:41,850 --> 01:21:47,969
7043 performing well in in different regards
7044
7045 1762
7046 01:21:44,869 --> 01:21:49,979
7047 Coco is probably a slightly better
7048
7049 1763
7050 01:21:47,969 --> 01:21:53,279
7051 instrument for trying to understand this
7052
7053 1764
7054 01:21:49,979 --> 01:21:56,488
7055 because it presents a more complex data
7056
7057 1765
7058 01:21:53,279 --> 01:21:58,198
7059 set and that's perhaps more well aligned
7060
7061 1766
7062 01:21:56,488 --> 01:22:01,709
7063 with different real world applications
7064
7065 1767
7066 01:21:58,198 --> 01:22:05,099
7067 that we might be considering so under
7068
7069 1768
7070 01:22:01,710 --> 01:22:08,939
7071 sort of the lens of Coco what we end up
7072
7073 1769
7074 01:22:05,100 --> 01:22:10,890
7075 seeing is that speed is the speed and
7076
7077 1770
7078 01:22:08,939 --> 01:22:13,409
7079 accuracy of these systems is mainly
7080
7081 1771
7082 01:22:10,890 --> 01:22:15,840
7083 influenced by three factors so you have
7084
7085 1772
7086 01:22:13,409 --> 01:22:17,789
7087 the resolution of the input image the
7088
7089 1773
7090 01:22:15,840 --> 01:22:19,920
7091 complexity of the network
7092
7093 1774
7094 01:22:17,789 --> 01:22:21,300
7095 and if it's a proposal based system the
7096
7097 1775
7098 01:22:19,920 --> 01:22:23,069
7099 number of proposals that you're using
7100
7101 1776
7102 01:22:21,300 --> 01:22:25,710
7103 which is sort of an adjustable hyper
7104
7105 1777
7106 01:22:23,069 --> 01:22:28,079
7107 parameter so there's a very nice paper
7108
7109 1778
7110 01:22:25,710 --> 01:22:29,789
7111 that'll be presented at the CVP are from
7112
7113 1779
7114 01:22:28,079 --> 01:22:32,579
7115 Jonathan Fong and colleagues at Google
7116
7117 1780
7118 01:22:29,789 --> 01:22:34,439
7119 that does a very nice empirical
7120
7121 1781
7122 01:22:32,579 --> 01:22:37,050
7123 evaluation of the speed accuracy
7124
7125 1782
7126 01:22:34,439 --> 01:22:38,429
7127 trade-offs and one of the reasons why
7128
7129 1783
7130 01:22:37,050 --> 01:22:40,310
7131 this is a very nice study is that they
7132
7133 1784
7134 01:22:38,430 --> 01:22:42,990
7135 do it entirely within one implementation
7136
7137 1785
7138 01:22:40,310 --> 01:22:45,090
7139 so it sort of provides the most fair
7140
7141 1786
7142 01:22:42,989 --> 01:22:46,819
7143 apples to apples comparison of different
7144
7145 1787
7146 01:22:45,090 --> 01:22:49,199
7147 methods within that implementation and
7148
7149 1788
7150 01:22:46,819 --> 01:22:52,829
7151 what they're able to produce from this
7152
7153 1789
7154 01:22:49,199 --> 01:22:55,319
7155 is this really nice sort of lay of the
7156
7157 1790
7158 01:22:52,829 --> 01:22:58,880
7159 land of three different meta
7160
7161 1791
7162 01:22:55,319 --> 01:23:01,380
7163 architectures OCR CNN RFC n and SSB
7164
7165 1792
7166 01:22:58,880 --> 01:23:05,000
7167 using a variety of different backbone
7168
7169 1793
7170 01:23:01,380 --> 01:23:07,350
7171 architectures and a variety of different
7172
7173 1794
7174 01:23:05,000 --> 01:23:10,710
7175 number of proposals one those are
7176
7177 1795
7178 01:23:07,350 --> 01:23:12,090
7179 appropriate and I suggest that people
7180
7181 1796
7182 01:23:10,710 --> 01:23:13,529
7183 look into this in detail because it
7184
7185 1797
7186 01:23:12,090 --> 01:23:16,680
7187 really kind of helps provide a more
7188
7189 1798
7190 01:23:13,529 --> 01:23:18,659
7191 complete understanding and I just wanted
7192
7193 1799
7194 01:23:16,680 --> 01:23:20,640
7195 to add to this slide one more point that
7196
7197 1800
7198 01:23:18,659 --> 01:23:23,010
7199 wasn't in the plot it's not quite
7200
7201 1801
7202 01:23:20,640 --> 01:23:24,630
7203 comparable because it's not implemented
7204
7205 1802
7206 01:23:23,010 --> 01:23:26,190
7207 in the same framework so there there are
7208
7209 1803
7210 01:23:24,630 --> 01:23:29,579
7211 lots of small implementation details
7212
7213 1804
7214 01:23:26,189 --> 01:23:31,619
7215 that can shift the FPS that you get one
7216
7217 1805
7218 01:23:29,579 --> 01:23:34,319
7219 way or the other but this sort of this
7220
7221 1806
7222 01:23:31,619 --> 01:23:36,119
7223 illustrates effectively where Yolo
7224
7225 1807
7226 01:23:34,319 --> 01:23:39,119
7227 version 2 which which will also be at
7228
7229 1808
7230 01:23:36,119 --> 01:23:40,409
7231 the cbpr falls within this you can see
7232
7233 1809
7234 01:23:39,119 --> 01:23:43,800
7235 that it's doing quite well
7236
7237 1810
7238 01:23:40,409 --> 01:23:48,779
7239 in terms of this sort of regime of being
7240
7241 1811
7242 01:23:43,800 --> 01:23:51,930
7243 very fast but sort of medium accuracy so
7244
7245 1812
7246 01:23:48,779 --> 01:23:54,149
7247 just a couple more slides to conclude so
7248
7249 1813
7250 01:23:51,930 --> 01:23:57,450
7251 as climbing showed there's been this
7252
7253 1814
7254 01:23:54,149 --> 01:23:59,309
7255 huge improvement over the last decade in
7256
7257 1815
7258 01:23:57,449 --> 01:24:02,449
7259 terms of where we are with object
7260
7261 1816
7262 01:23:59,310 --> 01:24:05,880
7263 detection and I want to try to summarize
7264
7265 1817
7266 01:24:02,449 --> 01:24:10,050
7267 why I think this has happened in terms
7268
7269 1818
7270 01:24:05,880 --> 01:24:13,109
7271 of things that before 2012 were false
7272
7273 1819
7274 01:24:10,050 --> 01:24:15,180
7275 but now they're true and I think that
7276
7277 1820
7278 01:24:13,109 --> 01:24:16,739
7279 these are in my opinion some of the core
7280
7281 1821
7282 01:24:15,180 --> 01:24:20,250
7283 things that have led to the improvements
7284
7285 1822
7286 01:24:16,739 --> 01:24:22,800
7287 we've seen so so the first may sound
7288
7289 1823
7290 01:24:20,250 --> 01:24:24,149
7291 kind of funny too so some of the people
7292
7293 1824
7294 01:24:22,800 --> 01:24:25,980
7295 who very recently started in computer
7296
7297 1825
7298 01:24:24,149 --> 01:24:27,879
7299 vision but that is that we see
7300
7301 1826
7302 01:24:25,979 --> 01:24:31,869
7303 improvements with more data
7304
7305 1827
7306 01:24:27,880 --> 01:24:34,329
7307 so in 2007 this wasn't true and it was
7308
7309 1828
7310 01:24:31,869 --> 01:24:35,349
7311 actually kind of a very sad thing we
7312
7313 1829
7314 01:24:34,329 --> 01:24:38,679
7315 didn't quite know it at the time because
7316
7317 1830
7318 01:24:35,350 --> 01:24:40,750
7319 it took a bit of annotation to actually
7320
7321 1831
7322 01:24:38,679 --> 01:24:42,789
7323 understand this but David Ramadan has a
7324
7325 1832
7326 01:24:40,750 --> 01:24:45,189
7327 nice paper looking at this which is that
7328
7329 1833
7330 01:24:42,789 --> 01:24:47,198
7331 if you increase the amount of data there
7332
7333 1834
7334 01:24:45,189 --> 01:24:48,519
7335 were very little improvement that you
7336
7337 1835
7338 01:24:47,198 --> 01:24:51,129
7339 could actually see in the models at that
7340
7341 1836
7342 01:24:48,520 --> 01:24:52,989
7343 time this is no longer the case because
7344
7345 1837
7346 01:24:51,130 --> 01:24:55,409
7347 we have models that are actually able to
7348
7349 1838
7350 01:24:52,988 --> 01:24:58,408
7351 take care of that that data and improve
7352
7353 1839
7354 01:24:55,408 --> 01:25:00,759
7355 the second which is very related is that
7356
7357 1840
7358 01:24:58,408 --> 01:25:03,219
7359 we see improvements when we increase
7360
7361 1841
7362 01:25:00,760 --> 01:25:05,579
7363 model capacity this also wasn't really
7364
7365 1842
7366 01:25:03,219 --> 01:25:08,198
7367 true before because we effectively would
7368
7369 1843
7370 01:25:05,579 --> 01:25:10,179
7371 would increase the model complexity and
7372
7373 1844
7374 01:25:08,198 --> 01:25:13,138
7375 most likely to see overfitting rather
7376
7377 1845
7378 01:25:10,179 --> 01:25:15,850
7379 than improved performance on the data
7380
7381 1846
7382 01:25:13,139 --> 01:25:18,819
7383 the third and I think that this is sort
7384
7385 1847
7386 01:25:15,850 --> 01:25:20,980
7387 of a really fundamental game that we've
7388
7389 1848
7390 01:25:18,819 --> 01:25:23,408
7391 had in the last few years is the ability
7392
7393 1849
7394 01:25:20,979 --> 01:25:25,869
7395 to take advantage of transfer learning
7396
7397 1850
7398 01:25:23,408 --> 01:25:29,289
7399 so that's the ability to pre train on
7400
7401 1851
7402 01:25:25,869 --> 01:25:30,729
7403 imagenet and use that information that's
7404
7405 1852
7406 01:25:29,289 --> 01:25:33,550
7407 been pulled out of that data set in
7408
7409 1853
7410 01:25:30,729 --> 01:25:36,189
7411 order to do better on detection this is
7412
7413 1854
7414 01:25:33,550 --> 01:25:39,429
7415 this is something that we really didn't
7416
7417 1855
7418 01:25:36,189 --> 01:25:40,569
7419 have any way of doing before and what
7420
7421 1856
7422 01:25:39,429 --> 01:25:42,789
7423 this means is that we can immediately
7424
7425 1857
7426 01:25:40,569 --> 01:25:45,069
7427 benefit from improvements to image
7428
7429 1858
7430 01:25:42,789 --> 01:25:46,979
7431 classification so every time someone
7432
7433 1859
7434 01:25:45,069 --> 01:25:49,899
7435 comes up with a new network architecture
7436
7437 1860
7438 01:25:46,979 --> 01:25:51,279
7439 well I shouldn't say every time a lot of
7440
7441 1861
7442 01:25:49,899 --> 01:25:53,519
7443 the time that people come up with it it
7444
7445 1862
7446 01:25:51,279 --> 01:25:56,019
7447 immediately translates into improved
7448
7449 1863
7450 01:25:53,520 --> 01:25:57,550
7451 performance some say cocoa detection
7452
7453 1864
7454 01:25:56,020 --> 01:25:59,639
7455 which is really exciting to see so
7456
7457 1865
7458 01:25:57,550 --> 01:26:01,719
7459 there's a synergy and then another
7460
7461 1866
7462 01:25:59,639 --> 01:26:04,619
7463 important issue that's related to that
7464
7465 1867
7466 01:26:01,719 --> 01:26:07,270
7467 again is that we're now sort of
7468
7469 1868
7470 01:26:04,619 --> 01:26:08,829
7471 coalescing to this shared modeling
7472
7473 1869
7474 01:26:07,270 --> 01:26:10,870
7475 framework between a bunch of different
7476
7477 1870
7478 01:26:08,829 --> 01:26:13,329
7479 disciplines so speech natural language
7480
7481 1871
7482 01:26:10,869 --> 01:26:14,529
7483 processing computer vision and this
7484
7485 1872
7486 01:26:13,329 --> 01:26:17,079
7487 means that when there are new
7488
7489 1873
7490 01:26:14,529 --> 01:26:19,029
7491 discoveries in any of those related
7492
7493 1874
7494 01:26:17,079 --> 01:26:20,889
7495 fields there's a decent chance that
7496
7497 1875
7498 01:26:19,029 --> 01:26:22,469
7499 those might actually translate into
7500
7501 1876
7502 01:26:20,889 --> 01:26:24,730
7503 things that we can do in computer vision
7504
7505 1877
7506 01:26:22,469 --> 01:26:28,000
7507 which is which is really exciting
7508
7509 1878
7510 01:26:24,729 --> 01:26:30,129
7511 so in conclusion we've come a very long
7512
7513 1879
7514 01:26:28,000 --> 01:26:33,550
7515 way in the last ten years in terms of
7516
7517 1880
7518 01:26:30,130 --> 01:26:35,949
7519 object detection more recently we're
7520
7521 1881
7522 01:26:33,550 --> 01:26:36,909
7523 moving from bounding box detection to
7524
7525 1882
7526 01:26:35,948 --> 01:26:39,000
7527 more interesting and challenging
7528
7529 1883
7530 01:26:36,908 --> 01:26:41,768
7531 instance level understanding problems
7532
7533 1884
7534 01:26:39,000 --> 01:26:44,349
7535 but there's still a lot of major
7536
7537 1885
7538 01:26:41,769 --> 01:26:47,559
7539 in my opinion that remain and I just
7540
7541 1886
7542 01:26:44,349 --> 01:26:51,190
7543 listed a few here that that were sort of
7544
7545 1887
7546 01:26:47,559 --> 01:26:53,559
7547 at the top of my mind and you can read
7548
7549 1888
7550 01:26:51,189 --> 01:26:57,149
7551 these off the slide so with that I'd
7552
7553 1889
7554 01:26:53,559 --> 01:26:57,150
7555 like to end and take any questions
7556
7557 1890
7558 01:26:57,869 --> 01:27:04,469
7559 [Applause]
7560
7561 1891
7562 01:27:00,250 --> 01:27:06,849
7563 [Music]
7564
7565 1892
7566 01:27:04,469 --> 01:27:08,618
7567 would you still use the two-phase
7568
7569 1893
7570 01:27:06,849 --> 01:27:14,880
7571 approach if you had only one or two
7572
7573 1894
7574 01:27:08,618 --> 01:27:17,828
7575 classes to detect so I think that this
7576
7577 1895
7578 01:27:14,880 --> 01:27:20,590
7579 ultimately comes down to what you're
7580
7581 1896
7582 01:27:17,828 --> 01:27:22,479
7583 trying to do so depending on your task
7584
7585 1897
7586 01:27:20,590 --> 01:27:24,309
7587 is the computational budget whether you
7588
7589 1898
7590 01:27:22,479 --> 01:27:26,919
7591 just want boxes or if you want masks or
7592
7593 1899
7594 01:27:24,309 --> 01:27:28,960
7595 key points and all of these issues are
7596
7597 1900
7598 01:27:26,920 --> 01:27:31,179
7599 going to affect what you decide to do so
7600
7601 1901
7602 01:27:28,960 --> 01:27:34,538
7603 if you just want boxes and they're just
7604
7605 1902
7606 01:27:31,179 --> 01:27:36,550
7607 two classes then you you might get
7608
7609 1903
7610 01:27:34,538 --> 01:27:37,809
7611 better results for your specific
7612
7613 1904
7614 01:27:36,550 --> 01:27:44,800
7615 application with the one stage approach
7616
7617 1905
7618 01:27:37,809 --> 01:27:46,840
7619 I have two questions the first one how
7620
7621 1906
7622 01:27:44,800 --> 01:27:48,940
7623 fast is masks are seen and compared with
7624
7625 1907
7626 01:27:46,840 --> 01:27:51,578
7627 fast are CNN I mean I know that may be
7628
7629 1908
7630 01:27:48,939 --> 01:27:53,288
7631 the branch is in burial but really is it
7632
7633 1909
7634 01:27:51,578 --> 01:27:56,018
7635 a bottleneck for it or not I mean it's
7636
7637 1910
7638 01:27:53,288 --> 01:27:58,328
7639 the second question is you show some
7640
7641 1911
7642 01:27:56,019 --> 01:28:01,389
7643 very nice qualitative results which is
7644
7645 1912
7646 01:27:58,328 --> 01:28:03,308
7647 like the people would surfing but there
7648
7649 1913
7650 01:28:01,389 --> 01:28:06,130
7651 is a very nice reflection over the water
7652
7653 1914
7654 01:28:03,309 --> 01:28:09,010
7655 why are masks are theanine didn't detect
7656
7657 1915
7658 01:28:06,130 --> 01:28:10,769
7659 it I mean I guess it should right yeah
7660
7661 1916
7662 01:28:09,010 --> 01:28:14,079
7663 okay so for the first question
7664
7665 1917
7666 01:28:10,769 --> 01:28:16,960
7667 so the overall runtime of this system
7668
7669 1918
7670 01:28:14,078 --> 01:28:19,090
7671 again it I was trying to illustrate with
7672
7673 1919
7674 01:28:16,960 --> 01:28:21,219
7675 the trade-offs is that there is no one
7676
7677 1920
7678 01:28:19,090 --> 01:28:22,690
7679 runtime of the system you can change a
7680
7681 1921
7682 01:28:21,219 --> 01:28:24,069
7683 whole lot of different factors you can
7684
7685 1922
7686 01:28:22,689 --> 01:28:25,359
7687 change the input resolution that that
7688
7689 1923
7690 01:28:24,069 --> 01:28:27,189
7691 phone network you're using the number of
7692
7693 1924
7694 01:28:25,359 --> 01:28:30,399
7695 proposals and get different points in
7696
7697 1925
7698 01:28:27,189 --> 01:28:32,859
7699 the operating curve but as sort of state
7700
7701 1926
7702 01:28:30,399 --> 01:28:36,488
7703 of the art type system is about 200
7704
7705 1927
7706 01:28:32,859 --> 01:28:38,768
7707 milliseconds per image and the it's I
7708
7709 1928
7710 01:28:36,488 --> 01:28:41,018
7711 think if I remember correctly maybe
7712
7713 1929
7714 01:28:38,769 --> 01:28:44,469
7715 about 40 milliseconds overhead compared
7716
7717 1930
7718 01:28:41,019 --> 01:28:46,090
7719 to a faster arcing on a lot of any okay
7720
7721 1931
7722 01:28:44,469 --> 01:28:48,550
7723 yeah and then the second question so
7724
7725 1932
7726 01:28:46,090 --> 01:28:50,380
7727 this is most likely just reflecting the
7728
7729 1933
7730 01:28:48,550 --> 01:28:53,110
7731 bias so if you want to call it that in
7732
7733 1934
7734 01:28:50,380 --> 01:28:55,208
7735 the data set so the data set most likely
7736
7737 1935
7738 01:28:53,109 --> 01:28:58,658
7739 doesn't have any wave you reflected
7740
7741 1936
7742 01:28:55,208 --> 01:29:00,188
7743 annotated as people and if it had a lot
7744
7745 1937
7746 01:28:58,658 --> 01:29:02,379
7747 of those annotated I suspect it would
7748
7749 1938
7750 01:29:00,189 --> 01:29:04,869
7751 detect them if it only has a few it
7752
7753 1939
7754 01:29:02,380 --> 01:29:06,458
7755 probably will miss them I suspect the
7756
7757 1940
7758 01:29:04,868 --> 01:29:09,609
7759 dataset has almost none so it's
7760
7761 1941
7762 01:29:06,458 --> 01:29:15,880
7763 effectively just a representation of the
7764
7765 1942
7766 01:29:09,609 --> 01:29:19,179
7767 task that it was trained for Dupin open
7768
7769 1943
7770 01:29:15,880 --> 01:29:22,349
7771 source ink masks are CNN and some kind
7772
7773 1944
7774 01:29:19,179 --> 01:29:25,779
7775 of reference implementation and so then
7776
7777 1945
7778 01:29:22,349 --> 01:29:27,849
7779 what's your like timeframe and if no is
7780
7781 1946
7782 01:29:25,779 --> 01:29:29,800
7783 there some other like open source
7784
7785 1947
7786 01:29:27,849 --> 01:29:32,650
7787 implementation I see one in tensorflow
7788
7789 1948
7790 01:29:29,800 --> 01:29:34,510
7791 is it like you know are you confident
7792
7793 1949
7794 01:29:32,649 --> 01:29:38,170
7795 that it's like pretty much the same like
7796
7797 1950
7798 01:29:34,510 --> 01:29:39,780
7799 your reference implementation yes so we
7800
7801 1951
7802 01:29:38,170 --> 01:29:42,158
7803 do plan to open source that are
7804
7805 1952
7806 01:29:39,779 --> 01:29:43,359
7807 tentative with the dig emphasis on
7808
7809 1953
7810 01:29:42,158 --> 01:29:48,670
7811 tentative because I don't want to make
7812
7813 1954
7814 01:29:43,359 --> 01:29:50,708
7815 any promises time frame is October so
7816
7817 1955
7818 01:29:48,670 --> 01:29:51,300
7819 time is out so we will have one more
7820
7821 1956
7822 01:29:50,708 --> 01:29:54,880
7823 question
7824
7825 1957
7826 01:29:51,300 --> 01:29:57,400
7827 hi there's the paper published at last
7828
7829 1958
7830 01:29:54,880 --> 01:30:00,400
7831 year's of APR called Carter net
7832
7833 1959
7834 01:29:57,399 --> 01:30:03,518
7835 we're basically feature maps are
7836
7837 1960
7838 01:30:00,399 --> 01:30:06,509
7839 concatenate concatenated before starting
7840
7841 1961
7842 01:30:03,519 --> 01:30:10,659
7843 to the branch of our PN and fasten it
7844
7845 1962
7846 01:30:06,510 --> 01:30:12,998
7847 seems plausible that happened not we all
7848
7849 1963
7850 01:30:10,658 --> 01:30:16,359
7851 have better performance since all
7852
7853 1964
7854 01:30:12,998 --> 01:30:19,630
7855 feature maps are exploited nevertheless
7856
7857 1965
7858 01:30:16,359 --> 01:30:21,759
7859 it's not the case so what do you think
7860
7861 1966
7862 01:30:19,630 --> 01:30:24,449
7863 is the reason that feature primate
7864
7865 1967
7866 01:30:21,760 --> 01:30:27,610
7867 Network performs better than happen at
7868
7869 1968
7870 01:30:24,448 --> 01:30:31,328
7871 so if I recall correctly and you can
7872
7873 1969
7874 01:30:27,609 --> 01:30:34,149
7875 correct me if I'm wrong I believe that
7876
7877 1970
7878 01:30:31,328 --> 01:30:36,038
7879 in the hyper net approach there's a
7880
7881 1971
7882 01:30:34,149 --> 01:30:38,018
7883 combination that features from a variety
7884
7885 1972
7886 01:30:36,038 --> 01:30:39,639
7887 of different scales but then the
7888
7889 1973
7890 01:30:38,019 --> 01:30:42,820
7891 predictions are made from a single
7892
7893 1974
7894 01:30:39,639 --> 01:30:46,510
7895 combined representation rather than from
7896
7897 1975
7898 01:30:42,819 --> 01:30:48,578
7899 multiple levels and you could resize the
7900
7901 1976
7902 01:30:46,510 --> 01:30:51,489
7903 concatenated spatial map in two
7904
7905 1977
7906 01:30:48,578 --> 01:30:55,118
7907 different sizes right right so so what
7908
7909 1978
7910 01:30:51,488 --> 01:30:55,748
7911 we observed is that for region proposal
7912
7913 1979
7914 01:30:55,118 --> 01:30:57,880
7915 generations
7916
7917 1980
7918 01:30:55,748 --> 01:31:00,550
7919 having multiple levels in the future
7920
7921 1981
7922 01:30:57,880 --> 01:31:03,219
7923 pyramid was very wasn't quite important
7924
7925 1982
7926 01:31:00,550 --> 01:31:05,019
7927 including having low resolution ones all
7928
7929 1983
7930 01:31:03,219 --> 01:31:06,170
7931 right sorry I mean including having high
7932
7933 1984
7934 01:31:05,019 --> 01:31:09,140
7935 resolution ones
7936
7937 1985
7938 01:31:06,170 --> 01:31:11,890
7939 for detecting small objects right so so
7940
7941 1986
7942 01:31:09,140 --> 01:31:14,360
7943 I suspect that with with hyper nut
7944
7945 1987
7946 01:31:11,890 --> 01:31:16,550
7947 perhaps the sort of missing thing was
7948
7949 1988
7950 01:31:14,359 --> 01:31:21,729
7951 just like not having quite enough
7952
7953 1989
7954 01:31:16,550 --> 01:31:24,560
7955 spatial resolution but I'm not 100% sure
7956
7957 1990
7958 01:31:21,729 --> 01:31:27,079
7959 yeah I think that might be the reason
7960
7961 1991
7962 01:31:24,560 --> 01:31:29,650
7963 thank you very much oh this is the end
7964
7965 1992
7966 01:31:27,079 --> 01:31:29,649
7967 of the first
7968
7969

You might also like