We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 116
1 1
2 00:00:00,000 --> 00:00:04,339
3 well visual recognition basically I will 4 5 2 6 00:00:01,979 --> 00:00:06,419 7 just talk about image classification 8 9 3 10 00:00:04,339 --> 00:00:09,089 11 actually I want to talk about object 12 13 4 14 00:00:06,419 --> 00:00:11,280 15 detection but because God is here so you 16 17 5 18 00:00:09,089 --> 00:00:13,740 19 will see even better talk about object 20 21 6 22 00:00:11,279 --> 00:00:18,539 23 detection so I will just talk about 24 25 7 26 00:00:13,740 --> 00:00:21,600 27 application so in this talk I will first 28 29 8 30 00:00:18,539 --> 00:00:24,300 31 give some introduction about injury 32 33 9 34 00:00:21,600 --> 00:00:26,820 35 recognition and inge classification then 36 37 10 38 00:00:24,300 --> 00:00:29,640 39 I will give some review of convolutional 40 41 11 42 00:00:26,820 --> 00:00:32,219 43 neural networks starting from the neck 44 45 12 46 00:00:29,640 --> 00:00:35,270 47 and to Alexander with your Google net in 48 49 13 50 00:00:32,219 --> 00:00:39,539 51 the past few years then I will give some 52 53 14 54 00:00:35,270 --> 00:00:41,700 55 review of resonate and also some 56 57 15 58 00:00:39,539 --> 00:00:44,039 59 introduction of our recent work or resin 60 61 16 62 00:00:41,700 --> 00:00:47,039 63 next which will be presented in this 64 65 17 66 00:00:44,039 --> 00:00:49,769 67 lívia and by the way at the slides of 68 69 18 70 00:00:47,039 --> 00:00:53,399 71 this tutorial will be available online 72 73 19 74 00:00:49,770 --> 00:00:56,789 75 after this tutorial so in the past few 76 77 20 78 00:00:53,399 --> 00:00:59,308 79 years we have been witnessing a 80 81 21 82 00:00:56,789 --> 00:01:01,350 83 revolution of depth so image net 84 85 22 86 00:00:59,308 --> 00:01:04,259 87 classification is a very good benchmark 88 89 23 90 00:01:01,350 --> 00:01:07,349 91 for this revolution in the first two 92 93 24 94 00:01:04,260 --> 00:01:10,618 95 years of this competition the models are 96 97 25 98 00:01:07,349 --> 00:01:14,280 99 shallow and accuracy are usually not 100 101 26 102 00:01:10,618 --> 00:01:16,349 103 very good so in the famous 2012 104 105 27 106 00:01:14,280 --> 00:01:18,090 107 Alexander which is one of the first 108 109 28 110 00:01:16,349 --> 00:01:21,899 111 convolutional neural network apply to 112 113 29 114 00:01:18,090 --> 00:01:24,600 115 this largest gill dataset greatly reduce 116 117 30 118 00:01:21,900 --> 00:01:28,228 119 their error rates by about 10 percent on 120 121 31 122 00:01:24,599 --> 00:01:29,669 123 this data set and after two years we 124 125 32 126 00:01:28,228 --> 00:01:31,709 127 have seen another significant 128 129 33 130 00:01:29,670 --> 00:01:34,170 131 improvement by we Gigi and Google net 132 133 34 134 00:01:31,709 --> 00:01:37,739 135 which again in this country increase the 136 137 35 138 00:01:34,170 --> 00:01:43,009 139 depth from about a layers to over 20 140 141 36 142 00:01:37,739 --> 00:01:46,009 143 layers so two years ago which is a 2015 144 145 37 146 00:01:43,009 --> 00:01:47,909 147 the deep residual network has again 148 149 38 150 00:01:46,009 --> 00:01:50,989 151 significantly increased the depth to 152 153 39 154 00:01:47,909 --> 00:01:55,469 155 over 100 layers and we have seen another 156 157 40 158 00:01:50,989 --> 00:01:57,599 159 improvement of accuracy so this deep 160 161 41 162 00:01:55,469 --> 00:02:00,359 163 neural networks are also the engines of 164 165 42 166 00:01:57,599 --> 00:02:03,118 167 visual recognition and example of the 168 169 43 170 00:02:00,359 --> 00:02:05,789 171 detection is a very good benchmark for 172 173 44 174 00:02:03,118 --> 00:02:09,628 175 for this behavior so for example in a 176 177 45 178 00:02:05,790 --> 00:02:13,530 179 girl will see 2007 of the detection the 180 181 46 182 00:02:09,628 --> 00:02:16,318 183 most popular model which was a hot glass 184 185 47 186 00:02:13,530 --> 00:02:18,840 187 DPM which is a shadow model has only 188 189 48 190 00:02:16,318 --> 00:02:22,708 191 achieve about 34 percent energy on this 192 193 49 194 00:02:18,840 --> 00:02:24,750 195 data set and after the Alex net was 196 197 50 198 00:02:22,709 --> 00:02:27,180 199 introduced and together with the region 200 201 51 202 00:02:24,750 --> 00:02:31,289 203 based conclusionary neural network or 204 205 52 206 00:02:27,180 --> 00:02:34,140 207 RCN the enmity has been significantly 208 209 53 210 00:02:31,289 --> 00:02:36,689 211 improved by 20 percent on this data set 212 213 54 214 00:02:34,139 --> 00:02:40,649 215 and if we replace the features from Alex 216 217 55 218 00:02:36,689 --> 00:02:43,889 219 net to rigidly observe from Alex net to 220 221 56 222 00:02:40,650 --> 00:02:46,980 223 vdd to resonate we see another very big 224 225 57 226 00:02:43,889 --> 00:02:51,449 227 improvement on this challenging of the 228 229 58 230 00:02:46,979 --> 00:02:53,399 231 detection task so actually currently red 232 233 59 234 00:02:51,449 --> 00:02:55,378 235 net and its extensions are the leading 236 237 60 238 00:02:53,400 --> 00:02:58,439 239 models for many popular visual 240 241 61 242 00:02:55,378 --> 00:03:01,919 243 recognition benchmarks such as auto 244 245 62 246 00:02:58,439 --> 00:03:03,870 247 detection in Cocoa and we'll see somatic 248 249 63 250 00:03:01,919 --> 00:03:07,049 251 segmentation and instant segmentation in 252 253 64 254 00:03:03,870 --> 00:03:09,390 255 Cocoa will see a de or cityscape or many 256 257 65 258 00:03:07,050 --> 00:03:11,909 259 other data set and it has also be 260 261 66 262 00:03:09,389 --> 00:03:13,979 263 applied to extract our visual features 264 265 67 266 00:03:11,908 --> 00:03:17,429 267 for visual reasoning such as vqa or 268 269 68 270 00:03:13,979 --> 00:03:21,539 271 clever also it can be used to improve 272 273 69 274 00:03:17,430 --> 00:03:25,230 275 video understanding so actually if you 276 277 70 278 00:03:21,539 --> 00:03:28,318 279 search ResNet on the image net 2016 280 281 71 282 00:03:25,229 --> 00:03:31,768 283 result webpage you will see about 200 284 285 72 286 00:03:28,318 --> 00:03:34,229 287 entries and so these demonstrate the 288 289 73 290 00:03:31,769 --> 00:03:38,400 291 prevalence of these models in general 292 293 74 294 00:03:34,229 --> 00:03:40,079 295 computer visual recognition so now let's 296 297 75 298 00:03:38,400 --> 00:03:43,349 299 see how the computer 300 301 76 302 00:03:40,080 --> 00:03:45,540 303 recognize an image before the prevalence 304 305 77 306 00:03:43,348 --> 00:03:49,199 307 of deep learning so let's think about a 308 309 78 310 00:03:45,539 --> 00:03:51,568 311 most simple case so given an image we 312 313 79 314 00:03:49,199 --> 00:03:54,000 315 can just represent it as pixels and then 316 317 80 318 00:03:51,568 --> 00:03:56,699 319 we can train a classifier on top of 320 321 81 322 00:03:54,000 --> 00:03:57,840 323 these pixels the classifier can be very 324 325 82 326 00:03:56,699 --> 00:04:02,878 327 simple it can be a nearest neighbor 328 329 83 330 00:03:57,840 --> 00:04:06,810 331 classifier as we am or forest so it is 332 333 84 334 00:04:02,878 --> 00:04:08,188 335 unlikely for this model to work so we 336 337 85 338 00:04:06,810 --> 00:04:10,140 339 can make it a little bit more 340 341 86 342 00:04:08,188 --> 00:04:12,449 343 complicated for example we can extract 344 345 87 346 00:04:10,139 --> 00:04:14,548 347 edges from this image so this 348 349 88 350 00:04:12,449 --> 00:04:17,129 351 representation of edges can be a little 352 353 89 354 00:04:14,549 --> 00:04:23,819 355 bit invariant to some sectors such as 356 357 90 358 00:04:17,129 --> 00:04:27,329 359 colors or intimations of some north 360 361 91 362 00:04:23,819 --> 00:04:31,170 363 so following this theme we can build 364 365 92 366 00:04:27,329 --> 00:04:33,329 367 even more and more complicated visual 368 369 93 370 00:04:31,170 --> 00:04:34,770 371 features to do better with your 372 373 94 374 00:04:33,329 --> 00:04:36,959 375 recognition depending on how much 376 377 95 378 00:04:34,769 --> 00:04:40,349 379 variance and invariance we want to have 380 381 96 382 00:04:36,959 --> 00:04:42,389 383 for example the popular features such as 384 385 97 386 00:04:40,350 --> 00:04:45,600 387 this and Haque usually extra edges and 388 389 98 390 00:04:42,389 --> 00:04:47,939 391 then we build ocol histograms of the 392 393 99 394 00:04:45,600 --> 00:04:50,100 395 orientations of edges so actually this 396 397 100 398 00:04:47,939 --> 00:04:52,589 399 type of Instagram can be think of as 400 401 101 402 00:04:50,100 --> 00:04:55,350 403 kind of a normalization and pulling 404 405 102 406 00:04:52,589 --> 00:04:58,739 407 operations locally operated on the edges 408 409 103 410 00:04:55,350 --> 00:05:01,350 411 so we can apply a classifier on top of 412 413 104 414 00:04:58,740 --> 00:05:03,689 415 these features and one step further we 416 417 105 418 00:05:01,350 --> 00:05:07,260 419 can do some back of word models using 420 421 106 422 00:05:03,689 --> 00:05:09,629 423 k-means or using start coding or even 424 425 107 426 00:05:07,259 --> 00:05:11,459 427 higher order future vector or we are ad 428 429 108 430 00:05:09,629 --> 00:05:13,350 431 on these people hugger features and 432 433 109 434 00:05:11,459 --> 00:05:15,870 435 after that still we train a classifier 436 437 110 438 00:05:13,350 --> 00:05:17,460 439 on top of that and usually these are the 440 441 111 442 00:05:15,870 --> 00:05:20,639 443 state-of-the-art models before the 444 445 112 446 00:05:17,459 --> 00:05:23,750 447 prevalence of deep learning so from this 448 449 113 450 00:05:20,639 --> 00:05:27,120 451 diagram we can see that our models are 452 453 114 454 00:05:23,750 --> 00:05:30,420 455 going from a simpler form to a more 456 457 115 458 00:05:27,120 --> 00:05:34,550 459 complicated form and it is just going 460 461 116 462 00:05:30,420 --> 00:05:37,110 463 from a shallower one to a deeper one but 464 465 117 466 00:05:34,550 --> 00:05:42,090 467 what can we do if we want to have an 468 469 118 470 00:05:37,110 --> 00:05:44,069 471 even better model so actually this is 472 473 119 474 00:05:42,089 --> 00:05:46,649 475 the motivation of learning deep features 476 477 120 478 00:05:44,069 --> 00:05:48,569 479 so in traditional computer vision we 480 481 121 482 00:05:46,649 --> 00:05:50,519 483 design specialized components which 484 485 122 486 00:05:48,569 --> 00:05:53,009 487 require domain knowledge for example we 488 489 123 490 00:05:50,519 --> 00:05:54,750 491 we want to extract something which is 492 493 124 494 00:05:53,009 --> 00:05:56,610 495 called edges we want to extract 496 497 125 498 00:05:54,750 --> 00:05:59,610 499 something which is called histograms or 500 501 126 502 00:05:56,610 --> 00:06:01,889 503 we want to extract their orientations so 504 505 127 506 00:05:59,610 --> 00:06:03,780 507 all these components require specific 508 509 128 510 00:06:01,889 --> 00:06:07,229 511 domain knowledge of the problem on hand 512 513 129 514 00:06:03,779 --> 00:06:08,969 515 however a limit per component 516 517 130 518 00:06:07,230 --> 00:06:11,160 519 the number of components we can decide 520 521 131 522 00:06:08,970 --> 00:06:12,840 523 because of the domain knowledge so on 524 525 132 526 00:06:11,160 --> 00:06:14,760 527 the other hand in order to learn deeper 528 529 133 530 00:06:12,839 --> 00:06:17,239 531 features actually in a deep neural 532 533 134 534 00:06:14,759 --> 00:06:19,800 535 network we don't need these special 536 537 135 538 00:06:17,240 --> 00:06:22,139 539 components we usually just need some 540 541 136 542 00:06:19,800 --> 00:06:24,060 543 generic components which is usually 544 545 137 546 00:06:22,139 --> 00:06:28,110 547 called layers nowadays such as 548 549 138 550 00:06:24,060 --> 00:06:31,589 551 convolutions relu are normalizations or 552 553 139 554 00:06:28,110 --> 00:06:33,780 555 appalling so are if generic component 556 557 140 558 00:06:31,589 --> 00:06:36,209 559 require less domain knowledge so we can 560 561 141 562 00:06:33,779 --> 00:06:36,929 563 just repeat these elementary layers 564 565 142 566 00:06:36,209 --> 00:06:38,879 567 which 568 569 143 570 00:06:36,930 --> 00:06:41,280 571 means that we can it is easy for us to 572 573 144 574 00:06:38,879 --> 00:06:43,860 575 create our deeper neural networks 576 577 145 578 00:06:41,279 --> 00:06:46,138 579 without much domain knowledge so this 580 581 146 582 00:06:43,860 --> 00:06:48,870 583 can give us a richer solution space when 584 585 147 586 00:06:46,139 --> 00:06:51,660 587 going deeper and also we can still 588 589 148 590 00:06:48,870 --> 00:06:53,490 591 easily train them using some n2n 592 593 149 594 00:06:51,660 --> 00:06:58,410 595 learning algorithms such as back 596 597 150 598 00:06:53,490 --> 00:07:00,449 599 propagation so and this is the 600 601 151 602 00:06:58,410 --> 00:07:01,919 603 fundamental motivation of doing deep 604 605 152 606 00:07:00,449 --> 00:07:03,780 607 learning and convolutional neural 608 609 153 610 00:07:01,918 --> 00:07:07,079 611 networks so in the following I will 612 613 154 614 00:07:03,779 --> 00:07:08,989 615 review a few typical neural networks in 616 617 155 618 00:07:07,079 --> 00:07:11,848 619 the last few years 620 621 156 622 00:07:08,990 --> 00:07:14,759 623 so first let's recap there Lynnette 624 625 157 626 00:07:11,848 --> 00:07:18,000 627 which were actually developed at 20 or 628 629 158 630 00:07:14,759 --> 00:07:20,550 631 30 years ago so here are the basic 632 633 159 634 00:07:18,000 --> 00:07:23,788 635 components of Linette which are still 636 637 160 638 00:07:20,550 --> 00:07:26,610 639 popular in the modern Conville net so 640 641 161 642 00:07:23,788 --> 00:07:29,519 643 their first example component is 644 645 162 646 00:07:26,610 --> 00:07:31,979 647 convolution so convolution is actually a 648 649 163 650 00:07:29,519 --> 00:07:35,549 651 locally connected layer and more 652 653 164 654 00:07:31,978 --> 00:07:38,430 655 importantly it is a layer with spatially 656 657 165 658 00:07:35,550 --> 00:07:40,770 659 sharing weight so in my opinion sharing 660 661 166 662 00:07:38,430 --> 00:07:43,348 663 a weight sharing is the kid in deep 664 665 167 666 00:07:40,769 --> 00:07:46,740 667 learning so in the case of convolutional 668 669 168 670 00:07:43,348 --> 00:07:50,459 671 neural a neural net we shall wait across 672 673 169 674 00:07:46,740 --> 00:07:52,228 675 spatial across the spatial domain in the 676 677 170 678 00:07:50,459 --> 00:07:54,719 679 case of recurrent neural net we shall 680 681 171 682 00:07:52,228 --> 00:07:57,240 683 wait temporarily so by doing weight 684 685 172 686 00:07:54,720 --> 00:07:59,699 687 sharing we can significantly reduce the 688 689 173 690 00:07:57,240 --> 00:08:03,300 691 number of parameters in the in the model 692 693 174 694 00:07:59,699 --> 00:08:05,658 695 per still have very good capacity on the 696 697 175 698 00:08:03,300 --> 00:08:09,240 699 other hand a learner also has another 700 701 176 702 00:08:05,658 --> 00:08:11,219 703 key component of sub sampling so 704 705 177 706 00:08:09,240 --> 00:08:13,680 707 nowadays we still do this type of sub 708 709 178 710 00:08:11,220 --> 00:08:18,270 711 sampling using are pulling or just some 712 713 179 714 00:08:13,680 --> 00:08:20,908 715 straight to convolutions so in the case 716 717 180 718 00:08:18,269 --> 00:08:24,209 719 of Linette the network is ended by some 720 721 181 722 00:08:20,908 --> 00:08:26,370 723 fully connected layers so typically for 724 725 182 726 00:08:24,209 --> 00:08:28,739 727 the last layer we can just think of it 728 729 183 730 00:08:26,370 --> 00:08:30,538 731 as a linear classifier which is very 732 733 184 734 00:08:28,740 --> 00:08:33,120 735 similar to an SVM 736 737 185 738 00:08:30,538 --> 00:08:37,379 739 so the entire architecture can be 740 741 186 742 00:08:33,120 --> 00:08:39,389 743 trained and to end by propagation so all 744 745 187 746 00:08:37,379 --> 00:08:44,208 747 these components are still their key 748 749 188 750 00:08:39,389 --> 00:08:48,088 751 components in nowadays called net so in 752 753 189 754 00:08:44,208 --> 00:08:50,489 755 2012 the framers alex net crashed into 756 757 190 758 00:08:48,089 --> 00:08:53,550 759 the in genetic classification challenge 760 761 191 762 00:08:50,490 --> 00:08:57,629 763 so for Alexander it is do a the next 764 765 192 766 00:08:53,549 --> 00:09:00,599 767 style backbone but it has some our key 768 769 193 770 00:08:57,629 --> 00:09:03,480 771 improvement so the first improvement is 772 773 194 774 00:09:00,600 --> 00:09:07,379 775 relu or rectified linear units 776 777 195 778 00:09:03,480 --> 00:09:09,870 779 so in some sense relu is kind of the 780 781 196 782 00:09:07,379 --> 00:09:12,539 783 reason for the revolution of deep 784 785 197 786 00:09:09,870 --> 00:09:15,210 787 learning recently because you can 788 789 198 790 00:09:12,539 --> 00:09:18,889 791 accelerate training because of better 792 793 199 794 00:09:15,210 --> 00:09:21,150 795 gradient propagation versus some typical 796 797 200 798 00:09:18,889 --> 00:09:24,569 799 activations such as tangent nature or 800 801 201 802 00:09:21,149 --> 00:09:27,629 803 sigmoid so another key component of Alex 804 805 202 806 00:09:24,570 --> 00:09:30,600 807 net is the job house operation which is 808 809 203 810 00:09:27,629 --> 00:09:34,139 811 essentially in network and sampling so 812 813 204 814 00:09:30,600 --> 00:09:36,409 815 in the case of Alex net or later on for 816 817 205 818 00:09:34,139 --> 00:09:40,699 819 a widow Jeanette which has very large 820 821 206 822 00:09:36,409 --> 00:09:43,500 823 fully kinetic layers that with many 824 825 207 826 00:09:40,700 --> 00:09:45,810 827 parameters jobel can help to reduce 828 829 208 830 00:09:43,500 --> 00:09:48,779 831 overfitting but it is also worth 832 833 209 834 00:09:45,809 --> 00:09:52,109 835 mentioning that java can be replaced by 836 837 210 838 00:09:48,779 --> 00:09:56,370 839 better known in some later networks so 840 841 211 842 00:09:52,110 --> 00:09:58,500 843 another key contribution of the Alex net 844 845 212 846 00:09:56,370 --> 00:10:00,750 847 design in my opinion is the data element 848 849 213 850 00:09:58,500 --> 00:10:03,090 851 Asian step so they complete domination 852 853 214 854 00:10:00,750 --> 00:10:06,210 855 is actually kind of label preserving 856 857 215 858 00:10:03,090 --> 00:10:08,610 859 trance transformation so you can do some 860 861 216 862 00:10:06,210 --> 00:10:09,060 863 random cropping or random scaling or 864 865 217 866 00:10:08,610 --> 00:10:12,600 867 flipping 868 869 218 870 00:10:09,059 --> 00:10:15,809 871 which can virtually create more data 872 873 219 874 00:10:12,600 --> 00:10:18,600 875 from other existing data on hand so this 876 877 220 878 00:10:15,809 --> 00:10:20,759 879 means even we have 1 million images on 880 881 221 882 00:10:18,600 --> 00:10:24,509 883 hand from internet so the number of data 884 885 222 886 00:10:20,759 --> 00:10:26,189 887 is still kind of limited comparing with 888 889 223 890 00:10:24,509 --> 00:10:28,679 891 the neural network size so the 892 893 224 894 00:10:26,190 --> 00:10:30,990 895 documentation is also our one of the key 896 897 225 898 00:10:28,679 --> 00:10:32,669 899 reasons to the recent success of neural 900 901 226 902 00:10:30,990 --> 00:10:38,190 903 networks it can also help to reduce 904 905 227 906 00:10:32,669 --> 00:10:41,399 907 overfitting so another milestone in my 908 909 228 910 00:10:38,190 --> 00:10:43,590 911 opinion is the vision Network so I still 912 913 229 914 00:10:41,399 --> 00:10:47,789 915 remember that when we just saw their 916 917 230 918 00:10:43,590 --> 00:10:49,470 919 image net 2014 with our webpage we 920 921 231 922 00:10:47,789 --> 00:10:54,750 923 discovered that there are some networks 924 925 232 926 00:10:49,470 --> 00:10:57,389 927 which are with 16 or even 19 years so 928 929 233 930 00:10:54,750 --> 00:11:02,279 931 our comment is just that it is beyond 932 933 234 934 00:10:57,389 --> 00:11:04,439 935 our imagination so actually the widget 936 937 235 938 00:11:02,279 --> 00:11:08,610 939 networks are very simple and which 940 941 236 942 00:11:04,440 --> 00:11:12,060 943 makes them a very real burrs so in my 944 945 237 946 00:11:08,610 --> 00:11:14,820 947 opinion the key design of the rigid 948 949 238 950 00:11:12,059 --> 00:11:18,239 951 networks are moderate modularization 952 953 239 954 00:11:14,820 --> 00:11:20,970 955 designs so in this case they just stack 956 957 240 958 00:11:18,240 --> 00:11:24,539 959 a lot of three by three convolutions 960 961 241 962 00:11:20,970 --> 00:11:26,310 963 following some very simple rules so in 964 965 242 966 00:11:24,539 --> 00:11:29,250 967 the same stage of the neural network all 968 969 243 970 00:11:26,309 --> 00:11:31,229 971 the layers have the same shape and when 972 973 244 974 00:11:29,250 --> 00:11:33,210 975 the spatial size is reduced they just 976 977 245 978 00:11:31,230 --> 00:11:35,490 979 increase the number of filters so to 980 981 246 982 00:11:33,210 --> 00:11:38,759 983 roughly keep the competition for each 984 985 247 986 00:11:35,490 --> 00:11:41,519 987 module so this is very simple so when we 988 989 248 990 00:11:38,759 --> 00:11:43,620 991 go deeper and deeper we don't need to 992 993 249 994 00:11:41,519 --> 00:11:49,049 995 design new layers we can just use the 996 997 250 998 00:11:43,620 --> 00:11:51,870 999 same template and when the ouija network 1000 1001 251 1002 00:11:49,049 --> 00:11:53,549 1003 was just published it was trained using 1004 1005 252 1006 00:11:51,870 --> 00:11:56,580 1007 some stage wise training for example 1008 1009 253 1010 00:11:53,549 --> 00:11:59,370 1011 people started from a shallower network 1012 1013 254 1014 00:11:56,580 --> 00:12:02,610 1015 that with that data has 11 layers and 1016 1017 255 1018 00:11:59,370 --> 00:12:06,960 1019 then gradually increase the depth to 13 1020 1021 256 1022 00:12:02,610 --> 00:12:09,450 1023 and to 16 so this type of stage wise 1024 1025 257 1026 00:12:06,960 --> 00:12:12,389 1027 training is not practical because it is 1028 1029 258 1030 00:12:09,450 --> 00:12:15,120 1031 time consuming and it is not end to end 1032 1033 259 1034 00:12:12,389 --> 00:12:17,039 1035 in some sense so we're actually which 1036 1037 260 1038 00:12:15,120 --> 00:12:21,570 1039 the network can be trained from scratch 1040 1041 261 1042 00:12:17,039 --> 00:12:23,459 1043 if we have some better initialization so 1044 1045 262 1046 00:12:21,570 --> 00:12:25,860 1047 next I will briefly talk about the 1048 1049 263 1050 00:12:23,460 --> 00:12:29,700 1051 initialization techniques let's think 1052 1053 264 1054 00:12:25,860 --> 00:12:31,769 1055 about a layer whose input is X and the 1056 1057 265 1058 00:12:29,700 --> 00:12:36,180 1059 weight matrix is W and the output is 1060 1061 266 1062 00:12:31,769 --> 00:12:39,449 1063 just y equals to W X so if we assume 1064 1065 267 1066 00:12:36,179 --> 00:12:43,649 1067 that their equations are linear and also 1068 1069 268 1070 00:12:39,450 --> 00:12:46,620 1071 they're the vectors X Y and W are 1072 1073 269 1074 00:12:43,649 --> 00:12:51,240 1075 independent then from some basic math we 1076 1077 270 1078 00:12:46,620 --> 00:12:55,110 1079 can show that the variance of y equals 1080 1081 271 1082 00:12:51,240 --> 00:12:58,080 1083 to the variance of X multiplied by a 1084 1085 272 1086 00:12:55,110 --> 00:13:00,990 1087 vector and this vector is the number of 1088 1089 273 1090 00:12:58,080 --> 00:13:03,930 1091 neurons multiplied by the variance of 1092 1093 274 1094 00:13:00,990 --> 00:13:06,930 1095 the weight so this is the case of one 1096 1097 275 1098 00:13:03,929 --> 00:13:08,639 1099 layer so if we have multiple layers we 1100 1101 276 1102 00:13:06,929 --> 00:13:10,769 1103 can see that the outputs of the network 1104 1105 277 1106 00:13:08,639 --> 00:13:13,620 1107 are the variance of the above network is 1108 1109 278 1110 00:13:10,769 --> 00:13:17,159 1111 also proportional to the variance of the 1112 1113 279 1114 00:13:13,620 --> 00:13:20,129 1115 input of the network up to with a factor 1116 1117 280 1118 00:13:17,159 --> 00:13:24,269 1119 of the multiplication of the variance of 1120 1121 281 1122 00:13:20,129 --> 00:13:26,879 1123 every single layer so actually this is 1124 1125 282 1126 00:13:24,269 --> 00:13:29,129 1127 the reason of the framers vanishing 1128 1129 283 1130 00:13:26,879 --> 00:13:32,278 1131 gradient or exploding gradient problem 1132 1133 284 1134 00:13:29,129 --> 00:13:34,259 1135 actually in some sense it is not just 1136 1137 285 1138 00:13:32,278 --> 00:13:36,179 1139 about gradient actually both the forward 1140 1141 286 1142 00:13:34,259 --> 00:13:40,220 1143 pass and the backward pass can both 1144 1145 287 1146 00:13:36,179 --> 00:13:42,719 1147 expose in this regard so if the 1148 1149 288 1150 00:13:40,220 --> 00:13:45,180 1151 initialization or if the weight of them 1152 1153 289 1154 00:13:42,720 --> 00:13:48,660 1155 the network is slightly smaller than 1156 1157 290 1158 00:13:45,179 --> 00:13:51,239 1159 ideal then images vanish because it is a 1160 1161 291 1162 00:13:48,659 --> 00:13:52,588 1163 multiplication of many small numbers on 1164 1165 292 1166 00:13:51,240 --> 00:13:56,519 1167 the other hand if it is just lightly 1168 1169 293 1170 00:13:52,589 --> 00:13:58,170 1171 larger than it will just expose so this 1172 1173 294 1174 00:13:56,519 --> 00:13:59,698 1175 is why we need some careful 1176 1177 295 1178 00:13:58,169 --> 00:14:01,429 1179 initialization so one of the most 1180 1181 296 1182 00:13:59,698 --> 00:14:03,659 1183 popular one is the so-called zombie 1184 1185 297 1186 00:14:01,429 --> 00:14:06,628 1187 initialization which is developed under 1188 1189 298 1190 00:14:03,659 --> 00:14:09,708 1191 some linear assumptions basically it 1192 1193 299 1194 00:14:06,629 --> 00:14:12,329 1195 just says we want some constant factors 1196 1197 300 1198 00:14:09,708 --> 00:14:14,369 1199 in the forward pass and also some other 1200 1201 301 1202 00:14:12,328 --> 00:14:16,229 1203 content sectors in the better path and 1204 1205 302 1206 00:14:14,370 --> 00:14:18,688 1207 some simple assumption is just that a 1208 1209 303 1210 00:14:16,230 --> 00:14:20,938 1211 defector for every single layer is one 1212 1213 304 1214 00:14:18,688 --> 00:14:23,809 1215 for either the forward pass or the 1216 1217 305 1218 00:14:20,938 --> 00:14:27,719 1219 backward pass so this obvious 1220 1221 306 1222 00:14:23,809 --> 00:14:29,068 1223 initialization is very useful however it 1224 1225 307 1226 00:14:27,720 --> 00:14:31,980 1227 was developed under the linear 1228 1229 308 1230 00:14:29,068 --> 00:14:35,519 1231 assumption and actually we can show that 1232 1233 309 1234 00:14:31,980 --> 00:14:39,170 1235 if the assumption is if the activation 1236 1237 310 1238 00:14:35,519 --> 00:14:44,850 1239 is real we can also have some nice 1240 1241 311 1242 00:14:39,169 --> 00:14:47,688 1243 analytic derivation which simply modify 1244 1245 312 1246 00:14:44,850 --> 00:14:53,189 1247 the above as of your case by a factor of 1248 1249 313 1250 00:14:47,688 --> 00:14:55,740 1251 1 over 2 where this factor is important 1252 1253 314 1254 00:14:53,188 --> 00:14:58,349 1255 because if we have D layers on hand and 1256 1257 315 1258 00:14:55,740 --> 00:15:01,350 1259 a factor of 2 per layer just we just 1260 1261 316 1262 00:14:58,350 --> 00:15:05,250 1263 have an exponential effect of 2 to the 1264 1265 317 1266 00:15:01,350 --> 00:15:08,009 1267 power of P and we can also see this kind 1268 1269 318 1270 00:15:05,250 --> 00:15:11,818 1271 of exponential effect is very prevalent 1272 1273 319 1274 00:15:08,009 --> 00:15:17,009 1275 in deep neural network if we don't do 1276 1277 320 1278 00:15:11,818 --> 00:15:20,068 1279 things right so basically in my opinion 1280 1281 321 1282 00:15:17,009 --> 00:15:23,419 1283 the Salvia or the MSI initialization are 1284 1285 322 1286 00:15:20,068 --> 00:15:26,969 1287 required for training we do g16 1288 1289 323 1290 00:15:23,419 --> 00:15:29,159 1291 originating from scratch and if we want 1292 1293 324 1294 00:15:26,970 --> 00:15:30,209 1295 to chain even deeper with reduced our 1296 1297 325 1298 00:15:29,159 --> 00:15:33,688 1299 networks for example 1300 1301 326 1302 00:15:30,208 --> 00:15:36,258 1303 with 20 layers usually we we have to use 1304 1305 327 1306 00:15:33,688 --> 00:15:39,539 1307 the image MSI initialization 1308 1309 328 1310 00:15:36,259 --> 00:15:42,808 1311 unfortunately unfortunately with this 1312 1313 329 1314 00:15:39,539 --> 00:15:44,969 1315 number of layers the deep plane networks 1316 1317 330 1318 00:15:42,808 --> 00:15:47,039 1319 usually are not better at this is why we 1320 1321 331 1322 00:15:44,970 --> 00:15:50,610 1323 have to develop the ResNet which we will 1324 1325 332 1326 00:15:47,039 --> 00:15:52,558 1327 discuss later but on the other hand this 1328 1329 333 1330 00:15:50,610 --> 00:15:55,528 1331 type of initializations are still useful 1332 1333 334 1334 00:15:52,558 --> 00:15:58,230 1335 for example if we want to do fine-tuning 1336 1337 335 1338 00:15:55,528 --> 00:16:00,558 1339 for detection or segmentation where the 1340 1341 336 1342 00:15:58,230 --> 00:16:03,688 1343 network have some newly initialization 1344 1345 337 1346 00:16:00,558 --> 00:16:05,909 1347 newly initialized layers in this case I 1348 1349 338 1350 00:16:03,688 --> 00:16:08,849 1351 recommend to try severe or MSI 1352 1353 339 1354 00:16:05,909 --> 00:16:13,259 1355 installation first and also it is worth 1356 1357 340 1358 00:16:08,850 --> 00:16:16,439 1359 mentioning that the mathematical outcome 1360 1361 341 1362 00:16:13,259 --> 00:16:18,899 1363 of this two type of initialization does 1364 1365 342 1366 00:16:16,438 --> 00:16:21,928 1367 not directly apply to multi branch 1368 1369 343 1370 00:16:18,899 --> 00:16:24,389 1371 network such as Google net are actually 1372 1373 344 1374 00:16:21,928 --> 00:16:28,198 1375 the same the relation methodology is 1376 1377 345 1378 00:16:24,389 --> 00:16:31,438 1379 also applicable so we can still do 30 1380 1381 346 1382 00:16:28,198 --> 00:16:34,049 1383 relation for different types of those 1384 1385 347 1386 00:16:31,438 --> 00:16:36,438 1387 inception blocks but usually now we 1388 1389 348 1390 00:16:34,049 --> 00:16:38,219 1391 don't need to do this because we have 1392 1393 349 1394 00:16:36,438 --> 00:16:43,139 1395 personalizations which are we'll also 1396 1397 350 1398 00:16:38,220 --> 00:16:45,749 1399 discuss later so also in 2014 another 1400 1401 351 1402 00:16:43,139 --> 00:16:48,419 1403 very successful neural network is 1404 1405 352 1406 00:16:45,749 --> 00:16:51,360 1407 designed and it is the Google net or 1408 1409 353 1410 00:16:48,419 --> 00:16:54,748 1411 called inception later so Google net is 1412 1413 354 1414 00:16:51,360 --> 00:16:58,188 1415 known for its very good accuracy and at 1416 1417 355 1418 00:16:54,749 --> 00:17:01,589 1419 the same time the small footprint so 1420 1421 356 1422 00:16:58,188 --> 00:17:04,139 1423 there are many complicated designed in 1424 1425 357 1426 00:17:01,589 --> 00:17:06,480 1427 the Google net and in my opinion I I 1428 1429 358 1430 00:17:04,140 --> 00:17:09,959 1431 would like to summarize them into three 1432 1433 359 1434 00:17:06,480 --> 00:17:11,548 1435 main points so the first property of 1436 1437 360 1438 00:17:09,959 --> 00:17:14,100 1439 Google net is that it is a multiple 1440 1441 361 1442 00:17:11,548 --> 00:17:16,019 1443 branch architecture so for example in 1444 1445 362 1446 00:17:14,099 --> 00:17:18,658 1447 the original version it has a one by one 1448 1449 363 1450 00:17:16,019 --> 00:17:21,689 1451 branch and then three by three five by 1452 1453 364 1454 00:17:18,659 --> 00:17:24,539 1455 five and pulling and another very 1456 1457 365 1458 00:17:21,689 --> 00:17:26,579 1459 interesting behavior which may be kind 1460 1461 366 1462 00:17:24,538 --> 00:17:29,640 1463 of an acident in the original kikuna is 1464 1465 367 1466 00:17:26,578 --> 00:17:32,250 1467 the usage of socket so as we can see 1468 1469 368 1470 00:17:29,640 --> 00:17:35,880 1471 there is a standalone 1x1 convolution in 1472 1473 369 1474 00:17:32,250 --> 00:17:38,700 1475 the original inception block and this 1476 1477 370 1478 00:17:35,880 --> 00:17:40,649 1479 one by one shortcut is merged into the 1480 1481 371 1482 00:17:38,700 --> 00:17:44,130 1483 other branches by condition in the 1484 1485 372 1486 00:17:40,648 --> 00:17:46,859 1487 original inception so 1488 1489 373 1490 00:17:44,130 --> 00:17:49,110 1491 we will see these one-by-one Shakur has 1492 1493 374 1494 00:17:46,859 --> 00:17:53,819 1495 been kept in almost older following 1496 1497 375 1498 00:17:49,109 --> 00:17:55,679 1499 generations of inceptions so in my 1500 1501 376 1502 00:17:53,819 --> 00:17:59,069 1503 understanding this one by one shortcut 1504 1505 377 1506 00:17:55,680 --> 00:18:03,570 1507 somehow helps optimization of this very 1508 1509 378 1510 00:17:59,069 --> 00:18:06,059 1511 complicated designs of google nets and 1512 1513 379 1514 00:18:03,569 --> 00:18:08,339 1515 also at that time google net is also 1516 1517 380 1518 00:18:06,059 --> 00:18:11,909 1519 kind of a photo net architecture 1520 1521 381 1522 00:18:08,339 --> 00:18:14,909 1523 so for each inception block there are 1524 1525 382 1526 00:18:11,910 --> 00:18:16,920 1527 some 1x1 convolutions to reduce the 1528 1529 383 1530 00:18:14,910 --> 00:18:20,580 1531 number of channels before doing the 1532 1533 384 1534 00:18:16,920 --> 00:18:22,769 1535 expensive 3x3 and 5x5 convolution so 1536 1537 385 1538 00:18:20,579 --> 00:18:24,720 1539 this is kind of alternate representation 1540 1541 386 1542 00:18:22,769 --> 00:18:27,529 1543 in in terms of the number of channels 1544 1545 387 1546 00:18:24,720 --> 00:18:30,390 1547 and in the original Google net because 1548 1549 388 1550 00:18:27,529 --> 00:18:32,879 1551 after the 3 by 3 or 5 by 5 convolution 1552 1553 389 1554 00:18:30,390 --> 00:18:37,830 1555 the journals are calculated so usually 1556 1557 390 1558 00:18:32,880 --> 00:18:42,630 1559 they don't need to do some dimension 1560 1561 391 1562 00:18:37,829 --> 00:18:45,179 1563 increase in that case so there are many 1564 1565 392 1566 00:18:42,630 --> 00:18:47,790 1567 other versions of Google net or in 1568 1569 393 1570 00:18:45,180 --> 00:18:50,670 1571 sumption developed after that in my 1572 1573 394 1574 00:18:47,789 --> 00:18:52,589 1575 opinion all those inception templates 1576 1577 395 1578 00:18:50,670 --> 00:18:55,740 1579 still have more or less the same three 1580 1581 396 1582 00:18:52,589 --> 00:18:59,509 1583 main poverty's multiple branches socket 1584 1585 397 1586 00:18:55,740 --> 00:19:02,579 1587 and bottleneck so I believe this other 1588 1589 398 1590 00:18:59,509 --> 00:19:07,859 1591 key component in the success of Google 1592 1593 399 1594 00:19:02,579 --> 00:19:10,409 1595 net so as we mentioned before are the 1596 1597 400 1598 00:19:07,859 --> 00:19:12,479 1599 zombie or MSI initialization are not 1600 1601 401 1602 00:19:10,410 --> 00:19:18,690 1603 directly applicable for multiple branch 1604 1605 402 1606 00:19:12,480 --> 00:19:20,519 1607 network such as Google net for Google 1608 1609 403 1610 00:19:18,690 --> 00:19:22,650 1611 net can still be optimized very 1612 1613 404 1614 00:19:20,519 --> 00:19:27,629 1615 successfully thanks to the introduction 1616 1617 405 1618 00:19:22,650 --> 00:19:30,150 1619 of personalization or called B n so M 1620 1621 406 1622 00:19:27,630 --> 00:19:32,640 1623 understanding BN is also a milestone 1624 1625 407 1626 00:19:30,150 --> 00:19:40,580 1627 technique in the reason deep learning 1628 1629 408 1630 00:19:32,640 --> 00:19:42,960 1631 revolution so basically before the 1632 1633 409 1634 00:19:40,579 --> 00:19:45,449 1635 recent prevalence of deep learning 1636 1637 410 1638 00:19:42,960 --> 00:19:47,880 1639 people has been long realized that if we 1640 1641 411 1642 00:19:45,450 --> 00:19:51,120 1643 want to train a neural network we at 1644 1645 412 1646 00:19:47,880 --> 00:19:55,290 1647 least want to normalize the impulse of 1648 1649 413 1650 00:19:51,119 --> 00:19:57,029 1651 the data by it mean or its STD I on the 1652 1653 414 1654 00:19:55,289 --> 00:19:57,950 1655 other hand the development of the Xavier 1656 1657 415 1658 00:19:57,029 --> 00:20:01,009 1659 or MS 1660 1661 416 1662 00:19:57,950 --> 00:20:03,980 1663 I initialization is just to analytically 1664 1665 417 1666 00:20:01,009 --> 00:20:06,319 1667 normalize the mean and STD for each 1668 1669 418 1670 00:20:03,980 --> 00:20:09,259 1671 layer based on some strong assumptions 1672 1673 419 1674 00:20:06,319 --> 00:20:11,869 1675 such as linear or independence 1676 1677 420 1678 00:20:09,259 --> 00:20:14,089 1679 assumptions so in the case of better 1680 1681 421 1682 00:20:11,869 --> 00:20:17,029 1683 normalization it is a kind of 1684 1685 422 1686 00:20:14,089 --> 00:20:20,089 1687 data-driven normalization of each layer 1688 1689 423 1690 00:20:17,029 --> 00:20:21,230 1691 and this is done for each mini batch so 1692 1693 424 1694 00:20:20,089 --> 00:20:23,899 1695 this is why it is called a 1696 1697 425 1698 00:20:21,230 --> 00:20:26,029 1699 virtualization the better-known can 1700 1701 426 1702 00:20:23,900 --> 00:20:27,490 1703 greatly accelerate training and also 1704 1705 427 1706 00:20:26,029 --> 00:20:30,039 1707 make the network less sensitive to 1708 1709 428 1710 00:20:27,490 --> 00:20:32,329 1711 initialization and it can also help 1712 1713 429 1714 00:20:30,039 --> 00:20:37,639 1715 generalization are because of the noise 1716 1717 430 1718 00:20:32,329 --> 00:20:39,379 1719 introduced into their statistics so here 1720 1721 431 1722 00:20:37,640 --> 00:20:42,170 1723 is a simple formulation of the 1724 1725 432 1726 00:20:39,380 --> 00:20:45,620 1727 better-known so given the outputs of any 1728 1729 433 1730 00:20:42,170 --> 00:20:48,740 1731 layer which we denote as X here we come 1732 1733 434 1734 00:20:45,619 --> 00:20:51,739 1735 we can compute the mean mu and standard 1736 1737 435 1738 00:20:48,740 --> 00:20:54,380 1739 deviation Sigma of X within the 1740 1741 436 1742 00:20:51,740 --> 00:20:56,029 1743 mini-batch then we can normalize the 1744 1745 437 1746 00:20:54,380 --> 00:20:59,300 1747 mini-wedge by subtracting the mean and 1748 1749 438 1750 00:20:56,029 --> 00:21:01,549 1751 divided by the standard deviation and 1752 1753 439 1754 00:20:59,299 --> 00:21:04,700 1755 after that we still learn a new skill 1756 1757 440 1758 00:21:01,549 --> 00:21:07,309 1759 gamma and a new shift beta which is to 1760 1761 441 1762 00:21:04,700 --> 00:21:10,160 1763 compensate the loss of representation 1764 1765 442 1766 00:21:07,309 --> 00:21:12,980 1767 power in the normalization so in this 1768 1769 443 1770 00:21:10,160 --> 00:21:15,019 1771 sentence actually mu and Sigma are kind 1772 1773 444 1774 00:21:12,980 --> 00:21:18,259 1775 of the functions of the activation and 1776 1777 445 1778 00:21:15,019 --> 00:21:20,359 1779 on the other hand the skill gamma and 1780 1781 446 1782 00:21:18,259 --> 00:21:22,400 1783 she's beta are parameters with too big 1784 1785 447 1786 00:21:20,359 --> 00:21:26,359 1787 to relearn they are just analogous to 1788 1789 448 1790 00:21:22,400 --> 00:21:28,670 1791 weight so this means that there will be 1792 1793 449 1794 00:21:26,359 --> 00:21:30,889 1795 two modes for better normalization 1796 1797 450 1798 00:21:28,670 --> 00:21:34,250 1799 so the first mode is the training mode 1800 1801 451 1802 00:21:30,890 --> 00:21:38,660 1803 in in decades mu and Sigma are function 1804 1805 452 1806 00:21:34,250 --> 00:21:41,809 1807 of a batch of the activations and in the 1808 1809 453 1810 00:21:38,660 --> 00:21:43,790 1811 case of testing these statistics of mu 1812 1813 454 1814 00:21:41,809 --> 00:21:46,279 1815 and Sigma are pre computed on the 1816 1817 455 1818 00:21:43,789 --> 00:21:50,529 1819 training set and they will be frozen and 1820 1821 456 1822 00:21:46,279 --> 00:21:54,019 1823 fixed so in my research experience 1824 1825 457 1826 00:21:50,529 --> 00:21:56,059 1827 usually are the difference between these 1828 1829 458 1830 00:21:54,019 --> 00:22:01,099 1831 two modes will create many practical 1832 1833 459 1834 00:21:56,059 --> 00:22:02,990 1835 problems in our implementation so I just 1836 1837 460 1838 00:22:01,099 --> 00:22:04,459 1839 want to know that we usually have to 1840 1841 461 1842 00:22:02,990 --> 00:22:07,130 1843 make sure our version normalization 1844 1845 462 1846 00:22:04,460 --> 00:22:10,100 1847 usage or implementation is correct and 1848 1849 463 1850 00:22:07,130 --> 00:22:10,370 1851 usually it is the cause of both in many 1852 1853 464 1854 00:22:10,099 --> 00:22:15,319 1855 of 1856 1857 465 1858 00:22:10,369 --> 00:22:19,189 1859 cause so anyway Vietnam is great it can 1860 1861 466 1862 00:22:15,319 --> 00:22:21,919 1863 greatly help accellerate training and it 1864 1865 467 1866 00:22:19,190 --> 00:22:26,539 1867 can most of the time you can improve 1868 1869 468 1870 00:22:21,920 --> 00:22:32,210 1871 accuracy so now we are ready to 1872 1873 469 1874 00:22:26,539 --> 00:22:34,220 1875 introduce the rest net so we have very 1876 1877 470 1878 00:22:32,210 --> 00:22:39,650 1879 good initializations and we have vaginal 1880 1881 471 1882 00:22:34,220 --> 00:22:44,870 1883 so why can we still train even deeper in 1884 1885 472 1886 00:22:39,650 --> 00:22:48,530 1887 neural networks actually we have tried 1888 1889 473 1890 00:22:44,869 --> 00:22:50,949 1891 it so here we introduce the concept of 1892 1893 474 1894 00:22:48,529 --> 00:22:53,740 1895 plane network which is just to 1896 1897 475 1898 00:22:50,950 --> 00:22:56,779 1899 repeatedly stacking three by three 1900 1901 476 1902 00:22:53,740 --> 00:23:00,769 1903 convolutional layers so we train this 1904 1905 477 1906 00:22:56,779 --> 00:23:03,259 1907 network on a small Civitan dataset which 1908 1909 478 1910 00:23:00,769 --> 00:23:05,808 1911 in a 20 layer version and also a 56 1912 1913 479 1914 00:23:03,259 --> 00:23:08,058 1915 layer version so somehow surprisingly we 1916 1917 480 1918 00:23:05,808 --> 00:23:11,269 1919 found that the deeper version is not 1920 1921 481 1922 00:23:08,058 --> 00:23:12,649 1923 better than the shallow version and even 1924 1925 482 1926 00:23:11,269 --> 00:23:15,170 1927 worse we found that the training error 1928 1929 483 1930 00:23:12,650 --> 00:23:19,190 1931 of the deeper version is higher than the 1932 1933 484 1934 00:23:15,170 --> 00:23:21,259 1935 general area of the shallower version so 1936 1937 485 1938 00:23:19,190 --> 00:23:23,900 1939 actually we Thunder this is the general 1940 1941 486 1942 00:23:21,259 --> 00:23:27,289 1943 phenomena and it is observed in almost 1944 1945 487 1946 00:23:23,900 --> 00:23:31,960 1947 all datasets and in many type of models 1948 1949 488 1950 00:23:27,289 --> 00:23:35,149 1951 if the model airplane networks so 1952 1953 489 1954 00:23:31,960 --> 00:23:38,299 1955 however this is counterintuitive in some 1956 1957 490 1958 00:23:35,150 --> 00:23:40,820 1959 sense I think of a shallow model for 1960 1961 491 1962 00:23:38,299 --> 00:23:42,500 1963 example it has 18 layers on the other 1964 1965 492 1966 00:23:40,819 --> 00:23:44,750 1967 hand we also think of a different model 1968 1969 493 1970 00:23:42,500 --> 00:23:48,049 1971 which is a counterpart of the shoulder 1972 1973 494 1974 00:23:44,750 --> 00:23:49,670 1975 model let's say it has 30 folders so 1976 1977 495 1978 00:23:48,049 --> 00:23:52,159 1979 actually a deeper model has a richer 1980 1981 496 1982 00:23:49,670 --> 00:23:54,230 1983 solution space and a deeper model should 1984 1985 497 1986 00:23:52,160 --> 00:23:57,080 1987 not have higher training error than its 1988 1989 498 1990 00:23:54,230 --> 00:23:59,620 1991 shallower counterpart this is because we 1992 1993 499 1994 00:23:57,079 --> 00:24:02,058 1995 can think of a solution by construction 1996 1997 500 1998 00:23:59,619 --> 00:24:04,969 1999 so let's say we have a well trained 2000 2001 501 2002 00:24:02,058 --> 00:24:07,308 2003 shadow model then we can just copy the 2004 2005 502 2006 00:24:04,970 --> 00:24:09,558 2007 weight from this model to the deeper 2008 2009 503 2010 00:24:07,308 --> 00:24:11,210 2011 model and then for the extra layers in 2012 2013 504 2014 00:24:09,558 --> 00:24:14,450 2015 the deeper model we can just simply set 2016 2017 505 2018 00:24:11,210 --> 00:24:16,370 2019 them as the identity so the existence of 2020 2021 506 2022 00:24:14,450 --> 00:24:18,920 2023 this solution for the deeper model 2024 2025 507 2026 00:24:16,369 --> 00:24:21,049 2027 indicate that at least you should have 2028 2029 508 2030 00:24:18,920 --> 00:24:22,950 2031 more or less the same training area with 2032 2033 509 2034 00:24:21,049 --> 00:24:25,648 2035 the shallower model 2036 2037 510 2038 00:24:22,950 --> 00:24:28,288 2039 so the degradation problem observed in 2040 2041 511 2042 00:24:25,648 --> 00:24:30,418 2043 experiment indicated there might be some 2044 2045 512 2046 00:24:28,288 --> 00:24:33,058 2047 optimization difficulties in the current 2048 2049 513 2050 00:24:30,419 --> 00:24:35,970 2051 servers so the servers cannot just find 2052 2053 514 2054 00:24:33,058 --> 00:24:38,278 2055 a solution when we create deeper and 2056 2057 515 2058 00:24:35,970 --> 00:24:40,259 2059 deeper models so these are the 2060 2061 516 2062 00:24:38,278 --> 00:24:43,829 2063 motivation of developing the deep 2064 2065 517 2066 00:24:40,259 --> 00:24:47,128 2067 residual Network the let's think about a 2068 2069 518 2070 00:24:43,829 --> 00:24:52,378 2071 plane net so we may also think about any 2072 2073 519 2074 00:24:47,128 --> 00:24:55,259 2075 2 or 3 or any number of consequence 2076 2077 520 2078 00:24:52,378 --> 00:24:59,308 2079 layers in the plane next in a plane acts 2080 2081 521 2082 00:24:55,259 --> 00:25:01,710 2083 as a small subnet so let's say if X X is 2084 2085 522 2086 00:24:59,308 --> 00:25:06,089 2087 the desired mapping to be learned by 2088 2089 523 2090 00:25:01,710 --> 00:25:10,169 2091 this small subnet so we just hope this 2092 2093 524 2094 00:25:06,089 --> 00:25:13,138 2095 Muslim net to fit this mapping so ended 2096 2097 525 2098 00:25:10,169 --> 00:25:15,720 2099 in the in the case of this residual 2100 2101 526 2102 00:25:13,138 --> 00:25:18,329 2103 learning instead of directly fit the 2104 2105 527 2106 00:25:15,720 --> 00:25:19,950 2107 mapping of this small subnet actually we 2108 2109 528 2110 00:25:18,329 --> 00:25:23,099 2111 hope the small similar to fit another 2112 2113 529 2114 00:25:19,950 --> 00:25:26,190 2115 mapping which is called FX and then let 2116 2117 530 2118 00:25:23,099 --> 00:25:30,719 2119 desire nothing to be the summation of FX 2120 2121 531 2122 00:25:26,190 --> 00:25:32,730 2123 and X so in this case actually FX is 2124 2125 532 2126 00:25:30,720 --> 00:25:36,690 2127 kind of a residual mapping with respect 2128 2129 533 2130 00:25:32,730 --> 00:25:39,358 2131 to identity so the heuristic is that if 2132 2133 534 2134 00:25:36,690 --> 00:25:42,690 2135 identity is optimal then it should be 2136 2137 535 2138 00:25:39,358 --> 00:25:44,398 2139 easy to just set all the weights as 0 so 2140 2141 536 2142 00:25:42,690 --> 00:25:47,249 2143 on the other hand if the optimal mapping 2144 2145 537 2146 00:25:44,398 --> 00:25:49,288 2147 is close up to identity then it should 2148 2149 538 2150 00:25:47,249 --> 00:25:52,889 2151 also be easy to find some fluctuations 2152 2153 539 2154 00:25:49,288 --> 00:25:54,720 2155 in addition to the to the identity so 2156 2157 540 2158 00:25:52,888 --> 00:25:56,788 2159 this is the basic idea of the residual 2160 2161 541 2162 00:25:54,720 --> 00:25:59,399 2163 learning and actually it works very well 2164 2165 542 2166 00:25:56,788 --> 00:26:03,628 2167 so here are the experiments on the 2168 2169 543 2170 00:25:59,398 --> 00:26:07,918 2171 Civitan data set under left hand side 2172 2173 544 2174 00:26:03,628 --> 00:26:10,048 2175 other results of plane networks and on 2176 2177 545 2178 00:26:07,919 --> 00:26:13,049 2179 the right hand side the result of Dib 2180 2181 546 2182 00:26:10,048 --> 00:26:14,970 2183 residual networks so as we can see the 2184 2185 547 2186 00:26:13,048 --> 00:26:17,908 2187 deep ResNet can be trained without 2188 2189 548 2190 00:26:14,970 --> 00:26:19,499 2191 difficulty and depressed nets have lower 2192 2193 549 2194 00:26:17,909 --> 00:26:25,070 2195 training error and also can be 2196 2197 550 2198 00:26:19,499 --> 00:26:28,618 2199 generalized to the to the test error so 2200 2201 551 2202 00:26:25,069 --> 00:26:33,388 2203 this means that the residual learning 2204 2205 552 2206 00:26:28,618 --> 00:26:36,269 2207 can help to can help the optimization so 2208 2209 553 2210 00:26:33,388 --> 00:26:37,000 2211 we also try this on the image net data 2212 2213 554 2214 00:26:36,269 --> 00:26:39,579 2215 set 2216 2217 555 2218 00:26:37,000 --> 00:26:41,980 2219 so a practical design when we go even 2220 2221 556 2222 00:26:39,579 --> 00:26:45,639 2223 deeper is we also introduced the 2224 2225 557 2226 00:26:41,980 --> 00:26:48,579 2227 bottleneck design into the residual 2228 2229 558 2230 00:26:45,640 --> 00:26:52,090 2231 block so instead of using two or three 2232 2233 559 2234 00:26:48,579 --> 00:26:54,099 2235 sequential 3x3 convolutions in each 2236 2237 560 2238 00:26:52,089 --> 00:26:56,678 2239 residual plot we first used a one by one 2240 2241 561 2242 00:26:54,099 --> 00:26:58,750 2243 to reduce the number of channels and 2244 2245 562 2246 00:26:56,679 --> 00:27:01,090 2247 then we do the three by three 2248 2249 563 2250 00:26:58,750 --> 00:27:03,339 2251 convolution and on the other hand we 2252 2253 564 2254 00:27:01,089 --> 00:27:08,048 2255 also use another one by one to increase 2256 2257 565 2258 00:27:03,339 --> 00:27:10,089 2259 the dimension back to the original so in 2260 2261 566 2262 00:27:08,048 --> 00:27:13,210 2263 understanding this this design or 2264 2265 567 2266 00:27:10,089 --> 00:27:17,558 2267 bottleneck is also helpful in addition 2268 2269 568 2270 00:27:13,210 --> 00:27:20,490 2271 to going deeper so here are the results 2272 2273 569 2274 00:27:17,558 --> 00:27:24,609 2275 of a ResNet on the image net data set so 2276 2277 570 2278 00:27:20,490 --> 00:27:28,720 2279 we trained residual neural net with 34 2280 2281 571 2282 00:27:24,609 --> 00:27:31,119 2283 layers 50 layers 101 layers and 150 2284 2285 572 2286 00:27:28,720 --> 00:27:34,450 2287 layers so the deeper the model is the 2288 2289 573 2290 00:27:31,119 --> 00:27:38,139 2291 lower the training and validation areas 2292 2293 574 2294 00:27:34,450 --> 00:27:40,298 2295 are and it is also worth mentioning that 2296 2297 575 2298 00:27:38,140 --> 00:27:43,809 2299 actually even our deepest model which 2300 2301 576 2302 00:27:40,298 --> 00:27:46,720 2303 has 150 layers actually has lower time 2304 2305 577 2306 00:27:43,808 --> 00:27:48,819 2307 complexity or number of flops than the 2308 2309 578 2310 00:27:46,720 --> 00:27:51,940 2311 framers with redream models this is 2312 2313 579 2314 00:27:48,819 --> 00:27:54,839 2315 because if we can have a better 2316 2317 580 2318 00:27:51,940 --> 00:27:57,759 2319 representation power by going deeper 2320 2321 581 2322 00:27:54,839 --> 00:28:00,639 2323 we can just use a smaller number of 2324 2325 582 2326 00:27:57,759 --> 00:28:05,200 2327 channels so which can help us to greatly 2328 2329 583 2330 00:28:00,640 --> 00:28:08,200 2331 reduce the complexity actually ResNet is 2332 2333 584 2334 00:28:05,200 --> 00:28:11,169 2335 very popular even beyond computer vision 2336 2337 585 2338 00:28:08,200 --> 00:28:14,140 2339 so for example in a recent work of 2340 2341 586 2342 00:28:11,169 --> 00:28:17,259 2343 neural machine translation and they also 2344 2345 587 2346 00:28:14,140 --> 00:28:21,549 2347 reported they contain an a layer l STM 2348 2349 588 2350 00:28:17,259 --> 00:28:23,109 2351 by stacking our STM blocks so this is 2352 2353 589 2354 00:28:21,548 --> 00:28:25,418 2355 with the help of the residual 2356 2357 590 2358 00:28:23,109 --> 00:28:28,089 2359 connections so the paper reported that 2360 2361 591 2362 00:28:25,419 --> 00:28:31,000 2363 usually people are not able to train our 2364 2365 592 2366 00:28:28,089 --> 00:28:32,798 2367 team with more than four layers for with 2368 2369 593 2370 00:28:31,000 --> 00:28:35,859 2371 residual connections this is portable 2372 2373 594 2374 00:28:32,798 --> 00:28:40,089 2375 and they can even have some games with 2376 2377 595 2378 00:28:35,859 --> 00:28:43,048 2379 up to alias or maybe even more the 2380 2381 596 2382 00:28:40,089 --> 00:28:46,359 2383 retina has also be used for speech 2384 2385 597 2386 00:28:43,048 --> 00:28:48,788 2387 synthesis in this paper called wavenet 2388 2389 598 2390 00:28:46,359 --> 00:28:50,758 2391 they construct a residual convolutional 2392 2393 599 2394 00:28:48,788 --> 00:28:54,868 2395 net work on the one DCP 2396 2397 600 2398 00:28:50,759 --> 00:28:57,628 2399 so unlike many workers recurrent network 2400 2401 601 2402 00:28:54,868 --> 00:29:01,769 2403 or LSP n to do the sequence to sequence 2404 2405 602 2406 00:28:57,628 --> 00:29:04,108 2407 learning in this work they use compose 2408 2409 603 2410 00:29:01,769 --> 00:29:07,199 2411 net so the key to the success in my 2412 2413 604 2414 00:29:04,108 --> 00:29:10,499 2415 understanding are in to design the first 2416 2417 605 2418 00:29:07,199 --> 00:29:12,599 2419 one is they use some deletions which can 2420 2421 606 2422 00:29:10,499 --> 00:29:15,149 2423 help them to capture some long-term 2424 2425 607 2426 00:29:12,598 --> 00:29:19,798 2427 dependency on the other hand they just 2428 2429 608 2430 00:29:15,148 --> 00:29:21,478 2431 stack a lot of such layers so in this 2432 2433 609 2434 00:29:19,798 --> 00:29:23,429 2435 regard they can have a very large 2436 2437 610 2438 00:29:21,479 --> 00:29:26,548 2439 receptive field so they can have even 2440 2441 611 2442 00:29:23,429 --> 00:29:28,528 2443 longer dependency captured so when they 2444 2445 612 2446 00:29:26,548 --> 00:29:32,269 2447 take a lot of layers the residual 2448 2449 613 2450 00:29:28,528 --> 00:29:34,979 2451 connections are the key to their success 2452 2453 614 2454 00:29:32,269 --> 00:29:37,440 2455 and also this story also applied to 2456 2457 615 2458 00:29:34,979 --> 00:29:39,659 2459 speech recognition and in another paper 2460 2461 616 2462 00:29:37,440 --> 00:29:41,338 2463 they also train a residual convolutional 2464 2465 617 2466 00:29:39,659 --> 00:29:46,469 2467 net work on the 1d sequence with the 2468 2469 618 2470 00:29:41,338 --> 00:29:49,440 2471 help of residual connections so next I 2472 2473 619 2474 00:29:46,469 --> 00:29:51,838 2475 will talk about one extension of ResNet 2476 2477 620 2478 00:29:49,440 --> 00:29:56,788 2479 which we call the rest next and this 2480 2481 621 2482 00:29:51,838 --> 00:29:58,858 2483 work will be presented in this video so 2484 2485 622 2486 00:29:56,788 --> 00:30:00,388 2487 as I mentioned before I'm understanding 2488 2489 623 2490 00:29:58,858 --> 00:30:02,638 2491 there are at least three key components 2492 2493 624 2494 00:30:00,388 --> 00:30:06,298 2495 in the governor or inception design 2496 2497 625 2498 00:30:02,638 --> 00:30:08,908 2499 circuit bottleneck and multi branch so 2500 2501 626 2502 00:30:06,298 --> 00:30:11,128 2503 in the original ResNet design we have 2504 2505 627 2506 00:30:08,909 --> 00:30:13,469 2507 the Shockers we have the bottleneck and 2508 2509 628 2510 00:30:11,128 --> 00:30:18,418 2511 we didn't have the multiple branch 2512 2513 629 2514 00:30:13,469 --> 00:30:21,979 2515 design so rezneck is just a simple multi 2516 2517 630 2518 00:30:18,419 --> 00:30:25,679 2519 branch components are designed in the 2520 2521 631 2522 00:30:21,979 --> 00:30:27,899 2523 scenario of ResNet so unlike Inception 2524 2525 632 2526 00:30:25,679 --> 00:30:30,319 2527 which are has the genius multi branch 2528 2529 633 2530 00:30:27,898 --> 00:30:34,108 2531 building block which means that it has 2532 2533 634 2534 00:30:30,318 --> 00:30:36,898 2535 different shapes of different branches 2536 2537 635 2538 00:30:34,108 --> 00:30:39,058 2539 in the case of Resnick's or the branch 2540 2541 636 2542 00:30:36,898 --> 00:30:43,288 2543 shared the same shape and the same 2544 2545 637 2546 00:30:39,058 --> 00:30:45,808 2547 number of our filters so the main 2548 2549 638 2550 00:30:43,288 --> 00:30:48,078 2551 observation in the rest net paper are as 2552 2553 639 2554 00:30:45,808 --> 00:30:50,808 2555 follows so first we find that actually 2556 2557 640 2558 00:30:48,078 --> 00:30:53,548 2559 condition and addition are 2560 2561 641 2562 00:30:50,808 --> 00:30:56,729 2563 interchangeable so actually this is the 2564 2565 642 2566 00:30:53,548 --> 00:30:58,979 2567 general property for deep neural network 2568 2569 643 2570 00:30:56,729 --> 00:31:03,479 2571 it is just not limited to resin xor 2572 2573 644 2574 00:30:58,979 --> 00:31:04,558 2575 ResNet so and if we have a uniform 2576 2577 645 2578 00:31:03,479 --> 00:31:07,169 2579 multiple 2580 2581 646 2582 00:31:04,558 --> 00:31:09,749 2583 network then we can just use this 2584 2585 647 2586 00:31:07,169 --> 00:31:12,690 2587 property to to convert it into some 2588 2589 648 2590 00:31:09,749 --> 00:31:15,569 2591 group convolutions so for example this 2592 2593 649 2594 00:31:12,690 --> 00:31:20,159 2595 is the original original design of the 2596 2597 650 2598 00:31:15,569 --> 00:31:23,579 2599 rednecks which has many branches of the 2600 2601 651 2602 00:31:20,159 --> 00:31:27,778 2603 same ship so we can show that instead of 2604 2605 652 2606 00:31:23,579 --> 00:31:31,668 2607 doing the addition in this block we can 2608 2609 653 2610 00:31:27,778 --> 00:31:34,288 2611 insert a tool condition in the second 2612 2613 654 2614 00:31:31,669 --> 00:31:36,480 2615 second set of layers in this block and 2616 2617 655 2618 00:31:34,288 --> 00:31:39,990 2619 after this condition we can just put it 2620 2621 656 2622 00:31:36,480 --> 00:31:42,960 2623 into a single or wider one by one 2624 2625 657 2626 00:31:39,990 --> 00:31:45,028 2627 convolution and also because of this 2628 2629 658 2630 00:31:42,960 --> 00:31:47,610 2631 condition we can replace a previous 2632 2633 659 2634 00:31:45,028 --> 00:31:51,839 2635 layer into a group convolutional layer 2636 2637 660 2638 00:31:47,609 --> 00:31:56,329 2639 which can be implemented more efficient 2640 2641 661 2642 00:31:51,839 --> 00:32:00,298 2643 than we're basically doing many branches 2644 2645 662 2646 00:31:56,329 --> 00:32:03,509 2647 so actually the idea of this uniform 2648 2649 663 2650 00:32:00,298 --> 00:32:06,839 2651 multiple branching is very successful we 2652 2653 664 2654 00:32:03,509 --> 00:32:10,079 2655 have observed better accuracy than rest 2656 2657 665 2658 00:32:06,839 --> 00:32:12,628 2659 net when we manually keep the number of 2660 2661 666 2662 00:32:10,079 --> 00:32:16,288 2663 flops and parameters as the original 2664 2665 667 2666 00:32:12,628 --> 00:32:20,069 2667 resinous per example in this liquor the 2668 2669 668 2670 00:32:16,288 --> 00:32:23,609 2671 X XOR the number of epochs and Y XOR the 2672 2673 669 2674 00:32:20,069 --> 00:32:26,849 2675 number of top one error rate so the blue 2676 2677 670 2678 00:32:23,609 --> 00:32:30,028 2679 lines are the original ResNet and the 2680 2681 671 2682 00:32:26,849 --> 00:32:32,998 2683 red lines are the resin next model so 2684 2685 672 2686 00:32:30,028 --> 00:32:35,849 2687 the dashed lines other training error 2688 2689 673 2690 00:32:32,999 --> 00:32:38,399 2691 and the solid lines other validation 2692 2693 674 2694 00:32:35,849 --> 00:32:40,798 2695 error so we can see that when having the 2696 2697 675 2698 00:32:38,398 --> 00:32:42,868 2699 same number of lobes and parameters the 2700 2701 676 2702 00:32:40,798 --> 00:32:44,460 2703 rest next model is better than the 2704 2705 677 2706 00:32:42,868 --> 00:32:46,888 2707 original resonant model 2708 2709 678 2710 00:32:44,460 --> 00:32:49,590 2711 so actually what we learn from this is 2712 2713 679 2714 00:32:46,888 --> 00:32:53,298 2715 that we can have a better trade-off when 2716 2717 680 2718 00:32:49,589 --> 00:32:57,349 2719 we train larger and larger models and 2720 2721 681 2722 00:32:53,298 --> 00:33:00,388 2723 also this rift next model can be 2724 2725 682 2726 00:32:57,349 --> 00:33:02,368 2727 generalized very well to some other more 2728 2729 683 2730 00:33:00,388 --> 00:33:05,459 2731 complicated recognition type such as 2732 2733 684 2734 00:33:02,368 --> 00:33:09,028 2735 auto detection and instance segmentation 2736 2737 685 2738 00:33:05,460 --> 00:33:12,329 2739 so here are the results of rest next for 2740 2741 686 2742 00:33:09,028 --> 00:33:15,538 2743 mass RCN so we can see that our best 2744 2745 687 2746 00:33:12,329 --> 00:33:18,269 2747 Rest next model can improve the cocoa 2748 2749 688 2750 00:33:15,538 --> 00:33:22,170 2751 bounding box AP by 1 point 6 I also 2752 2753 689 2754 00:33:18,269 --> 00:33:25,440 2755 the math AP by 1.4 so all these results 2756 2757 690 2758 00:33:22,170 --> 00:33:30,240 2759 indicate that features do matters in the 2760 2761 691 2762 00:33:25,440 --> 00:33:32,269 2763 current visual recognition community so 2764 2765 692 2766 00:33:30,240 --> 00:33:35,130 2767 also there are many architectures 2768 2769 693 2770 00:33:32,269 --> 00:33:38,839 2771 developed recently which are not covered 2772 2773 694 2774 00:33:35,130 --> 00:33:42,960 2775 in this tutorial for example there is a 2776 2777 695 2778 00:33:38,839 --> 00:33:46,759 2779 inception ResNet which use inception as 2780 2781 696 2782 00:33:42,960 --> 00:33:49,319 2783 the transformation function and also 2784 2785 697 2786 00:33:46,759 --> 00:33:53,069 2787 changing with the help of residual 2788 2789 698 2790 00:33:49,319 --> 00:33:56,339 2791 connections on the other hand a method 2792 2793 699 2794 00:33:53,069 --> 00:33:58,980 2795 called dense net is developed which use 2796 2797 700 2798 00:33:56,339 --> 00:34:00,990 2799 some dense niche densely connected 2800 2801 701 2802 00:33:58,980 --> 00:34:04,410 2803 shortcuts which are merged by a 2804 2805 702 2806 00:34:00,990 --> 00:34:06,690 2807 condition and on the other hand also in 2808 2809 703 2810 00:34:04,410 --> 00:34:09,750 2811 this division there is a work called 2812 2813 704 2814 00:34:06,690 --> 00:34:12,260 2815 exception and later on another work 2816 2817 705 2818 00:34:09,750 --> 00:34:14,849 2819 called mobile Nets both of which are 2820 2821 706 2822 00:34:12,260 --> 00:34:17,970 2823 given by the so called depth wise 2824 2825 707 2826 00:34:14,849 --> 00:34:20,878 2827 convolutions so depth wise convolutions 2828 2829 708 2830 00:34:17,969 --> 00:34:22,829 2831 are kind of group conclusions with where 2832 2833 709 2834 00:34:20,878 --> 00:34:25,949 2835 the number of groups equals to the 2836 2837 710 2838 00:34:22,829 --> 00:34:28,349 2839 number of channels and also following 2840 2841 711 2842 00:34:25,949 --> 00:34:31,408 2843 this line another reason work called 2844 2845 712 2846 00:34:28,349 --> 00:34:34,500 2847 shadow net use even more group or depth 2848 2849 713 2850 00:34:31,409 --> 00:34:38,190 2851 wise convolutions with shuttle to reduce 2852 2853 714 2854 00:34:34,500 --> 00:34:40,398 2855 the complexity of the models so actually 2856 2857 715 2858 00:34:38,190 --> 00:34:43,349 2859 as we can see from all these 2860 2861 716 2862 00:34:40,398 --> 00:34:48,059 2863 illustrations there are still three key 2864 2865 717 2866 00:34:43,349 --> 00:34:50,309 2867 components in these designs the first 2868 2869 718 2870 00:34:48,059 --> 00:34:53,398 2871 one is the usage of shortcut and the 2872 2873 719 2874 00:34:50,309 --> 00:34:57,929 2875 second one is a bottleneck and the third 2876 2877 720 2878 00:34:53,398 --> 00:35:01,889 2879 one is multiple branching so finally I 2880 2881 721 2882 00:34:57,929 --> 00:35:04,889 2883 will also like to mention a reason work 2884 2885 722 2886 00:35:01,889 --> 00:35:07,019 2887 done by Facebook AI research and apply 2888 2889 723 2890 00:35:04,889 --> 00:35:09,480 2891 machine learning teams so now we are 2892 2893 724 2894 00:35:07,019 --> 00:35:14,869 2895 able to change the image net data set in 2896 2897 725 2898 00:35:09,480 --> 00:35:18,869 2899 one hour so our setting is driven by 256 2900 2901 726 2902 00:35:14,869 --> 00:35:24,050 2903 GPUs so now we can train an infinite 2904 2905 727 2906 00:35:18,869 --> 00:35:27,960 2907 model using a mini batch size of 8192 2908 2909 728 2910 00:35:24,050 --> 00:35:30,180 2911 using a synchronized a 3d version so we 2912 2913 729 2914 00:35:27,960 --> 00:35:31,920 2915 train a reson activity on this data set 2916 2917 730 2918 00:35:30,179 --> 00:35:35,909 2919 so we have observed 2920 2921 731 2922 00:35:31,920 --> 00:35:39,720 2923 no loss of accuracy so as shown in the 2924 2925 732 2926 00:35:35,909 --> 00:35:42,960 2927 figure here the xxo is the mini batch 2928 2929 733 2930 00:35:39,719 --> 00:35:47,308 2931 size and the y-axis is the accuracy of 2932 2933 734 2934 00:35:42,960 --> 00:35:49,798 2935 the of the model so we can see that in a 2936 2937 735 2938 00:35:47,309 --> 00:35:55,319 2939 very wide spectrum of mini batch size 2940 2941 736 2942 00:35:49,798 --> 00:35:58,440 2943 from 64 to a k the models has more or 2944 2945 737 2946 00:35:55,318 --> 00:36:01,920 2947 less the same accuracy in this case up 2948 2949 738 2950 00:35:58,440 --> 00:36:05,460 2951 to some random variation so the key 2952 2953 739 2954 00:36:01,920 --> 00:36:08,338 2955 factors in in this algorithm are in two 2956 2957 740 2958 00:36:05,460 --> 00:36:11,429 2959 aspects the first aspect is a linear 2960 2961 741 2962 00:36:08,338 --> 00:36:14,338 2963 scaling learning rate in terms of the 2964 2965 742 2966 00:36:11,429 --> 00:36:16,019 2967 mini batch another factor is the warm-up 2968 2969 743 2970 00:36:14,338 --> 00:36:19,170 2971 of learning rates at the beginning of 2972 2973 744 2974 00:36:16,019 --> 00:36:22,048 2975 training and also in our paper we report 2976 2977 745 2978 00:36:19,170 --> 00:36:24,930 2979 some theoretical foundations of the 2980 2981 746 2982 00:36:22,048 --> 00:36:27,059 2983 usage of these two factors and also in 2984 2985 747 2986 00:36:24,929 --> 00:36:29,578 2987 my experience I found one of the most 2988 2989 748 2990 00:36:27,059 --> 00:36:32,099 2991 important factor is just to implement 2992 2993 749 2994 00:36:29,579 --> 00:36:36,869 2995 everything correctly in the case of 2996 2997 750 2998 00:36:32,099 --> 00:36:40,318 2999 multiple GPUs or multiple machines so in 3000 3001 751 3002 00:36:36,869 --> 00:36:42,568 3003 conclusion in my understanding the 3004 3005 752 3006 00:36:40,318 --> 00:36:44,599 3007 success of deep learning is the success 3008 3009 753 3010 00:36:42,568 --> 00:36:48,329 3011 of feature learning so in the case of 3012 3013 754 3014 00:36:44,599 --> 00:36:52,769 3015 visual recognition features still matter 3016 3017 755 3018 00:36:48,329 --> 00:36:54,809 3019 so the power of deep features can be 3020 3021 756 3022 00:36:52,769 --> 00:36:57,690 3023 demonstrated by the amazing visual 3024 3025 757 3026 00:36:54,809 --> 00:37:00,480 3027 recognition results in complicated tiles 3028 3029 758 3030 00:36:57,690 --> 00:37:03,030 3031 such as instant segmentation so here I 3032 3033 759 3034 00:37:00,480 --> 00:37:06,000 3035 show a result of mass ICM with ResNet 3036 3037 760 3038 00:37:03,030 --> 00:37:09,829 3039 101 which will be covered in next talk 3040 3041 761 3042 00:37:06,000 --> 00:37:09,829 3043 by rows so that's all thank you 3044 3045 762 3046 00:37:13,539 --> 00:37:25,640 3047 nope any and cut any questions there is 3048 3049 763 3050 00:37:18,650 --> 00:37:28,249 3051 a microphone there so ever I have a 3052 3053 764 3054 00:37:25,639 --> 00:37:30,379 3055 question so so how you determine the 3056 3057 765 3058 00:37:28,248 --> 00:37:32,058 3059 longer layers you want so when you 3060 3061 766 3062 00:37:30,380 --> 00:37:34,130 3063 design the network there's a first 3064 3065 767 3066 00:37:32,059 --> 00:37:36,619 3067 question second opportunities how do you 3068 3069 768 3070 00:37:34,130 --> 00:37:40,189 3071 think the mathematic for deep learning 3072 3073 769 3074 00:37:36,619 --> 00:37:42,349 3075 you think it's understandable for when 3076 3077 770 3078 00:37:40,188 --> 00:37:45,828 3079 you decide enamel yes so the number of 3080 3081 771 3082 00:37:42,349 --> 00:37:48,439 3083 layers so up before the development of a 3084 3085 772 3086 00:37:45,829 --> 00:37:50,449 3087 deep ResNet the number of layers is 3088 3089 773 3090 00:37:48,438 --> 00:37:53,838 3091 still kind of try an area because we 3092 3093 774 3094 00:37:50,449 --> 00:37:55,818 3095 don't know when it degrades after the 3096 3097 775 3098 00:37:53,838 --> 00:37:57,889 3099 development of ResNet in theory we can 3100 3101 776 3102 00:37:55,818 --> 00:38:00,168 3103 just go deeper so now the number of 3104 3105 777 3106 00:37:57,889 --> 00:38:03,798 3107 layers are mainly limited by practical 3108 3109 778 3110 00:38:00,168 --> 00:38:06,558 3111 concerns such as a memory or running 3112 3113 779 3114 00:38:03,798 --> 00:38:12,079 3115 time or sometimes by the number of back 3116 3117 780 3118 00:38:06,559 --> 00:38:13,640 3119 by the by the amount of data so for the 3120 3121 781 3122 00:38:12,079 --> 00:38:15,919 3123 second question in my understanding 3124 3125 782 3126 00:38:13,639 --> 00:38:18,798 3127 there are many recent interesting work 3128 3129 783 3130 00:38:15,918 --> 00:38:21,139 3131 trying to explain the mathematics of 3132 3133 784 3134 00:38:18,798 --> 00:38:24,199 3135 deep learning but in my understanding it 3136 3137 785 3138 00:38:21,139 --> 00:38:26,748 3139 is still kind of an open question so I 3140 3141 786 3142 00:38:24,199 --> 00:38:31,028 3143 hope we will see more results in the 3144 3145 787 3146 00:38:26,748 --> 00:38:33,759 3147 future let's let's thank our speaker 3148 3149 788 3150 00:38:31,028 --> 00:38:35,730 3151 thank you 3152 3153 789 3154 00:38:33,760 --> 00:38:39,370 3155 [Music] 3156 3157 790 3158 00:38:35,730 --> 00:38:41,710 3159 so our next speaker will be raw score 3160 3161 791 3162 00:38:39,369 --> 00:38:44,469 3163 check so roast right now is a research 3164 3165 792 3166 00:38:41,710 --> 00:38:46,539 3167 scientist at state booyah research so he 3168 3169 793 3170 00:38:44,469 --> 00:38:48,789 3171 has downloaded interesting world and 3172 3173 794 3174 00:38:46,539 --> 00:38:52,960 3175 very influenced work on object detection 3176 3177 795 3178 00:38:48,789 --> 00:38:54,880 3179 so he proposed region zealand and also 3180 3181 796 3182 00:38:52,960 --> 00:38:57,760 3183 involved in the development of faster 3184 3185 797 3186 00:38:54,880 --> 00:39:01,000 3187 zone faster our zone at the next step 3188 3189 798 3190 00:38:57,760 --> 00:39:03,630 3191 with the instance soon so today he will 3192 3193 799 3194 00:39:01,000 --> 00:39:06,219 3195 talk about the instant and detection on 3196 3197 800 3198 00:39:03,630 --> 00:39:17,740 3199 is to narrow up your understanding so 3200 3201 801 3202 00:39:06,219 --> 00:39:19,779 3203 let's work our speaker all right thank 3204 3205 802 3206 00:39:17,739 --> 00:39:21,849 3207 you very much I'll try to lean into this 3208 3209 803 3210 00:39:19,780 --> 00:39:23,860 3211 microphone so you can hear me so I'll be 3212 3213 804 3214 00:39:21,849 --> 00:39:25,710 3215 talking today about deep learning for 3216 3217 805 3218 00:39:23,860 --> 00:39:27,789 3219 instance level object understanding and 3220 3221 806 3222 00:39:25,710 --> 00:39:29,610 3223 this is work with a bunch of 3224 3225 807 3226 00:39:27,789 --> 00:39:33,159 3227 collaborators at Facebook AAA research 3228 3229 808 3230 00:39:29,610 --> 00:39:33,940 3231 so first of all to get started thank you 3232 3233 809 3234 00:39:33,159 --> 00:39:35,889 3235 for being here 3236 3237 810 3238 00:39:33,940 --> 00:39:39,789 3239 you know there's tough competition for 3240 3241 811 3242 00:39:35,889 --> 00:39:41,259 3243 your for your attention so you know I'm 3244 3245 812 3246 00:39:39,789 --> 00:39:45,099 3247 glad to see that the room is still full 3248 3249 813 3250 00:39:41,260 --> 00:39:47,560 3251 and that we we won out so now there's a 3252 3253 814 3254 00:39:45,099 --> 00:39:50,079 3255 lot to cover so let's dive right in so 3256 3257 815 3258 00:39:47,559 --> 00:39:51,909 3259 the first is you've just seen this talk 3260 3261 816 3262 00:39:50,079 --> 00:39:55,000 3263 and now you're all experts on deep 3264 3265 817 3266 00:39:51,909 --> 00:39:57,848 3267 representations so now 3268 3269 818 3270 00:39:55,000 --> 00:40:00,639 3271 and take that information and apply it 3272 3273 819 3274 00:39:57,849 --> 00:40:02,559 3275 to some other computer vision tasks such 3276 3277 820 3278 00:40:00,639 --> 00:40:06,460 3279 as object detection and instance level 3280 3281 821 3282 00:40:02,559 --> 00:40:09,639 3283 understanding so this is what object 3284 3285 822 3286 00:40:06,460 --> 00:40:11,710 3287 detection looked like in 2007 so this is 3288 3289 823 3290 00:40:09,639 --> 00:40:13,480 3291 when I started my PhD so first of all 3292 3293 824 3294 00:40:11,710 --> 00:40:16,358 3295 images were black and white because this 3296 3297 825 3298 00:40:13,480 --> 00:40:20,409 3299 was forever ago and you know we could 3300 3301 826 3302 00:40:16,358 --> 00:40:21,900 3303 detect things but really object 3304 3305 827 3306 00:40:20,409 --> 00:40:24,670 3307 detectors were not working all that well 3308 3309 828 3310 00:40:21,900 --> 00:40:26,980 3311 so this is what object detection looks 3312 3313 829 3314 00:40:24,670 --> 00:40:29,260 3315 like today so first of all there are a 3316 3317 830 3318 00:40:26,980 --> 00:40:31,960 3319 few things that you can look at that are 3320 3321 831 3322 00:40:29,260 --> 00:40:33,910 3323 different here so one is that the scenes 3324 3325 832 3326 00:40:31,960 --> 00:40:36,639 3327 that we can successfully detect objects 3328 3329 833 3330 00:40:33,909 --> 00:40:39,399 3331 in are much more complicated than they 3332 3333 834 3334 00:40:36,639 --> 00:40:41,230 3335 were back in 2007 the second thing is 3336 3337 835 3338 00:40:39,400 --> 00:40:43,389 3339 not only can we put a bounding box 3340 3341 836 3342 00:40:41,230 --> 00:40:46,059 3343 around objects but we can actually now 3344 3345 837 3346 00:40:43,389 --> 00:40:48,159 3347 provide a lot more information about 3348 3349 838 3350 00:40:46,059 --> 00:40:50,469 3351 those objects so for example we can 3352 3353 839 3354 00:40:48,159 --> 00:40:54,670 3355 provide detailed pixel levels instant 3356 3357 840 3358 00:40:50,469 --> 00:40:56,379 3359 segmentations so that's why the title of 3360 3361 841 3362 00:40:54,670 --> 00:40:58,230 3363 this is not just object detection a 3364 3365 842 3366 00:40:56,380 --> 00:41:00,608 3367 rather instance level understanding 3368 3369 843 3370 00:40:58,230 --> 00:41:02,289 3371 because we're now moving away from 3372 3373 844 3374 00:41:00,608 --> 00:41:05,170 3375 simply detecting objects and putting 3376 3377 845 3378 00:41:02,289 --> 00:41:06,608 3379 boxes around them to being able to find 3380 3381 846 3382 00:41:05,170 --> 00:41:08,920 3383 the objects say what category they are 3384 3385 847 3386 00:41:06,608 --> 00:41:11,920 3387 but also provide much richer information 3388 3389 848 3390 00:41:08,920 --> 00:41:14,470 3391 about what those objects are so as two 3392 3393 849 3394 00:41:11,920 --> 00:41:16,630 3395 examples of this so one is that for the 3396 3397 850 3398 00:41:14,469 --> 00:41:20,588 3399 person category which is your particular 3400 3401 851 3402 00:41:16,630 --> 00:41:25,869 3403 focus of study we can provide a key 3404 3405 852 3406 00:41:20,588 --> 00:41:28,779 3407 point pose estimation and again for 3408 3409 853 3410 00:41:25,869 --> 00:41:30,519 3411 people we can also now pretty reliably 3412 3413 854 3414 00:41:28,780 --> 00:41:32,380 3415 detect the activities that those people 3416 3417 855 3418 00:41:30,519 --> 00:41:37,239 3419 are engaged in and what the target 3420 3421 856 3422 00:41:32,380 --> 00:41:38,950 3423 objects of those activities are so the 3424 3425 857 3426 00:41:37,239 --> 00:41:40,568 3427 outline for this talk is that I'll spend 3428 3429 858 3430 00:41:38,949 --> 00:41:42,969 3431 the first part of it going over masks 3432 3433 859 3434 00:41:40,568 --> 00:41:44,858 3435 our CNN which is the system that 3436 3437 860 3438 00:41:42,969 --> 00:41:47,019 3439 produced some of the visualizations I 3440 3441 861 3442 00:41:44,858 --> 00:41:49,150 3443 was just showing there and then in the 3444 3445 862 3446 00:41:47,019 --> 00:41:51,550 3447 second part of the talk I'll try to 3448 3449 863 3450 00:41:49,150 --> 00:41:54,190 3451 provide a brief survey of the current 3452 3453 864 3454 00:41:51,550 --> 00:41:56,920 3455 landscape of object detection using deep 3456 3457 865 3458 00:41:54,190 --> 00:41:58,389 3459 learning and this is really a vast area 3460 3461 866 3462 00:41:56,920 --> 00:42:00,130 3463 that's kind of exploded over the last 3464 3465 867 3466 00:41:58,389 --> 00:42:02,650 3467 few years so it's impossible to cover 3468 3469 868 3470 00:42:00,130 --> 00:42:04,568 3471 all of it in any detail so I'll just hit 3472 3473 869 3474 00:42:02,650 --> 00:42:06,920 3475 upon a few of the points that I think 3476 3477 870 3478 00:42:04,568 --> 00:42:10,170 3479 are some of the most salient 3480 3481 871 3482 00:42:06,920 --> 00:42:12,150 3483 so let's start by specifying what the 3484 3485 872 3486 00:42:10,170 --> 00:42:14,550 3487 task is that we're going to look at here 3488 3489 873 3490 00:42:12,150 --> 00:42:16,619 3491 so we're going to look at instant 3492 3493 874 3494 00:42:14,550 --> 00:42:19,280 3495 segmentation and I'll show it in 3496 3497 875 3498 00:42:16,619 --> 00:42:21,630 3499 contrast to a couple of other tasks so 3500 3501 876 3502 00:42:19,280 --> 00:42:23,850 3503 object detection traditionally has been 3504 3505 877 3506 00:42:21,630 --> 00:42:26,010 3507 this task of putting boxes around the 3508 3509 878 3510 00:42:23,849 --> 00:42:28,259 3511 objects are trying to detect so here 3512 3513 879 3514 00:42:26,010 --> 00:42:31,140 3515 they're five people that are detected 3516 3517 880 3518 00:42:28,260 --> 00:42:33,180 3519 and because there's a box around each 3520 3521 881 3522 00:42:31,139 --> 00:42:36,389 3523 one we can say that they're exactly five 3524 3525 882 3526 00:42:33,179 --> 00:42:38,969 3527 instances now there's this other problem 3528 3529 883 3530 00:42:36,389 --> 00:42:41,190 3531 called semantic segmentation in which 3532 3533 884 3534 00:42:38,969 --> 00:42:42,929 3535 you don't want to use a box because 3536 3537 885 3538 00:42:41,190 --> 00:42:45,599 3539 that's a very coarse representation for 3540 3541 886 3542 00:42:42,929 --> 00:42:47,909 3543 an object but instead you want to try to 3544 3545 887 3546 00:42:45,599 --> 00:42:50,429 3547 label each pixel so that you can much 3548 3549 888 3550 00:42:47,909 --> 00:42:52,139 3551 more precisely delineate what pixels on 3552 3553 889 3554 00:42:50,429 --> 00:42:55,500 3555 the background versus what pixel is on 3556 3557 890 3558 00:42:52,139 --> 00:42:58,289 3559 an object now compared to box level 3560 3561 891 3562 00:42:55,500 --> 00:43:00,929 3563 detection this is in advanced but it's 3564 3565 892 3566 00:42:58,289 --> 00:43:03,840 3567 also a step backwards because at least 3568 3569 893 3570 00:43:00,929 --> 00:43:06,659 3571 if this is applied to things instead of 3572 3573 894 3574 00:43:03,840 --> 00:43:08,160 3575 stuff then what happens is you're no 3576 3577 895 3578 00:43:06,659 --> 00:43:11,369 3579 longer able to differentiate between 3580 3581 896 3582 00:43:08,159 --> 00:43:13,559 3583 different instances so all the person 3584 3585 897 3586 00:43:11,369 --> 00:43:16,589 3587 pixels got lumped together into one 3588 3589 898 3590 00:43:13,559 --> 00:43:18,659 3591 massive person pixels so instant 3592 3593 899 3594 00:43:16,590 --> 00:43:20,280 3595 segmentation is essentially the task 3596 3597 900 3598 00:43:18,659 --> 00:43:23,699 3599 that tries to take the best of both 3600 3601 901 3602 00:43:20,280 --> 00:43:25,830 3603 worlds from these so not only do you get 3604 3605 902 3606 00:43:23,699 --> 00:43:28,739 3607 rid of this very crude box level 3608 3609 903 3610 00:43:25,829 --> 00:43:30,840 3611 representation but instead you can 3612 3613 904 3614 00:43:28,739 --> 00:43:32,669 3615 replace it with this much finer 3616 3617 905 3618 00:43:30,840 --> 00:43:34,740 3619 representation that delineates which 3620 3621 906 3622 00:43:32,670 --> 00:43:38,970 3623 pixels are part of the person which are 3624 3625 907 3626 00:43:34,739 --> 00:43:40,889 3627 background but unlike semantic 3628 3629 908 3630 00:43:38,969 --> 00:43:44,549 3631 segmentation you actually retain this 3632 3633 909 3634 00:43:40,889 --> 00:43:46,500 3635 notion of instance so you can tell even 3636 3637 910 3638 00:43:44,550 --> 00:43:52,019 3639 with the segmentation that there still 3640 3641 911 3642 00:43:46,500 --> 00:43:54,300 3643 are exactly five people so masks are CNN 3644 3645 912 3646 00:43:52,019 --> 00:43:57,119 3647 is a model that's developed to try to 3648 3649 913 3650 00:43:54,300 --> 00:43:59,940 3651 solve this task and I'm going to start 3652 3653 914 3654 00:43:57,119 --> 00:44:02,250 3655 by providing an overview of how the 3656 3657 915 3658 00:43:59,940 --> 00:44:03,900 3659 model is structured and the first thing 3660 3661 916 3662 00:44:02,250 --> 00:44:06,750 3663 that I want to sort of impress upon you 3664 3665 917 3666 00:44:03,900 --> 00:44:08,610 3667 with this is that we're now in this era 3668 3669 918 3670 00:44:06,750 --> 00:44:11,309 3671 where we can build a whole lot of 3672 3673 919 3674 00:44:08,610 --> 00:44:13,110 3675 different modular components and they 3676 3677 920 3678 00:44:11,309 --> 00:44:15,960 3679 often stack on top of each other and 3680 3681 921 3682 00:44:13,110 --> 00:44:18,000 3683 very useful in interesting ways so a 3684 3685 922 3686 00:44:15,960 --> 00:44:19,289 3687 typical system that's developed today is 3688 3689 923 3690 00:44:18,000 --> 00:44:20,969 3691 going to take 3692 3693 924 3694 00:44:19,289 --> 00:44:22,739 3695 the successful modules that were built 3696 3697 925 3698 00:44:20,969 --> 00:44:24,179 3699 in the last couple of years and then 3700 3701 926 3702 00:44:22,739 --> 00:44:26,339 3703 figure out a new way of adding some 3704 3705 927 3706 00:44:24,179 --> 00:44:28,679 3707 peace to that so that it has some new 3708 3709 928 3710 00:44:26,340 --> 00:44:30,660 3711 and interesting and useful property so 3712 3713 929 3714 00:44:28,679 --> 00:44:34,169 3715 the way that I'm going to describe this 3716 3717 930 3718 00:44:30,659 --> 00:44:36,420 3719 model is by walking through the sequence 3720 3721 931 3722 00:44:34,170 --> 00:44:38,730 3723 of modules that is built up from and 3724 3725 932 3726 00:44:36,420 --> 00:44:41,519 3727 I'll try to provide sort of a bird's eye 3728 3729 933 3730 00:44:38,730 --> 00:44:42,960 3731 view as how this is being built so that 3732 3733 934 3734 00:44:41,519 --> 00:44:46,650 3735 you can sort of keep the whole picture 3736 3737 935 3738 00:44:42,960 --> 00:44:49,889 3739 in context along the way so the first 3740 3741 936 3742 00:44:46,650 --> 00:44:53,880 3743 place that I want to start is with the 3744 3745 937 3746 00:44:49,889 --> 00:44:55,859 3747 are CNN or region based CNN approach to 3748 3749 938 3750 00:44:53,880 --> 00:44:57,240 3751 object detection because this is the 3752 3753 939 3754 00:44:55,860 --> 00:45:00,420 3755 general framework that Mouse Carson 3756 3757 940 3758 00:44:57,239 --> 00:45:02,369 3759 operates in so this was originally 3760 3761 941 3762 00:45:00,420 --> 00:45:04,500 3763 developed on with particular 3764 3765 942 3766 00:45:02,369 --> 00:45:05,880 3767 architecture and design choices they 3768 3769 943 3770 00:45:04,500 --> 00:45:08,340 3771 share have sort of abstracted the 3772 3773 944 3774 00:45:05,880 --> 00:45:11,340 3775 concept a little bit away from those 3776 3777 945 3778 00:45:08,340 --> 00:45:13,289 3779 specific choices so in general there's 3780 3781 946 3782 00:45:11,340 --> 00:45:16,140 3783 some input image then there's some 3784 3785 947 3786 00:45:13,289 --> 00:45:18,570 3787 mechanism which produces object or 3788 3789 948 3790 00:45:16,139 --> 00:45:21,059 3791 region proposals this could either be an 3792 3793 949 3794 00:45:18,570 --> 00:45:22,860 3795 external source for example the very 3796 3797 950 3798 00:45:21,059 --> 00:45:26,190 3799 successful selective search algorithm or 3800 3801 951 3802 00:45:22,860 --> 00:45:28,470 3803 it could be in sort of internal source 3804 3805 952 3806 00:45:26,190 --> 00:45:30,360 3807 which is the network that's going to be 3808 3809 953 3810 00:45:28,469 --> 00:45:34,469 3811 doing object detection provides the 3812 3813 954 3814 00:45:30,360 --> 00:45:36,990 3815 proposals itself now the next part of 3816 3817 955 3818 00:45:34,469 --> 00:45:39,689 3819 this is that there's some region of 3820 3821 956 3822 00:45:36,989 --> 00:45:42,829 3823 interest or ROI transformation which is 3824 3825 957 3826 00:45:39,690 --> 00:45:42,829 3827 going to take 3828 3829 958 3830 00:45:43,840 --> 00:45:48,840 3831 to coerce it from its arbitrary shape 3832 3833 959 3834 00:45:49,590 --> 00:45:54,140 3835 vector and this could actually 3836 3837 960 3838 00:45:54,960 --> 00:46:00,289 3839 some level after many process after many 3840 3841 961 3842 00:46:00,300 --> 00:46:02,930 3843 con gusto No 3844 3845 962 3846 00:46:02,969 --> 00:46:07,519 3847 that are white transformation is perform 3848 3849 963 3850 00:46:09,710 --> 00:46:14,740 3851 they're obviously going to be based on 3852 3853 964 3854 00:46:11,659 --> 00:46:14,739 3855 the deep neural network 3856 3857 965 3858 00:46:18,000 --> 00:46:23,329 3859 of classifying regions and then there 3860 3861 966 3862 00:46:23,989 --> 00:46:28,569 3863 sober which have to do with refining 3864 3865 967 3866 00:46:26,179 --> 00:46:28,569 3867 spatial 3868 3869 968 3870 00:46:28,710 --> 00:46:33,610 3871 rushon and then what I'll focus on a 3872 3873 969 3874 00:46:30,809 --> 00:46:38,460 3875 little bit more predicting 3876 3877 970 3878 00:46:33,610 --> 00:46:38,460 3879 now star CNN okay 3880 3881 971 3882 00:46:38,530 --> 00:46:43,330 3883 the general type of approach that we're 3884 3885 972 3886 00:46:40,480 --> 00:46:45,980 3887 taking now I want to go 3888 3889 973 3890 00:46:43,329 --> 00:46:48,369 3891 modules that are used to build up the 3892 3893 974 3894 00:46:45,980 --> 00:46:48,369 3895 system 3896 3897 975 3898 00:46:49,210 --> 00:46:54,400 3899 many examples up and that's using what 3900 3901 976 3902 00:46:52,420 --> 00:46:57,670 3903 we 3904 3905 977 3906 00:46:54,400 --> 00:46:59,789 3907 so the backbone architecture is going to 3908 3909 978 3910 00:46:57,670 --> 00:46:59,789 3911 be 3912 3913 979 3914 00:47:00,400 --> 00:47:07,480 3915 to drive the whole recognition system 3916 3917 980 3918 00:47:02,860 --> 00:47:08,590 3919 and this could really be any successful 3920 3921 981 3922 00:47:07,480 --> 00:47:10,690 3923 Network that's been developed in the 3924 3925 982 3926 00:47:08,590 --> 00:47:12,070 3927 past or when one that's even more 3928 3929 983 3930 00:47:10,690 --> 00:47:14,740 3931 successful is developed a year from now 3932 3933 984 3934 00:47:12,070 --> 00:47:17,350 3935 that could be dropped in as well so for 3936 3937 985 3938 00:47:14,739 --> 00:47:21,489 3939 example it could be Alex now vgg ResNet 3940 3941 986 3942 00:47:17,349 --> 00:47:23,769 3943 res next now a couple of very basic 3944 3945 987 3946 00:47:21,489 --> 00:47:27,429 3947 guidelines that are useful to keep in 3948 3949 988 3950 00:47:23,769 --> 00:47:29,590 3951 mind is that so the first is that it's 3952 3953 989 3954 00:47:27,429 --> 00:47:32,109 3955 useful to use what's often referred to 3956 3957 990 3958 00:47:29,590 --> 00:47:34,210 3959 is same padding so this is idea that 3960 3961 991 3962 00:47:32,110 --> 00:47:36,610 3963 when you do any sort of pooling or 3964 3965 992 3966 00:47:34,210 --> 00:47:38,409 3967 convolutional operator you want the 3968 3969 993 3970 00:47:36,610 --> 00:47:40,510 3971 spatial extent 'ti input to that 3972 3973 994 3974 00:47:38,409 --> 00:47:42,009 3975 operator to be the same as the spatial 3976 3977 995 3978 00:47:40,510 --> 00:47:44,470 3979 extent of the output of that operator 3980 3981 996 3982 00:47:42,010 --> 00:47:46,270 3983 and the reason for doing this is because 3984 3985 997 3986 00:47:44,469 --> 00:47:47,679 3987 it preserves an integer scale 3988 3989 998 3990 00:47:46,269 --> 00:47:50,320 3991 relationship between different levels 3992 3993 999 3994 00:47:47,679 --> 00:47:51,699 3995 that are computed by the network and 3996 3997 1000 3998 00:47:50,320 --> 00:47:55,180 3999 you'll see in a little bit why that's 4000 4001 1001 4002 00:47:51,699 --> 00:47:57,819 4003 useful the second thing is that it's 4004 4005 1002 4006 00:47:55,179 --> 00:48:00,190 4007 nice to prefer using as a backbone 4008 4009 1003 4010 00:47:57,820 --> 00:48:02,410 4011 architecture a fully convolutional net 4012 4013 1004 4014 00:48:00,190 --> 00:48:05,349 4015 work and the reasons again end up being 4016 4017 1005 4018 00:48:02,409 --> 00:48:07,500 4019 a little bit more subtle and it 4020 4021 1006 4022 00:48:05,349 --> 00:48:09,849 4023 typically has to do with the later 4024 4025 1007 4026 00:48:07,500 --> 00:48:12,159 4027 architectural modifications that will 4028 4029 1008 4030 00:48:09,849 --> 00:48:14,799 4031 make and using a fully convolutional 4032 4033 1009 4034 00:48:12,159 --> 00:48:16,779 4035 network often provides a greater degree 4036 4037 1010 4038 00:48:14,800 --> 00:48:18,130 4039 of flexibility in terms of what you 4040 4041 1011 4042 00:48:16,780 --> 00:48:20,830 4043 might want to do with that network later 4044 4045 1012 4046 00:48:18,130 --> 00:48:22,720 4047 and then of course the last point is 4048 4049 1013 4050 00:48:20,829 --> 00:48:24,940 4051 that pre-training 4052 4053 1014 4054 00:48:22,719 --> 00:48:27,489 4055 on image net or a similar type of data 4056 4057 1015 4058 00:48:24,940 --> 00:48:30,490 4059 set is an extremely powerful mechanism 4060 4061 1016 4062 00:48:27,489 --> 00:48:32,799 4063 for transferring knowledge in the 4064 4065 1017 4066 00:48:30,489 --> 00:48:35,049 4067 weights of the backbone network to 4068 4069 1018 4070 00:48:32,800 --> 00:48:38,070 4071 another task like object detection where 4072 4073 1019 4074 00:48:35,050 --> 00:48:40,420 4075 you typically have less data available 4076 4077 1020 4078 00:48:38,070 --> 00:48:42,280 4079 so the first thing that we're going to 4080 4081 1021 4082 00:48:40,420 --> 00:48:45,010 4083 do after we've selected a backbone 4084 4085 1022 4086 00:48:42,280 --> 00:48:46,870 4087 architecture is prepare it for detection 4088 4089 1023 4090 00:48:45,010 --> 00:48:50,710 4091 assuming that we've done pre training 4092 4093 1024 4094 00:48:46,869 --> 00:48:52,539 4095 and so this is going to involve a little 4096 4097 1025 4098 00:48:50,710 --> 00:48:55,539 4099 bit of minor surgery to that network so 4100 4101 1026 4102 00:48:52,539 --> 00:48:57,809 4103 the first thing that we'll do is we'll 4104 4105 1027 4106 00:48:55,539 --> 00:49:00,579 4107 take the batch normalization layers and 4108 4109 1028 4110 00:48:57,809 --> 00:49:03,219 4111 we'll take their test time parameters 4112 4113 1029 4114 00:49:00,579 --> 00:49:05,739 4115 which are these scale and bias factors 4116 4117 1030 4118 00:49:03,219 --> 00:49:07,480 4119 and we're basically just going to treat 4120 4121 1031 4122 00:49:05,739 --> 00:49:09,009 4123 those as constants so effectively 4124 4125 1032 4126 00:49:07,480 --> 00:49:11,400 4127 removing batch alarm from the network 4128 4129 1033 4130 00:49:09,010 --> 00:49:13,870 4131 and the reason why this is done is 4132 4133 1034 4134 00:49:11,400 --> 00:49:15,369 4135 purely pragmatic and hopefully 4136 4137 1035 4138 00:49:13,869 --> 00:49:17,259 4139 we'll be able to remove it someday in 4140 4141 1036 4142 00:49:15,369 --> 00:49:19,239 4143 the future more easily and it's 4144 4145 1037 4146 00:49:17,260 --> 00:49:20,980 4147 basically just that for training most of 4148 4149 1038 4150 00:49:19,239 --> 00:49:24,429 4151 these object detection networks you 4152 4153 1039 4154 00:49:20,980 --> 00:49:26,170 4155 can't fit very many images on the GPU at 4156 4157 1040 4158 00:49:24,429 --> 00:49:27,879 4159 the time and therefore what ends up 4160 4161 1041 4162 00:49:26,170 --> 00:49:29,740 4163 happening is that the bachelor on 4164 4165 1042 4166 00:49:27,880 --> 00:49:31,599 4167 statistics I would be computed to end up 4168 4169 1043 4170 00:49:29,739 --> 00:49:34,419 4171 not being very good for training and 4172 4173 1044 4174 00:49:31,599 --> 00:49:36,039 4175 often lead to worst results so this is 4176 4177 1045 4178 00:49:34,420 --> 00:49:39,369 4179 just sort of a simple pragmatic CAC 4180 4181 1046 4182 00:49:36,039 --> 00:49:41,710 4183 these days to avoid that issue now the 4184 4185 1047 4186 00:49:39,369 --> 00:49:42,880 4187 second thing is that you need because 4188 4189 1048 4190 00:49:41,710 --> 00:49:44,980 4191 we're going to repurpose the 4192 4193 1049 4194 00:49:42,880 --> 00:49:46,780 4195 classification Network for detection is 4196 4197 1050 4198 00:49:44,980 --> 00:49:48,969 4199 that you need to remove the 4200 4201 1051 4202 00:49:46,780 --> 00:49:51,880 4203 classification specific head from that 4204 4205 1052 4206 00:49:48,969 --> 00:49:54,309 4207 network so in the case of this ResNet 4208 4209 1053 4210 00:49:51,880 --> 00:49:57,039 4211 that's illustrated here that amounts to 4212 4213 1054 4214 00:49:54,309 --> 00:49:59,949 4215 removing one average pooling layer and 4216 4217 1055 4218 00:49:57,039 --> 00:50:01,840 4219 then the fully connected layer after 4220 4221 1056 4222 00:49:59,949 --> 00:50:04,419 4223 that which was used for the thousand way 4224 4225 1057 4226 00:50:01,840 --> 00:50:06,460 4227 classification on image net and one 4228 4229 1058 4230 00:50:04,420 --> 00:50:09,190 4231 thing to note at least in the case of 4232 4233 1059 4234 00:50:06,460 --> 00:50:10,929 4235 resna is that once you've done this you 4236 4237 1060 4238 00:50:09,190 --> 00:50:17,110 4239 now have a fully convolutional network 4240 4241 1061 4242 00:50:10,929 --> 00:50:18,489 4243 that can take an input of any size now 4244 4245 1062 4246 00:50:17,110 --> 00:50:21,309 4247 the second thing that we're going to 4248 4249 1063 4250 00:50:18,489 --> 00:50:24,459 4251 look at is how scale and varying object 4252 4253 1064 4254 00:50:21,309 --> 00:50:26,049 4255 detection is realized so there are a 4256 4257 1065 4258 00:50:24,460 --> 00:50:28,179 4259 bunch of strategies for doing this 4260 4261 1066 4262 00:50:26,050 --> 00:50:30,370 4263 including sort of making the second 4264 4265 1067 4266 00:50:28,179 --> 00:50:34,169 4267 module and no op so so that's 4268 4269 1068 4270 00:50:30,369 --> 00:50:37,569 4271 illustrated you can see the cursor as 4272 4273 1069 4274 00:50:34,170 --> 00:50:39,700 4275 pane be here so so this is a strategy 4276 4277 1070 4278 00:50:37,570 --> 00:50:41,950 4279 that's been used in fast our CNN for 4280 4281 1071 4282 00:50:39,699 --> 00:50:44,349 4283 example which is basically just to use a 4284 4285 1072 4286 00:50:41,949 --> 00:50:47,259 4287 single feature map from the backbone 4288 4289 1073 4290 00:50:44,349 --> 00:50:49,929 4291 network as the basis for doing object 4292 4293 1074 4294 00:50:47,260 --> 00:50:52,510 4295 detection in scale and variants just 4296 4297 1075 4298 00:50:49,929 --> 00:50:54,069 4299 ends up coming through in this case via 4300 4301 1076 4302 00:50:52,510 --> 00:50:58,000 4303 the region of interest transformation 4304 4305 1077 4306 00:50:54,070 --> 00:51:01,480 4307 operation now sort of combat compatible 4308 4309 1078 4310 00:50:58,000 --> 00:51:04,960 4311 with approach B is the very classic idea 4312 4313 1079 4314 00:51:01,480 --> 00:51:07,539 4315 illustrated in a of building an image 4316 4317 1080 4318 00:51:04,960 --> 00:51:09,789 4319 pyramid and then applying whatever 4320 4321 1081 4322 00:51:07,539 --> 00:51:12,159 4323 technique you have independently to each 4324 4325 1082 4326 00:51:09,789 --> 00:51:13,659 4327 level of that image pyramid now in 4328 4329 1083 4330 00:51:12,159 --> 00:51:15,879 4331 practice this ends up working quite well 4332 4333 1084 4334 00:51:13,659 --> 00:51:18,339 4335 and will usually give you a nice 4336 4337 1085 4338 00:51:15,880 --> 00:51:20,680 4339 improvement in object detection quality 4340 4341 1086 4342 00:51:18,340 --> 00:51:22,210 4343 however it ends up being quite slow 4344 4345 1087 4346 00:51:20,679 --> 00:51:25,179 4347 because you now have to apply your 4348 4349 1088 4350 00:51:22,210 --> 00:51:26,260 4351 entire system to every level of an image 4352 4353 1089 4354 00:51:25,179 --> 00:51:31,278 4355 pyramid 4356 4357 1090 4358 00:51:26,260 --> 00:51:33,230 4359 so there's another approach which I want 4360 4361 1091 4362 00:51:31,278 --> 00:51:36,528 4363 to dig into a little bit and that's the 4364 4365 1092 4366 00:51:33,230 --> 00:51:38,960 4367 idea of using the fact that deep cough 4368 4369 1093 4370 00:51:36,528 --> 00:51:41,500 4371 Nets already inherently are building a 4372 4373 1094 4374 00:51:38,960 --> 00:51:44,119 4375 multiscale representation inside of them 4376 4377 1095 4378 00:51:41,500 --> 00:51:46,639 4379 so let's look at this in a little bit 4380 4381 1096 4382 00:51:44,119 --> 00:51:49,099 4383 more detail so in this illustration 4384 4385 1097 4386 00:51:46,639 --> 00:51:51,858 4387 there's an image here and you have say 4388 4389 1098 4390 00:51:49,099 --> 00:51:56,329 4391 column 3 comma 4 comma 5 feature Maps 4392 4393 1099 4394 00:51:51,858 --> 00:52:00,828 4395 computed by the network and you could in 4396 4397 1100 4398 00:51:56,329 --> 00:52:04,130 4399 principle take a model a detection model 4400 4401 1101 4402 00:52:00,829 --> 00:52:06,410 4403 and make predictions based on each one 4404 4405 1102 4406 00:52:04,130 --> 00:52:07,849 4407 of those levels of the network but 4408 4409 1103 4410 00:52:06,409 --> 00:52:10,759 4411 there's some issues that come up with 4412 4413 1104 4414 00:52:07,849 --> 00:52:13,400 4415 this so the first thing I guess the 4416 4417 1105 4418 00:52:10,760 --> 00:52:15,140 4419 first thing to note is that the reason 4420 4421 1106 4422 00:52:13,400 --> 00:52:18,710 4423 why you might want to do this and it 4424 4425 1107 4426 00:52:15,139 --> 00:52:22,129 4427 seems like a good idea is because you 4428 4429 1108 4430 00:52:18,710 --> 00:52:25,039 4431 allow your detector access to a range of 4432 4433 1109 4434 00:52:22,130 --> 00:52:27,170 4435 different scales so for example you 4436 4437 1110 4438 00:52:25,039 --> 00:52:28,549 4439 could detect small objects on column 3 4440 4441 1111 4442 00:52:27,170 --> 00:52:30,559 4443 because it has much higher spatial 4444 4445 1112 4446 00:52:28,548 --> 00:52:33,679 4447 resolution that should in principle 4448 4449 1113 4450 00:52:30,559 --> 00:52:35,930 4451 allow you to extract much better 4452 4453 1114 4454 00:52:33,679 --> 00:52:38,118 4455 features for detection then you could if 4456 4457 1115 4458 00:52:35,929 --> 00:52:40,219 4459 you try to detect tiny objects on column 4460 4461 1116 4462 00:52:38,119 --> 00:52:43,910 4463 5 which has been subsampled 4464 4465 1117 4466 00:52:40,219 --> 00:52:46,038 4467 significantly however there's sort of a 4468 4469 1118 4470 00:52:43,909 --> 00:52:48,078 4471 catch in this which is that if you were 4472 4473 1119 4474 00:52:46,039 --> 00:52:49,880 4475 to try to do that directly you'd be 4476 4477 1120 4478 00:52:48,079 --> 00:52:51,920 4479 compromising the quality of the features 4480 4481 1121 4482 00:52:49,880 --> 00:52:54,170 4483 because we know that the feature is 4484 4485 1122 4486 00:52:51,920 --> 00:52:56,358 4487 computed up here in comm 5 are going to 4488 4489 1123 4490 00:52:54,170 --> 00:52:58,548 4491 be really good for classification but 4492 4493 1124 4494 00:52:56,358 --> 00:53:00,828 4495 the features down here at Comp 3 are not 4496 4497 1125 4498 00:52:58,548 --> 00:53:02,210 4499 going to be so good and in the extreme 4500 4501 1126 4502 00:53:00,829 --> 00:53:04,430 4503 if you went down far enough you'd 4504 4505 1127 4506 00:53:02,210 --> 00:53:09,970 4507 effectively be using something that was 4508 4509 1128 4510 00:53:04,429 --> 00:53:14,000 4511 sort of equivalent to say sift or hog so 4512 4513 1129 4514 00:53:09,969 --> 00:53:16,639 4515 what we propose to do is to make a minor 4516 4517 1130 4518 00:53:14,000 --> 00:53:17,869 4519 modification of that approach and build 4520 4521 1131 4522 00:53:16,639 --> 00:53:20,088 4523 something called a feature pyramid 4524 4525 1132 4526 00:53:17,869 --> 00:53:20,930 4527 Network and this is a paper that's at 4528 4529 1133 4530 00:53:20,088 --> 00:53:24,528 4531 the cvpr 4532 4533 1134 4534 00:53:20,929 --> 00:53:28,940 4535 and will be presented on Saturday so the 4536 4537 1135 4538 00:53:24,528 --> 00:53:31,608 4539 idea is to try to get away with the best 4540 4541 1136 4542 00:53:28,940 --> 00:53:33,349 4543 of both worlds so we want to be able to 4544 4545 1137 4546 00:53:31,608 --> 00:53:35,989 4547 use the inherent multi scale 4548 4549 1138 4550 00:53:33,349 --> 00:53:38,359 4551 representation in the network but we 4552 4553 1139 4554 00:53:35,989 --> 00:53:39,559 4555 want to be able to use strong features 4556 4557 1140 4558 00:53:38,358 --> 00:53:41,179 4559 everywhere 4560 4561 1141 4562 00:53:39,559 --> 00:53:42,500 4563 and I guess I didn't mention it's 4564 4565 1142 4566 00:53:41,179 --> 00:53:45,529 4567 explicitly though you probably saw in 4568 4569 1143 4570 00:53:42,500 --> 00:53:47,809 4571 the slide we also want it to be fast by 4572 4573 1144 4574 00:53:45,530 --> 00:53:50,720 4575 add by requiring only a marginal 4576 4577 1145 4578 00:53:47,809 --> 00:53:54,380 4579 marginal increase in the computation 4580 4581 1146 4582 00:53:50,719 --> 00:53:57,709 4583 required to to build this pyramid so the 4584 4585 1147 4586 00:53:54,380 --> 00:53:59,570 4587 basic idea here is that as before you 4588 4589 1148 4590 00:53:57,710 --> 00:54:01,159 4591 have the standard forward pass to the 4592 4593 1149 4594 00:53:59,570 --> 00:54:03,170 4595 network that builds up the multiple 4596 4597 1150 4598 00:54:01,159 --> 00:54:05,329 4599 levels of representation at different 4600 4601 1151 4602 00:54:03,170 --> 00:54:08,869 4603 scales but now we're going to add to 4604 4605 1152 4606 00:54:05,329 --> 00:54:10,099 4607 that forward pass so new connections so 4608 4609 1153 4610 00:54:08,869 --> 00:54:11,989 4611 they're going to be these lateral 4612 4613 1154 4614 00:54:10,099 --> 00:54:13,809 4615 connections and they're also going to be 4616 4617 1155 4618 00:54:11,989 --> 00:54:15,859 4619 these top-down connections and 4620 4621 1156 4622 00:54:13,809 --> 00:54:19,000 4623 effectively what these are going to do 4624 4625 1157 4626 00:54:15,860 --> 00:54:22,490 4627 is they're going to take the top-down 4628 4629 1158 4630 00:54:19,000 --> 00:54:25,579 4631 strong features and propagate them to 4632 4633 1159 4634 00:54:22,489 --> 00:54:28,309 4635 the high resolution feature Maps below 4636 4637 1160 4638 00:54:25,579 --> 00:54:30,799 4639 and that's going to create this auxilary 4640 4641 1161 4642 00:54:28,309 --> 00:54:33,079 4643 or secondary pyramid over here which is 4644 4645 1162 4646 00:54:30,800 --> 00:54:35,180 4647 going to have a variety of different 4648 4649 1163 4650 00:54:33,079 --> 00:54:37,579 4651 spatial resolutions and the features 4652 4653 1164 4654 00:54:35,179 --> 00:54:42,500 4655 will ideally be strong across all those 4656 4657 1165 4658 00:54:37,579 --> 00:54:44,779 4659 levels so just to illustrate that again 4660 4661 1166 4662 00:54:42,500 --> 00:54:47,570 4663 the idea is that we're going to be able 4664 4665 1167 4666 00:54:44,780 --> 00:54:51,500 4667 to have strong features everywhere in 4668 4669 1168 4670 00:54:47,570 --> 00:54:53,180 4671 this pyramid and I also want to note 4672 4673 1169 4674 00:54:51,500 --> 00:54:55,159 4675 that this this is an idea that seems to 4676 4677 1170 4678 00:54:53,179 --> 00:54:58,190 4679 be very popular right now because it 4680 4681 1171 4682 00:54:55,159 --> 00:55:00,319 4683 seems to have effectively been invented 4684 4685 1172 4686 00:54:58,190 --> 00:55:03,829 4687 sort of simultaneously by I think there 4688 4689 1173 4690 00:55:00,320 --> 00:55:07,309 4691 are four or five different groups okay 4692 4693 1174 4694 00:55:03,829 --> 00:55:09,610 4695 so that's the second module in the 4696 4697 1175 4698 00:55:07,309 --> 00:55:12,739 4699 approach so now the third module is 4700 4701 1176 4702 00:55:09,610 --> 00:55:14,840 4703 going to be the mechanism that provides 4704 4705 1177 4706 00:55:12,739 --> 00:55:17,599 4707 the region proposals for doing object 4708 4709 1178 4710 00:55:14,840 --> 00:55:20,750 4711 detection now before describing this I 4712 4713 1179 4714 00:55:17,599 --> 00:55:22,250 4715 just want to draw your attention to what 4716 4717 1180 4718 00:55:20,750 --> 00:55:24,860 4719 I'm trying to provide is the bird's eye 4720 4721 1181 4722 00:55:22,250 --> 00:55:26,239 4723 view of what we're building up as I go 4724 4725 1182 4726 00:55:24,860 --> 00:55:28,250 4727 through each one of these steps so you 4728 4729 1183 4730 00:55:26,239 --> 00:55:30,019 4731 don't lose track of what's going on so 4732 4733 1184 4734 00:55:28,250 --> 00:55:32,389 4735 over here on the left side of the slide 4736 4737 1185 4738 00:55:30,019 --> 00:55:36,920 4739 you have what's now this little tiny 4740 4741 1186 4742 00:55:32,389 --> 00:55:39,589 4743 image of a ResNet and you have coming 4744 4745 1187 4746 00:55:36,920 --> 00:55:41,150 4747 off of that the feature pyramid network 4748 4749 1188 4750 00:55:39,590 --> 00:55:42,920 4751 which I just described which built that 4752 4753 1189 4754 00:55:41,150 --> 00:55:45,170 4755 feature pyramid and now that I'm going 4756 4757 1190 4758 00:55:42,920 --> 00:55:46,940 4759 to describe is the region proposal 4760 4761 1191 4762 00:55:45,170 --> 00:55:51,170 4763 mechanism which is going to be applied 4764 4765 1192 4766 00:55:46,940 --> 00:55:52,679 4767 to each one of the levels computed by 4768 4769 1193 4770 00:55:51,170 --> 00:55:55,619 4771 the feature pyramid Network 4772 4773 1194 4774 00:55:52,679 --> 00:55:59,480 4775 now the idea of the region proposal 4776 4777 1195 4778 00:55:55,619 --> 00:56:02,548 4779 network is that it's going to provide 4780 4781 1196 4782 00:55:59,480 --> 00:56:06,088 4783 these object proposals for detecting 4784 4785 1197 4786 00:56:02,548 --> 00:56:08,219 4787 objects using a sliding window mechanism 4788 4789 1198 4790 00:56:06,088 --> 00:56:10,828 4791 and what it's going to do is that at 4792 4793 1199 4794 00:56:08,219 --> 00:56:17,629 4795 each sliding window position it's going 4796 4797 1200 4798 00:56:10,829 --> 00:56:21,000 4799 to try to predict whether each one of K 4800 4801 1201 4802 00:56:17,630 --> 00:56:25,048 4803 prototypical boxes centered at that 4804 4805 1202 4806 00:56:21,000 --> 00:56:27,568 4807 position corresponds to an object so we 4808 4809 1203 4810 00:56:25,048 --> 00:56:29,548 4811 call these anchor boxes so they come in 4812 4813 1204 4814 00:56:27,568 --> 00:56:32,699 4815 different aspect ratios and different 4816 4817 1205 4818 00:56:29,548 --> 00:56:34,288 4819 scales and the idea is that hopefully 4820 4821 1206 4822 00:56:32,699 --> 00:56:36,899 4823 one of those aspect ratios and scales 4824 4825 1207 4826 00:56:34,289 --> 00:56:38,520 4827 will be kind of close to the aspect 4828 4829 1208 4830 00:56:36,900 --> 00:56:40,380 4831 ratio non scale of an object centered at 4832 4833 1209 4834 00:56:38,519 --> 00:56:42,409 4835 that location and then the region 4836 4837 1210 4838 00:56:40,380 --> 00:56:45,420 4839 proposal network basically needs to say 4840 4841 1211 4842 00:56:42,409 --> 00:56:48,449 4843 yes this anchor box is good and then 4844 4845 1212 4846 00:56:45,420 --> 00:56:50,670 4847 additionally it's going to suggest how 4848 4849 1213 4850 00:56:48,449 --> 00:56:53,338 4851 you transform that anchor box via a 4852 4853 1214 4854 00:56:50,670 --> 00:56:55,920 4855 small regression so that it better 4856 4857 1215 4858 00:56:53,338 --> 00:56:59,608 4859 localizes the object that's near it so 4860 4861 1216 4862 00:56:55,920 --> 00:57:00,900 4863 in practice at each location there most 4864 4865 1217 4866 00:56:59,608 --> 00:57:03,449 4867 likely won't be an object but if there 4868 4869 1218 4870 00:57:00,900 --> 00:57:05,338 4871 is an object maybe only one of the 4872 4873 1219 4874 00:57:03,449 --> 00:57:07,588 4875 anchor boxes at that location will be 4876 4877 1220 4878 00:57:05,338 --> 00:57:09,929 4879 good match to it in the job of our peon 4880 4881 1221 4882 00:57:07,588 --> 00:57:12,298 4883 it's identify which one of those anchor 4884 4885 1222 4886 00:57:09,929 --> 00:57:15,239 4887 boxes is a good match give it a high 4888 4889 1223 4890 00:57:12,298 --> 00:57:18,960 4891 object no score and then transform it so 4892 4893 1224 4894 00:57:15,239 --> 00:57:22,259 4895 that it matches the object there looks 4896 4897 1225 4898 00:57:18,960 --> 00:57:24,358 4899 like there's no question so about this 4900 4901 1226 4902 00:57:22,260 --> 00:57:26,400 4903 anchor box is the other actually 4904 4905 1227 4906 00:57:24,358 --> 00:57:28,338 4907 convolutional filters I was trying to 4908 4909 1228 4910 00:57:26,400 --> 00:57:30,750 4911 read the paper but it's not very clear 4912 4913 1229 4914 00:57:28,338 --> 00:57:33,869 4915 let's say you have different scales and 4916 4917 1230 4918 00:57:30,750 --> 00:57:37,190 4919 davonne ratios of these boxes do you run 4920 4921 1231 4922 00:57:33,869 --> 00:57:39,778 4923 them as convolutions on top of the image 4924 4925 1232 4926 00:57:37,190 --> 00:57:41,369 4927 because in the previous paper it says 4928 4929 1233 4930 00:57:39,778 --> 00:57:43,139 4931 you have free by free convolution and 4932 4933 1234 4934 00:57:41,369 --> 00:57:45,480 4935 now you have this anchors so what they 4936 4937 1235 4938 00:57:43,139 --> 00:57:47,639 4939 are in the actors realization yeah so 4940 4941 1236 4942 00:57:45,480 --> 00:57:51,240 4943 the realization so the anchor boxes are 4944 4945 1237 4946 00:57:47,639 --> 00:57:53,278 4947 are not filters they're they're just 4948 4949 1238 4950 00:57:51,239 --> 00:57:55,288 4951 these prototype boxes that act as 4952 4953 1239 4954 00:57:53,278 --> 00:57:57,559 4955 references and then there's a three by 4956 4957 1240 4958 00:57:55,289 --> 00:58:00,299 4959 three filter for each one of those and 4960 4961 1241 4962 00:57:57,559 --> 00:58:03,210 4963 that three by three fill both it's not 4964 4965 1242 4966 00:58:00,298 --> 00:58:05,179 4967 exactly one it's essentially one for 4968 4969 1243 4970 00:58:03,210 --> 00:58:06,500 4971 each property needs to predict 4972 4973 1244 4974 00:58:05,179 --> 00:58:08,929 4975 so there'll be one that predicts whether 4976 4977 1245 4978 00:58:06,500 --> 00:58:10,730 4979 it's an object or not an object and then 4980 4981 1246 4982 00:58:08,929 --> 00:58:12,919 4983 four that predicts geometric 4984 4985 1247 4986 00:58:10,730 --> 00:58:16,309 4987 transformations of the ink of the inker 4988 4989 1248 4990 00:58:12,920 --> 00:58:18,548 4991 box so they're there oh it's always 4992 4993 1249 4994 00:58:16,309 --> 00:58:22,069 4995 three by three filters predicting 4996 4997 1250 4998 00:58:18,548 --> 00:58:23,630 4999 properties of the anchor boxes I hope 5000 5001 1251 5002 00:58:22,068 --> 00:58:28,730 5003 that's clear okay 5004 5005 1252 5006 00:58:23,630 --> 00:58:30,680 5007 all right thanks okay so so that's going 5008 5009 1253 5010 00:58:28,730 --> 00:58:36,170 5011 to be the mechanism in the network which 5012 5013 1254 5014 00:58:30,679 --> 00:58:38,358 5015 provides the object proposals now the 5016 5017 1255 5018 00:58:36,170 --> 00:58:41,180 5019 fourth step and again your little road 5020 5021 1256 5022 00:58:38,358 --> 00:58:44,389 5023 map is over here in the corner with 5024 5025 1257 5026 00:58:41,179 --> 00:58:46,730 5027 increasingly shrinking size so the 5028 5029 1258 5030 00:58:44,389 --> 00:58:51,170 5031 fourth part is this region of interest 5032 5033 1259 5034 00:58:46,730 --> 00:58:54,130 5035 transformation step and so in NASCAR CNN 5036 5037 1260 5038 00:58:51,170 --> 00:58:56,990 5039 we use a new mechanism to provide this 5040 5041 1261 5042 00:58:54,130 --> 00:58:59,900 5043 transformation it called ROI align and 5044 5045 1262 5046 00:58:56,989 --> 00:59:03,288 5047 the idea of ROI align is that it's going 5048 5049 1263 5050 00:58:59,900 --> 00:59:05,869 5051 to smoothly transform the features from 5052 5053 1264 5054 00:59:03,289 --> 00:59:09,520 5055 whatever arbitrary aspect ratio the 5056 5057 1265 5058 00:59:05,869 --> 00:59:12,769 5059 region of interest has into a fixed size 5060 5061 1266 5062 00:59:09,519 --> 00:59:16,670 5063 feature vector without doing any 5064 5065 1267 5066 00:59:12,769 --> 00:59:19,190 5067 quantization so the way that it's going 5068 5069 1268 5070 00:59:16,670 --> 00:59:21,019 5071 to do that is using a very simple 5072 5073 1269 5074 00:59:19,190 --> 00:59:24,440 5075 mechanism adjust by linear interpolation 5076 5077 1270 5078 00:59:21,019 --> 00:59:28,099 5079 so if this is the region proposal here 5080 5081 1271 5082 00:59:24,440 --> 00:59:31,159 5083 then what we want to get out of this in 5084 5085 1272 5086 00:59:28,099 --> 00:59:34,970 5087 in this Illustrated example is a 2x2 5088 5089 1273 5090 00:59:31,159 --> 00:59:37,730 5091 grid of features and within each one is 5092 5093 1274 5094 00:59:34,969 --> 00:59:39,679 5095 the region of interest bins we're going 5096 5097 1275 5098 00:59:37,730 --> 00:59:42,440 5099 to lay down a grid of sampling points 5100 5101 1276 5102 00:59:39,679 --> 00:59:43,848 5103 here Illustrated is 2x2 and those 5104 5105 1277 5106 00:59:42,440 --> 00:59:47,028 5107 sampling points are going to be used for 5108 5109 1278 5110 00:59:43,849 --> 00:59:48,680 5111 bilinear interpolation and so one case 5112 5113 1279 5114 00:59:47,028 --> 00:59:50,510 5115 of that is illustrated here where you 5116 5117 1280 5118 00:59:48,679 --> 00:59:52,098 5119 have this point interpolating the 5120 5121 1281 5122 00:59:50,510 --> 00:59:55,339 5123 features at its four nearest neighbors 5124 5125 1282 5126 00:59:52,099 --> 00:59:56,838 5127 and so so via this interpolation process 5128 5129 1283 5130 00:59:55,338 --> 01:00:00,230 5131 we're going to get out a fixed size 5132 5133 1284 5134 00:59:56,838 --> 01:00:01,940 5135 feature vector now this probably sounds 5136 5137 1285 5138 01:00:00,230 --> 01:00:05,150 5139 like the obvious thing that you want to 5140 5141 1286 5142 01:00:01,940 --> 01:00:07,369 5143 do but it's actually a little bit 5144 5145 1287 5146 01:00:05,150 --> 01:00:09,829 5147 different in kind of very minor details 5148 5149 1288 5150 01:00:07,369 --> 01:00:13,490 5151 from what has been done in the past and 5152 5153 1289 5154 01:00:09,829 --> 01:00:15,559 5155 so for example in fast our CNN there's 5156 5157 1290 5158 01:00:13,489 --> 01:00:19,009 5159 this region of interest pool operation 5160 5161 1291 5162 01:00:15,559 --> 01:00:22,790 5163 which is very similar except for the fat 5164 5165 1292 5166 01:00:19,010 --> 01:00:24,830 5167 that it performs quantization and max 5168 5169 1293 5170 01:00:22,789 --> 01:00:27,559 5171 cooling but the next cooling is not 5172 5173 1294 5174 01:00:24,829 --> 01:00:30,769 5175 really the issue at hand here the real 5176 5177 1295 5178 01:00:27,559 --> 01:00:32,719 5179 issue is quantization so as illustrated 5180 5181 1296 5182 01:00:30,769 --> 01:00:35,389 5183 here if you start from this original 5184 5185 1297 5186 01:00:32,719 --> 01:00:37,939 5187 region of interest and then you perform 5188 5189 1298 5190 01:00:35,389 --> 01:00:40,909 5191 some quantization that snaps its 5192 5193 1299 5194 01:00:37,940 --> 01:00:44,119 5195 coordinates to the coordinate of the 5196 5197 1300 5198 01:00:40,909 --> 01:00:46,670 5199 underlying feature grid what happens is 5200 5201 1301 5202 01:00:44,119 --> 01:00:48,859 5203 that you're going to break the pixel to 5204 5205 1302 5206 01:00:46,670 --> 01:00:51,079 5207 pixel alignment between the input and 5208 5209 1303 5210 01:00:48,860 --> 01:00:53,720 5211 the output and it turns out that that 5212 5213 1304 5214 01:00:51,079 --> 01:00:56,480 5215 doesn't matter that much when it comes 5216 5217 1305 5218 01:00:53,719 --> 01:00:58,250 5219 to predicting bounding boxes but when 5220 5221 1306 5222 01:00:56,480 --> 01:01:00,769 5223 you move to tasks that require much 5224 5225 1307 5226 01:00:58,250 --> 01:01:03,199 5227 finer spatial localization such as 5228 5229 1308 5230 01:01:00,769 --> 01:01:04,820 5231 predicting object masks or predicting 5232 5233 1309 5234 01:01:03,199 --> 01:01:07,639 5235 key points like human pose estimation 5236 5237 1310 5238 01:01:04,820 --> 01:01:09,769 5239 then this type of detail actually starts 5240 5241 1311 5242 01:01:07,639 --> 01:01:11,599 5243 to matter quite a bit and we have a 5244 5245 1312 5246 01:01:09,769 --> 01:01:14,449 5247 detailed ablation breakdown in the paper 5248 5249 1313 5250 01:01:11,599 --> 01:01:16,759 5251 which shows that this tiny detail of 5252 5253 1314 5254 01:01:14,449 --> 01:01:18,409 5255 whether you quantize the coordinates or 5256 5257 1315 5258 01:01:16,760 --> 01:01:22,180 5259 not actually makes a very significant 5260 5261 1316 5262 01:01:18,409 --> 01:01:26,119 5263 difference in the final results okay so 5264 5265 1317 5266 01:01:22,179 --> 01:01:28,879 5267 now we're to the fifth and final modular 5268 5269 1318 5270 01:01:26,119 --> 01:01:31,909 5271 components of the system and this is the 5272 5273 1319 5274 01:01:28,880 --> 01:01:33,920 5275 part that's going to make predictions 5276 5277 1320 5278 01:01:31,909 --> 01:01:36,319 5279 for each region of interest that the 5280 5281 1321 5282 01:01:33,920 --> 01:01:41,150 5283 system has proposed and then transformed 5284 5285 1322 5286 01:01:36,320 --> 01:01:44,360 5287 via ROI of line and we the idea here is 5288 5289 1323 5290 01:01:41,150 --> 01:01:47,030 5291 that we really refer to this as the head 5292 5293 1324 5294 01:01:44,360 --> 01:01:48,829 5295 of the network and there going to be a 5296 5297 1325 5298 01:01:47,030 --> 01:01:52,760 5299 variety of heads that perform different 5300 5301 1326 5302 01:01:48,829 --> 01:01:55,190 5303 tasks so the first two are standard from 5304 5305 1327 5306 01:01:52,760 --> 01:01:58,160 5307 fasterfaster our CN n so that's doing 5308 5309 1328 5310 01:01:55,190 --> 01:02:00,710 5311 bounding box detection essentially 5312 5313 1329 5314 01:01:58,159 --> 01:02:02,179 5315 classifying whether this box is one one 5316 5317 1330 5318 01:02:00,710 --> 01:02:07,039 5319 of the categories are background and 5320 5321 1331 5322 01:02:02,179 --> 01:02:08,480 5323 then the second actually that probably 5324 5325 1332 5326 01:02:07,039 --> 01:02:11,900 5327 should that should have said bounding 5328 5329 1333 5330 01:02:08,480 --> 01:02:16,610 5331 box regression so the first one is doing 5332 5333 1334 5334 01:02:11,900 --> 01:02:19,610 5335 a geometric shift in scale change of the 5336 5337 1335 5338 01:02:16,610 --> 01:02:22,460 5339 box in order to try to more finely 5340 5341 1336 5342 01:02:19,610 --> 01:02:25,370 5343 localize the object the second one is 5344 5345 1337 5346 01:02:22,460 --> 01:02:27,889 5347 doing object classification saying 5348 5349 1338 5350 01:02:25,369 --> 01:02:29,210 5351 whether the proposal is one of the 5352 5353 1339 5354 01:02:27,889 --> 01:02:31,788 5355 foreground categories that we're trying 5356 5357 1340 5358 01:02:29,210 --> 01:02:35,048 5359 to detect or part of the background 5360 5361 1341 5362 01:02:31,789 --> 01:02:37,639 5363 and then in addition to those two 5364 5365 1342 5366 01:02:35,048 --> 01:02:41,978 5367 standard components we're going to add 5368 5369 1343 5370 01:02:37,639 --> 01:02:44,358 5371 in two new components so the first is 5372 5373 1344 5374 01:02:41,978 --> 01:02:46,638 5375 what NASCAR CNN is sort of all about 5376 5377 1345 5378 01:02:44,358 --> 01:02:49,458 5379 which is predicting instance level masks 5380 5381 1346 5382 01:02:46,639 --> 01:02:51,829 5383 for each object and then the second 5384 5385 1347 5386 01:02:49,458 --> 01:02:53,838 5387 which is sort of optional and wasn't 5388 5389 1348 5390 01:02:51,829 --> 01:02:56,239 5391 part of the original design 5392 5393 1349 5394 01:02:53,838 --> 01:02:57,949 5395 is that we discovered that there's 5396 5397 1350 5398 01:02:56,239 --> 01:03:00,528 5399 actually a very simple way in which the 5400 5401 1351 5402 01:02:57,949 --> 01:03:04,009 5403 same system can be used to predict human 5404 5405 1352 5406 01:03:00,528 --> 01:03:07,039 5407 pose pretty reliably so what's 5408 5409 1353 5410 01:03:04,009 --> 01:03:09,409 5411 illustrated here on the right is the 5412 5413 1354 5414 01:03:07,039 --> 01:03:12,649 5415 standard classification and bounding box 5416 5417 1355 5418 01:03:09,409 --> 01:03:16,039 5419 regression head that's used for Bastion 5420 5421 1356 5422 01:03:12,648 --> 01:03:18,048 5423 caster our CNN and then you can just 5424 5425 1357 5426 01:03:16,039 --> 01:03:19,939 5427 think that in parallel to that we're 5428 5429 1358 5430 01:03:18,048 --> 01:03:23,449 5431 adding in a new head that's going to 5432 5433 1359 5434 01:03:19,938 --> 01:03:25,998 5435 apply several convolution layers and 5436 5437 1360 5438 01:03:23,449 --> 01:03:28,249 5439 then the transpose layer to increase 5440 5441 1361 5442 01:03:25,998 --> 01:03:33,408 5443 spatial resolution in order to predict 5444 5445 1362 5446 01:03:28,248 --> 01:03:34,908 5447 instance segmentations okay so I would 5448 5449 1363 5450 01:03:33,409 --> 01:03:36,559 5451 like to talk about training but 5452 5453 1364 5454 01:03:34,909 --> 01:03:38,059 5455 unfortunately there just isn't enough 5456 5457 1365 5458 01:03:36,559 --> 01:03:41,719 5459 time right now to go into any of the 5460 5461 1366 5462 01:03:38,059 --> 01:03:43,880 5463 details so what I will say just in brief 5464 5465 1367 5466 01:03:41,719 --> 01:03:46,338 5467 summary is that the training procedure 5468 5469 1368 5470 01:03:43,880 --> 01:03:49,429 5471 is almost identical to fast and faster 5472 5473 1369 5474 01:03:46,338 --> 01:03:51,018 5475 our CNN the main difference is that 5476 5477 1370 5478 01:03:49,429 --> 01:03:53,418 5479 there are now these targets for 5480 5481 1371 5482 01:03:51,018 --> 01:03:55,248 5483 predicting masks and so I do want to 5484 5485 1372 5486 01:03:53,418 --> 01:03:56,568 5487 spend a couple of slides to show you 5488 5489 1373 5490 01:03:55,248 --> 01:03:59,868 5491 what those look like to give you a 5492 5493 1374 5494 01:03:56,568 --> 01:04:03,139 5495 little bit of intuition about that so 5496 5497 1375 5498 01:03:59,869 --> 01:04:06,079 5499 here's one image and in this image I've 5500 5501 1376 5502 01:04:03,139 --> 01:04:07,399 5503 highlighted four different regions of 5504 5505 1377 5506 01:04:06,079 --> 01:04:09,469 5507 interest that are going to be used 5508 5509 1378 5510 01:04:07,398 --> 01:04:11,808 5511 during training and then for each one of 5512 5513 1379 5514 01:04:09,469 --> 01:04:14,119 5515 those I'm showing The Associated mask 5516 5517 1380 5518 01:04:11,809 --> 01:04:16,880 5519 target that the network is being trained 5520 5521 1381 5522 01:04:14,119 --> 01:04:21,199 5523 to predict and these are represented as 5524 5525 1382 5526 01:04:16,880 --> 01:04:23,239 5527 binary 28 by 28 masks so you can see 5528 5529 1383 5530 01:04:21,199 --> 01:04:25,938 5531 that when an object is very well 5532 5533 1384 5534 01:04:23,239 --> 01:04:31,369 5535 localized by the region proposal the 5536 5537 1385 5538 01:04:25,938 --> 01:04:34,219 5539 target fills up the entire 28 by 28 grid 5540 5541 1386 5542 01:04:31,369 --> 01:04:36,318 5543 and then you can see in this other 5544 5545 1387 5546 01:04:34,219 --> 01:04:37,938 5547 example here when the region of interest 5548 5549 1388 5550 01:04:36,318 --> 01:04:39,829 5551 is not well aligned with the object 5552 5553 1389 5554 01:04:37,938 --> 01:04:42,048 5555 there's sort of an appropriate 5556 5557 1390 5558 01:04:39,829 --> 01:04:45,048 5559 transformation of the target so that it 5560 5561 1391 5562 01:04:42,048 --> 01:04:48,409 5563 only occupies a portion 5564 5565 1392 5566 01:04:45,048 --> 01:04:50,630 5567 of the proposal and this would mean that 5568 5569 1393 5570 01:04:48,409 --> 01:04:54,349 5571 the system is being trained so that even 5572 5573 1394 5574 01:04:50,630 --> 01:04:56,568 5575 if the box that it has predicted ends up 5576 5577 1395 5578 01:04:54,349 --> 01:04:59,149 5579 not being a great box for the object it 5580 5581 1396 5582 01:04:56,568 --> 01:05:01,068 5583 still should be able hopefully to 5584 5585 1397 5586 01:04:59,148 --> 01:05:06,648 5587 predict a reasonably good mask for the 5588 5589 1398 5590 01:05:01,068 --> 01:05:08,449 5591 object okay so unfortunately that's all 5592 5593 1399 5594 01:05:06,648 --> 01:05:11,868 5595 about training so now let's talk about 5596 5597 1400 5598 01:05:08,449 --> 01:05:14,749 5599 how inference works so imprints proceeds 5600 5601 1401 5602 01:05:11,869 --> 01:05:17,119 5603 in two steps and essentially the first 5604 5605 1402 5606 01:05:14,748 --> 01:05:20,208 5607 step is just to perform standard Doster 5608 5609 1403 5610 01:05:17,119 --> 01:05:22,009 5611 our scene and type inference so if 5612 5613 1404 5614 01:05:20,208 --> 01:05:24,228 5615 you're familiar with that then you'll 5616 5617 1405 5618 01:05:22,009 --> 01:05:26,568 5619 follow this if not then unfortunately 5620 5621 1406 5622 01:05:24,228 --> 01:05:29,958 5623 probably be a little bit too terse to 5624 5625 1407 5626 01:05:26,568 --> 01:05:31,880 5627 really understand but basically the 5628 5629 1408 5630 01:05:29,958 --> 01:05:36,018 5631 first step is to generate proposals 5632 5633 1409 5634 01:05:31,880 --> 01:05:39,709 5635 using RPN then to score those proposals 5636 5637 1410 5638 01:05:36,018 --> 01:05:42,228 5639 using the object classification head of 5640 5641 1411 5642 01:05:39,708 --> 01:05:46,038 5643 the network and also to regress the 5644 5645 1412 5646 01:05:42,228 --> 01:05:48,468 5647 refined proposals using bounding box 5648 5649 1413 5650 01:05:46,039 --> 01:05:51,979 5651 regression and then to apply non Maxima 5652 5653 1414 5654 01:05:48,469 --> 01:05:54,019 5655 suppression and take the top say 100 5656 5657 1415 5658 01:05:51,978 --> 01:05:57,288 5659 detection is what we typically do in 5660 5661 1416 5662 01:05:54,018 --> 01:05:59,618 5663 practice and now the second part of the 5664 5665 1417 5666 01:05:57,289 --> 01:06:02,719 5667 inference procedure is going to be 5668 5669 1418 5670 01:05:59,619 --> 01:06:06,439 5671 predicting masks for those top 100 5672 5673 1419 5674 01:06:02,719 --> 01:06:08,778 5675 detections and the way that this is done 5676 5677 1420 5678 01:06:06,438 --> 01:06:10,879 5679 is simply by reusing all of the features 5680 5681 1421 5682 01:06:08,778 --> 01:06:15,489 5683 that have already been computed and then 5684 5685 1422 5686 01:06:10,880 --> 01:06:19,130 5687 for each one of these refined detections 5688 5689 1423 5690 01:06:15,489 --> 01:06:22,639 5691 running the ROI transformation operation 5692 5693 1424 5694 01:06:19,130 --> 01:06:25,459 5695 on an ROI line and then running those 5696 5697 1425 5698 01:06:22,639 --> 01:06:27,048 5699 features through the mask ad and doing 5700 5701 1426 5702 01:06:25,458 --> 01:06:29,898 5703 this sort of two-stage Cascade 5704 5705 1427 5706 01:06:27,048 --> 01:06:32,059 5707 classification or not prediction has a 5708 5709 1428 5710 01:06:29,898 --> 01:06:34,818 5711 couple of advantages to it so the first 5712 5713 1429 5714 01:06:32,059 --> 01:06:38,359 5715 is that it's fast because you only have 5716 5717 1430 5718 01:06:34,818 --> 01:06:40,338 5719 to predict masks for say 100 objects 5720 5721 1431 5722 01:06:38,358 --> 01:06:42,918 5723 rather than for say a thousand objects 5724 5725 1432 5726 01:06:40,338 --> 01:06:45,199 5727 and the other is that you get slightly 5728 5729 1433 5730 01:06:42,918 --> 01:06:48,708 5731 improved accuracy because you're using 5732 5733 1434 5734 01:06:45,199 --> 01:06:50,179 5735 the refined detections rather than the 5736 5737 1435 5738 01:06:48,708 --> 01:06:52,419 5739 original proposals for doing the mask 5740 5741 1436 5742 01:06:50,179 --> 01:06:56,210 5743 prediction 5744 5745 1437 5746 01:06:52,420 --> 01:06:58,039 5747 so one of the things that has frequently 5748 5749 1438 5750 01:06:56,210 --> 01:07:01,220 5751 come up when I talk to people is that 5752 5753 1439 5754 01:06:58,039 --> 01:07:02,719 5755 intuitively you know I was the same way 5756 5757 1440 5758 01:07:01,219 --> 01:07:05,899 5759 you kind of have a hard time believing 5760 5761 1441 5762 01:07:02,719 --> 01:07:08,059 5763 that a 28 by 28 mask prediction is going 5764 5765 1442 5766 01:07:05,900 --> 01:07:10,430 5767 to be high enough resolution to give you 5768 5769 1443 5770 01:07:08,059 --> 01:07:11,929 5771 anything reasonable looking and in fact 5772 5773 1444 5774 01:07:10,429 --> 01:07:15,230 5775 I don't have an illustration of it but 5776 5777 1445 5778 01:07:11,929 --> 01:07:17,480 5779 even 14 by 14 is quite reasonable so so 5780 5781 1446 5782 01:07:15,230 --> 01:07:20,510 5783 what I want to show here is the result 5784 5785 1447 5786 01:07:17,480 --> 01:07:24,440 5787 of the inference procedure so here's a 5788 5789 1448 5790 01:07:20,510 --> 01:07:26,599 5791 detected person here's the 28 by 28 soft 5792 5793 1449 5794 01:07:24,440 --> 01:07:30,800 5795 mask so this is before doing any sort of 5796 5797 1450 5798 01:07:26,599 --> 01:07:35,269 5799 threshold encoder it has a value between 5800 5801 1451 5802 01:07:30,800 --> 01:07:38,510 5803 0 and 1 now in order to transform that 5804 5805 1452 5806 01:07:35,269 --> 01:07:40,159 5807 28 by 28 prediction into a prediction 5808 5809 1453 5810 01:07:38,510 --> 01:07:43,220 5811 that serving the coordinate space of the 5812 5813 1454 5814 01:07:40,159 --> 01:07:46,489 5815 image the first thing to do is resample 5816 5817 1455 5818 01:07:43,219 --> 01:07:49,399 5819 it so that it has the appropriate done 5820 5821 1456 5822 01:07:46,489 --> 01:07:51,789 5823 so that its aspect ratio matches the 5824 5825 1457 5826 01:07:49,400 --> 01:07:55,430 5827 aspect ratio of the detected box and 5828 5829 1458 5830 01:07:51,789 --> 01:07:58,338 5831 that rescaling is done using the soft 5832 5833 1459 5834 01:07:55,429 --> 01:08:00,858 5835 mask which is important to do it that 5836 5837 1460 5838 01:07:58,338 --> 01:08:02,809 5839 way rather than binarize in the mask and 5840 5841 1461 5842 01:08:00,858 --> 01:08:06,739 5843 then rescaling which introduces that 5844 5845 1462 5846 01:08:02,809 --> 01:08:09,199 5847 artifacts so after having rescaled that 5848 5849 1463 5850 01:08:06,739 --> 01:08:11,449 5851 mask the soft mask then you can simply 5852 5853 1464 5854 01:08:09,199 --> 01:08:14,809 5855 threshold it in order to get the final 5856 5857 1465 5858 01:08:11,449 --> 01:08:17,838 5859 prediction and here's another example 5860 5861 1466 5862 01:08:14,809 --> 01:08:20,420 5863 using the same image so you see the 28 5864 5865 1467 5866 01:08:17,838 --> 01:08:23,028 5867 by 28 soft mask which contains a fair 5868 5869 1468 5870 01:08:20,420 --> 01:08:24,920 5871 amount of detail the soft resampled 5872 5873 1469 5874 01:08:23,029 --> 01:08:29,230 5875 version of it and then the final 5876 5877 1470 5878 01:08:24,920 --> 01:08:32,390 5879 prediction after threshold again and 5880 5881 1471 5882 01:08:29,229 --> 01:08:37,218 5883 then here's a final example of the bird 5884 5885 1472 5886 01:08:32,390 --> 01:08:39,410 5887 and that image being detected ok so here 5888 5889 1473 5890 01:08:37,219 --> 01:08:41,119 5891 are just a few more qualitative results 5892 5893 1474 5894 01:08:39,409 --> 01:08:44,479 5895 showing you what the output of the 5896 5897 1475 5898 01:08:41,119 --> 01:08:47,599 5899 system looks like so I like this example 5900 5901 1476 5902 01:08:44,479 --> 01:08:49,968 5903 because it shows a really nice success 5904 5905 1477 5906 01:08:47,600 --> 01:08:52,880 5907 case of the system where you have a 5908 5909 1478 5910 01:08:49,969 --> 01:08:54,859 5911 person here and the person is cut in 5912 5913 1479 5914 01:08:52,880 --> 01:08:57,798 5915 half completely by the surfboard that 5916 5917 1480 5918 01:08:54,859 --> 01:09:00,319 5919 they're holding but it's still able to 5920 5921 1481 5922 01:08:57,798 --> 01:09:03,048 5923 detect the top part in the bottom part 5924 5925 1482 5926 01:09:00,319 --> 01:09:05,210 5927 as being part of the same instance and 5928 5929 1483 5930 01:09:03,048 --> 01:09:06,528 5931 the reason that it's doing this that 5932 5933 1484 5934 01:09:05,210 --> 01:09:09,020 5935 able to do this is because it's not 5936 5937 1485 5938 01:09:06,529 --> 01:09:11,960 5939 relying on any bottom of grouping bottom 5940 5941 1486 5942 01:09:09,020 --> 01:09:14,480 5943 of grouping would fail in such a case 5944 5945 1487 5946 01:09:11,960 --> 01:09:16,548 5947 because it wouldn't be able to connect 5948 5949 1488 5950 01:09:14,479 --> 01:09:18,259 5951 the bottom part to the top part but 5952 5953 1489 5954 01:09:16,548 --> 01:09:23,119 5955 because it's doing sort of a more 5956 5957 1490 5958 01:09:18,259 --> 01:09:24,649 5959 holistic reasoning of what the object 5960 5961 1491 5962 01:09:23,119 --> 01:09:30,588 5963 looks like in the scene it's able to 5964 5965 1492 5966 01:09:24,649 --> 01:09:32,439 5967 actually predict the full extent so 5968 5969 1493 5970 01:09:30,588 --> 01:09:35,028 5971 here's another example where you can see 5972 5973 1494 5974 01:09:32,439 --> 01:09:37,849 5975 that it's able to deal pretty well with 5976 5977 1495 5978 01:09:35,029 --> 01:09:40,069 5979 people who are very heavily overlapping 5980 5981 1496 5982 01:09:37,850 --> 01:09:42,109 5983 each other and objects that are 5984 5985 1497 5986 01:09:40,069 --> 01:09:43,850 5987 overlapping so for example the bottle 5988 5989 1498 5990 01:09:42,109 --> 01:09:48,829 5991 that's being held in the hand of this 5992 5993 1499 5994 01:09:43,850 --> 01:09:51,079 5995 person here so now I want to just very 5996 5997 1500 5998 01:09:48,829 --> 01:09:55,909 5999 briefly describe the application of this 6000 6001 1501 6002 01:09:51,079 --> 01:09:59,238 6003 to human pose so the idea here is that 6004 6005 1502 6006 01:09:55,909 --> 01:10:03,250 6007 human pose in kind of a funny way can be 6008 6009 1503 6010 01:09:59,238 --> 01:10:03,250 6011 expressed as a mask prediction problem 6012 6013 1504 6014 01:10:03,279 --> 01:10:08,210 6015 so that the idea is that you can think 6016 6017 1505 6018 01:10:05,779 --> 01:10:09,859 6019 of each one of the safe seventeen key 6020 6021 1506 6022 01:10:08,210 --> 01:10:13,609 6023 points that's used in the cocoa dataset 6024 6025 1507 6026 01:10:09,859 --> 01:10:15,079 6027 as being a one-pot mask the mask is on 6028 6029 1508 6030 01:10:13,609 --> 01:10:17,420 6031 where the key point is and it's off 6032 6033 1509 6034 01:10:15,079 --> 01:10:19,579 6035 everywhere else so now you can just 6036 6037 1510 6038 01:10:17,420 --> 01:10:21,230 6039 change the mask prediction head so that 6040 6041 1511 6042 01:10:19,579 --> 01:10:22,488 6043 instead of predicting a binary 6044 6045 1512 6046 01:10:21,229 --> 01:10:26,179 6047 foreground/background 6048 6049 1513 6050 01:10:22,488 --> 01:10:28,429 6051 mask is predicting 17 masks where each 6052 6053 1514 6054 01:10:26,180 --> 01:10:32,600 6055 one should have its Arg max at the 6056 6057 1515 6058 01:10:28,430 --> 01:10:35,539 6059 location of the key point there is one 6060 6061 1516 6062 01:10:32,600 --> 01:10:38,180 6063 small technical change that we make here 6064 6065 1517 6066 01:10:35,539 --> 01:10:41,000 6067 and that is rather than having a sigmoid 6068 6069 1518 6070 01:10:38,180 --> 01:10:43,940 6071 unit at each of the say 28 by 28 spatial 6072 6073 1519 6074 01:10:41,000 --> 01:10:46,250 6075 locations the prediction is now going to 6076 6077 1520 6078 01:10:43,939 --> 01:10:50,629 6079 for the mask is now going to be formed 6080 6081 1521 6082 01:10:46,250 --> 01:10:52,880 6083 by doing a soft meth max over the grid 6084 6085 1522 6086 01:10:50,630 --> 01:10:54,319 6087 of spatial locations and you can think 6088 6089 1523 6090 01:10:52,880 --> 01:10:56,779 6091 of that as sort of encoding the prior 6092 6093 1524 6094 01:10:54,319 --> 01:11:01,029 6095 that the key point is going to exist in 6096 6097 1525 6098 01:10:56,779 --> 01:11:05,029 6099 only one location within the spatial map 6100 6101 1526 6102 01:11:01,029 --> 01:11:07,130 6103 so here are some examples of the sort of 6104 6105 1527 6106 01:11:05,029 --> 01:11:09,289 6107 output that that the system produces and 6108 6109 1528 6110 01:11:07,130 --> 01:11:11,090 6111 again I guess I should emphasize that 6112 6113 1529 6114 01:11:09,289 --> 01:11:14,060 6115 this is sort of the most naive way that 6116 6117 1530 6118 01:11:11,090 --> 01:11:16,670 6119 you could do key point estimation and we 6120 6121 1531 6122 01:11:14,060 --> 01:11:18,320 6123 but it ends up working quite well and I 6124 6125 1532 6126 01:11:16,670 --> 01:11:19,760 6127 think should serve as the 6128 6129 1533 6130 01:11:18,319 --> 01:11:22,130 6131 pretty reasonable baseline for trying to 6132 6133 1534 6134 01:11:19,760 --> 01:11:25,820 6135 do more sophisticated things on top of 6136 6137 1535 6138 01:11:22,130 --> 01:11:28,970 6139 this so here's the video where NASCAR 6140 6141 1536 6142 01:11:25,819 --> 01:11:31,399 6143 CNN is being run frame by frame it's 6144 6145 1537 6146 01:11:28,970 --> 01:11:34,610 6147 doing so it's one model in this case 6148 6149 1538 6150 01:11:31,399 --> 01:11:38,389 6151 that's trained to do box detection it's 6152 6153 1539 6154 01:11:34,609 --> 01:11:41,059 6155 not shown mask inference and also human 6156 6157 1540 6158 01:11:38,390 --> 01:11:50,270 6159 key point imprints all within the same 6160 6161 1541 6162 01:11:41,060 --> 01:11:52,760 6163 model and here's another video okay so 6164 6165 1542 6166 01:11:50,270 --> 01:11:54,620 6167 now that brings me to the second part of 6168 6169 1543 6170 01:11:52,760 --> 01:11:58,460 6171 the talk which is just going to be a 6172 6173 1544 6174 01:11:54,619 --> 01:12:02,269 6175 very brief survey of deep learning 6176 6177 1545 6178 01:11:58,460 --> 01:12:04,909 6179 project detection so there have been a 6180 6181 1546 6182 01:12:02,270 --> 01:12:07,370 6183 huge number of proposed methods over the 6184 6185 1547 6186 01:12:04,909 --> 01:12:10,220 6187 last few years and they often have very 6188 6189 1548 6190 01:12:07,369 --> 01:12:12,050 6191 different names but they're actually all 6192 6193 1549 6194 01:12:10,220 --> 01:12:15,320 6195 related to each other and sort of 6196 6197 1550 6198 01:12:12,050 --> 01:12:17,810 6199 interesting in complex ways and I want 6200 6201 1551 6202 01:12:15,319 --> 01:12:20,929 6203 to try to provide a little bit of 6204 6205 1552 6206 01:12:17,810 --> 01:12:23,090 6207 structure on this space I really like 6208 6209 1553 6210 01:12:20,930 --> 01:12:24,619 6211 mountains and so I just wanted to put in 6212 6213 1554 6214 01:12:23,090 --> 01:12:26,000 6215 a nice picture of a mountain here this 6216 6217 1555 6218 01:12:24,619 --> 01:12:30,470 6219 slide doesn't really convey any other 6220 6221 1556 6222 01:12:26,000 --> 01:12:32,000 6223 information so let's start with 6224 6225 1557 6226 01:12:30,470 --> 01:12:35,150 6227 something that's common to all of these 6228 6229 1558 6230 01:12:32,000 --> 01:12:37,130 6231 methods so so the first is that they all 6232 6233 1559 6234 01:12:35,149 --> 01:12:39,619 6235 start by modifying a classification 6236 6237 1560 6238 01:12:37,130 --> 01:12:42,829 6239 network so you saw this for the 6240 6241 1561 6242 01:12:39,619 --> 01:12:45,170 6243 particular case of NASCAR CNN but this 6244 6245 1562 6246 01:12:42,829 --> 01:12:47,899 6247 is true of all of the existing methods 6248 6249 1563 6250 01:12:45,170 --> 01:12:49,880 6251 and that's because classification 6252 6253 1564 6254 01:12:47,899 --> 01:12:52,549 6255 networks and the features they learn are 6256 6257 1565 6258 01:12:49,880 --> 01:12:54,590 6259 really the backbone or the engine that's 6260 6261 1566 6262 01:12:52,550 --> 01:13:00,199 6263 driving a lot of visual recognition 6264 6265 1567 6266 01:12:54,590 --> 01:13:01,550 6267 these days now I wanted I was searching 6268 6269 1568 6270 01:13:00,199 --> 01:13:04,039 6271 around for what I thought was for the 6272 6273 1569 6274 01:13:01,550 --> 01:13:08,420 6275 highest information gain split of these 6276 6277 1570 6278 01:13:04,039 --> 01:13:14,119 6279 methods and I think that it's this idea 6280 6281 1571 6282 01:13:08,420 --> 01:13:15,470 6283 of a stage so if you split by stage and 6284 6285 1572 6286 01:13:14,119 --> 01:13:17,510 6287 I'll explain what that means in a moment 6288 6289 1573 6290 01:13:15,470 --> 01:13:21,110 6291 then you sort of get this division into 6292 6293 1574 6294 01:13:17,510 --> 01:13:24,470 6295 one set of methods like the our CNN 6296 6297 1575 6298 01:13:21,109 --> 01:13:26,659 6299 style set of methods and another set of 6300 6301 1576 6302 01:13:24,470 --> 01:13:29,150 6303 methods which are often called one stage 6304 6305 1577 6306 01:13:26,659 --> 01:13:29,899 6307 and these are approaches like over feet 6308 6309 1578 6310 01:13:29,149 --> 01:13:30,529 6311 Yolo 6312 6313 1579 6314 01:13:29,899 --> 01:13:33,558 6315 SSD 6316 6317 1580 6318 01:13:30,529 --> 01:13:37,158 6319 II and then a new method that that we 6320 6321 1581 6322 01:13:33,559 --> 01:13:39,110 6323 recently have written a paper out about 6324 6325 1582 6326 01:13:37,158 --> 01:13:41,058 6327 that will show up at ICC V and will be 6328 6329 1583 6330 01:13:39,109 --> 01:13:44,808 6331 presented on Wednesday as a poster 6332 6333 1584 6334 01:13:41,059 --> 01:13:48,199 6335 called Retina net so so what's this idea 6336 6337 1585 6338 01:13:44,809 --> 01:13:51,349 6339 of a stage so if we sort of go back to 6340 6341 1586 6342 01:13:48,198 --> 01:13:55,759 6343 basics of object detection so if you 6344 6345 1587 6346 01:13:51,349 --> 01:13:57,770 6347 have a H by W image with n pixels if you 6348 6349 1588 6350 01:13:55,760 --> 01:13:59,480 6351 think about every rectangle in that 6352 6353 1589 6354 01:13:57,770 --> 01:14:02,869 6355 image you have order N squared which is 6356 6357 1590 6358 01:13:59,479 --> 01:14:05,089 6359 a huge number of windows and essentially 6360 6361 1591 6362 01:14:02,868 --> 01:14:06,500 6363 the detection problem we saw very 6364 6365 1592 6366 01:14:05,090 --> 01:14:09,110 6367 popular way to think about it is 6368 6369 1593 6370 01:14:06,500 --> 01:14:11,599 6371 reducing detection to classifying each 6372 6373 1594 6374 01:14:09,109 --> 01:14:14,000 6375 one of those with those windows this is 6376 6377 1595 6378 01:14:11,599 --> 01:14:16,880 6379 a huge output space it's very difficult 6380 6381 1596 6382 01:14:14,000 --> 01:14:18,469 6383 to deal with in practice so a lot of the 6384 6385 1597 6386 01:14:16,880 --> 01:14:20,389 6387 literature on object detection whether 6388 6389 1598 6390 01:14:18,469 --> 01:14:22,219 6391 explicitly stated or not is essentially 6392 6393 1599 6394 01:14:20,389 --> 01:14:26,770 6395 trying to figure out ways to manage this 6396 6397 1600 6398 01:14:22,219 --> 01:14:29,149 6399 computational complexity and a very 6400 6401 1601 6402 01:14:26,770 --> 01:14:31,360 6403 popular approach to this has been 6404 6405 1602 6406 01:14:29,149 --> 01:14:33,920 6407 sliding window which allows you to 6408 6409 1603 6410 01:14:31,359 --> 01:14:36,769 6411 rather than considering every possible 6412 6413 1604 6414 01:14:33,920 --> 01:14:39,618 6415 rectangle you reduce it to a discrete 6416 6417 1605 6418 01:14:36,770 --> 01:14:41,780 6419 set of aspect ratios translation scales 6420 6421 1606 6422 01:14:39,618 --> 01:14:43,460 6423 and it can often bring you down to sort 6424 6425 1607 6426 01:14:41,779 --> 01:14:46,759 6427 of around a hundred thousand different 6428 6429 1608 6430 01:14:43,460 --> 01:14:50,868 6431 rectangles to consider but the other 6432 6433 1609 6434 01:14:46,760 --> 01:14:53,570 6435 very powerful idea is this idea of the 6436 6437 1610 6438 01:14:50,868 --> 01:14:56,328 6439 Cascade in which rather than making a 6440 6441 1611 6442 01:14:53,569 --> 01:15:00,460 6443 decision all at once you have some 6444 6445 1612 6446 01:14:56,328 --> 01:15:02,719 6447 sequence of decisions which allows 6448 6449 1613 6450 01:15:00,460 --> 01:15:05,059 6451 faster testing but perhaps more 6452 6453 1614 6454 01:15:02,719 --> 01:15:07,399 6455 importantly simpler training because it 6456 6457 1615 6458 01:15:05,059 --> 01:15:09,770 6459 allows so some parts of the model to 6460 6461 1616 6462 01:15:07,399 --> 01:15:11,569 6463 focus on the easy cases and other parts 6464 6465 1617 6466 01:15:09,770 --> 01:15:14,929 6467 of the model to focus on the hard cases 6468 6469 1618 6470 01:15:11,569 --> 01:15:16,609 6471 and I think that the way that this idea 6472 6473 1619 6474 01:15:14,929 --> 01:15:18,800 6475 the cascade interacts with training is 6476 6477 1620 6478 01:15:16,609 --> 01:15:22,130 6479 actually sort of the core difference 6480 6481 1621 6482 01:15:18,800 --> 01:15:26,900 6483 between the many stage versus one stage 6484 6485 1622 6486 01:15:22,130 --> 01:15:29,328 6487 set of approaches so let's start with 6488 6489 1623 6490 01:15:26,899 --> 01:15:31,969 6491 the more than one stage group because 6492 6493 1624 6494 01:15:29,328 --> 01:15:35,328 6495 you just saw an example of that and so 6496 6497 1625 6498 01:15:31,969 --> 01:15:38,300 6499 let me illustrate how the rc9 style 6500 6501 1626 6502 01:15:35,328 --> 01:15:40,729 6503 approach has more than one stage 6504 6505 1627 6506 01:15:38,300 --> 01:15:44,480 6507 so one of the very first things I did 6508 6509 1628 6510 01:15:40,729 --> 01:15:46,639 6511 that this approach involves is coming up 6512 6513 1629 6514 01:15:44,479 --> 01:15:50,419 6515 with a set of object or region proposals 6516 6517 1630 6518 01:15:46,640 --> 01:15:52,369 6519 and not off the bat is already some form 6520 6521 1631 6522 01:15:50,420 --> 01:15:54,710 6523 of drastically reducing the complexity 6524 6525 1632 6526 01:15:52,369 --> 01:15:56,750 6527 of the output space you're going from 6528 6529 1633 6530 01:15:54,710 --> 01:16:01,069 6531 you know order N squared different 6532 6533 1634 6534 01:15:56,750 --> 01:16:03,109 6535 rectangles to say mm during training so 6536 6537 1635 6538 01:16:01,069 --> 01:16:06,500 6539 this is a huge reduction and that's done 6540 6541 1636 6542 01:16:03,109 --> 01:16:08,989 6543 via some mechanism either learned or via 6544 6545 1637 6546 01:16:06,500 --> 01:16:12,170 6547 bottom-up grouping and now what that 6548 6549 1638 6550 01:16:08,989 --> 01:16:14,090 6551 means is that the second part of the 6552 6553 1639 6554 01:16:12,170 --> 01:16:16,550 6555 system which is doing classification of 6556 6557 1640 6558 01:16:14,090 --> 01:16:19,340 6559 these proposals only has to focus on 6560 6561 1641 6562 01:16:16,550 --> 01:16:21,079 6563 this particular subset of the output 6564 6565 1642 6566 01:16:19,340 --> 01:16:23,150 6567 space and that's going to make the 6568 6569 1643 6570 01:16:21,079 --> 01:16:25,789 6571 training process easier it's going to 6572 6573 1644 6574 01:16:23,149 --> 01:16:28,189 6575 simplify it because it's effectively 6576 6577 1645 6578 01:16:25,789 --> 01:16:30,319 6579 taking this problem which previously was 6580 6581 1646 6582 01:16:28,189 --> 01:16:32,449 6583 extremely class and balanced between 6584 6585 1647 6586 01:16:30,319 --> 01:16:35,779 6587 foreground and background and making it 6588 6589 1648 6590 01:16:32,449 --> 01:16:38,269 6591 more class balanced so now in contrast 6592 6593 1649 6594 01:16:35,779 --> 01:16:40,939 6595 the one stage approach and here I'm 6596 6597 1650 6598 01:16:38,270 --> 01:16:43,610 6599 showing a slide of Yolo has sort of an 6600 6601 1651 6602 01:16:40,939 --> 01:16:45,379 6603 illustrative example of that is going to 6604 6605 1652 6606 01:16:43,609 --> 01:16:47,839 6607 take a different approach and it's 6608 6609 1653 6610 01:16:45,380 --> 01:16:51,470 6611 basically going to just try to classify 6612 6613 1654 6614 01:16:47,840 --> 01:16:54,079 6615 the whole output space at once now there 6616 6617 1655 6618 01:16:51,470 --> 01:16:57,170 6619 are different ways of making this more 6620 6621 1656 6622 01:16:54,079 --> 01:16:59,840 6623 reasonable then dealing with millions of 6624 6625 1657 6626 01:16:57,170 --> 01:17:01,880 6627 possible boxes to classify so one is 6628 6629 1658 6630 01:16:59,840 --> 01:17:04,279 6631 sort of in the spirit of sliding window 6632 6633 1659 6634 01:17:01,880 --> 01:17:06,590 6635 and that's you come up with some sort 6636 6637 1660 6638 01:17:04,279 --> 01:17:09,920 6639 Uruguay of reducing the size of the 6640 6641 1661 6642 01:17:06,590 --> 01:17:11,600 6643 output space so for example Yolo has 6644 6645 1662 6646 01:17:09,920 --> 01:17:15,560 6647 this grid division and that that's 6648 6649 1663 6650 01:17:11,600 --> 01:17:21,079 6651 effectively massive reduction in the 6652 6653 1664 6654 01:17:15,560 --> 01:17:23,690 6655 size of the output space so that's 6656 6657 1665 6658 01:17:21,079 --> 01:17:25,880 6659 that's sort of a the main high-level 6660 6661 1666 6662 01:17:23,689 --> 01:17:28,189 6663 split that I want to communicate and now 6664 6665 1667 6666 01:17:25,880 --> 01:17:30,680 6667 I want to sort of go down one of those 6668 6669 1668 6670 01:17:28,189 --> 01:17:33,769 6671 branches the the multistage detector 6672 6673 1669 6674 01:17:30,680 --> 01:17:36,020 6675 branch and provide what I think is one 6676 6677 1670 6678 01:17:33,770 --> 01:17:38,390 6679 useful way to think about the set of 6680 6681 1671 6682 01:17:36,020 --> 01:17:41,260 6683 methods that that fall down on the side 6684 6685 1672 6686 01:17:38,390 --> 01:17:45,530 6687 of the brand on the side of the tree so 6688 6689 1673 6690 01:17:41,260 --> 01:17:49,690 6691 you can think of this type of detector 6692 6693 1674 6694 01:17:45,529 --> 01:17:52,189 6695 as being composed of two parts logically 6696 6697 1675 6698 01:17:49,689 --> 01:17:55,009 6699 so one is that there's some 6700 6701 1676 6702 01:17:52,189 --> 01:17:56,689 6703 sort of image wise computation that's 6704 6705 1677 6706 01:17:55,010 --> 01:18:00,739 6707 performed over the whole image and has 6708 6709 1678 6710 01:17:56,689 --> 01:18:05,899 6711 nothing to do with the proposal part of 6712 6713 1679 6714 01:18:00,738 --> 01:18:08,809 6715 the system it and then the second part 6716 6717 1680 6718 01:18:05,899 --> 01:18:12,259 6719 is going to be region wise computation 6720 6721 1681 6722 01:18:08,810 --> 01:18:15,350 6723 and that's going to scale with the 6724 6725 1682 6726 01:18:12,260 --> 01:18:17,960 6727 number of proposals or regions that the 6728 6729 1683 6730 01:18:15,350 --> 01:18:19,670 6731 system needs to classify and you can 6732 6733 1684 6734 01:18:17,960 --> 01:18:22,399 6735 think of there being sort of a slider 6736 6737 1685 6738 01:18:19,670 --> 01:18:24,350 6739 that you can use to slide from one end 6740 6741 1686 6742 01:18:22,399 --> 01:18:27,139 6743 to the other end in terms of how much 6744 6745 1687 6746 01:18:24,350 --> 01:18:31,430 6747 computation you put into either side of 6748 6749 1688 6750 01:18:27,140 --> 01:18:32,840 6751 the split so this is an idea that we 6752 6753 1689 6754 01:18:31,430 --> 01:18:35,810 6755 described in this paper called networks 6756 6757 1690 6758 01:18:32,840 --> 01:18:38,180 6759 on convolutional feature maps and and I 6760 6761 1691 6762 01:18:35,810 --> 01:18:40,070 6763 think it's a useful idea in the sense of 6764 6765 1692 6766 01:18:38,180 --> 01:18:43,340 6767 helping you organize the landscape of 6768 6769 1693 6770 01:18:40,069 --> 01:18:47,448 6771 this type of detection method so at one 6772 6773 1694 6774 01:18:43,340 --> 01:18:50,180 6775 extreme you have our CNN and it's at the 6776 6777 1695 6778 01:18:47,448 --> 01:18:53,839 6779 extreme where the image wise computation 6780 6781 1696 6782 01:18:50,180 --> 01:18:55,130 6783 is essentially subtracting the pixel 6784 6785 1697 6786 01:18:53,840 --> 01:18:56,600 6787 mean and dividing by standard deviation 6788 6789 1698 6790 01:18:55,130 --> 01:19:00,800 6791 or something like that it's almost 6792 6793 1699 6794 01:18:56,600 --> 01:19:04,460 6795 nothing and then the per vision 6796 6797 1700 6798 01:19:00,800 --> 01:19:05,480 6799 computation is the whole convolutional 6800 6801 1701 6802 01:19:04,460 --> 01:19:08,719 6803 network itself 6804 6805 1702 6806 01:19:05,479 --> 01:19:10,789 6807 applied independently teach region which 6808 6809 1703 6810 01:19:08,719 --> 01:19:12,560 6811 is obviously very expensive although 6812 6813 1704 6814 01:19:10,789 --> 01:19:14,238 6815 this might be a very good way to deal 6816 6817 1705 6818 01:19:12,560 --> 01:19:17,600 6819 with small objects for example so there 6820 6821 1706 6822 01:19:14,238 --> 01:19:20,718 6823 are trade-offs now faster and faster are 6824 6825 1707 6826 01:19:17,600 --> 01:19:22,250 6827 seen on depending on specific 6828 6829 1708 6830 01:19:20,719 --> 01:19:24,980 6831 implementation details can sort of fall 6832 6833 1709 6834 01:19:22,250 --> 01:19:27,260 6835 at different points in the spectrum so 6836 6837 1710 6838 01:19:24,979 --> 01:19:28,279 6839 on one hand the sort of original version 6840 6841 1711 6842 01:19:27,260 --> 01:19:33,050 6843 of it that doesn't use this feature 6844 6845 1712 6846 01:19:28,279 --> 01:19:36,880 6847 pyramid network and may involve using 6848 6849 1713 6850 01:19:33,050 --> 01:19:38,719 6851 sort of more complex heavier classic 6852 6853 1714 6854 01:19:36,880 --> 01:19:44,300 6855 classification and bounding boxes 6856 6857 1715 6858 01:19:38,719 --> 01:19:46,189 6859 regression head when you use fpn feature 6860 6861 1716 6862 01:19:44,300 --> 01:19:47,960 6863 pyramid network that actually enables 6864 6865 1717 6866 01:19:46,189 --> 01:19:49,279 6867 you to reduce the amount of computation 6868 6869 1718 6870 01:19:47,960 --> 01:19:51,260 6871 that you put into the head of the 6872 6873 1719 6874 01:19:49,279 --> 01:19:53,269 6875 detector and so that's sort of that 6876 6877 1720 6878 01:19:51,260 --> 01:19:56,480 6879 shifts that version of the system so 6880 6881 1721 6882 01:19:53,270 --> 01:19:58,670 6883 further down this axis here and then 6884 6885 1722 6886 01:19:56,479 --> 01:20:00,468 6887 there's also this other nice approach 6888 6889 1723 6890 01:19:58,670 --> 01:20:02,659 6891 called region fully convolutional 6892 6893 1724 6894 01:20:00,469 --> 01:20:04,948 6895 network which is sort of represents the 6896 6897 1725 6898 01:20:02,659 --> 01:20:07,948 6899 complete other extreme from our 6900 6901 1726 6902 01:20:04,948 --> 01:20:09,928 6903 which is that essentially all of the 6904 6905 1727 6906 01:20:07,948 --> 01:20:13,829 6907 almost all the computation that's being 6908 6909 1728 6910 01:20:09,929 --> 01:20:15,690 6911 done is image wise and there's a very 6912 6913 1729 6914 01:20:13,829 --> 01:20:18,300 6915 very small amount which essentially 6916 6917 1730 6918 01:20:15,689 --> 01:20:20,939 6919 amounts to a bunch of average cooling 6920 6921 1731 6922 01:20:18,300 --> 01:20:26,969 6923 operations of computation that's done 6924 6925 1732 6926 01:20:20,939 --> 01:20:28,829 6927 per region and so now I realize that I'm 6928 6929 1733 6930 01:20:26,969 --> 01:20:30,779 6931 starting to run out of time here but I 6932 6933 1734 6934 01:20:28,829 --> 01:20:35,729 6935 want to just briefly talk about speed 6936 6937 1735 6938 01:20:30,779 --> 01:20:37,679 6939 accuracy trade-offs so so I think right 6940 6941 1736 6942 01:20:35,729 --> 01:20:40,529 6943 now there's sort of a lot of confusion 6944 6945 1737 6946 01:20:37,679 --> 01:20:41,940 6947 about speed and accuracy of the 6948 6949 1738 6950 01:20:40,529 --> 01:20:43,829 6951 different detection systems I've been 6952 6953 1739 6954 01:20:41,939 --> 01:20:46,678 6955 proposed and I think one of the reasons 6956 6957 1740 6958 01:20:43,829 --> 01:20:48,929 6959 for that is that if you just look at 6960 6961 1741 6962 01:20:46,679 --> 01:20:51,390 6963 performance on the pascal vo c dataset 6964 6965 1742 6966 01:20:48,929 --> 01:20:53,489 6967 and often it will be reported on the 6968 6969 1743 6970 01:20:51,390 --> 01:20:56,460 6971 2007 version which is now ten years old 6972 6973 1744 6974 01:20:53,488 --> 01:21:00,569 6975 it sort of provides an incomplete 6976 6977 1745 6978 01:20:56,460 --> 01:21:03,899 6979 picture so let me draw an analogy if and 6980 6981 1746 6982 01:21:00,569 --> 01:21:05,969 6983 this was the only data set we we had we 6984 6985 1747 6986 01:21:03,899 --> 01:21:07,609 6987 effectively wouldn't understand that 6988 6989 1748 6990 01:21:05,969 --> 01:21:10,289 6991 there's any difference between 6992 6993 1749 6994 01:21:07,609 --> 01:21:14,009 6995 nearest-neighbor on simple features sbm 6996 6997 1750 6998 01:21:10,289 --> 01:21:16,738 6999 and comp nets because they'll basically 7000 7001 1751 7002 01:21:14,010 --> 01:21:20,159 7003 work about the same on that list but you 7004 7005 1752 7006 01:21:16,738 --> 01:21:22,859 7007 bring in image net and now that really 7008 7009 1753 7010 01:21:20,159 --> 01:21:25,319 7011 increases the degree to which you can 7012 7013 1754 7014 01:21:22,859 --> 01:21:28,559 7015 discern which methods are doing useful 7016 7017 1755 7018 01:21:25,319 --> 01:21:30,599 7019 things in which ones aren't so sort of 7020 7021 1756 7022 01:21:28,560 --> 01:21:33,719 7023 by analogy I'd like to suggest that 7024 7025 1757 7026 01:21:30,600 --> 01:21:36,300 7027 Pascal sort of is an outdated data set 7028 7029 1758 7030 01:21:33,719 --> 01:21:39,359 7031 at this point and perhaps doesn't 7032 7033 1759 7034 01:21:36,300 --> 01:21:41,850 7035 provide quite enough complexity to quite 7036 7037 1760 7038 01:21:39,359 --> 01:21:44,869 7039 understand that which methods are 7040 7041 1761 7042 01:21:41,850 --> 01:21:47,969 7043 performing well in in different regards 7044 7045 1762 7046 01:21:44,869 --> 01:21:49,979 7047 Coco is probably a slightly better 7048 7049 1763 7050 01:21:47,969 --> 01:21:53,279 7051 instrument for trying to understand this 7052 7053 1764 7054 01:21:49,979 --> 01:21:56,488 7055 because it presents a more complex data 7056 7057 1765 7058 01:21:53,279 --> 01:21:58,198 7059 set and that's perhaps more well aligned 7060 7061 1766 7062 01:21:56,488 --> 01:22:01,709 7063 with different real world applications 7064 7065 1767 7066 01:21:58,198 --> 01:22:05,099 7067 that we might be considering so under 7068 7069 1768 7070 01:22:01,710 --> 01:22:08,939 7071 sort of the lens of Coco what we end up 7072 7073 1769 7074 01:22:05,100 --> 01:22:10,890 7075 seeing is that speed is the speed and 7076 7077 1770 7078 01:22:08,939 --> 01:22:13,409 7079 accuracy of these systems is mainly 7080 7081 1771 7082 01:22:10,890 --> 01:22:15,840 7083 influenced by three factors so you have 7084 7085 1772 7086 01:22:13,409 --> 01:22:17,789 7087 the resolution of the input image the 7088 7089 1773 7090 01:22:15,840 --> 01:22:19,920 7091 complexity of the network 7092 7093 1774 7094 01:22:17,789 --> 01:22:21,300 7095 and if it's a proposal based system the 7096 7097 1775 7098 01:22:19,920 --> 01:22:23,069 7099 number of proposals that you're using 7100 7101 1776 7102 01:22:21,300 --> 01:22:25,710 7103 which is sort of an adjustable hyper 7104 7105 1777 7106 01:22:23,069 --> 01:22:28,079 7107 parameter so there's a very nice paper 7108 7109 1778 7110 01:22:25,710 --> 01:22:29,789 7111 that'll be presented at the CVP are from 7112 7113 1779 7114 01:22:28,079 --> 01:22:32,579 7115 Jonathan Fong and colleagues at Google 7116 7117 1780 7118 01:22:29,789 --> 01:22:34,439 7119 that does a very nice empirical 7120 7121 1781 7122 01:22:32,579 --> 01:22:37,050 7123 evaluation of the speed accuracy 7124 7125 1782 7126 01:22:34,439 --> 01:22:38,429 7127 trade-offs and one of the reasons why 7128 7129 1783 7130 01:22:37,050 --> 01:22:40,310 7131 this is a very nice study is that they 7132 7133 1784 7134 01:22:38,430 --> 01:22:42,990 7135 do it entirely within one implementation 7136 7137 1785 7138 01:22:40,310 --> 01:22:45,090 7139 so it sort of provides the most fair 7140 7141 1786 7142 01:22:42,989 --> 01:22:46,819 7143 apples to apples comparison of different 7144 7145 1787 7146 01:22:45,090 --> 01:22:49,199 7147 methods within that implementation and 7148 7149 1788 7150 01:22:46,819 --> 01:22:52,829 7151 what they're able to produce from this 7152 7153 1789 7154 01:22:49,199 --> 01:22:55,319 7155 is this really nice sort of lay of the 7156 7157 1790 7158 01:22:52,829 --> 01:22:58,880 7159 land of three different meta 7160 7161 1791 7162 01:22:55,319 --> 01:23:01,380 7163 architectures OCR CNN RFC n and SSB 7164 7165 1792 7166 01:22:58,880 --> 01:23:05,000 7167 using a variety of different backbone 7168 7169 1793 7170 01:23:01,380 --> 01:23:07,350 7171 architectures and a variety of different 7172 7173 1794 7174 01:23:05,000 --> 01:23:10,710 7175 number of proposals one those are 7176 7177 1795 7178 01:23:07,350 --> 01:23:12,090 7179 appropriate and I suggest that people 7180 7181 1796 7182 01:23:10,710 --> 01:23:13,529 7183 look into this in detail because it 7184 7185 1797 7186 01:23:12,090 --> 01:23:16,680 7187 really kind of helps provide a more 7188 7189 1798 7190 01:23:13,529 --> 01:23:18,659 7191 complete understanding and I just wanted 7192 7193 1799 7194 01:23:16,680 --> 01:23:20,640 7195 to add to this slide one more point that 7196 7197 1800 7198 01:23:18,659 --> 01:23:23,010 7199 wasn't in the plot it's not quite 7200 7201 1801 7202 01:23:20,640 --> 01:23:24,630 7203 comparable because it's not implemented 7204 7205 1802 7206 01:23:23,010 --> 01:23:26,190 7207 in the same framework so there there are 7208 7209 1803 7210 01:23:24,630 --> 01:23:29,579 7211 lots of small implementation details 7212 7213 1804 7214 01:23:26,189 --> 01:23:31,619 7215 that can shift the FPS that you get one 7216 7217 1805 7218 01:23:29,579 --> 01:23:34,319 7219 way or the other but this sort of this 7220 7221 1806 7222 01:23:31,619 --> 01:23:36,119 7223 illustrates effectively where Yolo 7224 7225 1807 7226 01:23:34,319 --> 01:23:39,119 7227 version 2 which which will also be at 7228 7229 1808 7230 01:23:36,119 --> 01:23:40,409 7231 the cbpr falls within this you can see 7232 7233 1809 7234 01:23:39,119 --> 01:23:43,800 7235 that it's doing quite well 7236 7237 1810 7238 01:23:40,409 --> 01:23:48,779 7239 in terms of this sort of regime of being 7240 7241 1811 7242 01:23:43,800 --> 01:23:51,930 7243 very fast but sort of medium accuracy so 7244 7245 1812 7246 01:23:48,779 --> 01:23:54,149 7247 just a couple more slides to conclude so 7248 7249 1813 7250 01:23:51,930 --> 01:23:57,450 7251 as climbing showed there's been this 7252 7253 1814 7254 01:23:54,149 --> 01:23:59,309 7255 huge improvement over the last decade in 7256 7257 1815 7258 01:23:57,449 --> 01:24:02,449 7259 terms of where we are with object 7260 7261 1816 7262 01:23:59,310 --> 01:24:05,880 7263 detection and I want to try to summarize 7264 7265 1817 7266 01:24:02,449 --> 01:24:10,050 7267 why I think this has happened in terms 7268 7269 1818 7270 01:24:05,880 --> 01:24:13,109 7271 of things that before 2012 were false 7272 7273 1819 7274 01:24:10,050 --> 01:24:15,180 7275 but now they're true and I think that 7276 7277 1820 7278 01:24:13,109 --> 01:24:16,739 7279 these are in my opinion some of the core 7280 7281 1821 7282 01:24:15,180 --> 01:24:20,250 7283 things that have led to the improvements 7284 7285 1822 7286 01:24:16,739 --> 01:24:22,800 7287 we've seen so so the first may sound 7288 7289 1823 7290 01:24:20,250 --> 01:24:24,149 7291 kind of funny too so some of the people 7292 7293 1824 7294 01:24:22,800 --> 01:24:25,980 7295 who very recently started in computer 7296 7297 1825 7298 01:24:24,149 --> 01:24:27,879 7299 vision but that is that we see 7300 7301 1826 7302 01:24:25,979 --> 01:24:31,869 7303 improvements with more data 7304 7305 1827 7306 01:24:27,880 --> 01:24:34,329 7307 so in 2007 this wasn't true and it was 7308 7309 1828 7310 01:24:31,869 --> 01:24:35,349 7311 actually kind of a very sad thing we 7312 7313 1829 7314 01:24:34,329 --> 01:24:38,679 7315 didn't quite know it at the time because 7316 7317 1830 7318 01:24:35,350 --> 01:24:40,750 7319 it took a bit of annotation to actually 7320 7321 1831 7322 01:24:38,679 --> 01:24:42,789 7323 understand this but David Ramadan has a 7324 7325 1832 7326 01:24:40,750 --> 01:24:45,189 7327 nice paper looking at this which is that 7328 7329 1833 7330 01:24:42,789 --> 01:24:47,198 7331 if you increase the amount of data there 7332 7333 1834 7334 01:24:45,189 --> 01:24:48,519 7335 were very little improvement that you 7336 7337 1835 7338 01:24:47,198 --> 01:24:51,129 7339 could actually see in the models at that 7340 7341 1836 7342 01:24:48,520 --> 01:24:52,989 7343 time this is no longer the case because 7344 7345 1837 7346 01:24:51,130 --> 01:24:55,409 7347 we have models that are actually able to 7348 7349 1838 7350 01:24:52,988 --> 01:24:58,408 7351 take care of that that data and improve 7352 7353 1839 7354 01:24:55,408 --> 01:25:00,759 7355 the second which is very related is that 7356 7357 1840 7358 01:24:58,408 --> 01:25:03,219 7359 we see improvements when we increase 7360 7361 1841 7362 01:25:00,760 --> 01:25:05,579 7363 model capacity this also wasn't really 7364 7365 1842 7366 01:25:03,219 --> 01:25:08,198 7367 true before because we effectively would 7368 7369 1843 7370 01:25:05,579 --> 01:25:10,179 7371 would increase the model complexity and 7372 7373 1844 7374 01:25:08,198 --> 01:25:13,138 7375 most likely to see overfitting rather 7376 7377 1845 7378 01:25:10,179 --> 01:25:15,850 7379 than improved performance on the data 7380 7381 1846 7382 01:25:13,139 --> 01:25:18,819 7383 the third and I think that this is sort 7384 7385 1847 7386 01:25:15,850 --> 01:25:20,980 7387 of a really fundamental game that we've 7388 7389 1848 7390 01:25:18,819 --> 01:25:23,408 7391 had in the last few years is the ability 7392 7393 1849 7394 01:25:20,979 --> 01:25:25,869 7395 to take advantage of transfer learning 7396 7397 1850 7398 01:25:23,408 --> 01:25:29,289 7399 so that's the ability to pre train on 7400 7401 1851 7402 01:25:25,869 --> 01:25:30,729 7403 imagenet and use that information that's 7404 7405 1852 7406 01:25:29,289 --> 01:25:33,550 7407 been pulled out of that data set in 7408 7409 1853 7410 01:25:30,729 --> 01:25:36,189 7411 order to do better on detection this is 7412 7413 1854 7414 01:25:33,550 --> 01:25:39,429 7415 this is something that we really didn't 7416 7417 1855 7418 01:25:36,189 --> 01:25:40,569 7419 have any way of doing before and what 7420 7421 1856 7422 01:25:39,429 --> 01:25:42,789 7423 this means is that we can immediately 7424 7425 1857 7426 01:25:40,569 --> 01:25:45,069 7427 benefit from improvements to image 7428 7429 1858 7430 01:25:42,789 --> 01:25:46,979 7431 classification so every time someone 7432 7433 1859 7434 01:25:45,069 --> 01:25:49,899 7435 comes up with a new network architecture 7436 7437 1860 7438 01:25:46,979 --> 01:25:51,279 7439 well I shouldn't say every time a lot of 7440 7441 1861 7442 01:25:49,899 --> 01:25:53,519 7443 the time that people come up with it it 7444 7445 1862 7446 01:25:51,279 --> 01:25:56,019 7447 immediately translates into improved 7448 7449 1863 7450 01:25:53,520 --> 01:25:57,550 7451 performance some say cocoa detection 7452 7453 1864 7454 01:25:56,020 --> 01:25:59,639 7455 which is really exciting to see so 7456 7457 1865 7458 01:25:57,550 --> 01:26:01,719 7459 there's a synergy and then another 7460 7461 1866 7462 01:25:59,639 --> 01:26:04,619 7463 important issue that's related to that 7464 7465 1867 7466 01:26:01,719 --> 01:26:07,270 7467 again is that we're now sort of 7468 7469 1868 7470 01:26:04,619 --> 01:26:08,829 7471 coalescing to this shared modeling 7472 7473 1869 7474 01:26:07,270 --> 01:26:10,870 7475 framework between a bunch of different 7476 7477 1870 7478 01:26:08,829 --> 01:26:13,329 7479 disciplines so speech natural language 7480 7481 1871 7482 01:26:10,869 --> 01:26:14,529 7483 processing computer vision and this 7484 7485 1872 7486 01:26:13,329 --> 01:26:17,079 7487 means that when there are new 7488 7489 1873 7490 01:26:14,529 --> 01:26:19,029 7491 discoveries in any of those related 7492 7493 1874 7494 01:26:17,079 --> 01:26:20,889 7495 fields there's a decent chance that 7496 7497 1875 7498 01:26:19,029 --> 01:26:22,469 7499 those might actually translate into 7500 7501 1876 7502 01:26:20,889 --> 01:26:24,730 7503 things that we can do in computer vision 7504 7505 1877 7506 01:26:22,469 --> 01:26:28,000 7507 which is which is really exciting 7508 7509 1878 7510 01:26:24,729 --> 01:26:30,129 7511 so in conclusion we've come a very long 7512 7513 1879 7514 01:26:28,000 --> 01:26:33,550 7515 way in the last ten years in terms of 7516 7517 1880 7518 01:26:30,130 --> 01:26:35,949 7519 object detection more recently we're 7520 7521 1881 7522 01:26:33,550 --> 01:26:36,909 7523 moving from bounding box detection to 7524 7525 1882 7526 01:26:35,948 --> 01:26:39,000 7527 more interesting and challenging 7528 7529 1883 7530 01:26:36,908 --> 01:26:41,768 7531 instance level understanding problems 7532 7533 1884 7534 01:26:39,000 --> 01:26:44,349 7535 but there's still a lot of major 7536 7537 1885 7538 01:26:41,769 --> 01:26:47,559 7539 in my opinion that remain and I just 7540 7541 1886 7542 01:26:44,349 --> 01:26:51,190 7543 listed a few here that that were sort of 7544 7545 1887 7546 01:26:47,559 --> 01:26:53,559 7547 at the top of my mind and you can read 7548 7549 1888 7550 01:26:51,189 --> 01:26:57,149 7551 these off the slide so with that I'd 7552 7553 1889 7554 01:26:53,559 --> 01:26:57,150 7555 like to end and take any questions 7556 7557 1890 7558 01:26:57,869 --> 01:27:04,469 7559 [Applause] 7560 7561 1891 7562 01:27:00,250 --> 01:27:06,849 7563 [Music] 7564 7565 1892 7566 01:27:04,469 --> 01:27:08,618 7567 would you still use the two-phase 7568 7569 1893 7570 01:27:06,849 --> 01:27:14,880 7571 approach if you had only one or two 7572 7573 1894 7574 01:27:08,618 --> 01:27:17,828 7575 classes to detect so I think that this 7576 7577 1895 7578 01:27:14,880 --> 01:27:20,590 7579 ultimately comes down to what you're 7580 7581 1896 7582 01:27:17,828 --> 01:27:22,479 7583 trying to do so depending on your task 7584 7585 1897 7586 01:27:20,590 --> 01:27:24,309 7587 is the computational budget whether you 7588 7589 1898 7590 01:27:22,479 --> 01:27:26,919 7591 just want boxes or if you want masks or 7592 7593 1899 7594 01:27:24,309 --> 01:27:28,960 7595 key points and all of these issues are 7596 7597 1900 7598 01:27:26,920 --> 01:27:31,179 7599 going to affect what you decide to do so 7600 7601 1901 7602 01:27:28,960 --> 01:27:34,538 7603 if you just want boxes and they're just 7604 7605 1902 7606 01:27:31,179 --> 01:27:36,550 7607 two classes then you you might get 7608 7609 1903 7610 01:27:34,538 --> 01:27:37,809 7611 better results for your specific 7612 7613 1904 7614 01:27:36,550 --> 01:27:44,800 7615 application with the one stage approach 7616 7617 1905 7618 01:27:37,809 --> 01:27:46,840 7619 I have two questions the first one how 7620 7621 1906 7622 01:27:44,800 --> 01:27:48,940 7623 fast is masks are seen and compared with 7624 7625 1907 7626 01:27:46,840 --> 01:27:51,578 7627 fast are CNN I mean I know that may be 7628 7629 1908 7630 01:27:48,939 --> 01:27:53,288 7631 the branch is in burial but really is it 7632 7633 1909 7634 01:27:51,578 --> 01:27:56,018 7635 a bottleneck for it or not I mean it's 7636 7637 1910 7638 01:27:53,288 --> 01:27:58,328 7639 the second question is you show some 7640 7641 1911 7642 01:27:56,019 --> 01:28:01,389 7643 very nice qualitative results which is 7644 7645 1912 7646 01:27:58,328 --> 01:28:03,308 7647 like the people would surfing but there 7648 7649 1913 7650 01:28:01,389 --> 01:28:06,130 7651 is a very nice reflection over the water 7652 7653 1914 7654 01:28:03,309 --> 01:28:09,010 7655 why are masks are theanine didn't detect 7656 7657 1915 7658 01:28:06,130 --> 01:28:10,769 7659 it I mean I guess it should right yeah 7660 7661 1916 7662 01:28:09,010 --> 01:28:14,079 7663 okay so for the first question 7664 7665 1917 7666 01:28:10,769 --> 01:28:16,960 7667 so the overall runtime of this system 7668 7669 1918 7670 01:28:14,078 --> 01:28:19,090 7671 again it I was trying to illustrate with 7672 7673 1919 7674 01:28:16,960 --> 01:28:21,219 7675 the trade-offs is that there is no one 7676 7677 1920 7678 01:28:19,090 --> 01:28:22,690 7679 runtime of the system you can change a 7680 7681 1921 7682 01:28:21,219 --> 01:28:24,069 7683 whole lot of different factors you can 7684 7685 1922 7686 01:28:22,689 --> 01:28:25,359 7687 change the input resolution that that 7688 7689 1923 7690 01:28:24,069 --> 01:28:27,189 7691 phone network you're using the number of 7692 7693 1924 7694 01:28:25,359 --> 01:28:30,399 7695 proposals and get different points in 7696 7697 1925 7698 01:28:27,189 --> 01:28:32,859 7699 the operating curve but as sort of state 7700 7701 1926 7702 01:28:30,399 --> 01:28:36,488 7703 of the art type system is about 200 7704 7705 1927 7706 01:28:32,859 --> 01:28:38,768 7707 milliseconds per image and the it's I 7708 7709 1928 7710 01:28:36,488 --> 01:28:41,018 7711 think if I remember correctly maybe 7712 7713 1929 7714 01:28:38,769 --> 01:28:44,469 7715 about 40 milliseconds overhead compared 7716 7717 1930 7718 01:28:41,019 --> 01:28:46,090 7719 to a faster arcing on a lot of any okay 7720 7721 1931 7722 01:28:44,469 --> 01:28:48,550 7723 yeah and then the second question so 7724 7725 1932 7726 01:28:46,090 --> 01:28:50,380 7727 this is most likely just reflecting the 7728 7729 1933 7730 01:28:48,550 --> 01:28:53,110 7731 bias so if you want to call it that in 7732 7733 1934 7734 01:28:50,380 --> 01:28:55,208 7735 the data set so the data set most likely 7736 7737 1935 7738 01:28:53,109 --> 01:28:58,658 7739 doesn't have any wave you reflected 7740 7741 1936 7742 01:28:55,208 --> 01:29:00,188 7743 annotated as people and if it had a lot 7744 7745 1937 7746 01:28:58,658 --> 01:29:02,379 7747 of those annotated I suspect it would 7748 7749 1938 7750 01:29:00,189 --> 01:29:04,869 7751 detect them if it only has a few it 7752 7753 1939 7754 01:29:02,380 --> 01:29:06,458 7755 probably will miss them I suspect the 7756 7757 1940 7758 01:29:04,868 --> 01:29:09,609 7759 dataset has almost none so it's 7760 7761 1941 7762 01:29:06,458 --> 01:29:15,880 7763 effectively just a representation of the 7764 7765 1942 7766 01:29:09,609 --> 01:29:19,179 7767 task that it was trained for Dupin open 7768 7769 1943 7770 01:29:15,880 --> 01:29:22,349 7771 source ink masks are CNN and some kind 7772 7773 1944 7774 01:29:19,179 --> 01:29:25,779 7775 of reference implementation and so then 7776 7777 1945 7778 01:29:22,349 --> 01:29:27,849 7779 what's your like timeframe and if no is 7780 7781 1946 7782 01:29:25,779 --> 01:29:29,800 7783 there some other like open source 7784 7785 1947 7786 01:29:27,849 --> 01:29:32,650 7787 implementation I see one in tensorflow 7788 7789 1948 7790 01:29:29,800 --> 01:29:34,510 7791 is it like you know are you confident 7792 7793 1949 7794 01:29:32,649 --> 01:29:38,170 7795 that it's like pretty much the same like 7796 7797 1950 7798 01:29:34,510 --> 01:29:39,780 7799 your reference implementation yes so we 7800 7801 1951 7802 01:29:38,170 --> 01:29:42,158 7803 do plan to open source that are 7804 7805 1952 7806 01:29:39,779 --> 01:29:43,359 7807 tentative with the dig emphasis on 7808 7809 1953 7810 01:29:42,158 --> 01:29:48,670 7811 tentative because I don't want to make 7812 7813 1954 7814 01:29:43,359 --> 01:29:50,708 7815 any promises time frame is October so 7816 7817 1955 7818 01:29:48,670 --> 01:29:51,300 7819 time is out so we will have one more 7820 7821 1956 7822 01:29:50,708 --> 01:29:54,880 7823 question 7824 7825 1957 7826 01:29:51,300 --> 01:29:57,400 7827 hi there's the paper published at last 7828 7829 1958 7830 01:29:54,880 --> 01:30:00,400 7831 year's of APR called Carter net 7832 7833 1959 7834 01:29:57,399 --> 01:30:03,518 7835 we're basically feature maps are 7836 7837 1960 7838 01:30:00,399 --> 01:30:06,509 7839 concatenate concatenated before starting 7840 7841 1961 7842 01:30:03,519 --> 01:30:10,659 7843 to the branch of our PN and fasten it 7844 7845 1962 7846 01:30:06,510 --> 01:30:12,998 7847 seems plausible that happened not we all 7848 7849 1963 7850 01:30:10,658 --> 01:30:16,359 7851 have better performance since all 7852 7853 1964 7854 01:30:12,998 --> 01:30:19,630 7855 feature maps are exploited nevertheless 7856 7857 1965 7858 01:30:16,359 --> 01:30:21,759 7859 it's not the case so what do you think 7860 7861 1966 7862 01:30:19,630 --> 01:30:24,449 7863 is the reason that feature primate 7864 7865 1967 7866 01:30:21,760 --> 01:30:27,610 7867 Network performs better than happen at 7868 7869 1968 7870 01:30:24,448 --> 01:30:31,328 7871 so if I recall correctly and you can 7872 7873 1969 7874 01:30:27,609 --> 01:30:34,149 7875 correct me if I'm wrong I believe that 7876 7877 1970 7878 01:30:31,328 --> 01:30:36,038 7879 in the hyper net approach there's a 7880 7881 1971 7882 01:30:34,149 --> 01:30:38,018 7883 combination that features from a variety 7884 7885 1972 7886 01:30:36,038 --> 01:30:39,639 7887 of different scales but then the 7888 7889 1973 7890 01:30:38,019 --> 01:30:42,820 7891 predictions are made from a single 7892 7893 1974 7894 01:30:39,639 --> 01:30:46,510 7895 combined representation rather than from 7896 7897 1975 7898 01:30:42,819 --> 01:30:48,578 7899 multiple levels and you could resize the 7900 7901 1976 7902 01:30:46,510 --> 01:30:51,489 7903 concatenated spatial map in two 7904 7905 1977 7906 01:30:48,578 --> 01:30:55,118 7907 different sizes right right so so what 7908 7909 1978 7910 01:30:51,488 --> 01:30:55,748 7911 we observed is that for region proposal 7912 7913 1979 7914 01:30:55,118 --> 01:30:57,880 7915 generations 7916 7917 1980 7918 01:30:55,748 --> 01:31:00,550 7919 having multiple levels in the future 7920 7921 1981 7922 01:30:57,880 --> 01:31:03,219 7923 pyramid was very wasn't quite important 7924 7925 1982 7926 01:31:00,550 --> 01:31:05,019 7927 including having low resolution ones all 7928 7929 1983 7930 01:31:03,219 --> 01:31:06,170 7931 right sorry I mean including having high 7932 7933 1984 7934 01:31:05,019 --> 01:31:09,140 7935 resolution ones 7936 7937 1985 7938 01:31:06,170 --> 01:31:11,890 7939 for detecting small objects right so so 7940 7941 1986 7942 01:31:09,140 --> 01:31:14,360 7943 I suspect that with with hyper nut 7944 7945 1987 7946 01:31:11,890 --> 01:31:16,550 7947 perhaps the sort of missing thing was 7948 7949 1988 7950 01:31:14,359 --> 01:31:21,729 7951 just like not having quite enough 7952 7953 1989 7954 01:31:16,550 --> 01:31:24,560 7955 spatial resolution but I'm not 100% sure 7956 7957 1990 7958 01:31:21,729 --> 01:31:27,079 7959 yeah I think that might be the reason 7960 7961 1991 7962 01:31:24,560 --> 01:31:29,650 7963 thank you very much oh this is the end 7964 7965 1992 7966 01:31:27,079 --> 01:31:29,649 7967 of the first 7968 7969