Open In App

Understanding GoogLeNet Model - CNN Architecture

Last Updated : 15 Jul, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

GoogLeNet (Inception V1) is a deep convolutional neural network architecture designed for efficient image classification. It introduces the Inception module, which performs multiple convolution operations (1x1, 3x3, 5x5) in parallel, along with max pooling and concatenates their outputs. The architecture is deep, yet optimized for speed and performance, which makes it suitable for large-scale visual recognition tasks. It brought forward innovative architectural choices such as 1×1 convolutions, global average pooling and the Inception module, all aimed at improving depth and computational efficiency.

Key Features of GoogLeNet

The GoogLeNet architecture is very different from previous architectures such as AlexNet and ZF-Net. It uses many different kinds of methods such as:

1. 1×1 Convolutions

One of the core techniques employed in GoogLeNet is the use of 1×1 convolutions, primarily for dimensionality reduction. These layers help decrease the number of trainable parameters while enabling deeper and more efficient architectures.

Example Comparison:

  • Without 1×1 Convolution:(14×14×48)×(5×5×480)=112.9M operations


  • With 1×1 Convolution:(14×14×16)×(1×1×480)+(14×14×48)×(5×5×16)=5.3M operations

This results in a massive reduction in computation without compromising performance.

2. Global Average Pooling

In traditional architectures like AlexNet, fully connected layers at the end introduce a large number of parameters. GoogLeNet replaces these with Global Average Pooling, which computes the average of each feature map (e.g. converting 7×7 maps to 1×1), this significantly reduces the model’s parameter count and solves overfitting.

Benefits:

  • Zero additional trainable parameters
  • Reduces overfitting
  • Improves top-1 accuracy by approximately 0.6%

3. Inception Module

The Inception module is the architectural core of GoogLeNet. It processes the input using multiple types of operations in parallel, including 1×1, 3×3, 5×5 convolutions and 3×3 max pooling. The outputs from all paths are concatenated depth-wise.

  • Purpose: Enables the network to capture features at multiple scales effectively.
  • Advantage: Improves representational power without dramatically increasing computation.


4. Auxiliary Classifiers

To address the vanishing gradient problem during training, GoogLeNet introduces auxiliary classifiers(intermediate branches that act as smaller classifiers). These are active only during training and help regularize the network.

Structure of Each Auxiliary Classifier:

  • Average pooling layer (5×5, stride 3)
  • 1×1 convolution (128 filters, ReLU)
  • Fully connected layer (1024 units, ReLU)
  • Dropout layer (dropout rate = 0.7)
  • Fully connected softmax layer (1000 classes)
  • The auxiliary losses are added to the main loss with a weight of 0.3 to stabilize training.

5. Model Architecture

GoogLeNet is a 22-layer deep network (excluding pooling layers) that emphasizes computational efficiency, making it feasible to run even on hardware with limited resources. Below is Layer by Layer architectural details of GoogLeNet.

Inception-layer-by-layer
Layer-by-Layer Inception


The architecture also contains two auxiliary classifier layer connected to the output of Inception (4a) and Inception (4d) layers.

Inception V1 architecture

Key highlights of the architecture:

  • Input Layer: Accepts a 224×224 RGB image as input.
  • Initial Convolutions and Pooling: Applies a series of standard convolutional and max pooling layers to downsample the input and extract low-level features.
  • Local Response Normalization (LRN): Normalizes the feature maps early in the network to improve generalization.
  • Inception Modules: Each module processes the input through 1×1, 3×3, and 5×5 convolutions, as well as 3×3 max pooling, all in parallel. The outputs are concatenated along the depth dimension, allowing the network to capture both fine and coarse features.
  • Auxiliary Classifiers: Appear as smaller branches connected to intermediate layers of the network. Include average pooling, 1×1 convolutions, fully connected layers, and softmax outputs.
  • Final Layers: Uses global average pooling (7×7) to reduce each feature map to a single value. Followed by a fully connected layer and a softmax activation to produce the final classification output.

Performance and Results

  • Winner of ILSVRC 2014 in both classification and detection tasks
  • Achieved a top-5 error rate of 6.67% in image classification
  • An ensemble of six GoogLeNet models achieved 43.9% mAP (mean Average Precision) on the ImageNet detection task
GoogLeNet Classification top-5 Error
GoogLeNet Detection Performance

Similar Reads