Dubey ML Write-Up
Dubey ML Write-Up
In this article, we delve into the hybrid architecture of TransU-net, combining CNNs and
Transformers.
Outline
U-net Architecture
U-Net, a convolutional neural network (CNN) architecture, was introduced in 2015 by Olaf
Ronneberger et al. [1] and was initially designed for biomedical image segmentation. Its
name derives from its “U” shape, created by its encoder-decoder structure and skip
connections [2]. U-Net has proven to be flexible and effective, and its applications have
expanded into various fields beyond its original purpose.
This path operates much like a typical convolutional neural network (CNN). It progressively
down samples the input image, extracting and encoding essential features at each level.
This process can be visualized as a funnel, gradually reducing the spatial dimensions of the
image while increasing the depth of the feature maps.
Convolution: Applies filters to the image to detect specific patterns and features.
ReLU Activation: Introduces non-linearity to enable the network to learn complex
relationships.
Max Pooling: Down samples the feature maps to reduce their size and increase the
receptive field of subsequent layers.
As the image traverses this path, the network captures increasingly abstract and high-level
features, crucial for understanding the overall context of the image.
This path reverses the contracting path processes, gradually upsampling the feature maps
to reconstruct a segmentation mask of the same size as the original input image. It’s acts
as a reverse funnel, expanding the spatial dimensions while decreasing the depth of the
feature maps.
Upsampling: Increases the size of the feature maps, often using transposed
convolutions.
Concatenation: Combines the up sampled feature maps with corresponding high-
resolution ones from the contracting path. This crucial step, facilitated by skip
connections, helps to recover spatial details lost during downsampling.
Convolution: Refines the combined feature maps to generate a precise segmentation
mask.
Given an image x ∈ ℝ^{H × W × C} with a spatial resolution of H×W and C channels, the goal
is to predict a pixel-wise label map of the same resolution, H×W. Traditional methods often
rely on Convolutional Neural Networks (CNNs), such as U-Net, to encode the images into
high-level feature representations and then decode them back to the full spatial resolution.
In contrast, this approach introduces self-attention mechanisms into the encoder design by
leveraging Transformers[3]. This enhancement enables the model to capture richer feature
representations and better global context.
Implementation of TransU-net
The goal is to predict a pixel-wise label map (segmentation map) for an image, where
each pixel is assigned a class label. The output map will have the same spatial resolution
as the input image H×W, and the process is performed using a neural network.
1. Image Sequentialization
The image is split into small patches of size P×P (e.g., 16×16 pixels).
Each patch is flattened into a vector and forms a “token.” The total number of tokens is
H⋅W
N=
P2
where H and W are the image height and width.
Example: If the image is 256×256 and the patch size P=16, then
256 ⋅ 256
N= = 256 tokens
162
2. Patch Embedding
3. Transformer Encoder
The sequence of embedded tokens is passed through multiple layers that constitute the
Transformer. This includes:
*image showing structure, fig.5
4. TransU-Net Architecture
TransU-Net improves upon using just Transformers by combining CNNs and Transformers :
Hybrid Encoder:
A CNN is first applied to the input image to extract intermediate feature maps at
various resolutions.
These feature maps are tokenized and processed by the Transformer.
Using CNN features ensures local (low-level) details like edges and boundaries are
preserved.
ReLU activation.
Skip connections between the encoder and decoder help recover fine-grained details.
The dataset includes 30 abdominal CT scans with ground-truth labels for 8 organs. The
main evaluation metrics were:
Key Observations:
1. The hybrid CNN-Transformer encoder leverages low-level spatial details and high-level
global context, leading to better results.
2. The cascaded upsampling strategy recovers fine-grained features, improving
segmentation boundaries.
3. Superior results were observed for all organs, especially challenging ones like the
pancreas and gallbladder.