Assignment 10
Assignment 10
ANS-: To compare the cost efficiency of upgrading from a dual-CPU system to a GPU
system for processing 1 million images, we need to consider both the initial hardware costs
and the ongoing operational costs over a period of 1 year.
Performance:
Dual-CPU system: 100 images per hour
GPU system: 500 images per hour
Operational Costs:
Let’s assume the following hypothetical operational costs:
Dual-CPU system: $200 per month
GPU system: $400 per month
Step 2: Calculate Time to Process 1 Million Images
Dual-CPU System:
Operational Costs=12 months×200 USD/month=2,400 USD
GPU System:
Operational Costs=12 months×400 USD/month=4,800 USD
1. Dual-CPU System:
2. GPU System:
1. Determine the Speedup: This is the ratio of the time taken with a single GPU to the
time taken with multiple GPUs.
2. Calculate the Scaling Efficiency: This is the speedup divided by the number of
GPUs.
To determine the energy efficiency (in images per watt-hour) for both the GPU-based server
and the CPU-based server, we will follow these steps:
GPU-based server:
o Power consumption: 400 watts
o Processing time: 2 hours
o Images processed: 250,000
CPU-based server:
o Power consumption: 250 watts
o Processing time: 10 hours
o Images processed: 250,000
Energy consumption (in watt-hours) is calculated by multiplying the power consumption (in
watts) by the time (in hours).
1. GPU-based server:
2. CPU-based server:
Energy efficiency is calculated by dividing the number of images processed by the total
energy consumption.
1. GPU-based server:
2. CPU-based server:
Given Data
The formula for the output size of a convolutional layer is given by:
Output Size=Input Size−Filter Size+2×Padding/Stride+1
Output Size=64−3+2×0/1+1=62.
Each filter is applied to a 3×3 region across all 64 channels. Therefore, the number of
operations per filter application is:
Each convolutional operation involves a multiply and an add (MAC operation), so each filter
application involves twice the number of FLOPs:
The output feature map has dimensions 62×62 and there are 128 filters. So, the total number
of output elements is:
62×62×128
The total number of FLOPs required to compute the output feature map is:
Total FLOPs=1152×62×62×128
62×62=3844
2. Multiply by 128:
3844×128=491,392
3. Multiply by 1152:
491,392×1152=566,231,424.
Conclusion
The total number of floating-point operations (FLOPs) required to compute the output feature
map for the given convolutional layer is 566,231,424.
TensorFlow
Overview:
GPU Acceleration:
TensorFlow provides built-in support for GPU acceleration using CUDA and cuDNN.
Users can leverage GPUs by simply installing the GPU version of TensorFlow and
setting device contexts in the code.
Key Features:
Flexible and powerful, suitable for both high-level and low-level operations.
TensorFlow Hub for reusable pre-trained models.
TensorFlow Extended (TFX) for production deployment.
Performance:
PyTorch
Overview:
GPU Acceleration:
PyTorch offers seamless GPU acceleration. Tensor operations are easy to move
between CPU and GPU.
It uses CUDA for GPU support and allows dynamic graph building, which can be
particularly useful for certain applications.
Key Features:
Performance:
PyTorch is designed for flexibility and ease of use, with competitive performance,
especially in research contexts.
Keras
Overview:
Keras is a high-level neural networks API, written in Python and capable of running
on top of TensorFlow, Theano, and other frameworks.
It is user-friendly and fast to prototype with.
GPU Acceleration:
Key Features:
Performance:
Keras prioritizes ease of use and rapid development, with performance largely
dependent on the backend used (e.g., TensorFlow).
TensorFlow Implementation
import tensorflow as tf
# Load data
# Define model
model = Sequential([
Flatten(input_shape=(28, 28)),
Dense(128, activation='relu'),
Dense(10, activation='softmax')
])
# Compile model
# Train model
with tf.device('/GPU:0'):
# Evaluate model
PyTorch Implementation
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
# Load data
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,),
(0.5,))])
train_dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST('./data', train=False, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=32, shuffle=False)
# Define model
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.flatten = nn.Flatten()
self.fc1 = nn.Linear(28*28, 128)
self.fc2 = nn.Linear(128, 10)
model = Net().cuda()
# Train model
for epoch in range(10):
model.train()
for data, target in train_loader:
data, target = data.cuda(), target.cuda()
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
# Evaluate model
correct = 0
total = 0
model.eval()
with torch.no_grad():
for data, target in test_loader:
data, target = data.cuda(), target.cuda()
output = model(data)
_, predicted = torch.max(output.data, 1)
total += target.size(0)
correct += (predicted == target).sum().item()
Keras Implementation
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Flatten
# Load data
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
# Define model
model = Sequential([
Flatten(input_shape=(28, 28)),
Dense(128, activation='relu'),
Dense(10, activation='softmax')
])
# Compile model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
# Train model
# Evaluate model
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2)
print(f"Test accuracy: {test_acc}")
Comprehensive Report: Real-Time Object Detection for Autonomous
Driving Using GPU Computing
1. Introduction
2. Problem Description
Autonomous vehicles must process a vast amount of visual data in real time to
detect and classify objects accurately. The challenge lies in the need for high-
speed processing to ensure safety and reliability. Traditional CPU-based
systems struggle with the computational demands of real-time object detection
due to their limited parallel processing capabilities.
Challenges:
We will implement a simplified version of the YOLO (You Only Look Once)
model for object detection. YOLO is known for its speed and accuracy, making
it suitable for real-time applications.
For simplicity, we'll use a subset of a well-known object detection dataset like
COCO (Common Objects in Context).
python
Copy code
import tensorflow as tf
from tensorflow.keras.preprocessing.image import
ImageDataGenerator
train_generator = datagen.flow_from_directory(
'data/train', target_size=(416, 416),
batch_size=32, class_mode='categorical',
subset='training'
)
val_generator = datagen.flow_from_directory(
'data/val', target_size=(416, 416),
batch_size=32, class_mode='categorical',
subset='validation'
)
4.2 Model Definition
python
Copy code
from tensorflow.keras.layers import Conv2D, Input,
BatchNormalization, LeakyReLU, ZeroPadding2D
from tensorflow.keras.models import Model
return Model(inputs, x)
python
Copy code
# Evaluate the model
test_loss, test_acc = model.evaluate(val_generator,
verbose=2)
print(f"Validation accuracy: {test_acc}")
# Summarize performance
import matplotlib.pyplot as plt
Example Comparison:
Inference Speed:
7. Conclusion
Deliverables
The final report should include sections for introduction, problem description,
role of GPU computing, implementation details, performance evaluation,
impact analysis, and conclusion, along with appropriate references and
appendices for the source code.