UCS645 ProjectReport MergeSort
UCS645 ProjectReport MergeSort
Sr Name Roll No
1 Aishwarya Jain 102203738
2 Alok Priyadashi 102203323
3 Anushka Verma 102203699
4 Samiksha Kak 102203587
MAY, 2025
Table of Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Introduction to Problem Statement . . . . . . . . . . . . . . . . . . . . . . 1
2 Problem Formulation 4
3 Objectives 5
4 Methodology 6
4.1 Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2 Output Screenshots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5 Performance Analysis 15
1.1 Background
This report explores the foundational principles and practical aspects of parallel pro-
gramming, including models, algorithms, and performance metrics. It aims to provide a
structured understanding of how computation can be accelerated by leveraging concur-
rency across multiple processors.
1
Figure 1: Parallel computing model showing problem decomposition
One might ask why there is such a large peak performance gap between many-threaded
GPUs and multicore CPUs. The answer lies in the differences in the fundamental de-
sign philosophies between the two types of processors, as illustrated in Fig. 1. The
design of a CPU, as shown in Fig. 1, is optimized for sequential code performance. The
arithmetic units and operand data delivery logic are designed to minimize the effective
latency of arithmetic operations at the cost of increased use of chip area and power per
unit. Large last-level on-chip caches are designed to capture frequently accessed data
and convert some of the long latency memory accesses into short-latency cache accesses.
Sophisticated branch prediction logic and execution control logic are used to mitigate the
latency of conditional branch instructions. By reducing the latency of operations, the
CPU hardware reduces the execution latency of each individual thread. However, the
low-latency arithmetic units, sophisticated operand delivery logic, large cache memory,
and control logic consume chip area and power that could otherwise be used to provide
more arithmetic execution units and memory access channels.
This design approach is commonly referred to as latency-oriented design. The design
philosophy of the GPUs, on the other hand, has been shaped by the fast-growing video
game industry, which exerts tremendous economic pressure for the ability to perform a
massive number of floating-point calculations and memory accesses per video frame in
2
advanced games. This demand motivates GPU vendors to look for ways to maximize the
chip area and power budget dedicated to floating-point calculations and memory access
throughput.
3
2 Problem Formulation
With the rise in data-intensive applications and big data analytics, the need for high-
performance computing (HPC) solutions has become crucial. Traditional CPU-based
algorithms often fail to deliver the desired efficiency for large-scale computations due to
limited parallelism. Sorting, being a fundamental operation in numerous applications like
databases, scientific simulations, and real-time systems, becomes a natural candidate for
optimization using parallel computing. This project formulates the problem of enhanc-
ing the performance of the merge sort algorithm through GPU acceleration using CUDA
(Compute Unified Device Architecture).
The core objective is to analyze and compare the execution time and performance of merge
sort implemented on both CPU and GPU architectures for arrays of increasing sizes. By
leveraging GPU′ s parallel processing capabilities, the aim is to demonstrate how execution
time can be significantly reduced for large datasets. The experiment involves calculating
various performance metrics such as speedup, efficiency, communication overhead, scala-
bility, granularity, load balancing, and total overhead.
The project first performs sorting on arrays of different sizes using CUDA kernels for
GPU execution and standard recursive methods for CPU. Execution times are recorded
and used to compute the above metrics. The results are visualized through bar charts and
line graphs to better understand the relationship between array size and performance gain.
Additionally, all data is exported to an Excel file titled Performance Metrics for record-
keeping and further analysis.
4
3 Objectives
• To implement the Merge Sort algorithm on both CPU and GPU platforms using
CUDA C/C++ and evaluate their execution for varying array sizes.
• To measure and compare execution times for CPU and GPU implementations of
merge sort to assess the performance benefits of GPU parallelization.
– Speedup
– Efficiency
– Load Balancing
– Communication Overhead
– Scalability
– Granularity
• To analyze the effect of array size on the performance of both CPU and GPU merge
sort implementations and determine thresholds where GPU significantly outper-
forms CPU.
• To visualize the comparative performance using bar charts and line graphs that
represent execution times and all performance parameters.
5
4 Methodology
1. Algorithm Selection
(a) Chose merge sort for its O(n log n) complexity and parallelisability
2. Implementation Approach
i. Recursive divide-and-conquer
3. Performance Measurement
4. Validation
6
5. Testing
6. Environment
7. Output
8. Limitations
9. Future Work
7
4.1 Pseudocode
8
4.2 Output Screenshots
9
Figure 4: Metrics calculation
10
Figure 5: CPU vs GPU Performance
Figure 6: Speedup
11
Figure 7: Efficiency
12
Figure 9: Communication Overhead
13
Figure 11: Granularity
14
5 Performance Analysis
• GPU implementation shows a 2-10x speedup versus CPU for large datasets (N >
10,000)
• Kernel launch overhead speeds up the CPU for small data sets (N < 1,000)
• Memory Bottlenecks
• Algorithm behavior
• Comparative Insights
• Implementation Challenges
15
• Optimization Opportunities
• Practical Implications
These observations underscore both the promise of GPU-accelerated sorting and the en-
gineering challenges involved in achieving optimal performance across different use cases.
16
6 Results and Discussion
• Key Findings
• Correctness Verification
∗ Pre-sorted arrays
∗ Reverse-sorted arrays
∗ Random distributions
∗ Duplicate values
• Resource Utilization
17
• Limitations Performance degradation observed when:
• Comparative Analysis
18