The document details an assignment on parallel distributed computing, focusing on word analysis and term frequency analysis in large text files using multi-threading. It explains the implementation of thread management, including chunk division, thread safety with mutexes, and performance optimization through thread affinity. The document also presents execution time comparisons for different threading configurations and discusses challenges faced during the implementation.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
24 views10 pages
My Work
The document details an assignment on parallel distributed computing, focusing on word analysis and term frequency analysis in large text files using multi-threading. It explains the implementation of thread management, including chunk division, thread safety with mutexes, and performance optimization through thread affinity. The document also presents execution time comparisons for different threading configurations and discusses challenges faced during the implementation.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10
Assignment 01 Parallel Distributed Computing
Muhammad Daud Cheema
I220875 Youtube video link: https://youtu.be/CKMOJGAiAM8?si=zxIs3JCLUwmgWSOD
TASK 2: Word Analysis in a Large Text File
Solution Explanation: For this, the program: 1. Obtain the file size and equally divide it into chunks and assigns each chunk to a separate thread using pthread_create. 2. Each thread processes its chunk, counts words, checks for vowels, and updates the global statistics ( vowel word count, and word frequencies). 3. To ensure thread safety, shared resources are protected using mutexes (mutex_lock and mutex_unlock). 4. The program uses pthread_setaffinity_np to bind threads to specific CPU cores, enhancing performance by reducing context switching and distributing the data equally among cores. Execution Time and Speedup Analysis: 1. Without Thread Affinity: In this configuration, the operating system decides how to distribute threads across available CPU cores. 2. 3. With Thread Affinity: Threads are explicitly bound to specific CPU cores using pthread_setaffinity_np. This configuration minimizes context switching and can improve performance on systems with multiple cores. Challenges Faced and Solutions: 1. Handling large files: Processing a file larger than the system's memory requires careful handling of chunks. The solution divides the file into smaller, manageable parts for each thread. 2. Thread Synchronization: Ensuring thread safety when updating shared resources (like word frequencies and counts) was challenging. This was addressed by using mutexes to lock shared variables and prevent race conditions. 3. Data Extraction: I had to store the entire data into a txt file, so that it could be divided into chunks.
Problem 2: Term Frequency Analysis
Solution Explanation: The problem involves performing Term Frequency Analysis on a large text file using multi- threading. The goal is to count the frequency of each word in the file and calculate the total number of unique words. The analysis can be done with and without considering thread affinity, which refers to binding threads to specific CPU cores to optimize performance. Execution Time and Speedup Analysis: 1. Without Thread Affinity: In this setup, the OS decides how to distribute threads across CPU cores. With Thread Affinity: Threads are explicitly bound to cores, potentially improving performance by reducing context switching. Challenges Faced and Solutions: 1. Handling large files: Processing a file larger than the system's memory requires careful handling of chunks. The solution divides the file into smaller, manageable parts for each thread. 2. Thread Synchronization: Ensuring thread safety when updating shared resources (like word frequencies and counts) was challenging. This was addressed by using mutexes to lock shared variables and prevent race conditions. 3. Data Extraction: I had to store the entire data into a txt file, so that it could be divided into chunks. TABLE FOR BOTH PROBLEMS Time for one thread Time for 2 threads Time for 4 threads T2 without Affinity 97.8823 80.6181 74.0218 T2 with Affinity 80.71 64.72 72.57 T3 without Affinity 128.762 67.77 34.534 T3 with Affinity 75.6515 45.776 40.5807