0% found this document useful (0 votes)
18 views11 pages

BDA Technical Documentation

Uploaded by

viplaviwade21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views11 pages

BDA Technical Documentation

Uploaded by

viplaviwade21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Name: Viplavi Wade

Student Id: 13922741

Subject: Big Data Analytics

Course: MSC Advanced Computing

Date: June 24, 2024

Big Data Analytics: CineSense Project

GitHub Repo Link: https://github.com/ViplaviWade/Big-Data-Project

1.0 Introduction

CineSense is an innovative video-processing startup that extracts valuable insights from social
media video content using advanced natural language processing (NLP) and computer vision
techniques. This analysis is crucial for businesses seeking to understand their audience, improve
customer experiences, and make data-driven decisions.

The project's primary objective is to develop a Python application that efficiently downloads and
analyzes YouTube videos using parallel processing techniques. This involves tasks such as
downloading videos, extracting audio, transcribing audio to text, performing sentiment and
emotion analysis, and translating the text. The project emphasizes the use of multiprocessing,
threading, or asynchronous programming to optimize the workflow.

2.0 Tools and technologies used:

• Python Programming language (for implementation) Python3 as Python has a wide range of
libraries that could be used for the implementation of the project such as using pytube, using
speech recognition, spacy, textblob for sentiment analysis etc.
• Git and GitHub: Implementing version control for the project, as it provides visibility and
contribution for repository hosting, and tracking the changes from each phase of the project
development

3.0 Implementation Phases of the Project:

3.1 Phase-1

Tasks:

1. Manually retrieve 10-15 random video URLs from YouTube. Save the URLs in a text file called
video_urls.txt , where each URL should be stored on a separate line. Consider YouTube
videos that are 2-3 minutes in duration.
2. Develop a Python script to read the URLs. Assuming you have the text file named
video_urls.txt containing the URLs of YouTube videos, load it in Python and extract the URLs
using your preferred data structure.

3. Develop a Python script to download the videos using their URLs. Test your solution by
downloading the files serially. Use parallel programming such as multiprocessing or
threading to handle downloads. Your decision will determine the best strategy. For testing
reasons, ensure the script can download up to 5 videos simultaneously to avoid YouTube
blocks. You are advised to use threads and semaphores to control the downloads. Compare
serial and parallel executions for your video download script. Discuss the complexity of your
video download scripts time and space.

To download the videos in a serial as well as parallel fashion I have created this input choice
where the users have the choice to select the downloading execution mechanism where the
user can select ‘1’ for serial execution of downloading and ‘2’ for parallel execution of
download.
Parallel programming is implemented using the multi-threading concept

Ensuring the script can download up to 5 videos simultaneously to avoid YouTube blocks.

Time and space complexity of the serial and parallel downloads.

Time Complexity:

• Serial Execution: The time complexity of serial execution is O(n). Each video
download operation is independent and sequential, leading to a linear relationship
between the number of videos (n) and the total download time. Thus, the total time
taken increases linearly with the number of videos.
• Parallel Execution: The time complexity of parallel execution is O(n/k). Assuming the
system allows downloading (k) videos concurrently without significant overhead, the
time complexity can be approximated to O(n/k), where (k) is the number of
concurrent threads. However, the actual speedup will depend on the network
bandwidth, system resources, and how well the parallelism can be achieved.

Space Complexity:

• Serial Execution: The time complexity of serial execution is O(1). The space
complexity is constant because at any given time, only one video is being processed
and downloaded, regardless of the total number of videos.
• Parallel Execution: The time complexity of parallel execution is O(k). The space
complexity is proportional to the number of concurrent threads (k). Each thread
consumes memory for its stack and local variables. Additionally, each video being
downloaded will consume space simultaneously, but this is generally negligible
compared to the overall download directory size.

Factors that might influence the actual working mechanism of the serial and parallel
execution

• Network Bandwidth: The actual performance gain from parallel downloads will
highly depend on the available network bandwidth. If the bandwidth is a bottleneck,
adding more threads might not lead to a proportional decrease in download time.
• Thread Management: Proper management of threads using semaphores and
mutexes is crucial to avoid issues like race conditions and excessive resource usage.
As in my code, a semaphore is used to limit the number of concurrent downloads to
5, ensuring the system is not overwhelmed.
• Error Handling: Robust error handling is essential in both serial and parallel
execution to manage issues like unavailable videos or network failures gracefully.

Parallel execution can significantly reduce the total download time compared to serial
execution, as demonstrated by the results of my executed code the serial execution took
43.01 seconds, whereas the parallel execution is demonstrated in 27.67 seconds. However,
this comes with increased complexity in managing concurrent threads and potential
challenges related to network bandwidth and system resources. The choice between serial
and parallel execution should consider these factors to optimize performance effectively.

4. Develop a Python script to keep a log for each download. After downloading each video,
create a logger to record which video was downloaded by which process or thread. Save the
log entries to the same file, e.g., download_log.txt . For this script, you have to use threads
and a mutex.

5. Develop Python scripts to perform various video analysis tasks. After downloading a video,
perform the following tasks. It is preferable to develop a separate script for each
functionality. The five analysis subtasks are as follows.
I. Extract audio from a video file.
The code for extracting audio from video file is added in the extract_audio.py file and
the extracted audio is saved under folder names ‘extracted_audio’
II. Transcribe audio to text.
The transcription of audio files to text file is completed in the file
transcribe_audio.py file and the transcribed text files are stored under the
transcribe_audio2text folder
III. Perform the sentiment analysis on a video's content, extracting its polarity and
sensitivity.
The sentiment analysis is completed in sentiment_analysis.py file and stored under
sentiment_analysis folder. In this example the sentiment analysis is completed in
0.12 seconds. And the sentiment shows the polarity and subjectivity of the video.
And it is stored under a json file format for every youtube video.
IV. Translate the text into another language, e.g. Spanish.
The code for translating text from English to any other language. Here, default set as
‘Spanish’ is completed in ‘translate_text.py’. The execution for translating the English
text to Spanish took around 4.30 seconds to complete. The translated text is stored
under the folder named ‘translations’.
V. Extract the emotions of a text.
The emotion of the text is determined under the extract_emotions.py. The emotions
are stored as a json file where the json file contains the values i.e. the emotions of
the text as a form in Happy, Angry, Surprise, Sad, and Fear. These json files are stored
under a folder named extracted_emotions. The execution of this python script is
completed in 1.92 seconds.

You might also like