Data Analysis PHASE
Data Analysis PHASE
A PROJECT REPORT
Submitted by
DheerajSingh Dhami(21BCS3113)
Manasvi Rajeev Sharma (21BCS3092)
BACHELOR OF ENGINEERING
IN
Chandigarh University
May 2023
BONAFIDE CERTIFICATE
Certified that this project report DATA ANALYSIS USING BIG DATA TOOLSis
the bonafide work of DHEERAJ DHAMI and Manasvi Sharma who carried out
the project work under my supervision.
INTRODUCTION
We have T-Series music video dataset available with us and let us assume that the
client wants to see the analysis of the overall data. Now, the size of the dataset is
very huge (might go up to billions of rows) and using traditional DBMS is not
feasible. So, we will use Big Data tool like Apache Spark to transform the data and
generate the necessary aggregated output tables and store it in MySQL database.
With this architecture the UI will be able to fetch reports and charts at much faster
speed from MySQL than querying on the actual raw data. Finally, the batch we use
to analyze the data can be automated to run on daily basis within a fixed period of
time
4.Setup the environment and install all the tools required for the project.
5.Read data from CSV file and store the data into HDFS (Hadoop File
System) in compressed format.
6. Transform the raw data and build multiple table by performing the required
aggregations.
Timeline
We have to install all the tools and setup the environment (if
you have already installed the required tools you can skip this
task), make sure you install all the required software in one
location for simplicity.
hadoop-services
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
Finally, you can run pyspark command in terminal which
should start Spark on Jupyter Notebook
Chapter 2
LITERATURE REVIEW/BACKGROUNDSTUDY:
2.1 Abstract:
The exponential growth of data has led to an increase in the volume, velocity, and
variety of data generated. Traditional data analysis tools are no longer sufficient to
handle such large data sets. Big data tools provide a solution to this challenge by
enabling analysts to process, analyze, and derive insights from massive data sets.
This research paper provides an overview of big data analytics, explores the
various big data tools available, identifies challenges faced in big data analytics,
and provides best practices for overcoming these challenges.
2.2 Introduction:
The advent of big data has created a new era of data analysis, where traditional
data analysis tools are no longer capable of handling the scale of data being
generated. Big data analytics refers to the use of advanced techniques and tools to
analyze and extract insights from large data sets. The goal of this research paper is
to examine the use of big data tools for data analysis, and to provide an
understanding of the importance of big data analytics, explore various big data
tools, identify challenges faced in big data analytics, and provide best practices for
overcoming these challenges.
Importance of Big Data Analytics:
Big data analytics plays a significant role in enabling organizations to make
informed decisions based on insights derived from their data. It provides a
powerful tool for analyzing data, identifying patterns, trends, and insights that
would otherwise be difficult to discern. For instance, big data analytics can be used
to analyze customer behavior, identify fraud, optimize business processes, and
improve customer satisfaction. By using big data analytics, businesses can gain a
competitive edge by making informed decisions based on insights derived from
their data.
Case Study:
A case study on the use of big data analytics in the healthcare industry can provide
an insight into how big data tools can be used to extract insights from large data
sets. In the healthcare industry, big data analytics can be used to improve patient
outcomes, identify disease patterns, and optimize resource utilization. For example,
the use of big data analytics can enable healthcare providers to identify high-risk
patients, develop personalized treatment plans, and
CHAPTER 3
DESIGN FLOW/PROCESS
Reading data from CSV files and transforming it to generate final output tables to
be stored in traditional DBMS has several key features:
1. CSV files are a widely used format for storing data, and can be easily
created and edited using spreadsheet software such as Microsoft Excel or
Google Sheets.
2. The process of reading data from CSV files is relatively simple and can
be done using a variety of programming languages, such as Python or
Java.
3. Data transformation is an essential part of this process, as CSV files often
contain unstructured or inconsistent data that needs to be cleaned and
standardized before it can be stored in a database.
4. Traditional DBMS such as MySQL, PostgreSQL, or Oracle are designed
to handle large volumes of structured data and provide advanced features
for data querying, analysis, and reporting.
However, there are some potential drawbacks and limitations to this approach, such
as:
1. CSV files may not be the best choice for storing large volumes of data, as
they can become unwieldy and difficult to manage over time.
2. The process of data transformation can be time-consuming and complex,
especially if the CSV files contain large amounts of unstructured or
inconsistent data.
3. The use of traditional DBMS can also be limiting, as these systems are
often designed for specific use cases and may not be flexible enough to
handle changing data requirements or data models.
To address these limitations and ensure an effective solution, the following features
are ideally required:
1. Install Pyspark, Hadoop File System (HDFS), and any necessary drivers for your
DBMS on your Linux machine.
2. Use Pyspark to read the CSV files from HDFS. Pyspark provides several APIs to
read CSV files, such as the `csv` module, which can be used to read CSV files as
DataFrames. Here's an example:
3. Transform the data using Pyspark'sDataFrame API. Pyspark provides a rich set
of APIs to manipulate DataFrames. You can perform operations like filtering,
aggregation, joining, and more on DataFrames. Here's an example:
4. Store the final output tables in your traditional DBMS. Pyspark provides
connectors for many popular DBMSs, such as MySQL, PostgreSQL, and Oracle.
You can use the appropriate connector to write the DataFrames to your DBMS.
Here's an example:
With these steps, you can implement reading data from CSV files, transforming the
data using Pyspark, and storing the final output tables in a traditional DBMS.
Reading data from CSV files and transforming it to generate final output tables to
be stored in traditional DBMS has several key features:
5. CSV files are a widely used format for storing data, and can be easily
created and edited using spreadsheet software such as Microsoft Excel or
Google Sheets.
6. The process of reading data from CSV files is relatively simple and can
be done using a variety of programming languages, such as Python or
Java.
7. Data transformation is an essential part of this process, as CSV files often
contain unstructured or inconsistent data that needs to be cleaned and
standardized before it can be stored in a database.
8. Traditional DBMS such as MySQL, PostgreSQL, or Oracle are designed
to handle large volumes of structured data and provide advanced features
for data querying, analysis, and reporting.