0% found this document useful (0 votes)

42 views14 pages

Data Analysis PHASE

This document provides an overview of big data analytics and tools. It discusses the importance of big data analytics for making informed decisions from large datasets. Popular big data tools like Apache Hadoop and Apache Spark are described for processing and analyzing large datasets. Challenges of big data analytics like data quality issues, data integration, and privacy are outlined. Best practices are suggested like data cleansing, choosing the right tool, and hiring skilled analysts. A case study on healthcare analytics demonstrates how insights can be extracted from large datasets.

Uploaded by

dhamiclay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views14 pages

Data Analysis PHASE

Uploaded by

dhamiclay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 14

DATA ANALYSIS USING BIG DATA TOOLS

A PROJECT REPORT

Submitted by

DheerajSingh Dhami(21BCS3113)
Manasvi Rajeev Sharma (21BCS3092)

in partial fulfillment for the award of the degree of

BACHELOR OF ENGINEERING

COMPUTER SCIENCE AND ENGINEERING

Chandigarh University

May 2023
BONAFIDE CERTIFICATE

Certified that this project report DATA ANALYSIS USING BIG DATA TOOLSis
the bonafide work of DHEERAJ DHAMI and Manasvi Sharma who carried out
the project work under my supervision.

Dr. Puneet Kumar Er.Hari Mohan Dixit

HEAD OF THE DEPARTMENT SUPERVISOR
CSE AP
CSE

Submitted for the project viva-voce examination held on

INTERNAL EXAMINER EXTERNAL

EXAMINER
CHAPTER 1.

INTRODUCTION

1.1. Identification of Client /Need / Relevant Contemporary issue

We have T-Series music video dataset available with us and let us assume that the
client wants to see the analysis of the overall data. Now, the size of the dataset is
very huge (might go up to billions of rows) and using traditional DBMS is not
feasible. So, we will use Big Data tool like Apache Spark to transform the data and
generate the necessary aggregated output tables and store it in MySQL database.
With this architecture the UI will be able to fetch reports and charts at much faster
speed from MySQL than querying on the actual raw data. Finally, the batch we use
to analyze the data can be automated to run on daily basis within a fixed period of
time

1.2. Identification of Problem/Tasks

1.Transform the raw data into multiple tables as per the requirement.
2.Load the tables to MySQL.
3.Automate the flow so it can be scheduled to be ran on a regular basis.

4.Setup the environment and install all the tools required for the project.
5.Read data from CSV file and store the data into HDFS (Hadoop File
System) in compressed format.
6. Transform the raw data and build multiple table by performing the required
aggregations.

7. Load the end tables to MySQL tables.

8. Automating the full flow using Shell Script.

Timeline

PROBLEM STUDY: -16 February 2023 - 19 February 2023

PLANNING: -20 February 2023 - 26 February 2023
REQUIREMENT ANALYSIS: -27 February 2023 - 03 March2023
& GATHERING
DESIGNING: -04 March 2023 - 15 March 2023
DEVELOPMENT: -16 March 2023 – 10 May 2023
DEPLOYMENT: -11 May 2023
1.3. Requirements
 It is expected that you are using a Linux distribution. (A cloud
system can be a substitute.)

 We have to install all the tools and setup the environment (if
you have already installed the required tools you can skip this
task), make sure you install all the required software in one
location for simplicity.

 Install Hadoop in your system using this tutorial.

 Once Hadoop is set up, start the services using start-all.sh

command and run jps to check whether the services are up or
not. Below screenshot shows the expected services that should
be running on successful installation.

hadoop-services

 Now, you can install Apache Spark using this link

 Once spark is installed we will install Anaconda. Download

Anaconda bash installer file from Anaconda website. Install and
initialize it.
 Finally install MySQL.

 Now, by default Spark is supposed to start on terminal, to use

Jupyter Notebook for development we will have to set some
properties in ~/.bashrc file.

 export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
 Finally, you can run pyspark command in terminal which
should start Spark on Jupyter Notebook

1.4. Organization of the Report

The report for the virtual whiteboard project is organized into 5
Chaptersstated below:
 CHAPTER 1. INTRODUCTION:
It will include the Identification of Client/ Need/ Relevant
Contemporary issue, Identification of Problem, Identification of
Tasks, Timeline, and Organization of the Report.
 CHAPTER 2. LITERATURE
REVIEW/BACKGROUNDSTUDY:
It will include the Timeline of the reported problem, Existing
solutions, Bibliometric analysis, Review Summary, Problem
Definition, and Goals/Objectives.
 CHAPTER 3. DESIGN FLOW/PROCESS:
It will include the Evaluation & Selection of Specifications/Features,
Design Constraints, Analysis of Features and finalization subject to
constraints, Design Flow, Design selection, and Implementation
plan/methodology.
 CHAPTER 4. RESULTS ANALYSIS AND VALIDATION:
It will include the Implementation of the solution.
 CHAPTER 5. CONCLUSION AND FUTURE WORK:
It will include the Conclusion and Future work.

Chapter 2
LITERATURE REVIEW/BACKGROUNDSTUDY:

2.1 Abstract:
The exponential growth of data has led to an increase in the volume, velocity, and
variety of data generated. Traditional data analysis tools are no longer sufficient to
handle such large data sets. Big data tools provide a solution to this challenge by
enabling analysts to process, analyze, and derive insights from massive data sets.
This research paper provides an overview of big data analytics, explores the
various big data tools available, identifies challenges faced in big data analytics,
and provides best practices for overcoming these challenges.

2.2 Introduction:
The advent of big data has created a new era of data analysis, where traditional
data analysis tools are no longer capable of handling the scale of data being
generated. Big data analytics refers to the use of advanced techniques and tools to
analyze and extract insights from large data sets. The goal of this research paper is
to examine the use of big data tools for data analysis, and to provide an
understanding of the importance of big data analytics, explore various big data
tools, identify challenges faced in big data analytics, and provide best practices for
overcoming these challenges.
Importance of Big Data Analytics:
Big data analytics plays a significant role in enabling organizations to make
informed decisions based on insights derived from their data. It provides a
powerful tool for analyzing data, identifying patterns, trends, and insights that
would otherwise be difficult to discern. For instance, big data analytics can be used
to analyze customer behavior, identify fraud, optimize business processes, and
improve customer satisfaction. By using big data analytics, businesses can gain a
competitive edge by making informed decisions based on insights derived from
their data.

Tools for Big Data Analysis:

There are several big data tools available for analyzing large data sets. One of the
most popular tools is Apache Hadoop, an open-source framework that enables
distributed processing of large data sets. Hadoop provides a distributed file system
called HDFS, which facilitates efficient storage and retrieval of large data sets.
Apache Spark is another popular tool for big data analytics. Spark is a fast and
efficient engine for large-scale data processing that provides an easy-to-use
interface for data analysis and can be used with a variety of programming
languages, including Python, Java, and Scala. Other big data tools include Apache
Cassandra, Apache Flink, and Apache Storm.

Challenges in Big Data Analytics:

Despite the benefits of big data analytics, there are several challenges that must be
addressed. One of the most significant challenges is data quality. Large data sets
often contain errors and inconsistencies, which can lead to inaccurate results if not
properly addressed. Data integration is another challenge, which involves
combining data from different sources into a single data set. This can be a complex
and time-consuming process, particularly when dealing with data from disparate
sources. Another challenge is data privacy and security, which requires
organizations to ensure that their data is secure and protected from unauthorized
access.
Best Practices for Big Data Analytics:
To overcome the challenges in big data analytics, it is important to follow best
practices. These include data quality checks, data normalization, and the use of
machine learning algorithms for data cleansing. It is also important to have a clear
understanding of the business problem being addressed and to choose the
appropriate tool for the task. Another best practice is to ensure that data is stored in
a format that is easily accessible and usable by big data tools. Finally, it is
important to have a skilled team of data analysts who are proficient in big data
tools and techniques.

Case Study:
A case study on the use of big data analytics in the healthcare industry can provide
an insight into how big data tools can be used to extract insights from large data
sets. In the healthcare industry, big data analytics can be used to improve patient
outcomes, identify disease patterns, and optimize resource utilization. For example,
the use of big data analytics can enable healthcare providers to identify high-risk
patients, develop personalized treatment plans, and

CHAPTER 3
DESIGN FLOW/PROCESS

Reading data from CSV files and transforming it to generate final output tables to
be stored in traditional DBMS has several key features:

1. CSV files are a widely used format for storing data, and can be easily
created and edited using spreadsheet software such as Microsoft Excel or
Google Sheets.
2. The process of reading data from CSV files is relatively simple and can
be done using a variety of programming languages, such as Python or
Java.
3. Data transformation is an essential part of this process, as CSV files often
contain unstructured or inconsistent data that needs to be cleaned and
standardized before it can be stored in a database.
4. Traditional DBMS such as MySQL, PostgreSQL, or Oracle are designed
to handle large volumes of structured data and provide advanced features
for data querying, analysis, and reporting.

However, there are some potential drawbacks and limitations to this approach, such
as:

1. CSV files may not be the best choice for storing large volumes of data, as
they can become unwieldy and difficult to manage over time.
2. The process of data transformation can be time-consuming and complex,
especially if the CSV files contain large amounts of unstructured or
inconsistent data.
3. The use of traditional DBMS can also be limiting, as these systems are
often designed for specific use cases and may not be flexible enough to
handle changing data requirements or data models.

To address these limitations and ensure an effective solution, the following features
are ideally required:

1. Scalability: The solution should be able to handle large volumes of data,

with the ability to scale up or down as needed.
2. Flexibility: The solution should be able to handle a variety of data formats
and types, and be flexible enough to accommodate changes in data
requirements or models.
3. Automation: The solution should automate as much of the data
transformation process as possible, to reduce the risk of errors and save
time.
4. Data quality: The solution should include features to ensure data quality,
such as data validation and data profiling, to identify and address any
issues with the data.
5. Security: The solution should include security features to protect sensitive
data, such as encryption and access controls.
6. Integration: The solution should be able to integrate with other systems
and tools, such as data visualization or business intelligence software, to
provide a complete end-to-end solution for data processing and analysis.
To implement reading data from CSV files, transforming the data using Pyspark,
and storing the final output tables in a traditional DBMS, you can follow these
steps:

1. Install Pyspark, Hadoop File System (HDFS), and any necessary drivers for your
DBMS on your Linux machine.

2. Use Pyspark to read the CSV files from HDFS. Pyspark provides several APIs to
read CSV files, such as the `csv` module, which can be used to read CSV files as
DataFrames. Here's an example:

3. Transform the data using Pyspark'sDataFrame API. Pyspark provides a rich set
of APIs to manipulate DataFrames. You can perform operations like filtering,
aggregation, joining, and more on DataFrames. Here's an example:

4. Store the final output tables in your traditional DBMS. Pyspark provides
connectors for many popular DBMSs, such as MySQL, PostgreSQL, and Oracle.
You can use the appropriate connector to write the DataFrames to your DBMS.
Here's an example:
With these steps, you can implement reading data from CSV files, transforming the
data using Pyspark, and storing the final output tables in a traditional DBMS.

The implementation process can be broken down into several steps:

1. Reading data from CSV files

2. Transforming the data to generate the final output tables
3. Storing the output tables in a traditional DBMS

Here's an example implementation using PySpark, Linux basics, and

Hadoop File System:

Step 1: Reading data from CSV files

Step 2: Transforming the data to generate the final output tables

Step 3: Storing the output tables in a traditional DBMS

In this example, we're reading a CSV file into a PySparkDataFrame,

applying some transformations to generate the final output table, and
then writing the output table to a traditional DBMS (in this case,
PostgreSQL)
To execute this code, you'll need to have PySpark installed and configured, as well
as access to a Hadoop File System and a traditional DBMS.

Reading data from CSV files and transforming it to generate final output tables to
be stored in traditional DBMS has several key features:
5. CSV files are a widely used format for storing data, and can be easily
created and edited using spreadsheet software such as Microsoft Excel or
Google Sheets.
6. The process of reading data from CSV files is relatively simple and can
be done using a variety of programming languages, such as Python or
Java.
7. Data transformation is an essential part of this process, as CSV files often
contain unstructured or inconsistent data that needs to be cleaned and
standardized before it can be stored in a database.
8. Traditional DBMS such as MySQL, PostgreSQL, or Oracle are designed
to handle large volumes of structured data and provide advanced features
for data querying, analysis, and reporting.

Big Data Analytics
No ratings yet
Big Data Analytics
19 pages
[Ebooks PDF] download Android Cookbook Problems and Solutions for Android Developers 2nd Edition Ian F. Darwin full chapters
100% (2)
[Ebooks PDF] download Android Cookbook Problems and Solutions for Android Developers 2nd Edition Ian F. Darwin full chapters
55 pages
DM00031020 - Reference Manual - Printed
No ratings yet
DM00031020 - Reference Manual - Printed
849 pages
BDA U1
No ratings yet
BDA U1
80 pages
Big Data Analysis
No ratings yet
Big Data Analysis
33 pages
1. Introduction of Subject
No ratings yet
1. Introduction of Subject
28 pages
Week 1
No ratings yet
Week 1
33 pages
FUNDAMENTALS OF BIG DATA ANALYTICS Digital Notes
No ratings yet
FUNDAMENTALS OF BIG DATA ANALYTICS Digital Notes
121 pages
BDA Unit 1
No ratings yet
BDA Unit 1
36 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
124 pages
Big Data Analytics Tools, BHARATH.S (Assignment-1)
No ratings yet
Big Data Analytics Tools, BHARATH.S (Assignment-1)
17 pages
Proxmox Mail Gateway: Deployment Guide
No ratings yet
Proxmox Mail Gateway: Deployment Guide
40 pages
PPT 1.1.3
No ratings yet
PPT 1.1.3
15 pages
j.ijdsa.20241005.11
No ratings yet
j.ijdsa.20241005.11
14 pages
Finance - Unit 4
No ratings yet
Finance - Unit 4
39 pages
Substation Maintenance and Construction Manual Circuit Breakers Booklet
No ratings yet
Substation Maintenance and Construction Manual Circuit Breakers Booklet
293 pages
It (r20) 4-1 Big Data Analytics Digital Notes
No ratings yet
It (r20) 4-1 Big Data Analytics Digital Notes
84 pages
Datasheet Fusiveis Bussmann
100% (1)
Datasheet Fusiveis Bussmann
2 pages
Technical Seminar Report
No ratings yet
Technical Seminar Report
24 pages
Final-Scm PPT Big Bazaar
50% (2)
Final-Scm PPT Big Bazaar
29 pages
Introduction-to-Data-Analytics
No ratings yet
Introduction-to-Data-Analytics
15 pages
Big Datadoc
No ratings yet
Big Datadoc
9 pages
ABSTRACT
No ratings yet
ABSTRACT
9 pages
Toad For SQL Server 7.2.x: Release Notes
No ratings yet
Toad For SQL Server 7.2.x: Release Notes
10 pages
Bda File New
No ratings yet
Bda File New
6 pages
McLaren Artura Order MUELZMQ Summary 2022-06-09
No ratings yet
McLaren Artura Order MUELZMQ Summary 2022-06-09
6 pages
48 Infochimps - How To Do A Big Data Project
100% (1)
48 Infochimps - How To Do A Big Data Project
17 pages
Big Data Analytics Unit - 1 Notes
No ratings yet
Big Data Analytics Unit - 1 Notes
24 pages
Brochure Con Accesorios y CEM
No ratings yet
Brochure Con Accesorios y CEM
12 pages
FSF Fortinet
No ratings yet
FSF Fortinet
31 pages
ANT-A70VP1100v06-4032 Datasheet
No ratings yet
ANT-A70VP1100v06-4032 Datasheet
2 pages
Big Data Technology Report With Pages Removed
No ratings yet
Big Data Technology Report With Pages Removed
32 pages
CS8091-Big-Data-Analytics
No ratings yet
CS8091-Big-Data-Analytics
28 pages
WWW Droidthunder Com
No ratings yet
WWW Droidthunder Com
10 pages
lauras
No ratings yet
lauras
33 pages
Niversiti Utra Alaysia: LAB 4 (WEEKS 5 & 6) - Individual
No ratings yet
Niversiti Utra Alaysia: LAB 4 (WEEKS 5 & 6) - Individual
5 pages
Introduction
No ratings yet
Introduction
10 pages
PPT 1.1.4
No ratings yet
PPT 1.1.4
16 pages
BDA UNIT-1 NOTES
No ratings yet
BDA UNIT-1 NOTES
10 pages
Lo1 and Lo2 PDF
No ratings yet
Lo1 and Lo2 PDF
52 pages
Lesson 1 Overview of Big Data Analytics
No ratings yet
Lesson 1 Overview of Big Data Analytics
6 pages
CS403-Assignment 1 100% Coorect Solution Spring 2025 By M.junaid Qazi
No ratings yet
CS403-Assignment 1 100% Coorect Solution Spring 2025 By M.junaid Qazi
4 pages
Big Data Manual - Edited
No ratings yet
Big Data Manual - Edited
69 pages
big data analytics02
No ratings yet
big data analytics02
20 pages
Manoj Kumari Roll No. 20
No ratings yet
Manoj Kumari Roll No. 20
11 pages
No SQL Database in Bda
No ratings yet
No SQL Database in Bda
84 pages
Bigdata
No ratings yet
Bigdata
12 pages
Experiment No _ 1 Bda
No ratings yet
Experiment No _ 1 Bda
10 pages
Big Data Outline Notes
No ratings yet
Big Data Outline Notes
3 pages
Unit-1
No ratings yet
Unit-1
11 pages
Big Data Seminar Report Rahul Jain
No ratings yet
Big Data Seminar Report Rahul Jain
41 pages
_big Data Analytics
No ratings yet
_big Data Analytics
5 pages
Big Data Analytics
No ratings yet
Big Data Analytics
8 pages
Tax Invoice Cum Acknowledgement Receipt of PAN Application (Change Request)
No ratings yet
Tax Invoice Cum Acknowledgement Receipt of PAN Application (Change Request)
1 page
Big Data Analytics
100% (3)
Big Data Analytics
79 pages
BDA Module
No ratings yet
BDA Module
6 pages
Iso 4422 2 1999
No ratings yet
Iso 4422 2 1999
9 pages
Big Data Analytics
No ratings yet
Big Data Analytics
6 pages
Smu C++ SLM Unit 1
No ratings yet
Smu C++ SLM Unit 1
15 pages
Lecture 2 - Hadoop 221
No ratings yet
Lecture 2 - Hadoop 221
28 pages
2 Using Hand Tools
No ratings yet
2 Using Hand Tools
31 pages
Big Data Analytics 1
No ratings yet
Big Data Analytics 1
22 pages
Big Data Analytics
100% (1)
Big Data Analytics
11 pages
Hassan - PPL Tender
No ratings yet
Hassan - PPL Tender
37 pages
Sem Wheel Loeader
No ratings yet
Sem Wheel Loeader
186 pages
Syllabus - Big Data Analytic Strategic Planning
No ratings yet
Syllabus - Big Data Analytic Strategic Planning
1 page
COMP1005 Fundamentals of Programming Semester 2, 2021: Unit Outline
No ratings yet
COMP1005 Fundamentals of Programming Semester 2, 2021: Unit Outline
9 pages
Research Paper (1) .Docxxx
No ratings yet
Research Paper (1) .Docxxx
6 pages
Bigdata
No ratings yet
Bigdata
12 pages
What Is RTWP?: Leopedrini
No ratings yet
What Is RTWP?: Leopedrini
4 pages
Big Data Ashish
No ratings yet
Big Data Ashish
7 pages
Report On Bigdata
No ratings yet
Report On Bigdata
3 pages
Big Data Analytics
100% (1)
Big Data Analytics
3 pages
APC Smart-UPS RT 2000VA 230V: Home Products Uninterruptible Power Supply (UPS) Smart-UPS On-Line
No ratings yet
APC Smart-UPS RT 2000VA 230V: Home Products Uninterruptible Power Supply (UPS) Smart-UPS On-Line
4 pages
Unit 1 BDA
No ratings yet
Unit 1 BDA
2 pages
Big Data
No ratings yet
Big Data
1 page
Big Data Training in Chennai - Big Data Course in Chennai
No ratings yet
Big Data Training in Chennai - Big Data Course in Chennai
1 page
Introduction to Big Data
No ratings yet
Introduction to Big Data
4 pages
Enterprise Systems (CRM & Erp)
No ratings yet
Enterprise Systems (CRM & Erp)
26 pages
Challenges in Big Data Analytics Techniques
No ratings yet
Challenges in Big Data Analytics Techniques
6 pages
Making A Knockoff WiFi Pineapple From A GL-iNet AR150
100% (1)
Making A Knockoff WiFi Pineapple From A GL-iNet AR150
4 pages
Big Data Analytics Tutorial
100% (15)
Big Data Analytics Tutorial
101 pages
Cp5293 Big Data Analytics Question Bank
0% (1)
Cp5293 Big Data Analytics Question Bank
13 pages
Connect&Develop Summary
No ratings yet
Connect&Develop Summary
2 pages
cp5293 Big Data Analytics Question Bank
0% (1)
cp5293 Big Data Analytics Question Bank
13 pages
Waste Heat Recovery Factsheet
No ratings yet
Waste Heat Recovery Factsheet
2 pages
Mud Pulse Systems
No ratings yet
Mud Pulse Systems
2 pages
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Real-Time Big Data Analytics: Emerging Trends
From Everand
Real-Time Big Data Analytics: Emerging Trends
Trilokesh Khatri
No ratings yet
Application Design: Key Principles For Data-Intensive App Systems
From Everand
Application Design: Key Principles For Data-Intensive App Systems
Rob Botwright
No ratings yet

Data Analysis PHASE

Uploaded by

Data Analysis PHASE

Uploaded by

DATA ANALYSIS USING BIG DATA TOOLS

in partial fulfillment for the award of the degree of

COMPUTER SCIENCE AND ENGINEERING

Dr. Puneet Kumar Er.Hari Mohan Dixit

Submitted for the project viva-voce examination held on

INTERNAL EXAMINER EXTERNAL

1.1. Identification of Client /Need / Relevant Contemporary issue

1.2. Identification of Problem/Tasks

7. Load the end tables to MySQL tables.

8. Automating the full flow using Shell Script.

PROBLEM STUDY: -16 February 2023 - 19 February 2023

 Install Hadoop in your system using this tutorial.

 Once Hadoop is set up, start the services using start-all.sh

 Now, you can install Apache Spark using this link

 Once spark is installed we will install Anaconda. Download

 Now, by default Spark is supposed to start on terminal, to use

1.4. Organization of the Report

Tools for Big Data Analysis:

Challenges in Big Data Analytics:

1. Scalability: The solution should be able to handle large volumes of data,

The implementation process can be broken down into several steps:

1. Reading data from CSV files

Here's an example implementation using PySpark, Linux basics, and

Step 1: Reading data from CSV files

Step 3: Storing the output tables in a traditional DBMS

In this example, we're reading a CSV file into a PySparkDataFrame,

You might also like