0% found this document useful (0 votes)

2 views11 pages

Big Data G

The document outlines a group project from Jigjiga University on the fundamentals of Big Data, focusing on the Hadoop ecosystem and its components such as MapReduce, Hive, Pig, and HBase. It details the steps taken to set up PySpark on Google Colab, load a dataset, execute queries, and save results, emphasizing the practicality of using Colab for small-scale Hadoop tasks. The project highlights the flexibility of PySpark and the accessibility of cloud resources for learning and applying big data concepts without a dedicated Hadoop infrastructure.

Uploaded by

kayton083

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views11 pages

Big Data G

Uploaded by

kayton083

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

JIGJIGA UNIVERSITY

Institute of Technology
Department of Software Engineering
Course Title: Fundamentals of Big Data
Analytics and BI
Group Project
No. Name ID. No.
1 Kadar Abdirahman Muhumed R/1696/13
2 Nasra Ahmed Abdi R/4277/13
3 Suad Yasin Omer R/3755/13
4 Abdirasak Mustafe Mohamed R/5277/13
5 Hassan Bashir Abdikarem R/1495/13
6 Fehima Ahmed Rabi R/1197/13
7 Amin Abdi Hassen R/4321/13
8 Adnan Shukri Abib R/4736/13
9 Abdulahi Abdirahman Omer R/3114/13
10 Ali Suldan Hassan R/5570/13
1. Introduction to Hadoop and its Components

Hadoop is an open-source framework for storing and processing large datasets in a distributed
computing environment. It consists of four core components:

 MapReduce: A programming model for processing and generating big datasets.

 Hive: A data warehouse infrastructure built on Hadoop, which provides a SQL-like interface
(HiveQL).
 Pig: A platform for analyzing large datasets using a high-level scripting language (Pig Latin).
 HBase: A NoSQL database that runs on top of HDFS for real-time read/write access.

Hadoop is designed to handle massive datasets that exceed the processing capacity of a single machine.
It leverages the power of distributed computing, distributing data and processing tasks across a cluster
of commodity hardware (like standard servers).

Key Characteristics of Hadoop:

 Scalability: Easily scales horizontally by adding more nodes to the cluster as data volume grows.
 Fault Tolerance: Data is replicated across multiple nodes, ensuring data availability even if some
nodes fail.
 Cost-Effectiveness: Utilizes inexpensive commodity hardware, making it a cost-effective solution
for big data processing.

Hadoop provides a robust and scalable platform for handling and processing big data. Its combination of
HDFS, MapReduce, Hive, Pig, and HBase offers a comprehensive solution for various big data challenges,
from data storage and retrieval to complex analysis and real-time processing.
The following pages outline the steps we took to complete the tasks of working with the Hadoop
ecosystem using Google Colab and PySpark. The tasks include setting up the environment, loading a
dataset, querying it using HiveQL-like syntax, and saving the results.

Steps Followed

Step 1: Setting Up PySpark on Google Colab

2. Installed PySpark by running the following command:

!pip install pyspark

3. Imported the required PySpark module:

from pyspark.sql import SparkSession

4. Created a Spark session to initiate PySpark operations:
Step 2: Loading the Dataset

1. Downloaded an open-source dataset (from Kaggle).

2. Uploaded the dataset to Google Colab:

Used the file upload option in Colab or ran:

from google.colab import files uploaded = files.upload()

Step 3: Querying Data Using HiveQL-like Syntax
1. Executed SQL queries using Spark SQL:
2. Grouping the data by a specific column

Step 4: Saving Query Results

1. Saved the processed DataFrame back to the local filesystem:

Conclusion

Using Google Colab with PySpark is a practical and accessible alternative to a full-fledged Hadoop
environment for small-scale Hadoop ecosystem tasks. The combination of HDFS-like file handling, SQL
queries, and cloud storage ensured task completion without requiring dedicated Hadoop infrastructure.

The process highlights the flexibility of PySpark in simulating key Hadoop operations such as distributed
file handling and query processing. By leveraging Colab’s cloud-based resources, users can efficiently
manage datasets and execute HiveQL-like queries. Additionally, the integration with Python makes it
easier to manipulate and analyze data, offering a smooth transition between data preprocessing and big
data analytics.

While Google Colab lacks native HDFS support, its ability to save, upload, and download files
compensates for this limitation in small to medium-scale tasks. For large-scale production systems,
connecting Colab to cloud services like Google Cloud Storage, Amazon S3, or Hadoop-as-a-Service
platforms can further extend its capabilities.

This project demonstrates that even without a dedicated Hadoop setup, it is possible to learn and apply
core big data concepts effectively. The combination of PySpark and Google Colab enables practical
experimentation, making it an excellent choice for educational purposes, prototyping, and small-scale
data projects.

Big Data Analytics: A Comparative Evaluation of Apache Hadoop and Apache Spark
No ratings yet
Big Data Analytics: A Comparative Evaluation of Apache Hadoop and Apache Spark
8 pages
Dsbda Lab Manual
No ratings yet
Dsbda Lab Manual
35 pages
Nosql
No ratings yet
Nosql
44 pages
Unit 6-1
No ratings yet
Unit 6-1
128 pages
20dce017 Bda Pracfil
No ratings yet
20dce017 Bda Pracfil
41 pages
Hadoop Intro - Part1
No ratings yet
Hadoop Intro - Part1
45 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
Kadi Sarva Vishwavidyalaya: LDRP Institute of Technology and Research Gandhinagar
No ratings yet
Kadi Sarva Vishwavidyalaya: LDRP Institute of Technology and Research Gandhinagar
44 pages
Snowflake Resume
No ratings yet
Snowflake Resume
4 pages
Asit Kumar Das - M5 SPARK
No ratings yet
Asit Kumar Das - M5 SPARK
24 pages
BDA Report
No ratings yet
BDA Report
20 pages
2015 Liulu Ms
No ratings yet
2015 Liulu Ms
63 pages
Big Data Lab File
No ratings yet
Big Data Lab File
49 pages
Big Data Analysis - Lab Manual - Bharathidasan University - B.SC Data Science, Second Year, 4th Semester
No ratings yet
Big Data Analysis - Lab Manual - Bharathidasan University - B.SC Data Science, Second Year, 4th Semester
41 pages
Module 2.2
No ratings yet
Module 2.2
32 pages
Unit 2 - Intro To Hadoop
No ratings yet
Unit 2 - Intro To Hadoop
51 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
BIG Data Analytics 21CSH-471: Computer Science & Engineering
No ratings yet
BIG Data Analytics 21CSH-471: Computer Science & Engineering
24 pages
Unit # 2
No ratings yet
Unit # 2
23 pages
Bad601 Lab Maual
No ratings yet
Bad601 Lab Maual
34 pages
Data Analyst
No ratings yet
Data Analyst
9 pages
Big Data
No ratings yet
Big Data
27 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
44 pages
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
Big Data Analytics With Hadoop and Apache Spark
No ratings yet
Big Data Analytics With Hadoop and Apache Spark
17 pages
Module 5 Data Science
No ratings yet
Module 5 Data Science
8 pages
Advancing Polyglot Big Data Processing Using The Hadoop Ecosystem
No ratings yet
Advancing Polyglot Big Data Processing Using The Hadoop Ecosystem
22 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Data Analytics Chapter 5
No ratings yet
Data Analytics Chapter 5
14 pages
Assignment 1 Spec
No ratings yet
Assignment 1 Spec
5 pages
Integration of Python With Hadoop and Spark
No ratings yet
Integration of Python With Hadoop and Spark
13 pages
ICAI 2023 Paper 3719
No ratings yet
ICAI 2023 Paper 3719
6 pages
Lab Manual Big Data Analytics Lab (LC-CSE-410G) : Department of Computer Science and Engineering
No ratings yet
Lab Manual Big Data Analytics Lab (LC-CSE-410G) : Department of Computer Science and Engineering
28 pages
Ewwww
No ratings yet
Ewwww
12 pages
Rewwww
No ratings yet
Rewwww
12 pages
Iouu
No ratings yet
Iouu
12 pages
Haddob Lab Report
No ratings yet
Haddob Lab Report
12 pages
BigDataProcessingTools HaddopHDFSHiveSpark
No ratings yet
BigDataProcessingTools HaddopHDFSHiveSpark
2 pages
Big Data Unit II
No ratings yet
Big Data Unit II
42 pages
Hadoop in Bigdata Processing Concept
No ratings yet
Hadoop in Bigdata Processing Concept
2 pages
Big Data
No ratings yet
Big Data
3 pages
BDA Unit - II
No ratings yet
BDA Unit - II
66 pages
BDAunit II
No ratings yet
BDAunit II
4 pages
Introduction To Big Data PDF
No ratings yet
Introduction To Big Data PDF
16 pages
Integration of Python With Hadoop and Spark
No ratings yet
Integration of Python With Hadoop and Spark
10 pages
Data Science With Python - Lesson 12 - Python Integration With Hadoop
No ratings yet
Data Science With Python - Lesson 12 - Python Integration With Hadoop
53 pages
Introduction To Big Dat1
No ratings yet
Introduction To Big Dat1
6 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
Unit 2
No ratings yet
Unit 2
9 pages
Cloud Comp Techno
No ratings yet
Cloud Comp Techno
5 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
Test Bank For Business Intelligence, Analytics, Data Science, and AI, 5th Edition by Ramesh Sharda
No ratings yet
Test Bank For Business Intelligence, Analytics, Data Science, and AI, 5th Edition by Ramesh Sharda
31 pages
BigData and Hadoop - Syllabus
No ratings yet
BigData and Hadoop - Syllabus
2 pages
DSA RTU 2022 Paper
No ratings yet
DSA RTU 2022 Paper
15 pages
DTM Assignment - A.P.U E-Bookstore
0% (2)
DTM Assignment - A.P.U E-Bookstore
8 pages
Big Data Testing
100% (1)
Big Data Testing
34 pages
Report - Atlan - Data Catalog Primer
100% (1)
Report - Atlan - Data Catalog Primer
24 pages
Normalization in DBMS
No ratings yet
Normalization in DBMS
6 pages
RDBMS Unit 5
No ratings yet
RDBMS Unit 5
39 pages
SAS Dumps For Base SAS 9 Exam
No ratings yet
SAS Dumps For Base SAS 9 Exam
10 pages
Ii-Ii-Dbms-Courses File
No ratings yet
Ii-Ii-Dbms-Courses File
207 pages
Msbi Developer (SSRS, Ssas, Ssis) : Advanced Level
100% (1)
Msbi Developer (SSRS, Ssas, Ssis) : Advanced Level
4 pages
DB211 - Unit 7 Assignment A
0% (1)
DB211 - Unit 7 Assignment A
7 pages
Performance Tuning For The InfiniDB Analytics Database (For Version 1.0.3)
100% (1)
Performance Tuning For The InfiniDB Analytics Database (For Version 1.0.3)
72 pages
Assignment Database Management
No ratings yet
Assignment Database Management
4 pages
06 - IBM Watsonx - Data Competitive Insights
No ratings yet
06 - IBM Watsonx - Data Competitive Insights
113 pages
Mysql and applicATION
No ratings yet
Mysql and applicATION
35 pages
SA Seminar 01
No ratings yet
SA Seminar 01
46 pages
Indexing in DBMS
No ratings yet
Indexing in DBMS
5 pages
Dbms and Rdbms Questions
No ratings yet
Dbms and Rdbms Questions
16 pages
Simplr Solutions - Field
No ratings yet
Simplr Solutions - Field
31 pages
Chapter 3
No ratings yet
Chapter 3
30 pages
DBUNIT2
No ratings yet
DBUNIT2
21 pages
Resume - Aalok Kumar
No ratings yet
Resume - Aalok Kumar
4 pages
Hive MCQ Full 60 Questions
No ratings yet
Hive MCQ Full 60 Questions
5 pages
How To Purge Sale Order Records in Order Management
No ratings yet
How To Purge Sale Order Records in Order Management
6 pages
Excel Chapter - 11
No ratings yet
Excel Chapter - 11
14 pages
Unit V
No ratings yet
Unit V
17 pages
B. C. (Sem. VI) (CBCS) (W.E.F. 2016) Examination: Faculty Code: 003 Subject Code: 1036001
No ratings yet
B. C. (Sem. VI) (CBCS) (W.E.F. 2016) Examination: Faculty Code: 003 Subject Code: 1036001
4 pages
Ola Hallengren's SQL Server Maintenance Solution: Installation
No ratings yet
Ola Hallengren's SQL Server Maintenance Solution: Installation
8 pages
Wipro 15 04 24
No ratings yet
Wipro 15 04 24
5 pages
Status Config Fail
No ratings yet
Status Config Fail
2 pages
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Mastering Hadoop
From Everand
Mastering Hadoop
Sandeep Karanth
No ratings yet
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Hadoop Blueprints
From Everand
Hadoop Blueprints
Anurag Shrivastava
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
From Everand
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
Adam Jones
No ratings yet

Big Data G

Uploaded by

Big Data G

Uploaded by

JIGJIGA UNIVERSITY

 MapReduce: A programming model for processing and generating big datasets.

Key Characteristics of Hadoop:

Step 1: Setting Up PySpark on Google Colab

2. Installed PySpark by running the following command:

!pip install pyspark

from pyspark.sql import SparkSession

1. Downloaded an open-source dataset (from Kaggle).

Used the file upload option in Colab or ran:

from google.colab import files uploaded = files.upload()

Step 4: Saving Query Results

1. Saved the processed DataFrame back to the local filesystem:

You might also like