Big Data G
Big Data G
Institute of Technology
Department of Software Engineering
Course Title: Fundamentals of Big Data
Analytics and BI
Group Project
No. Name ID. No.
1 Kadar Abdirahman Muhumed R/1696/13
2 Nasra Ahmed Abdi R/4277/13
3 Suad Yasin Omer R/3755/13
4 Abdirasak Mustafe Mohamed R/5277/13
5 Hassan Bashir Abdikarem R/1495/13
6 Fehima Ahmed Rabi R/1197/13
7 Amin Abdi Hassen R/4321/13
8 Adnan Shukri Abib R/4736/13
9 Abdulahi Abdirahman Omer R/3114/13
10 Ali Suldan Hassan R/5570/13
1. Introduction to Hadoop and its Components
Hadoop is an open-source framework for storing and processing large datasets in a distributed
computing environment. It consists of four core components:
Hadoop is designed to handle massive datasets that exceed the processing capacity of a single machine.
It leverages the power of distributed computing, distributing data and processing tasks across a cluster
of commodity hardware (like standard servers).
Scalability: Easily scales horizontally by adding more nodes to the cluster as data volume grows.
Fault Tolerance: Data is replicated across multiple nodes, ensuring data availability even if some
nodes fail.
Cost-Effectiveness: Utilizes inexpensive commodity hardware, making it a cost-effective solution
for big data processing.
Hadoop provides a robust and scalable platform for handling and processing big data. Its combination of
HDFS, MapReduce, Hive, Pig, and HBase offers a comprehensive solution for various big data challenges,
from data storage and retrieval to complex analysis and real-time processing.
The following pages outline the steps we took to complete the tasks of working with the Hadoop
ecosystem using Google Colab and PySpark. The tasks include setting up the environment, loading a
dataset, querying it using HiveQL-like syntax, and saving the results.
Steps Followed
Using Google Colab with PySpark is a practical and accessible alternative to a full-fledged Hadoop
environment for small-scale Hadoop ecosystem tasks. The combination of HDFS-like file handling, SQL
queries, and cloud storage ensured task completion without requiring dedicated Hadoop infrastructure.
The process highlights the flexibility of PySpark in simulating key Hadoop operations such as distributed
file handling and query processing. By leveraging Colab’s cloud-based resources, users can efficiently
manage datasets and execute HiveQL-like queries. Additionally, the integration with Python makes it
easier to manipulate and analyze data, offering a smooth transition between data preprocessing and big
data analytics.
While Google Colab lacks native HDFS support, its ability to save, upload, and download files
compensates for this limitation in small to medium-scale tasks. For large-scale production systems,
connecting Colab to cloud services like Google Cloud Storage, Amazon S3, or Hadoop-as-a-Service
platforms can further extend its capabilities.
This project demonstrates that even without a dedicated Hadoop setup, it is possible to learn and apply
core big data concepts effectively. The combination of PySpark and Google Colab enables practical
experimentation, making it an excellent choice for educational purposes, prototyping, and small-scale
data projects.