0% found this document useful (0 votes)
2 views11 pages

Big Data G

The document outlines a group project from Jigjiga University on the fundamentals of Big Data, focusing on the Hadoop ecosystem and its components such as MapReduce, Hive, Pig, and HBase. It details the steps taken to set up PySpark on Google Colab, load a dataset, execute queries, and save results, emphasizing the practicality of using Colab for small-scale Hadoop tasks. The project highlights the flexibility of PySpark and the accessibility of cloud resources for learning and applying big data concepts without a dedicated Hadoop infrastructure.

Uploaded by

kayton083
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views11 pages

Big Data G

The document outlines a group project from Jigjiga University on the fundamentals of Big Data, focusing on the Hadoop ecosystem and its components such as MapReduce, Hive, Pig, and HBase. It details the steps taken to set up PySpark on Google Colab, load a dataset, execute queries, and save results, emphasizing the practicality of using Colab for small-scale Hadoop tasks. The project highlights the flexibility of PySpark and the accessibility of cloud resources for learning and applying big data concepts without a dedicated Hadoop infrastructure.

Uploaded by

kayton083
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

JIGJIGA UNIVERSITY

Institute of Technology
Department of Software Engineering
Course Title: Fundamentals of Big Data
Analytics and BI
Group Project
No. Name ID. No.
1 Kadar Abdirahman Muhumed R/1696/13
2 Nasra Ahmed Abdi R/4277/13
3 Suad Yasin Omer R/3755/13
4 Abdirasak Mustafe Mohamed R/5277/13
5 Hassan Bashir Abdikarem R/1495/13
6 Fehima Ahmed Rabi R/1197/13
7 Amin Abdi Hassen R/4321/13
8 Adnan Shukri Abib R/4736/13
9 Abdulahi Abdirahman Omer R/3114/13
10 Ali Suldan Hassan R/5570/13
1. Introduction to Hadoop and its Components

Hadoop is an open-source framework for storing and processing large datasets in a distributed
computing environment. It consists of four core components:

 MapReduce: A programming model for processing and generating big datasets.


 Hive: A data warehouse infrastructure built on Hadoop, which provides a SQL-like interface
(HiveQL).
 Pig: A platform for analyzing large datasets using a high-level scripting language (Pig Latin).
 HBase: A NoSQL database that runs on top of HDFS for real-time read/write access.

Hadoop is designed to handle massive datasets that exceed the processing capacity of a single machine.
It leverages the power of distributed computing, distributing data and processing tasks across a cluster
of commodity hardware (like standard servers).

Key Characteristics of Hadoop:

 Scalability: Easily scales horizontally by adding more nodes to the cluster as data volume grows.
 Fault Tolerance: Data is replicated across multiple nodes, ensuring data availability even if some
nodes fail.
 Cost-Effectiveness: Utilizes inexpensive commodity hardware, making it a cost-effective solution
for big data processing.

Hadoop provides a robust and scalable platform for handling and processing big data. Its combination of
HDFS, MapReduce, Hive, Pig, and HBase offers a comprehensive solution for various big data challenges,
from data storage and retrieval to complex analysis and real-time processing.
The following pages outline the steps we took to complete the tasks of working with the Hadoop
ecosystem using Google Colab and PySpark. The tasks include setting up the environment, loading a
dataset, querying it using HiveQL-like syntax, and saving the results.

Steps Followed

Step 1: Setting Up PySpark on Google Colab

2. Installed PySpark by running the following command:

!pip install pyspark


3. Imported the required PySpark module:

from pyspark.sql import SparkSession


4. Created a Spark session to initiate PySpark operations:
Step 2: Loading the Dataset

1. Downloaded an open-source dataset (from Kaggle).


2. Uploaded the dataset to Google Colab:

Used the file upload option in Colab or ran:

from google.colab import files uploaded = files.upload()


Step 3: Querying Data Using HiveQL-like Syntax
1. Executed SQL queries using Spark SQL:
2. Grouping the data by a specific column

Step 4: Saving Query Results

1. Saved the processed DataFrame back to the local filesystem:


Conclusion

Using Google Colab with PySpark is a practical and accessible alternative to a full-fledged Hadoop
environment for small-scale Hadoop ecosystem tasks. The combination of HDFS-like file handling, SQL
queries, and cloud storage ensured task completion without requiring dedicated Hadoop infrastructure.

The process highlights the flexibility of PySpark in simulating key Hadoop operations such as distributed
file handling and query processing. By leveraging Colab’s cloud-based resources, users can efficiently
manage datasets and execute HiveQL-like queries. Additionally, the integration with Python makes it
easier to manipulate and analyze data, offering a smooth transition between data preprocessing and big
data analytics.

While Google Colab lacks native HDFS support, its ability to save, upload, and download files
compensates for this limitation in small to medium-scale tasks. For large-scale production systems,
connecting Colab to cloud services like Google Cloud Storage, Amazon S3, or Hadoop-as-a-Service
platforms can further extend its capabilities.

This project demonstrates that even without a dedicated Hadoop setup, it is possible to learn and apply
core big data concepts effectively. The combination of PySpark and Google Colab enables practical
experimentation, making it an excellent choice for educational purposes, prototyping, and small-scale
data projects.

You might also like