0% found this document useful (0 votes)
9 views

PySpark Exam Setup and Basic Code Guide

Uploaded by

Nakul Arora
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

PySpark Exam Setup and Basic Code Guide

Uploaded by

Nakul Arora
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

PySpark Exam Setup and Basic Code

Guide
### 1. Unzipping the File

Assuming the file is zipped, unzip it first:


```bash
cd /home/ashok/Documents
unzip <your_zip_file_name>.zip
```

### 2. Start Hadoop

Ensure that Hadoop is started as it is essential for PySpark operations:


```bash
start-dfs.sh
start-yarn.sh
```
Check if Hadoop is running by opening these URLs:
- HDFS: http://localhost:9870
- YARN: http://localhost:8088

### 3. Navigate to the Folder

Go to the folder where the notebook file is located:


```bash
cd /home/ashok/Documents/qpaper
```

### 4. Start PySpark Notebook

Start the PySpark Jupyter Notebook using the following command:


```bash
pysparknb
```
### 5. Basic PySpark Code in the Notebook

Once you have opened the notebook (.ipynb), you can start with these basic codes to ensure
everything is working fine.

##### a. Import Required Libraries

```python
from pyspark.sql import SparkSession
```

##### b. Initialize Spark Session

```python
spark = SparkSession.builder \
.appName("Exam Setup") \
.getOrCreate()

# Check Spark version to verify the environment


print(spark.version)
```

##### c. Basic DataFrame Setup

Create a small DataFrame to verify if PySpark is working:


```python
# Sample Data
data = [("Ashok", 1), ("John", 2), ("Doe", 3)]

# Creating DataFrame
df = spark.createDataFrame(data, ["Name", "ID"])

# Show DataFrame
df.show()
```

##### d. Reading Data from HDFS

To ensure HDFS is running, you can check with the following command:
```bash
!hdfs dfs -ls /
```
You can also read files from HDFS if needed:
```python
# Example to read from HDFS if a file is stored there
df = spark.read.csv("hdfs://localhost:9000/path/to/your/file.csv", header=True)
df.show()
```

##### e. Basic DataFrame Operations

You can perform a few basic operations to manipulate data:


```python
# Show schema
df.printSchema()

# Select specific columns


df.select("Name").show()

# Filter rows
df.filter(df["ID"] > 1).show()

# Group by a column
df.groupBy("Name").count().show()
```

##### f. Saving DataFrame to HDFS

If you need to save the DataFrame back to HDFS:


```python
df.write.csv("hdfs://localhost:9000/path/to/output_folder", header=True)
```

### 6. Saving and Renaming the Notebook

Periodically save your progress by pressing Ctrl+S. After completing your work or at any
point, rename the notebook as instructed:
- Click the title at the top of the notebook.
- Rename it to your roll number (e.g., '123456').

### 7. Shut Down Spark and Hadoop After Completion

Once you finish your work, stop Hadoop services to free up resources:
```bash
stop-yarn.sh
stop-dfs.sh
```

You might also like