0% found this document useful (0 votes)
14 views4 pages

PySpark Exam Setup and Basic Code Guide

Uploaded by

Nakul Arora
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views4 pages

PySpark Exam Setup and Basic Code Guide

Uploaded by

Nakul Arora
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

PySpark Exam Setup and Basic Code

Guide
### 1. Unzipping the File

Assuming the file is zipped, unzip it first:


```bash
cd /home/ashok/Documents
unzip <your_zip_file_name>.zip
```

### 2. Start Hadoop

Ensure that Hadoop is started as it is essential for PySpark operations:


```bash
start-dfs.sh
start-yarn.sh
```
Check if Hadoop is running by opening these URLs:
- HDFS: http://localhost:9870
- YARN: http://localhost:8088

### 3. Navigate to the Folder

Go to the folder where the notebook file is located:


```bash
cd /home/ashok/Documents/qpaper
```

### 4. Start PySpark Notebook

Start the PySpark Jupyter Notebook using the following command:


```bash
pysparknb
```
### 5. Basic PySpark Code in the Notebook

Once you have opened the notebook (.ipynb), you can start with these basic codes to ensure
everything is working fine.

##### a. Import Required Libraries

```python
from pyspark.sql import SparkSession
```

##### b. Initialize Spark Session

```python
spark = SparkSession.builder \
.appName("Exam Setup") \
.getOrCreate()

# Check Spark version to verify the environment


print(spark.version)
```

##### c. Basic DataFrame Setup

Create a small DataFrame to verify if PySpark is working:


```python
# Sample Data
data = [("Ashok", 1), ("John", 2), ("Doe", 3)]

# Creating DataFrame
df = spark.createDataFrame(data, ["Name", "ID"])

# Show DataFrame
df.show()
```

##### d. Reading Data from HDFS

To ensure HDFS is running, you can check with the following command:
```bash
!hdfs dfs -ls /
```
You can also read files from HDFS if needed:
```python
# Example to read from HDFS if a file is stored there
df = spark.read.csv("hdfs://localhost:9000/path/to/your/file.csv", header=True)
df.show()
```

##### e. Basic DataFrame Operations

You can perform a few basic operations to manipulate data:


```python
# Show schema
df.printSchema()

# Select specific columns


df.select("Name").show()

# Filter rows
df.filter(df["ID"] > 1).show()

# Group by a column
df.groupBy("Name").count().show()
```

##### f. Saving DataFrame to HDFS

If you need to save the DataFrame back to HDFS:


```python
df.write.csv("hdfs://localhost:9000/path/to/output_folder", header=True)
```

### 6. Saving and Renaming the Notebook

Periodically save your progress by pressing Ctrl+S. After completing your work or at any
point, rename the notebook as instructed:
- Click the title at the top of the notebook.
- Rename it to your roll number (e.g., '123456').

### 7. Shut Down Spark and Hadoop After Completion

Once you finish your work, stop Hadoop services to free up resources:
```bash
stop-yarn.sh
stop-dfs.sh
```

You might also like