PySpark Exam Setup and Basic Code Guide
PySpark Exam Setup and Basic Code Guide
Guide
### 1. Unzipping the File
Once you have opened the notebook (.ipynb), you can start with these basic codes to ensure
everything is working fine.
```python
from pyspark.sql import SparkSession
```
```python
spark = SparkSession.builder \
.appName("Exam Setup") \
.getOrCreate()
# Creating DataFrame
df = spark.createDataFrame(data, ["Name", "ID"])
# Show DataFrame
df.show()
```
To ensure HDFS is running, you can check with the following command:
```bash
!hdfs dfs -ls /
```
You can also read files from HDFS if needed:
```python
# Example to read from HDFS if a file is stored there
df = spark.read.csv("hdfs://localhost:9000/path/to/your/file.csv", header=True)
df.show()
```
# Filter rows
df.filter(df["ID"] > 1).show()
# Group by a column
df.groupBy("Name").count().show()
```
Periodically save your progress by pressing Ctrl+S. After completing your work or at any
point, rename the notebook as instructed:
- Click the title at the top of the notebook.
- Rename it to your roll number (e.g., '123456').
Once you finish your work, stop Hadoop services to free up resources:
```bash
stop-yarn.sh
stop-dfs.sh
```