0% found this document useful (0 votes)
18 views2 pages

Spark 5

Uploaded by

yaso.ponnaganti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views2 pages

Spark 5

Uploaded by

yaso.ponnaganti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

spark5

August 20, 2024

[1]: # Install Apache Spark if not already installed


!pip install PySpark

Collecting PySpark
Downloading pyspark-3.5.2.tar.gz (317.3 MB)
���������������������������������������� 317.3/317.3
MB 4.1 MB/s eta 0:00:00
Preparing metadata (setup.py) … done
Requirement already satisfied: py4j==0.10.9.7 in /usr/local/lib/python3.10/dist-
packages (from PySpark) (0.10.9.7)
Building wheels for collected packages: PySpark
Building wheel for PySpark (setup.py) … done
Created wheel for PySpark: filename=pyspark-3.5.2-py2.py3-none-any.whl
size=317812365
sha256=098cabe5072f576a6420082b780d1ceeb0d89ade653e76de07fa10971b334ad5
Stored in directory: /root/.cache/pip/wheels/34/34/bd/03944534c44b677cd5859f24
8090daa9fb27b3c8f8e5f49574
Successfully built PySpark
Installing collected packages: PySpark
Successfully installed PySpark-3.5.2

[2]: # Import necessary libraries


from pyspark.sql import SparkSession

[3]: # Create a SparkSession


spark = SparkSession.builder.appName("My Pipeline").getOrCreate()

[4]: # Ingest data from a CSV file


df = spark.read.csv("/content/data.csv", header=True, inferSchema=True)

[5]: df.show()

+-----+---+-------+
| Name|Age|Country|
+-----+---+-------+
|Alice| 25| UK|
| Jhon| 54| USA|

1
| Nani| 23| India|
| Bob| 16|Germany|
+-----+---+-------+

[6]: # Transform data by filtering and aggregating


df_transformed = df.filter(df["age"] > 18).groupBy("country").count()

[7]: # Store output in a Parquet file


df_transformed.write.parquet("output.parquet")

[8]: # showing transformed data


df_transformed.show()

+-------+-----+
|country|count|
+-------+-----+
| India| 1|
| USA| 1|
| UK| 1|
+-------+-----+

[9]: # Stop the SparkSession


spark.stop()

[ ]:

You might also like