PySpark CheatSheet Edureka
PySpark CheatSheet Edureka
co
Now once, spark context is initialized, it’s time to check if all the .cache().reduce(add) subtract
10 >>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)])
versions are correct or not. We need to check the default parameters >>> y = sc.parallelize([("a", 3), ("c", None)])
being used by SparkContext. count
>>> sorted(x.subtract(y).collect())
>>> sc.parallelize([2, 3, 4]).count()
[('a', 1), ('b', 4), ('b', 5)]
>>> sc.version # SparkContext Version 3
>>> sc.pythonVer # Python Version first unioin
>>> sc.parallelize([2, 3, 4]).first() >>> rdd = sc.parallelize([1, 1, 2, 3])
>>> sc.appName # Application Name
2 >>> rdd.union(rdd).collect()
>>> sc.applicationId # Application ID
take [1, 1, 2, 3, 1, 1, 2, 3]
>>> sc.master # Master URL
>>> str(sc.sparkHome) # Installed Spark Path >>> sc.parallelize([2, 3, 4, 5, 6]).cache().take(2) intersection
>>> str(sc.sparkUser()) # Retreive Current SparkContext User [2, 3] >>> rdd1 = sc.parallelize([1, 10, 2, 3, 4, 5])
>>> sc.defaultParallelism # Get default level of Parallelism countByValue >>> rdd2 = sc.parallelize([1, 6, 2, 3, 7, 8])
>>> sc.defaultMinPartitions # Get minimum number of Partitions >>> sorted(sc.parallelize([1, 2, 1, 2, 2], 2).countByValue().items()) >>> rdd1.intersection(rdd2).collect()
[(1, 2), (2, 3)] [1, 2, 3]
cartesian
Data Loading >>> rdd = sc.parallelize([1, 2])
>>> sorted(rdd.cartesian(rdd).collect())
Creating RDDs [(1, 1), (1, 2), (2, 1), (2, 2)]