Databricks
Databricks
1
Databricks
4. Creating RDD
Problem statement: reading a some sample file which consists of no.s
and display that using RDD
note:
1. you can use the existing notebook or create the new notebook
2. dataset here we are using is "sampleds1"
step1: create a note book
step2: upload the dataset to DB environment
goto left corner and click on "databricks ==> create ==> table"
upload file ==> "Drop files to upload, or click to browse"
now, we can see the file in "DBFS => FILE STORE => TABLES"
NOW, copy the file path "/FileStore/tables/sampleds1.txt"
step3: go to note book and write the spark code (use back arrow to
go back)
// Observations
1. we have to do the imports first
2. from pyspark we need to import the SparkConfguration and
SparkContext
3. SparkConfguration ==> allows us to put some configurations for
Spark
like to i want to read some data from RDS Instance,
2
Databricks
I want to read some data from external databases / or some other
file store
So, all the configurations that are required by spark for accessing
their databases / accessing that file store, we provide inside Spark
Configuration
4. SparkContext ==> Entry point of the spark inside the cluster. where
this spark actually started using the cluster, creates the RDDs
partition, then all that setting, all that configuration it required
working properly and then it start reading the file and doing all the
process.
// Observations
1. Next, we need to create SparkContext, it is the entry point of the
program.
2. We will save the context inside the variable at "sc"
3. .getOrCreate !!?
==> it is referring to SparkContext and then it's asking to get a
Sparkcontext or create a SparkContext.
3
Databricks
note:
1. if we are working on local machine/local cluster, we can simply go
for spark context and it will create this sparkcontext for us.
text = sc.textFile('/FileStore/tables/sampleds1.txt')
// Observations
1. let us just re-define a variable "text", and refer "sc" whenever we
wants to create the RDD i.e. "sc" is responsible for creating the RDD
2. this is mainly a transformation
text
// Observations
4
Databricks
1. provides the meta data information for the variable
text.collect()
// Observations
1. Action statement "collect"
2. "collect" ask to run all the transformations that are behind the
action
3. text.collect() ==> it will actually starts processing the file
4. when we are referring to some sort of text file that it simply splits
the data for you for the file
5
Databricks
RDD Functions:
map()
==> used as a mapper of data from one state to other
==> It will create a new RDD
==> Syntax: rdd.map(lambda x:x.split())
note:
1. A lambda function is a small anonymous function.
6
Databricks
from pyspark import SparkConf, SparkContext
sc = SparkContext.getOrCreate(conf=conf)
# COMMAND ----------
myrdd = sc.textFile('/FileStore/tables/sample.txt')
myrdd = sc.textFile('/FileStore/tables/sample2-1.txt')
# COMMAND ----------
myrdd1 = myrdd.map(lambda x: x.split(' '))
myrdd1.collect()
# COMMAND ----------
rdd.collect()
# COMMAND ----------
rdd2.collect()