0% found this document useful (0 votes)
31 views7 pages

Databricks

Uploaded by

KV Deepti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views7 pages

Databricks

Uploaded by

KV Deepti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Databricks

1. login to databricks community edition


2. Creating Cluster
observe:
0 Workers:0 GB Memory, 0 Cores, 0 DBU
1 Driver:15.3 GB Memory, 2 Cores, 1 DBU
step1: goto left corner and click on "Compute" ==> Create Cluster
Cluster Name: mysparkcluster
Databricks runtime version: keep the default
Instance:
Availability zone: Auto
3. Creating a NoteBook
step1: goto left corner and click on "databricks ==> create ==>
Notebook"
step2:
Name:HelloWorld
Default Language: Python
Cluster: mysparkcluster ==> create

1
Databricks
4. Creating RDD
Problem statement: reading a some sample file which consists of no.s
and display that using RDD
note:
1. you can use the existing notebook or create the new notebook
2. dataset here we are using is "sampleds1"
step1: create a note book
step2: upload the dataset to DB environment
goto left corner and click on "databricks ==> create ==> table"
upload file ==> "Drop files to upload, or click to browse"
now, we can see the file in "DBFS => FILE STORE => TABLES"
NOW, copy the file path "/FileStore/tables/sampleds1.txt"

step3: go to note book and write the spark code (use back arrow to
go back)

from pyspark import SparkConf, SparkContext

// Observations
1. we have to do the imports first
2. from pyspark we need to import the SparkConfguration and
SparkContext
3. SparkConfguration ==> allows us to put some configurations for
Spark
like to i want to read some data from RDS Instance,

2
Databricks
I want to read some data from external databases / or some other
file store
So, all the configurations that are required by spark for accessing
their databases / accessing that file store, we provide inside Spark
Configuration

4. SparkContext ==> Entry point of the spark inside the cluster. where
this spark actually started using the cluster, creates the RDDs
partition, then all that setting, all that configuration it required
working properly and then it start reading the file and doing all the
process.

conf = SparkConf().setAppName("Read sample file")


// Observations
1. Let us Provide the configuration for Configuration Variable "conf"
now
2. Set the "AppName" by which Spark and the cluster will
communicate. AppName could be anything
sc = SparkContext.getOrCreate(conf=conf)

// Observations
1. Next, we need to create SparkContext, it is the entry point of the
program.
2. We will save the context inside the variable at "sc"
3. .getOrCreate !!?
==> it is referring to SparkContext and then it's asking to get a
Sparkcontext or create a SparkContext.
3
Databricks

note:
1. if we are working on local machine/local cluster, we can simply go
for spark context and it will create this sparkcontext for us.

2. but on DataBricks environment, we need to use to get / create,


means if we have already created sparkcontext in our previous
notebook, and if we are creating again a sparkcontext, databricks will
not allows that.
"The community version of the DataBricks have that limitation that,
we can only create a single SparkContext inside the free notebook."
so, the solutuon is "getOrCreate", get the sparkcontext if it is already
available or create a new one.
if we just keep on creating SparkContext, then it will actually
overload your instance memory
4. inside we have to provide configuration "(conf=conf)"

text = sc.textFile('/FileStore/tables/sampleds1.txt')

// Observations
1. let us just re-define a variable "text", and refer "sc" whenever we
wants to create the RDD i.e. "sc" is responsible for creating the RDD
2. this is mainly a transformation

text
// Observations

4
Databricks
1. provides the meta data information for the variable

text.collect()

// Observations
1. Action statement "collect"
2. "collect" ask to run all the transformations that are behind the
action
3. text.collect() ==> it will actually starts processing the file
4. when we are referring to some sort of text file that it simply splits
the data for you for the file

1. Spark works on lazy evaluation

5
Databricks
RDD Functions:
map()
==> used as a mapper of data from one state to other
==> It will create a new RDD
==> Syntax: rdd.map(lambda x:x.split())

note:
1. A lambda function is a small anonymous function.

2. A lambda function can take any number of arguments, but can


only have one expression.
ex:

Example: Multiply argument a with argument b and return the result:


x = lambda a, b : a * b
print(x(5, 6))

3. working of a mapper : set of i/p and we map that i/p to a set of


output
# Databricks notebook source

6
Databricks
from pyspark import SparkConf, SparkContext

conf = SparkConf().setAppName("Read File")

sc = SparkContext.getOrCreate(conf=conf)
# COMMAND ----------
myrdd = sc.textFile('/FileStore/tables/sample.txt')
myrdd = sc.textFile('/FileStore/tables/sample2-1.txt')

# COMMAND ----------
myrdd1 = myrdd.map(lambda x: x.split(' '))
myrdd1.collect()
# COMMAND ----------

rdd.collect()

# COMMAND ----------

rdd2.collect()

You might also like