0% found this document useful (0 votes)
80 views12 pages

Extracting Data From An API On Databricks - by Ryan Chynoweth - Feb, 2024 - Medium

Uploaded by

walteravelin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views12 pages

Extracting Data From An API On Databricks - by Ryan Chynoweth - Feb, 2024 - Medium

Uploaded by

walteravelin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Extracting Data from an API on

Databricks
Ryan Chynoweth · Follow
5 min read · Feb 11, 2024

160 6

Introduction
Databricks seamlessly integrates with the application and data
infrastructure of organizations. Its ability to extract data from various
sources, perform transformations, and integrate with data sinks
simplifies system integration differentiating Databricks. This stands in
contrast to cloud data warehouses that are reliant on external functions
for these tasks that increase complexity and cost.

We will cover Databricks’ ability to consume data from external APIs and
save that data to a table in Databricks Unity Catalog. We will explore two
primary methods: a single-threaded approach, and a distributed option
for executing requests in parallel. In both scenarios we will be using the
python requests library to perform these actions.

Associated code can be found on my GitHub. It is worth noting that


within Databricks, I utilize Python Imports to modularize the code
effectively, housing the functions responsible for calling the Rest APIs in
the libs directory.

Single Threaded Option


The first approach involves consuming a single API endpoint at a time,
with the execution taking place on the driver. This method is ideal for
scenarios where you need to access one or just a few endpoints
periodically. To deploy this solution, engineers should consider selecting
the single-node compute option for the cluster since the code operates on
a single machine and doesn’t require additional virtual machines for
processing data.

As an example, we are going to call the https://cat-

fact.herokuapp.com/facts/ endpoint which is available through Postman.


See below for the example function that calls an API.

import requests
import json
from pyspark.sql.functions import udf

class APIExtract():
"""
Class used for transformations requiring an API call.
"""

def __init__(self):
self.api_udf = udf(self.call_simple_rest_api)

def call_simple_rest_api(self, url="https://cat-fact.herokuapp.com/facts/"):


""" Example Rest API call to open API from Postman """
# public REST API from PostMan: https://documenter.getpostman.com/view/8854915
response = requests.get(url)
return json.loads(response.text)
In a Python notebook we can then import this class and call the API with
the following code.

from libs.api_extract import APIExtract

api_extract_client = APIExtract()
api_extract_client.call_simple_rest_api()

Once executed you should have the following output.

'[
{
"status": {
"verified": true,
"sentCount": 1
},
"_id": "58e00b5f0aac31001185ed24",
"user": "58e007480aac31001185ecef",
"text": "When asked if her husband had any hobbies, Mary Todd Lincoln is s
"__v": 0,
"source": "user",
"updatedAt": "2020-08-23T20:20:01.611Z",
"type": "cat",
"createdAt": "2018-02-19T21:20:03.434Z",
"deleted": false,
"used": false
},
...
...
...
{
"status": {
"verified": true,
"sentCount": 1
},
"_id": "58e00af60aac31001185ed1d",
"user": "58e007480aac31001185ecef",
"text": "It was illegal to slay cats in ancient Egypt, in large part becau
"__v": 0,
"source": "user",
"updatedAt": "2020-09-16T20:20:04.164Z",
"type": "cat",
"createdAt": "2018-01-15T21:20:02.945Z",
"deleted": false,
"used": true
}
]'

Next we want to save the data to a table in Unity Catalog which can be
done using the code below.

df = spark.createDataFrame(data)
df.write.saveAsTable('cat_data')

We have now extracted data from an API, converted the response to a


DataFrame, then saved the DataFrame to a table in Unity Catalog.

Parallel API Calls


In some scenarios, there may be a need to collect data from multiple API
endpoints concurrently, or to paginate through a single API endpoint.
This can be efficiently achieved by running operations in parallel across
multiple cores simultaneously.

To begin, we will need to make a couple of additions to our notebook to


enable the parallel execution of API calls. First, we create a PySpark
DataFrame containing the request parameters. In this example, we only
have one column, url , but additional columns could be included for
filtering, authentication, payloads, and other purposes.
# Create a list of dictionaries with the URL values
request_params = [
{"url": "https://cat-fact.herokuapp.com/facts/"},
{"url": "https://dog.ceo/api/breeds/list/all/"},
{"url": "https://world.openpetfoodfacts.org/api/v0/product/20106836.json"},
{"url": "https://world.openfoodfacts.org/api/v0/product/737628064502.json"},
{"url": "https://openlibrary.org/api/books?bibkeys=ISBN:0201558025,LCCN:930054
]

# Create DataFrame from the list of dictionaries


request_df = spark.createDataFrame(request_params)
request_df.show()

You should then have the following output.

+--------------------+
| url|
+--------------------+
|https://cat-fact....|
|https://dog.ceo/a...|
|https://world.ope...|
|https://world.ope...|
|https://openlibra...|
+--------------------+

Now I can call the api_udf function created in our APIExtract class
above.

response_df = request_df.withColumn('response', api_extract_client.api_udf(col('ur


response_df.show()

You end up with the following DataFrame which can be saved to a table
using response_df.write.saveAsTable('parallel_api_calls') .
+--------------------+--------------------+
| url| response|
+--------------------+--------------------+
|https://cat-fact....|[{createdAt=2018-...|
|https://dog.ceo/a...|{message={pyrenee...|
|https://world.ope...|{status_verbose=p...|
|https://world.ope...|{status_verbose=p...|
|https://openlibra...|{LCCN:93005405={p...|
+--------------------+--------------------+

Notice that all the data regardless of endpoint is saved to a single table
and/or DataFrame. Ideally you will need to split the data into separate
datasets which you can do with the following code as a stream or batch
process from the ingestion table.

df = (spark.read
.table('parallel_api_calls')
.filter(col('url') == 'https://cat-fact.herokuapp.com/facts/')
)

df.write.saveAsTable('cat_facts')

Running as a Task in Workflows


Users often need to extract data as part of a larger job. This can be
seamlessly integrated into Databricks workflows as a task. Below, we
present an example job that involves ingesting data from OpenWeather,
followed by the execution of dependent tasks for further processing.

Search Write
The source code for this pipeline can be found here.

Before concluding, let’s consider the cost of extracting data from an API.
Utilizing single-node compute on Databricks for extracting data from an
API can prove to be highly cost-effective. While some may argue it is
cheaper to use cloud function solutions, the streamlined architecture of
having ingestion and transformations in a single pipeline often
outweighs any marginal cost differences and allow users to more easily
save data directly into a table without staging the data as files. Typically,
the cost associated with extracting data from the API is insignificant
compared to the overall expense of running data pipelines on entire
datasets and tables.

With the provided parallel API code, I would recommend consolidating


the ingestion process into a single task within Databricks. This approach
allows you to leverage economies of scale, where the cost for data
extraction is spread over multiple data sources and the cluster is highly
utilized.

Conclusion
Consuming data from an API and saving the response to a table is
extremely simple on Databricks. This can be done in a single node
manner for smaller use cases or can be distributed to run in parallel for
scale.
Disclaimer: these are my own thoughts and opinions and not a reflection of my
employer

Databricks Data Ingestion API

Written by Ryan Chynoweth Follow

470 Followers

Senior Solutions Architect Databricks — anything shared is my own thoughts and


opinions

More from Ryan Chynoweth


Ryan Chynoweth Ryan Chynoweth

Task Parameters and Values in Converting Stored Procedures to


Databricks Workflows Databricks
Special thanks to co-author Kyle Hale, Sr.
Specialist Solutions Architect at…
Databricks.
11 min read · Dec 7, 2022 14 min read · Dec 29, 2022

54 4 123 7

Ryan Chynoweth in DBSQL SME Engineering Ryan Chynoweth

Converting Chained Stored Recursive CTE on Databricks


Procedures to Databricks Introduction
SQL workflows on DBSQL!

6 min read · Feb 22, 2024 3 min read · Apr 20, 2022

21 33

See all from Ryan Chynoweth


Recommended from Medium

Prashanth Kumar Dave Melillo in Towards Data Science

Azure Databricks: Job Building a Data Platform in 2024


Performance Monitoring,… How to build a modern, scalable data
Troubleshooting
As a part of Databricksand
Job performance platform to power your analytics and data…
Optimization
Monitoring, troubleshooting and… science projects (updated)
Optimization we will be looking into various
14 min read · Feb 25, 2024
aspects. 9 min read · Feb 5, 2024

5 2.1K 31

Lists

Coding & Development Company Offsite Reading


11 stories · 502 saves List
8 stories · 104 saves

data science and AI Natural Language


40 stories · 103 saves Processing
1284 stories · 774 saves
Ahmed Sayed Ryan Chynoweth in DBSQL SME Engineering

Build Scalable Data Pipelines in Converting Chained Stored


Python Using DLT Procedures to Databricks
SQL workflows on DBSQL!

13 min read · Feb 18, 2024 6 min read · Feb 22, 2024

326 2 21

Viral Patel Abhinav Prakash

Apply Encryption to PII Fields of Delta Lake vs. Parquet


Delta Tables If Delta lake tables also use Parquet files to
When it comes to the PII (Personally store data, how are they different (and…
Identifiable Information) or Commercially… better) than vanilla Parquet tables?
Sensitive data, It is necessary requirement
6
ofmin readof· the…
most Jan 17, 2024 10 min read · Jan 24, 2024

100 3 137 3

See more recommendations


Help Status About Careers Blog Privacy Terms Text to speech Teams

You might also like