Extracting Data From An API On Databricks - by Ryan Chynoweth - Feb, 2024 - Medium
Extracting Data From An API On Databricks - by Ryan Chynoweth - Feb, 2024 - Medium
Databricks
Ryan Chynoweth · Follow
5 min read · Feb 11, 2024
160 6
Introduction
Databricks seamlessly integrates with the application and data
infrastructure of organizations. Its ability to extract data from various
sources, perform transformations, and integrate with data sinks
simplifies system integration differentiating Databricks. This stands in
contrast to cloud data warehouses that are reliant on external functions
for these tasks that increase complexity and cost.
We will cover Databricks’ ability to consume data from external APIs and
save that data to a table in Databricks Unity Catalog. We will explore two
primary methods: a single-threaded approach, and a distributed option
for executing requests in parallel. In both scenarios we will be using the
python requests library to perform these actions.
import requests
import json
from pyspark.sql.functions import udf
class APIExtract():
"""
Class used for transformations requiring an API call.
"""
def __init__(self):
self.api_udf = udf(self.call_simple_rest_api)
api_extract_client = APIExtract()
api_extract_client.call_simple_rest_api()
'[
{
"status": {
"verified": true,
"sentCount": 1
},
"_id": "58e00b5f0aac31001185ed24",
"user": "58e007480aac31001185ecef",
"text": "When asked if her husband had any hobbies, Mary Todd Lincoln is s
"__v": 0,
"source": "user",
"updatedAt": "2020-08-23T20:20:01.611Z",
"type": "cat",
"createdAt": "2018-02-19T21:20:03.434Z",
"deleted": false,
"used": false
},
...
...
...
{
"status": {
"verified": true,
"sentCount": 1
},
"_id": "58e00af60aac31001185ed1d",
"user": "58e007480aac31001185ecef",
"text": "It was illegal to slay cats in ancient Egypt, in large part becau
"__v": 0,
"source": "user",
"updatedAt": "2020-09-16T20:20:04.164Z",
"type": "cat",
"createdAt": "2018-01-15T21:20:02.945Z",
"deleted": false,
"used": true
}
]'
Next we want to save the data to a table in Unity Catalog which can be
done using the code below.
df = spark.createDataFrame(data)
df.write.saveAsTable('cat_data')
+--------------------+
| url|
+--------------------+
|https://cat-fact....|
|https://dog.ceo/a...|
|https://world.ope...|
|https://world.ope...|
|https://openlibra...|
+--------------------+
Now I can call the api_udf function created in our APIExtract class
above.
You end up with the following DataFrame which can be saved to a table
using response_df.write.saveAsTable('parallel_api_calls') .
+--------------------+--------------------+
| url| response|
+--------------------+--------------------+
|https://cat-fact....|[{createdAt=2018-...|
|https://dog.ceo/a...|{message={pyrenee...|
|https://world.ope...|{status_verbose=p...|
|https://world.ope...|{status_verbose=p...|
|https://openlibra...|{LCCN:93005405={p...|
+--------------------+--------------------+
Notice that all the data regardless of endpoint is saved to a single table
and/or DataFrame. Ideally you will need to split the data into separate
datasets which you can do with the following code as a stream or batch
process from the ingestion table.
df = (spark.read
.table('parallel_api_calls')
.filter(col('url') == 'https://cat-fact.herokuapp.com/facts/')
)
df.write.saveAsTable('cat_facts')
Search Write
The source code for this pipeline can be found here.
Before concluding, let’s consider the cost of extracting data from an API.
Utilizing single-node compute on Databricks for extracting data from an
API can prove to be highly cost-effective. While some may argue it is
cheaper to use cloud function solutions, the streamlined architecture of
having ingestion and transformations in a single pipeline often
outweighs any marginal cost differences and allow users to more easily
save data directly into a table without staging the data as files. Typically,
the cost associated with extracting data from the API is insignificant
compared to the overall expense of running data pipelines on entire
datasets and tables.
Conclusion
Consuming data from an API and saving the response to a table is
extremely simple on Databricks. This can be done in a single node
manner for smaller use cases or can be distributed to run in parallel for
scale.
Disclaimer: these are my own thoughts and opinions and not a reflection of my
employer
470 Followers
54 4 123 7
6 min read · Feb 22, 2024 3 min read · Apr 20, 2022
21 33
5 2.1K 31
Lists
13 min read · Feb 18, 2024 6 min read · Feb 22, 2024
326 2 21
100 3 137 3