Data - DSPy
Data - DSPy
Data
For each example in your data, we distinguish typically between three types of values: the
inputs, the intermediate labels, and the final label. You can use DSPy effectively without any
intermediate or final labels, but you will need at least a few example inputs.
How can you get examples like these? If your task is extremely unusual, please invest in
preparing ~10 examples by hand. Often times, depending on your metric below, you just need
inputs and not labels, so it's not that hard.
However, chances are that your task is not actually that unique. You can almost always find
somewhat adjacent datasets on, say, HuggingFace datasets or other forms of data that you
can leverage here.
If there's data whose licenses are permissive enough, we suggest you use them. Otherwise, you
can also start using/deploying/demoing your system and collect some initial data that way.
DSPy Examples are similar to Python dict s but have a few useful utilities. Your DSPy modules
will return values of the type Prediction , which is a special sub-class of Example .
When you use DSPy, you will do a lot of evaluation and optimization runs. Your individual
datapoints will be of type Example :
Ask AI
qa_pair = dspy.Example(question="This is a question?", answer="This is an
answer.")
print(qa_pair)
https://dspy-docs.vercel.app/building-blocks/4-data/ 1/4
22/10/2024, 15:36 Data - DSPy
print(qa_pair.question)
print(qa_pair.answer)
Output:
Examples can have any field keys and any value types, though usually values are strings.
You can now express your training set for example as:
In DSPy, the Example objects have a with_inputs() method, which can mark specific fields
as inputs. (The rest are just metadata or labels.)
# Single Input.
print(qa_pair.with_inputs("question"))
# Multiple Inputs; be careful about marking your labels as inputs unless you
mean it.
print(qa_pair.with_inputs("question", "answer"))
Values can be accessed using the . (dot) operator. You can access the value of key name in
defined object Example(name="John Doe", job="sleep") through object.name .
To access or exclude certain keys, use inputs() and labels() methods to return new
Example objects containing only input or non-input keys, respectively.
input_key_only = article_summary.inputs()
non_input_key_only = article_summary.labels()
Ask AI
print("Example object with Input fields only:", input_key_only)
print("Example object with Non-Input fields only:", non_input_key_only)
Output
https://dspy-docs.vercel.app/building-blocks/4-data/ 2/4
22/10/2024, 15:36 Data - DSPy
dl = DataLoader()
For most dataset formats, it's quite straightforward you pass the file path to the corresponding
method of the format and you'll get the list of Example for the dataset in return:
import pandas as pd
csv_dataset = dl.from_csv(
"sample_dataset.csv",
fields=("instruction", "context", "response"),
input_keys=("instruction", "context")
)
json_dataset = dl.from_json(
"sample_dataset.json",
fields=("instruction", "context", "response"),
input_keys=("instruction", "context")
)
parquet_dataset = dl.from_parquet(
"sample_dataset.parquet",
fields=("instruction", "context", "response"),
input_keys=("instruction", "context")
)
pandas_dataset = dl.from_pandas(
pd.read_csv("sample_dataset.csv"), # DataFrame
fields=("instruction", "context", "response"),
input_keys=("instruction", "context")
)
These are some supported formats that DataLoader supports to load from file directly, in
Ask
backend most of these methods are leveraging load_dataset method from AI
datasets library
to load these formats. But when working with text data you often use HuggingFace datasets, in
order to import HF datasets in list of Example format we can use from_huggingface method:
https://dspy-docs.vercel.app/building-blocks/4-data/ 3/4
22/10/2024, 15:36 Data - DSPy
blog_alpaca = dl.from_huggingface(
"intertwine-expel/expel-blog"
input_keys=("title",)
)
You can access the dataset of the splits by calling key of the corresponding split:
from
train_split = blog_alpaca['train']
# Since this is the only split in the dataset we can split this into
# train and test split ourselves by slicing or sampling 75 rows from the train
# split for testing.
testset = train_split[:75]
trainset = train_split[75:]
The way you load a huggingface dataset using load_dataset is exactly how you load data it
via from_huggingface as well. This includes passing specific splits, subsplits, read
instructions, etc. For code snippets you can refer to the cheatsheet snippets for loading from
HF.
Ask AI
https://dspy-docs.vercel.app/building-blocks/4-data/ 4/4