0% found this document useful (0 votes)
47 views1 page

Code2pdf 67c734433f773

The document provides Python code for loading and formatting data for a machine learning model using the Alpaca prompt structure. It includes functions for tokenization, data collation, and loading federated dataset partitions. Additionally, there is a utility function to replace specific characters in dictionary keys.

Uploaded by

Nitish chowdary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views1 page

Code2pdf 67c734433f773

The document provides Python code for loading and formatting data for a machine learning model using the Alpaca prompt structure. It includes functions for tokenization, data collation, and loading federated dataset partitions. Additionally, there is a utility function to replace specific characters in dictionary keys.

Uploaded by

Nitish chowdary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

from transformers import AutoTokenizer

from trl import DataCollatorForCompletionOnlyLM

from flwr_datasets.partitioner import IidPartitioner


from flwr_datasets import FederatedDataset

FDS = None # Cache FederatedDataset

def formatting_prompts_func(example):
output_texts = []
# Constructing a standard Alpaca (https://github.com/tatsu-lab/stanford_alpaca#data-release) prompt
mssg = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
for i in range(len(example["instruction"])):
text = f"{mssg}\n### Instruction:\n{example['instruction'][i]}\n### Response: {example['response'][i]}"
output_texts.append(text)
return output_texts

def get_tokenizer_and_data_collator_and_propt_formatting(model_name: str):


# From: https://huggingface.co/docs/trl/en/sft_trainer
tokenizer = AutoTokenizer.from_pretrained(
model_name, use_fast=True, padding_side="right"
)
tokenizer.pad_token = tokenizer.eos_token
response_template_with_context = "\n### Response:" # alpaca response tag
response_template_ids = tokenizer.encode(
response_template_with_context, add_special_tokens=False
)[2:]
data_collator = DataCollatorForCompletionOnlyLM(
response_template_ids, tokenizer=tokenizer
)

return tokenizer, data_collator, formatting_prompts_func

def load_data(partition_id: int, num_partitions: int, dataset_name: str):


"""Load partition data."""
# Only initialize `FederatedDataset` once
global FDS
if FDS is None:
partitioner = IidPartitioner(num_partitions=num_partitions)
FDS = FederatedDataset(
dataset=dataset_name,
partitioners={"train": partitioner},
)
client_trainset = FDS.load_partition(partition_id, "train")
client_trainset = client_trainset.rename_column("output", "response")

return client_trainset

def replace_keys(input_dict, match="-", target="_"):


"""Recursively replace match string with target string in dictionary keys."""
new_dict = {}
for key, value in input_dict.items():
new_key = key.replace(match, target)
if isinstance(value, dict):
new_dict[new_key] = replace_keys(value, match, target)
else:
new_dict[new_key] = value
return new_dict

You might also like