AI Phase2
AI Phase2
USING PYTHON
PRESENTED BY
SIVASUBRAMANIAN TJ
1. GPT-2 TOKENIZER
2. DATA PREPROCESSING
3. CUSTOM PYTORCH DATASET
4. TRAINING ARGUMENTS
5. FINE-TUNING & ENSEMBLE METHOD
6. CONCLUSION
GPT-2 TOKENIZER
● Tokenization is a crucial step in natural language processing
(NLP) that involves breaking down text into smaller units, or
tokens, for further analysis and processing. In the provided
code, we tokenize text using the GPT-2 tokenizer.
DATA PREPROCESSING
● This part of the code is responsible for reading and processing
conversational data from a dataset file, splitting it into
individual messages, and tokenizing the messages using the
GPT-2 tokenizer.
CUSTOM PYTORCH DATASET
• By converting our data into a PyTorch dataset, we make it
compatible with PyTorch's data loaders, making it easier to
iterate through, shuffle, and batch our data for training our
GPT-2 models. The dataset is to be used in the training process,
ensuring that our data is in a suitable format for model training.
TRAINING ARGUMENTS