Tuning LLM: Exploring the Nuances of Prompt Tuning and Fine Tuning
Large Language Models (LLMs) have revolutionized the field of artificial intelligence, showing remarkable capabilities in understanding and generating human-like text. The ability of these models to perform a wide array of tasks, from casual conversations to complex problem-solving, is largely attributed to their sophisticated training methodologies. However, the performance of LLMs can often be enhanced through various tuning strategies.In the previous blog, we gained a basic understanding of prompt engineering and prompt tuning.What is Prompt? What is Prompt Engineering & Prompt Tuning?In this Blog, we delve into the nuances of prompting and fine-tuning, shedding light on how each method contributes to optimizing LLMs.Prompt Tuning: Influencing Outputs with InputsUnlike Prompt Engineering adding task-specific instructions along with the input, Here we introduce Soft Prompts, Soft prompts are task-specific instruction prompts, which are embedded in the input embedding layer of the model to guide towards desired behaviors and require minimal training at the input embedding layer by freezing the original weights of the model. Soft prompts are learned continuous embeddings, which are not necessarily interpretable as natural language but are used as input to the model. In simple, the soft prompts play a pivot role in how the input is presented to the model, This soft prompt embedding layer is learnable, In each iteration of training the soft prompt gets optimized over time.Strategies for Crafting Effective PromptsClarity and Conciseness: Clear and straightforward prompts are typically more effective than vague or overly complex ones. Aim for simplicity while keeping essential details.Contextual Framing: Providing context can improve the model’s ability to understand the desired output. Including examples or specific details can clarify the task.Iterative Refinement: Experiment with various prompts to determine which produces the best results. Tuning prompts involves iterative trial and error.Use of Instructional Language: Directly phrasing prompts as instructions, like “List the benefits of…”, can more effectively guide the model’s output.Benefits of Prompt TuningRapid Adaptation: Prompt tuning allows for quick adjustments to the model’s behavior without requiring extensive retraining, making it ideal for dynamic and real-time applications.Cost-Effective: Since it doesn’t involve large-scale data processing or lengthy training sessions, prompt tuning is a more cost-effective way to fine-tune models for specific tasks.Versatility: Prompt tuning can be applied across various domains and tasks, offering a flexible approach to customizing model outputs for different applications and contexts.Reduced Computational Resources: Compared to full model retraining, prompt tuning requires fewer computational resources, making it feasible for use on less powerful hardware.Let’s now dive into the coding, we can look into a sample Python code and see where this prompt tuning takes place for code generation tasks using soft prompts.import csvimport nltkimport randomfrom nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenizeimport stringimport torchfrom tqdm import tqdmfrom torch.nn.utils.rnn import pad_sequenceimport torch.nn as nnfrom transformers import GPT2LMHeadModel, GPT2Tokenizer,GPT2Model, GPT2Config, AdamWfrom sklearn.model_selection import train_test_splitfirst, we import the required modulesdef preprocess_text(input_text): """ Processes the given text by converting it to lowercase, removing punctuation and digits, tokenizing the words, and filtering out stop words. Args: input_text (str): Text that needs to be processed. Returns: str: The processed text. """ # Convert text to lowercase input_text = input_text.lower() # Remove punctuation and digits input_text = input_text.translate(str.maketrans('', '', string.punctuation + string.digits)) # Tokenize the text into words words = word_tokenize(input_text) # Remove stop words filtered_words = [word for word in words if word not in stopwords.words('english')] # Join the words back into a single string processed_text = ' '.join(filtered_words) return processed_textwe gonna use a preprocessing function that would transform our text without stopping words and tokenizing.def tokenize_and_pad_sequences(data_pairs, max_len_article, max_len_highlights): """ Tokenizes the input text data and pads the sequences to the given maximum lengths. Args: data_pairs (list): A list of tuples with each tuple containing an article and its corresponding highlights. max_len_article (int): The maximum length for article sequences. max_len_highlights (int): The maximum length for highlight sequences. Returns: list: A list of tuples containing tokenized and padded sequences for articles and highlights. """ processed_data = [] for article, highlights in data_pairs: # Tokenize and convert text to indices article_indices = tokenizer.encode(article, add_special_tokens=True) highlights_indices = tokenizer.encode(highlights, add_special_tokens=True) # Padding the sequences to the specified maximum lengths padded_article = torch.tensor(article_indices + [tokenizer.pad_token_id] * (max_len_article - len(article_indices))) padded_highlights = torch.tensor(highlights_indices + [tokenizer.pad_token_id] * (max_len_highlights - len(highlights_indices))) # Ensure both tokenized sequences are not empty if len(article_indices) > 0 and len(highlights_indices) > 0: processed_data.append((padded_article, padded_highlights)) return processed_dataAfter preprocessing, now we tokenize the input data, convert the tokens to indices using the GPT-2 tokenizer, and pad sequences to specified lengths.def load_data_from_csv(file_path): """ Reads a CSV file and extracts columns for articles and highlights. Args: file_path (str): Path to the CSV file. Returns: list: A list of tuples containing articles and highlights. """ data = [] with open(file_path, 'r', encoding='utf-8') as f: reader = csv.DictReader(f) for row in reader: article = row.get('article', '') highlights = row.get('highlights', '') data.append((article, highlights)) return data# Load test datatest_file_path = 'test.csv'test_data = load_data_from_csv(test_file_path)# Load training datatrain_file_path = 'train.csv'train_data = load_data_from_csv(train_file_path)# Load validation datavalidation_file_path = 'validation.csv'validation_data = load_data_from_csv(validation_file_path)Here comes the dataset loading part, we try to read data from CSV files, extracting relevant columns (‘article’ and ‘highlights’) and storing them as tuples in separate lists for training, testing, and validation data.def sample_one_percent(data_list, seed): """ Randomly samples 1% of the data from the provided list for reproducibility. Args: data_list (list): The original data list. seed (int): Seed value for reproducibility. Returns: list: A list containing 1% of the sampled data. """ random.seed(seed) sample_size = int(0.01 * len(data_list)) return random.sample(data_list, sample_size)# Sample 1% of the test dataone_percent_test_data = sample_one_percent(test_data, seed=14)# Sample 1% of the training dataone_percent_train_data = sample_one_percent(train_data, seed=14)# Sample 1% of the validation dataone_percent_val_data = sample_one_percent(validation_data, seed=14)We randomly select 1% of the data from the training, testing, and validation datasets for effective processing during model building.# Apply preprocessing to the sampled dataprocessed_train_data = [(preprocess_text(article), preprocess_text(highlights)) for article, highlights in one_percent_train_data]processed_test_data = [(preprocess_text(article), preprocess_text(highlights)) for article, highlights in one_percent_test_data]processed_val_data = [(preprocess_text(article), preprocess_text(highlights)) for article, highlights in one_percent_val_data]This block processes the sampled data for the training, testing, and validation sets by applying text preprocessing (lowercasing, punctuation removal, tokenization, and stop word removal).# Initialize GPT-2 tokenizertokenizer = GPT2Tokenizer.from_pretrained("gpt2")# Set the padding token to the end-of-sequence token and add it to the tokenizerpad_token = tokenizer.eos_tokentokenizer.add_special_tokens({'pad_token': pad_token})# Define maximum lengths for articles and highlightsmax_article_len = 1021max_highlights_len = 1024# Apply tokenization and padding to the preprocessed datasetstokenized_train_data = tokenize_and_pad_sequences(processed_train_data, max_len_article=max_article_len, max_len_highlights=max_highlights_len)tokenized_test_data = tokenize_and_pad_sequences(processed_test_data, max_len_article=max_article_len, max_len_highlights=max_highlights_len)tokenized_val_data = tokenize_and_pad_sequences(processed_val_data, max_len_article=max_article_len, max_len_highlights=max_highlights_len)The GPT-2 tokenizer is loaded in this block, which also adds a unique token (“<pad>’’) to the tokenizer that will be used for padding sequences. Tokenized and padded datasets for training, testing, and validation are produced by this block by tokenizing and padding the preprocessed data to the designated maximum lengths for articles and highlights.# Initialize lists to store input and target IDs for training datainput_ids_train = []target_ids_train = []# Maximum lengths for articles and highlightsmax_article_len = 1021max_highlights_len = 1024# Iterate over the tokenized training datafor article_tokens, highlights_tokens in tokenized_train_data: # Truncate article tokens to the maximum article length truncated_article = article_tokens[:max_article_len] # Truncate highlights tokens to the maximum highlights length truncated_highlights = highlights_tokens[:max_highlights_len] # Add the truncated article tokens to the input list input_ids_train.append(truncated_article) # Add the truncated highlights tokens to the target list target_ids_train.append(truncated_highlights)# Convert training lists to PyTorch tensorsinput_ids_train = torch.stack(input_ids_train)target_ids_train = torch.stack(target_ids_train)This section trims the article and highlights tokens that are longer than the allowed limits before converting them into PyTorch tensors to prepare the training data.input_ids_val = []target_ids_val = []# Iterate over the tokenized validation datafor article_tokens, highlights_tokens in tokenized_val_data: # Truncate article tokens to the maximum article length truncated_article = article_tokens[:max_article_len] # Truncate highlights tokens to the maximum highlights length truncated_highlights = highlights_tokens[:max_highlights_len] # Add the truncated article tokens to the input list for validation input_ids_val.append(truncated_article) # Add the truncated highlights tokens to the target list for validation target_ids_val.append(truncated_highlights)# Convert validation lists to PyTorch tensorsinput_ids_val = torch.stack(input_ids_val)target_ids_val = torch.stack(target_ids_val)Similarly, we reduce the tokens for the validation set also and create PyTorch tensors# Load the GPT-2 model and tokenizermodel_name = "gpt2"tokenizer = GPT2Tokenizer.from_pretrained(model_name)gpt2_model = GPT2LMHeadModel.from_pretrained(model_name)# Define the number of prompt tokens and the embedding sizenum_prompt_tokens = 3 # For example, "summarize the following text"embedding_dim = gpt2_model.config.hidden_size# Specify the prompt sentenceprompt_sentence = "summarize"# Tokenize the prompt sentenceprompt_ids = tokenizer.encode(prompt_sentence, return_tensors='pt')# Obtain embeddings for the tokenized prompt sentence from the GPT-2 modelprompt_embeddings = gpt2_model.transformer.wte(prompt_ids)# Initialize an embedding layer for soft prompts with the prompt sentence embeddingssoft_prompt_layer = nn.Embedding(num_prompt_tokens, embedding_dim)soft_prompt_layer.weight.data.copy_(prompt_embeddings.squeeze(0))The GPT-2 model and tokenizer are loaded in this section. Tokenizing a particular sentence, retrieving embeddings from the GPT-2 model, initializing an embedding layer for soft prompts with the sentence embeddings, and defining the number of tokens in prompt and embedding size are all done inside this phase.class GPT2WithPromptTuning(nn.Module): def __init__(self, gpt2_model, soft_prompt_embeddings): super(GPT2WithPromptTuning, self).__init__() self.gpt2_model = gpt2_model self.soft_prompt_embeddings = soft_prompt_embeddings def forward(self, input_ids, soft_prompt_ids): # Obtain the embeddings for the input_ids from the GPT-2 model gpt2_embeddings = self.gpt2_model.transformer.wte(input_ids) # Obtain the embeddings for the soft prompts soft_prompt_embeds = self.soft_prompt_embeddings(soft_prompt_ids) # Concatenate the soft prompt embeddings with the input embeddings embeddings = torch.cat([soft_prompt_embeds, gpt2_embeddings], dim=1) # Pass the concatenated embeddings through the GPT-2 model outputs = self.gpt2_model(inputs_embeds=embeddings) return outputsThe one above concatenates soft prompt embeddings at the start of the input sequence, defining a unique GPT-2 model with prompt adjustments.# Initialize the modelmodel = GPT2WithPromptTuning(gpt2_model, soft_prompt_layer)# Freeze GPT-2 model weightsfor param in model.gpt2_model.parameters(): param.requires_grad = FalseHere we initiate the model with soft_promt_layer and also make the trainable parameters to freeze.# Define hyperparametersbatch_size = 8epochs = 2learning_rate = 2e-3gradient_clip_value = 1.0device = torch.device("cuda" if torch.cuda.is_available() else "cpu")# Move model to GPUmodel.to(device)# Define optimizer and criterionoptimizer = torch.optim.AdamW(model.soft_prompt_embeddings.parameters(), lr=learning_rate)criterion = nn.CrossEntropyLoss(ignore_index=-100)soft_prompt_ids = torch.tensor([0, 1, 2])This section initializes the AdamW optimizer and the cross-entropy loss function for training the model, as well as defines hyperparameters such as batch size, number of epochs, learning rate, and gradient clip value.# Training loopfor epoch in range(epochs): # Create a tqdm progress bar for the training data data_iterator = tqdm(zip(input_ids_train, target_ids_train), desc=f'Epoch {epoch + 1}', total=len(input_ids_train)) for input_ids, target_ids in data_iterator: optimizer.zero_grad() # Move input and target tensors to GPU input_ids, target_ids = input_ids.to(device), target_ids.to(device) outputs = model(input_ids, soft_prompt_ids.to(device)) logits = outputs.logits if hasattr(outputs, "logits") else outputs.last_hidden_state loss = criterion(logits, target_ids) loss.backward() # Gradient clipping to prevent exploding gradients torch.nn.utils.clip_grad_norm_(model.parameters(), gradient_clip_value) optimizer.step() # Update the progress bar description with the current loss data_iterator.set_postfix(loss=loss.item()) # Set the model back to training mode model.train()# Close the tqdm progress bardata_iterator.close()Here is an example of a sequence-to-sequence model’s training loop. Gradient descent is used in iterations over epochs to optimize the model’s parameters. To avoid gradient explosions, it computes the loss function, updates the model, and applies gradient clipping at each epoch.Fine-Tuning: The Model’s Weight Adjustment ApproachFine-tuning a Large Language Model (LLM) means adjusting or updating the pre-trained model parameters to improve performance over a specific task or domain. This process begins with a base model that was already trained on a huge dataset with a wide range of knowledge and linguistic patterns. During fine-tuning, the model is trained on the task-specific dataset, and its weights are adjusted using gradient descent to minimize the loss on this new dataset. This makes the model adapt its knowledge to effectively handle and understand the targeted task, such as summarization, question answering, code generation & debugging, or domain-specific text generation. By utilizing the vast knowledge gained from the pre-training phase, fine-tuning may significantly boost the model’s accuracy and relevance for particular tasks and create a balance between general understanding of language and specialized tasks.The Process of Fine-TuningDataset Preparation: A properly selected dataset that represents the target task is required for fine-tuning. For the model to effectively generalize, the chosen dataset should have a variety of contexts, styles, and approaches.Training Procedure: The model is trained on the new dataset, often for a reduced number of epochs. This training can be supervised (where correct outputs are provided) or unsupervised (learned from the data without explicit feedback).Evaluation and Adjustment: After fine-tuning, the model should be evaluated for performance on a validation set. Metrics such as accuracy, F1 score, or BLEU score can help gauge effectiveness. Based on these evaluations, further adjustments can be made.Benefits of Fine-TuningDomain Adaptation: Fine-tuning allows LLMs to specialize in niche domains, such as medical terminology or legal language, by adapting their understanding to specific jargon and context.Improved Performance: By training on relevant data, fine-tuned models often achieve better performance compared to their general-purpose counterparts.Flexibility: Fine-tuned models can efficiently handle a variety of tasks within their specialized domain, making them versatile for real-world applications.Let’s look at a sample Python code for Fine-Tuning GPT 2 LLM.import torchfrom torch.utils.data import DataLoaderfrom transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArgumentsfrom datasets import load_datasetdataset = load_dataset("wikitext", "wikitext-2-raw-v1")here we import the required libraries and load our wikitext-2 dataset, but you can use any dataset you prefer.model_name = "gpt2"tokenizer = AutoTokenizer.from_pretrained(model_name)def tokenize_function(examples): return tokenizer(examples["text"], padding="max_length", truncation=True)tokenized_datasets = dataset.map(tokenize_function, batched=True)Now we tokenized our dataset along with padding for fine-tuning.train_dataset = tokenized_datasets["train"]eval_dataset = tokenized_datasets["validation"]train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=8)eval_dataloader = DataLoader(eval_dataset, batch_size=8)Now we create DataLoaders for training and evaluation, DataLoaders ensure efficient and effective data handling during the training and evaluation of machine learning models. They help in batch processing, shuffling, parallel data loading, and better memory management, all of which contribute to more efficient training and evaluation processes.model = AutoModelForCausalLM.from_pretrained(model_name)training_args = TrainingArguments( output_dir="./results", evaluation_strategy="epoch", learning_rate=2e-5, per_device_train_batch_size=8, per_device_eval_batch_size=8, num_train_epochs=3, weight_decay=0.01,)The code snippet loads a pre-trained causal language model using the AutoModelForCausalLM class from the Hugging Face Transformers library. It then sets up training arguments with specific parameters such as output directory, evaluation strategy, learning rate, batch sizes, number of training epochs, and weight decay using the TrainingArguments class.trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset,)trainer.train()model.save_pretrained("./fine-tuned-model")tokenizer.save_pretrained("./fine-tuned-model")The code initializes a Trainer to fine-tune the loaded model using the specified training arguments and datasets, then trains the model and saves the fine-tuned model and tokenizer to a specified directory.Navigating the Challenges of Prompt Tuning and Fine-TuningPrompt tuning faces several challenges. One significant issue is achieving the right balance between specificity and generality; overly specific prompts can limit the model’s versatility, while too general prompts may not yield precise results. Additionally, prompt tuning requires a deep understanding of the model’s behavior and the nuances of natural language, which can be complex and time-consuming. Ensuring consistency across different contexts and managing unexpected or biased outputs are also critical challenges that need to be addressed to harness the potential of prompt tuning effectively.Fine-tuning may result in major improvements in model performance and accuracy, but it also raises some challenges. Overfitting is an important problem wherein the model may perform poorly on unseen or new data due to its heavy tuning on the training set. Furthermore, obtaining the right datasets may require a significant time.Real-Time Applications of Prompt Tuning & Fine TuningPrompt tuning is useful when users require specific kinds of responses without the need for extensive model re-training (Fine Tuning) and have limited resources. For example, in customer service chatbots, well-engineered soft prompts can streamline interactions, and ensure that responses align with customer questions. In creative writing aids, prompts can inspire imaginative outputs while maintaining a corresponding narrative style.In fields where precise language and terminology are essential, such as healthcare, finance, and education, fine-tuning is frequently used. For instance, a medical record-tuned LLM can help healthcare workers with report writing, patient history analysis, and prescription recommendation generation.Here are some example scenarios for real-time applications to help choose between prompt tuning and fine-tuningConclusionLarge language model tuning is not a one-size-fits-all process; the decision between fine-tuning and prompt tuning is based on the unique requirements and limitations of the given task. Every tuning technique has advantages and disadvantages of its own, so a developer must know when and how to use each one.By carefully navigating such conditions and utilizing LLMs for their fundamental flexibility as well as their capacity to adapt and function well in a range of scenarios, we may be able to reduce the gap between human intentions and machine understanding.Tuning LLM: Exploring the Nuances of Prompt Tuning and Fine Tuning was originally published in IceApple Tech Talks on Medium, where people are continuing the conversation by highlighting and responding to this story.
Read More