Post

Finetuning LLM

method for compressing prompts

The paper describes a method for compressing prompts used with large language models (LLMs), such as GPT-3.5, to reduce inference costs. The mathematical techniques and methods used in this context include:

  1. Quantization: The authors mention that some methods focus on reducing costs through quantization of model parameters. Quantization is a process that reduces the precision of numerical values in the model, which can lead to reduced memory requirements and faster computations.

  2. Compression: Another approach involves compressing the model, potentially using techniques like arithmetic coding. Compression aims to reduce the size of the model, making it more efficient for inference.

  3. Instruct Tuning and Delta Tuning: Instruct tuning and delta tuning are mentioned as methods to modify the model parameters during fine-tuning. These techniques aim to improve the efficiency of the model in generating responses.

  4. Prompt Compression: The paper introduces a prompt compression system designed to generate a compressed prompt from the original prompt. The compression rate is defined, and the authors use a coarse-to-fine framework to achieve prompt compression. The budget controller dynamically allocates different compression ratios to various components in prompts.

  5. Perplexity: The paper discusses using perplexity as a measurement of how well a language model predicts a sample. It is used in the context of out-of-distribution (OoD) detection, where higher perplexity is considered indicative of unreliable predictions.

  6. SelectiveContext: The authors mention a method called SelectiveContext, which evaluates the informativeness of lexical units by computing self-information with a small language model. This method drops less informative content for prompt compression.

  7. Iterative Token-level Prompt Compression (ITPC): The paper introduces an iterative token-level prompt compression algorithm. This algorithm divides the target prompt into segments, computes conditional probabilities, and dynamically calculates compression thresholds to retain important information during compression.

  8. Distribution Alignment: To narrow the gap between the distribution of a small language model and a black-box large model, the paper suggests using instruction tuning. This involves fine-tuning the small language model on data generated by the large language model.

These mathematical techniques and methods aim to optimize the efficiency of large language models for various tasks by reducing the size of prompts and models or by fine-tuning existing models to make them more cost-effective during inference.

Let’s break down the key formulas mentioned in the problem formulation and methodology of the paper “LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models.”

Problem Formulation:

  1. Compression Rate (τ):
    • Formula: \(\tau = \frac{e_L}{L}\)
    • Explanation: The compression rate is defined as the ratio of the number of tokens in the compressed prompt $ e_L $ to the total number of tokens in the original prompt $( L )$. It represents the extent of compression, with smaller values indicating lower inference costs.
  2. Compression Ratio (1/τ):
    • Formula: Compression ratio = \(\frac{1}{\tau}\)
    • Explanation: The reciprocal of the compression rate gives the compression ratio, indicating how much the original prompt has been compressed. A smaller compression rate results in a larger compression ratio.
  3. Kullback-Leibler Divergence (KL):
    • Formula: $ \text{KL}(P(xe_Gxe), P(x_Gx)) $
    • Explanation: KL divergence measures the difference between the distributions of the tokens generated by the compressed prompt $( xe_G )$ and the tokens derived from the original prompt $( x_G)$. The objective is to minimize this divergence, aiming for similarity between the two distributions.

Methodology:

4.1 Budget Controller:

  1. Demonstration Compression Rate (τdems):
    • Formula: \(\tau_{dems} = \frac{\tau_L - (\tau_{ins} L_{ins} + \tau_{que} L_{que})}{L_{dems}}\)
    • Explanation: This formula calculates the compression rate specifically for demonstrations $( \tau_{dems} )$ based on the overall target compression rate $( \tau_L )$, compression rates for instructions $( \tau_{ins} )$, and questions $( \tau_{que} )$, and lengths of respective components in the original prompt.
  2. Adjust Compression Ratios for Instructions and Questions:
    • Formula: \(\Delta \tau = \frac{k \cdot \tau_{dems} L_{dems} - e_L^D}{L_{ins} + L_{que}}\)
    • Explanation: After coarse-grained compression of demonstrations, this formula calculates the remaining budget $( \Delta \tau )$ to be allocated to instructions and questions, ensuring the overall compression rate target is met.

4.2 Iterative Token-level Prompt Compression (ITPC):

  1. Compression Threshold Calculation (γi):
    • Formula: $ \gamma_i $
    • Explanation: The compression threshold for each segment is dynamically calculated based on the perplexity distribution and the corresponding compression ratio $( \tau_{s_j} )$ for the segment $( s_j )$.

4.3 Distribution Alignment:

  1. Instruction Tuning Objective Function:
    • Formula: \(\min_{\theta_s} E[ \frac{1}{N} \sum_{i=1}^{N} L(x_i, y_{LLM_i}; \theta_{Ms}) ]\)
    • Explanation: The objective function for instruction tuning minimizes the loss between the instruction $( x_i )$ and the texts generated by the large language model $( y_{LLM_i} )$ using the parameters $ \theta_{Ms} $ of the small language model.

These formulas collectively describe the approach taken in LLMLingua for prompt compression, involving budget allocation, token-level compression, and distribution alignment. The detailed explanations provide insights into how each formula contributes to the overall methodology.

This post is licensed under CC BY 4.0 by the author.