vLLM

Installation

You need to install the vllm library to use the vLLM integration. See the installation section for instructions to install vLLM for CPU or ROCm.

Load the model

Outlines supports models available via vLLM's offline batched inference interface. You can load a model using:

from outlines import models

model = models.vllm("mistralai/Mistral-7B-v0.1")

Or alternatively:

import vllm
from outlines import models

llm = vllm.LLM("mistralai/Mistral-7B-v0.1")
model = models.VLLM(llm)

Models are loaded from the HuggingFace hub.

Device

The default installation of vLLM only allows to load models on GPU. See the installation instructions to run models on CPU.

You can pass any parameter that you would normally pass to vllm.LLM, as keyword arguments:

from outlines import models

model = models.vllm(
    "mistralai/Mistral-7B-v0.1",
    trust_remote_code=True,
    gpu_memory_utilization=0.7
)

Main parameters:

Parameters	Type	Description	Default
`tokenizer_mode`	`str`	"auto" will use the fast tokenizer if available and "slow" will always use the slow tokenizer.	`auto`
`trust_remote_code`	`bool`	Trust remote code when downloading the model and tokenizer.	`False`
`tensor_parallel_size`	`int`	The number of GPUs to use for distributed execution with tensor parallelism.	`1`
`dtype`	`str`	The data type for the model weights and activations. Currently, we support `float32`, `float16`, and `bfloat16`. If `auto`, we use the `torch_dtype` attribute specified in the model config file. However, if the `torch_dtype` in the config is `float32`, we will use `float16` instead.	`auto`
`quantization`	`Optional[str]`	The method used to quantize the model weights. Currently, we support "awq", "gptq" and "squeezellm". If None, we first check the `quantization_config` attribute in the model config file. If that is None, we assume the model weights are not quantized and use `dtype` to determine the data type of the weights.	`None`
`revision`	`Optional[str]`	The specific model version to use. It can be a branch name, a tag name, or a commit id.	`None`
`tokenizer_revision`	`Optional[str]`	The specific tokenizer version to use. It can be a branch name, a tag name, or a commit id.	`None`
`gpu_memory_utilization`	`float`	The ratio (between 0 and 1) of GPU memory to reserve for the model weights, activations, and KV cache. Higher values will increase the KV cache size and thus improve the model's throughput. However, if the value is too high, it may cause out-of-memory (OOM) errors.	`0.9`
`swap_space`	`int`	The size (GiB) of CPU memory per GPU to use as swap space. This can be used for temporarily storing the states of the requests when their `best_of` sampling parameters are larger than 1. If all requests will have `best_of=1`, you can safely set this to 0. Otherwise, too small values may cause out-of-memory (OOM) errors.	4
`enforce_eager`	`bool`	Whether to enforce eager execution. If True, we will disable CUDA graph and always execute the model in eager mode. If False, we will use CUDA graph and eager execution in hybrid.	`False`
`enable_lora`	`bool`	Whether to enable loading LoRA adapters	`False`

See the vLLM code for a list of all the available parameters.

Use quantized models

vLLM supports AWQ, GPTQ and SqueezeLLM quantized models:

from outlines import models

model = models.vllm("TheBloke/Llama2-7b-Chat-AWQ", quantization="awq")
model = models.vllm("TheBloke/Mistral-7B-Instruct-v0.2-GPTQ", quantization="gptq")
model = models.vllm("https://huggingface.co/squeeze-ai-lab/sq-llama-30b-w4-s5", quantization="squeezellm")

Dependencies

To use AWQ model you need to install the autoawq library pip install autoawq.

To use GPTQ models you need to install the autoGTPQ and optimum libraries pip install auto-gptq optimum.

Multi-GPU usage

To run multi-GPU inference with vLLM you need to set the tensor_parallel_size argument to the number of GPUs available when initializing the model. For instance to run inference on 2 GPUs:

from outlines import models

model = models.vllm(
    "mistralai/Mistral-7B-v0.1"
    tensor_parallel_size=2
)

Load LoRA adapters

You can load LoRA adapters and alternate between them dynamically:

from outlines import models

model = models.vllm("facebook/opt-350m", enable_lora=True)
model.load_lora("ybelkaa/opt-350m-lora")  # Load LoRA adapter
model.load_lora(None)  # Unload LoRA adapter

Generate text

In addition to the parameters described in the text generation section you can pass an instance of SamplingParams directly to any generator via the sampling_params keyword argument:

from vllm.sampling_params import SamplingParams
from outlines import models, generate


model = models.vllm("mistralai/Mistral-7b-v0.1")
generator = generate.text(model)

params = SamplingParams(n=2, frequency_penalty=1., min_tokens=2)
answer = generator("A prompt", sampling_params=params)

This also works with generators built with generate.regex, generate.json, generate.cfg, generate.format and generate.choice.

Note

The values passed via the SamplingParams instance supersede the other arguments to the generator or the samplers.

SamplingParams attributes:

Parameters	Type	Description	Default
`n`	`int`	Number of output sequences to return for the given prompt.	`1`
`best_of`	`Optional[int]`	Number of output sequences that are generated from the prompt. From these `best_of` sequences, the top `n` sequences are returned. `best_of` must be greater than or equal to `n`. This is treated as the beam width when `use_beam_search` is True. By default, `best_of` is set to `n`.	`None`
`presence_penalty`	`float`	Float that penalizes new tokens based on whether they appear in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens.	`0.0`
`frequency_penalty`	`float`	Float that penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens.	`0.0`
`repetition_penalty`	`float`	Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens.	`1.0`
`temperature`	`float`	Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling.	`1.0`
`top_p`	`float`	Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.	`1.0`
`top_k`	`int`	Integer that controls the number of top tokens to consider. Set to -1 to consider all tokens.	`-1`
`min_p`	`float`	Float that represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this.	`0.0`
`seed`	`Optional[int]`	Random seed to use for the generation.	`None`
`use_beam_search`	`bool`	Whether to use beam search instead of sampling.	`False`
`length_penalty`	`float`	Float that penalizes sequences based on their length. Used in beam search.	`1.0`
`early_stopping`	`Union[bool, str]`	Controls the stopping condition for beam search. It accepts the following values: `True`, where the generation stops as soon as there are `best_of` complete candidates; `False`, where an heuristic is applied and the generation stops when is it very unlikely to find better candidates; `"never"`, where the beam search procedure only stops when there cannot be better candidates (canonical beam search algorithm).	`False`
`stop`	`Optional[Union[str, List[str]]]`	List of strings that stop the generation when they are generated. The returned output will not contain the stop strings.	`None`
`stop_token_ids`	`Optional[List[int]]`	List of tokens that stop the generation when they are generated. The returned output will contain the stop tokens unless the stop tokens are special tokens.	`None`
`include_stop_str_in_output`	`bool`	Whether to include the stop strings in output text. Defaults to False.	`False`
`ignore_eos`	`bool`	Whether to ignore the EOS token and continue generating tokens after the EOS token is generated.	`False`
`max_tokens`	`int`	Maximum number of tokens to generate per output sequence.	`16`
`min_tokens`	`int`	Minimum number of tokens to generate per output sequence before EOS or stop_token_ids can be generated	`0`
`skip_special_tokens`	`bool`	Whether to skip special tokens in the output.	`True`
`spaces_between_special_tokens`	`bool`	Whether to add spaces between special tokens in the output. Defaults to True.	`True`

Streaming

Warning

Streaming is not available for the offline vLLM integration.

Installation

By default the vLLM library is installed with pre-commpiled C++ and CUDA binaries and will only run on GPU:

pip install vllm

CPU

You need to have the gcc compiler installed on your system. Then you will need to install vLLM from source. First clone the repository:

git clone https://github.com/vllm-project/vllm.git
cd vllm

Install the Python packages needed for the installation:

pip install --upgrade pip
pip install wheel packaging ninja setuptools>=49.4.0 numpy
pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu

and finally run:

VLLM_TARGET_DEVICE=cpu python setup.py install

See the vLLM documentation for more details, alternative installation methods (Docker) and performance tips.

ROCm

You will need to install vLLM from source. First install Pytorch on ROCm:

pip install torch==2.2.0.dev20231206+rocm5.7 --index-url https://download.pytorch.org/whl/nightly/rocm5.7 # tested version

You will then need to install flash attention for ROCm following these instructions. You can then install xformers=0.0.23 and apply the patches needed to adapt Flash Attention for ROCm:

pip install xformers==0.0.23 --no-deps
bash patch_xformers.rocm.sh

And finally build vLLM:

cd vllm
pip install -U -r requirements-rocm.txt
python setup.py install # This may take 5-10 minutes.

See the vLLM documentation for alternative installation methods (Docker).