This document provides a simple example of how to run a large language model on the Wahab cluster. It demonstrates the process using the Mistral model or Meta LLama model from Hugging Face.
To start an interactive session on a GPU node, use the following command:
salloc -c 8 -p gpu --gres gpu:1
Load the necessary transformers module with:
module load transformers/4.44
Although not required, you can create an additional environment to allow installation of additional modules:
crun.transformers -c -p ~/envs/llm_example_env
crun.transformers -p pip install X Y Z
If your model requires authorization, log in with:
crun.transformers huggingface-cli login --token XXXX
You can find your token at https://huggingface.co/settings/tokens .
LLM models can be downloaded automatically when using Auto* classes (AutoTokenizer/AutoModelFor*/pipeline) or manually with the following command:
crun.transformers huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct --local-dir Meta-Llama-3.1-8B-Instruct
crun.transformers huggingface-cli download mistralai/Mistral-7B-Instruct-v0.3 --local-dir Mistral-7B-Instruct-v0.3
Note: Some models may require pre-approval, such as Meta Llama 3.1, or accepting a license, like Mistral. Complete these steps on the Hugging Face website.
The following Python script instantiates a chatbot:
from transformers import (
BitsAndBytesConfig,
pipeline
)
from transformers.models.llama.modeling_llama import LlamaModel
# for module name you can use local path to a module directory
# or a hugging face model id
# using huggingface model id
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
# using local path
# model_name = "Meta-Llama-3.1-8B-Instruct"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype="float16",
)
pipe = pipeline(
"text-generation",
model_name,
trust_remote_code=True,
model_kwargs=dict(
quantization_config=bnb_config,
low_cpu_mem_usage=True,
)
)
messages = [
{
"role": "system",
"content": "You are a helpful chatbot"
},
]
extra_pipe_options = {}
# (optional) for llama 3, a few additional setting is needed,
# else you may see some warning
if type(pipe.model.base_model) is LlamaModel:
terminators = [
pipe.tokenizer.eos_token_id,
pipe.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
extra_pipe_options['eos_token_id'] = terminators
extra_pipe_options['pad_token_id'] = pipe.tokenizer.eos_token_id
while True:
sequences = pipe(
messages,
do_sample=True,
max_new_tokens=2048,
temperature=0.5,
top_p=0.95,
num_return_sequences=1,
**extra_pipe_options,
)
response = sequences[-1]['generated_text'][-1]['content']
print(response)
prompt = input("Prompt: ")
messages.append({
"role": "assistant",
"content": response,
})
messages.append({
"role": "user",
"content": prompt,
})
Run the model and generate a response:
crun.transformers python chat.py
Output:
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:11<00:00, 2.99s/it]
I'm here to help you with any questions or topics you'd like to discuss. What's on your mind?
Prompt: hi, how are you?
I'm just a chatbot, so I don't have feelings like humans do, but I'm functioning properly and ready to help you. How about you? How's your day going so far?
Prompt:
You can also use the Transformers module on the Jupyter app from Open Ondemand. Follow these steps:
Field | Value | Comment |
---|---|---|
Python Version | Python 3.10 | |
Python Suite | Transformers 4.44 (generative ai, LLM) | |
Additional Module Directory | none | or if you have created a new environment |
Number of Cores | 8 | |
Number of GPU | 1 | |
Partition | gpu | or high-gpu-mem |
Number of Hours | 4 | up to 24 hours for gpu, 4 hours for high-gpu-mem |
For fine-tuning techniques, refer to the following resources:
Peft libraries are included in the transformers module. For other methods, create a new environment and install additional packages, as mentioned earlier in this document.