执行如下命令后报错

from llama import tokenizer, Llama, Dialog
checkpoint_dir = "/training-data/pakcages/llama/llama-2-7b-chat"
tokenizer_path = "/training-data/pakcages/llama/tokenizer.model"
temperature = 0.75
top_p = 0.9
max_seq_len = 128
max_gen_len = 64
max_batch_size = 4

generator = Llama.build(
    ckpt_dir=checkpoint_dir,
    tokenizer_path=tokenizer_path,
    max_seq_len=max_seq_len,
    max_batch_size=max_batch_size)

ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set

是源码里面这一段引起的:

if not torch.distributed.is_initialized():
    torch.distributed.init_process_group("nccl")

启动不起来看样子是因为分布式的问题。我尝试绕开分布式,从它的build函数开始看:

checkpoints = sorted(Path(ckpt_dir).glob("*.pth"))
ckpt_path = checkpoints[get_model_parallel_rank()]
checkpoint = torch.load(ckpt_path, map_location="cpu")

检查给定的checkpoint_dir是否包含pth文件,7B的模型只有一个pth文件,所以一个进程就可以了,我想get_model_parallel_rank()大概意思即是有几个文件就启动多少个进程,代码来自facebook团队开发并行训练包fairscale:

def get_model_parallel_rank() -> int:
    """Return my rank for the model parallel group."""
    return torch.distributed.get_rank(group=get_model_parallel_group())

def get_model_parallel_group() -> torch.distributed.ProcessGroup:
    """Get the model parallel group the caller rank belongs to."""
    assert _MODEL_PARALLEL_GROUP is not None, "model parallel group is not initialized"
    return _MODEL_PARALLEL_GROUP

咱不管这些,7B反正就一个参数文件,直接从文件夹加载:

from pathlib import Path
checkpoints = sorted(Path(checkpoint_dir).glob("*.pth"))
# Llama-2-7b model weights are distributed in a single file.
checkpoint = checkpoints[0]
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
checkpoint = torch.load(checkpoint, map_location=device)

7b模型直接加载进显卡占用了13671MB的显存:

huggingface转换

算了,还是先不重写了,先用huggingface转换吧,安装一下transformers,他的转换函数在src/transformers/models/llama/convert_llama_weights_to_hf.py 这里可以看源文件。

本身安装transformers的时候已经安装了这个模块,写个脚本:

from transformers.models.llama.convert_llama_weights_to_hf import main

if __name__ == "__main__":
    main()

直接这个脚本就行了,执行一下:

# python convert.py --help
usage: convert.py [-h] [--input_dir INPUT_DIR] [--model_size {7B,7Bf,13B,13Bf,30B,34B,65B,70B,70Bf,tokenizer_only}] [--output_dir OUTPUT_DIR] [--safe_serialization SAFE_SERIALIZATION]

options:
  -h, --help            show this help message and exit
  --input_dir INPUT_DIR
                        Location of LLaMA weights, which contains tokenizer.model and model folders
  --model_size {7B,7Bf,13B,13Bf,30B,34B,65B,70B,70Bf,tokenizer_only}
                        'f' models correspond to the finetuned versions, and are specific to the Llama2 official release. For more details on Llama2, checkout the original repo: https://huggingface.co/meta-llama
  --output_dir OUTPUT_DIR
                        Location to write HF model and tokenizer
  --safe_serialization SAFE_SERIALIZATION
                        Whether or not to save using `safetensors`.

–input_dir 写的是llama的根目录

–model_size 选择要转换的模型参数量,这个有个bug,你只能填提供的那几个名字,问题是llama的目录下对应的模型文件名是”llama-2-*b”这种,转换脚本会去”input_dir/*B”下面去找模型文件,所以需要给”llama-2-*b”改成”*B”后再执行脚本。

转换过程中并不需要GPU,完成之后用transformers加载就行了,7B的模型加载完后划分了26G现存

执行:

total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters: {total_params}")
>>> Total number of parameters: 6607343616

模型的总参数是6,607,343,616

粗略计算一下,采用单精度(single-precision float-point format)存储这些参数的话总共要用6607343616 * 32 / 8 / 1024 / 1024 / 1024 = 24.6143G,如果采用半精度存储看下:

tokenizer = LlamaTokenizer.from_pretrained('llama_hf/7Bf', torch_dtype=torch.float16)
model = LlamaModel.from_pretrained('llama_hf/7Bf', torch_dtype=torch.float16)

放进显卡的话划分了13G现存

还想再降低显存占用就得使用quantization了,后面整理下。

LlamaForCausalLM可以用来生成回答,默认的LlamaModel没有这功能

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注