개발/AI

[llama.cpp] quantize: 양자화 예시

jykim23 2024. 1. 4. 18:39

환경

OS: Ubuntu server 22.04

GPU: 4060ti 16gb

 

 

 

Get the Code

git clone https://github.com/ggerganov/llama.cpp.git; cd llama.cpp

Build

상세 내용: https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#build

make LLAMA_OPENBLAS=1

 

 

CLI로 로컬에 다운로드

모델 다운로드: https://huggingface.co/docs/huggingface_hub/ko/guides/download#download-from-the-cli

굵은 글씨는 수정해서 사용.
huggingface-cli download yanolja/KoSOLAR-10.7B-v0.1 --local-dir=./KoSOLAR-10.7B-v0.1

huggingface-cli download yanolja/KoSOLAR-10.7B-v0.1 --local-dir=./KoSOLAR-10.7B-v0.1

 

 

Quantize: 양자화

상세 내용: https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#prepare-data--run

# gguf 변환
python3 convert.py models/KoSOLAR-10.7B-v0.1/

# 양자화(q4_1 예시)
./quantize ./models/KoSOLAR-10.7B-v0.1/ggml-model-f16.gguf ./models/KoSOLAR-10.7B-v0.1/ggml-model-q4_1.gguf q4_1

양자화 정보: https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#quantization

 

 

결과

양자화된 모델(./models/7B/ggml-model-q4_1.gguf)로 llama_cpp_python 등등 활용

streamlit 등 Python 예시

import streamlit as st
from llama_cpp import Llama

llm = Llama(model_path=f"./models/KoSOLAR-10.7B-v0.1/ggml-model-q4_1.gguf", n_gpu_layers=-1, n_ctx=4096)