개발/AI
[llama.cpp] quantize: 양자화 예시
jykim23
2024. 1. 4. 18:39
728x90
환경
OS: Ubuntu server 22.04
GPU: 4060ti 16gb
Get the Code
git clone https://github.com/ggerganov/llama.cpp.git; cd llama.cpp
Build
상세 내용: https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#build
make LLAMA_OPENBLAS=1
CLI로 로컬에 다운로드
모델 다운로드: https://huggingface.co/docs/huggingface_hub/ko/guides/download#download-from-the-cli
굵은 글씨는 수정해서 사용.
huggingface-cli download yanolja/KoSOLAR-10.7B-v0.1 --local-dir=./KoSOLAR-10.7B-v0.1
huggingface-cli download yanolja/KoSOLAR-10.7B-v0.1 --local-dir=./KoSOLAR-10.7B-v0.1
Quantize: 양자화
상세 내용: https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#prepare-data--run
# gguf 변환
python3 convert.py models/KoSOLAR-10.7B-v0.1/
# 양자화(q4_1 예시)
./quantize ./models/KoSOLAR-10.7B-v0.1/ggml-model-f16.gguf ./models/KoSOLAR-10.7B-v0.1/ggml-model-q4_1.gguf q4_1
양자화 정보: https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#quantization
결과
양자화된 모델(./models/7B/ggml-model-q4_1.gguf)로 llama_cpp_python 등등 활용
streamlit 등 Python 예시
import streamlit as st
from llama_cpp import Llama
llm = Llama(model_path=f"./models/KoSOLAR-10.7B-v0.1/ggml-model-q4_1.gguf", n_gpu_layers=-1, n_ctx=4096)
728x90