ReframeBot-Llama3.1-8B-AWQ
4-bit AWQ quantized version of the merged ReframeBot-DPO-Llama3.1-8B. Optimized for high-throughput serving with vLLM.
This model combines the base Llama 3.1 8B Instruct model with the DPO-aligned CBT adapter, then compresses it using Activation-aware Weight Quantization (AWQ) for efficient production deployment.
Usage
vLLM (Recommended)
from vllm import LLM, SamplingParams
llm = LLM(model="Nhatminh1234/ReframeBot-Llama3.1-8B-AWQ", quantization="awq")
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=256)
prompts = ["I'm feeling so overwhelmed with my thesis..."]
outputs = llm.generate(prompts, sampling_params)
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "Nhatminh1234/ReframeBot-Llama3.1-8B-AWQ"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
Quantization Details
| Parameter | Value |
|---|---|
| Quantization Method | AWQ (Activation-aware Weight Quantization) |
| Bits | 4-bit |
| Group Size | 128 |
| Version | GEMM |
| Calibration Dataset | ReframeBot Socratic Dialogue Dataset (32 samples) |
| Hardware used | NVIDIA RTX 5070 (laptop, 8 GB VRAM) |
Model Pipeline
- Base Model: Llama 3.1 8B Instruct
- Stage 1 (SFT): Fine-tuned on 4.5k CBT dialogues.
- Stage 2 (DPO): Aligned with 1.4k preference pairs for empathy.
- Stage 3 (Merge): Merged adapter into base model.
- Stage 4 (Quantize): AWQ 4-bit quantization for serving.
Intended Use
Designed for production deployment in the ReframeBot system. Must be used with the accompanying Guardrail and RAG components for safe and accurate operation. Not a substitute for professional mental health care.
Project
GitHub: ReframeBot
- Downloads last month
- 26
Model tree for Nhatminh1234/ReframeBot-Llama3.1-8B-AWQ
Base model
meta-llama/Llama-3.1-8B Finetuned
meta-llama/Llama-3.1-8B-Instruct