ReframeBot-Llama3.1-8B-AWQ

4-bit AWQ quantized version of the merged ReframeBot-DPO-Llama3.1-8B. Optimized for high-throughput serving with vLLM.

This model combines the base Llama 3.1 8B Instruct model with the DPO-aligned CBT adapter, then compresses it using Activation-aware Weight Quantization (AWQ) for efficient production deployment.

Usage

vLLM (Recommended)

from vllm import LLM, SamplingParams

llm = LLM(model="Nhatminh1234/ReframeBot-Llama3.1-8B-AWQ", quantization="awq")
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=256)

prompts = ["I'm feeling so overwhelmed with my thesis..."]
outputs = llm.generate(prompts, sampling_params)

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Nhatminh1234/ReframeBot-Llama3.1-8B-AWQ"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

Quantization Details

Parameter	Value
Quantization Method	AWQ (Activation-aware Weight Quantization)
Bits	4-bit
Group Size	128
Version	GEMM
Calibration Dataset	ReframeBot Socratic Dialogue Dataset (32 samples)
Hardware used	NVIDIA RTX 5070 (laptop, 8 GB VRAM)

Model Pipeline

Base Model: Llama 3.1 8B Instruct
Stage 1 (SFT): Fine-tuned on 4.5k CBT dialogues.
Stage 2 (DPO): Aligned with 1.4k preference pairs for empathy.
Stage 3 (Merge): Merged adapter into base model.
Stage 4 (Quantize): AWQ 4-bit quantization for serving.

Intended Use

Designed for production deployment in the ReframeBot system. Must be used with the accompanying Guardrail and RAG components for safe and accurate operation. Not a substitute for professional mental health care.

Project

GitHub: ReframeBot

Downloads last month: 26

Safetensors

Model size

8B params

Tensor type

I32

BF16

Model tree for Nhatminh1234/ReframeBot-Llama3.1-8B-AWQ

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct

Quantized

(619)

this model