ReframeBot-Llama3.1-8B-AWQ

4-bit AWQ quantized version of the merged ReframeBot-DPO-Llama3.1-8B. Optimized for high-throughput serving with vLLM.

This model combines the base Llama 3.1 8B Instruct model with the DPO-aligned CBT adapter, then compresses it using Activation-aware Weight Quantization (AWQ) for efficient production deployment.

Usage

vLLM (Recommended)

from vllm import LLM, SamplingParams

llm = LLM(model="Nhatminh1234/ReframeBot-Llama3.1-8B-AWQ", quantization="awq")
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=256)

prompts = ["I'm feeling so overwhelmed with my thesis..."]
outputs = llm.generate(prompts, sampling_params)

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Nhatminh1234/ReframeBot-Llama3.1-8B-AWQ"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

Quantization Details

Parameter Value
Quantization Method AWQ (Activation-aware Weight Quantization)
Bits 4-bit
Group Size 128
Version GEMM
Calibration Dataset ReframeBot Socratic Dialogue Dataset (32 samples)
Hardware used NVIDIA RTX 5070 (laptop, 8 GB VRAM)

Model Pipeline

  1. Base Model: Llama 3.1 8B Instruct
  2. Stage 1 (SFT): Fine-tuned on 4.5k CBT dialogues.
  3. Stage 2 (DPO): Aligned with 1.4k preference pairs for empathy.
  4. Stage 3 (Merge): Merged adapter into base model.
  5. Stage 4 (Quantize): AWQ 4-bit quantization for serving.

Intended Use

Designed for production deployment in the ReframeBot system. Must be used with the accompanying Guardrail and RAG components for safe and accurate operation. Not a substitute for professional mental health care.

Project

GitHub: ReframeBot

Downloads last month
26
Safetensors
Model size
8B params
Tensor type
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Nhatminh1234/ReframeBot-Llama3.1-8B-AWQ

Quantized
(619)
this model