MLX Quants of allura-org/Llama-3.3-70B-Joyous

MLX quants of allura-org/Llama-3.3-70B-Joyous using mlx-lm for quantization on Apple Silicon.

Quants

Quant(Revision) Bits per Weight
4.0 bpw 4.0
6.0 bpw 6.0
8.0 bpw 8.0

Configure mlx

[uv venv] # First-time setup with uv (optional)
[uv] pip install -U mlx-lm

The uv wrapper is optional but recommended, get it with Homebrew:

brew install uv

Serve an OpenAI-compatible endpoint

[uv run] mlx_lm.server --model /path/to/weights/Llama-3.3-70B-Joyous_MLX-hi \
  --max-tokens -1 --temp 1.25 --min-p 0.05

The default URL is http://127.0.0.1:8080/v1

Programmatic usage

from mlx_lm import load, generate

model_path = "/path/to/weights/Llama-3.3-70B-Joyous_MLX-hi"

model, tokenizer = load(model_path)

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)

Original model card:

Joyous 70B

image

One last hurrah for Llama 3.3 70B. I hope to never tune this model again. Let it die.

Joyous is a finetune of L3.3 70B designed for roleplay tasks, however (as my luck has been going recently) it turned out to be somewhat comically good at assistant tasks as well, far beyond its base model in subjective assistant evals.

Merry Christmas, gooners!

Info

Use the Llama 3 chat template, obviously.

We recommend the following system prompt for assistant usecases:

You are Luna, a helpful and harmless language model by Allura.

I used 1.25 temp and 0.05 min_p while testing, however your preferred samplers may differ.

Downloads last month
53
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for allura-quants/Llama-3.3-70B-Joyous_MLX-hi