--- license: creativeml-openrail-m tags: - text-to-image - stable-diffusion - stable-diffusion-xl - realistic - photorealistic - diffusers library_name: diffusers base_model: stabilityai/stable-diffusion-xl-base-1.0 --- # WAI REALCN (SDXL) Photorealistic Stable Diffusion XL checkpoint released by the community as “WAI REALCN”. The model keeps the standard SDXL architecture (two CLIP text encoders, latent UNet, and VAE) and was shared on [Civitai](https://civitai.com/models/469902/wai-realcn). ## Model Summary - Task: text-to-image generation at 1024×1024 (and downscaled resolutions). - Architecture: SDXL with two CLIP text encoders (`CLIPTextModel` + `CLIPTextModelWithProjection`), UNet with cross-attention, and an AutoencoderKL VAE (scaling factor 0.13025). - Scheduler: EulerDiscreteScheduler by default; other SDXL schedulers from `diffusers` also work. - Format: Diffusers pipeline (`StableDiffusionXLPipeline`) with FP16 weights expected at load time for GPU inference. ## Recommended Use - Photorealistic portraits and lifestyle imagery; neutral prompting works best (avoid over-stylized prompts). - Works with standard SDXL negative prompting (e.g., “blurry, low quality, artifacts, extra limbs”). - 1024×1024 is the native resolution; smaller sizes are fine, higher may need upscaling. ## Quickstart (Diffusers) ```python import torch from diffusers import StableDiffusionXLPipeline pipe = StableDiffusionXLPipeline.from_pretrained( "YOUR_USERNAME_HERE/deewaiREALCN", torch_dtype=torch.float16, ).to("cuda") prompt = "a candid street portrait of a young adult, soft daylight, shallow depth of field, high detail" negative_prompt = "blurry, low quality, extra fingers, distorted face" image = pipe( prompt=prompt, negative_prompt=negative_prompt, num_inference_steps=30, guidance_scale=7.5, ).images[0] image.save("sample.png") ``` ## Files and Architecture Notes - `model_index.json`: Declares `StableDiffusionXLPipeline` with dual tokenizers/encoders (standard SDXL design). - `tokenizer/` & `tokenizer_2/`: Separate CLIP tokenizers matching the two text encoders; keep both to preserve padding/special-token behavior. - `text_encoder/`: 12-layer CLIP text encoder (768 hidden size, quick GELU). - `text_encoder_2/`: 32-layer CLIP text encoder with projection (1280 hidden size, GELU). - `unet/`: Latent UNet with cross-attention (`sample_size`: 128 → 1024px images). - `vae/`: AutoencoderKL with `scaling_factor: 0.13025` for latents. - `scheduler/`: Default Euler scheduler settings. ## Prompting Tips - Start concise: subject + setting + lighting + camera feel (e.g., “portrait, indoor window light, 85mm, f/1.8”). - Add quality anchors sparingly (“high detail”, “natural skin”, “cinematic lighting”). - Keep negatives short; overlong negatives can reduce fidelity. ## Safety and Limitations - May reproduce biases or create sensitive/NSFW content; review outputs before use. - Not guaranteed for medical, legal, or safety-critical applications. - Respect the CreativeML Open RAIL-M license; comply with downstream use restrictions. # More information about the construct of the model It’s a **text-to-image diffusion** pipeline with the usual components: * **tokenizer + text_encoder**: turns your prompt into embeddings * **tokenizer_2 + text_encoder_2**: a second, larger text encoder (so it’s **dual-encoder**) * **UNet2DConditionModel**: the main denoiser that predicts noise at each diffusion step * **VAE (AutoencoderKL)**: converts images ↔ latent space * **EulerDiscreteScheduler**: controls the denoising step schedule (Euler sampler) This structure is typical of **SDXL-like** pipelines (two CLIP text encoders, big UNet, VAE, Euler scheduler). ### 2) Two text encoders (why and what “projection” means) You have: * **Text Encoder (12-layer, 768d, 123M params)** This looks like a CLIP ViT-L/14 style text tower size. * **Text Encoder 2 (32-layer, 1280d, 695M params)** with **projection** “WithProjection” means it outputs embeddings and applies a learned projection head to a target dimension (commonly used in SDXL so the model can combine conditioning streams cleanly). Net: prompts are encoded twice (often “pooled” + “per-token” info), giving richer conditioning. ### 3) UNet config tells you image resolution and how heavy the model is Key fields: * **sample_size=128 (latent 4 x 128 x 128)** Latents are 128×128. Since SD VAEs typically downsample by **8×**, this corresponds to about **1024×1024 output resolution** (128×8 = 1024). * **cross_attention_dim=2048** The UNet expects conditioning vectors of width 2048. With two encoders (768 and 1280), those concatenate to **2048** (768 + 1280), so that’s a big hint this is SDXL-style dual-conditioning. * **block_out_channels=[320, 640, 1280]**, attention heads scale as **[5, 10, 20]** That’s the channel widths per stage and how attention capacity increases deeper in the network. * **transformer_layers_per_block=[1, 2, 10]** Deepest blocks have many transformer layers (10), which is part of why it’s huge. * **params=2567.46M** The UNet alone is **2.57B parameters**, which is very large (this is in the “XL/bigger-than-XL” territory). Also: * **addition_embed_type=text_time** suggests it injects extra conditioning (text + timestep style embedding). ### 4) VAE details (encoding/decoding) * **latent_channels=4**: standard SD latent format. * **scaling_factor=0.13025**: how latents are scaled when passed between UNet and VAE (Diffusers uses this internally). * **force_upcast=True**: during decode/encode it may upcast to float32 for numerical stability (helps avoid artifacts, but costs memory). ### 5) Total size and what it implies for VRAM Rough total params: * Text encoders: 123M + 695M ≈ **818M** * UNet: **2567M** * VAE: **84M** Total ≈ **3.47B parameters**. Implication: this is a **very heavy** pipeline. In fp16/bf16 the raw weights alone are multiple GB, and runtime activations add a lot more. You typically need: * aggressive memory tricks (attention slicing, xFormers, CPU offload), or * a large VRAM GPU for comfortable 1024×1024 generation. ### 6) The `torch_dtype` warning Diffusers changed API naming: instead of passing `torch_dtype=...` you should pass `dtype=...` in whatever wrapper is printing that warning. If you tell me which model/repo this is (or the pipeline class you’re using), I can translate these facts into practical settings (recommended dtype, VRAM-saving flags, and expected output resolution and speed).