Diffusers documentation
CogView4Transformer2DModel
Get started
Tutorials
OverviewUnderstanding pipelines, models and schedulersAutoPipelineTrain a diffusion modelLoad LoRAs for inferenceAccelerate inference of text-to-image diffusion modelsWorking with big models
Load pipelines and adapters
Load pipelinesLoad community pipelines and componentsLoad schedulers and modelsModel files and layoutsLoad adaptersPush files to the Hub
Generative tasks
Inference techniques
OverviewCreate a serverDistributed inferenceMerge LoRAsScheduler featuresPipeline callbacksReproducible pipelinesControlling image qualityPrompt techniques
Advanced inference
Hybrid Inference
Specific pipeline examples
CogVideoXConsisIDStable Diffusion XLSDXL TurboKandinskyIP-AdapterOmniGenPAGControlNetT2I-AdapterLatent Consistency ModelTextual inversionShap-EDiffEditTrajectory Consistency Distillation-LoRAStable Video DiffusionMarigold Computer Vision
Training
Quantization Methods
Accelerate inference and reduce memory
Speed up inferenceReduce memory usagePyTorch 2.0xFormersToken mergingDeepCacheTGATExDiTParaAttention
Optimized model formats
Optimized hardware
Conceptual Guides
PhilosophyControlled generationHow to contribute?Diffusers' Ethical GuidelinesEvaluating Diffusion Models
Community Projects
API
Main Classes
Loaders
Models
OverviewAutoModel
ControlNets
Transformers
AllegroTransformer3DModelAuraFlowTransformer2DModelCogVideoXTransformer3DModelConsisIDTransformer3DModelCogView3PlusTransformer2DModelCogView4Transformer2DModelDiTTransformer2DModelEasyAnimateTransformer3DModelFluxTransformer2DModelHunyuanDiT2DModelHunyuanVideoTransformer3DModelLatteTransformer3DModelLuminaNextDiT2DModelLumina2Transformer2DModelLTXVideoTransformer3DModelMochiTransformer3DModelOmniGenTransformer2DModelPixArtTransformer2DModelPriorTransformerSD3Transformer2DModelSanaTransformer2DModelStableAudioDiTModelTransformer2DModelTransformerTemporalModelWanTransformer3DModel
UNets
VAEs
Pipelines
Schedulers
Internal classes
You are viewing v0.33.1 version. A newer version v0.38.0 is available.
CogView4Transformer2DModel
A Diffusion Transformer model for 2D data from CogView4
The model can be loaded with the following code snippet.
from diffusers import CogView4Transformer2DModel
transformer = CogView4Transformer2DModel.from_pretrained("THUDM/CogView4-6B", subfolder="transformer", torch_dtype=torch.bfloat16).to("cuda")CogView4Transformer2DModel
class diffusers.CogView4Transformer2DModel
< source >( patch_size: int = 2 in_channels: int = 16 out_channels: int = 16 num_layers: int = 30 attention_head_dim: int = 40 num_attention_heads: int = 64 text_embed_dim: int = 4096 time_embed_dim: int = 512 condition_dim: int = 256 pos_embed_max_size: int = 128 sample_size: int = 128 rope_axes_dim: typing.Tuple[int, int] = (256, 256) )
Parameters
- patch_size (
int, defaults to2) — The size of the patches to use in the patch embedding layer. - in_channels (
int, defaults to16) — The number of channels in the input. - num_layers (
int, defaults to30) — The number of layers of Transformer blocks to use. - attention_head_dim (
int, defaults to40) — The number of channels in each head. - num_attention_heads (
int, defaults to64) — The number of heads to use for multi-head attention. - out_channels (
int, defaults to16) — The number of channels in the output. - text_embed_dim (
int, defaults to4096) — Input dimension of text embeddings from the text encoder. - time_embed_dim (
int, defaults to512) — Output dimension of timestep embeddings. - condition_dim (
int, defaults to256) — The embedding dimension of the input SDXL-style resolution conditions (original_size, target_size, crop_coords). - pos_embed_max_size (
int, defaults to128) — The maximum resolution of the positional embeddings, from which slices of shapeH x Ware taken and added to input patched latents, whereHandWare the latent height and width respectively. A value of 128 means that the maximum supported height and width for image generation is128 * vae_scale_factor * patch_size => 128 * 8 * 2 => 2048. - sample_size (
int, defaults to128) — The base resolution of input latents. If height/width is not provided during generation, this value is used to determine the resolution assample_size * vae_scale_factor => 128 * 8 => 1024
Transformer2DModelOutput
class diffusers.models.modeling_outputs.Transformer2DModelOutput
< source >( sample: torch.Tensor )
Parameters
- sample (
torch.Tensorof shape(batch_size, num_channels, height, width)or(batch size, num_vector_embeds - 1, num_latent_pixels)if Transformer2DModel is discrete) — The hidden states output conditioned on theencoder_hidden_statesinput. If discrete, returns probability distributions for the unnoised latent pixels.
The output of Transformer2DModel.