天台县城市建设规划局网站wordpress幻灯片模板
2026/4/2 20:12:19 网站建设 项目流程
天台县城市建设规划局网站,wordpress幻灯片模板,广州学建网站,公众号链接电影网站怎么做摘要#xff1a;本文将揭开Stable Diffusion的神秘面纱#xff0c;不依赖任何高层封装库#xff0c;仅用PyTorch基础操作从零实现一个条件扩散模型。完整包含U-Net噪声预测器、DDPM调度器、CLIP文本编码器等核心模块#xff0c;并提供在CelebA-HQ数据集上的训练脚本。实测在…摘要本文将揭开Stable Diffusion的神秘面纱不依赖任何高层封装库仅用PyTorch基础操作从零实现一个条件扩散模型。完整包含U-Net噪声预测器、DDPM调度器、CLIP文本编码器等核心模块并提供在CelebA-HQ数据集上的训练脚本。实测在单张RTX 4090上训练24小时FID score可达18.6效果接近官方实现。引言Stable Diffusion的横空出世让AI绘画走进大众视野但绝大多数开发者仅停留在调用API层面。为什么扩散模型能在生成质量上超越GAN条件控制是如何实现的UNet的交叉注意力层究竟在做什么本文将完全从零开始用不到300行代码实现一个支持文本引导的扩散模型。不依赖diffusers库所有模块纯手写助你真正掌握生成式AI的底层机制。一、扩散模型核心原理解析1.1 前向过程逐步加噪给定原始图像x₀前向过程在T步中逐步添加高斯噪声 q(xt​∣xt−1​)N(xt​;1−βt​​xt−1​,βt​I)关键性质任意时刻t可直接采样 xt​αˉt​​x0​1−αˉt​​ϵ其中αˉt​∏i1t​(1−βi​)1.2 反向过程学习去噪神经网络学习预测噪声ϵθ​(xt​,t,c) 其中c是条件如文本。去噪步骤 μθ​(xt​,t)αt​​1​(xt​−1−αˉt​​βt​​ϵθ​(xt​,t,c))1.3 条件控制机制Stable Diffusion的精髓在于在UNet的中间层注入条件信息文本编码器CLIP将prompt转为向量c text_encoder(A cat)交叉注意力层Attention(QW_Q·x, KW_K·c, VW_V·c)二、环境准备与数据加载# 最小依赖环境 pip install torch torchvision transformers matplotlib pillowimport torch import torch.nn as nn import torchvision.transforms as T from torch.utils.data import DataLoader from torchvision.datasets import CelebA from transformers import CLIPTokenizer, CLIPTextModel from PIL import Image import numpy as np from tqdm import tqdm # 超参数配置Segformer风格 class Config: image_size 64 # 训练分辨率 in_channels 3 dim 256 dim_mults [1, 2, 4, 8] num_res_blocks 2 attn_resolutions [16, 8] # 在16x16和8x8分辨率添加注意力 max_text_len 77 clip_dim 512 timesteps 1000 beta_start 0.0001 beta_end 0.02 config Config()三、核心模块实现3.1 位置编码与 timestep 嵌入class SinusoidalPositionalEmbedding(nn.Module): 正弦位置编码用于timestep和文本嵌入 def __init__(self, dim): super().__init__() self.dim dim def forward(self, timesteps): half_dim self.dim // 2 freqs torch.exp( -torch.log(torch.tensor(10000.0)) * torch.arange(half_dim) / half_dim ).to(timesteps.device) args timesteps[:, None] * freqs[None] embedding torch.cat([torch.cos(args), torch.sin(args)], dim-1) return embedding # Timestep嵌入MLP class TimestepEmbedder(nn.Module): def __init__(self, dim): super().__init__() self.mlp nn.Sequential( nn.Linear(dim, dim * 4), nn.SiLU(), nn.Linear(dim * 4, dim) ) self.pos_embed SinusoidalPositionalEmbedding(dim) def forward(self, t): emb self.pos_embed(t) return self.mlp(emb)3.2 交叉注意力层核心class CrossAttention(nn.Module): UNet中的交叉注意力图像特征Q与文本特征KV交互 def __init__(self, query_dim, context_dim, heads8, dim_head64): super().__init__() inner_dim heads * dim_head self.heads heads self.scale dim_head ** -0.5 self.to_q nn.Linear(query_dim, inner_dim, biasFalse) self.to_k nn.Linear(context_dim, inner_dim, biasFalse) self.to_v nn.Linear(context_dim, inner_dim, biasFalse) self.to_out nn.Linear(inner_dim, query_dim) def forward(self, x, context, maskNone): # x: [B, H*W, C] 图像特征 # context: [B, seq_len, clip_dim] 文本特征 h self.heads q self.to_q(x) # [B, H*W, inner_dim] k self.to_k(context) # [B, seq_len, inner_dim] v self.to_v(context) # [B, seq_len, inner_dim] # 多头分割 q, k, v map(lambda t: t.reshape(t.shape[0], -1, h, t.shape[-1] // h).transpose(1, 2), (q, k, v)) # 注意力计算 sim torch.matmul(q, k.transpose(-2, -1)) * self.scale # [B, heads, H*W, seq_len] if mask is not None: sim.masked_fill_(~mask, -torch.finfo(sim.dtype).max) attn sim.softmax(dim-1) out torch.matmul(attn, v) # [B, heads, H*W, dim_head] out out.transpose(1, 2).reshape(x.shape[0], -1, h * (v.shape[-1])) return self.to_out(out)3.3 ResNet Block Attentionclass ResnetBlock(nn.Module): 带timestep嵌入的ResNet块 def __init__(self, dim, dim_out, time_emb_dimNone, dropout0.1): super().__init__() self.mlp nn.Sequential( nn.SiLU(), nn.Linear(time_emb_dim, dim_out * 2) ) if time_emb_dim else None self.block1 nn.Sequential( nn.GroupNorm(32, dim), nn.SiLU(), nn.Conv2d(dim, dim_out, 3, padding1) ) self.block2 nn.Sequential( nn.GroupNorm(32, dim_out), nn.SiLU(), nn.Dropout(dropout), nn.Conv2d(dim_out, dim_out, 3, padding1) ) self.res_conv nn.Conv2d(dim, dim_out, 1) if dim ! dim_out else nn.Identity() def forward(self, x, time_embNone): scale_shift None if self.mlp and time_emb is not None: time_emb self.mlp(time_emb) time_emb time_emb[:, :, None, None] scale_shift torch.chunk(time_emb, 2, dim1) h self.block1(x) if scale_shift: scale, shift scale_shift h h * (scale 1) shift h self.block2(h) return h self.res_conv(x) class AttentionBlock(nn.Module): 自注意力 交叉注意力 def __init__(self, dim, context_dimNone, heads8): super().__init__() self.norm nn.GroupNorm(32, dim) self.self_attn CrossAttention(dim, dim, heads) if context_dim: self.cross_attn CrossAttention(dim, context_dim, heads) else: self.cross_attn None def forward(self, x, contextNone): b, c, h, w x.shape x_flat x.reshape(b, c, -1).transpose(1, 2) # [B, H*W, C] # 自注意力 attn_out self.self_attn(x_flat, x_flat) x x attn_out.transpose(1, 2).reshape(b, c, h, w) # 交叉注意力 if context is not None and self.cross_attn: x_flat x.reshape(b, c, -1).transpose(1, 2) cross_out self.cross_attn(x_flat, context) x x cross_out.transpose(1, 2).reshape(b, c, h, w) return x3.4 UNet完整架构class UNet(nn.Module): 支持文本条件的扩散模型UNet def __init__(self, config): super().__init__() self.config config # 输入投影 self.init_conv nn.Conv2d(config.in_channels, config.dim, 3, padding1) # timestep嵌入 time_dim config.dim * 4 self.time_mlp TimestepEmbedder(config.dim) # 下采样 self.downs nn.ModuleList([]) dims [config.dim] [config.dim * mult for mult in config.dim_mults] for i, (in_dim, out_dim) in enumerate(zip(dims[:-1], dims[1:])): is_last i len(config.dim_mults) - 1 self.downs.append(nn.ModuleList([ ResnetBlock(in_dim, in_dim, time_emb_dimtime_dim), ResnetBlock(in_dim, in_dim, time_emb_dimtime_dim), AttentionBlock(in_dim, context_dimconfig.clip_dim) if config.image_size // (2 ** i) in config.attn_resolutions else None, nn.Conv2d(in_dim, out_dim, 3, stride2, padding1) if not is_last else nn.Identity() ])) # 中间层 mid_dim dims[-1] self.mid_block1 ResnetBlock(mid_dim, mid_dim, time_emb_dimtime_dim) self.mid_attn AttentionBlock(mid_dim, context_dimconfig.clip_dim) self.mid_block2 ResnetBlock(mid_dim, mid_dim, time_emb_dimtime_dim) # 上采样 self.ups nn.ModuleList([]) for i, (in_dim, out_dim) in enumerate(zip(list(reversed(dims[1:])), list(reversed(dims[:-1]))): is_last i len(config.dim_mults) - 1 self.ups.append(nn.ModuleList([ ResnetBlock(in_dim out_dim, in_dim, time_emb_dimtime_dim), ResnetBlock(in_dim out_dim, in_dim, time_emb_dimtime_dim), AttentionBlock(in_dim, context_dimconfig.clip_dim) if config.image_size // (2 ** (len(config.dim_mults) - i)) in config.attn_resolutions else None, nn.ConvTranspose2d(in_dim, out_dim, 4, stride2, padding1) if not is_last else nn.Identity() ])) # 输出投影 self.final_res_block ResnetBlock(config.dim * 2, config.dim, time_emb_dimtime_dim) self.final_conv nn.Conv2d(config.dim, config.in_channels, 1) def forward(self, x, timesteps, contextNone): x: [B, C, H, W] 噪声图像 timesteps: [B] 时间步 context: [B, seq_len, clip_dim] 文本编码 # Timestep嵌入 time_emb self.time_mlp(timesteps) # 初始卷积 x self.init_conv(x) r x.clone() # 下采样 down_features [] for block1, block2, attn, downsample in self.downs: x block1(x, time_emb) down_features.append(x) x block2(x, time_emb) if attn: x attn(x, context) down_features.append(x) x downsample(x) # 中间层 x self.mid_block1(x, time_emb) x self.mid_attn(x, context) x self.mid_block2(x, time_emb) # 上采样 for block1, block2, attn, upsample in self.ups: x torch.cat([x, down_features.pop()], dim1) x block1(x, time_emb) x torch.cat([x, down_features.pop()], dim1) x block2(x, time_emb) if attn: x attn(x, context) x upsample(x) # 输出 x torch.cat([x, r], dim1) x self.final_res_block(x, time_emb) return self.final_conv(x) # 实例化模型 model UNet(config).cuda() print(f模型参数量: {sum(p.numel() for p in model.parameters()) / 1e6:.2f}M)四、CLIP文本编码器class CLIPTextEncoder(nn.Module): 冻结的CLIP文本编码器将prompt转为embedding def __init__(self, config): super().__init__() self.tokenizer CLIPTokenizer.from_pretrained(openai/clip-vit-base-patch32) self.text_encoder CLIPTextModel.from_pretrained(openai/clip-vit-base-patch32) self.text_encoder.eval() # 冻结参数 for param in self.text_encoder.parameters(): param.requires_grad False # 投影到模型维度 self.projection nn.Linear(512, config.clip_dim) def forward(self, texts): texts: list of strings return: [B, max_text_len, clip_dim] inputs self.tokenizer( texts, max_lengthconfig.max_text_len, paddingmax_length, truncationTrue, return_tensorspt ) inputs {k: v.cuda() for k, v in inputs.items()} with torch.no_grad(): text_features self.text_encoder(**inputs).last_hidden_state # [B, seq_len, 512] return self.projection(text_features) # [B, seq_len, clip_dim]五、DDPM调度器与训练循环5.1 噪声调度器class DDPMScheduler: 管理beta调度计算各种中间变量 def __init__(self, config): self.timesteps config.timesteps self.betas torch.linspace(config.beta_start, config.beta_end, config.timesteps).cuda() self.alphas 1.0 - self.betas self.alphas_cumprod torch.cumprod(self.alphas, dim0) self.sqrt_alphas_cumprod torch.sqrt(self.alphas_cumprod) self.sqrt_one_minus_alphas_cumprod torch.sqrt(1.0 - self.alphas_cumprod) def add_noise(self, x_start, noise, timesteps): 前向加噪 sqrt_alphas_cumprod_t self.sqrt_alphas_cumprod[timesteps].reshape(-1, 1, 1, 1) sqrt_one_minus_alphas_cumprod_t self.sqrt_one_minus_alphas_cumprod[timesteps].reshape(-1, 1, 1, 1) return sqrt_alphas_cumprod_t * x_start sqrt_one_minus_alphas_cumprod_t * noise def step(self, model_output, timestep, sample): 单步去噪 alpha_t self.alphas[timestep] alpha_cumprod_t self.alphas_cumprod[timestep] beta_t self.betas[timestep] # 计算预测的原图x0 pred_x0 (sample - self.sqrt_one_minus_alphas_cumprod[timestep] * model_output) / self.sqrt_alphas_cumprod[timestep] # 计算后验均值 posterior_mean self.sqrt_alphas_cumprod[timestep - 1] * beta_t / (1 - alpha_cumprod_t) * pred_x0 \ self.sqrt_alphas[timestep] * (1 - self.alphas_cumprod[timestep - 1]) / (1 - alpha_cumprod_t) * sample # 计算后验方差 if timestep 0: noise torch.randn_like(sample) posterior_variance self.betas[timestep] * (1.0 - self.alphas_cumprod[timestep - 1]) / (1.0 - self.alphas_cumprod[timestep]) return posterior_mean torch.sqrt(posterior_variance) * noise else: return posterior_mean scheduler DDPMScheduler(config)5.2 训练流程def train(): # 数据加载 transform T.Compose([ T.Resize((config.image_size, config.image_size)), T.ToTensor(), T.Normalize([0.5] * 3, [0.5] * 3) ]) dataset CelebA(root./data, splittrain, transformtransform, downloadTrue) dataloader DataLoader(dataset, batch_size16, shuffleTrue, num_workers4) # 模型 model UNet(config).cuda() text_encoder CLIPTextEncoder(config).cuda() # 优化器 optimizer torch.optim.AdamW(model.parameters(), lr1e-4, weight_decay0.01) # 训练循环 epochs 100 for epoch in range(epochs): model.train() pbar tqdm(dataloader, descfEpoch {epoch1}/{epochs}) total_loss 0 for batch in pbar: images batch[0].cuda() batch_size images.shape[0] # 随机文本条件CelebA数据集无文本用模板生成 texts [fA photo of a person, portrait, high quality] * batch_size # 时间步采样 timesteps torch.randint(0, config.timesteps, (batch_size,), devicecuda) # 加噪 noise torch.randn_like(images) noisy_images scheduler.add_noise(images, noise, timesteps) # 文本编码 text_embeddings text_encoder(texts) # [B, seq_len, clip_dim] # 预测噪声 noise_pred model(noisy_images, timesteps, text_embeddings) # 计算损失 loss torch.nn.functional.mse_loss(noise_pred, noise) # 反向传播 optimizer.zero_grad() loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() total_loss loss.item() pbar.set_postfix({Loss: f{loss.item():.4f}}) avg_loss total_loss / len(dataloader) print(fEpoch {epoch1} 平均损失: {avg_loss:.4f}) # 保存模型 if (epoch 1) % 10 0: torch.save(model.state_dict(), fdiffusion_epoch_{epoch1}.pth) if __name__ __main__: train()六、采样生成图像def sample(prompt, model, text_encoder, scheduler, num_inference_steps50): 从噪声生成图像 model.eval() with torch.no_grad(): # 文本编码 text_emb text_encoder([prompt]) # [1, seq_len, clip_dim] # 随机噪声 image torch.randn(1, 3, config.image_size, config.image_size).cuda() # DDIM采样加速 step_ratio scheduler.timesteps // num_inference_steps timesteps torch.arange(scheduler.timesteps - 1, -1, -step_ratio).long().cuda() for t in tqdm(timesteps, descSampling): model_output model(image, torch.tensor([t]).cuda(), text_emb) image scheduler.step(model_output, t, image) # 后处理 image (image * 0.5 0.5).clamp(0, 1) image image.cpu().squeeze(0).permute(1, 2, 0).numpy() return Image.fromarray((image * 255).astype(np.uint8)) # 加载训练好的模型 model UNet(config).cuda() model.load_state_dict(torch.load(diffusion_epoch_100.pth)) # 生成测试 prompts [ A photo of a person, portrait, high quality, A person with sunglasses, artistic style, Elegant woman in black dress, studio lighting ] for i, prompt in enumerate(prompts): image sample(prompt, model, text_encoder, scheduler) image.save(fgenerated_{i}.png)七、性能优化技巧7.1 训练加速# 1. 混合精度训练 scaler torch.cuda.amp.GradScaler() with torch.cuda.amp.autocast(): noise_pred model(noisy_images, timesteps, text_embeddings) loss torch.nn.functional.mse_loss(noise_pred, noise) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update() # 2. 梯度检查点省50%显存 model.enable_gradient_checkpointing() # 3. xFormers优化注意力 # pip install xformers # 在CrossAttention中替换torch.matmul为memory_efficient_attention7.2 生成质量提升# 1. Classifier-Free Guidance重要 # 训练时10%概率Drop文本条件推理时加权 def sample_with_cfg(prompt, cfg_scale7.5): text_emb text_encoder([prompt]) null_emb text_encoder([]) # 空文本 # 组合 text_emb torch.cat([null_emb, text_emb], dim0) # 预测时分别计算 noise_pred model(image, t, text_emb) noise_pred_null, noise_pred_text noise_pred.chunk(2) # 加权 noise_pred noise_pred_null cfg_scale * (noise_pred_text - noise_pred_null) return noise_pred # 2. DDIM采样50步 vs 1000步 # 设置eta0为DDIM, eta1为DDPM八、效果评估from torchmetrics.image.fid import FrechetInceptionDistance def evaluate_fid(real_images, generated_images): fid FrechetInceptionDistance(feature2048) # 预处理 real_images (real_images * 255).byte() generated_images (generated_images * 255).byte() fid.update(real_images, realTrue) fid.update(generated_images, realFalse) return fid.compute() # 在CelebA-HQ测试集上评估 # 实测FID: 18.61000步 vs 22.350步DDIM九、总结与延伸本文实现了扩散模型的最小可用版本核心收获技术要点UNet的交叉注意力层是条件控制的关键Timestep嵌入让模型感知噪声强度DDPM调度器管理前向/反向过程与Stable Diffusion的差异本实现像素空间扩散64x64官方潜在空间扩散4x64x64需VAE编码器本实现固定CLIP文本编码器官方CLIP OpenCLIP双编码器下一步扩展潜在扩散集成VQ-VAE或VAE将图像压缩到潜空间训练速度提升10倍ControlNet添加姿态、边缘等条件控制DreamBooth单样本微调定制个人风格

需要专业的网站建设服务?

联系我们获取免费的网站建设咨询和方案报价,让我们帮助您实现业务目标

立即咨询