Research Paper

Abstract

We introduce Adaptive Moment Calibration (AMC), a training-free routine that personalises large diffusion models on commodity CPUs in less than 0.2 s while consuming under 1 J. Real cameras deviate from the statistics encountered during cloud-scale pre-training through sensor primaries, tone curves, blur and compression; a vanilla Stable Diffusion XL backbone therefore yields noticeably degraded generations. Parameter-efficient fine-tuning methods such as LoRA or Diff-Tuning restore quality but at the cost of minutes of GPU compute, hundreds of joules, and off-device data transfer—constraints incompatible with privacy regulation and battery-powered devices.

AMC exploits the recently reported dominance of a low-rank Gaussian core inside high-noise denoisers (Xiang Li, 2024): for each noise level σ the pretrained score network f_σ can be decomposed into an analytic Wiener filter W_σ and a high-frequency residual r_θ. AMC distils W_σ offline, compresses all noise levels with a shared rank-512 SVD, and publishes “AMC-ready” checkpoints. At deployment the user collects up to 128 unlabelled frames, estimates mean and covariance with a shrinkage estimator, and hot-swaps the stored moments in closed form—no gradients, recompilation, or GPU required.

On three unseen DSLR domains AMC matches LoRA in FID (29.1 versus 30.6) while being 812× faster and 385× more energy-efficient, reduces colour error ΔE₀₀ below the perceptual threshold, and empirically follows the predicted cubic decay of calibration error with noise. AMC therefore provides a practical, privacy-preserving and sustainable alternative to optimisation-based personalisation.

Introduction

Text-to-image diffusion backbones such as Stable Diffusion XL (SD-XL) and DiT-XL have become the generative workhorse in creative tools, medical imaging and autonomous perception. Their promise of “ship once, run everywhere” falters in front of heterogeneous edge sensors: every camera exhibits unique colour primaries, black-level offsets, tone curves or rolling-shutter artefacts. When such out-of-distribution (OOD) data is fed into an unchanged backbone, generation fidelity deteriorates—a show-stopper in safety-critical settings such as X-ray triage or aerial surveillance.

Why is adaptation difficult? Even parameter-efficient fine-tuning (PEFT) schemes like LoRA or the chain-of-forgetting strategy of Diff-Tuning (Jincheng Zhong, 2024) require back-propagation, minutes of latency, specialised hardware and typically more than 200 J of energy. They also compel the user to transmit privacy-sensitive images to the cloud, conflicting with GDPR, HIPAA and the forthcoming EU AI Act. Lighter analytic techniques such as AdaIN merely copy channel-wise batch-norm statistics; despite millisecond execution they leave blur, colour cast and cross-channel correlations untouched, yielding modest gains.

A recent analysis uncovered a hidden Gaussian bias in diffusion denoisers (Xiang Li, 2024): at high noise levels the network acts almost linearly and is well-approximated by an optimal Gaussian filter for the training data. This suggests that much of the domain gap is encoded in the first two moments alone. Yet all existing adaptation methods continue to cast the problem as numerical optimisation rather than simple algebra.

We close this gap with Adaptive Moment Calibration (AMC). Leveraging the linear-Gaussian observation, we approximate each pretrained denoiser by f_σ(x) = μ_σ + W_σ(x − μ_σ) + r_θ(x,σ), where W_σ is a low-rank Wiener filter that captures coarse content, μ_σ is the mean, and r_θ retains high-frequency style. If a deployment domain differs mostly in mean and covariance, swapping W_σ in closed form suffices.

Contributions

A closed-form Wiener update that replaces (μ_σ, W_σ) by (μ̂, Ŵ_σ) for any target covariance Σ̂ without touching nonlinear residual weights.
A one-time spectral bundle distillation that turns any existing backbone into an “AMC-ready” checkpoint with <40 MB overhead.
A theoretical KL bound showing cubic decay of calibration error with noise level, corroborated empirically.
A 300-line PyTorch implementation that completes calibration on a Snapdragon-8-Gen-2 CPU in 0.17 s and 0.68 J.
Comprehensive experiments on three DSLR domains, mobile SoC power profiling and a 5 × 6 × 4 ablation grid demonstrating that AMC attains LoRA-level quality while being three orders of magnitude cheaper.

Related Work

Parameter-efficient fine-tuning. LoRA inserts rank-decomposition adapters into attention blocks, whereas Diff-Tuning exploits a “chain of forgetting” along reverse timesteps (Jincheng Zhong, 2024). Both still require gradient descent and GPUs, conflicting with on-device constraints. Task-clustering to avoid negative transfer (Hyojun Go, 2023) is similarly optimisation-dependent. AMC learns nothing at deployment.

Analytic editing. AdaIN swaps per-channel mean and variance, while batch-norm statistic replacement follows the same spirit. These methods run fast but cannot correct cross-channel correlations or blur. AMC generalises them to a low-rank full-covariance substitute without sacrificing latency.

Gaussian structure. The hidden Gaussian bias in diffusion (Xiang Li, 2024) and the non-isotropic heat-blur perspective of Blurring Diffusion Models (Emiel Hoogeboom, 2022) both report that linear Gaussian filters dominate early denoising. Cold Diffusion retrains a network per deterministic operator (Arpit Bansal, 2022); AMC instead reuses the original backbone and swaps moments on the fly.

Robustness & augmentation. DensePure (Chaowei Xiao, 2022) and DiffAug (Chandramouli Sastry, 2023) harness denoising for classifier robustness rather than generative fidelity and thus address an orthogonal goal.

Theory. Polynomial convergence guarantees for score-based generative modelling (Holden Lee, 2022) legitimise reliance on Gaussian reference distributions and motivate the cubic dependency that AMC exploits.

In summary, prior art either (a) performs costly optimisation, (b) handles only per-channel variance, or (c) retrains per degradation. AMC is optimisation-free, covariance-aware and universally applicable.

Background

Problem setting. Let f_σ: ℝ^d → ℝ^d denote the denoiser of a pretrained diffusion model at discrete noise levels σ₁…σ_K. The model was trained on distribution p^* with mean μ^* and covariance Σ^*. At deployment the model faces p̂ with moments (μ̂, Σ̂). The goal is to adapt f_σ so that samples generated by a standard Euler–Maruyama sampler match p̂, under the resource limits <1 s CPU, <1 J energy and zero parameter updates.

Gaussian-core hypothesis. Empirical evidence shows that at large σ the denoiser behaves almost linearly and can be written as f_σ(x) = μ_σ + W_σ(x − μ_σ) + r_θ(x,σ), with ∥r_θ∥₂ ≪ ∥W_σ(x − μ_σ)∥₂. Singular values of W_σ decay rapidly: 512 components capture more than 98 % of its energy for 1024² images.

Assumptions. 1. The domain shift is dominated by first- and second-order statistics; 2. The nonlinear residual r_θ is largely invariant across domains as long as μ_σ and W_σ are not perturbed aggressively—hence a small regulariser on r_θ suffices when optional fine-tuning is performed.

Notation. A shared SVD basis U ∈ ℝ^{d × r} (r ≤ 512) spans the principal subspace of all W_σ. For each σ, D_σ ∈ ℝ^r holds the projected singular values. Given the target covariance Σ̂, its projection into the basis is α = U^⊤ Σ̂ U, and the optimal Wiener gain becomes D̂_σ = α (α + σ² I)⁻¹.

Method

Stage 0: Spectral bundle distillation (offline)
1  For each of the 20 logarithmically-spaced noise levels σₖ
    draw 1000 Gaussian samples and evaluate the denoiser.
2  Estimate the full-rank Wiener filter W_σ via normal equations.
3  Factorise the mean of all W_σ; keep the first ≤512 singular vectors U.
4  Project each W_σ onto U and store its diagonal D_σ and μ_σ.
    (“AMC-ready” checkpoint < 40 MB for 1024² images.)

Stage 1: On-device closed-form calibration
Input: ≤128 linear-RGB images from the target camera.
(a) Moment estimation       → μ̂, Σ̂ (Ledoit–Wolf shrinkage)
(b) Basis projection        → α = Uᵀ Σ̂ U
(c) Wiener update           → D̂_σ = α (α + σ² I)⁻¹
(d) Hot-swap                → replace (μ_σ, D_σ) by (μ̂, D̂_σ)
Total latency: 0.17 s on Snapdragon-8-Gen-2; energy: 0.68 J.

Stage 2: Optional extensions
• Patch-AMC (per-tile moments)
• Operator-aware AMC (known blur kernel)
• Prompt-aware gating (CLIP-guided blending)

Theoretical guarantee. For an Euler–Maruyama sampler with noise schedule {σ_k}, substituting (μ_σ, D_σ) by (μ̂, D̂_σ) yields

KL(p̂ ∥ p^*) ≤ max_k ∥Σ̂−Σ^*∥₂·σ_k⁻³(1+o(1)), so the mismatch shrinks cubically with noise level, matching empirical observations.

Implementation footprint. AMC is an nn.Module wrapper of fewer than 300 lines; all additional tensors occupy 5 MB fp16 RAM. No GPU, compilation or graph surgery is required.

Experimental Setup

Common environment. Python 3.10, PyTorch 2.1, diffusers 0.22, PEFT 0.6, scikit-learn 1.4, rawpy 0.18, pyRAPL 0.4. Global seed = 42; deterministic algorithms enabled.

Stage-0 distillation. Executed once on a single NVIDIA A6000 for SD-XL-base-1.0, producing amc_stage0_sd_xl.pt.

Experiment 1: real-camera domain transfer

Data: MIT-Adobe-FiveK RAW photos for Canon-5D, Nikon-D700, Sony-A7; demosaicked to linear-RGB, resized to 1024². 64 frames per camera form the calibration set; ≈1.8 k remaining frames serve as “real” distribution for FID.
Prompts: 100 random COCO captions × 4 seeds.
Generation: Euler a sampler, 50 steps, guidance = 7.5. Methods compared: Vanilla, AdaIN, LoRA (rank 4, 500 AdamW steps, lr = 1e-4), AMC.
Metrics: FID (pytorch-fid), colour error ΔE₀₀, calibration latency and energy (pyRAPL).
Statistics: Three independent calibrations; paired t-tests; 95 % confidence intervals.

Experiment 2: mobile latency & energy

Hardware: Qualcomm RB3 Gen-2 (Snapdragon-8-Gen-2), Adreno GPU disabled, CPU governor “performance”. Power measured via external INA226 shunt at 1 kHz.
Workloads: AMC.calibrate(128 imgs) versus LoRA fine-tune (100 steps) on the same Nikon batch.
Outputs: Mean latency, energy and peak die temperature over five runs; raw power traces released.

Experiment 3: ablation & robustness grid

Data: ImageNet-V2 with synthetic degradation (Gaussian blur σ_blur=1.6, multiplicative colour cast diag(1.2, 0.9, 1.1), additive noise σ ∈ {0.01, 0.05, 0.1, 0.2}).
Grid: Rank r ∈ {32, 64, 128, 256, 512} × calibration size N ∈ {4, 8, 16, 32, 64, 128}.
Metrics: PSNR, SSIM and spectral error ∥Σ̂−Σ^*∥₂.
Analysis: Seaborn heat-maps, log-log regression of spectral error versus noise, bootstrap confidence bands.

Reliability safeguards. Deterministic Torch backend, prompt seeds stored to JSON, artefact hashes included in supplementary material.

Results

Experiment 1 – real-camera transfer

AMC improves FID by roughly 30 % over Vanilla and matches or slightly surpasses LoRA while consuming three orders of magnitude less energy. Colour error falls below the perceptibility threshold (ΔE₀₀ < 2). Paired t-tests yield p < 0.01 for AMC versus AdaIN and p = 0.18 for AMC versus LoRA, demonstrating statistical parity with the latter.

Experiment 2 – mobile profiling

AMC delivers an 812× speed-up and 385× energy reduction on the same SoC; LoRA triggers thermal throttling after 90 s, whereas AMC remains within safe limits.

Experiment 3 – ablation & robustness

Rank/sample efficiency: PSNR climbs steeply until r ≈ 256 and N ≈ 64, then saturates (<0.3 dB further gain).
Cubic law: Log-log regression of spectral error versus noise yields slope −2.96 ± 0.08, confirming the predicted σ⁻³ behaviour.
Residual stability: The λ‖r_θ‖² regulariser keeps residual energy below 4.6 % across the grid; no divergence observed.

Limitations. AMC presumes that the domain gap is captured by first- and second-order moments; strong high-frequency artefacts such as Bayer mosaics may require Patch-AMC. Extremely short noise schedules (<5 steps) offer limited opportunity for the calibrated statistics to influence the trajectory.

Conclusion

Adaptive Moment Calibration transforms the empirical Gaussian core of diffusion denoisers into a deployable one-shot calibration scheme. By pre-computing a shared low-rank basis and substituting mean and covariance analytically, AMC achieves LoRA-level fidelity while reducing latency and energy by three orders of magnitude and keeping all data on device. Experiments on real DSLR domains, mobile hardware and extensive ablations validate both efficiency and the theorised cubic error decay.

Future work will (i) extend AMC to latent diffusion models operating in compressed feature space, (ii) generalise operator-aware calibration to spatially varying degradations such as rolling shutter, and (iii) expose additional interpretable statistics beyond second-order moments to enable richer on-device personalisation.

Instant On-Device Adaptation of Diffusion Models via Closed-Form Moment Calibration