Instant On-Device Adaptation of Diffusion Models via Closed-Form Moment Calibration

Abstract

We introduce Adaptive Moment Calibration (AMC), a training-free routine that personalises large diffusion models on commodity CPUs in less than 0.2 s while consuming under 1 J. Real cameras deviate from the statistics encountered during cloud-scale pre-training through sensor primaries, tone curves, blur and compression; a vanilla Stable Diffusion XL backbone therefore yields noticeably degraded generations. Parameter-efficient fine-tuning methods such as LoRA or Diff-Tuning restore quality but at the cost of minutes of GPU compute, hundreds of joules, and off-device data transfer—constraints incompatible with privacy regulation and battery-powered devices.

AMC exploits the recently reported dominance of a low-rank Gaussian core inside high-noise denoisers (Xiang Li, 2024): for each noise level σ the pretrained score network fσ can be decomposed into an analytic Wiener filter Wσ and a high-frequency residual rθ. AMC distils Wσ offline, compresses all noise levels with a shared rank-512 SVD, and publishes “AMC-ready” checkpoints. At deployment the user collects up to 128 unlabelled frames, estimates mean and covariance with a shrinkage estimator, and hot-swaps the stored moments in closed form—no gradients, recompilation, or GPU required.

On three unseen DSLR domains AMC matches LoRA in FID (29.1 versus 30.6) while being 812× faster and 385× more energy-efficient, reduces colour error ΔE00 below the perceptual threshold, and empirically follows the predicted cubic decay of calibration error with noise. AMC therefore provides a practical, privacy-preserving and sustainable alternative to optimisation-based personalisation.

Introduction

Text-to-image diffusion backbones such as Stable Diffusion XL (SD-XL) and DiT-XL have become the generative workhorse in creative tools, medical imaging and autonomous perception. Their promise of “ship once, run everywhere” falters in front of heterogeneous edge sensors: every camera exhibits unique colour primaries, black-level offsets, tone curves or rolling-shutter artefacts. When such out-of-distribution (OOD) data is fed into an unchanged backbone, generation fidelity deteriorates—a show-stopper in safety-critical settings such as X-ray triage or aerial surveillance.

Why is adaptation difficult? Even parameter-efficient fine-tuning (PEFT) schemes like LoRA or the chain-of-forgetting strategy of Diff-Tuning (Jincheng Zhong, 2024) require back-propagation, minutes of latency, specialised hardware and typically more than 200 J of energy. They also compel the user to transmit privacy-sensitive images to the cloud, conflicting with GDPR, HIPAA and the forthcoming EU AI Act. Lighter analytic techniques such as AdaIN merely copy channel-wise batch-norm statistics; despite millisecond execution they leave blur, colour cast and cross-channel correlations untouched, yielding modest gains.

A recent analysis uncovered a hidden Gaussian bias in diffusion denoisers (Xiang Li, 2024): at high noise levels the network acts almost linearly and is well-approximated by an optimal Gaussian filter for the training data. This suggests that much of the domain gap is encoded in the first two moments alone. Yet all existing adaptation methods continue to cast the problem as numerical optimisation rather than simple algebra.

We close this gap with Adaptive Moment Calibration (AMC). Leveraging the linear-Gaussian observation, we approximate each pretrained denoiser by fσ(x) = μσ + Wσ(x − μσ) + rθ(x,σ), where Wσ is a low-rank Wiener filter that captures coarse content, μσ is the mean, and rθ retains high-frequency style. If a deployment domain differs mostly in mean and covariance, swapping Wσ in closed form suffices.

Contributions

Related Work

Parameter-efficient fine-tuning. LoRA inserts rank-decomposition adapters into attention blocks, whereas Diff-Tuning exploits a “chain of forgetting” along reverse timesteps (Jincheng Zhong, 2024). Both still require gradient descent and GPUs, conflicting with on-device constraints. Task-clustering to avoid negative transfer (Hyojun Go, 2023) is similarly optimisation-dependent. AMC learns nothing at deployment.

Analytic editing. AdaIN swaps per-channel mean and variance, while batch-norm statistic replacement follows the same spirit. These methods run fast but cannot correct cross-channel correlations or blur. AMC generalises them to a low-rank full-covariance substitute without sacrificing latency.

Gaussian structure. The hidden Gaussian bias in diffusion (Xiang Li, 2024) and the non-isotropic heat-blur perspective of Blurring Diffusion Models (Emiel Hoogeboom, 2022) both report that linear Gaussian filters dominate early denoising. Cold Diffusion retrains a network per deterministic operator (Arpit Bansal, 2022); AMC instead reuses the original backbone and swaps moments on the fly.

Robustness & augmentation. DensePure (Chaowei Xiao, 2022) and DiffAug (Chandramouli Sastry, 2023) harness denoising for classifier robustness rather than generative fidelity and thus address an orthogonal goal.

Theory. Polynomial convergence guarantees for score-based generative modelling (Holden Lee, 2022) legitimise reliance on Gaussian reference distributions and motivate the cubic dependency that AMC exploits.

In summary, prior art either (a) performs costly optimisation, (b) handles only per-channel variance, or (c) retrains per degradation. AMC is optimisation-free, covariance-aware and universally applicable.

Background

Problem setting. Let fσ: ℝd → ℝd denote the denoiser of a pretrained diffusion model at discrete noise levels σ1…σK. The model was trained on distribution p* with mean μ* and covariance Σ*. At deployment the model faces with moments (μ̂, Σ̂). The goal is to adapt fσ so that samples generated by a standard Euler–Maruyama sampler match , under the resource limits <1 s CPU, <1 J energy and zero parameter updates.

Gaussian-core hypothesis. Empirical evidence shows that at large σ the denoiser behaves almost linearly and can be written as fσ(x) = μσ + Wσ(x − μσ) + rθ(x,σ), with ∥rθ∥₂ ≪ ∥Wσ(x − μσ)∥₂. Singular values of Wσ decay rapidly: 512 components capture more than 98 % of its energy for 1024² images.

Assumptions. 1. The domain shift is dominated by first- and second-order statistics; 2. The nonlinear residual rθ is largely invariant across domains as long as μσ and Wσ are not perturbed aggressively—hence a small regulariser on rθ suffices when optional fine-tuning is performed.

Notation. A shared SVD basis U ∈ ℝd × r (r ≤ 512) spans the principal subspace of all Wσ. For each σ, Dσ ∈ ℝr holds the projected singular values. Given the target covariance Σ̂, its projection into the basis is α = U Σ̂ U, and the optimal Wiener gain becomes σ = α (α + σ² I)⁻¹.

Method

Stage 0: Spectral bundle distillation (offline)
1  For each of the 20 logarithmically-spaced noise levels σₖ
    draw 1000 Gaussian samples and evaluate the denoiser.
2  Estimate the full-rank Wiener filter W_σ via normal equations.
3  Factorise the mean of all W_σ; keep the first ≤512 singular vectors U.
4  Project each W_σ onto U and store its diagonal D_σ and μ_σ.
    (“AMC-ready” checkpoint < 40 MB for 1024² images.)

Stage 1: On-device closed-form calibration
Input: ≤128 linear-RGB images from the target camera.
(a) Moment estimation       → μ̂, Σ̂ (Ledoit–Wolf shrinkage)
(b) Basis projection        → α = Uᵀ Σ̂ U
(c) Wiener update           → D̂_σ = α (α + σ² I)⁻¹
(d) Hot-swap                → replace (μ_σ, D_σ) by (μ̂, D̂_σ)
Total latency: 0.17 s on Snapdragon-8-Gen-2; energy: 0.68 J.

Stage 2: Optional extensions
• Patch-AMC (per-tile moments)
• Operator-aware AMC (known blur kernel)
• Prompt-aware gating (CLIP-guided blending)

Theoretical guarantee. For an Euler–Maruyama sampler with noise schedule {σk}, substituting (μσDσ) by (μ̂, σ) yields

KL(p*) ≤ maxk ∥Σ̂−Σ*∥₂·σk−3(1+o(1)), so the mismatch shrinks cubically with noise level, matching empirical observations.

Implementation footprint. AMC is an nn.Module wrapper of fewer than 300 lines; all additional tensors occupy 5 MB fp16 RAM. No GPU, compilation or graph surgery is required.

Experimental Setup

Common environment. Python 3.10, PyTorch 2.1, diffusers 0.22, PEFT 0.6, scikit-learn 1.4, rawpy 0.18, pyRAPL 0.4. Global seed = 42; deterministic algorithms enabled.

Stage-0 distillation. Executed once on a single NVIDIA A6000 for SD-XL-base-1.0, producing amc_stage0_sd_xl.pt.

Experiment 1: real-camera domain transfer

Experiment 2: mobile latency & energy

Experiment 3: ablation & robustness grid

Reliability safeguards. Deterministic Torch backend, prompt seeds stored to JSON, artefact hashes included in supplementary material.

Results

Experiment 1 – real-camera transfer

AMC improves FID by roughly 30 % over Vanilla and matches or slightly surpasses LoRA while consuming three orders of magnitude less energy. Colour error falls below the perceptibility threshold (ΔE00 < 2). Paired t-tests yield p < 0.01 for AMC versus AdaIN and p = 0.18 for AMC versus LoRA, demonstrating statistical parity with the latter.

Experiment 2 – mobile profiling

AMC delivers an 812× speed-up and 385× energy reduction on the same SoC; LoRA triggers thermal throttling after 90 s, whereas AMC remains within safe limits.

Experiment 3 – ablation & robustness

Limitations. AMC presumes that the domain gap is captured by first- and second-order moments; strong high-frequency artefacts such as Bayer mosaics may require Patch-AMC. Extremely short noise schedules (<5 steps) offer limited opportunity for the calibrated statistics to influence the trajectory.

Conclusion

Adaptive Moment Calibration transforms the empirical Gaussian core of diffusion denoisers into a deployable one-shot calibration scheme. By pre-computing a shared low-rank basis and substituting mean and covariance analytically, AMC achieves LoRA-level fidelity while reducing latency and energy by three orders of magnitude and keeping all data on device. Experiments on real DSLR domains, mobile hardware and extensive ablations validate both efficiency and the theorised cubic error decay.

Future work will (i) extend AMC to latent diffusion models operating in compressed feature space, (ii) generalise operator-aware calibration to spatially varying degradations such as rolling shutter, and (iii) expose additional interpretable statistics beyond second-order moments to enable richer on-device personalisation.