Scalable Learning of Multivariate Distributions via Coresets

Abstract

Efficient and scalable non-parametric or semi-parametric regression analysis and density estimation are of crucial importance to the fields of statistics and machine learning. However, available methods are limited in their ability to handle large-scale data. We address this issue by developing a novel coreset construction for multivariate conditional transformation models (MCTMs) to enhance their scalability and training efficiency. To the best of our knowledge, these are the first coresets for semi-parametric distributional models. Our approach yields substantial data reduction via importance sampling. It ensures with high probability that the log-likelihood remains within multiplicative error bounds of (1 ± ε) and thereby maintains statistical model accuracy. Compared to conventional full-parametric models, where coresets have been incorporated before, our semi-parametric approach exhibits enhanced adaptability, particularly in scenarios where complex distributions and non-linear relationships are present, but not fully understood. To address numerical problems associated with normalizing logarithmic terms, we follow a geometric approximation based on the convex hull of input data. This ensures feasible, stable, and accurate inference in scenarios involving large amounts of data. Numerical experiments demonstrate substantially improved computational efficiency when handling large and complex datasets, thus laying the foundation for a broad range of applications within the statistics and machine learning communities.

Contributions

What we introduce

First Coresets for MCTMs

We introduce the first coreset construction framework for highly flexible Multivariate Conditional Transformation Models (MCTMs), bridging the gap between semi-parametric modeling for multivariate distributions and efficient data reduction techniques.

Convex Hull Stabilization for Logarithmic Terms

Standard sensitivity frameworks fail on the unstable, unbounded logarithmic terms in the MCTM loss. We solve this by geometrically approximating the convex hull of transformation derivatives to avoid extreme directions, paired with ℓ₂ leverage scores for the quadratic part.

Provable Scalability on Real-World Data

We demonstrate significant computational speedups on complex, large-scale datasets (Covertype, Equity Returns), scaling high-dimensional joint distribution models while rigorously maintaining (1 ± ε) log-likelihood approximation guarantees.

Method

The Hybrid ℓ₂-Hull Strategy

The core challenge of applying coresets to MCTMs is that the negative log-likelihood contains both tractable quadratic terms and highly unstable logarithmic terms with asymptotes at zero. Standard sensitivity sampling fails for the latter because the sensitivities are unbounded. Our hybrid approach separates these concerns: we use ℓ₂ leverage scores to bound the quadratic part, and a convex hull approximation on the derivative space to prevent the optimizer from exploring infeasible extreme directions.

Coreset Construction Pipeline

Input

Full Data 𝒟

→

Stabilize Log Term
Convex Hull a'(y)

Bound Quadratic Term
ℓ₂ Leverage Sampling

→

Output

Final Coreset ℂ

Main Theorem — Loss Approximation

Assume a sample 𝒮 drawn via the ℓ₂-Hull method. With high probability, for any parameters (ϑ, λ) within the restricted domain D(η) = {(ϑ, λ) | ∀(i,j): ⟨ϑⱼ, a'ᵢⱼ⟩ > η}:

|f(A, ϑ, λ) − f(A(𝒮), ϑ, λ, w)| ≤ ε · f(A, ϑ, λ)

The framework yields a coreset size bounded by 𝒪(J²d² ln³(cdJ)c⁶ / ε²), practically decoupling optimization time from n while preserving statistical accuracy.

Empirical Results

Key Takeaways

🎯

High Approximation Accuracy

The ℓ₂-Hull method achieves log-likelihood ratios near 1.0 and maintains remarkably tight parameter distances compared to the full dataset, consistently outperforming uniform subsampling.

⚡

Massive Computational Speedup

Enables fitting of complex 10-to-20 dimensional joint distributions on datasets with hundreds of thousands of observations (Covertype) in seconds rather than hours, circumventing hardware limitations.

🌌

Robustness to Complex Structures

Reliably captures 14 different complex dependency scenarios — including heavy-tailed distributions, multimodality, and non-linear correlations — where baseline sampling methods miss critical boundary points.

🛡️

Guaranteed Numerical Stability

By explicitly integrating extreme geometric boundaries via the convex hull, the algorithm ensures stable parameter updates and avoids the severe optimization failures seen in pure leverage-score sampling.

Citation

BibTeX

bibtex

@inproceedings{ding2026mctmcoreset, title = {Scalable Learning of Multivariate Distributions via Coresets}, author = {Ding, Zeyu and Ickstadt, Katja and Klein, Nadja and Munteanu, Alexander and Omlor, Simon}, booktitle = {Proceedings of the 29th International Conference on Artificial Intelligence and Statistics (AISTATS)}, series = {Proceedings of Machine Learning Research}, volume = {300}, year = {2026}, publisher = {PMLR} }