Efficient and scalable non-parametric or semi-parametric regression analysis and density estimation are of crucial importance to the fields of statistics and machine learning. However, available methods are limited in their ability to handle large-scale data. We address this issue by developing a novel coreset construction for multivariate conditional transformation models (MCTMs) to enhance their scalability and training efficiency. To the best of our knowledge, these are the first coresets for semi-parametric distributional models. Our approach yields substantial data reduction via importance sampling. It ensures with high probability that the log-likelihood remains within multiplicative error bounds of (1 ± ε) and thereby maintains statistical model accuracy. Compared to conventional full-parametric models, where coresets have been incorporated before, our semi-parametric approach exhibits enhanced adaptability, particularly in scenarios where complex distributions and non-linear relationships are present, but not fully understood. To address numerical problems associated with normalizing logarithmic terms, we follow a geometric approximation based on the convex hull of input data. This ensures feasible, stable, and accurate inference in scenarios involving large amounts of data. Numerical experiments demonstrate substantially improved computational efficiency when handling large and complex datasets, thus laying the foundation for a broad range of applications within the statistics and machine learning communities.
We introduce the first coreset construction framework for highly flexible Multivariate Conditional Transformation Models (MCTMs), bridging the gap between semi-parametric modeling for multivariate distributions and efficient data reduction techniques.
Standard sensitivity frameworks fail on the unstable, unbounded logarithmic terms in the MCTM loss. We solve this by geometrically approximating the convex hull of transformation derivatives to avoid extreme directions, paired with ℓ₂ leverage scores for the quadratic part.
We demonstrate significant computational speedups on complex, large-scale datasets (Covertype, Equity Returns), scaling high-dimensional joint distribution models while rigorously maintaining (1 ± ε) log-likelihood approximation guarantees.
The core challenge of applying coresets to MCTMs is that the negative log-likelihood contains both tractable quadratic terms and highly unstable logarithmic terms with asymptotes at zero. Standard sensitivity sampling fails for the latter because the sensitivities are unbounded. Our hybrid approach separates these concerns: we use ℓ₂ leverage scores to bound the quadratic part, and a convex hull approximation on the derivative space to prevent the optimizer from exploring infeasible extreme directions.
Assume a sample 𝒮 drawn via the ℓ₂-Hull method. With high probability, for any parameters (ϑ, λ) within the restricted domain D(η) = {(ϑ, λ) | ∀(i,j): ⟨ϑⱼ, a'ᵢⱼ⟩ > η}:
The framework yields a coreset size bounded by 𝒪(J²d² ln³(cdJ)c⁶ / ε²), practically decoupling optimization time from n while preserving statistical accuracy.
The ℓ₂-Hull method achieves log-likelihood ratios near 1.0 and maintains remarkably tight parameter distances compared to the full dataset, consistently outperforming uniform subsampling.
Enables fitting of complex 10-to-20 dimensional joint distributions on datasets with hundreds of thousands of observations (Covertype) in seconds rather than hours, circumventing hardware limitations.
Reliably captures 14 different complex dependency scenarios — including heavy-tailed distributions, multimodality, and non-linear correlations — where baseline sampling methods miss critical boundary points.
By explicitly integrating extreme geometric boundaries via the convex hull, the algorithm ensures stable parameter updates and avoids the severe optimization failures seen in pure leverage-score sampling.