Skip to content

[ENH] Add a "SupportScaler" or "SupportTransformer" #588

@joshdunnlime

Description

@joshdunnlime

Is your feature request related to a problem? Please describe.
Yes! When trying to use an skpro GridSearch across distributions you can run into lots of support issues if your target variable is not scaled to the smallest support range. For example, if you use xgblss with target range (-inf, inf) and search across ["Normal", "Gamma"], the GridSearch will fail or return NaNs for the Gamma search. Scaling to the range of the of the smallest leads to less interpretable skpro/scipy distribution parameter for the normal distribution in this case.

Describe the solution you'd like
I have implemented a something locally that works rather nicely:

# imports ...

full_reals = {
    "Laplace",
    "Logistic",
    "Normal",
    "SkewNormal",
    "TDistribution",
    "TruncatedNormal",
}


class SupportTransformer(TransformerMixin, BaseEstimator, OneToOneFeatureMixin):

    def __init__(self, dist, rtol):
        ...

    def _get_skpro_distr(self, distr):
        """Copied from xgblss code"""
        ...

    def _get_support(self):
        # logic to get scipy rvs which includes support
        # calls _get_skpro_distr
        return rvs.support(**sc_params)

    def fit(self, X, y=None):
        if self.dist in full_reals:
            # no fit needed
            return self

        self.support = self._get_support()

        # check if X is within support
        if any(
            [np.any(X.max() >= self.support[1]), np.any(X.min() <= self.support[0])]
        ):
            # some more implementation logic
            self.mms = MinMaxScaler((support_lower, support_upper))
            self.mms.fit(X)

            self.scale_ = mms.scale_

        return self

    def transform(self, X):
        if hasattr(self, "mms"):
            return self.mms.transform(X)
        else:
            return X

    def inverse_transform(self, X):
        if hasattr(self, "mms"):
            return self.mms.inverse_transform(X)
        else:
            return X

Usage is as follows:

ttr = TransformedTargetRegressorProba(
    xgboostlss.XGBoostLSS(),
    SupportTransformer(),
)

param_distributions = [
    {"regressor__dist": "Normal", "transformer__dist": "Normal"},
    {"regressor__dist": "Gamma", "transformer__dist": "Gamma"},
]

rscv = GridSearchCV(
    estimator=ttr,
    param_grid=param_distributions,
    cv=cv,
    scoring=CRPS(),
    error_score='raise',
)

# for some -inf < y < inf
rscv.fit(X, y)

Describe alternatives you've considered
Horrible, horrible loops 😆

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature requestNew feature or requestmodule:regressionprobabilistic regression modulemodule:transformationstransformations module: feature extraction, pre-/post-processing

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions