Skip to content

Welcome to flexcv-earth

See our repository here: flexcv-earth

This python package provides wrapper classes for the earth function from the earth package in R. It then can be used as a sklearn estimator in python and especially in the flexcv package.

Installation

pip install flexcv_earth @ git+https://github.com/radlfabs/flexcv-earth

Additional dependencies of rpy2

The model class for the EarthRegressor is actually wrapping around rpy2 code and is using embedded R under the hood. Therefore, you should have a recent R version installed and run our install_rpackages.py script. From the command line change your directory to your flexcv-earth installation directory. This can be your folder that you created with venv. Run our python script that installs the remaining R dependencies.

cd path/to/flexcv-earth/
python -m install_rpackages

Now you have installed everything you need to use the EarthRegressorwith flexcv-earth.

Use with flexcv

You can use flexcv to perform cross validation with the EarthRegressor class. Define a model configuration with yaml as follows:

from flexcv import CrossValidation
from flexcv_earth import EarthRegressor, EarthModelPostProcessor
from flexcv.synthesizer import generate_data

X, y, _, _ = generate_data(10, 100)

yaml_config = """
EarthRegressor:
    requires_inner_cv: True
    n_trials: 200
    allows_n_jobs: False
    model: EarthRegressor
    params:
        degree: !Int
            low: 1
            high: 5
        nprune: !Int
            low: 1
            high: 300
        fast_k: !Int
            low: 0
            high: 20
        newvar_penalty: !Float
            low: 0.01
            high: 0.2
    post_processor: EarthModelPostProcessor
    add_merf: True
"""

cv = (
    CrossValidation().set_data(X, y)
    .set_models(yaml_strin=yaml_config)
)

Reference

flexcv_earth.models.EarthRegressor

Bases: BaseEstimator, RegressorMixin

Wrapper Class for Earth Regressor in R. For more Details see https://cran.r-project.org/web/packages/earth/earth.pdf.

Parameters:

Name Type Description Default
degree int

Degree of the splines. 1 for linear, 2 for quadratic, etc. (Default value = 1)

1
nprune int | None

Number of pruning steps. If None, the number of pruning steps is determined by the algorithm. (Default value = None)

None
nk int | None

Number of knots. If None, the number of knots is determined by the algorithm. The default is semi-automatically calculated from the number of predictors but may need adjusting. (Default value = None)

None
thresh float

Forward stepping threshold. (Default value = 0.001)

0.001
minspan int

Minimum number of observations between knots. (Default value = 0)

0
endspan int

Minimum number of observations before the first and after the final knot. (Default value = 0)

0
newvar_penalty float

(Default value = 0.0)

0.0
fast_k int

Maximum number of parent terms considered at each step of the forward pass. (Default value = 20)

20
fast_beta float

Fast MARS ageing coefficient, as described in the Fast MARS paper section 3.1. Default is 1. A value of 0 sometimes gives better results. (Default value = 1.0)

1.0
pmethod str

Pruning method. One of: backward none exhaustive forward seqrep cv. Default is "backward". Specify pmethod="cv" to use cross-validation to select the number of terms. This selects the number of terms that gives the maximum mean out-of-fold RSq on the fold models. Requires the nfold argument. Use "none" to retain all the terms created by the forward pass. If y has multiple columns, then only "backward" or "none" is allowed. Pruning can take a while if "exhaustive" is chosen and the model is big (more than about 30 terms). The current version of the leaps package used during pruning does not allow user interrupts (i.e., you have to kill your R session to interrupt; in Windows use the Task Manager or from the command line use taskkill). (Default value = "backward")

'backward'
Source code in flexcv_earth/models.py
class EarthRegressor(BaseEstimator, RegressorMixin):
    """Wrapper Class for Earth Regressor in R.
    For more Details see https://cran.r-project.org/web/packages/earth/earth.pdf.

    Args:
        degree (int): Degree of the splines. 1 for linear, 2 for quadratic, etc. (Default value = 1)
        nprune (int | None): Number of pruning steps. If None, the number of pruning steps is determined by the algorithm. (Default value = None)
        nk (int | None): Number of knots. If None, the number of knots is determined by the algorithm. The default is semi-automatically calculated from the number of predictors but may need adjusting. (Default value = None)
        thresh (float): Forward stepping threshold. (Default value = 0.001)
        minspan (int): Minimum number of observations between knots. (Default value = 0)
        endspan (int): Minimum number of observations before the first and after the final knot. (Default value = 0)
        newvar_penalty (float): (Default value = 0.0)
        fast_k (int): Maximum number of parent terms considered at each step of the forward pass. (Default value = 20)
        fast_beta (float): Fast MARS ageing coefficient, as described in the Fast MARS paper section 3.1. Default is 1. A value of 0 sometimes gives better results. (Default value = 1.0)
        pmethod (str): Pruning method. One of: backward none exhaustive forward seqrep cv. Default is "backward". Specify pmethod="cv" to use cross-validation to select the number of terms. This selects the number of terms that gives the maximum mean out-of-fold RSq on the fold models. Requires the nfold argument. Use "none" to retain all the terms created by the forward pass. If y has multiple columns, then only "backward" or "none" is allowed. Pruning can take a while if "exhaustive" is chosen and the model is big (more than about 30 terms). The current version of the leaps package used during pruning does not allow user interrupts (i.e., you have to kill your R session to interrupt; in Windows use the Task Manager or from the command line use taskkill). (Default value = "backward")

    """

    def __init__(
        self,
        degree: int = 1,
        nprune: int = None,
        nk: int = None,
        thresh: float = 0.001,
        minspan: int = 0,
        endspan: int = 0,
        newvar_penalty: float = 0.0,
        fast_k: int = 20,
        fast_beta: float = 1.0,
        pmethod: str = "backward",
        random_state: int = None,
    ):
        self.degree = degree
        self.nprune = nprune
        self.nk = nk
        self.thresh = thresh
        self.minspan = minspan
        self.endspan = endspan
        self.newvar_penalty = newvar_penalty
        self.fast_k = fast_k
        self.fast_beta = fast_beta
        self.pmethod = pmethod
        self.random_state = random_state

    def fit(self, X, y):
        """Fit a EARTH model to the given training data.

        Args:
          X (array-like): Features.
          y (array-like): Target values.

        Returns:
           (object): Returns self.
        """
        if np.iscomplexobj(X) or np.iscomplexobj(y):
            raise ValueError("Complex data not supported")
        # ro.r('sink(nullfile())')
        if self.random_state is not None:
            ro.r(f"set.seed({self.random_state})")

        ro.r(
            """
            library(earth)
        """
        )
        numpy2ri.activate()
        pandas2ri.activate()

        assert (
            X.shape[0] == y.shape[0]
        ), "Number of X samples must match number of y samples."

        # Convert X, y according to its type
        if isinstance(X, pd.DataFrame):
            # Convert pandas dataframe to R dataframe
            r_X = pandas2ri.py2rpy(X)
        elif isinstance(X, np.ndarray):
            r_X = numpy2ri.numpy2rpy(X)
            # Convert numpy array to R matrix
        else:
            r_X = ro.r.matrix(X, nrow=X.shape[0], ncol=X.shape[1])

        # Convert pandas Series to R vector
        r_y = ro.FloatVector(y)

        # Fit MARS regression model using earth function from the earth package
        # make nprune None in R as default
        nprune = self.nprune if self.nprune is not None else ro.r("as.null")()
        # The following has a special defaults which we dont want to overwrite with None
        nk = {"nk": self.nk} if self.nk is not None else {}
        # We have to pass newvar.penalty as a named argument because Python does not allow "." in variable names
        newvar_penalty = {"newvar.penalty": self.newvar_penalty}
        fast_k = {"fast.k": self.fast_k}
        fast_beta = {"fast.beta": self.fast_beta}

        self.model_ = ro.r.earth(
            r_X,
            r_y,
            degree=self.degree,
            nprune=nprune,
            thresh=self.thresh,
            minspan=self.minspan,
            endspan=self.endspan,
            pmethod=self.pmethod,
            **newvar_penalty,
            **fast_k,
            **fast_beta,
            **nk,
        )

        self.is_fitted_ = True
        self.var_imp_ = self.calc_variable_importance()

        del r_X
        del r_y
        numpy2ri.deactivate()
        pandas2ri.deactivate()

        gc.collect()
        ro.r("gc()")
        gc.collect()

        return self

    def predict(self, X):
        """Make predicitons using the fitted model.

        Args:
          X (array-like): Features

        Returns:
            (array-like): An array of fitted values.
        """
        if np.iscomplexobj(X):
            raise ValueError("Complex data not supported")
        ro.r(
            """
            library(earth)
        """
        )
        numpy2ri.activate()
        pandas2ri.activate()
        # input checks
        check_is_fitted(self)
        if isinstance(X, pd.DataFrame):
            # Convert pandas dataframe to R dataframe
            r_X = pandas2ri.py2rpy(X)
        elif isinstance(X, np.ndarray):
            r_X = numpy2ri.numpy2rpy(X)
            # Convert numpy array to R matrix
        else:
            r_X = ro.r.matrix(X, nrow=X.shape[0], ncol=X.shape[1])
        # assign model in R in order to predict
        # ro.r.assign("model", self.model)
        y_pred = np.asarray(ro.r["predict"](self.model_, r_X))
        # make sure that the output is a 1d array
        y_pred = y_pred.ravel()
        del r_X

        numpy2ri.deactivate()
        pandas2ri.deactivate()
        gc.collect()
        ro.r("gc()")
        gc.collect()

        return y_pred

    def __sklearn_is_fitted__(self):
        return self.is_fitted_

    def get_params(self, deep=False):
        """Returns the parameters of the model.

        Args:
          deep: bool: This argument is not used.  (Default value = False)
        Returns:
            (dict): Parameter names mapped to their values.
        """
        return {
            "degree": self.degree,
            "nprune": self.nprune,
            "nk": self.nk,
            "thresh": self.thresh,
            "minspan": self.minspan,
            "endspan": self.endspan,
            "newvar_penalty": self.newvar_penalty,
            "fast_k": self.fast_k,
            "fast_beta": self.fast_beta,
            "pmethod": self.pmethod,
            "random_state": self.random_state,
        }

    def get_rmodel(self):
        """Returns the R model object.

        Returns:
            (object): The R model object."""
        return self.model_

    def make_r_plots(self):
        """Creates plots of the model in R and saves them to disk. They are saved to disk in the `tmp_imgs` folder."""
        Path("tmp_imgs").mkdir(parents=True, exist_ok=True)
        for i in range(1, 5):
            ro.r["png"](f"tmp_imgs/mars_plot_{i}.png", width=1024, height=1024)
            ro.r["plot"](self.model_, which=i)
            ro.r["dev.off"]()

    def calc_variable_importance(self):
        """Calculates the variable importance of the model.

        Returns:
            (pandas.DataFrame): A DataFrame containing the variable importance."""
        ro.globalenv["ev"] = ro.r["evimp"](self.model_, trim=False)
        imp = ro.r("as.data.frame(unclass(ev[,c(3,4,6)]))")
        imp_df: pd.DataFrame = ro.conversion.rpy2py(imp)
        imp_df.columns = ["nsubsets", "gcv", "rss"]

        del imp
        gc.collect()
        ro.r("rm(ev)")
        ro.r("gc()")
        gc.collect()
        return imp_df

    def get_variable_importance(self, features):
        """Returns the variable importance of the model.

        Args:
          features: array-like: The feature names.

        Returns:
            (pandas.DataFrame): A DataFrame containing the variable importance.
        """
        self.var_imp_.index = features
        return self.var_imp_

flexcv_earth.models.EarthRegressor.calc_variable_importance()

Calculates the variable importance of the model.

Returns:

Type Description
DataFrame

A DataFrame containing the variable importance.

Source code in flexcv_earth/models.py
def calc_variable_importance(self):
    """Calculates the variable importance of the model.

    Returns:
        (pandas.DataFrame): A DataFrame containing the variable importance."""
    ro.globalenv["ev"] = ro.r["evimp"](self.model_, trim=False)
    imp = ro.r("as.data.frame(unclass(ev[,c(3,4,6)]))")
    imp_df: pd.DataFrame = ro.conversion.rpy2py(imp)
    imp_df.columns = ["nsubsets", "gcv", "rss"]

    del imp
    gc.collect()
    ro.r("rm(ev)")
    ro.r("gc()")
    gc.collect()
    return imp_df

flexcv_earth.models.EarthRegressor.fit(X, y)

Fit a EARTH model to the given training data.

Parameters:

Name Type Description Default
X array - like

Features.

required
y array - like

Target values.

required

Returns:

Type Description
object

Returns self.

Source code in flexcv_earth/models.py
def fit(self, X, y):
    """Fit a EARTH model to the given training data.

    Args:
      X (array-like): Features.
      y (array-like): Target values.

    Returns:
       (object): Returns self.
    """
    if np.iscomplexobj(X) or np.iscomplexobj(y):
        raise ValueError("Complex data not supported")
    # ro.r('sink(nullfile())')
    if self.random_state is not None:
        ro.r(f"set.seed({self.random_state})")

    ro.r(
        """
        library(earth)
    """
    )
    numpy2ri.activate()
    pandas2ri.activate()

    assert (
        X.shape[0] == y.shape[0]
    ), "Number of X samples must match number of y samples."

    # Convert X, y according to its type
    if isinstance(X, pd.DataFrame):
        # Convert pandas dataframe to R dataframe
        r_X = pandas2ri.py2rpy(X)
    elif isinstance(X, np.ndarray):
        r_X = numpy2ri.numpy2rpy(X)
        # Convert numpy array to R matrix
    else:
        r_X = ro.r.matrix(X, nrow=X.shape[0], ncol=X.shape[1])

    # Convert pandas Series to R vector
    r_y = ro.FloatVector(y)

    # Fit MARS regression model using earth function from the earth package
    # make nprune None in R as default
    nprune = self.nprune if self.nprune is not None else ro.r("as.null")()
    # The following has a special defaults which we dont want to overwrite with None
    nk = {"nk": self.nk} if self.nk is not None else {}
    # We have to pass newvar.penalty as a named argument because Python does not allow "." in variable names
    newvar_penalty = {"newvar.penalty": self.newvar_penalty}
    fast_k = {"fast.k": self.fast_k}
    fast_beta = {"fast.beta": self.fast_beta}

    self.model_ = ro.r.earth(
        r_X,
        r_y,
        degree=self.degree,
        nprune=nprune,
        thresh=self.thresh,
        minspan=self.minspan,
        endspan=self.endspan,
        pmethod=self.pmethod,
        **newvar_penalty,
        **fast_k,
        **fast_beta,
        **nk,
    )

    self.is_fitted_ = True
    self.var_imp_ = self.calc_variable_importance()

    del r_X
    del r_y
    numpy2ri.deactivate()
    pandas2ri.deactivate()

    gc.collect()
    ro.r("gc()")
    gc.collect()

    return self

flexcv_earth.models.EarthRegressor.get_params(deep=False)

Returns the parameters of the model.

Parameters:

Name Type Description Default
deep

bool: This argument is not used. (Default value = False)

False

Returns: (dict): Parameter names mapped to their values.

Source code in flexcv_earth/models.py
def get_params(self, deep=False):
    """Returns the parameters of the model.

    Args:
      deep: bool: This argument is not used.  (Default value = False)
    Returns:
        (dict): Parameter names mapped to their values.
    """
    return {
        "degree": self.degree,
        "nprune": self.nprune,
        "nk": self.nk,
        "thresh": self.thresh,
        "minspan": self.minspan,
        "endspan": self.endspan,
        "newvar_penalty": self.newvar_penalty,
        "fast_k": self.fast_k,
        "fast_beta": self.fast_beta,
        "pmethod": self.pmethod,
        "random_state": self.random_state,
    }

flexcv_earth.models.EarthRegressor.get_rmodel()

Returns the R model object.

Returns:

Type Description
object

The R model object.

Source code in flexcv_earth/models.py
def get_rmodel(self):
    """Returns the R model object.

    Returns:
        (object): The R model object."""
    return self.model_

flexcv_earth.models.EarthRegressor.get_variable_importance(features)

Returns the variable importance of the model.

Parameters:

Name Type Description Default
features

array-like: The feature names.

required

Returns:

Type Description
DataFrame

A DataFrame containing the variable importance.

Source code in flexcv_earth/models.py
def get_variable_importance(self, features):
    """Returns the variable importance of the model.

    Args:
      features: array-like: The feature names.

    Returns:
        (pandas.DataFrame): A DataFrame containing the variable importance.
    """
    self.var_imp_.index = features
    return self.var_imp_

flexcv_earth.models.EarthRegressor.make_r_plots()

Creates plots of the model in R and saves them to disk. They are saved to disk in the tmp_imgs folder.

Source code in flexcv_earth/models.py
def make_r_plots(self):
    """Creates plots of the model in R and saves them to disk. They are saved to disk in the `tmp_imgs` folder."""
    Path("tmp_imgs").mkdir(parents=True, exist_ok=True)
    for i in range(1, 5):
        ro.r["png"](f"tmp_imgs/mars_plot_{i}.png", width=1024, height=1024)
        ro.r["plot"](self.model_, which=i)
        ro.r["dev.off"]()

flexcv_earth.models.EarthRegressor.predict(X)

Make predicitons using the fitted model.

Parameters:

Name Type Description Default
X array - like

Features

required

Returns:

Type Description
array - like

An array of fitted values.

Source code in flexcv_earth/models.py
def predict(self, X):
    """Make predicitons using the fitted model.

    Args:
      X (array-like): Features

    Returns:
        (array-like): An array of fitted values.
    """
    if np.iscomplexobj(X):
        raise ValueError("Complex data not supported")
    ro.r(
        """
        library(earth)
    """
    )
    numpy2ri.activate()
    pandas2ri.activate()
    # input checks
    check_is_fitted(self)
    if isinstance(X, pd.DataFrame):
        # Convert pandas dataframe to R dataframe
        r_X = pandas2ri.py2rpy(X)
    elif isinstance(X, np.ndarray):
        r_X = numpy2ri.numpy2rpy(X)
        # Convert numpy array to R matrix
    else:
        r_X = ro.r.matrix(X, nrow=X.shape[0], ncol=X.shape[1])
    # assign model in R in order to predict
    # ro.r.assign("model", self.model)
    y_pred = np.asarray(ro.r["predict"](self.model_, r_X))
    # make sure that the output is a 1d array
    y_pred = y_pred.ravel()
    del r_X

    numpy2ri.deactivate()
    pandas2ri.deactivate()
    gc.collect()
    ro.r("gc()")
    gc.collect()

    return y_pred

flexcv_earth.model_postprocessing.EarthModelPostProcessor

Source code in flexcv_earth/model_postprocessing.py
class EarthModelPostProcessor():
    def __init__(self):
        super().__init__()

    def __call__(self, results_all_folds, fold_result, features, run, *args, **kwargs):
        """Postprocessing function for the MARS model.
        Logs the parameters to Neptune.
        Creates a variable importance table and logs barplots to neptune.

        Args:
            results_all_folds: A dict of results for all folds
            fold_result: A dataclass containing the results for the current fold
            features: list of features
            run: neptune run object
            *args: any additional arguments
            **kwargs: any additional keyword arguments

        Returns:
            (dict): updated results dictionary
        """
        with plt.style.context("ggplot"):
            imp_df: pd.DataFrame = fold_result.best_model.get_variable_importance(features)
            run[f"{fold_result.model_name}/FeatImportance/Table"].append(File.as_html(imp_df))
            for col in imp_df.columns:
                # plot all rows of col where col is not 0
                fig = plt.figure()
                tmp = imp_df[col]
                tmp = tmp[tmp != 0]
                try:
                    tmp.plot.barh()
                    plt.title(f"{col} Variable Importance")
                    run[f"{fold_result.model_name}/FeatImportance/"].append(fig)
                except Exception as e:
                    logger.info(f"{e}")
                    logger.info(f"Could not make barplot for {fold_result.model_name}. Continuing.")
                del fig
                plt.close()

            run[f"{fold_result.model_name}/BestParams"].append(
                pformat(fold_result.best_params)
            )

        return results_all_folds

flexcv_earth.model_postprocessing.EarthModelPostProcessor.__call__(results_all_folds, fold_result, features, run, *args, **kwargs)

Postprocessing function for the MARS model. Logs the parameters to Neptune. Creates a variable importance table and logs barplots to neptune.

Parameters:

Name Type Description Default
results_all_folds

A dict of results for all folds

required
fold_result

A dataclass containing the results for the current fold

required
features

list of features

required
run

neptune run object

required
*args

any additional arguments

()
**kwargs

any additional keyword arguments

{}

Returns:

Type Description
dict

updated results dictionary

Source code in flexcv_earth/model_postprocessing.py
def __call__(self, results_all_folds, fold_result, features, run, *args, **kwargs):
    """Postprocessing function for the MARS model.
    Logs the parameters to Neptune.
    Creates a variable importance table and logs barplots to neptune.

    Args:
        results_all_folds: A dict of results for all folds
        fold_result: A dataclass containing the results for the current fold
        features: list of features
        run: neptune run object
        *args: any additional arguments
        **kwargs: any additional keyword arguments

    Returns:
        (dict): updated results dictionary
    """
    with plt.style.context("ggplot"):
        imp_df: pd.DataFrame = fold_result.best_model.get_variable_importance(features)
        run[f"{fold_result.model_name}/FeatImportance/Table"].append(File.as_html(imp_df))
        for col in imp_df.columns:
            # plot all rows of col where col is not 0
            fig = plt.figure()
            tmp = imp_df[col]
            tmp = tmp[tmp != 0]
            try:
                tmp.plot.barh()
                plt.title(f"{col} Variable Importance")
                run[f"{fold_result.model_name}/FeatImportance/"].append(fig)
            except Exception as e:
                logger.info(f"{e}")
                logger.info(f"Could not make barplot for {fold_result.model_name}. Continuing.")
            del fig
            plt.close()

        run[f"{fold_result.model_name}/BestParams"].append(
            pformat(fold_result.best_params)
        )

    return results_all_folds