Welcome to flexcv-earth

See our repository here: flexcv-earth

This python package provides wrapper classes for the earth function from the earth package in R. It then can be used as a sklearn estimator in python and especially in the flexcv package.

Installation

pip install flexcv_earth @ git+https://github.com/radlfabs/flexcv-earth

Additional dependencies of `rpy2`

The model class for the EarthRegressor is actually wrapping around rpy2 code and is using embedded R under the hood. Therefore, you should have a recent R version installed and run our install_rpackages.py script. From the command line change your directory to your flexcv-earth installation directory. This can be your folder that you created with venv. Run our python script that installs the remaining R dependencies.

cd path/to/flexcv-earth/
python -m install_rpackages

Now you have installed everything you need to use the EarthRegressorwith flexcv-earth.

Use with flexcv

You can use flexcv to perform cross validation with the EarthRegressor class. Define a model configuration with yaml as follows:

from flexcv import CrossValidation
from flexcv_earth import EarthRegressor, EarthModelPostProcessor
from flexcv.synthesizer import generate_data

X, y, _, _ = generate_data(10, 100)

yaml_config = """
EarthRegressor:
    requires_inner_cv: True
    n_trials: 200
    allows_n_jobs: False
    model: EarthRegressor
    params:
        degree: !Int
            low: 1
            high: 5
        nprune: !Int
            low: 1
            high: 300
        fast_k: !Int
            low: 0
            high: 20
        newvar_penalty: !Float
            low: 0.01
            high: 0.2
    post_processor: EarthModelPostProcessor
    add_merf: True
"""

cv = (
    CrossValidation().set_data(X, y)
    .set_models(yaml_strin=yaml_config)
)

Reference

`flexcv_earth.models.EarthRegressor`

Bases: BaseEstimator, RegressorMixin

Wrapper Class for Earth Regressor in R. For more Details see https://cran.r-project.org/web/packages/earth/earth.pdf.

Parameters:

Name	Type	Description	Default
`degree`	`int`	Degree of the splines. 1 for linear, 2 for quadratic, etc. (Default value = 1)	`1`
`nprune`	`int \| None`	Number of pruning steps. If None, the number of pruning steps is determined by the algorithm. (Default value = None)	`None`
`nk`	`int \| None`	Number of knots. If None, the number of knots is determined by the algorithm. The default is semi-automatically calculated from the number of predictors but may need adjusting. (Default value = None)	`None`
`thresh`	`float`	Forward stepping threshold. (Default value = 0.001)	`0.001`
`minspan`	`int`	Minimum number of observations between knots. (Default value = 0)	`0`
`endspan`	`int`	Minimum number of observations before the first and after the final knot. (Default value = 0)	`0`
`newvar_penalty`	`float`	(Default value = 0.0)	`0.0`
`fast_k`	`int`	Maximum number of parent terms considered at each step of the forward pass. (Default value = 20)	`20`
`fast_beta`	`float`	Fast MARS ageing coefficient, as described in the Fast MARS paper section 3.1. Default is 1. A value of 0 sometimes gives better results. (Default value = 1.0)	`1.0`
`pmethod`	`str`	Pruning method. One of: backward none exhaustive forward seqrep cv. Default is "backward". Specify pmethod="cv" to use cross-validation to select the number of terms. This selects the number of terms that gives the maximum mean out-of-fold RSq on the fold models. Requires the nfold argument. Use "none" to retain all the terms created by the forward pass. If y has multiple columns, then only "backward" or "none" is allowed. Pruning can take a while if "exhaustive" is chosen and the model is big (more than about 30 terms). The current version of the leaps package used during pruning does not allow user interrupts (i.e., you have to kill your R session to interrupt; in Windows use the Task Manager or from the command line use taskkill). (Default value = "backward")	`'backward'`

Source code in flexcv_earth/models.py

class EarthRegressor(BaseEstimator, RegressorMixin):
    """Wrapper Class for Earth Regressor in R.
    For more Details see https://cran.r-project.org/web/packages/earth/earth.pdf.

    Args:
        degree (int): Degree of the splines. 1 for linear, 2 for quadratic, etc. (Default value = 1)
        nprune (int | None): Number of pruning steps. If None, the number of pruning steps is determined by the algorithm. (Default value = None)
        nk (int | None): Number of knots. If None, the number of knots is determined by the algorithm. The default is semi-automatically calculated from the number of predictors but may need adjusting. (Default value = None)
        thresh (float): Forward stepping threshold. (Default value = 0.001)
        minspan (int): Minimum number of observations between knots. (Default value = 0)
        endspan (int): Minimum number of observations before the first and after the final knot. (Default value = 0)
        newvar_penalty (float): (Default value = 0.0)
        fast_k (int): Maximum number of parent terms considered at each step of the forward pass. (Default value = 20)
        fast_beta (float): Fast MARS ageing coefficient, as described in the Fast MARS paper section 3.1. Default is 1. A value of 0 sometimes gives better results. (Default value = 1.0)
        pmethod (str): Pruning method. One of: backward none exhaustive forward seqrep cv. Default is "backward". Specify pmethod="cv" to use cross-validation to select the number of terms. This selects the number of terms that gives the maximum mean out-of-fold RSq on the fold models. Requires the nfold argument. Use "none" to retain all the terms created by the forward pass. If y has multiple columns, then only "backward" or "none" is allowed. Pruning can take a while if "exhaustive" is chosen and the model is big (more than about 30 terms). The current version of the leaps package used during pruning does not allow user interrupts (i.e., you have to kill your R session to interrupt; in Windows use the Task Manager or from the command line use taskkill). (Default value = "backward")

    """

    def __init__(
        self,
        degree: int = 1,
        nprune: int = None,
        nk: int = None,
        thresh: float = 0.001,
        minspan: int = 0,
        endspan: int = 0,
        newvar_penalty: float = 0.0,
        fast_k: int = 20,
        fast_beta: float = 1.0,
        pmethod: str = "backward",
        random_state: int = None,
    ):
        self.degree = degree
        self.nprune = nprune
        self.nk = nk
        self.thresh = thresh
        self.minspan = minspan
        self.endspan = endspan
        self.newvar_penalty = newvar_penalty
        self.fast_k = fast_k
        self.fast_beta = fast_beta
        self.pmethod = pmethod
        self.random_state = random_state

    def fit(self, X, y):
        """Fit a EARTH model to the given training data.

        Args:
          X (array-like): Features.
          y (array-like): Target values.

        Returns:
           (object): Returns self.
        """
        if np.iscomplexobj(X) or np.iscomplexobj(y):
            raise ValueError("Complex data not supported")
        # ro.r('sink(nullfile())')
        if self.random_state is not None:
            ro.r(f"set.seed({self.random_state})")

        ro.r(
            """
            library(earth)
        """
        )
        numpy2ri.activate()
        pandas2ri.activate()

        assert (
            X.shape[0] == y.shape[0]
        ), "Number of X samples must match number of y samples."

        # Convert X, y according to its type
        if isinstance(X, pd.DataFrame):
            # Convert pandas dataframe to R dataframe
            r_X = pandas2ri.py2rpy(X)
        elif isinstance(X, np.ndarray):
            r_X = numpy2ri.numpy2rpy(X)
            # Convert numpy array to R matrix
        else:
            r_X = ro.r.matrix(X, nrow=X.shape[0], ncol=X.shape[1])

        # Convert pandas Series to R vector
        r_y = ro.FloatVector(y)

        # Fit MARS regression model using earth function from the earth package
        # make nprune None in R as default
        nprune = self.nprune if self.nprune is not None else ro.r("as.null")()
        # The following has a special defaults which we dont want to overwrite with None
        nk = {"nk": self.nk} if self.nk is not None else {}
        # We have to pass newvar.penalty as a named argument because Python does not allow "." in variable names
        newvar_penalty = {"newvar.penalty": self.newvar_penalty}
        fast_k = {"fast.k": self.fast_k}
        fast_beta = {"fast.beta": self.fast_beta}

        self.model_ = ro.r.earth(
            r_X,
            r_y,
            degree=self.degree,
            nprune=nprune,
            thresh=self.thresh,
            minspan=self.minspan,
            endspan=self.endspan,
            pmethod=self.pmethod,
            **newvar_penalty,
            **fast_k,
            **fast_beta,
            **nk,
        )

        self.is_fitted_ = True
        self.var_imp_ = self.calc_variable_importance()

        del r_X
        del r_y
        numpy2ri.deactivate()
        pandas2ri.deactivate()

        gc.collect()
        ro.r("gc()")
        gc.collect()

        return self

    def predict(self, X):
        """Make predicitons using the fitted model.

        Args:
          X (array-like): Features

        Returns:
            (array-like): An array of fitted values.
        """
        if np.iscomplexobj(X):
            raise ValueError("Complex data not supported")
        ro.r(
            """
            library(earth)
        """
        )
        numpy2ri.activate()
        pandas2ri.activate()
        # input checks
        check_is_fitted(self)
        if isinstance(X, pd.DataFrame):
            # Convert pandas dataframe to R dataframe
            r_X = pandas2ri.py2rpy(X)
        elif isinstance(X, np.ndarray):
            r_X = numpy2ri.numpy2rpy(X)
            # Convert numpy array to R matrix
        else:
            r_X = ro.r.matrix(X, nrow=X.shape[0], ncol=X.shape[1])
        # assign model in R in order to predict
        # ro.r.assign("model", self.model)
        y_pred = np.asarray(ro.r["predict"](self.model_, r_X))
        # make sure that the output is a 1d array
        y_pred = y_pred.ravel()
        del r_X

        numpy2ri.deactivate()
        pandas2ri.deactivate()
        gc.collect()
        ro.r("gc()")
        gc.collect()

        return y_pred

    def __sklearn_is_fitted__(self):
        return self.is_fitted_

    def get_params(self, deep=False):
        """Returns the parameters of the model.

        Args:
          deep: bool: This argument is not used.  (Default value = False)
        Returns:
            (dict): Parameter names mapped to their values.
        """
        return {
            "degree": self.degree,
            "nprune": self.nprune,
            "nk": self.nk,
            "thresh": self.thresh,
            "minspan": self.minspan,
            "endspan": self.endspan,
            "newvar_penalty": self.newvar_penalty,
            "fast_k": self.fast_k,
            "fast_beta": self.fast_beta,
            "pmethod": self.pmethod,
            "random_state": self.random_state,
        }

    def get_rmodel(self):
        """Returns the R model object.

        Returns:
            (object): The R model object."""
        return self.model_

    def make_r_plots(self):
        """Creates plots of the model in R and saves them to disk. They are saved to disk in the `tmp_imgs` folder."""
        Path("tmp_imgs").mkdir(parents=True, exist_ok=True)
        for i in range(1, 5):
            ro.r["png"](f"tmp_imgs/mars_plot_{i}.png", width=1024, height=1024)
            ro.r["plot"](self.model_, which=i)
            ro.r["dev.off"]()

    def calc_variable_importance(self):
        """Calculates the variable importance of the model.

        Returns:
            (pandas.DataFrame): A DataFrame containing the variable importance."""
        ro.globalenv["ev"] = ro.r["evimp"](self.model_, trim=False)
        imp = ro.r("as.data.frame(unclass(ev[,c(3,4,6)]))")
        imp_df: pd.DataFrame = ro.conversion.rpy2py(imp)
        imp_df.columns = ["nsubsets", "gcv", "rss"]

        del imp
        gc.collect()
        ro.r("rm(ev)")
        ro.r("gc()")
        gc.collect()
        return imp_df

    def get_variable_importance(self, features):
        """Returns the variable importance of the model.

        Args:
          features: array-like: The feature names.

        Returns:
            (pandas.DataFrame): A DataFrame containing the variable importance.
        """
        self.var_imp_.index = features
        return self.var_imp_

`flexcv_earth.models.EarthRegressor.calc_variable_importance()`

Calculates the variable importance of the model.

Returns:

Type	Description
`DataFrame`	A DataFrame containing the variable importance.

Source code in flexcv_earth/models.py

def calc_variable_importance(self):
    """Calculates the variable importance of the model.

    Returns:
        (pandas.DataFrame): A DataFrame containing the variable importance."""
    ro.globalenv["ev"] = ro.r["evimp"](self.model_, trim=False)
    imp = ro.r("as.data.frame(unclass(ev[,c(3,4,6)]))")
    imp_df: pd.DataFrame = ro.conversion.rpy2py(imp)
    imp_df.columns = ["nsubsets", "gcv", "rss"]

    del imp
    gc.collect()
    ro.r("rm(ev)")
    ro.r("gc()")
    gc.collect()
    return imp_df

`flexcv_earth.models.EarthRegressor.fit(X, y)`

Fit a EARTH model to the given training data.

Parameters:

Name	Type	Description	Default
`X`	`array - like`	Features.	required
`y`	`array - like`	Target values.	required

Returns:

Type	Description
`object`	Returns self.

Source code in flexcv_earth/models.py

def fit(self, X, y):
    """Fit a EARTH model to the given training data.

    Args:
      X (array-like): Features.
      y (array-like): Target values.

    Returns:
       (object): Returns self.
    """
    if np.iscomplexobj(X) or np.iscomplexobj(y):
        raise ValueError("Complex data not supported")
    # ro.r('sink(nullfile())')
    if self.random_state is not None:
        ro.r(f"set.seed({self.random_state})")

    ro.r(
        """
        library(earth)
    """
    )
    numpy2ri.activate()
    pandas2ri.activate()

    assert (
        X.shape[0] == y.shape[0]
    ), "Number of X samples must match number of y samples."

    # Convert X, y according to its type
    if isinstance(X, pd.DataFrame):
        # Convert pandas dataframe to R dataframe
        r_X = pandas2ri.py2rpy(X)
    elif isinstance(X, np.ndarray):
        r_X = numpy2ri.numpy2rpy(X)
        # Convert numpy array to R matrix
    else:
        r_X = ro.r.matrix(X, nrow=X.shape[0], ncol=X.shape[1])

    # Convert pandas Series to R vector
    r_y = ro.FloatVector(y)

    # Fit MARS regression model using earth function from the earth package
    # make nprune None in R as default
    nprune = self.nprune if self.nprune is not None else ro.r("as.null")()
    # The following has a special defaults which we dont want to overwrite with None
    nk = {"nk": self.nk} if self.nk is not None else {}
    # We have to pass newvar.penalty as a named argument because Python does not allow "." in variable names
    newvar_penalty = {"newvar.penalty": self.newvar_penalty}
    fast_k = {"fast.k": self.fast_k}
    fast_beta = {"fast.beta": self.fast_beta}

    self.model_ = ro.r.earth(
        r_X,
        r_y,
        degree=self.degree,
        nprune=nprune,
        thresh=self.thresh,
        minspan=self.minspan,
        endspan=self.endspan,
        pmethod=self.pmethod,
        **newvar_penalty,
        **fast_k,
        **fast_beta,
        **nk,
    )

    self.is_fitted_ = True
    self.var_imp_ = self.calc_variable_importance()

    del r_X
    del r_y
    numpy2ri.deactivate()
    pandas2ri.deactivate()

    gc.collect()
    ro.r("gc()")
    gc.collect()

    return self

`flexcv_earth.models.EarthRegressor.get_params(deep=False)`

Returns the parameters of the model.

Parameters:

Name	Type	Description	Default
`deep`		bool: This argument is not used. (Default value = False)	`False`

Returns: (dict): Parameter names mapped to their values.

Source code in flexcv_earth/models.py

def get_params(self, deep=False):
    """Returns the parameters of the model.

    Args:
      deep: bool: This argument is not used.  (Default value = False)
    Returns:
        (dict): Parameter names mapped to their values.
    """
    return {
        "degree": self.degree,
        "nprune": self.nprune,
        "nk": self.nk,
        "thresh": self.thresh,
        "minspan": self.minspan,
        "endspan": self.endspan,
        "newvar_penalty": self.newvar_penalty,
        "fast_k": self.fast_k,
        "fast_beta": self.fast_beta,
        "pmethod": self.pmethod,
        "random_state": self.random_state,
    }

`flexcv_earth.models.EarthRegressor.get_rmodel()`

Returns the R model object.

Returns:

Type	Description
`object`	The R model object.

Source code in flexcv_earth/models.py

def get_rmodel(self):
    """Returns the R model object.

    Returns:
        (object): The R model object."""
    return self.model_

`flexcv_earth.models.EarthRegressor.get_variable_importance(features)`

Returns the variable importance of the model.

Parameters:

Name	Type	Description	Default
`features`		array-like: The feature names.	required

Returns:

Type	Description
`DataFrame`	A DataFrame containing the variable importance.

Source code in flexcv_earth/models.py

def get_variable_importance(self, features):
    """Returns the variable importance of the model.

    Args:
      features: array-like: The feature names.

    Returns:
        (pandas.DataFrame): A DataFrame containing the variable importance.
    """
    self.var_imp_.index = features
    return self.var_imp_

`flexcv_earth.models.EarthRegressor.make_r_plots()`

Creates plots of the model in R and saves them to disk. They are saved to disk in the tmp_imgs folder.

Source code in flexcv_earth/models.py

def make_r_plots(self):
    """Creates plots of the model in R and saves them to disk. They are saved to disk in the `tmp_imgs` folder."""
    Path("tmp_imgs").mkdir(parents=True, exist_ok=True)
    for i in range(1, 5):
        ro.r["png"](f"tmp_imgs/mars_plot_{i}.png", width=1024, height=1024)
        ro.r["plot"](self.model_, which=i)
        ro.r["dev.off"]()

`flexcv_earth.models.EarthRegressor.predict(X)`

Make predicitons using the fitted model.

Parameters:

Name	Type	Description	Default
`X`	`array - like`	Features	required

Returns:

Type	Description
`array - like`	An array of fitted values.

Source code in flexcv_earth/models.py

def predict(self, X):
    """Make predicitons using the fitted model.

    Args:
      X (array-like): Features

    Returns:
        (array-like): An array of fitted values.
    """
    if np.iscomplexobj(X):
        raise ValueError("Complex data not supported")
    ro.r(
        """
        library(earth)
    """
    )
    numpy2ri.activate()
    pandas2ri.activate()
    # input checks
    check_is_fitted(self)
    if isinstance(X, pd.DataFrame):
        # Convert pandas dataframe to R dataframe
        r_X = pandas2ri.py2rpy(X)
    elif isinstance(X, np.ndarray):
        r_X = numpy2ri.numpy2rpy(X)
        # Convert numpy array to R matrix
    else:
        r_X = ro.r.matrix(X, nrow=X.shape[0], ncol=X.shape[1])
    # assign model in R in order to predict
    # ro.r.assign("model", self.model)
    y_pred = np.asarray(ro.r["predict"](self.model_, r_X))
    # make sure that the output is a 1d array
    y_pred = y_pred.ravel()
    del r_X

    numpy2ri.deactivate()
    pandas2ri.deactivate()
    gc.collect()
    ro.r("gc()")
    gc.collect()

    return y_pred

`flexcv_earth.model_postprocessing.EarthModelPostProcessor`

Source code in flexcv_earth/model_postprocessing.py

class EarthModelPostProcessor():
    def __init__(self):
        super().__init__()

    def __call__(self, results_all_folds, fold_result, features, run, *args, **kwargs):
        """Postprocessing function for the MARS model.
        Logs the parameters to Neptune.
        Creates a variable importance table and logs barplots to neptune.

        Args:
            results_all_folds: A dict of results for all folds
            fold_result: A dataclass containing the results for the current fold
            features: list of features
            run: neptune run object
            *args: any additional arguments
            **kwargs: any additional keyword arguments

        Returns:
            (dict): updated results dictionary
        """
        with plt.style.context("ggplot"):
            imp_df: pd.DataFrame = fold_result.best_model.get_variable_importance(features)
            run[f"{fold_result.model_name}/FeatImportance/Table"].append(File.as_html(imp_df))
            for col in imp_df.columns:
                # plot all rows of col where col is not 0
                fig = plt.figure()
                tmp = imp_df[col]
                tmp = tmp[tmp != 0]
                try:
                    tmp.plot.barh()
                    plt.title(f"{col} Variable Importance")
                    run[f"{fold_result.model_name}/FeatImportance/"].append(fig)
                except Exception as e:
                    logger.info(f"{e}")
                    logger.info(f"Could not make barplot for {fold_result.model_name}. Continuing.")
                del fig
                plt.close()

            run[f"{fold_result.model_name}/BestParams"].append(
                pformat(fold_result.best_params)
            )

        return results_all_folds

`flexcv_earth.model_postprocessing.EarthModelPostProcessor.call(results_all_folds, fold_result, features, run, *args, **kwargs)`

Postprocessing function for the MARS model. Logs the parameters to Neptune. Creates a variable importance table and logs barplots to neptune.

Parameters:

Name	Description	Default
`results_all_folds`	A dict of results for all folds	required
`fold_result`	A dataclass containing the results for the current fold	required
`features`	list of features	required
`run`	neptune run object	required
`*args`	any additional arguments	`()`
`**kwargs`	any additional keyword arguments	`{}`

Returns:

Type	Description
`dict`	updated results dictionary

Source code in flexcv_earth/model_postprocessing.py

def __call__(self, results_all_folds, fold_result, features, run, *args, **kwargs):
    """Postprocessing function for the MARS model.
    Logs the parameters to Neptune.
    Creates a variable importance table and logs barplots to neptune.

    Args:
        results_all_folds: A dict of results for all folds
        fold_result: A dataclass containing the results for the current fold
        features: list of features
        run: neptune run object
        *args: any additional arguments
        **kwargs: any additional keyword arguments

    Returns:
        (dict): updated results dictionary
    """
    with plt.style.context("ggplot"):
        imp_df: pd.DataFrame = fold_result.best_model.get_variable_importance(features)
        run[f"{fold_result.model_name}/FeatImportance/Table"].append(File.as_html(imp_df))
        for col in imp_df.columns:
            # plot all rows of col where col is not 0
            fig = plt.figure()
            tmp = imp_df[col]
            tmp = tmp[tmp != 0]
            try:
                tmp.plot.barh()
                plt.title(f"{col} Variable Importance")
                run[f"{fold_result.model_name}/FeatImportance/"].append(fig)
            except Exception as e:
                logger.info(f"{e}")
                logger.info(f"Could not make barplot for {fold_result.model_name}. Continuing.")
            del fig
            plt.close()

        run[f"{fold_result.model_name}/BestParams"].append(
            pformat(fold_result.best_params)
        )

    return results_all_folds

Welcome to flexcv-earth

Installation

Additional dependencies of rpy2

Use with flexcv

Reference

flexcv_earth.models.EarthRegressor

flexcv_earth.models.EarthRegressor.calc_variable_importance()

flexcv_earth.models.EarthRegressor.fit(X, y)

flexcv_earth.models.EarthRegressor.get_params(deep=False)

flexcv_earth.models.EarthRegressor.get_rmodel()

flexcv_earth.models.EarthRegressor.get_variable_importance(features)

flexcv_earth.models.EarthRegressor.make_r_plots()

flexcv_earth.models.EarthRegressor.predict(X)

flexcv_earth.model_postprocessing.EarthModelPostProcessor

flexcv_earth.model_postprocessing.EarthModelPostProcessor.__call__(results_all_folds, fold_result, features, run, *args, **kwargs)

Additional dependencies of `rpy2`

`flexcv_earth.models.EarthRegressor`

`flexcv_earth.models.EarthRegressor.calc_variable_importance()`

`flexcv_earth.models.EarthRegressor.fit(X, y)`

`flexcv_earth.models.EarthRegressor.get_params(deep=False)`

`flexcv_earth.models.EarthRegressor.get_rmodel()`

`flexcv_earth.models.EarthRegressor.get_variable_importance(features)`

`flexcv_earth.models.EarthRegressor.make_r_plots()`

`flexcv_earth.models.EarthRegressor.predict(X)`

`flexcv_earth.model_postprocessing.EarthModelPostProcessor`

`flexcv_earth.model_postprocessing.EarthModelPostProcessor.call(results_all_folds, fold_result, features, run, *args, **kwargs)`