`flexcv.model_selection`

This module implements customization of the objective function for the hyperparameter optimization. In order to use a custom objective function, we implemented the inner cv loop as follows (pseudo code):

objective_cv(
    if n_jobs == -1:
        parallel_objective(some_kind_of_scorer)
    else:
        objective(some_king_of_scorer)

`flexcv.model_selection.ObjectiveScorer`

Bases: Callable[[ndarray, ndarray, ndarray, ndarray], float]

Callable class that wraps a scorer function to be used as an objective function. The scorer function must match the following signature. Instantiating the class will check the signature.

Parameters:

Name	Type	Description	Default
`y_valid`	`ndarray`	The validation target values.	required
`y_pred`	`ndarray`	The predicted target values.	required
`y_train_in`	`ndarray`	The training target values.	required
`y_pred_train`	`ndarray`	The predicted training target values.	required

Returns:

Type	Description
`float`	The objective function value.

Source code in flexcv/model_selection.py

class ObjectiveScorer(
    Callable[[np.ndarray, np.ndarray, np.ndarray, np.ndarray], float]
):
    """Callable class that wraps a scorer function to be used as an objective function.
    The scorer function must match the following signature. Instantiating the class will check the signature.

    Args:
        y_valid (np.ndarray): The validation target values.
        y_pred (np.ndarray): The predicted target values.
        y_train_in (np.ndarray): The training target values.
        y_pred_train (np.ndarray): The predicted training target values.

    Returns:
        (float): The objective function value.

    """

    def __init__(
        self, scorer: Callable[[np.ndarray, np.ndarray, np.ndarray, np.ndarray], float]
    ):
        self.scorer = scorer
        self.check_signature()

    def __call__(
        self,
        y_valid: np.ndarray,
        y_pred: np.ndarray,
        y_train_in: np.ndarray,
        y_pred_train: np.ndarray,
    ) -> float:
        return self.scorer(y_valid, y_pred, y_train_in, y_pred_train)

    def check_signature(self):
        """ """
        expected_args = ["y_valid", "y_pred", "y_train_in", "y_pred_train"]
        signature = inspect.signature(self.scorer)
        for arg_name, param in signature.parameters.items():
            if arg_name not in expected_args:
                raise ValueError(
                    f"Invalid argument name '{arg_name}' in scorer function signature."
                )
            if param.kind != inspect.Parameter.POSITIONAL_OR_KEYWORD:
                raise ValueError(
                    f"Invalid parameter kind '{param.kind}' in scorer function signature."
                )
        if len(signature.parameters) != len(expected_args):
            raise ValueError(
                f"Invalid number of arguments in scorer function signature. Expected {len(expected_args)}, got {len(signature.parameters)}."
            )

`flexcv.model_selection.ObjectiveScorer.check_signature()`

Source code in flexcv/model_selection.py

def check_signature(self):
    """ """
    expected_args = ["y_valid", "y_pred", "y_train_in", "y_pred_train"]
    signature = inspect.signature(self.scorer)
    for arg_name, param in signature.parameters.items():
        if arg_name not in expected_args:
            raise ValueError(
                f"Invalid argument name '{arg_name}' in scorer function signature."
            )
        if param.kind != inspect.Parameter.POSITIONAL_OR_KEYWORD:
            raise ValueError(
                f"Invalid parameter kind '{param.kind}' in scorer function signature."
            )
    if len(signature.parameters) != len(expected_args):
        raise ValueError(
            f"Invalid number of arguments in scorer function signature. Expected {len(expected_args)}, got {len(signature.parameters)}."
        )

`flexcv.model_selection.custom_scorer(y_valid, y_pred, y_train_in, y_pred_train)`

Objective scorer for the hyperparameter optimization. The function calculates the mean squared error (MSE) for both the validation and training data, and then calculates a weighted sum of the MSEs and their differences. The weights and thresholds used in the calculation are defined in the function. The function returns a float value that represents the objective function value. This function is used in the hyperparameter optimization process to evaluate the performance of different models with different hyperparameters.

Parameters:

Name	Type	Description	Default
`y_valid`	`ndarray`	The validation target values	required
`y_pred`	`ndarray`	Predicted target values	required
`y_train_in`	`ndarray`	Inner training target values	required
`y_pred_train`	`ndarray`	Inner predicted target values	required

Returns:

Type	Description
`float`	The objective function value.

For hyperparameter tuning (inner cv loop) we use the following hierarchy:

objective_cv(
    if n_jobs == -1:
        parallel_objective(some_kind_of_scorer)
    else:
        objective(some_king_of_scorer)

Source code in flexcv/model_selection.py

def custom_scorer(y_valid, y_pred, y_train_in, y_pred_train) -> float:
    """Objective scorer for the hyperparameter optimization.
    The function calculates the mean squared error (MSE) for both the validation and training data,
    and then calculates a weighted sum of the MSEs and their differences.
    The weights and thresholds used in the calculation are defined in the function.
    The function returns a float value that represents the objective function value.
    This function is used in the hyperparameter optimization process to evaluate the performance of different models with different hyperparameters.

    Args:
      y_valid (np.ndarray): The validation target values
      y_pred (np.ndarray): Predicted target values
      y_train_in (np.ndarray): Inner training target values
      y_pred_train (np.ndarray): Inner predicted target values

    Returns:
      (float): The objective function value.

    For hyperparameter tuning (inner cv loop) we use the following hierarchy:
        ```python
        objective_cv(
            if n_jobs == -1:
                parallel_objective(some_kind_of_scorer)
            else:
                objective(some_king_of_scorer)
        ```

    """

    mse_valid = mean_squared_error(y_valid, y_pred)
    mse_train = mean_squared_error(y_train_in, y_pred_train)

    mse_delta = mse_train - mse_valid
    target_delta = 0.05

    return (
        1 * mse_valid
        + 0.5 * abs(mse_delta)
        + 2 * max(0, (mse_delta - target_delta))
        + 1 * max(0, -mse_delta)
    )

`flexcv.model_selection.objective(X_train_in, y_train_in, X_valid, y_valid, pipe, params, objective_scorer)`

Objective function for the hyperparameter optimization. Sets the parameters of the pipeline and fits it to the training data. Predicts the validation data and calculates the MSE for both the validation and training data. Then applies the objective scorer to the validation MSE and the training MSE which returns the objective function value. Returns the negative validation and training MSEs as well as the negative objective function value, since optuna maximizes the objective function. This function is called from the objective_cv function if n_jobs_cv is set to 1.

Parameters:

Name	Type	Description	Default
`X_train_in`	`DataFrame or ndarray`	The training data.	required
`y_train_in`	`DataFrame or ndarray`	The training target values.	required
`X_valid`	`DataFrame or ndarray`	The validation data.	required
`y_valid`	`DataFrame or ndarray`	The validation target values.	required
`pipe`	`Pipeline`	The pipeline to be used for the training.	required

Returns:

Type	Description
`tuple`	A tuple containing the negative validation MSE, the negative training MSE and the negative objective function value.

Inner CV pseudo code

objective_cv(
    if n_jobs == -1:
        parallel_objective(some_kind_of_scorer)
    else:
        objective(some_king_of_scorer)

Source code in flexcv/model_selection.py

def objective(
    X_train_in,
    y_train_in,
    X_valid,
    y_valid,
    pipe,
    params,
    objective_scorer: ObjectiveScorer,
):
    """Objective function for the hyperparameter optimization.
    Sets the parameters of the pipeline and fits it to the training data.
    Predicts the validation data and calculates the MSE for both the validation and training data.
    Then applies the objective scorer to the validation MSE and the training MSE which returns the objective function value.
    Returns the negative validation and training MSEs as well as the negative objective function value, since optuna maximizes the objective function.
    This function is called from the objective_cv function if n_jobs_cv is set to 1.

    Args:
        X_train_in (pd.DataFrame or np.ndarray): The training data.
        y_train_in (pd.DataFrame or np.ndarray): The training target values.
        X_valid (pd.DataFrame or np.ndarray): The validation data.
        y_valid (pd.DataFrame or np.ndarray): The validation target values.
        pipe (Pipeline): The pipeline to be used for the training.

    Returns:
        (tuple): A tuple containing the negative validation MSE, the negative training MSE and the negative objective function value.

    Inner CV pseudo code:
        ```python
        objective_cv(
            if n_jobs == -1:
                parallel_objective(some_kind_of_scorer)
            else:
                objective(some_king_of_scorer)
        ```

    """

    pipe.set_params(**params)

    pipe.fit(X_train_in, y_train_in)

    y_pred = pipe.predict(X_valid)
    y_pred_train = pipe.predict(X_train_in)

    score_valid = mean_squared_error(y_valid, y_pred)
    score_train = mean_squared_error(y_train_in, y_pred_train)
    score_of = objective_scorer(y_valid, y_pred, y_train_in, y_pred_train)

    return -score_valid, -score_train, -score_of

`flexcv.model_selection.objective_cv(trial, cross_val_split, pipe, params, X, y, run, n_jobs, objective_scorer)`

Objective function for the hyperparameter optimization with cross validation. n_jobs is the number of processes to use for the parallelization. If n_jobs is -1, the number of processes is set to the number of available CPUs. If n_jobs is 1, the objective function is called sequentially.

Parameters:

Name	Type	Description	Default
`trial`	`trial`	Optuna trial object.	required
`cross_val_split`	`function`	Function that returns the indices for the cross validation split.	required
`pipe`	`Pipeline`	The pipeline to be used for the training.	required
`params`	`dict`	Dictionary containing the parameters to be set in the pipeline.	required
`X`	`DataFrame or ndarray`	Features.	required
`y`	`DataFrame or ndarray`	Target.	required
`run`	`run`	neptune run object	required
`n_jobs`	`int`	Sklearn n_jobs parameter to control if CV is run in parallel or sequentially	required
`objective_scorer`	`ObjectiveScorer`	Callable class that wraps a scorer function to be used as an objective function.	required

Returns:

Type	Description
`float`	The mean objective function value. Note: We average per default. If you would like to use the RMSE as the objective function, you have to average the MSEs and then take the square root.

Inner CV pseudo code

objective_cv(
    if n_jobs == -1:
        parallel_objective(some_kind_of_scorer)
    else:
        objective(some_king_of_scorer)

Source code in flexcv/model_selection.py

def objective_cv(
    trial, cross_val_split, pipe, params, X, y, run, n_jobs, objective_scorer
):
    """Objective function for the hyperparameter optimization with cross validation.
    n_jobs is the number of processes to use for the parallelization.
    If n_jobs is -1, the number of processes is set to the number of available CPUs.
    If n_jobs is 1, the objective function is called sequentially.

    Args:
      trial (neptune.trial): Optuna trial object.
      cross_val_split (function): Function that returns the indices for the cross validation split.
      pipe (Pipeline): The pipeline to be used for the training.
      params (dict): Dictionary containing the parameters to be set in the pipeline.
      X (pd.DataFrame or np.ndarray): Features.
      y (pd.DataFrame or np.ndarray): Target.
      run (neptune.run): neptune run object
      n_jobs (int): Sklearn n_jobs parameter to control if CV is run in parallel or sequentially
      objective_scorer (ObjectiveScorer): Callable class that wraps a scorer function to be used as an objective function.


    Returns:
      (float): The mean objective function value. Note: We average per default. If you would like to use the RMSE as the objective function, you have to average the MSEs and then take the square root.

    Inner CV pseudo code:
        ```python
        objective_cv(
            if n_jobs == -1:
                parallel_objective(some_kind_of_scorer)
            else:
                objective(some_king_of_scorer)
        ```

    """

    params_ = {
        name: trial._suggest(name, distribution)
        for name, distribution in params.items()
    }

    scores_valid = []
    scores_train = []
    scores_OF = []

    if n_jobs == -1:
        # Define the number of processes to use and create a pool
        num_processes = multiprocessing.cpu_count()
        pool = multiprocessing.Pool(processes=num_processes)

        # Map the parallel function to the cross validation split
        results = pool.starmap(
            parallel_objective,
            [
                (train_idx, valid_idx, X, y, pipe, params_, objective_scorer)
                for train_idx, valid_idx in cross_val_split(X=X, y=y)
            ],
        )
        pool.close()

        for result in results:
            scores_valid.append(result[0])
            scores_train.append(result[1])
            scores_OF.append(result[2])
    else:
        for train_idx, valid_idx in cross_val_split(X=X, y=y):
            X_train_in = X.iloc[train_idx]
            y_train_in = y.iloc[train_idx]

            X_valid = X.iloc[valid_idx]
            y_valid = y.iloc[valid_idx]

            score_valid, score_train, score_OF = objective(
                X_train_in,
                y_train_in,
                X_valid,
                y_valid,
                pipe,
                params_,
                objective_scorer,
            )

            scores_valid.append(score_valid)
            scores_train.append(score_train)
            scores_OF.append(score_OF)

    trial.set_user_attr("mean_test_score", np.mean(scores_valid))
    trial.set_user_attr("mean_train_score", np.mean(scores_train))
    trial.set_user_attr("mean_OF_score", np.mean(scores_OF))

    return np.mean(scores_OF)

`flexcv.model_selection.parallel_objective(train_idx, valid_idx, X, y, pipe, params_, objective_scorer)`

Objective function for the hyperparameter optimization to be used with multiprocessing.Pool.starmap. Gets the training and validation indices and the data and calls the objective function. Is called from the objective_cv function if n_jobs_cv is set to -1.

Parameters:

Name	Type	Description	Default
`train_idx`	`ndarray`	The training indices.	required
`valid_idx`	`ndarray`	The validation indices.	required
`X`	`DataFrame or ndarray`	The data.	required
`y`	`DataFrame or ndarray`	The target values.	required
`pipe`	`Pipeline`	The pipeline to be used for the training.	required

Returns:

Type	Description
`tuple`	A tuple containing the validation MSE, the training MSE and the objective function value.

Inner CV pseudo code

objective_cv(
    if n_jobs == -1:
        parallel_objective(some_kind_of_scorer)
    else:
        objective(some_king_of_scorer)

Source code in flexcv/model_selection.py

def parallel_objective(
    train_idx, valid_idx, X, y, pipe, params_, objective_scorer: ObjectiveScorer
):
    """Objective function for the hyperparameter optimization to be used with multiprocessing.Pool.starmap.
    Gets the training and validation indices and the data and calls the objective function.
    Is called from the objective_cv function if n_jobs_cv is set to -1.

    Args:
        train_idx (ndarray): The training indices.
        valid_idx (ndarray): The validation indices.
        X (pd.DataFrame or np.ndarray): The data.
        y (pd.DataFrame or np.ndarray): The target values.
        pipe (Pipeline): The pipeline to be used for the training.

    Returns:
      (tuple): A tuple containing the validation MSE, the training MSE and the objective function value.

    Inner CV pseudo code:
        ```python
        objective_cv(
            if n_jobs == -1:
                parallel_objective(some_kind_of_scorer)
            else:
                objective(some_king_of_scorer)
        ```

    """
    X_train_in = X.iloc[train_idx]
    y_train_in = y.iloc[train_idx]

    X_valid = X.iloc[valid_idx]
    y_valid = y.iloc[valid_idx]

    score_valid, score_train, score_OF = objective(
        X_train_in, y_train_in, X_valid, y_valid, pipe, params_, objective_scorer
    )

    return score_valid, score_train, score_OF