Skip to content

flexcv.split

In order to switch cross validation split methods dynamically we need to implement a function that returns the correct cross validation split function. This is necessary because the split methods may have different numbers of arguments. This module also implements a custom stratified cross validation split method for continuous target variables and a custom stratified group cross validation split method for continuous target variables that incorporates grouping information.

flexcv.split.CrossValMethod

Bases: Enum

Enum class to assign CrossValMethods to the cross_val() function. This is useful to return the correct splitting function depending on the cross val method.

Members
  • KFOLD: Regular sklearn KFold cross validation. No grouping information is used.
  • GROUP: Regular sklearn GroupKFold cross validation. Grouping information is used.
  • STRAT: Regular sklearn StratifiedKFold cross validation. No grouping information is used.
  • STRATGROUP: Regular sklearn StratifiedGroupKFold cross validation. Grouping information is used.
  • CONTISTRAT: Stratified cross validation for continuous targets. No grouping information is used.
  • CONTISTRATGROUP: Stratified cross validation for continuous targets. Grouping information is used.
  • CONCATSTRATKFOLD: Stratified cross validation. Leaky stratification on element-wise-concatenated target and group labels.
Source code in flexcv/split.py
class CrossValMethod(Enum):
    """Enum class to assign CrossValMethods to the cross_val() function.
    This is useful to return the correct splitting function depending on the cross val method.

    Members:
        - `KFOLD`: Regular sklearn `KFold` cross validation. No grouping information is used.
        - `GROUP`: Regular sklearn `GroupKFold` cross validation. Grouping information is used.
        - `STRAT`: Regular sklearn `StratifiedKFold` cross validation. No grouping information is used.
        - `STRATGROUP`: Regular sklearn `StratifiedGroupKFold` cross validation. Grouping information is used.
        - `CONTISTRAT`: Stratified cross validation for continuous targets. No grouping information is used.
        - `CONTISTRATGROUP`: Stratified cross validation for continuous targets. Grouping information is used.
        - `CONCATSTRATKFOLD`: Stratified cross validation. Leaky stratification on element-wise-concatenated target and group labels.
    """

    KFOLD = "KFold"
    GROUP = "GroupKFold"
    STRAT = "StratifiedKFold"
    STRATGROUP = "StratifiedGroupKFold"
    CONTISTRAT = "ContinuousStratifiedKFold"
    CONTISTRATGROUP = "ContinuousStratifiedGroupKFold"
    CONCATSTRATKFOLD = "ConcatenatedStratifiedKFold"

flexcv.split.make_cross_val_split(*, groups, method, n_splits=5, random_state=42)

This function creates and returns a callable cross validation splitter based on the specified method.

Parameters:

Name Type Description Default
groups Series | None

A pd.Series containing the grouping information for the samples.

required
method CrossValMethod

A CrossValMethod enum value specifying the cross validation method to use.

required
n_splits int

Number of splits (Default value = 5)

5
random_state int

A random seed to control random processes (Default value = 42)

42

Returns:

Type Description
Callable

A callable cross validation splitter based on the specified method.

Raises:

Type Description
TypeError

If the given method is not one of KFOLD

Source code in flexcv/split.py
def make_cross_val_split(
    *,
    groups: pd.Series | None,
    method: CrossValMethod,
    n_splits: int = 5,
    random_state: int = 42,
) -> Callable[..., Iterator[tuple[ndarray, ndarray]]]:
    """This function creates and returns a callable cross validation splitter based on the specified method.

    Args:
      groups (pd.Series | None): A pd.Series containing the grouping information for the samples.
      method (CrossValMethod): A CrossValMethod enum value specifying the cross validation method to use.
      n_splits (int): Number of splits (Default value = 5)
      random_state (int): A random seed to control random processes (Default value = 42)

    Returns:
      (Callable): A callable cross validation splitter based on the specified method.

    Raises:
      (TypeError): If the given method is not one of KFOLD

    """

    match method:
        case CrossValMethod.KFOLD:
            kf = KFold(n_splits=n_splits, random_state=random_state, shuffle=True)
            return kf.split

        case CrossValMethod.STRAT:
            strat_skf = StratifiedKFold(
                n_splits=n_splits, random_state=random_state, shuffle=True
            )
            return strat_skf.split

        case CrossValMethod.CONTISTRAT:
            conti_skf = ContinuousStratifiedKFold(
                n_splits=n_splits, random_state=random_state, shuffle=True
            )
            return conti_skf.split

        case CrossValMethod.GROUP:
            gkf = GroupKFold(n_splits=n_splits)
            return partial(gkf.split, groups=groups)

        case CrossValMethod.STRATGROUP:
            strat_gkf = StratifiedGroupKFold(
                n_splits=n_splits, random_state=random_state, shuffle=True
            )
            return partial(strat_gkf.split, groups=groups)

        case CrossValMethod.CONTISTRATGROUP:
            conti_sgkf = ContinuousStratifiedGroupKFold(
                n_splits=n_splits, random_state=random_state, shuffle=True
            )
            return partial(conti_sgkf.split, groups=groups)

        case CrossValMethod.CONCATSTRATKFOLD:
            concat_skf = ConcatenatedStratifiedKFold(
                n_splits=n_splits, random_state=random_state, shuffle=True
            )
            return partial(concat_skf.split, groups=groups)

        case _:
            is_cross_validator = isinstance(method, BaseCrossValidator)
            is_groups_consumer = isinstance(method, GroupsConsumerMixin)

            if is_cross_validator and is_groups_consumer:
                return partial(method.split, groups=groups)

            if is_cross_validator:
                return method.split

            if isinstance(method, Iterator):
                return method

            else:
                raise TypeError("Invalid Cross Validation method given.")

flexcv.split.string_to_crossvalmethod(method)

Converts a string to a CrossValMethod enum member.

Parameters:

Name Type Description Default
method str

The string to convert.

required

Returns:

Type Description
CrossValMethod

The CrossValMethod enum value.

Raises:

Type Description
TypeError

If the given string does not match any CrossValMethod.

Source code in flexcv/split.py
def string_to_crossvalmethod(method: str) -> CrossValMethod:
    """Converts a string to a CrossValMethod enum member.

    Args:
      method (str): The string to convert.

    Returns:
      (CrossValMethod): The CrossValMethod enum value.

    Raises:
      (TypeError): If the given string does not match any CrossValMethod.

    """
    keys = [e.value for e in CrossValMethod]
    values = [e for e in CrossValMethod]
    method_dict = dict(zip(keys, values))

    if method in method_dict:
        return method_dict[method]
    else:
        raise TypeError("Invalid Cross Validation method given.")

flexcv.stratification

This module implements two stratificsation methods that can be used in contexts of regression of hierarchical (i.e. where the target is continuous and the data is grouped).

flexcv.stratification.ConcatenatedStratifiedKFold

Bases: GroupsConsumerMixin, BaseCrossValidator

Group Concatenated Continuous Stratified k-Folds cross validator. This is a variation of StratifiedKFold that uses a concatenation of target and grouping variable.

- The target is discretized.
- Each discrete target label is casted to type(str) and concatenated with the grouping label
- Stratification is applied to this new temporal concatenated target
- This preserves the group's *and* the targets distribution in each fold to be roughly equal to the input distribution
- The procedure allows overlapping groups which could be interpreted as data leakage in many cases.
- Population (i.e. the input data set) distribution is leaking into the folds' distribution.
Source code in flexcv/stratification.py
class ConcatenatedStratifiedKFold(GroupsConsumerMixin, BaseCrossValidator):
    """Group Concatenated Continuous Stratified k-Folds cross validator.
    This is a variation of StratifiedKFold that uses a concatenation of target and grouping variable.

        - The target is discretized.
        - Each discrete target label is casted to type(str) and concatenated with the grouping label
        - Stratification is applied to this new temporal concatenated target
        - This preserves the group's *and* the targets distribution in each fold to be roughly equal to the input distribution
        - The procedure allows overlapping groups which could be interpreted as data leakage in many cases.
        - Population (i.e. the input data set) distribution is leaking into the folds' distribution.
    """

    def __init__(self, n_splits, shuffle=True, random_state=42, groups=None):
        self.n_splits = n_splits
        self.shuffle = shuffle
        self.random_state = random_state
        self.groups = groups

    def split(self, X, y, groups=None):
        """Generate indices to split data into training and test set.
        Applies target discretization, row-wise concatenation with the group-label, and stratification on this temporal concatenated column.

        Args:
          X (array-like): Features
          y (array-like): target
          groups (array-like): Grouping variable (Default value = None)

        Returns:
            (Iterator[tuple[ndarray, ndarray]]): Iterator over the indices of the training and test set.
        """
        self.skf = StratifiedKFold(
            n_splits=self.n_splits, shuffle=self.shuffle, random_state=self.random_state
        )
        assert y is not None, "y cannot be None"
        kbins = KBinsDiscretizer(n_bins=10, encode="ordinal", strategy="quantile")
        if isinstance(y, pd.Series):
            y_cat = (
                kbins.fit_transform(y.to_numpy().reshape(-1, 1)).flatten().astype(int)
            )
            y_cat = pd.Series(y_cat, index=y.index)
        else:
            y_cat = kbins.fit_transform(y.reshape(-1, 1)).flatten().astype(int)  # type: ignore
        # concatenate y_cat and groups such that the stratification is done on both
        # elementwise concatenation of three arrays
        try:
            y_concat = y_cat.astype(str) + "_" + groups.astype(str)
        except UFuncTypeError:
            # Why easy when you can do it the hard way?
            y_concat = np.core.defchararray.add(
                np.core.defchararray.add(y_cat.astype(str), "_"), groups.astype(str)
            )

        return self.skf.split(X, y_concat)

    def get_n_splits(self, X=None, y=None, groups=None):
        """

        Args:
          X (array-like): Features
          y (array-like): target values. (Default value = None)
          groups (array-like): grouping values. (Default value = None)

        Returns:
         (int) : The number of splitting iterations in the cross-validator.
        """
        return self.n_splits

flexcv.stratification.ConcatenatedStratifiedKFold.get_n_splits(X=None, y=None, groups=None)

Parameters:

Name Type Description Default
X array - like

Features

None
y array - like

target values. (Default value = None)

None
groups array - like

grouping values. (Default value = None)

None

Returns:

Type Description

(int) : The number of splitting iterations in the cross-validator.

Source code in flexcv/stratification.py
def get_n_splits(self, X=None, y=None, groups=None):
    """

    Args:
      X (array-like): Features
      y (array-like): target values. (Default value = None)
      groups (array-like): grouping values. (Default value = None)

    Returns:
     (int) : The number of splitting iterations in the cross-validator.
    """
    return self.n_splits

flexcv.stratification.ConcatenatedStratifiedKFold.split(X, y, groups=None)

Generate indices to split data into training and test set. Applies target discretization, row-wise concatenation with the group-label, and stratification on this temporal concatenated column.

Parameters:

Name Type Description Default
X array - like

Features

required
y array - like

target

required
groups array - like

Grouping variable (Default value = None)

None

Returns:

Type Description
Iterator[tuple[ndarray, ndarray]]

Iterator over the indices of the training and test set.

Source code in flexcv/stratification.py
def split(self, X, y, groups=None):
    """Generate indices to split data into training and test set.
    Applies target discretization, row-wise concatenation with the group-label, and stratification on this temporal concatenated column.

    Args:
      X (array-like): Features
      y (array-like): target
      groups (array-like): Grouping variable (Default value = None)

    Returns:
        (Iterator[tuple[ndarray, ndarray]]): Iterator over the indices of the training and test set.
    """
    self.skf = StratifiedKFold(
        n_splits=self.n_splits, shuffle=self.shuffle, random_state=self.random_state
    )
    assert y is not None, "y cannot be None"
    kbins = KBinsDiscretizer(n_bins=10, encode="ordinal", strategy="quantile")
    if isinstance(y, pd.Series):
        y_cat = (
            kbins.fit_transform(y.to_numpy().reshape(-1, 1)).flatten().astype(int)
        )
        y_cat = pd.Series(y_cat, index=y.index)
    else:
        y_cat = kbins.fit_transform(y.reshape(-1, 1)).flatten().astype(int)  # type: ignore
    # concatenate y_cat and groups such that the stratification is done on both
    # elementwise concatenation of three arrays
    try:
        y_concat = y_cat.astype(str) + "_" + groups.astype(str)
    except UFuncTypeError:
        # Why easy when you can do it the hard way?
        y_concat = np.core.defchararray.add(
            np.core.defchararray.add(y_cat.astype(str), "_"), groups.astype(str)
        )

    return self.skf.split(X, y_concat)

flexcv.stratification.ContinuousStratifiedGroupKFold

Bases: GroupsConsumerMixin, BaseCrossValidator

Continuous Stratified Group k-Folds cross validator. This is a variation of StratifiedKFold that - makes a temporal discretization of the target variable. - apply stratified group k-fold based on the passed groups and the discretized target. - does not further use this discretized target - tries to preserve the percentage of samples in each percentile per group given the constraint of non-overlapping groups

Source code in flexcv/stratification.py
class ContinuousStratifiedGroupKFold(GroupsConsumerMixin, BaseCrossValidator):
    """Continuous Stratified Group k-Folds cross validator.
    This is a variation of StratifiedKFold that
        - makes a temporal discretization of the target variable.
        - apply stratified group k-fold based on the passed groups and the discretized target.
        - does not further use this discretized target
        - tries to preserve the percentage of samples in each percentile per group given the constraint of non-overlapping groups
    """

    def __init__(self, n_splits, shuffle=True, random_state=42, groups=None):
        self.n_splits = n_splits
        self.shuffle = shuffle
        self.random_state = random_state
        self.groups = groups

    def split(self, X, y, groups=None):
        """Generate indices to split data into training and test set.
        The data is first grouped by groups and then split into n_splits folds. The folds are made by preserving the percentage of samples for each class.
        This is a variation of StratifiedGroupKFold that uses a custom discretization of the target variable.

        Args:
          X (array-like): Features
          y (array-like): target
          groups (array-like): Grouping/clustering variable (Default value = None)

        Returns:
            (Iterator[tuple[ndarray, ndarray]]): Iterator over the indices of the training and test set.
        """
        self.sgkf = StratifiedGroupKFold(
            n_splits=self.n_splits, shuffle=self.shuffle, random_state=self.random_state
        )
        assert y is not None, "y cannot be None"
        kbins = KBinsDiscretizer(n_bins=10, encode="ordinal", strategy="quantile")
        if isinstance(y, pd.Series):
            y_cat = (
                kbins.fit_transform(y.to_numpy().reshape(-1, 1)).flatten().astype(int)
            )
            y_cat = pd.Series(y_cat, index=y.index)
        else:
            y_cat = kbins.fit_transform(y.reshape(-1, 1)).flatten().astype(int)  # type: ignore
        return self.sgkf.split(X, y_cat, groups)

    def get_n_splits(self, X=None, y=None, groups=None):
        """
        Returns the number of splitting iterations in the cross-validator.

        Returns:
          (int): The number of splitting iterations in the cross-validator.
        """
        return self.n_splits

flexcv.stratification.ContinuousStratifiedGroupKFold.get_n_splits(X=None, y=None, groups=None)

Returns the number of splitting iterations in the cross-validator.

Returns:

Type Description
int

The number of splitting iterations in the cross-validator.

Source code in flexcv/stratification.py
def get_n_splits(self, X=None, y=None, groups=None):
    """
    Returns the number of splitting iterations in the cross-validator.

    Returns:
      (int): The number of splitting iterations in the cross-validator.
    """
    return self.n_splits

flexcv.stratification.ContinuousStratifiedGroupKFold.split(X, y, groups=None)

Generate indices to split data into training and test set. The data is first grouped by groups and then split into n_splits folds. The folds are made by preserving the percentage of samples for each class. This is a variation of StratifiedGroupKFold that uses a custom discretization of the target variable.

Parameters:

Name Type Description Default
X array - like

Features

required
y array - like

target

required
groups array - like

Grouping/clustering variable (Default value = None)

None

Returns:

Type Description
Iterator[tuple[ndarray, ndarray]]

Iterator over the indices of the training and test set.

Source code in flexcv/stratification.py
def split(self, X, y, groups=None):
    """Generate indices to split data into training and test set.
    The data is first grouped by groups and then split into n_splits folds. The folds are made by preserving the percentage of samples for each class.
    This is a variation of StratifiedGroupKFold that uses a custom discretization of the target variable.

    Args:
      X (array-like): Features
      y (array-like): target
      groups (array-like): Grouping/clustering variable (Default value = None)

    Returns:
        (Iterator[tuple[ndarray, ndarray]]): Iterator over the indices of the training and test set.
    """
    self.sgkf = StratifiedGroupKFold(
        n_splits=self.n_splits, shuffle=self.shuffle, random_state=self.random_state
    )
    assert y is not None, "y cannot be None"
    kbins = KBinsDiscretizer(n_bins=10, encode="ordinal", strategy="quantile")
    if isinstance(y, pd.Series):
        y_cat = (
            kbins.fit_transform(y.to_numpy().reshape(-1, 1)).flatten().astype(int)
        )
        y_cat = pd.Series(y_cat, index=y.index)
    else:
        y_cat = kbins.fit_transform(y.reshape(-1, 1)).flatten().astype(int)  # type: ignore
    return self.sgkf.split(X, y_cat, groups)

flexcv.stratification.ContinuousStratifiedKFold

Bases: BaseCrossValidator

Continuous Stratified k-Folds cross validator, i.e. it works with continuous target variables instead of multiclass targets.

This is a variation of StratifiedKFold that

- makes a copy of the target variable and discretizes it.
- applies stratified k-folds based on this discrete target to ensure equal percentile distribution across folds
- does not further use or pass this discrete target.
- does not apply grouping rules.
Source code in flexcv/stratification.py
class ContinuousStratifiedKFold(BaseCrossValidator):
    """Continuous Stratified k-Folds cross validator, i.e. it works with *continuous* target variables instead of multiclass targets.

    This is a variation of StratifiedKFold that

        - makes a copy of the target variable and discretizes it.
        - applies stratified k-folds based on this discrete target to ensure equal percentile distribution across folds
        - does not further use or pass this discrete target.
        - does not apply grouping rules.
    """

    def __init__(self, n_splits, shuffle=True, random_state=42, groups=None):
        self.n_splits = n_splits
        self.shuffle = shuffle
        self.random_state = random_state
        self.groups = groups

    def split(self, X, y, groups=None):
        """Generate indices to split data into training and test set.
        The folds are made by preserving the percentage of samples for each class.
        This is a variation of StratifiedGroupKFold that uses a custom discretization of the target variable.

        Args:
          X (array-like): Features
          y (array-like): target
          groups (array-like): Grouping variable (Default value = None)

        Returns:
            (Iterator[tuple[ndarray, ndarray]]): Iterator over the indices of the training and test set.
        """
        self.skf = StratifiedKFold(
            n_splits=self.n_splits, shuffle=self.shuffle, random_state=self.random_state
        )
        assert y is not None, "y cannot be None"
        kbins = KBinsDiscretizer(n_bins=10, encode="ordinal", strategy="quantile")
        if isinstance(y, pd.Series):
            y_cat = (
                kbins.fit_transform(y.to_numpy().reshape(-1, 1)).flatten().astype(int)
            )
            y_cat = pd.Series(y_cat, index=y.index)
        else:
            y_cat = kbins.fit_transform(y.reshape(-1, 1)).flatten().astype(int)  # type: ignore

        return self.skf.split(X, y_cat)

    def get_n_splits(self, X=None, y=None, groups=None):
        """

        Args:
          X (array-like): Features
          y (array-like): target values. (Default value = None)
          groups (array-like): grouping values. (Default value = None)

        Returns:
         (int) : The number of splitting iterations in the cross-validator.
        """
        return self.n_splits

flexcv.stratification.ContinuousStratifiedKFold.get_n_splits(X=None, y=None, groups=None)

Parameters:

Name Type Description Default
X array - like

Features

None
y array - like

target values. (Default value = None)

None
groups array - like

grouping values. (Default value = None)

None

Returns:

Type Description

(int) : The number of splitting iterations in the cross-validator.

Source code in flexcv/stratification.py
def get_n_splits(self, X=None, y=None, groups=None):
    """

    Args:
      X (array-like): Features
      y (array-like): target values. (Default value = None)
      groups (array-like): grouping values. (Default value = None)

    Returns:
     (int) : The number of splitting iterations in the cross-validator.
    """
    return self.n_splits

flexcv.stratification.ContinuousStratifiedKFold.split(X, y, groups=None)

Generate indices to split data into training and test set. The folds are made by preserving the percentage of samples for each class. This is a variation of StratifiedGroupKFold that uses a custom discretization of the target variable.

Parameters:

Name Type Description Default
X array - like

Features

required
y array - like

target

required
groups array - like

Grouping variable (Default value = None)

None

Returns:

Type Description
Iterator[tuple[ndarray, ndarray]]

Iterator over the indices of the training and test set.

Source code in flexcv/stratification.py
def split(self, X, y, groups=None):
    """Generate indices to split data into training and test set.
    The folds are made by preserving the percentage of samples for each class.
    This is a variation of StratifiedGroupKFold that uses a custom discretization of the target variable.

    Args:
      X (array-like): Features
      y (array-like): target
      groups (array-like): Grouping variable (Default value = None)

    Returns:
        (Iterator[tuple[ndarray, ndarray]]): Iterator over the indices of the training and test set.
    """
    self.skf = StratifiedKFold(
        n_splits=self.n_splits, shuffle=self.shuffle, random_state=self.random_state
    )
    assert y is not None, "y cannot be None"
    kbins = KBinsDiscretizer(n_bins=10, encode="ordinal", strategy="quantile")
    if isinstance(y, pd.Series):
        y_cat = (
            kbins.fit_transform(y.to_numpy().reshape(-1, 1)).flatten().astype(int)
        )
        y_cat = pd.Series(y_cat, index=y.index)
    else:
        y_cat = kbins.fit_transform(y.reshape(-1, 1)).flatten().astype(int)  # type: ignore

    return self.skf.split(X, y_cat)