`flexcv.split`

In order to switch cross validation split methods dynamically we need to implement a function that returns the correct cross validation split function. This is necessary because the split methods may have different numbers of arguments. This module also implements a custom stratified cross validation split method for continuous target variables and a custom stratified group cross validation split method for continuous target variables that incorporates grouping information.

`flexcv.split.CrossValMethod`

Bases: Enum

Enum class to assign CrossValMethods to the cross_val() function. This is useful to return the correct splitting function depending on the cross val method.

Members

KFOLD: Regular sklearn KFold cross validation. No grouping information is used.
GROUP: Regular sklearn GroupKFold cross validation. Grouping information is used.
STRAT: Regular sklearn StratifiedKFold cross validation. No grouping information is used.
STRATGROUP: Regular sklearn StratifiedGroupKFold cross validation. Grouping information is used.
CONTISTRAT: Stratified cross validation for continuous targets. No grouping information is used.
CONTISTRATGROUP: Stratified cross validation for continuous targets. Grouping information is used.
CONCATSTRATKFOLD: Stratified cross validation. Leaky stratification on element-wise-concatenated target and group labels.

Source code in flexcv/split.py

class CrossValMethod(Enum):
    """Enum class to assign CrossValMethods to the cross_val() function.
    This is useful to return the correct splitting function depending on the cross val method.

    Members:
        - `KFOLD`: Regular sklearn `KFold` cross validation. No grouping information is used.
        - `GROUP`: Regular sklearn `GroupKFold` cross validation. Grouping information is used.
        - `STRAT`: Regular sklearn `StratifiedKFold` cross validation. No grouping information is used.
        - `STRATGROUP`: Regular sklearn `StratifiedGroupKFold` cross validation. Grouping information is used.
        - `CONTISTRAT`: Stratified cross validation for continuous targets. No grouping information is used.
        - `CONTISTRATGROUP`: Stratified cross validation for continuous targets. Grouping information is used.
        - `CONCATSTRATKFOLD`: Stratified cross validation. Leaky stratification on element-wise-concatenated target and group labels.
    """

    KFOLD = "KFold"
    GROUP = "GroupKFold"
    STRAT = "StratifiedKFold"
    STRATGROUP = "StratifiedGroupKFold"
    CONTISTRAT = "ContinuousStratifiedKFold"
    CONTISTRATGROUP = "ContinuousStratifiedGroupKFold"
    CONCATSTRATKFOLD = "ConcatenatedStratifiedKFold"

`flexcv.split.make_cross_val_split(*, groups, method, n_splits=5, random_state=42)`

This function creates and returns a callable cross validation splitter based on the specified method.

Parameters:

Name	Type	Description	Default
`groups`	`Series \| None`	A pd.Series containing the grouping information for the samples.	required
`method`	`CrossValMethod`	A CrossValMethod enum value specifying the cross validation method to use.	required
`n_splits`	`int`	Number of splits (Default value = 5)	`5`
`random_state`	`int`	A random seed to control random processes (Default value = 42)	`42`

Returns:

Type	Description
`Callable`	A callable cross validation splitter based on the specified method.

Raises:

Type	Description
`TypeError`	If the given method is not one of KFOLD

Source code in flexcv/split.py

def make_cross_val_split(
    *,
    groups: pd.Series | None,
    method: CrossValMethod,
    n_splits: int = 5,
    random_state: int = 42,
) -> Callable[..., Iterator[tuple[ndarray, ndarray]]]:
    """This function creates and returns a callable cross validation splitter based on the specified method.

    Args:
      groups (pd.Series | None): A pd.Series containing the grouping information for the samples.
      method (CrossValMethod): A CrossValMethod enum value specifying the cross validation method to use.
      n_splits (int): Number of splits (Default value = 5)
      random_state (int): A random seed to control random processes (Default value = 42)

    Returns:
      (Callable): A callable cross validation splitter based on the specified method.

    Raises:
      (TypeError): If the given method is not one of KFOLD

    """

    match method:
        case CrossValMethod.KFOLD:
            kf = KFold(n_splits=n_splits, random_state=random_state, shuffle=True)
            return kf.split

        case CrossValMethod.STRAT:
            strat_skf = StratifiedKFold(
                n_splits=n_splits, random_state=random_state, shuffle=True
            )
            return strat_skf.split

        case CrossValMethod.CONTISTRAT:
            conti_skf = ContinuousStratifiedKFold(
                n_splits=n_splits, random_state=random_state, shuffle=True
            )
            return conti_skf.split

        case CrossValMethod.GROUP:
            gkf = GroupKFold(n_splits=n_splits)
            return partial(gkf.split, groups=groups)

        case CrossValMethod.STRATGROUP:
            strat_gkf = StratifiedGroupKFold(
                n_splits=n_splits, random_state=random_state, shuffle=True
            )
            return partial(strat_gkf.split, groups=groups)

        case CrossValMethod.CONTISTRATGROUP:
            conti_sgkf = ContinuousStratifiedGroupKFold(
                n_splits=n_splits, random_state=random_state, shuffle=True
            )
            return partial(conti_sgkf.split, groups=groups)

        case CrossValMethod.CONCATSTRATKFOLD:
            concat_skf = ConcatenatedStratifiedKFold(
                n_splits=n_splits, random_state=random_state, shuffle=True
            )
            return partial(concat_skf.split, groups=groups)

        case _:
            is_cross_validator = isinstance(method, BaseCrossValidator)
            is_groups_consumer = isinstance(method, GroupsConsumerMixin)

            if is_cross_validator and is_groups_consumer:
                return partial(method.split, groups=groups)

            if is_cross_validator:
                return method.split

            if isinstance(method, Iterator):
                return method

            else:
                raise TypeError("Invalid Cross Validation method given.")

`flexcv.split.string_to_crossvalmethod(method)`

Converts a string to a CrossValMethod enum member.

Parameters:

Name	Type	Description	Default
`method`	`str`	The string to convert.	required

Returns:

Type	Description
`CrossValMethod`	The CrossValMethod enum value.

Raises:

Type	Description
`TypeError`	If the given string does not match any CrossValMethod.

Source code in flexcv/split.py

def string_to_crossvalmethod(method: str) -> CrossValMethod:
    """Converts a string to a CrossValMethod enum member.

    Args:
      method (str): The string to convert.

    Returns:
      (CrossValMethod): The CrossValMethod enum value.

    Raises:
      (TypeError): If the given string does not match any CrossValMethod.

    """
    keys = [e.value for e in CrossValMethod]
    values = [e for e in CrossValMethod]
    method_dict = dict(zip(keys, values))

    if method in method_dict:
        return method_dict[method]
    else:
        raise TypeError("Invalid Cross Validation method given.")

`flexcv.stratification`

This module implements two stratificsation methods that can be used in contexts of regression of hierarchical (i.e. where the target is continuous and the data is grouped).

`flexcv.stratification.ConcatenatedStratifiedKFold`

Bases: GroupsConsumerMixin, BaseCrossValidator

Group Concatenated Continuous Stratified k-Folds cross validator. This is a variation of StratifiedKFold that uses a concatenation of target and grouping variable.

- The target is discretized.
- Each discrete target label is casted to type(str) and concatenated with the grouping label
- Stratification is applied to this new temporal concatenated target
- This preserves the group's *and* the targets distribution in each fold to be roughly equal to the input distribution
- The procedure allows overlapping groups which could be interpreted as data leakage in many cases.
- Population (i.e. the input data set) distribution is leaking into the folds' distribution.

Source code in flexcv/stratification.py

class ConcatenatedStratifiedKFold(GroupsConsumerMixin, BaseCrossValidator):
    """Group Concatenated Continuous Stratified k-Folds cross validator.
    This is a variation of StratifiedKFold that uses a concatenation of target and grouping variable.

        - The target is discretized.
        - Each discrete target label is casted to type(str) and concatenated with the grouping label
        - Stratification is applied to this new temporal concatenated target
        - This preserves the group's *and* the targets distribution in each fold to be roughly equal to the input distribution
        - The procedure allows overlapping groups which could be interpreted as data leakage in many cases.
        - Population (i.e. the input data set) distribution is leaking into the folds' distribution.
    """

    def __init__(self, n_splits, shuffle=True, random_state=42, groups=None):
        self.n_splits = n_splits
        self.shuffle = shuffle
        self.random_state = random_state
        self.groups = groups

    def split(self, X, y, groups=None):
        """Generate indices to split data into training and test set.
        Applies target discretization, row-wise concatenation with the group-label, and stratification on this temporal concatenated column.

        Args:
          X (array-like): Features
          y (array-like): target
          groups (array-like): Grouping variable (Default value = None)

        Returns:
            (Iterator[tuple[ndarray, ndarray]]): Iterator over the indices of the training and test set.
        """
        self.skf = StratifiedKFold(
            n_splits=self.n_splits, shuffle=self.shuffle, random_state=self.random_state
        )
        assert y is not None, "y cannot be None"
        kbins = KBinsDiscretizer(n_bins=10, encode="ordinal", strategy="quantile")
        if isinstance(y, pd.Series):
            y_cat = (
                kbins.fit_transform(y.to_numpy().reshape(-1, 1)).flatten().astype(int)
            )
            y_cat = pd.Series(y_cat, index=y.index)
        else:
            y_cat = kbins.fit_transform(y.reshape(-1, 1)).flatten().astype(int)  # type: ignore
        # concatenate y_cat and groups such that the stratification is done on both
        # elementwise concatenation of three arrays
        try:
            y_concat = y_cat.astype(str) + "_" + groups.astype(str)
        except UFuncTypeError:
            # Why easy when you can do it the hard way?
            y_concat = np.core.defchararray.add(
                np.core.defchararray.add(y_cat.astype(str), "_"), groups.astype(str)
            )

        return self.skf.split(X, y_concat)

    def get_n_splits(self, X=None, y=None, groups=None):
        """

        Args:
          X (array-like): Features
          y (array-like): target values. (Default value = None)
          groups (array-like): grouping values. (Default value = None)

        Returns:
         (int) : The number of splitting iterations in the cross-validator.
        """
        return self.n_splits

`flexcv.stratification.ConcatenatedStratifiedKFold.get_n_splits(X=None, y=None, groups=None)`

Parameters:

Name	Type	Description	Default
`X`	`array - like`	Features	`None`
`y`	`array - like`	target values. (Default value = None)	`None`
`groups`	`array - like`	grouping values. (Default value = None)	`None`

Returns:

Type	Description
	(int) : The number of splitting iterations in the cross-validator.

Source code in flexcv/stratification.py

def get_n_splits(self, X=None, y=None, groups=None):
    """

    Args:
      X (array-like): Features
      y (array-like): target values. (Default value = None)
      groups (array-like): grouping values. (Default value = None)

    Returns:
     (int) : The number of splitting iterations in the cross-validator.
    """
    return self.n_splits

`flexcv.stratification.ConcatenatedStratifiedKFold.split(X, y, groups=None)`

Generate indices to split data into training and test set. Applies target discretization, row-wise concatenation with the group-label, and stratification on this temporal concatenated column.

Parameters:

Name	Type	Description	Default
`X`	`array - like`	Features	required
`y`	`array - like`	target	required
`groups`	`array - like`	Grouping variable (Default value = None)	`None`

Returns:

Type	Description
`Iterator[tuple[ndarray, ndarray]]`	Iterator over the indices of the training and test set.

Source code in flexcv/stratification.py

def split(self, X, y, groups=None):
    """Generate indices to split data into training and test set.
    Applies target discretization, row-wise concatenation with the group-label, and stratification on this temporal concatenated column.

    Args:
      X (array-like): Features
      y (array-like): target
      groups (array-like): Grouping variable (Default value = None)

    Returns:
        (Iterator[tuple[ndarray, ndarray]]): Iterator over the indices of the training and test set.
    """
    self.skf = StratifiedKFold(
        n_splits=self.n_splits, shuffle=self.shuffle, random_state=self.random_state
    )
    assert y is not None, "y cannot be None"
    kbins = KBinsDiscretizer(n_bins=10, encode="ordinal", strategy="quantile")
    if isinstance(y, pd.Series):
        y_cat = (
            kbins.fit_transform(y.to_numpy().reshape(-1, 1)).flatten().astype(int)
        )
        y_cat = pd.Series(y_cat, index=y.index)
    else:
        y_cat = kbins.fit_transform(y.reshape(-1, 1)).flatten().astype(int)  # type: ignore
    # concatenate y_cat and groups such that the stratification is done on both
    # elementwise concatenation of three arrays
    try:
        y_concat = y_cat.astype(str) + "_" + groups.astype(str)
    except UFuncTypeError:
        # Why easy when you can do it the hard way?
        y_concat = np.core.defchararray.add(
            np.core.defchararray.add(y_cat.astype(str), "_"), groups.astype(str)
        )

    return self.skf.split(X, y_concat)

`flexcv.stratification.ContinuousStratifiedGroupKFold`

Bases: GroupsConsumerMixin, BaseCrossValidator

Continuous Stratified Group k-Folds cross validator. This is a variation of StratifiedKFold that - makes a temporal discretization of the target variable. - apply stratified group k-fold based on the passed groups and the discretized target. - does not further use this discretized target - tries to preserve the percentage of samples in each percentile per group given the constraint of non-overlapping groups

Source code in flexcv/stratification.py

class ContinuousStratifiedGroupKFold(GroupsConsumerMixin, BaseCrossValidator):
    """Continuous Stratified Group k-Folds cross validator.
    This is a variation of StratifiedKFold that
        - makes a temporal discretization of the target variable.
        - apply stratified group k-fold based on the passed groups and the discretized target.
        - does not further use this discretized target
        - tries to preserve the percentage of samples in each percentile per group given the constraint of non-overlapping groups
    """

    def __init__(self, n_splits, shuffle=True, random_state=42, groups=None):
        self.n_splits = n_splits
        self.shuffle = shuffle
        self.random_state = random_state
        self.groups = groups

    def split(self, X, y, groups=None):
        """Generate indices to split data into training and test set.
        The data is first grouped by groups and then split into n_splits folds. The folds are made by preserving the percentage of samples for each class.
        This is a variation of StratifiedGroupKFold that uses a custom discretization of the target variable.

        Args:
          X (array-like): Features
          y (array-like): target
          groups (array-like): Grouping/clustering variable (Default value = None)

        Returns:
            (Iterator[tuple[ndarray, ndarray]]): Iterator over the indices of the training and test set.
        """
        self.sgkf = StratifiedGroupKFold(
            n_splits=self.n_splits, shuffle=self.shuffle, random_state=self.random_state
        )
        assert y is not None, "y cannot be None"
        kbins = KBinsDiscretizer(n_bins=10, encode="ordinal", strategy="quantile")
        if isinstance(y, pd.Series):
            y_cat = (
                kbins.fit_transform(y.to_numpy().reshape(-1, 1)).flatten().astype(int)
            )
            y_cat = pd.Series(y_cat, index=y.index)
        else:
            y_cat = kbins.fit_transform(y.reshape(-1, 1)).flatten().astype(int)  # type: ignore
        return self.sgkf.split(X, y_cat, groups)

    def get_n_splits(self, X=None, y=None, groups=None):
        """
        Returns the number of splitting iterations in the cross-validator.

        Returns:
          (int): The number of splitting iterations in the cross-validator.
        """
        return self.n_splits

`flexcv.stratification.ContinuousStratifiedGroupKFold.get_n_splits(X=None, y=None, groups=None)`

Returns the number of splitting iterations in the cross-validator.

Returns:

Type	Description
`int`	The number of splitting iterations in the cross-validator.

Source code in flexcv/stratification.py

def get_n_splits(self, X=None, y=None, groups=None):
    """
    Returns the number of splitting iterations in the cross-validator.

    Returns:
      (int): The number of splitting iterations in the cross-validator.
    """
    return self.n_splits

`flexcv.stratification.ContinuousStratifiedGroupKFold.split(X, y, groups=None)`

Generate indices to split data into training and test set. The data is first grouped by groups and then split into n_splits folds. The folds are made by preserving the percentage of samples for each class. This is a variation of StratifiedGroupKFold that uses a custom discretization of the target variable.

Parameters:

Name	Type	Description	Default
`X`	`array - like`	Features	required
`y`	`array - like`	target	required
`groups`	`array - like`	Grouping/clustering variable (Default value = None)	`None`

Returns:

Type	Description
`Iterator[tuple[ndarray, ndarray]]`	Iterator over the indices of the training and test set.

Source code in flexcv/stratification.py

def split(self, X, y, groups=None):
    """Generate indices to split data into training and test set.
    The data is first grouped by groups and then split into n_splits folds. The folds are made by preserving the percentage of samples for each class.
    This is a variation of StratifiedGroupKFold that uses a custom discretization of the target variable.

    Args:
      X (array-like): Features
      y (array-like): target
      groups (array-like): Grouping/clustering variable (Default value = None)

    Returns:
        (Iterator[tuple[ndarray, ndarray]]): Iterator over the indices of the training and test set.
    """
    self.sgkf = StratifiedGroupKFold(
        n_splits=self.n_splits, shuffle=self.shuffle, random_state=self.random_state
    )
    assert y is not None, "y cannot be None"
    kbins = KBinsDiscretizer(n_bins=10, encode="ordinal", strategy="quantile")
    if isinstance(y, pd.Series):
        y_cat = (
            kbins.fit_transform(y.to_numpy().reshape(-1, 1)).flatten().astype(int)
        )
        y_cat = pd.Series(y_cat, index=y.index)
    else:
        y_cat = kbins.fit_transform(y.reshape(-1, 1)).flatten().astype(int)  # type: ignore
    return self.sgkf.split(X, y_cat, groups)

`flexcv.stratification.ContinuousStratifiedKFold`

Bases: BaseCrossValidator

Continuous Stratified k-Folds cross validator, i.e. it works with continuous target variables instead of multiclass targets.

This is a variation of StratifiedKFold that

- makes a copy of the target variable and discretizes it.
- applies stratified k-folds based on this discrete target to ensure equal percentile distribution across folds
- does not further use or pass this discrete target.
- does not apply grouping rules.

Source code in flexcv/stratification.py

class ContinuousStratifiedKFold(BaseCrossValidator):
    """Continuous Stratified k-Folds cross validator, i.e. it works with *continuous* target variables instead of multiclass targets.

    This is a variation of StratifiedKFold that

        - makes a copy of the target variable and discretizes it.
        - applies stratified k-folds based on this discrete target to ensure equal percentile distribution across folds
        - does not further use or pass this discrete target.
        - does not apply grouping rules.
    """

    def __init__(self, n_splits, shuffle=True, random_state=42, groups=None):
        self.n_splits = n_splits
        self.shuffle = shuffle
        self.random_state = random_state
        self.groups = groups

    def split(self, X, y, groups=None):
        """Generate indices to split data into training and test set.
        The folds are made by preserving the percentage of samples for each class.
        This is a variation of StratifiedGroupKFold that uses a custom discretization of the target variable.

        Args:
          X (array-like): Features
          y (array-like): target
          groups (array-like): Grouping variable (Default value = None)

        Returns:
            (Iterator[tuple[ndarray, ndarray]]): Iterator over the indices of the training and test set.
        """
        self.skf = StratifiedKFold(
            n_splits=self.n_splits, shuffle=self.shuffle, random_state=self.random_state
        )
        assert y is not None, "y cannot be None"
        kbins = KBinsDiscretizer(n_bins=10, encode="ordinal", strategy="quantile")
        if isinstance(y, pd.Series):
            y_cat = (
                kbins.fit_transform(y.to_numpy().reshape(-1, 1)).flatten().astype(int)
            )
            y_cat = pd.Series(y_cat, index=y.index)
        else:
            y_cat = kbins.fit_transform(y.reshape(-1, 1)).flatten().astype(int)  # type: ignore

        return self.skf.split(X, y_cat)

    def get_n_splits(self, X=None, y=None, groups=None):
        """

        Args:
          X (array-like): Features
          y (array-like): target values. (Default value = None)
          groups (array-like): grouping values. (Default value = None)

        Returns:
         (int) : The number of splitting iterations in the cross-validator.
        """
        return self.n_splits

`flexcv.stratification.ContinuousStratifiedKFold.get_n_splits(X=None, y=None, groups=None)`

Parameters:

Name	Type	Description	Default
`X`	`array - like`	Features	`None`
`y`	`array - like`	target values. (Default value = None)	`None`
`groups`	`array - like`	grouping values. (Default value = None)	`None`

Returns:

Type	Description
	(int) : The number of splitting iterations in the cross-validator.

Source code in flexcv/stratification.py

def get_n_splits(self, X=None, y=None, groups=None):
    """

    Args:
      X (array-like): Features
      y (array-like): target values. (Default value = None)
      groups (array-like): grouping values. (Default value = None)

    Returns:
     (int) : The number of splitting iterations in the cross-validator.
    """
    return self.n_splits

`flexcv.stratification.ContinuousStratifiedKFold.split(X, y, groups=None)`

Generate indices to split data into training and test set. The folds are made by preserving the percentage of samples for each class. This is a variation of StratifiedGroupKFold that uses a custom discretization of the target variable.

Parameters:

Name	Type	Description	Default
`X`	`array - like`	Features	required
`y`	`array - like`	target	required
`groups`	`array - like`	Grouping variable (Default value = None)	`None`

Returns:

Type	Description
`Iterator[tuple[ndarray, ndarray]]`	Iterator over the indices of the training and test set.

Source code in flexcv/stratification.py

def split(self, X, y, groups=None):
    """Generate indices to split data into training and test set.
    The folds are made by preserving the percentage of samples for each class.
    This is a variation of StratifiedGroupKFold that uses a custom discretization of the target variable.

    Args:
      X (array-like): Features
      y (array-like): target
      groups (array-like): Grouping variable (Default value = None)

    Returns:
        (Iterator[tuple[ndarray, ndarray]]): Iterator over the indices of the training and test set.
    """
    self.skf = StratifiedKFold(
        n_splits=self.n_splits, shuffle=self.shuffle, random_state=self.random_state
    )
    assert y is not None, "y cannot be None"
    kbins = KBinsDiscretizer(n_bins=10, encode="ordinal", strategy="quantile")
    if isinstance(y, pd.Series):
        y_cat = (
            kbins.fit_transform(y.to_numpy().reshape(-1, 1)).flatten().astype(int)
        )
        y_cat = pd.Series(y_cat, index=y.index)
    else:
        y_cat = kbins.fit_transform(y.reshape(-1, 1)).flatten().astype(int)  # type: ignore

    return self.skf.split(X, y_cat)