flexcv.split
In order to switch cross validation split methods dynamically we need to implement a function that returns the correct cross validation split function. This is necessary because the split methods may have different numbers of arguments. This module also implements a custom stratified cross validation split method for continuous target variables and a custom stratified group cross validation split method for continuous target variables that incorporates grouping information.
flexcv.split.CrossValMethod
Bases: Enum
Enum class to assign CrossValMethods to the cross_val() function. This is useful to return the correct splitting function depending on the cross val method.
Members
KFOLD
: Regular sklearnKFold
cross validation. No grouping information is used.GROUP
: Regular sklearnGroupKFold
cross validation. Grouping information is used.STRAT
: Regular sklearnStratifiedKFold
cross validation. No grouping information is used.STRATGROUP
: Regular sklearnStratifiedGroupKFold
cross validation. Grouping information is used.CONTISTRAT
: Stratified cross validation for continuous targets. No grouping information is used.CONTISTRATGROUP
: Stratified cross validation for continuous targets. Grouping information is used.CONCATSTRATKFOLD
: Stratified cross validation. Leaky stratification on element-wise-concatenated target and group labels.
Source code in flexcv/split.py
flexcv.split.make_cross_val_split(*, groups, method, n_splits=5, random_state=42)
This function creates and returns a callable cross validation splitter based on the specified method.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
groups |
Series | None
|
A pd.Series containing the grouping information for the samples. |
required |
method |
CrossValMethod
|
A CrossValMethod enum value specifying the cross validation method to use. |
required |
n_splits |
int
|
Number of splits (Default value = 5) |
5
|
random_state |
int
|
A random seed to control random processes (Default value = 42) |
42
|
Returns:
Type | Description |
---|---|
Callable
|
A callable cross validation splitter based on the specified method. |
Raises:
Type | Description |
---|---|
TypeError
|
If the given method is not one of KFOLD |
Source code in flexcv/split.py
flexcv.split.string_to_crossvalmethod(method)
Converts a string to a CrossValMethod enum member.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
method |
str
|
The string to convert. |
required |
Returns:
Type | Description |
---|---|
CrossValMethod
|
The CrossValMethod enum value. |
Raises:
Type | Description |
---|---|
TypeError
|
If the given string does not match any CrossValMethod. |
Source code in flexcv/split.py
flexcv.stratification
This module implements two stratificsation methods that can be used in contexts of regression of hierarchical (i.e. where the target is continuous and the data is grouped).
flexcv.stratification.ConcatenatedStratifiedKFold
Bases: GroupsConsumerMixin
, BaseCrossValidator
Group Concatenated Continuous Stratified k-Folds cross validator. This is a variation of StratifiedKFold that uses a concatenation of target and grouping variable.
- The target is discretized.
- Each discrete target label is casted to type(str) and concatenated with the grouping label
- Stratification is applied to this new temporal concatenated target
- This preserves the group's *and* the targets distribution in each fold to be roughly equal to the input distribution
- The procedure allows overlapping groups which could be interpreted as data leakage in many cases.
- Population (i.e. the input data set) distribution is leaking into the folds' distribution.
Source code in flexcv/stratification.py
flexcv.stratification.ConcatenatedStratifiedKFold.get_n_splits(X=None, y=None, groups=None)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
array - like
|
Features |
None
|
y |
array - like
|
target values. (Default value = None) |
None
|
groups |
array - like
|
grouping values. (Default value = None) |
None
|
Returns:
Type | Description |
---|---|
(int) : The number of splitting iterations in the cross-validator. |
Source code in flexcv/stratification.py
flexcv.stratification.ConcatenatedStratifiedKFold.split(X, y, groups=None)
Generate indices to split data into training and test set. Applies target discretization, row-wise concatenation with the group-label, and stratification on this temporal concatenated column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
array - like
|
Features |
required |
y |
array - like
|
target |
required |
groups |
array - like
|
Grouping variable (Default value = None) |
None
|
Returns:
Type | Description |
---|---|
Iterator[tuple[ndarray, ndarray]]
|
Iterator over the indices of the training and test set. |
Source code in flexcv/stratification.py
flexcv.stratification.ContinuousStratifiedGroupKFold
Bases: GroupsConsumerMixin
, BaseCrossValidator
Continuous Stratified Group k-Folds cross validator. This is a variation of StratifiedKFold that - makes a temporal discretization of the target variable. - apply stratified group k-fold based on the passed groups and the discretized target. - does not further use this discretized target - tries to preserve the percentage of samples in each percentile per group given the constraint of non-overlapping groups
Source code in flexcv/stratification.py
flexcv.stratification.ContinuousStratifiedGroupKFold.get_n_splits(X=None, y=None, groups=None)
Returns the number of splitting iterations in the cross-validator.
Returns:
Type | Description |
---|---|
int
|
The number of splitting iterations in the cross-validator. |
flexcv.stratification.ContinuousStratifiedGroupKFold.split(X, y, groups=None)
Generate indices to split data into training and test set. The data is first grouped by groups and then split into n_splits folds. The folds are made by preserving the percentage of samples for each class. This is a variation of StratifiedGroupKFold that uses a custom discretization of the target variable.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
array - like
|
Features |
required |
y |
array - like
|
target |
required |
groups |
array - like
|
Grouping/clustering variable (Default value = None) |
None
|
Returns:
Type | Description |
---|---|
Iterator[tuple[ndarray, ndarray]]
|
Iterator over the indices of the training and test set. |
Source code in flexcv/stratification.py
flexcv.stratification.ContinuousStratifiedKFold
Bases: BaseCrossValidator
Continuous Stratified k-Folds cross validator, i.e. it works with continuous target variables instead of multiclass targets.
This is a variation of StratifiedKFold that
- makes a copy of the target variable and discretizes it.
- applies stratified k-folds based on this discrete target to ensure equal percentile distribution across folds
- does not further use or pass this discrete target.
- does not apply grouping rules.
Source code in flexcv/stratification.py
flexcv.stratification.ContinuousStratifiedKFold.get_n_splits(X=None, y=None, groups=None)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
array - like
|
Features |
None
|
y |
array - like
|
target values. (Default value = None) |
None
|
groups |
array - like
|
grouping values. (Default value = None) |
None
|
Returns:
Type | Description |
---|---|
(int) : The number of splitting iterations in the cross-validator. |
Source code in flexcv/stratification.py
flexcv.stratification.ContinuousStratifiedKFold.split(X, y, groups=None)
Generate indices to split data into training and test set. The folds are made by preserving the percentage of samples for each class. This is a variation of StratifiedGroupKFold that uses a custom discretization of the target variable.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
array - like
|
Features |
required |
y |
array - like
|
target |
required |
groups |
array - like
|
Grouping variable (Default value = None) |
None
|
Returns:
Type | Description |
---|---|
Iterator[tuple[ndarray, ndarray]]
|
Iterator over the indices of the training and test set. |