Flexible cross validation and machine learning for regression on tabular data

development
cross-validation
ML
OSS
Python
Author

Fabian Rosenthal

Published

November 1, 2023

flexcv is a Python package that implements flexible cross validation and machine learning for tabular data. It provides a range of features for comparing machine learning models on different datasets with different sets of predictors, customizing just about everything around cross validations. It supports both fixed and random effects, as well as random slopes.

flexcv originated from the need to perform nested cross validation on grouped data. While I wrote most of the code during my time as a student research assistant at Hochschule Düsseldorf, I have continued to maintain and improve the package as a private open-source-project. Moreover, I decided to release it with full documentation, project page on GitHub pages, and continuous integration and deployment using GitHub Actions. Have a look at the project page!

The package is designed to be flexible and easy to use, with a focus on reproducibility and ease of use. It is built on top of popular machine learning libraries such as scikit-learn and pandas, and is designed to work seamlessly with these libraries.

Why would you need it? You can not simply perform a nested group cross validation with scikit-learn. This is where flexcv comes in. It provides a simple and flexible interface for performing nested cross validation on grouped data, and supports a wide range of machine learning models and evaluation metrics. It does a whole lot of other stuff, too, like scaling in the inner and outer CV independently and providing compatibility to neptune.ai for logging.

This project showcases my

Poster

This conference poster was a contribution to the Jahrestagung für Akustik DAGA 2024 and is published here.

Installation

flexcv is on PyPI, so you can install it using pip:

pip install flexcv

Usage

Here is an example of how to use flexcv to perform nested cross validation on a regression problem. Let’s first load the modules and generate some sample data:

# import the class interface, data generator and model
from flexcv import CrossValidation
from flexcv.synthesizer import generate_regression
from flexcv.models import LinearModel

# make sample data
X, y, group, _ = generate_regression(10, 100, noise_level=0.01)

Now the fun part about flexcv is its class interface. You can set up a complex configuration with just a few lines of code. Here is an example of how to set up a group cross validation with a linear model:

# instantiate our cross validation class
cv = CrossValidation()

# now we can use method chaining to set up our configuration perform the cross validation
results = (
    cv
    .set_data(X, y, group, dataset_name="ExampleData")
    .set_splits(method_outer_split="GroupKFold", method_inner_split="KFold")
    .add_model(LinearModel)
    .perform()
    .get_results()
)

# results has a summary property which returns a dataframe
# we can simply call the pandas method "to_excel"
results.summary.to_excel("my_cv_results.xlsx")

I decided to use method chaining in the core interface, to make it easy to set up the configuration and perform the cross validation without having to remember a bunch of classes and functions. This approach leverages IDE hints and completion to guide the user through the process.

Visit the project page to learn more. Feel free to reach out to me if you have any questions or suggestions, preferably using the issue tracker.