Flexible Cross Validation and Machine Learning for Regression on Tabular Data |
Find the repository here.
flexcv
is a Python package that implements flexible cross validation and machine learning for tabular data. It provides a range of features for comparing machine learning models on different datasets with different sets of predictors, customizing just about everything around cross validations. It supports both fixed and random effects, as well as random slopes.
Features
The flexcv
package provides the following features:
- Cross-validation of model performance (generalization estimation)
- Selection of model hyperparameters using an inner cross-validation and a state-of-the-art optimization provided by optuna.
- Customization of objective functions for optimization to select meaningful model parameters.
- Fixed and mixed effects modeling (random intercepts and slopes).
- Scaling of inner and outer cross-validation folds separately.
- Easy usage of the state-of-the-art MLops platform
neptune
to track all of your experiments. Check out their website or explore our neptune project that we used for testing this package. Also check out the neptune integration guide. - Integrates the
merf
package to apply correction for clustered data using the expectation maximization algorithm and supporting anysklearn
BaseEstimator. Read more about that package in this blog post or go right to their repo. - Adaptations for cross validation splits with stratification for continuous target variables.
- Easy local summary of all evaluation metrics in a single table.
- Wrapper classes for the
statsmodels
package to use their mixed effects models inside of asklearn
Pipeline. Read more about that package here. - Inner cross validation implementation that let's you push groups to the inner split, e. g. to apply GroupKFold.
- Customizable ObjectiveScorer function for hyperparameter tuning, that let's you make a trade-off between under- and overfitting.
These are the core packages used under the hood in flexcv
:
sklearn
- A very popular machine learning library. We use their Estimator API for models, the pipeline module, the StandardScaler, metrics and of course wrap around their cross validation split methods. Learn more here.Optuna
- A state-of-the-art optimization package. We use it for parameter selection in the inner loop of our nested cross validation. Learn more about theoretical background and opportunities here.neptune
- Awesome logging dashboard with lots of integrations. It is a charm in combination withOptuna
. We used it to track all of our experiments.Neptune
is quite deeply integrated intoflexcv
. Learn more about this great library here.merf
- Mixed Effects for Random Forests. Applies correction terms on the predictions of clustered data. Works not only with random forest but with everysklearn
BaseEstimator.
Why would you use flexcv
?
Working with cross validation in Python usually starts with creating a sklearn pipeline. Pipelines are super useful to combine preprocessing steps with model fitting and prevent data leakage.
However, there are limitations, e. g. if you want to push the training part of your clustering variable to the inner cross validation split. For some of the features, you would have to write a lot of boilerplate code to get it working, and you end up with a lot of code duplication.
As soon as you want to use a linear mixed effects model, you have to use the statsmodels
package, which is not compatible with the sklearn
pipeline.
flexcv
solves these problems and provides a lot of useful features for cross validation and machine learning on tabular data, so you can focus on your data and your models.
Earth Extension
An wrapper implementation of the Earth Regression package for R exists which you can use with flexcv. It is called flexcv-earth. It is not yet available on PyPI, but you can install it from GitHub with the command pip install git+https://github.com/radlfabs/flexcv-earth.git
. You can then use the EarthModel
class in your flexcv
configuration by importing it from flexcv_earth
. Further information is available in the documentation.
Contributions
We welcome contributions to this repository. If you have any questions, please don't hesitate to get in contact by reaching out or filing a github issue.