Title: | Robust Tuning and Training for Cross-Source Prediction |
---|---|
Description: | Provides robust parameter tuning and model training for predictive models applied across data sources where the data distribution varies slightly from source to source. This package implements three primary tuning methods: cross-validation-based internal tuning, external tuning, and the 'RobustTuneC' method. External tuning includes a conservative option where parameters are tuned internally on the training data and validating on an external dataset, providing a slightly pessimistic AUC estimate. It supports Lasso, Ridge, Random Forest, Boosting, and Support Vector Machine classifiers. Currently, only binary classification is supported. The response variable must be the first column of the dataset and a factor with exactly two levels. The tuning methods are based on the paper by Nicole Ellenbach, Anne-Laure Boulesteix, Bernd Bischl, Kristian Unger, and Roman Hornung (2021) "Improved Outcome Prediction Across Data Sources Through Robust Parameter Tuning" <doi:10.1007/s00357-020-09368-z>. |
Authors: | Yuting He [aut, cre], Nicole Ellenbach [ctb], Roman Hornung [ctb] |
Maintainer: | Yuting He <[email protected]> |
License: | GPL-3 |
Version: | 0.1.6 |
Built: | 2025-02-16 06:33:00 UTC |
Source: | https://github.com/yuting-he/robustprediction |
This package provides robust parameter tuning and predictive modeling techniques, useful for situations where prediction across different data sources is important and the data distribution varies slightly from source to source.
The 'RobustPrediction' package helps users build and tune classifiers using the methods 'RobustTuneC' method, internal, or external tuning method. The package supports the following classifiers: boosting, lasso, ridge, random forest, and support vector machine(SVM). It is intended for scenarios where parameter tuning across data sources is important.
The 'RobustPrediction' package provides comprehensive tools for robust parameter tuning and predictive modeling, particularly for cross-source prediction tasks.
The package includes functions for tuning model parameters using three methods:
- **Internal tuning**: Standard cross-validation on the training data to select the best parameters.
- **External tuning**: Parameter tuning based on an external dataset that is independent of the training data. This method
has two variants controlled by the estperf
argument:
- **Standard external tuning (estperf = FALSE
)**: Parameters are tuned directly using the external dataset.
This is the default approach and provides a straightforward method for selecting optimal parameters based on external data.
- **Conservative external tuning (estperf = TRUE
)**: Internal tuning is first performed on the training data,
and then the model is evaluated on the external dataset. This approach provides a more conservative (slightly pessimistic)
AUC estimate, as described by Ellenbach et al. (2021). For the most accurate performance evaluation,
it is recommended to use a second external dataset.
- **RobustTuneC**: A method designed to combine internal and external tuning for better performance in cross-source scenarios.
The package supports Lasso, Ridge, Random Forest, Boosting, and SVM classifiers. These models can be trained and tuned using the provided methods, and the package includes the model's AUC (Area Under the Curve) value to help users evaluate prediction performance.
It is particularly useful when the data to be predicted comes from a different source than the training data, where variability between datasets may require more robust parameter tuning techniques. The methods provided in this package may help reduce overfitting the training data distribution and improve model generalization across different data sources.
This package requires the following packages: glmnet
, mboost
, mlr
,
pROC
, ranger
.
Maintainer: Yuting He [email protected]
Other contributors:
Nicole Ellenbach [contributor]
Roman Hornung [contributor]
Ellenbach, N., Boulesteix, A.-L., Bischl, B., Unger, K., & Hornung, R. (2021). Improved outcome prediction across data sources through robust parameter tuning. Journal of Classification, 38, 212-231. <doi:10.1007/s00357-020-09368-z>.
Useful links:
# Example usage: data(sample_data_train) data(sample_data_extern) res <- tuneandtrain(sample_data_train, sample_data_extern, tuningmethod = "robusttunec", classifier = "lasso")
# Example usage: data(sample_data_train) data(sample_data_extern) res <- tuneandtrain(sample_data_train, sample_data_extern, tuningmethod = "robusttunec", classifier = "lasso")
This dataset, named 'sample_data_extern', is a subset of publicly available microarray data from the HG-U133PLUS2 chip. It contains expression levels of 200 genes across 50 samples, used primarily as an external validation set in robust feature selection studies. The data has been sourced from the ArrayExpress repository and has been referenced in several research articles.
sample_data_extern
sample_data_extern
A data frame with 50 observations and 201 variables, including:
Factor. The response variable.
Numeric. Expression level of gene 236694_at.
Numeric. Expression level of gene 222356_at.
Numeric. Expression level of gene 1554125_a_at.
Numeric. Expression level of gene 232823_at.
Numeric. Expression level of gene 205766_at.
Numeric. Expression level of gene 1560446_at.
Numeric. Expression level of gene 202565_s_at.
Numeric. Expression level of gene 234887_at.
Numeric. Expression level of gene 209687_at.
Numeric. Expression level of gene 221592_at.
Numeric. Expression level of gene 1570123_at.
Numeric. Expression level of gene 241368_at.
Numeric. Expression level of gene 243324_x_at.
Numeric. Expression level of gene 224046_s_at.
Numeric. Expression level of gene 202775_s_at.
Numeric. Expression level of gene 216332_at.
Numeric. Expression level of gene 1569545_at.
Numeric. Expression level of gene 205946_at.
Numeric. Expression level of gene 203547_at.
Numeric. Expression level of gene 243239_at.
Numeric. Expression level of gene 234245_at.
Numeric. Expression level of gene 210832_x_at.
Numeric. Expression level of gene 224549_x_at.
Numeric. Expression level of gene 236628_at.
Numeric. Expression level of gene 214848_at.
Numeric. Expression level of gene 1553015_a_at.
Numeric. Expression level of gene 1554199_at.
Numeric. Expression level of gene 1557636_a_at.
Numeric. Expression level of gene 1558511_s_at.
Numeric. Expression level of gene 1561713_at.
Numeric. Expression level of gene 1561883_at.
Numeric. Expression level of gene 1568720_at.
Numeric. Expression level of gene 1569168_at.
Numeric. Expression level of gene 1569443_s_at.
Numeric. Expression level of gene 1570103_at.
Numeric. Expression level of gene 200916_at.
Numeric. Expression level of gene 201554_x_at.
Numeric. Expression level of gene 202371_at.
Numeric. Expression level of gene 204481_at.
Numeric. Expression level of gene 205831_at.
Numeric. Expression level of gene 207061_at.
Numeric. Expression level of gene 207423_s_at.
Numeric. Expression level of gene 209896_s_at.
Numeric. Expression level of gene 212646_at.
Numeric. Expression level of gene 214068_at.
Numeric. Expression level of gene 217727_x_at.
Numeric. Expression level of gene 221103_s_at.
Numeric. Expression level of gene 221785_at.
Numeric. Expression level of gene 224207_x_at.
Numeric. Expression level of gene 228257_at.
Numeric. Expression level of gene 228877_at.
Numeric. Expression level of gene 231173_at.
Numeric. Expression level of gene 231328_s_at.
Numeric. Expression level of gene 231639_at.
Numeric. Expression level of gene 232221_x_at.
Numeric. Expression level of gene 232349_x_at.
Numeric. Expression level of gene 232849_at.
Numeric. Expression level of gene 233601_at.
Numeric. Expression level of gene 234403_at.
Numeric. Expression level of gene 234585_at.
Numeric. Expression level of gene 234650_at.
Numeric. Expression level of gene 234897_s_at.
Numeric. Expression level of gene 236071_at.
Numeric. Expression level of gene 236689_at.
Numeric. Expression level of gene 238551_at.
Numeric. Expression level of gene 239414_at.
Numeric. Expression level of gene 241034_at.
Numeric. Expression level of gene 241131_at.
Numeric. Expression level of gene 241897_at.
Numeric. Expression level of gene 242611_at.
Numeric. Expression level of gene 244805_at.
Numeric. Expression level of gene 244866_at.
Numeric. Expression level of gene 32259_at.
Numeric. Expression level of gene 1552264_a_at.
Numeric. Expression level of gene 1552880_at.
Numeric. Expression level of gene 1553186_x_at.
Numeric. Expression level of gene 1553372_at.
Numeric. Expression level of gene 1553438_at.
Numeric. Expression level of gene 1554299_at.
Numeric. Expression level of gene 1554362_at.
Numeric. Expression level of gene 1554491_a_at.
Numeric. Expression level of gene 1555098_a_at.
Numeric. Expression level of gene 1555990_at.
Numeric. Expression level of gene 1556034_s_at.
Numeric. Expression level of gene 1556822_s_at.
Numeric. Expression level of gene 1556824_at.
Numeric. Expression level of gene 1557278_s_at.
Numeric. Expression level of gene 1558603_at.
Numeric. Expression level of gene 1558890_at.
Numeric. Expression level of gene 1560791_at.
Numeric. Expression level of gene 1561083_at.
Numeric. Expression level of gene 1561364_at.
Numeric. Expression level of gene 1561553_at.
Numeric. Expression level of gene 1562523_at.
Numeric. Expression level of gene 1562613_at.
Numeric. Expression level of gene 1563351_at.
Numeric. Expression level of gene 1563473_at.
Numeric. Expression level of gene 1566780_at.
Numeric. Expression level of gene 1567257_at.
Numeric. Expression level of gene 1569664_at.
Numeric. Expression level of gene 1569882_at.
Numeric. Expression level of gene 1570252_at.
Numeric. Expression level of gene 201089_at.
Numeric. Expression level of gene 201261_x_at.
Numeric. Expression level of gene 202052_s_at.
Numeric. Expression level of gene 202236_s_at.
Numeric. Expression level of gene 202948_at.
Numeric. Expression level of gene 203080_s_at.
Numeric. Expression level of gene 203211_s_at.
Numeric. Expression level of gene 203218_at.
Numeric. Expression level of gene 203236_s_at.
Numeric. Expression level of gene 203347_s_at.
Numeric. Expression level of gene 203960_s_at.
Numeric. Expression level of gene 204609_at.
Numeric. Expression level of gene 204806_x_at.
Numeric. Expression level of gene 204949_at.
Numeric. Expression level of gene 204979_s_at.
Numeric. Expression level of gene 205823_at.
Numeric. Expression level of gene 205902_at.
Numeric. Expression level of gene 205967_at.
Numeric. Expression level of gene 206186_at.
Numeric. Expression level of gene 207151_at.
Numeric. Expression level of gene 207379_at.
Numeric. Expression level of gene 207440_at.
Numeric. Expression level of gene 207883_s_at.
Numeric. Expression level of gene 208277_at.
Numeric. Expression level of gene 208280_at.
Numeric. Expression level of gene 209224_s_at.
Numeric. Expression level of gene 209561_at.
Numeric. Expression level of gene 209630_s_at.
Numeric. Expression level of gene 210118_s_at.
Numeric. Expression level of gene 210342_s_at.
Numeric. Expression level of gene 211566_x_at.
Numeric. Expression level of gene 211756_at.
Numeric. Expression level of gene 212170_at.
Numeric. Expression level of gene 212494_at.
Numeric. Expression level of gene 213118_at.
Numeric. Expression level of gene 214475_x_at.
Numeric. Expression level of gene 214834_at.
Numeric. Expression level of gene 215718_s_at.
Numeric. Expression level of gene 216283_s_at.
Numeric. Expression level of gene 217206_at.
Numeric. Expression level of gene 217557_s_at.
Numeric. Expression level of gene 217577_at.
Numeric. Expression level of gene 218152_at.
Numeric. Expression level of gene 218252_at.
Numeric. Expression level of gene 219714_s_at.
Numeric. Expression level of gene 220506_at.
Numeric. Expression level of gene 220889_s_at.
Numeric. Expression level of gene 221204_s_at.
Numeric. Expression level of gene 221795_at.
Numeric. Expression level of gene 222048_at.
Numeric. Expression level of gene 223142_s_at.
Numeric. Expression level of gene 223439_at.
Numeric. Expression level of gene 223673_at.
Numeric. Expression level of gene 224363_at.
Numeric. Expression level of gene 224512_s_at.
Numeric. Expression level of gene 224690_at.
Numeric. Expression level of gene 224936_at.
Numeric. Expression level of gene 225334_at.
Numeric. Expression level of gene 225713_at.
Numeric. Expression level of gene 225839_at.
Numeric. Expression level of gene 226041_at.
Numeric. Expression level of gene 226093_at.
Numeric. Expression level of gene 226543_at.
Numeric. Expression level of gene 227695_at.
Numeric. Expression level of gene 228295_at.
Numeric. Expression level of gene 228548_at.
Numeric. Expression level of gene 229234_at.
Numeric. Expression level of gene 229658_at.
Numeric. Expression level of gene 229725_at.
Numeric. Expression level of gene 230252_at.
Numeric. Expression level of gene 230471_at.
Numeric. Expression level of gene 231149_s_at.
Numeric. Expression level of gene 231556_at.
Numeric. Expression level of gene 231754_at.
Numeric. Expression level of gene 232011_s_at.
Numeric. Expression level of gene 233030_at.
Numeric. Expression level of gene 234161_at.
Numeric. Expression level of gene 235050_at.
Numeric. Expression level of gene 235094_at.
Numeric. Expression level of gene 235278_at.
Numeric. Expression level of gene 235671_at.
Numeric. Expression level of gene 235952_at.
Numeric. Expression level of gene 236158_at.
Numeric. Expression level of gene 236181_at.
Numeric. Expression level of gene 237055_at.
Numeric. Expression level of gene 237768_x_at.
Numeric. Expression level of gene 238897_at.
Numeric. Expression level of gene 239160_at.
Numeric. Expression level of gene 239998_at.
Numeric. Expression level of gene 240254_at.
Numeric. Expression level of gene 240612_at.
Numeric. Expression level of gene 240692_at.
Numeric. Expression level of gene 240822_at.
Numeric. Expression level of gene 240842_at.
Numeric. Expression level of gene 241331_at.
Numeric. Expression level of gene 241598_at.
Numeric. Expression level of gene 241927_x_at.
Numeric. Expression level of gene 242405_at.
This dataset was extracted from a larger dataset available on ArrayExpress and is used as an external validation set for feature selection tasks and other machine learning applications in bioinformatics.
The original dataset can be found on ArrayExpress: https://www.ebi.ac.uk/arrayexpress
Ellenbach, N., Boulesteix, A.L., Bischl, B., et al. (2021). Improved Outcome Prediction Across Data Sources Through Robust Parameter Tuning. Journal of Classification, 38, 212–231. doi:10.1007/s00357-020-09368-z.
Hornung, R., Causeur, D., Bernau, C., Boulesteix, A.L. (2017). Improving cross-study prediction through addon batch effect adjustment or addon normalization. Bioinformatics, 33(3), 397–404. doi:10.1093/bioinformatics/btw650.
# Load the dataset data(sample_data_extern) # View the first few rows of the dataset head(sample_data_extern) # Summary of the dataset summary(sample_data_extern)
# Load the dataset data(sample_data_extern) # View the first few rows of the dataset head(sample_data_extern) # Summary of the dataset summary(sample_data_extern)
This dataset, named 'sample_data_train', is a subset of publicly available microarray data from the HG-U133PLUS2 chip. It contains expression levels of 200 genes across 50 samples, used primarily as a training set in robust feature selection studies. The data has been sourced from the ArrayExpress repository and has been referenced in several research articles.
sample_data_train
sample_data_train
A data frame with 50 observations and 201 variables, including:
Factor. The response variable.
Numeric. Expression level of gene 236694_at.
Numeric. Expression level of gene 222356_at.
Numeric. Expression level of gene 1554125_a_at.
Numeric. Expression level of gene 232823_at.
Numeric. Expression level of gene 205766_at.
Numeric. Expression level of gene 1560446_at.
Numeric. Expression level of gene 202565_s_at.
Numeric. Expression level of gene 234887_at.
Numeric. Expression level of gene 209687_at.
Numeric. Expression level of gene 221592_at.
Numeric. Expression level of gene 1570123_at.
Numeric. Expression level of gene 241368_at.
Numeric. Expression level of gene 243324_x_at.
Numeric. Expression level of gene 224046_s_at.
Numeric. Expression level of gene 202775_s_at.
Numeric. Expression level of gene 216332_at.
Numeric. Expression level of gene 1569545_at.
Numeric. Expression level of gene 205946_at.
Numeric. Expression level of gene 203547_at.
Numeric. Expression level of gene 243239_at.
Numeric. Expression level of gene 234245_at.
Numeric. Expression level of gene 210832_x_at.
Numeric. Expression level of gene 224549_x_at.
Numeric. Expression level of gene 236628_at.
Numeric. Expression level of gene 214848_at.
Numeric. Expression level of gene 1553015_a_at.
Numeric. Expression level of gene 1554199_at.
Numeric. Expression level of gene 1557636_a_at.
Numeric. Expression level of gene 1558511_s_at.
Numeric. Expression level of gene 1561713_at.
Numeric. Expression level of gene 1561883_at.
Numeric. Expression level of gene 1568720_at.
Numeric. Expression level of gene 1569168_at.
Numeric. Expression level of gene 1569443_s_at.
Numeric. Expression level of gene 1570103_at.
Numeric. Expression level of gene 200916_at.
Numeric. Expression level of gene 201554_x_at.
Numeric. Expression level of gene 202371_at.
Numeric. Expression level of gene 204481_at.
Numeric. Expression level of gene 205831_at.
Numeric. Expression level of gene 207061_at.
Numeric. Expression level of gene 207423_s_at.
Numeric. Expression level of gene 209896_s_at.
Numeric. Expression level of gene 212646_at.
Numeric. Expression level of gene 214068_at.
Numeric. Expression level of gene 217727_x_at.
Numeric. Expression level of gene 221103_s_at.
Numeric. Expression level of gene 221785_at.
Numeric. Expression level of gene 224207_x_at.
Numeric. Expression level of gene 228257_at.
Numeric. Expression level of gene 228877_at.
Numeric. Expression level of gene 231173_at.
Numeric. Expression level of gene 231328_s_at.
Numeric. Expression level of gene 231639_at.
Numeric. Expression level of gene 232221_x_at.
Numeric. Expression level of gene 232349_x_at.
Numeric. Expression level of gene 232849_at.
Numeric. Expression level of gene 233601_at.
Numeric. Expression level of gene 234403_at.
Numeric. Expression level of gene 234585_at.
Numeric. Expression level of gene 234650_at.
Numeric. Expression level of gene 234897_s_at.
Numeric. Expression level of gene 236071_at.
Numeric. Expression level of gene 236689_at.
Numeric. Expression level of gene 238551_at.
Numeric. Expression level of gene 239414_at.
Numeric. Expression level of gene 241034_at.
Numeric. Expression level of gene 241131_at.
Numeric. Expression level of gene 241897_at.
Numeric. Expression level of gene 242611_at.
Numeric. Expression level of gene 244805_at.
Numeric. Expression level of gene 244866_at.
Numeric. Expression level of gene 32259_at.
Numeric. Expression level of gene 1552264_a_at.
Numeric. Expression level of gene 1552880_at.
Numeric. Expression level of gene 1553186_x_at.
Numeric. Expression level of gene 1553372_at.
Numeric. Expression level of gene 1553438_at.
Numeric. Expression level of gene 1554299_at.
Numeric. Expression level of gene 1554362_at.
Numeric. Expression level of gene 1554491_a_at.
Numeric. Expression level of gene 1555098_a_at.
Numeric. Expression level of gene 1555990_at.
Numeric. Expression level of gene 1556034_s_at.
Numeric. Expression level of gene 1556822_s_at.
Numeric. Expression level of gene 1556824_at.
Numeric. Expression level of gene 1557278_s_at.
Numeric. Expression level of gene 1558603_at.
Numeric. Expression level of gene 1558890_at.
Numeric. Expression level of gene 1560791_at.
Numeric. Expression level of gene 1561083_at.
Numeric. Expression level of gene 1561364_at.
Numeric. Expression level of gene 1561553_at.
Numeric. Expression level of gene 1562523_at.
Numeric. Expression level of gene 1562613_at.
Numeric. Expression level of gene 1563351_at.
Numeric. Expression level of gene 1563473_at.
Numeric. Expression level of gene 1566780_at.
Numeric. Expression level of gene 1567257_at.
Numeric. Expression level of gene 1569664_at.
Numeric. Expression level of gene 1569882_at.
Numeric. Expression level of gene 1570252_at.
Numeric. Expression level of gene 201089_at.
Numeric. Expression level of gene 201261_x_at.
Numeric. Expression level of gene 202052_s_at.
Numeric. Expression level of gene 202236_s_at.
Numeric. Expression level of gene 202948_at.
Numeric. Expression level of gene 203080_s_at.
Numeric. Expression level of gene 203211_s_at.
Numeric. Expression level of gene 203218_at.
Numeric. Expression level of gene 203236_s_at.
Numeric. Expression level of gene 203347_s_at.
Numeric. Expression level of gene 203960_s_at.
Numeric. Expression level of gene 204609_at.
Numeric. Expression level of gene 204806_x_at.
Numeric. Expression level of gene 204949_at.
Numeric. Expression level of gene 204979_s_at.
Numeric. Expression level of gene 205823_at.
Numeric. Expression level of gene 205902_at.
Numeric. Expression level of gene 205967_at.
Numeric. Expression level of gene 206186_at.
Numeric. Expression level of gene 207151_at.
Numeric. Expression level of gene 207379_at.
Numeric. Expression level of gene 207440_at.
Numeric. Expression level of gene 207883_s_at.
Numeric. Expression level of gene 208277_at.
Numeric. Expression level of gene 208280_at.
Numeric. Expression level of gene 209224_s_at.
Numeric. Expression level of gene 209561_at.
Numeric. Expression level of gene 209630_s_at.
Numeric. Expression level of gene 210118_s_at.
Numeric. Expression level of gene 210342_s_at.
Numeric. Expression level of gene 211566_x_at.
Numeric. Expression level of gene 211756_at.
Numeric. Expression level of gene 212170_at.
Numeric. Expression level of gene 212494_at.
Numeric. Expression level of gene 213118_at.
Numeric. Expression level of gene 214475_x_at.
Numeric. Expression level of gene 214834_at.
Numeric. Expression level of gene 215718_s_at.
Numeric. Expression level of gene 216283_s_at.
Numeric. Expression level of gene 217206_at.
Numeric. Expression level of gene 217557_s_at.
Numeric. Expression level of gene 217577_at.
Numeric. Expression level of gene 218152_at.
Numeric. Expression level of gene 218252_at.
Numeric. Expression level of gene 219714_s_at.
Numeric. Expression level of gene 220506_at.
Numeric. Expression level of gene 220889_s_at.
Numeric. Expression level of gene 221204_s_at.
Numeric. Expression level of gene 221795_at.
Numeric. Expression level of gene 222048_at.
Numeric. Expression level of gene 223142_s_at.
Numeric. Expression level of gene 223439_at.
Numeric. Expression level of gene 223673_at.
Numeric. Expression level of gene 224363_at.
Numeric. Expression level of gene 224512_s_at.
Numeric. Expression level of gene 224690_at.
Numeric. Expression level of gene 224936_at.
Numeric. Expression level of gene 225334_at.
Numeric. Expression level of gene 225713_at.
Numeric. Expression level of gene 225839_at.
Numeric. Expression level of gene 226041_at.
Numeric. Expression level of gene 226093_at.
Numeric. Expression level of gene 226543_at.
Numeric. Expression level of gene 227695_at.
Numeric. Expression level of gene 228295_at.
Numeric. Expression level of gene 228548_at.
Numeric. Expression level of gene 229234_at.
Numeric. Expression level of gene 229658_at.
Numeric. Expression level of gene 229725_at.
Numeric. Expression level of gene 230252_at.
Numeric. Expression level of gene 230471_at.
Numeric. Expression level of gene 231149_s_at.
Numeric. Expression level of gene 231556_at.
Numeric. Expression level of gene 231754_at.
Numeric. Expression level of gene 232011_s_at.
Numeric. Expression level of gene 233030_at.
Numeric. Expression level of gene 234161_at.
Numeric. Expression level of gene 235050_at.
Numeric. Expression level of gene 235094_at.
Numeric. Expression level of gene 235278_at.
Numeric. Expression level of gene 235671_at.
Numeric. Expression level of gene 235952_at.
Numeric. Expression level of gene 236158_at.
Numeric. Expression level of gene 236181_at.
Numeric. Expression level of gene 237055_at.
Numeric. Expression level of gene 237768_x_at.
Numeric. Expression level of gene 238897_at.
Numeric. Expression level of gene 239160_at.
Numeric. Expression level of gene 239998_at.
Numeric. Expression level of gene 240254_at.
Numeric. Expression level of gene 240612_at.
Numeric. Expression level of gene 240692_at.
Numeric. Expression level of gene 240822_at.
Numeric. Expression level of gene 240842_at.
Numeric. Expression level of gene 241331_at.
Numeric. Expression level of gene 241598_at.
Numeric. Expression level of gene 241927_x_at.
Numeric. Expression level of gene 242405_at.
This dataset was extracted from a larger dataset available on ArrayExpress. It is used as a training set for feature selection tasks and other machine learning applications in bioinformatics.
The original dataset can be found on ArrayExpress: https://www.ebi.ac.uk/arrayexpress
Ellenbach, N., Boulesteix, A.L., Bischl, B., et al. (2021). Improved Outcome Prediction Across Data Sources Through Robust Parameter Tuning. Journal of Classification, 38, 212–231. doi:10.1007/s00357-020-09368-z.
Hornung, R., Causeur, D., Bernau, C., Boulesteix, A.L. (2017). Improving cross-study prediction through addon batch effect adjustment or addon normalization. Bioinformatics, 33(3), 397–404. doi:10.1093/bioinformatics/btw650.
# Load the dataset: data(sample_data_train) # Dimension of the dataset: dim(sample_data_train) # View the first rows of the dataset: head(sample_data_train)
# Load the dataset: data(sample_data_train) # Dimension of the dataset: dim(sample_data_train) # View the first rows of the dataset: head(sample_data_train)
This function tunes and trains a classifier using a specified tuning method. Depending on the method chosen, the function will either perform RobustTuneC, external tuning, or internal tuning.
tuneandtrain(data, dataext = NULL, tuningmethod, classifier, ...)
tuneandtrain(data, dataext = NULL, tuningmethod, classifier, ...)
data |
A data frame containing the training data. The first column should be the response variable, which must be a factor for classification tasks. The remaining columns should be the predictor variables. Ensure that the data is properly formatted, with no missing values. |
dataext |
A data frame containing the external validation data, required only for the tuning methods "robusttunec" and "ext". Similar to the 'data' argument, the first column should be the response variable (factor), and the remaining columns should be the predictors. If 'tuningmethod = "int"', this parameter is ignored. |
tuningmethod |
A character string specifying which tuning approach to use. Options are:
|
classifier |
A character string specifying which classifier to use. Options include:
|
... |
Additional parameters to be passed to the specific tuning and training functions. These can include options such as the number of trees for Random Forest, the number of folds for cross-validation, or hyperparameters specific to the chosen classifier. |
A list containing the results of the tuning and training process, which typically includes:
Best hyperparameters selected during the tuning process.
The final trained model.
Performance metrics (AUC) on the training or validation data, depending on the tuning method.
# Load sample data data(sample_data_train) data(sample_data_extern) # Example usage: Robust tuning with Ridge classifier result_boosting <- tuneandtrain(sample_data_train, sample_data_extern, tuningmethod = "robusttunec", classifier = "ridge") result_boosting$best_lambda result_boosting$best_model result_boosting$final_auc # Example usage: Internal cross-validation with Lasso classifier result_lasso <- tuneandtrain(sample_data_train, tuningmethod = "int", classifier = "lasso", maxit = 120000, nlambda = 200, nfolds = 5) result_lasso$best_lambda result_lasso$best_model result_lasso$final_auc result_lasso$active_set_Train
# Load sample data data(sample_data_train) data(sample_data_extern) # Example usage: Robust tuning with Ridge classifier result_boosting <- tuneandtrain(sample_data_train, sample_data_extern, tuningmethod = "robusttunec", classifier = "ridge") result_boosting$best_lambda result_boosting$best_model result_boosting$final_auc # Example usage: Internal cross-validation with Lasso classifier result_lasso <- tuneandtrain(sample_data_train, tuningmethod = "int", classifier = "lasso", maxit = 120000, nlambda = 200, nfolds = 5) result_lasso$best_lambda result_lasso$best_model result_lasso$final_auc result_lasso$active_set_Train
This function tunes and trains a classifier using an external validation dataset. Based on the specified classifier, the function selects and runs the appropriate tuning and training process. The external validation data is used to optimize the model's hyperparameters and improve generalization performance across datasets.
tuneandtrainExt(data, dataext, classifier, ...)
tuneandtrainExt(data, dataext, classifier, ...)
data |
A data frame containing the training data. The first column should be the response variable (factor), and the remaining columns should be the predictor variables. Ensure that the data is properly formatted, with no missing values. |
dataext |
A data frame containing the external validation data. The first column should be the response variable (factor), and the remaining columns should be the predictor variables. The external data is used for tuning hyperparameters to avoid overfitting on the training data. |
classifier |
A character string specifying the classifier to use. Must be one of the following:
|
... |
Additional arguments to pass to the specific classifier function. These may include hyperparameters such as the number of trees for Random Forest, regularization parameters for Lasso/Ridge, or kernel settings for SVM. |
A list containing the results from the classifier's tuning and training process. The returned object typically includes:
best_model
: The final trained model using the best hyperparameters.
best_hyperparams
: The optimal hyperparameters found during the tuning process.
final_auc
: Performance metrics (AUC) of the final model.
# Load sample data data(sample_data_train) data(sample_data_extern) # Example usage with Lasso result_lasso <- tuneandtrainExt(sample_data_train, sample_data_extern, classifier = "lasso", maxit = 120000, nlambda = 100) result_lasso$best_lambda result_lasso$best_model result_lasso$final_auc result_lasso$active_set_Train # Example usage with Ridge result_ridge <- tuneandtrainExt(sample_data_train, sample_data_extern, classifier = "ridge", maxit = 120000, nlambda = 100) result_ridge$best_lambda result_ridge$best_model result_ridge$final_auc
# Load sample data data(sample_data_train) data(sample_data_extern) # Example usage with Lasso result_lasso <- tuneandtrainExt(sample_data_train, sample_data_extern, classifier = "lasso", maxit = 120000, nlambda = 100) result_lasso$best_lambda result_lasso$best_model result_lasso$final_auc result_lasso$active_set_Train # Example usage with Ridge result_ridge <- tuneandtrainExt(sample_data_train, sample_data_extern, classifier = "ridge", maxit = 120000, nlambda = 100) result_ridge$best_lambda result_ridge$best_model result_ridge$final_auc
This function tunes and trains a Boosting classifier using the mboost::glmboost
function.
It provides two strategies for tuning the number of boosting iterations (mstop
) based on
the estperf
argument:
When estperf = FALSE
(default): Hyperparameters are tuned using the external validation dataset.
The mstop
value that gives the highest AUC on the external dataset is selected as the best model.
However, no AUC value is returned in this case, as per best practices.
When estperf = TRUE
: Hyperparameters are tuned internally using the training dataset.
The model is then validated on the external dataset to provide a conservative (slightly pessimistic) AUC estimate.
tuneandtrainExtBoost( data, dataext, estperf = FALSE, mstop_seq = seq(5, 1000, by = 5), nu = 0.1 )
tuneandtrainExtBoost( data, dataext, estperf = FALSE, mstop_seq = seq(5, 1000, by = 5), nu = 0.1 )
data |
A data frame containing the training data. The first column should be the response variable (factor), and the remaining columns should be the predictor variables. |
dataext |
A data frame containing the external validation data. The first column should be the response variable (factor), and the remaining columns should be the predictor variables. |
estperf |
A logical value indicating whether to use internal tuning with external validation ( |
mstop_seq |
A numeric vector specifying the sequence of boosting iterations to evaluate.
Default is |
nu |
A numeric value specifying the learning rate for boosting. Default is |
A list containing the following components:
best_mstop
: The optimal number of boosting iterations determined during the tuning process.
best_model
: The trained Boosting model using the selected mstop
.
est_auc
: The AUC value evaluated on the external dataset. This is only returned when estperf = TRUE
,
providing a conservative (slightly pessimistic) estimate of the model's performance.
# Load sample data data(sample_data_train) data(sample_data_extern) # Example usage with external tuning (default) mstop_seq <- seq(50, 500, by = 50) result <- tuneandtrainExtBoost(sample_data_train, sample_data_extern, mstop_seq = mstop_seq, nu = 0.1) print(result$best_mstop) # Optimal mstop print(result$best_model) # Trained Boosting model # Note: est_auc is not returned when estperf = FALSE # Example usage with internal tuning and external validation result_internal <- tuneandtrainExtBoost(sample_data_train, sample_data_extern, estperf = TRUE, mstop_seq = mstop_seq, nu = 0.1) print(result_internal$best_mstop) # Optimal mstop print(result_internal$best_model) # Trained Boosting model print(result_internal$est_auc) # AUC on external validation dataset
# Load sample data data(sample_data_train) data(sample_data_extern) # Example usage with external tuning (default) mstop_seq <- seq(50, 500, by = 50) result <- tuneandtrainExtBoost(sample_data_train, sample_data_extern, mstop_seq = mstop_seq, nu = 0.1) print(result$best_mstop) # Optimal mstop print(result$best_model) # Trained Boosting model # Note: est_auc is not returned when estperf = FALSE # Example usage with internal tuning and external validation result_internal <- tuneandtrainExtBoost(sample_data_train, sample_data_extern, estperf = TRUE, mstop_seq = mstop_seq, nu = 0.1) print(result_internal$best_mstop) # Optimal mstop print(result_internal$best_model) # Trained Boosting model print(result_internal$est_auc) # AUC on external validation dataset
This function tunes and trains a Lasso classifier using the glmnet
package.
It provides two strategies for tuning hyperparameters based on the estperf
argument:
When estperf = FALSE
(default): Hyperparameters are tuned using the external validation dataset.
The lambda value that gives the highest AUC on the external dataset is selected as the best model.
However, no AUC value is returned in this case, as per best practices.
When estperf = TRUE
: Hyperparameters are tuned internally using the training dataset.
The model is then validated on the external dataset to provide a conservative (slightly pessimistic) AUC estimate.
tuneandtrainExtLasso( data, dataext, estperf = FALSE, maxit = 120000, nlambda = 100 )
tuneandtrainExtLasso( data, dataext, estperf = FALSE, maxit = 120000, nlambda = 100 )
data |
A data frame containing the training data. The first column should be the response variable (factor), and the remaining columns should be the predictor variables. |
dataext |
A data frame containing the external validation data. The first column should be the response variable (factor), and the remaining columns should be the predictor variables. |
estperf |
A logical value indicating whether to use internal tuning with external validation ( |
maxit |
An integer specifying the maximum number of iterations. Default is 120000. |
nlambda |
An integer specifying the number of lambda values to use in the Lasso model. Default is 100. |
A list containing the following components:
best_lambda
: The optimal lambda value determined during the tuning process.
best_model
: The trained Lasso model using the selected lambda value.
est_auc
: The AUC value evaluated on the external dataset. This is only returned when estperf = TRUE
,
providing a conservative (slightly pessimistic) estimate of the model's performance.
active_set_Train
: The number of active coefficients (non-zero) in the model trained on the training dataset.
# Load sample data data(sample_data_train) data(sample_data_extern) # Example usage with external tuning (default) result <- tuneandtrainExtLasso(sample_data_train, sample_data_extern, maxit = 120000, nlambda = 100) print(result$best_lambda) print(result$best_model) print(result$active_set_Train) # Example usage with internal tuning and external validation result_internal <- tuneandtrainExtLasso(sample_data_train, sample_data_extern, estperf = TRUE, maxit = 120000, nlambda = 100) print(result_internal$best_lambda) print(result_internal$best_model) print(result_internal$est_auc) print(result_internal$active_set_Train)
# Load sample data data(sample_data_train) data(sample_data_extern) # Example usage with external tuning (default) result <- tuneandtrainExtLasso(sample_data_train, sample_data_extern, maxit = 120000, nlambda = 100) print(result$best_lambda) print(result$best_model) print(result$active_set_Train) # Example usage with internal tuning and external validation result_internal <- tuneandtrainExtLasso(sample_data_train, sample_data_extern, estperf = TRUE, maxit = 120000, nlambda = 100) print(result_internal$best_lambda) print(result_internal$best_model) print(result_internal$est_auc) print(result_internal$active_set_Train)
This function tunes and trains a Random Forest classifier using the ranger
package.
It provides two strategies for tuning the min.node.size
parameter based on the estperf
argument:
When estperf = FALSE
(default): Hyperparameters are tuned using the external validation dataset.
The min.node.size
value that gives the highest AUC on the external dataset is selected as the best model.
However, no AUC value is returned in this case, as per best practices.
When estperf = TRUE
: Hyperparameters are tuned internally using the training dataset.
The model is then validated on the external dataset to provide a conservative (slightly pessimistic) AUC estimate.
tuneandtrainExtRF(data, dataext, estperf = FALSE, num.trees = 500)
tuneandtrainExtRF(data, dataext, estperf = FALSE, num.trees = 500)
data |
A data frame containing the training data. The first column should be the response variable (factor), and the remaining columns should be the predictor variables. |
dataext |
A data frame containing the external validation data. The first column should be the response variable (factor), and the remaining columns should be the predictor variables. |
estperf |
A logical value indicating whether to use internal tuning with external validation ( |
num.trees |
An integer specifying the number of trees in the Random Forest. Default is 500. |
A list containing the following components:
best_min_node_size
: The optimal min.node.size
value determined during the tuning process.
best_model
: The trained Random Forest model using the selected min.node.size
.
est_auc
: The AUC value evaluated on the external dataset. This is only returned when estperf = TRUE
,
providing a conservative (slightly pessimistic) estimate of the model's performance.
# Load sample data data(sample_data_train) data(sample_data_extern) # Example usage with external tuning (default) result <- tuneandtrainExtRF(sample_data_train, sample_data_extern, num.trees = 500) print(result$best_min_node_size) # Optimal min.node.size print(result$best_model) # Trained Random Forest model # Note: est_auc is not returned when estperf = FALSE # Example usage with internal tuning and external validation result_internal <- tuneandtrainExtRF(sample_data_train, sample_data_extern, estperf = TRUE, num.trees = 500) print(result_internal$best_min_node_size) # Optimal min.node.size print(result_internal$best_model) # Trained Random Forest model print(result_internal$est_auc) # AUC on external validation dataset
# Load sample data data(sample_data_train) data(sample_data_extern) # Example usage with external tuning (default) result <- tuneandtrainExtRF(sample_data_train, sample_data_extern, num.trees = 500) print(result$best_min_node_size) # Optimal min.node.size print(result$best_model) # Trained Random Forest model # Note: est_auc is not returned when estperf = FALSE # Example usage with internal tuning and external validation result_internal <- tuneandtrainExtRF(sample_data_train, sample_data_extern, estperf = TRUE, num.trees = 500) print(result_internal$best_min_node_size) # Optimal min.node.size print(result_internal$best_model) # Trained Random Forest model print(result_internal$est_auc) # AUC on external validation dataset
This function tunes and trains a Ridge classifier using the glmnet
package.
It provides two strategies for tuning the regularization parameter lambda
based on the estperf
argument:
When estperf = FALSE
(default): Hyperparameters are tuned using the external validation dataset.
The lambda
value that gives the highest AUC on the external dataset is selected as the best model.
However, no AUC value is returned in this case, as per best practices.
When estperf = TRUE
: Hyperparameters are tuned internally using the training dataset.
The model is then validated on the external dataset to provide a conservative (slightly pessimistic) AUC estimate.
tuneandtrainExtRidge( data, dataext, estperf = FALSE, maxit = 120000, nlambda = 100 )
tuneandtrainExtRidge( data, dataext, estperf = FALSE, maxit = 120000, nlambda = 100 )
data |
A data frame containing the training data. The first column should be the response variable (factor), and the remaining columns should be the predictor variables. |
dataext |
A data frame containing the external validation data. The first column should be the response variable (factor), and the remaining columns should be the predictor variables. |
estperf |
A logical value indicating whether to use internal tuning with external validation ( |
maxit |
An integer specifying the maximum number of iterations. Default is 120000. |
nlambda |
An integer specifying the number of lambda values to use in the Ridge model. Default is 100. |
A list containing the following components:
best_lambda
: The optimal lambda
value determined during the tuning process.
best_model
: The trained Ridge model using the selected lambda
.
est_auc
: The AUC value evaluated on the external dataset. This is only returned when estperf = TRUE
,
providing a conservative (slightly pessimistic) estimate of the model's performance.
# Load sample data data(sample_data_train) data(sample_data_extern) # Example usage with external tuning (default) result <- tuneandtrainExtRidge(sample_data_train, sample_data_extern, maxit = 120000, nlambda = 100) print(result$best_lambda) # Optimal lambda print(result$best_model) # Final trained model # Note: est_auc is not returned when estperf = FALSE # Example usage with internal tuning and external validation result_internal <- tuneandtrainExtRidge(sample_data_train, sample_data_extern, estperf = TRUE, maxit = 120000, nlambda = 100) print(result_internal$best_lambda) # Optimal lambda print(result_internal$best_model) # Final trained model print(result_internal$est_auc) # AUC on external validation dataset
# Load sample data data(sample_data_train) data(sample_data_extern) # Example usage with external tuning (default) result <- tuneandtrainExtRidge(sample_data_train, sample_data_extern, maxit = 120000, nlambda = 100) print(result$best_lambda) # Optimal lambda print(result$best_model) # Final trained model # Note: est_auc is not returned when estperf = FALSE # Example usage with internal tuning and external validation result_internal <- tuneandtrainExtRidge(sample_data_train, sample_data_extern, estperf = TRUE, maxit = 120000, nlambda = 100) print(result_internal$best_lambda) # Optimal lambda print(result_internal$best_model) # Final trained model print(result_internal$est_auc) # AUC on external validation dataset
This function tunes and trains a Support Vector Machine (SVM) classifier using the mlr
package.
It provides two strategies for tuning the cost parameter based on the estperf
argument:
When estperf = FALSE
(default): Hyperparameters are tuned using the external validation dataset.
The cost
value that gives the highest AUC on the external dataset is selected as the best model.
However, no AUC value is returned in this case, as per best practices.
When estperf = TRUE
: Hyperparameters are tuned internally using the training dataset.
The model is then validated on the external dataset to provide a conservative (slightly pessimistic) AUC estimate.
tuneandtrainExtSVM( data, dataext, estperf = FALSE, kernel = "linear", cost_seq = 2^(-15:15), scale = FALSE )
tuneandtrainExtSVM( data, dataext, estperf = FALSE, kernel = "linear", cost_seq = 2^(-15:15), scale = FALSE )
data |
A data frame containing the training data. The first column should be the response variable (factor), and the remaining columns should be the predictor variables. |
dataext |
A data frame containing the external validation data. The first column should be the response variable (factor), and the remaining columns should be the predictor variables. |
estperf |
A logical value indicating whether to use internal tuning with external validation ( |
kernel |
A character string specifying the kernel type to be used in the SVM. Default is |
cost_seq |
A numeric vector specifying the sequence of cost values to evaluate. Default is |
scale |
A logical value indicating whether to scale the predictor variables. Default is |
A list containing the following components:
best_cost
: The optimal cost value determined during the tuning process.
best_model
: The trained SVM model using the selected cost
.
est_auc
: The AUC value evaluated on the external dataset. This is only returned when estperf = TRUE
,
providing a conservative (slightly pessimistic) estimate of the model's performance.
# Load sample data data(sample_data_train) data(sample_data_extern) # Example usage with external tuning (default) result <- tuneandtrainExtSVM(sample_data_train, sample_data_extern, kernel = "linear", cost_seq = 2^(-15:15), scale = FALSE) print(result$best_cost) # Optimal cost print(result$best_model) # Final trained model # Note: est_auc is not returned when estperf = FALSE # Example usage with internal tuning and external validation result_internal <- tuneandtrainExtSVM(sample_data_train, sample_data_extern, estperf = TRUE, kernel = "linear", cost_seq = 2^(-15:15), scale = FALSE) print(result_internal$best_cost) # Optimal cost print(result_internal$best_model) # Final trained model print(result_internal$est_auc) # AUC on external validation dataset
# Load sample data data(sample_data_train) data(sample_data_extern) # Example usage with external tuning (default) result <- tuneandtrainExtSVM(sample_data_train, sample_data_extern, kernel = "linear", cost_seq = 2^(-15:15), scale = FALSE) print(result$best_cost) # Optimal cost print(result$best_model) # Final trained model # Note: est_auc is not returned when estperf = FALSE # Example usage with internal tuning and external validation result_internal <- tuneandtrainExtSVM(sample_data_train, sample_data_extern, estperf = TRUE, kernel = "linear", cost_seq = 2^(-15:15), scale = FALSE) print(result_internal$best_cost) # Optimal cost print(result_internal$best_model) # Final trained model print(result_internal$est_auc) # AUC on external validation dataset
This function tunes and trains a specified classifier using internal cross-validation. The classifier is specified by the 'classifier' argument, and the function delegates to the appropriate tuning and training function based on this choice.
tuneandtrainInt(data, classifier, ...)
tuneandtrainInt(data, classifier, ...)
data |
A data frame containing the training data. The first column should be the response variable (factor), and the remaining columns should be the predictor variables. |
classifier |
A character string specifying the classifier to use. Must be one of 'boosting', 'rf', 'lasso', 'ridge', 'svm'. |
... |
Additional arguments to pass to the specific classifier function. |
A list containing the results from the specific classifier's tuning and training process. The list typically includes:
best_hyperparams
: The best hyperparameters selected by cross-validation.
best_model
: The final trained model using the selected hyperparameters.
final_auc
: Cross-validation results (AUC).
# Load sample data data(sample_data_train) # Example usage with Lasso result_lasso <- tuneandtrainInt(sample_data_train, classifier = "lasso", maxit = 120000, nlambda = 100) result_lasso$best_lambda result_lasso$best_model result_lasso$final_auc result_lasso$active_set_Train # Example usage with Ridge result_ridge <- tuneandtrainInt(sample_data_train, classifier = "ridge", maxit = 120000, nlambda = 100) result_ridge$best_lambda result_ridge$best_model result_ridge$final_auc
# Load sample data data(sample_data_train) # Example usage with Lasso result_lasso <- tuneandtrainInt(sample_data_train, classifier = "lasso", maxit = 120000, nlambda = 100) result_lasso$best_lambda result_lasso$best_model result_lasso$final_auc result_lasso$active_set_Train # Example usage with Ridge result_ridge <- tuneandtrainInt(sample_data_train, classifier = "ridge", maxit = 120000, nlambda = 100) result_ridge$best_lambda result_ridge$best_model result_ridge$final_auc
This function tunes and trains a Boosting classifier using the mboost
package. The function
evaluates a sequence of boosting iterations on the training dataset using internal cross-validation
and selects the best model based on the Area Under the Curve (AUC).
tuneandtrainIntBoost(data, mstop_seq = seq(5, 1000, by = 5), nu = 0.1)
tuneandtrainIntBoost(data, mstop_seq = seq(5, 1000, by = 5), nu = 0.1)
data |
A data frame containing the training data. The first column should be the response variable (factor), and the remaining columns should be the predictor variables. |
mstop_seq |
A numeric vector of boosting iterations to be evaluated. Default is a sequence from 5 to 1000 with a step of 5. |
nu |
A numeric value for the learning rate. Default is 0.1. |
This function performs K-fold cross-validation on the training dataset, where the number of boosting
iterations (mstop
) is tuned to maximize the AUC. The optimal number of boosting iterations is selected,
and the final model is trained on the entire training dataset.
A list containing the best number of boosting iterations ('best_mstop') and the final Boosting classifier model ('best_model').
# Load sample data data(sample_data_train) # Example usage mstop_seq <- seq(5, 5000, by = 5) result <- tuneandtrainIntBoost(sample_data_train, mstop_seq, nu = 0.1) result$best_mstop result$best_model
# Load sample data data(sample_data_train) # Example usage mstop_seq <- seq(5, 5000, by = 5) result <- tuneandtrainIntBoost(sample_data_train, mstop_seq, nu = 0.1) result$best_mstop result$best_model
This function tunes and trains a Lasso classifier using the glmnet
package. The function
performs internal cross-validation to evaluate a sequence of lambda (regularization) values and
selects the best model based on the Area Under the Curve (AUC).
tuneandtrainIntLasso(data, maxit = 120000, nlambda = 200, nfolds = 5)
tuneandtrainIntLasso(data, maxit = 120000, nlambda = 200, nfolds = 5)
data |
A data frame containing the training data. The first column should be the response variable (factor), and the remaining columns should be the predictor variables. |
maxit |
An integer specifying the maximum number of iterations. Default is 120000. |
nlambda |
An integer specifying the number of lambda values to use in the Lasso model. Default is 200. |
nfolds |
An integer specifying the number of folds for cross-validation. Default is 5. |
This function trains a logistic Lasso model on the training dataset using cross-validation. The lambda value that results in the highest AUC during cross-validation is chosen as the best model, and the final model is trained on the full training dataset with this optimal lambda value.
A list containing the best lambda value ('best_lambda'), the final trained model ('best_model'), and the number of active coefficients ('active_set_Train').
# Load sample data data(sample_data_train) # Example usage result <- tuneandtrainIntLasso(sample_data_train, maxit = 120000, nlambda = 200, nfolds = 5) result$best_lambda result$best_model result$active_set_Train
# Load sample data data(sample_data_train) # Example usage result <- tuneandtrainIntLasso(sample_data_train, maxit = 120000, nlambda = 200, nfolds = 5) result$best_lambda result$best_model result$active_set_Train
This function tunes and trains a Random Forest classifier using the ranger
package with internal cross-validation.
The function evaluates a sequence of min.node.size
values on the training dataset and selects
the best model based on the Area Under the Curve (AUC).
tuneandtrainIntRF(data, num.trees = 500, nfolds = 5, seed = 123)
tuneandtrainIntRF(data, num.trees = 500, nfolds = 5, seed = 123)
data |
A data frame containing the training data. The first column should be the response variable (factor), and the remaining columns should be the predictor variables. |
num.trees |
An integer specifying the number of trees in the Random Forest. Default is 500. |
nfolds |
An integer specifying the number of folds for cross-validation. Default is 5. |
seed |
An integer specifying the random seed for reproducibility. Default is 123. |
Random Forest constructs multiple decision trees and aggregates their predictions.
The min.node.size
parameter controls the minimum number of samples in each terminal node, affecting model complexity.
This function performs cross-validation within the training dataset to evaluate the impact of different min.node.size
values.
The min.node.size
value that results in the highest AUC is selected as the best model.
A list containing the best 'min.node.size' value ('best_min_node_size') and the final trained model ('best_model').
# Load sample data data(sample_data_train) # Example usage result <- tuneandtrainIntRF(sample_data_train, num.trees = 500, nfolds = 5, seed = 123) result$best_min_node_size result$best_model
# Load sample data data(sample_data_train) # Example usage result <- tuneandtrainIntRF(sample_data_train, num.trees = 500, nfolds = 5, seed = 123) result$best_min_node_size result$best_model
This function tunes and trains a Ridge classifier using the glmnet
package. The function
evaluates a sequence of lambda (regularization) values using internal cross-validation and selects
the best model based on the Area Under the Curve (AUC).
tuneandtrainIntRidge( data, maxit = 120000, nlambda = 200, nfolds = 5, seed = 123 )
tuneandtrainIntRidge( data, maxit = 120000, nlambda = 200, nfolds = 5, seed = 123 )
data |
A data frame containing the training data. The first column should be the response variable (factor), and the remaining columns should be the predictor variables. |
maxit |
An integer specifying the maximum number of iterations. Default is 120000. |
nlambda |
An integer specifying the number of lambda values to use in the Ridge model. Default is 200. |
nfolds |
An integer specifying the number of folds for cross-validation. Default is 5. |
seed |
An integer specifying the random seed for reproducibility. Default is 123. |
The function trains a logistic Ridge regression model on the training dataset and performs cross-validation to select the best lambda value. The lambda value that gives the highest AUC on the training dataset during cross-validation is chosen as the best model.
A list containing the best lambda value ('best_lambda') and the final trained model ('best_model').
# Load sample data data(sample_data_train) # Example usage result <- tuneandtrainIntRidge(sample_data_train, maxit = 120000, nlambda = 200, nfolds = 5, seed = 123) result$best_lambda result$best_model
# Load sample data data(sample_data_train) # Example usage result <- tuneandtrainIntRidge(sample_data_train, maxit = 120000, nlambda = 200, nfolds = 5, seed = 123) result$best_lambda result$best_model
This function tunes and trains a Support Vector Machine (SVM) classifier using the mlr
package.
The function evaluates a sequence of cost values using internal cross-validation and selects
the best model based on the Area Under the Curve (AUC).
tuneandtrainIntSVM( data, kernel = "linear", cost_seq = 2^(-15:15), scale = FALSE, nfolds = 5, seed = 123 )
tuneandtrainIntSVM( data, kernel = "linear", cost_seq = 2^(-15:15), scale = FALSE, nfolds = 5, seed = 123 )
data |
A data frame containing the training data. The first column should be the response variable (factor), and the remaining columns should be the predictor variables. |
kernel |
A character string specifying the kernel type to be used in the SVM. Default is "linear". |
cost_seq |
A numeric vector of cost values to be evaluated. Default is '2^(-15:15)'. |
scale |
A logical indicating whether to scale the predictor variables. Default is FALSE. |
nfolds |
An integer specifying the number of folds for cross-validation. Default is 5. |
seed |
An integer specifying the random seed for reproducibility. Default is 123. |
In Support Vector Machines, the cost
parameter controls the trade-off between
achieving a low training error and a low testing error.
This function trains an SVM model on the training dataset, performs cross-validation, and
selects the cost value that results in the highest AUC. The final model is then trained using the optimal
cost value, and the performance is reported based on the AUC.
A list containing the best cost value ('best_cost') and the final trained model ('best_model').
# Load sample data data(sample_data_train) # Example usage result <- tuneandtrainIntSVM( sample_data_train, kernel = "linear", cost_seq = 2^(-15:15), scale = FALSE, nfolds = 5, seed = 123 ) result$best_cost result$best_model
# Load sample data data(sample_data_train) # Example usage result <- tuneandtrainIntSVM( sample_data_train, kernel = "linear", cost_seq = 2^(-15:15), scale = FALSE, nfolds = 5, seed = 123 ) result$best_cost result$best_model
This function tunes and trains a specified classifier using the "RobustTuneC" method and the provided data.
tuneandtrainRobustTuneC(data, dataext, classifier, ...)
tuneandtrainRobustTuneC(data, dataext, classifier, ...)
data |
A data frame containing the training data. The first column should be the response variable (factor), and the remaining columns should be the predictor variables. |
dataext |
A data frame containing the external validation data. The first column should be the response variable (factor), and the remaining columns should be the predictor variables. |
classifier |
A character string specifying the classifier to use. Must be one of the following:
|
... |
Additional arguments to pass to the specific classifier function. |
A list containing the results from the specific classifier's tuning and training process, the returned object typically includes:
best_hyperparams
: The best hyperparameters selected through the RobustTuneC method.
best_model
: The final trained model based on the best hyperparameters.
final_auc
: Performance metrics (AUC) of the final model.
# Load sample data data(sample_data_train) data(sample_data_extern) # Example usage with Lasso result_lasso <- tuneandtrainRobustTuneC(sample_data_train, sample_data_extern, classifier = "lasso", maxit = 120000, nlambda = 100) result_lasso$best_lambda result_lasso$best_model result_lasso$final_auc result_lasso$active_set_Train # Example usage with Ridge result_ridge <- tuneandtrainRobustTuneC(sample_data_train, sample_data_extern, classifier = "ridge", maxit = 120000, nlambda = 100) result_ridge$best_lambda result_ridge$best_model result_ridge$final_auc
# Load sample data data(sample_data_train) data(sample_data_extern) # Example usage with Lasso result_lasso <- tuneandtrainRobustTuneC(sample_data_train, sample_data_extern, classifier = "lasso", maxit = 120000, nlambda = 100) result_lasso$best_lambda result_lasso$best_model result_lasso$final_auc result_lasso$active_set_Train # Example usage with Ridge result_ridge <- tuneandtrainRobustTuneC(sample_data_train, sample_data_extern, classifier = "ridge", maxit = 120000, nlambda = 100) result_ridge$best_lambda result_ridge$best_model result_ridge$final_auc
This function tunes and trains a Boosting classifier using the mboost::glmboost
function
and the "RobustTuneC" method. The function performs K-fold cross-validation on the training dataset
and evaluates a sequence of boosting iterations (mstop
) based on the Area Under the Curve (AUC).
tuneandtrainRobustTuneCBoost( data, dataext, K = 5, mstop_seq = seq(5, 1000, by = 5), nu = 0.1 )
tuneandtrainRobustTuneCBoost( data, dataext, K = 5, mstop_seq = seq(5, 1000, by = 5), nu = 0.1 )
data |
Training data as a data frame. The first column should be the response variable. |
dataext |
External validation data as a data frame. The first column should be the response variable. |
K |
Number of folds to use in cross-validation. Default is 5. |
mstop_seq |
A sequence of boosting iterations to consider. Default is a sequence starting at 5 and increasing by 5 each time, up to 1000. |
nu |
Learning rate for the boosting algorithm. Default is 0.1. |
After cross-validation, the best mstop
value is selected based on the AUC, and the final Boosting
model is trained using this optimal mstop
. The external validation dataset is then used to calculate
the final AUC and assess the model performance.
A list containing the best number of boosting iterations ('best_mstop'), the final trained model ('best_model'), and the chosen c value('best_c').
# Load the sample data data(sample_data_train) data(sample_data_extern) # Example usage with the sample data mstop_seq <- seq(50, 500, by = 50) result <- tuneandtrainRobustTuneCBoost(sample_data_train, sample_data_extern, mstop_seq = mstop_seq) result$best_mstop result$best_model result$best_c
# Load the sample data data(sample_data_train) data(sample_data_extern) # Example usage with the sample data mstop_seq <- seq(50, 500, by = 50) result <- tuneandtrainRobustTuneCBoost(sample_data_train, sample_data_extern, mstop_seq = mstop_seq) result$best_mstop result$best_model result$best_c
This function tunes and trains a Lasso classifier using the glmnet
package and the "RobustTuneC" method.
The function uses K-fold cross-validation to evaluate a sequence of lambda (regularization) values and selects
the best model based on the Area Under the Curve (AUC).
tuneandtrainRobustTuneCLasso( data, dataext, K = 5, maxit = 120000, nlambda = 100 )
tuneandtrainRobustTuneCLasso( data, dataext, K = 5, maxit = 120000, nlambda = 100 )
data |
A data frame containing the training data. The first column should be the response variable (factor), and the remaining columns should be the predictor variables. |
dataext |
A data frame containing the external validation data. The first column should be the response variable (factor), and the remaining columns should be the predictor variables. |
K |
Number of folds to use in cross-validation. Default is 5. |
maxit |
Maximum number of iterations. Default is 120000. |
nlambda |
The number of lambda values to use for cross-validation. Default is 100. |
This function trains a logistic Lasso model using the training dataset and validates it through cross-validation. After selecting the best lambda value based on the training data, the model is then applied to an external validation dataset to compute the final AUC. The lambda value that results in the highest AUC on the external validation dataset is chosen as the best model.
A list containing the best lambda value ('best_lambda'), the final trained model ('best_model'), the number of active coefficients ('active_set_Train'), and the chosen c value('best_c').
# Load sample data data(sample_data_train) data(sample_data_extern) # Example usage result <- tuneandtrainRobustTuneCLasso(sample_data_train, sample_data_extern, K = 5, maxit = 120000, nlambda = 100) result$best_lambda result$best_model result$best_c
# Load sample data data(sample_data_train) data(sample_data_extern) # Example usage result <- tuneandtrainRobustTuneCLasso(sample_data_train, sample_data_extern, K = 5, maxit = 120000, nlambda = 100) result$best_lambda result$best_model result$best_c
This function tunes and trains a Random Forest classifier using the ranger
package and the "RobustTuneC" method.
The function uses K-fold cross-validation to evaluate different min.node.size
values on the training dataset
and selects the best model based on the Area Under the Curve (AUC).
tuneandtrainRobustTuneCRF(data, dataext, K = 5, num.trees = 500)
tuneandtrainRobustTuneCRF(data, dataext, K = 5, num.trees = 500)
data |
A data frame containing the training data. The first column should be the response variable (factor), and the remaining columns should be the predictor variables. |
dataext |
A data frame containing the external validation data. The first column should be the response variable (factor), and the remaining columns should be the predictor variables. |
K |
Number of folds to use in cross-validation. Default is 5. |
num.trees |
An integer specifying the number of trees to grow in the Random Forest. Default is 500. |
Random Forest constructs multiple decision trees and aggregates their predictions.
The min.node.size
parameter controls the minimum number of samples in each terminal node, affecting model complexity.
This function evaluates the min.node.size
values through cross-validation and then applies the best model to an
external validation dataset. The min.node.size
value that results in the highest AUC on the validation dataset is selected.
A list containing the best minimum node size ('best_min_node_size'), the final trained model ('best_model'), and the chosen c value('best_c').
# Load sample data data(sample_data_train) data(sample_data_extern) # Example usage result <- tuneandtrainRobustTuneCRF(sample_data_train, sample_data_extern, K = 5, num.trees = 500) result$best_min_node_size result$best_model result$best_c
# Load sample data data(sample_data_train) data(sample_data_extern) # Example usage result <- tuneandtrainRobustTuneCRF(sample_data_train, sample_data_extern, K = 5, num.trees = 500) result$best_min_node_size result$best_model result$best_c
This function tunes and trains a Ridge classifier using the glmnet
package with the "RobustTuneC" method.
The function evaluates a sequence of lambda (regularization) values using K-fold cross-validation (K specified by the user)
on the training dataset and selects the best model based on Area Under the Curve (AUC).
tuneandtrainRobustTuneCRidge( data, dataext, K = 5, maxit = 120000, nlambda = 100 )
tuneandtrainRobustTuneCRidge( data, dataext, K = 5, maxit = 120000, nlambda = 100 )
data |
A data frame containing the training data. The first column should be the response variable (factor), and the remaining columns should be the predictor variables. |
dataext |
A data frame containing the external validation data. The first column should be the response variable (factor), and the remaining columns should be the predictor variables. |
K |
Number of folds to use in cross-validation. Default is 5. |
maxit |
Maximum number of iterations. Default is 120000. |
nlambda |
The number of lambda values to use for cross-validation. Default is 100. |
The function first performs K-fold cross-validation on the training dataset to select the best lambda value based on AUC. Then, the model is further validated on an external dataset, and the lambda value that provides the best performance on the external dataset is chosen as the final model. The Ridge regression is fitted using the selected lambda value, and the final model's performance is evaluated using AUC on the external validation dataset.
A list containing the best lambda value ('best_lambda'), the final trained model ('best_model'), and the chosen c value('best_c').
# Load sample data data(sample_data_train) data(sample_data_extern) # Example usage result <- tuneandtrainRobustTuneCRidge(sample_data_train, sample_data_extern, K = 5, maxit = 120000, nlambda = 100) result$best_lambda result$best_model result$best_c
# Load sample data data(sample_data_train) data(sample_data_extern) # Example usage result <- tuneandtrainRobustTuneCRidge(sample_data_train, sample_data_extern, K = 5, maxit = 120000, nlambda = 100) result$best_lambda result$best_model result$best_c
This function tunes and trains a Support Vector Machine (SVM) classifier using the "RobustTuneC" method. It performs K-fold cross-validation (with K specified by the user) to select the best model based on the Area Under the Curve (AUC) metric.
tuneandtrainRobustTuneCSVM( data, dataext, K = 5, seed = 123, kernel = "linear", cost_seq = 2^(-15:15), scale = FALSE )
tuneandtrainRobustTuneCSVM( data, dataext, K = 5, seed = 123, kernel = "linear", cost_seq = 2^(-15:15), scale = FALSE )
data |
A data frame containing the training data. The first column should be the response variable (factor), and the remaining columns should be the predictor variables. |
dataext |
A data frame containing the external validation data. The first column should be the response variable (factor), and the remaining columns should be the predictor variables. |
K |
Number of folds to use in cross-validation. Default is 5. |
seed |
An integer specifying the random seed for reproducibility. Default is 123. |
kernel |
A character string specifying the kernel type to be used in the SVM. It can be "linear", "polynomial", "radial", or "sigmoid". Default is "linear". |
cost_seq |
A numeric vector of cost values to be evaluated. Default is '2^(-15:15)'. |
scale |
A logical value indicating whether to scale the predictor variables. Default is 'FALSE'. |
In Support Vector Machines, the cost
parameter controls the trade-off between achieving
a low training error and a low testing error.
This function trains an SVM model on the training dataset, performs cross-validation to evaluate different
cost
values, and selects the one that yields the highest AUC.
The final model is trained using the optimal cost value, and its performance is reported using the AUC metric
on the external validation dataset.
A list containing the best cost value ('best_cost'), the final trained model ('best_model'), and the chosen c value('best_c').
# Load sample data data(sample_data_train) data(sample_data_extern) # Example usage result <- tuneandtrainRobustTuneCSVM(sample_data_train, sample_data_extern, K = 5, seed = 123, kernel = "linear", cost_seq = 2^(-15:15), scale = FALSE) result$best_cost result$best_model result$best_c
# Load sample data data(sample_data_train) data(sample_data_extern) # Example usage result <- tuneandtrainRobustTuneCSVM(sample_data_train, sample_data_extern, K = 5, seed = 123, kernel = "linear", cost_seq = 2^(-15:15), scale = FALSE) result$best_cost result$best_model result$best_c