API¶
Pipeline utilities¶
Nested Cross-Validation for scikit-learn using MPI.
This package provides nested cross-validation similar to scikit-learn’s GridSearchCV but uses the Message Passing Interface (MPI) for parallel computing.
-
class
palladio.model_assessment.
ModelAssessment
(estimator, cv=None, scoring=None, fit_params=None, multi_output=False, shuffle_y=False, n_jobs=1, n_splits=10, test_size=0.1, train_size=None, random_state=None, groups=None, experiments_folder=None, verbose=False)[source]¶ Cross-validation with nested parameter search for each training fold.
The data is first split into
cv
train and test sets. For each training set a grid search over the specified set of parameters is performed (inner cross-validation). The set of parameters that achieved the highest average score across all inner folds is used to re-fit a model on the entire training set of the outer cross-validation loop. Finally, results on the test set of the outer loop are reported.Parameters: estimator : object type that implements the “fit” and “predict” methods
A object of that type is instantiated for each grid point.
cv : integer or cross-validation generator, optional, default: 3
If an integer is passed, it is the number of folds. Specific cross-validation objects can be passed, see sklearn.cross_validation module for the list of possible objects
scoring : string, callable or None, optional, default: None
A string (see model evaluation documentation) or a scorer callable object / function with signature
scorer(estimator, X, y)
. See sklearn.metrics.get_scorer for details.fit_params : dict, optional, default: None
Parameters to pass to the fit method.
multi_output : boolean, default: False
Allow multi-output y, as for multivariate regression.
shuffle_y : bool, optional, default=False
When True, the object is used to perform permutation test.
n_jobs : int, optional, default: 1
The number of jobs to use for the computation. This works by computing each of the Monte Carlo runs in parallel. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. Ignored when using MPI.
- n_splits: int, optional, default: 10
The number of cross-validation splits (folds/iterations).
test_size : float (default 0.1), int, or None
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is automatically set to the complement of the train size.
train_size : float, int, or None (default is None)
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.
- random_state : int or RandomState, optional, default: None
Pseudo-random number generator state used for random sampling.
- groups : array-like, with shape (n_samples,), optional, default: None
Group labels for the samples used while splitting the dataset into train/test set.
experiments_folder : string, optional, default: None
The path to the folder used to save the results.
verbose : bool, optional, default: False
Print debug messages.
Attributes
scorer_ function Scorer function used on the held out data to choose the best parameters for the model. cv_results_ dictionary Result of the fit. The dictionary is pandas.DataFrame-able. Each row is the results of an external split. Columns are: ‘split_i’, ‘learn_score’, ‘test_score’, ‘cv_results_‘, ‘ytr_pred’, ‘yts_pred’, ‘test_index’, ‘train_index’, ‘estimator’
Example: >>> pd.DataFrame(cv_results_) split_i | learn_score | test_score | cv_results_ | ...
0 | 0.987 | 0.876 | {<internal splits>} | ... 1 | 0.846 | 0.739 | {<internal splits>} | ... 2 | 0.956 | 0.630 | {<internal splits>} | ... 3 | 0.964 | 0.835 | {<internal splits>} | ...
Extra tools¶
Utilities functions and classes.
-
palladio.utils.
retrieve_features
(best_estimator)[source]¶ Retrieve selected features from any estimator.
In case it has the ‘get_support’ method, use it. Else, if it has a ‘coef_‘ attribute, assume it’s a linear model and the features correspond to the indices of the coefficients != 0
-
palladio.utils.
get_selected_list
(grid_search, vs_analysis=True)[source]¶ Retrieve the list of selected features.
Retrieves the list of selected features automatically identifying the type of object
Returns: index : nunmpy.array
The indices of the selected features
-
palladio.utils.
build_cv_results
(dictionary, **results)[source]¶ Function to build final cv_results_ dictionary with partial results.
-
palladio.utils.
signatures
(splits_results, frequency_threshold=0.0)[source]¶ Return (almost) nested signatures for each correlation value.
The function returns 3 lists where each item refers to a signature (for increasing value of linear correlation). Each signature is orderer from the most to the least selected variable across KCV splits results.
Parameters: splits_results : iterable
List of results from L1L2Py module, one for each external split.
frequency_threshold : float
Only the variables selected more (or equal) than this threshold are included into the signature.
Returns: sign_totals : list of
numpy.ndarray
.Counts the number of times each variable in the signature is selected.
sign_freqs : list of
numpy.ndarray
.Frequencies calculated from
sign_totals
.sign_idxs : list of
numpy.ndarray
.Indexes of the signatures variables .
Examples
>>> from palladio.utils import signatures >>> splits_results = [{'selected_list':[[True, False], [True, True]]}, ... {'selected_list':[[True, False], [False, True]]}] >>> sign_totals, sign_freqs, sign_idxs = signatures(splits_results) >>> print sign_totals [array([ 2., 0.]), array([ 2., 1.])] >>> print sign_freqs [array([ 1., 0.]), array([ 1. , 0.5])] >>> print sign_idxs [array([0, 1]), array([1, 0])]
-
palladio.utils.
selection_summary
(splits_results)[source]¶ Count how many times each variables was selected.
Parameters: splits_results : iterable
List of results from L1L2Py module, one for each external split.
Returns: summary :
numpy.ndarray
Selection summary.
# mu_values X # variables
matrix.
-
palladio.utils.
confusion_matrix
(labels, predictions)[source]¶ Calculate a confusion matrix.
From given real and predicted labels, the function calculated a confusion matrix as a double nested dictionary. The external one contains two keys,
'T'
and'F'
. Both internal dictionaries contain a key for each class label. Then the['T']['C1']
entry counts the number of correctly predicted'C1'
labels, while['F']['C2']
the incorrectly predicted'C2'
labels.Note that each external dictionary correspond to a confusion matrix diagonal and the function works only on two-class labels.
Parameters: labels : iterable
Real labels.
predictions : iterable
Predicted labels.
Returns: cm : dict
Dictionary containing the confusion matrix values.
-
palladio.utils.
classification_measures
(confusion_matrix, positive_label=None)[source]¶ Calculate some classification measures.
Measures are calculated from a given confusion matrix (see
confusion_matrix()
for a detailed description of the required structure).The
positive_label
arguments allows to specify what label has to be considered the positive class. This is needed to calculate some measures like F-measure and set some aliases (e.g. precision and recall are respectively the ‘predictive value’ and the ‘true rate’ for the positive class).If
positive_label
is None, the resulting dictionary will not contain all the measures. Assuming to have to classes ‘C1’ and ‘C2’, and to indicate ‘C1’ as the positive (P) class, the function returns a dictionary with the following structure:{ 'C1': {'predictive_value': --, # TP / (TP + FP) 'true_rate': --}, # TP / (TP + FN) 'C2': {'predictive_value': --, # TN / (TN + FN) 'true_rate': --}, # TN / (TN + FP) 'accuracy': --, # (TP + TN) / (TP + FP + FN + TN) 'balanced_accuracy': --, # 0.5 * ( (TP / (TP + FN)) + # (TN / (TN + FP)) ) 'MCC': --, # ( (TP * TN) - (FP * FN) ) / # sqrt( (TP + FP) * (TP + FN) * # (TN + FP) * (TN + FN) ) # Following, only with positive_labels != None 'sensitivity': --, # P true rate: TP / (TP + FN) 'specificity': --, # N true rate: TN / (TN + FP) 'precision': --, # P predictive value: TP / (TP + FP) 'recall': --, # P true rate: TP / (TP + FN) 'F_measure': -- # 2. * ( (Precision * Recall ) / # (Precision + Recall) ) }
Parameters: confusion_matrix : dict
Confusion matrix (as the one returned by
confusion_matrix()
).positive_label : str
Positive class label.
Returns: summary : dict
Dictionary containing calculated measures.
-
palladio.utils.
set_module_defaults
(module, dictionary)[source]¶ Set default variables of a module, given a dictionary.
Used after the loading of the configuration file to set some defaults.
Plotting functions¶
-
palladio.plotting.
score_plot
(param_grid, results, indep_var=None, pivoting_var=None, base_folder=None, logspace=None, plot_errors=False, is_regression=False)[source]¶ Plot error 2d plot.
Parameters: param_grid : dict
Dictionary of grid parameters for GridSearch.
results : dict
Instance of an equivalent of cv_results_, as given by ModelAssessment.
indep_var : array-like, optional, default None
List of independent variables on which plots are based. If more that 2, a plot for each combination is made. If None, the 2 longest parameters in param_grid are selected.
pivoting_var : array-like, optional, default None
List of pivoting variables. For each of them, a plot is made. If unspecified, get the unspecified independent variable with the best model values.
base_folder : str or None, optional, default None
Folder where to save the plots.
logspace : array-like or None, optional, default None
List to specify which variable to visualise in logspace.
plot_errors : bool, optional, default False
If True, plot errors instead of scores.
is_regression : bool, optional, default False
If True and plot_errors is True, do errors = -scores instead of 1 - scores.