API

Pipeline utilities

Nested Cross-Validation for scikit-learn using MPI.

This package provides nested cross-validation similar to scikit-learn’s GridSearchCV but uses the Message Passing Interface (MPI) for parallel computing.

class palladio.model_assessment.ModelAssessment(estimator, cv=None, scoring=None, fit_params=None, multi_output=False, shuffle_y=False, n_jobs=1, n_splits=10, test_size=0.1, train_size=None, random_state=None, groups=None, experiments_folder=None, verbose=False)[source]

Cross-validation with nested parameter search for each training fold.

The data is first split into cv train and test sets. For each training set a grid search over the specified set of parameters is performed (inner cross-validation). The set of parameters that achieved the highest average score across all inner folds is used to re-fit a model on the entire training set of the outer cross-validation loop. Finally, results on the test set of the outer loop are reported.

Parameters:

estimator : object type that implements the “fit” and “predict” methods

A object of that type is instantiated for each grid point.

cv : integer or cross-validation generator, optional, default: 3

If an integer is passed, it is the number of folds. Specific cross-validation objects can be passed, see sklearn.cross_validation module for the list of possible objects

scoring : string, callable or None, optional, default: None

A string (see model evaluation documentation) or a scorer callable object / function with signature scorer(estimator, X, y). See sklearn.metrics.get_scorer for details.

fit_params : dict, optional, default: None

Parameters to pass to the fit method.

multi_output : boolean, default: False

Allow multi-output y, as for multivariate regression.

shuffle_y : bool, optional, default=False

When True, the object is used to perform permutation test.

n_jobs : int, optional, default: 1

The number of jobs to use for the computation. This works by computing each of the Monte Carlo runs in parallel. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. Ignored when using MPI.

n_splits: int, optional, default: 10

The number of cross-validation splits (folds/iterations).

test_size : float (default 0.1), int, or None

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is automatically set to the complement of the train size.

train_size : float, int, or None (default is None)

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

random_state : int or RandomState, optional, default: None

Pseudo-random number generator state used for random sampling.

groups : array-like, with shape (n_samples,), optional, default: None

Group labels for the samples used while splitting the dataset into train/test set.

experiments_folder : string, optional, default: None

The path to the folder used to save the results.

verbose : bool, optional, default: False

Print debug messages.

Attributes

scorer_ function Scorer function used on the held out data to choose the best parameters for the model.
cv_results_ dictionary

Result of the fit. The dictionary is pandas.DataFrame-able. Each row is the results of an external split. Columns are: ‘split_i’, ‘learn_score’, ‘test_score’, ‘cv_results_‘, ‘ytr_pred’, ‘yts_pred’, ‘test_index’, ‘train_index’, ‘estimator’

Example: >>> pd.DataFrame(cv_results_) split_i | learn_score | test_score | cv_results_ | ...

0 | 0.987 | 0.876 | {<internal splits>} | ... 1 | 0.846 | 0.739 | {<internal splits>} | ... 2 | 0.956 | 0.630 | {<internal splits>} | ... 3 | 0.964 | 0.835 | {<internal splits>} | ...
fit(X, y)[source]

Fit the model to the training data.

Extra tools

Utilities functions and classes.

palladio.utils.save_signature(filename, selected, threshold=0.75)[source]

Save signature summary.

palladio.utils.retrieve_features(best_estimator)[source]

Retrieve selected features from any estimator.

In case it has the ‘get_support’ method, use it. Else, if it has a ‘coef_‘ attribute, assume it’s a linear model and the features correspond to the indices of the coefficients != 0

palladio.utils.get_selected_list(grid_search, vs_analysis=True)[source]

Retrieve the list of selected features.

Retrieves the list of selected features automatically identifying the type of object

Returns:

index : nunmpy.array

The indices of the selected features

palladio.utils.build_cv_results(dictionary, **results)[source]

Function to build final cv_results_ dictionary with partial results.

palladio.utils.signatures(splits_results, frequency_threshold=0.0)[source]

Return (almost) nested signatures for each correlation value.

The function returns 3 lists where each item refers to a signature (for increasing value of linear correlation). Each signature is orderer from the most to the least selected variable across KCV splits results.

Parameters:

splits_results : iterable

List of results from L1L2Py module, one for each external split.

frequency_threshold : float

Only the variables selected more (or equal) than this threshold are included into the signature.

Returns:

sign_totals : list of numpy.ndarray.

Counts the number of times each variable in the signature is selected.

sign_freqs : list of numpy.ndarray.

Frequencies calculated from sign_totals.

sign_idxs : list of numpy.ndarray.

Indexes of the signatures variables .

Examples

>>> from palladio.utils import signatures
>>> splits_results = [{'selected_list':[[True, False], [True, True]]},
...                   {'selected_list':[[True, False], [False, True]]}]
>>> sign_totals, sign_freqs, sign_idxs = signatures(splits_results)
>>> print sign_totals
[array([ 2.,  0.]), array([ 2.,  1.])]
>>> print sign_freqs
[array([ 1.,  0.]), array([ 1. ,  0.5])]
>>> print sign_idxs
[array([0, 1]), array([1, 0])]
palladio.utils.selection_summary(splits_results)[source]

Count how many times each variables was selected.

Parameters:

splits_results : iterable

List of results from L1L2Py module, one for each external split.

Returns:

summary : numpy.ndarray

Selection summary. # mu_values X # variables matrix.

palladio.utils.confusion_matrix(labels, predictions)[source]

Calculate a confusion matrix.

From given real and predicted labels, the function calculated a confusion matrix as a double nested dictionary. The external one contains two keys, 'T' and 'F'. Both internal dictionaries contain a key for each class label. Then the ['T']['C1'] entry counts the number of correctly predicted 'C1' labels, while ['F']['C2'] the incorrectly predicted 'C2' labels.

Note that each external dictionary correspond to a confusion matrix diagonal and the function works only on two-class labels.

Parameters:

labels : iterable

Real labels.

predictions : iterable

Predicted labels.

Returns:

cm : dict

Dictionary containing the confusion matrix values.

palladio.utils.classification_measures(confusion_matrix, positive_label=None)[source]

Calculate some classification measures.

Measures are calculated from a given confusion matrix (see confusion_matrix() for a detailed description of the required structure).

The positive_label arguments allows to specify what label has to be considered the positive class. This is needed to calculate some measures like F-measure and set some aliases (e.g. precision and recall are respectively the ‘predictive value’ and the ‘true rate’ for the positive class).

If positive_label is None, the resulting dictionary will not contain all the measures. Assuming to have to classes ‘C1’ and ‘C2’, and to indicate ‘C1’ as the positive (P) class, the function returns a dictionary with the following structure:

{
    'C1': {'predictive_value': --,  # TP / (TP + FP)
           'true_rate':        --}, # TP / (TP + FN)
    'C2': {'predictive_value': --,  # TN / (TN + FN)
           'true_rate':        --}, # TN / (TN + FP)
    'accuracy':          --,        # (TP + TN) / (TP + FP + FN + TN)
    'balanced_accuracy': --,        # 0.5 * ( (TP / (TP + FN)) +
                                    #         (TN / (TN + FP)) )
    'MCC':               --,        # ( (TP * TN) - (FP * FN) ) /
                                    # sqrt( (TP + FP) * (TP + FN) *
                                    #       (TN + FP) * (TN + FN) )

    # Following, only with positive_labels != None
    'sensitivity':       --,        # P true rate: TP / (TP + FN)
    'specificity':       --,        # N true rate: TN / (TN + FP)
    'precision':         --,        # P predictive value: TP / (TP + FP)
    'recall':            --,        # P true rate: TP / (TP + FN)
    'F_measure':         --         # 2. * ( (Precision * Recall ) /
                                    #        (Precision + Recall) )
}
Parameters:

confusion_matrix : dict

Confusion matrix (as the one returned by confusion_matrix()).

positive_label : str

Positive class label.

Returns:

summary : dict

Dictionary containing calculated measures.

palladio.utils.set_module_defaults(module, dictionary)[source]

Set default variables of a module, given a dictionary.

Used after the loading of the configuration file to set some defaults.

palladio.utils.sec_to_timestring(seconds)[source]

Transform seconds into a formatted time string.

Parameters:

seconds : int

Seconds to be transformed.

Returns :

———– :

time : string

A well formatted time string.

palladio.utils.safe_run(function)[source]

Decorator that tries to run a function and prints an error when fails.

Plotting functions

palladio.plotting.score_plot(param_grid, results, indep_var=None, pivoting_var=None, base_folder=None, logspace=None, plot_errors=False, is_regression=False)[source]

Plot error 2d plot.

Parameters:

param_grid : dict

Dictionary of grid parameters for GridSearch.

results : dict

Instance of an equivalent of cv_results_, as given by ModelAssessment.

indep_var : array-like, optional, default None

List of independent variables on which plots are based. If more that 2, a plot for each combination is made. If None, the 2 longest parameters in param_grid are selected.

pivoting_var : array-like, optional, default None

List of pivoting variables. For each of them, a plot is made. If unspecified, get the unspecified independent variable with the best model values.

base_folder : str or None, optional, default None

Folder where to save the plots.

logspace : array-like or None, optional, default None

List to specify which variable to visualise in logspace.

plot_errors : bool, optional, default False

If True, plot errors instead of scores.

is_regression : bool, optional, default False

If True and plot_errors is True, do errors = -scores instead of 1 - scores.