Quick start tutorial

Installation using pip

PALLADIO may be installed using standard Python tools (with administrative or sudo permissions on GNU-Linux platforms):

$ pip install palladio

or

$ easy_install palladio

We strongly suggest to use Anaconda and create an environment for your experiments.

Installation from sources

If you like to manually install PALLADIO, download the .zip or .tar.gz archive from http://slipguru.github.io/palladio/. Then extract it and move into the root directory:

$ unzip slipguru-palladio-|release|.zip
$ cd palladio-|release|/

or:

$ tar xvf slipguru-palladio-|release|.tar.gz
$ cd palladio-|release|/

Otherwise you can clone our GitHub repository:

$ git clone https://github.com/slipguru/palladio.git

From here, you can follow the standard Python installation step:

$ python setup.py install

After PALLADIO installation, you should have access to two scripts, named with a common pd_ prefix:

$ pd_<TAB>
pd_analysis.py    pd_run.py

This tutorial assumes that you downloaded and extracted PALLADIO source package which contains a palladio/config_templates directory with some data files (.npy or .csv) which will be used to show PALLADIO functionalities.

PALLADIO needs only 3 ingredients:

  • n_samples x n_variables input matrix
  • n_samples x 1 labels vector
  • a configuration file

Cluster setup

Since all experiments performed during a run are independent from one another, PALLADIO has been designed specifically to work in a cluster environment. It is fairly easy to prepare the cluster for the experiments: assuming a standard configuration for the nodes (a shared home folder and a python installation which includes standard libraries for scientific computation, namely numpy, scipy and sklearn, as well as of course the mpi4py library for the MPI infrastructure), it is sufficient to transfer on the cluster a folder containing the dataset (data matrix and labels) and the configuration file, and install PALLADIO itself following the instructions above.

Configuration File

A configuration file in PALLADIO is a standard Python script. It is imported as a module, then all the code is executed. In this file the user defines all the parameters required to run a session, that is to perform all the experiments required to produce the final plots and reports.

In folder palladio/config_templates you will find an example of a typical configuration file. Every configuration file has several sections which control different aspects of the procedure.

The code below contains all the information required to load the dataset which will be used in the experiments.

data_path = 'data/gedm.csv'
target_path = 'data/labels.csv'

# pandas.read_csv options
data_loading_options = {
    'delimiter': ',',
    'header': 0,
    'index_col': 0
}
target_loading_options = data_loading_options

dataset = datasets.load_csv(
    os.path.join(os.path.dirname(__file__),data_path),
    os.path.join(os.path.dirname(__file__),target_path),
    data_loading_options=data_loading_options,
    target_loading_options=target_loading_options,
    samples_on='col')

data, labels = dataset.data, dataset.target
feature_names = dataset.feature_names

The last two lines store the input data matrix data and the labels vector labels in two variables which will be accessible during the session. The names of the features are also saved at this point. Notice how it is possible to load the dataset in any desired way, as long as data ends up being a \(n \times d\) matrix and labels a vector of \(n\) elements (both np.array-like).

Next, we have the section containing settings relative to the session itself:

session_folder = 'palladio_test_session'

# The learning task, if None palladio tries to guess it
# [see sklearn.utils.multiclass.type_of_target]
learning_task = None

# The number of repetitions of 'regular' experiments
n_splits_regular = 50

# The number of repetitions of 'permutation' experiments
n_splits_permutation = 50

The most important settings are the last two, namely n_splits_regular and n_splits_permutation, which control how many repetitions of regular and permutations experiments are performed. Normally you’ll want to perform the same number of experiments for the two batches, but there are cases in which for instance you may want to perform only one of the two batches: in that case you will want to set one of the two variables to be 0.

Finally, the section of the configuration file where the actual variable selection and learing algorithms (and their parameters) are chosen:

model = RFE(LinearSVC(loss='hinge'), step=0.3)

# Set the estimator to be a GridSearchCV
param_grid = {
    'n_features_to_select': [10, 20, 50],
    'estimator__C': np.logspace(-4, 0, 5),
}

estimator = GridSearchCV(
  model,
  param_grid=param_grid,
  cv=3,
  scoring='accuracy',
  n_jobs=1)

# Set options for ModelAssessment
ma_options = {
    'test_size': 0.25,
    'scoring': 'accuracy',
    'n_jobs': -1,
    'n_splits': n_splits_regular
}

This is maybe the less intuitive part of the file. Because of the way PALLADIO is designed, for all repetitions of the experiment a new learning and test set are generated by resampling without replacement from the whole dataset, then an estimator is used to fit the learning set. This is where that estimator (and its parameter) is defined.

Think about the estimator variable as the sklearn-compatible object (an estimator) which you would use to fit a training set, with the intent of validating it on a separate test set.

In this example we use a RFE algorithm (see http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html) for variable selection, which internally uses a Linear SVM for classification. Then we use a GridSearchCV (http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) object to wrap the RFE object, because we want to optimize the parameters for the RFE object itself, which are defined just above the declaration of the estimator variable.

The dictionary ma_options define some more configuration options for the ModelAssessment object, which is the one responsible for the outer iterations (the ones where the dataset is resampled); the test_size key for instance determins the portion of data left aside for testing.

The only constraint for the estimator object is that it has to implement the standard methods of an sklearn estimator object, that is methods fit and predict. Using any classification algorithm provided by the scikit-learn library automatically satisfies this requirement.

Running the experiments

Parallel jobs are created by invoking the mpirun command; the following syntax assumes that the OpenMPI implementation of MPI has been chosen for the cluster, if this is not the case, please refer to the documentation of the implementation available on your cluster for the command line options corresponding to those specified here:

$ mpirun -np N_JOBS --hostfile HOSTFILE pd_run.py path/to/config.py

Here N_JOBS obviously determines how many parallel jobs will be spawned and distributed among all available nodes, while HOSTFILE is a file listing the addresses or names of the available nodes.

Take into account that if optimized linear algebra libraries are present on the nodes (as it is safe to assume for most clusters) you should tune the number of jobs so that cores are optimally exploited: since those libraries already parallelize operations, it is useless to assign too many slots for each node.

Running experiments on a single machine

It is possible to perform experiments using PALLADIO also on a single machine, without a cluster infrastructure. The command is similar to the previous one, it is sufficient to omit the first part, relative to the MPI infrastructure:

$ pd_run.py path/to/config.py

Warning

Due to the great number of experiments which are performed, it might take a very long time for the whole procedure to complete; this option is therefore deprecated unless the dataset is very small (no more than 100 samples and no more than 100 features).

Results analysis

The pd_analysis.py script reads the results from all experiments and produces several plots and text files. The syntax is the following:

$ pd_analysis.py path/to/results_dir

See Analysis for further details on the output of the analysis.