Quick start tutorial

Adenine* can be installed using standard Python tools (with administrative or sudo permissions on GNU-Linux platforms):

$ pip install adenine

Installation from sources

If you like to manually install Adenine, download the .zip or .tar.gz archive from http://slipguru.github.io/adenine/. Then extract it and move into the root directory:

$ unzip slipguru-adenine-|release|.zip
$ cd adenine-|release|/

or:

$ tar xvf slipguru-adenine-|release|.tar.gz
$ cd adenine-|release|/

Otherwise you can clone our GitHub repository:

$ git clone https://github.com/slipguru/adenine.git

From here, you can follow the standard Python installation step:

$ python setup.py install

After Adenine installation, you should have access to two scripts, named with a common ade_ prefix:

$ ade_<TAB>
ade_analysis.py    ade_run.py

This tutorial assumes that you downloaded and extracted Adenine source package which contains a examples\data directory with some data files (.npy or .csv) which will be used to show Adenine functionalities.

Adenine needs only 3 ingredients:

  • n_samples x n_variables input matrix
  • n_samples x 1 output vector (optional)
  • configuration file

Input data format

Input data are assumed to be:

  • numpy array stored in .npy files organized with a row for each sample and a column for each feature,
  • tabular data stored in comma separated .csv files presenting the variables header on the first row and the sample indexes on the first column,
  • toy examples available from adenine.utils.data_source function.

Configuration File

Adenine configuration file is a standard Python script. It is imported as a module, then all the code is executed. In this file the user can define all the option needed to read the data and to create the pipelines.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""Configuration file for adenine."""

from adenine.utils import data_source

# --------------------------  EXPERMIENT INFO ------------------------- #
exp_tag = '_experiment'
output_root_folder = 'results'
plotting_context = 'notebook'  # one of {paper, notebook, talk, poster}
file_format = 'pdf'  # or 'png'

# ----------------------------  INPUT DATA ---------------------------- #
# Load an example dataset or specify your input data in tabular format
data_file = 'data.csv'
labels_file = 'labels.csv'  # OPTIONAL
samples_on = 'rows'  # if samples lie on columns use 'cols' or 'col'
data_sep = ','  # the data separator. e.g., ',', '\t', ' ', ...
X, y, feat_names, index = data_source.load('custom',
                                           data_file, labels_file,
                                           samples_on=samples_on,
                                           sep=data_sep)

# -----------------------  PIPELINES DEFINITION ------------------------ #
# --- Missing values imputing --- #
step0 = {'Impute': [False, {'missing_values': 'NaN',
                            'strategy': ['median',
                                         'mean',
                                         'nearest_neighbors']}]}

# --- Data preprocessing --- #
step1 = {'None': [False], 'Recenter': [False], 'Standardize': [False],
         'Normalize': [False, {'norm': ['l1', 'l2']}],
         'MinMax': [False, {'feature_range': [(0, 1), (-1, 1)]}]}

# --- Unsupervised features learning --- #
# affinity ca be precumputed for SE
step2 = {'PCA': [False, {'n_components': 3}],
         'IncrementalPCA': [False],
         'RandomizedPCA': [False],
         'KernelPCA': [False, {'kernel': ['linear', 'rbf', 'poly']}],
         'Isomap': [False, {'n_neighbors': 5}],
         'LLE': [False, {'n_neighbors': 5,
                         'method': ['standard', 'modified',
                                    'hessian', 'ltsa']}],
         'SE': [False, {'affinity': ['nearest_neighbors', 'rbf']}],
         'MDS': [False, {'metric': True}],
         'tSNE': [False],
         'RBM': [False, {'n_components': 256}],
         'None': [False]
         }

# --- Clustering --- #
# affinity ca be precumputed for AP, Spectral and Hierarchical
step3 = {'KMeans': [False, {'n_clusters': [3, 'auto']}],
         'AP': [False, {'preference': ['auto']}],
         'MS': [False],
         'Spectral': [False, {'n_clusters': [3, 8]}],
         'Hierarchical': [False, {'n_clusters': [3, 8],
                                  'affinity': ['manhattan', 'euclidean'],
                                  'linkage':  ['ward', 'complete', 'average']}]
         }

Experiment runner

The ade_run.py script, executes the full Adenine framework. The prototype is the following:

$ ade_run.py ade_config.py

When launched, the script reads the data, then it creates and runs each pipeline saving the results in a tree-like structure which has the current folder as root.

Results analysis

The ade_analysis.py script provides useful summaries and graphs from the results of the experiment. This script accepts as only parameter a result directory already created:

$ ade_analysis.py result-dir

The script produces a set of textual and graphical results. An output example obtained by one of the implemented pipelines is represented below.

broken link broken link

You can reproduce the example above specifying data_source.load('circles') in the configuration file.

Example dataset

An example dataset can be dowloaded here. The dataset is a random extraction of 801 samples (with dimension 20531) measuring RNA-Seq gene expression of patients affected by 5 different types of tumor: breast invasive carcinoma (BRCA), kidney renal clear cell carcinoma (KIRC), colon (COAD), lung (LUAD) and prostate adenocarcinoma (PRAD). The full dataset is maintained by The Cancer Genome Atlas Pan-Cancer Project [1] and we refer to the original repository for furher details.

Reference

[1] Weinstein, John N., et al. “The cancer genome atlas pan-cancer analysis project.” Nature genetics 45.10 (2013): 1113-1120.