ADENINE (A Data ExploratioN pIpeliNE)¶
adenine is a machine learning and data mining Python library for exploratory data analysis.
The main structure of adenine can be summarized in the following 4 steps.
- Imputing: Does your dataset have missing entries? In the first step you can fill the missing values choosing between different strategies: feature-wise median, mean and most frequent value or k-NN imputing.
- Preprocessing: Have you ever wondered what would have changed if only your data have been preprocessed in a different way? Or is it data preprocessing a good idea after all? adenine includes several preprocessing procedures, such as: data recentering, Min-Max scaling, standardization and normalization. adenine also allows you to compare the results of the analysis made with different preprocessing strategies.
- Dimensionality Reduction: In the context of data exploration, this phase becomes particularly helpful for high dimensional data. This step includes manifold learning (such as isomap, multidimensional scaling, etc) and unsupervised feature learning (principal component analysis, kernel PCA, Bernoulli RBM, etc) techniques.
- Clustering: This step aims at grouping data into clusters in an unsupervised manner. Several techniques such as k-means, spectral or hierarchical clustering are offered.
The final output of adenine is a compact, textual and graphical representation of the results obtained from the pipelines made with each possible combination of the algorithms selected at each step.
adenine can run on multiple cores/machines* and it is fully scikit-learn compliant.
User documentation¶
API¶
Pipeline utilities¶
-
adenine.core.define_pipeline.
parse_imputing
(key, content)[source]¶ Parse the options of the imputing step.
This function parses the imputing step coded as dictionary in the ade_config file.
Parameters: key : class or str, like {‘Impute’, ‘None’}
The type of selected imputing step. In case in which key is a class, it must contain both a fit and transform method.
content : dict
A dictionary containing parameters for each imputing class. Each parameter can be a list; in this case for each combination of parameters a different pipeline will be created.
Returns: tpl : tuple
A tuple made like (‘imputing_name’, imputing_obj, ‘imputing’), where imputing_obj is an sklearn ‘transforms’ (i.e. it has bot a fit and transform method).
-
adenine.core.define_pipeline.
parse_preproc
(key, content)[source]¶ Parse the options of the preprocessing step.
This function parses the preprocessing step coded as dictionary in the ade_config file.
Parameters: key : class or str, like {‘None’, ‘Recenter’, ‘Standardize’, ‘Normalize’,
‘MinMax’}
The selected preprocessing algorithm. In case in which key is a class, it must contain both a fit and transform method.
content : dict
A dictionary containing parameters for each preprocessing class. Each parameter can be a list; in this case for each combination of parameters a different pipeline will be created.
Returns: tpl : tuple
A tuple made like (‘preproc_name’, preproc_obj, ‘preproc’), where preproc_obj is an sklearn ‘transforms’ (i.e. it has bot a fit and transform method).
-
adenine.core.define_pipeline.
parse_dimred
(key, content)[source]¶ Parse the options of the dimensionality reduction step.
This function does the same as parse_preproc but works on the dimensionality reduction & manifold learning options.
Parameters: key : class or str, like {‘None’, ‘PCA’, ‘KernelPCA’, ‘Isomap’, ‘LLE’,
‘SE’, ‘MDS’, ‘tSNE’, ‘RBM’}
The selected dimensionality reduction algorithm. In case in which key is a class, it must contain both a fit and transform method.
content : dict
A dictionary containing parameters for each dimensionality reduction class. Each parameter can be a list; in this case for each combination of parameters a different pipeline will be created.
Returns: tpl : tuple
A tuple made like (‘dimres_name’, dimred_obj, ‘dimred’), where dimred_obj is a sklearn ‘transforms’ (i.e. it has bot a .fit and .transform method).
-
adenine.core.define_pipeline.
parse_clustering
(key, content)[source]¶ Parse the options of the clustering step.
This function does the same as parse_preproc but works on the clustering options.
Parameters: key : class or str, like {‘KMeans’, ‘AP’, ‘MS’, ‘Spectral’, ‘Hierarchical’}
The selected clustering algorithm. In case in which key is a class, it must contain a fit method.
content : dict
A dictionary containing parameters for each clustering class. Each parameter can be a list; in this case for each combination of parameters a different pipeline will be created.
Returns: tpl : tuple
A tuple made like (‘clust_name’, clust_obj, ‘clustering’), where clust_obj implements the fit method.
-
adenine.core.define_pipeline.
parse_steps
(steps, max_n_pipes=200)[source]¶ Parse the steps and create the pipelines.
This function parses the steps coded as dictionaries in the ade_config files and creates a sklearn pipeline objects for each combination of imputing -> preprocessing -> dimensionality reduction -> clustering algorithms.
- A typical step may be of the following form:
- stepX = {‘Algorithm’: [On/Off flag, {‘parameter1’, [list of params]}]}
where On/Off flag = {True, False} and ‘list of params’ allows to specify multiple params. In case in which the ‘list of params’ is actually a list, multiple pipelines are created for each combination of parameters.
Parameters: steps : list of dictionaries
A list of (usually 4) dictionaries that contains the details of the pipelines to implement.
max_n_pipes : int, optional, default: 200
The maximum number of combinations allowed. This avoids a too expensive computation.
Returns: pipes : list of sklearn.pipeline.Pipeline
The returned list must contain every possible combination of imputing -> preprocessing -> dimensionality reduction -> clustering algorithms (up to max_n_pipes).
-
adenine.core.pipelines.
create
(pdef)[source]¶ Scikit-learn Pipelines objects creation (deprecated).
This function creates a list of sklearn Pipeline objects starting from the list of list of tuples given in input that could be created using the adenine.core.define_pipeline module.
Parameters: pdef : list of list of tuples
This arguments contains the specification needed by sklearn in order to create a working Pipeline object.
Returns: pipes : list of sklearn.pipeline.Pipeline objects
The list of Piplines, each of them can be fitted and trasformed with some data.
-
adenine.core.pipelines.
which_level
(label)[source]¶ Define the step level according to the input step label [DEPRECATED].
This function return the level (i.e.: imputing, preproc, dimred, clustring, None) according to the step label provided as input.
Parameters: label : string
This is the step level as it is reported in the ade_config file.
Returns: level : {imputing, preproc, dimred, clustering, None}
The appropriate level of the input step.
-
adenine.core.pipelines.
evaluate
(level, step, X)[source]¶ Transform or predict according to the input level.
This function uses the transform or the predict method on the input sklearn-like step according to its level (i.e. imputing, preproc, dimred, clustering, none).
Parameters: level : {‘imputing’, ‘preproc’, ‘dimred’, ‘clustering’, ‘None’}
The step level.
step : sklearn-like object
This might be an Imputer, or a PCA, or a KMeans (and so on...) sklearn-like object.
X : array of float, shape
The input data matrix.
Returns: res : array of float
A matrix projection in case of dimred, a label vector in case of clustering, and so on.
-
adenine.core.pipelines.
pipe_worker
(pipe_id, pipe, pipes_dump, X)[source]¶ Parallel pipelines execution.
Parameters: pipe_id : string
Pipeline identifier.
pipe : list of tuples
Tuple containing a label and a sklearn Pipeline object.
pipes_dump : multiprocessing.Manager.dict
Dictionary containing the results of the parallel execution.
X : array of float, shape
The input data matrix.
Adenine analyzer module.
-
adenine.core.analyze_results.
est_clst_perf
(root, data_in, labels=None, t_labels=None, model=None, metric='euclidean')[source]¶ Estimate the clustering performance.
This estimates the clustering performance by means of several indexes. Results are saved in a tree-like structure in the root folder.
Parameters: root : string
The root path for the output creation.
data_in : array of float, shape
The low space embedding estimated by the dimensinality reduction and manifold learning algorithm.
labels : array of float, shape
The label assignment performed by the clustering algorithm.
t_labels : array of float, shape
The true label vector; None if missing.
model : sklearn or sklearn-like object
An instance of the class that evaluates a step. In particular this must be a clustering model provided with the clusters_centers_ attribute (e.g. KMeans).
metric : string
The metric used during the clustering algorithms.
-
adenine.core.analyze_results.
make_df_clst_perf
(root)[source]¶ Summarize all the clustering performance estimations.
Given the output file produced by est_clst_perf(), this function groups all of them together in friendly text and latex files, and saves the two files produced in a tree-like structure in the root folder.
Parameters: root : string
The root path for the output creation.
-
adenine.core.analyze_results.
get_step_attributes
(step, pos)[source]¶ Get the attributes of the input step.
This function returns the attributes (i.e. level, name, outcome) of the input step. This comes handy when dealing with steps with more than one parameter (e.g. KernelPCA ‘poly’ or ‘rbf’).
Parameters: step : list
A step coded by ade_run.py as [name, level, param, data_out, data_in, mdl obj, voronoi_mdl_obj]
pos : int
The position of the step inside the pipeline.
Returns: name : string
A unique name for the step (e.g. KernelPCA_rbf).
level : {imputing, preproc, dimred, clustering}
The step level.
data_out : array of float, shape
Where n_out is n_dimensions for dimensionality reduction step, or 1 for clustering.
data_in : array of float, shape
Where n_in is n_dimensions for preprocessing/imputing/dimensionality reduction step, or n_dim for clustering (because the data have already been dimensionality reduced).
param : dictionary
The parameters of the sklearn object implementing the algorithm.
mdl_obj : sklearn or sklearn-like object
This is an instance of the class that evaluates a step.
-
adenine.core.analyze_results.
analysis_worker
(elem, root, y, feat_names, index, lock)[source]¶ Parallel pipelines analysis.
Parameters: elem : list
The first two element of this list are the pipe_id and all the data of that pipeline.
root : string
The root path for the output creation.
y : array of float, shape
The label vector; None if missing.
feat_names : array of integers (or strings), shape
The feature names; a range of numbers if missing.
index : list of integers (or strings)
This is the samples identifier, if provided as first column (or row) of of the input file. Otherwise it is just an incremental range of size n_samples.
lock : multiprocessing.synchronize.Lock
Obtained by multiprocessing.Lock(). Needed for optional creation of directories.
Input Data¶
This module is just a wrapper for some sklearn.datasets functions.
-
adenine.utils.data_source.
generate_gauss
(mu=None, std=None, n_sample=None)[source]¶ Create a Gaussian dataset.
Generates a dataset with n_sample * n_class examples and n_dim dimensions.
Parameters: mu : array of float, shape
The mean of each class.
std : array of float, shape
The standard deviation of each Gaussian distribution.
n_sample : int
Number of point per class.
-
adenine.utils.data_source.
load_custom
(x_filename, y_filename, samples_on='rows', **kwargs)[source]¶ Load a custom dataset.
This function loads the data matrix and the label vector returning a unique sklearn-like object dataSetObj.
Parameters: x_filename : string
The data matrix file name.
y_filename : string
The label vector file name.
samples_on : string
This can be either in [‘row’, ‘rows’] if the samples lie on the row of the input data matrix, or viceversa in [‘col’, ‘cols’] the other way around.
kwargs : dict
Arguments of pandas.read_csv function.
Returns: data : sklearn.datasets.base.Bunch
An instance of the sklearn.datasets.base.Bunch class, the meaningful attributes are .data, the data matrix, and .target, the label vector.
-
adenine.utils.data_source.
load
(opt='custom', x_filename=None, y_filename=None, n_samples=0, samples_on='rows', **kwargs)[source]¶ Load a specified dataset.
This function can be used either to load one of the standard scikit-learn datasets or a different dataset saved as X.npy Y.npy in the working directory.
Parameters: opt : {‘iris’, ‘digits’, ‘diabetes’, ‘boston’, ‘circles’, ‘moons’,
‘custom’}, default: ‘custom’
Name of a predefined dataset to be loaded.
x_filename : string, default
The data matrix file name.
y_filename : string, default
The label vector file name.
n_samples : int
The number of samples to be loaded. This comes handy when dealing with large datasets. When n_samples is less than the actual size of the dataset this function performs a random subsampling that is stratified w.r.t. the labels (if provided).
samples_on : string
This can be either in [‘row’, ‘rows’] if the samples lie on the row of the input data matrix, or viceversa in [‘col’, ‘cols’] the other way around.
data_sep : string
The data separator. For instance comma, tab, blank space, etc.
Returns: X : array of float, shape
The input data matrix.
y : array of float, shape
The label vector; np.nan if missing.
feature_names : array of integers (or strings), shape
The feature names; a range of number if missing.
index : list of integers (or strings)
This is the samples identifier, if provided as first column (or row) of of the input file. Otherwise it is just an incremental range of size n_samples.
Plotting functions¶
Adenine plotting module.
-
adenine.core.plotting.
silhouette
(root, data_in, labels, model=None)[source]¶ Generate and save the silhouette plot of data_in w.r.t labels.
This function generates the silhouette plot representing how data are correctly clustered, based on labels. The plots will be saved into the root folder in a tree-like structure.
Parameters: root : string
The root path for the output creation
data_in : array of float, shape
The low space embedding estimated by the dimensionality reduction and manifold learning algorithm.
labels : array of float, shape
The label vector. It can contain true or estimated labels.
model : sklearn or sklearn-like object
An instance of the class that evaluates a step.
-
adenine.core.plotting.
scatter
(root, data_in, labels=None, true_labels=False, model=None)[source]¶ Generate the scatter plot of the dimensionality reduced data set.
This function generates the scatter plot representing the dimensionality reduced data set. The plots will be saved into the root folder in a tree-like structure.
Parameters: root : string
The root path for the output creation
data_in : array of float, shape
The low space embedding estimated by the dimensinality reduction and manifold learning algorithm.
labels : array of float, shape
The label vector. It can contain true or estimated labels.
true_labels : boolean
Identify if labels contains true or estimated labels.
model : sklearn or sklearn-like object
An instance of the class that evaluates a step. In particular this must be a clustering model provided with the clusters_centers_ attribute (e.g. KMeans).
-
adenine.core.plotting.
voronoi
(root, data_in, labels=None, true_labels=False, model=None)[source]¶ Generate the Voronoi tessellation obtained from the clustering algorithm.
This function generates the Voronoi tessellation obtained from the clustering algorithm applied on the data projected on a two-dimensional embedding. The plots will be saved into the appropriate folder of the tree-like structure created into the root folder.
Parameters: root : string
The root path for the output creation
data_in : array of float, shape
The low space embedding estimated by the dimensinality reduction and manifold learning algorithm.
labels : array of int, shape
The result of the clustering step.
true_labels : boolean [deprecated]
Identify if labels contains true or estimated labels.
model : sklearn or sklearn-like object
An instance of the class that evaluates a step. In particular this must be a clustering model provided with the clusters_centers_ attribute (e.g. KMeans).
-
adenine.core.plotting.
tree
(root, data_in, labels=None, index=None, model=None)[source]¶ Generate the tree structure obtained from the clustering algorithm.
This function generates the tree obtained from the clustering algorithm applied on the data. The plots will be saved into the appropriate folder of the tree-like structure created into the root folder.
Parameters: root : string
The root path for the output creation
data_in : array of float, shape
The low space embedding estimated by the dimensinality reduction and manifold learning algorithm.
labels : array of int, shape
The result of the clustering step.
index : list of integers (or strings)
This is the samples identifier, if provided as first column (or row) of of the input file. Otherwise it is just an incremental range of size n_samples.
model : sklearn or sklearn-like object
An instance of the class that evaluates a step. In particular this must be a clustering model provided with the clusters_centers_ attribute (e.g. KMeans).
-
adenine.core.plotting.
dendrogram
(root, data_in, labels=None, index=None, model=None, n_max=150)[source]¶ Generate and save the dendrogram obtained from the clustering algorithm.
This function generates the dendrogram obtained from the clustering algorithm applied on the data. The plots will be saved into the appropriate folder of the tree-like structure created into the root folder. The row colors of the heatmap are the either true or estimated data labels.
Parameters: root : string
The root path for the output creation
data_in : array of float, shape
The low space embedding estimated by the dimensinality reduction and manifold learning algorithm.
labels : array of int, shape
The result of the clustering step.
index : list of integers (or strings)
This is the samples identifier, if provided as first column (or row) of of the input file. Otherwise it is just an incremental range of size n_samples.
model : sklearn or sklearn-like object
An instance of the class that evaluates a step. In particular this must be a clustering model provided with the clusters_centers_ attribute (e.g. KMeans).
n_max : int, (INACTIVE)
The maximum number of samples to include in the dendrogram. When the number of samples is bigger than n_max, only n_max samples randomly extracted from the dataset are represented. The random extraction is performed using sklearn.model_selection.StratifiedShuffleSplit (or sklearn.cross_validation.StratifiedShuffleSplit for legacy reasons).
-
adenine.core.plotting.
pcmagnitude
(root, points, title='', ylabel='')[source]¶ Plot the trend of principal components magnitude.
Parameters: root : string
The root path for the output creation.
points : array of float, shape
This could be the explained variance ratio or the eigenvalues of the centered matrix, according to the PCA algorithm of choice, respectively PCA or KernelPCA.
title : string
Plot title.
ylabel : string
Y-axis label.
-
adenine.core.plotting.
eigs
(root, affinity, n_clusters=0, title='', ylabel='', normalised=True, n_components=20, filename=None, ylim='auto', rw=False)[source]¶ Plot eigenvalues of the Laplacian associated to data affinity matrix.
Parameters: root : string
The root path for the output creation.
affinity : array of float, shape
The affinity matrix.
n_clusters : int, optional
The number of clusters.
title : string, optional
Plot title.
ylabel : string, optional
Y-axis label.
normalised : boolean, optional, default True
Choose whether to normalise the Laplacian matrix.
n_components : int, optional, default 20
Number of components to show in the plot.
filename : None or str, optional, default None
If not None, overrides default filename for saving the plot.
ylim : ‘auto’, None, tuple or list, optional, default ‘auto’
If ‘auto’, choose the highest eigenvalue for the height of the plot. If None, plt.ylim is not called (matplotlib default is used). Otherwise, specify manually the desired ylim.
rw : boolean, optional, default False
Normalise the Laplacian matrix as the random walks point of view. This should be better suited with unclear data distributions.
Extra tools¶
-
adenine.utils.extra.
values_iterator
(dictionary)[source]¶ Add support for python2 or 3 dictionary iterators.
-
adenine.utils.extra.
items_iterator
(dictionary)[source]¶ Add support for python2 or 3 dictionary iterators.
-
adenine.utils.extra.
modified_cartesian
(*args, **kwargs)[source]¶ Modified Cartesian product.
This takes two (or more) lists and returns their Cartesian product. If one of two list is empty this function returns the non-empty one.
Parameters: *args : lists, length
The group of input lists.
Returns: cp : list
The Cartesian Product of the two (or more) nonempty input lists.
-
adenine.utils.extra.
make_time_flag
()[source]¶ Generate a time flag.
This function simply generates a time flag using the current time.
Returns: timeFlag : string
A unique time flag.
-
adenine.utils.extra.
sec_to_time
(seconds)[source]¶ Transform seconds into a formatted time string.
Parameters: seconds : int
Seconds to be transformed.
Returns: time : string
A well formatted time string.
-
adenine.utils.extra.
ensure_symmetry
(X)[source]¶ Ensure matrix symmetry.
Parameters: X : numpy.ndarray
Input matrix of precomputed pairwise distances.
Returns: new_X : numpy.ndarray
Symmetric distance matrix. Values are averaged.