Key features of irMF 11.0

Some of the functionalities of irMF below are followed by an (*). To our best knowledge, they are only available in irMF.

  • Use fixed left or right factoring vectors: this option is extremely useful in situations like:

    • Validating an NMF model using fixed right factoring vectors defining e.g. expression profiles, found in a training dataset, on a validation dataset, to assess classification results versus existing groups, e.g. disease groups.

    • Imposing sparsity constraint on right factoring vectors while ensuring conservation of left factoring vectors. For that you need to be able to fix left factoring vectors in a first step, as NMF "Lasso" does (we’ll come soon to this option)

  • Calculate leverages (*): NMF loadings depend heavily on the chosen scaling system, which is arbitrary, e.g. L1, L2, max element of a factoring vector = 1. For this reason, loadings cannot be easily compared, e.g. to assess the specific contribution of a sample or variable to a particular factoring vector. Other packages need to rely on additional methods, e.g. K-means, which are further applied on loadings to perform clustering tasks (and K-means are also sensitive to the chosen scaling system by the way!) Leverages are standardized values between 0 and 1 that can be effectively compared, e.g. to cluster samples or variables, as irMF does. Importantly, leverages are very robust to any change in scaling system. Leverage computation is well described in our posneg paper.

  • Permutation tests (*): Assess the association between NMF clusters and within cluster ordering with existing groups.

  • NMF “Lasso” (*): This is a key feature to select variables without affecting the structure of the original matrix. Key patterns that are retrieved by NMF on the original matrix are perfectly conserved after variable selection. NMF “Lasso” is a regularized form of the projected gradient algorithm. However, the penalty is automatically set in order to achieve the objective, which is defined in terms of percentage of retained variables. This percentage is typically validated through assessing the conservation of NMF patterns.

  • Blind deconvolution (*): This feature is particularly useful to deconvolute mixture sample without relying on known, e.g. marker genes. It is based on a convex variant of NMF (see below).

  • NMF Priors (*): NMF prior factoring vectors can be used to fit known biological processes. Within each prior component, only non-zero elements will be updated, allowing for straightforward interpretations of which markers are activated or inactivated within a particular biological process.

  • NMF convex variants: Convex variants were originally proposed by Ding et al (2006). They are very useful to:

    • Naturally provide sparse factoring vectors, i.e. do not rely on regularization penalties, which are difficult to define in a non-supervised context (penalties are usually optimized through cross-validation schemes). Convex constraint on either left or right factoring vectors lead to forms of relaxed K-means and are useful to cluster samples or variables.

    • Perform blind deconvolution of samples.

  • NMF kernel variants: Take NMF beyond linear modeling, just like SVM does. This option was also proposed in Ding et al (2006) and can prove extremely useful to cluster effectively samples when other methods fail.

  • NMF robust bootstrap approach (*): Assess the stability of factoring vectors without relying on multiple random initializations, which are numerically very greedy. To our best knowledge, all packages propose multiple random initializations. This robust approach allows for excluding automatically inconsistent variables, e.g. those having high leverage but poor stability.

  • Advanced screeplots to assess NMF rank (*): Variance, Volume, Stability and Specificity screeplots can prove very useful. Other packages propose only a stability screeplot based on multiple random initializations, as proposed by Brunet et al (2004).

  • Robust SVD (*): Can prove very useful to detect and impute outlier cells or impute missing values, as proposed by Li Liu et al (2003).

  • NMF "combined cell plot" (*): The NMF “combined cell plot” is a key feature of irMF, which challenges classical bi-clustering heatmaps obtained through hierarchical clustering. Clusters and indicators of the stability and leverage of each sample or variable are displayed on the same visualization. The combined cell plot is only available in irMF.

  • Non-Negative Tensor Factorization: This approach is extremely useful to analyze longitudinal datasets with numerous variables, e.g. sensors to control an industrial process or biological markers in a time-course experiment. Or text-mine emails over a period of time (e.g. legal investigation to detect patterns, themes, that may evolve with time). A sparse version based on Hoyer's approach is also available.

  • Multiblock: Allows for studying multiple blocks of variables simultaneously.

irMF algorithms

Like other packages, irMF provides a variety of efficient algorithms, projected gradient or Hierarchical ALS, regularized ALS, and affine NMF. Initialization is based on NNSVD or user trial vectors. However, from our own experience, the key to a successful NMF analysis is not the algorithm itself, rather the chosen approach: normalization scheme, chosen rank, focus on specific subsets, choice of appropriate variants, correct interpretation and use of NMF factoring vectors, etc.

irMF and the JMP environment

Beyond all these unique functionalities, irMF benefits of the JMP rich environment, which provides exceptional tools for exploring data. All irMF sessions are fully logged and can be recalled in non-interactive mode. The core calculation code being written in Python 3 can be easily integrated with any other scripting language. Table portions is an important component of irMF, as it allows to store in the original table all irMF analysis that were performed in sample and/or variable subsets.

irMF and Python

irMF uses our package NMTF (GitHub - paulfogel/NMTF: Non-Negative Matrix and Tensor Factorizations), including a scikit-learn like interface that allows seamless integration with any python pipeline. 

irMF versus other NMF packages

  • Matlab, julia, python and R NMF packages cover parts of the irMF features described here (but not all of them). irMF can be seen as an attempt to integrate the best of NMF and NTF, and also proposes unique functionalities.

  • Matlab, python or julia are all powerful languages, so the process of extending existing functions into custom ones is relatively easy. The R NMF package relies on C++ code to ensure fast calculations, making customization much harder.