KNNImputer imputs missing values in microarray data as described in Troyanskaya et al 2001. Given an input PCL, for each gene with missing values, some number of nearest neighbors (by a configurable similarity measure) are found, and the missing value is replaced with a weighted average of the equivalent value in those neighbors. KNNImputer can optionally remove genes with too many missing values to impute.


Basic Usage

 KNNImputer -i <data.pcl> -o <imputed.pcl>

Replace missing values in the microarray data data.pcl based on their nearest neighbors, remove any genes with too many missing values, and save the result in imputed.pcl.

Detailed Usage

package "KNNImputer"
version "1.0"
purpose "More modern version of KNNImpute."

section "Main"
option  "input"         i   "Input PCL file"
                            string  typestr="filename"
option  "output"        o   "Output PCL file"
                            string  typestr="filename"

section "Genes/Neighbors"
option  "neighbors"     k   "Nearest neighbors to use"
                            int default="10"
option  "distance"      d   "Similarity measure"
                            "pearnorm","hypergeom"  default="euclidean"
option  "missing"       m   "Fraction of conditions which must be present"
                            double  default="0.7"

section "Miscellaneous"
option  "genes"         g   "Gene inclusion file"
                            string  typestr="filename"
option  "weights"       w   "Input weights file"
                            string  typestr="filename"
option  "autocorrelate" a   "Autocorrelate distances"
                            flag    off

section "Optional"
option  "skip"          s   "Columns to skip in input PCL"
                            int default="2"
option  "limit"         l   "Gene count limit for caching"
                            int default="-1"
option  "verbosity"     v   "Message verbosity"
                            int default="5"
Flag Default Type Description
-i stdin PCL text file Input PCL file in which missing values are to be imputed.
-o stdout PCL text file Output PCL file in which missing values have been replaced and genes with too many missing values have been removed.
-k 10 Integer Number of neighbors to use for each missing value imputation.
-d euclidean euclidean, pearson, kendalls, kolm-smir, spearman, pearnorm, or hypergeom Similarity measure to use for finding nearest neighbors. The default (Euclidean distance) is highly recommended.
-m 0.7 Double Fraction of a gene's expression vector that must be present; genes with less than this many non-missing values are removed from the output. For example, in a PCL with 10 columns, genes with more than three missing values would be removed by default.
-g None Gene text file If given, only genes in the given gene set are included in the output.
-w None PCL text file If given, a PCL file with dimensions equal to the data given with -i. However, the values in the cells of the weights PCL represent the relative weight given to each gene/experiment pair. If no weights file is given, all weights default to 1.
-a off Flag If on, autocorrelate similarity scores (find the maximum similarity score over all possible lags of the two vectors; see Sleipnir::CMeasureAutocorrelate).
-s 2 Integer Number of columns to skip between the initial ID column and the first experimental (data) column in the input PCL.
-l -1 Integer Maximum number of genes in input file before in-memory score caching is disabled. If -1, caching is never performed. Caching greatly speeds up processing, but can consume large amounts of memory for inputs with many genes (rows).