KNNImputer imputs missing values in microarray data as described in Troyanskaya et al 2001. Given an input PCL, for each gene with missing values, some number of nearest neighbors (by a configurable similarity measure) are found, and the missing value is replaced with a weighted average of the equivalent value in those neighbors. KNNImputer can optionally remove genes with too many missing values to impute.
KNNImputer -i <data.pcl> -o <imputed.pcl>
Replace missing values in the microarray data
data.pcl based on their nearest neighbors, remove any genes with too many missing values, and save the result in
package "KNNImputer" version "1.0" purpose "More modern version of KNNImpute." section "Main" option "input" i "Input PCL file" string typestr="filename" option "output" o "Output PCL file" string typestr="filename" section "Genes/Neighbors" option "neighbors" k "Nearest neighbors to use" int default="10" option "distance" d "Similarity measure" values="pearson","euclidean","kendalls","kolm-smir","spearman", "pearnorm","hypergeom" default="euclidean" option "missing" m "Fraction of conditions which must be present" double default="0.7" section "Miscellaneous" option "genes" g "Gene inclusion file" string typestr="filename" option "weights" w "Input weights file" string typestr="filename" option "autocorrelate" a "Autocorrelate distances" flag off section "Optional" option "skip" s "Columns to skip in input PCL" int default="2" option "limit" l "Gene count limit for caching" int default="-1" option "verbosity" v "Message verbosity" int default="5"
|-i||stdin||PCL text file||Input PCL file in which missing values are to be imputed.|
|-o||stdout||PCL text file||Output PCL file in which missing values have been replaced and genes with too many missing values have been removed.|
|-k||10||Integer||Number of neighbors to use for each missing value imputation.|
|-d||euclidean||euclidean, pearson, kendalls, kolm-smir, spearman, pearnorm, or hypergeom||Similarity measure to use for finding nearest neighbors. The default (Euclidean distance) is highly recommended.|
|-m||0.7||Double||Fraction of a gene's expression vector that must be present; genes with less than this many non-missing values are removed from the output. For example, in a PCL with 10 columns, genes with more than three missing values would be removed by default.|
|-g||None||Gene text file||If given, only genes in the given gene set are included in the output.|
|-w||None||PCL text file||If given, a PCL file with dimensions equal to the data given with |
|-a||off||Flag||If on, autocorrelate similarity scores (find the maximum similarity score over all possible lags of the two vectors; see Sleipnir::CMeasureAutocorrelate).|
|-s||2||Integer||Number of columns to skip between the initial ID column and the first experimental (data) column in the input PCL.|
|-l||-1||Integer||Maximum number of genes in input file before in-memory score caching is disabled. If -1, caching is never performed. Caching greatly speeds up processing, but can consume large amounts of memory for inputs with many genes (rows).|