- NAME
- DESCRIPTION
- HOW TO USE THIS PROGRAM ?
- Comparing the scores of the matrix sites to the theoretical
- Assessing matrix sites with a Leave-One-Out (LOO) procedure
- AUTHORS
- CATEGORY
- USAGE
- OPTIONS
- SEE ALSO
**WISH LIST**

matrix-quality

Evaluate the quality of a Position-Specific Scoring Matrix (PSSM), by comparing score distributions obtained with this matrix in various sequence sets.

The most classical use of the program is to compare score distributions between "positive" sequences (e.g. true binding sites for the considered transcription factor) and "negative" sequences (e.g. intergenic sequences between convergently transcribed genes).

The typical positive set is a collection of sites that have been shown (with experimental methods) to bind the transcription factor of interest.

A particular case of postive control is to estimate the distribution of scores of the sites that served to build the matrix. This however provkes some bias (over-estimation of the scores), since the matrix is used to score the sites on which it was "trained". This bias can be circumvented by applying a cross-validation.

An important bias of evaluation (and a frequent trap in published
articles) can result from an over-fitting of the matrix to the
positive set, in case one would use the same sites for building the
PSSM and for evaluating it. To avoid this bias, *matrix-quality*
supports two modes of cross-validation (CV):

1. Leave-one-out (LOO) 2. k-fold cross-validation (kfold)

The cross-validation can only be performed when the matrix is specified in a format that includes both the matrix and the sites (sequences) that were used to build this matrix. This is the case for matrices in MEME, consensus, transfac and MotifSampler formats.

The set of input sequence (matrix site sequences) is partitionned into k randomly selected subets of approx. equal size (the number of sites is not always an exact multiple of k).

The program then iterates over the testing set in the following way. All the sites that are not part of the testing sets are used as trianing sites to build a partial matrix. The testing sites are then scored with this partial matrix.

In LOO cross-validation mode, one sequence (the "left-out sequence") is temporarily discarded from the positive set, and the remaining sequences are used to build a matrix, which is then used to score the left out sequence. The process iterates over all the sequences of the positive set.

If the left-out sequence has one or more "twin" (identical site) in the positive set, they are also temporarily excluded from the positive set and not included in the matrix used to score the left out sequence.

The LOO is actually a particular case of k-fold cross-validation, where k equals the total number of sites used to build the original matrix. The LOO is particularly adapted for matrices built from a very small number of sites (e.g. matrices built from a handful of well-documented sites as usually found in transcription factor databases).

On the contrary, the k-fold cross-validation is useful to save computing time for matrices built from large collection of sites (e.g. thousands of sites resulting from ChIP-seq experiments).

It is sometimes difficult to find a good negative set, i.e. a collection of sequences which supposedly do not contain any binding site for the transcription factor of interest.

One possibility is to select a random set of genome fragments
(e.g. use *random-genes* to select promoters of 100 randomly selected
genes). However, some of these randomly selected sequences might
contain effective binding sites for the transcripton factor.

Another possiblity is to generate artificial sequences according to
some background model (uing *random-seq*), but there is always a risk
that for model to be an over-simplification of the real sequences.

Yet another approach to perform the negative test os to scan biological sequences (e.g. upstream regions of 100 randomly picked genes) with column-permuted matrices. The advantage of this approach is that the sequences are realistic, but the permuted matrices hopefully do not correspond to any actual motif, and their empirical distribution observed in the test sequences is thus supposed to fit the theoretcial distribution.

This approach may however pose problem in the specific case of weak-complexity motifs (e.g. CCGCCC, AATTTT), since many permutations will give motifs that are similar, if not equal, to the original motif.

Let us be frank, this program can do many things, but requires a bit of expertise. A good strategy to get familiar with its multiple results is to start runing the simplest possible analysis, and progressively adding the more advanced tasks.

We propose hereafter a step-by-step schedule of utilization, where subsequent tasks are progressively added.

We assume here that the user disposes of a PSSM in a format that
includes both the matrix and the aligned sites used to compute the
matrix (e.g. MEME format). Beware, the sites actually incorporated in
the matrix may differ frfom the collection of sites used as input for
the matrix-building program. For instance, if you use MEME (with the
option -zoops) to build a matrix from a collection of annotated TFBS,
some sites may be incorporated in the matrix, and some other
skipped. We use hereafter the expression **"matrix sites"** to refer to
the sites used in the alignment from which the residues frequencies of
the matrix were computed.

matrix-quality -v 1 -ms my_matrix.meme -matrix_format meme \ -no_cv -perm matrix_sites 0 -bgfile my_background.txt \ -o my_matrix_quality

This will produce the simplest possible analysis: computing the score distribution of the matrix sites, and comparing it to the theoretical distribution.

Beware: the score distribution of matrix sites is fake. Indeed, those are the very stes that were used to build the matrix. Each site partly contributed to the matrix scores (weights) that will serve to score it. There is thus a problem of over-fitting: we train a matrix with some data, and we evaluate the matrix with the same data.

To circumvent the problem of over-fitting mentioned above, we have
need to perform the Leave-One-Out (LOO) procedure. Actually,
*matrix-scan* automatically runs the leave-one-out test by
default. The reason why it was not done in the previous section is
because we used the option -no_cv, for the only purpose of
illustrating the problem of overfitting. We will now run
*matrix-scan* in the normal way, without inactivating the LOO
procedure.

matrix-quality -v 1 -ms my_matrix.meme -matrix_format meme \ -perm matrix_sites 0 -bgfile my_background.txt \ -o my_matrix_quality

The result distributions now contain 3 curves:

**theory**-
The theoretical distribution of scores, computing according to the background model;

**matrix_sites**-
The score distribution of the matrix sites (which is biased by the fact that these sites were used to build the matrix).

**matrix_sites_cv**-
This is the distribution of scores for the matrix sites, evaluated with the LOO procedure.

**Jacques van Helden <Jacques.van-Helden[at]univ-amu.fr>****Alejandra Medina-Rivera <amedina[at]liigh.unam.mx> (CCG, UNAM, Mexico)****Morgane Thomas-Chollier <morgane[at]bigre.ulb.ac.be>**

**sequences****pattern matching****PSSM****evaluation**

matrix-quality [-i inputfile] [-o outputfile] [-v]

**-v #**-
Level of verbosity (detail in the warning messages during execution)

**-h**-
Display full help message

**-dry**-
Dry run: print the commands but do not execute them.

**-help**-
Same as -h

**-m matrix_file**-
Matrix file. If the file includes several matrices, it will only take the first one.

**-ms matrix_sites**-
File containing both a matrix and its sites. The sites are then used as positive sequence set, and labelled as "matrix_sites" in the distribution tables and graphs.

The option -ms is only valid with the file formats which contain both the matrix and its sites (e.g. consensus, MotifSampler, meme, infogibbs and transfac). The format of the matrix+site file can be specified with the option '-matrix_format'.

If the matrix and its sites are only available in separate files, an equivalent effect can be obtained by combining the options "-m my_matrix.tab" and "-seq matrix_sites site_sequences.fasta". Althougth when this option is used the LOO test is not performed.

If

*matrix-scan-quick*is available in the machine this programe will be used instead of matrix-scan. For*matrix-scan-quick*the matrix most be in infogibbs or tab format.If the file includes several matrices, it will only take the first one.

**-matrix_format matrix_format**-
Format of the matrix file.

**-seq seq_type seq_file**-
File containing a sequence set of a given type. The first next argument indicates the type of the sequence (which will appear in the leend of the plots), and the second next argument the file name.

**-scanopt seq_type "option1 option2 ..."**-
Sequence set-specific options for matrix-scan. These options are added at the end of the matrix-scan command for scanning the specified sequence set.

**-no_cv**-
Do not apply the leave-one-out (LOO) test on the matrix site sequences.

**-kfold k**-
k-fold cross-validation.

Divide the matrix sites in k chunks for cross-validation. The chunks are sampled in a random way.

**-noperm**-
Skip the matrix permutation step. This option is mainly used for debugging, or to run the last steps (comparison + graph generation) without re-running the time-consuming scanning steps.

**-noscan**-
Skip the matrix-scan step. This option is mainly used for debugging, or to run the last steps (comparison + graph generation) without re-running the time-consuming scanning steps.

**-nocompa**-
Skip the step of comparisons between distributions. This option is mainly used for debugging, or to run the last steps (comparison + graph generation) without re-running the time-consuming scanning steps.

**-nograph**-
Skip the step of drawing comparison graphs.

**-noicon**-
Do not generate the small graphs (icons) used for the galleries in the indexes.

**-export_hits**-
Return matrix-scan scores in addition to the distribution of scores. Beware ! This option can produce very large files and use lots of disk space.

**-perm seq_type #**-
Number of permutations for a specific set (default 0).

**-perm_sep**-
Calculate the distributions for each permuted matrix separately. This provides an estimate of the variability between permutations, but the resulting graph is less readable, because of the multiplicity of curves.

**Note:**the option to merge permutations (*-perm_merged*) has been disactivated since we swapped from matrix-scan to matrix-scan-quick. The option*-perm_sep*is thus currently the only mode of presentation. We still need to implement the merging of the distributions, in order to re-activate the option -perm_merged (see with list). **-seq_format sequence_format**-
Sequence format.

**-pseudo pseudo_counts**-
Pseudo-counts. The pseudo-count reflects the possibility that residues that were not (yet) observed in the model might however be valid for future observations. The pseudo-count is used to compute the corrected residue frequencies.

**-th_prior background_file**-
Background model to be used to calculate the matrix theorical distribution. The matrix theorical distribution is calculated with

*matrix-distrib*. **-bg_format background_file**-
Format for the background model file.

Supported formats: all the input formats supported by convert-background-model.

**-decimals #**-
Number of decimals for computing weight scores (default 2). This arguments is passed to

*matrix-scan*and*matrix-distrib*. **-o output_prefix**-
Prefix of the output files. The program generates various files, and automatically adds a specific suffix to each output file.

*pos_scores*-
Scores of the positive sequence set.

**-graph_option 'option1 options2 ...'**-
Specify options that will be passed to the program

*XYgraph*for generating the distributions and the ROC curves.Beware: if an option requires to be followed by a value (ex -xsize 1000), you have to embrace the option and its value in quotes.

Example -graph_option '-size 800 -title "LexA matrix" -bg blue'

This option can be used iteratively on a command line.

Example -graph_option '-xsize 1000' -graph_option '-title "LexA matrix"'

**-roc_ref**-
Reference distribution for the ROC curve.

**-roc_option 'option1 options2 ...'**-
Specify options that will be passed to the program

*XYgraph*for generating the ROC curves (ot the distribution curves).Beware: if an option requires to be followed by a value (ex -xsize 1000), you have to embrace the option and its value in quotes.

Example -roc_option '-ygstep1 0.1 -ygstep2 0.02'

This option can be used iteratively on a command line.

Example -roc_option '-ygstep1 0.1' -roc_option '-ygstep2 0.02'

**-distrib_option 'option1 options2 ...'**-
Specify options that will be passed to the program

*XYgraph*for generating the distribution curves (not the ROC curves).Beware: if an option requires to be followed by a value (ex -xsize 1000), you have to embrace the option and its value in quotes.

Example -distrib_option '-xmin -35 -xmax 20'

**-img_format**-
Image format for the plots (ROC curve, score profiles, ...). To display the supported formats, type the following command: XYgraph -h.

Multiple image formats can be specified either by using iteratively the option, or by separating them by commas.

Example: -img_format png,pdf

**-logo_format**-
Image format for the sequence logos.

Multiple image formats can be specified either by using iteratively the option, or by separating them by commas.

Example: -logo_format png,pdf

**-nwd**-
The option will calculate the NWD data for the score distribution of the specified sequence set (Medina-Rivera, et al. 2010). At each frequency value (y-axis) we calculate the weigh difference (WD), defined as the difference between the observed Ws in all upstream non-codingsequence set and the expected Ws in the theoretical distribution of the PSSM for a given P-value.

The WD can be visualized as the horizontal distance between the distribution curves. As larger matrices allow higher scores, we divided the difference bye the matrix width to obtain the normalized weight difference.

Usage: -nwd seq_type

**-archive**-
Compress the result directory into a zip archive of the same name (with suffix .zip).

## Title for html

**-html_title**-
Get a title for the html page.

**-task tasks**-
Specify one or several tasks to be run. If this option is not specified, all the tasks are run.

Note that some tasks depend on other ones. This option should thus be used with caution, by experimented users only.

Supported tasks:

**scan**-
Scan sequences with matrix-scan

**theor**-
Calculate the theoretical distribution

**loo**-
Leave-one-out test on the matrix sites

**theor_cv**-
Calculate the theoretical distribution of loo partial matrices

**permute**-
Scan sequences with permuted matrices

**compare**-
Compare distributions between the various input files

**graphs**-
Draw the graphs with distrib comparisons

**synthesis**-
Generate a HTML file with a synthetic report, which displays the main graphs (distribution curves and ROC curve) and provides links to the result files.

In order to be correctly indexed, the graphs have to be generated in png format.

**nwd**-
Calculate the Normalized Weight Distance between the theoretical distribution and a score distribution in a specified sequence_type

**Background model**-
*matrix-distrib*requires to specify a background model, which will be passed to*matrix-distrib*and*matrix-scan*. This background model can be specified with the same options as for*matrix-scan*. **Other options**-
All the other options are automatically passed to

*matrix-scan*, in order to specify the scanning parameters (strands, background model, ...).Note that the option '-return' of matrix-scan cannot be used here, because matrix-quality specifies the return fields required for its statistics.

If the option '-bgfile' is specified, the specified background model will be used to calculate the matrix theorical distribution. If another type of background model is specified for matrix-scan ('-bginput' or '-window'), use '-th_prior' option to specify the background model to be used for the calculation of the matrix theorical distribution.

**matrix-scan**-
Called by

*matrix-quality*for scanning the different sets (positive, negative) with the input matrix. **matrix-distrib**-
Called by

*matrix-quality*for computing the theoretical distribution of scores. **convert-matrix**-
Called by

*matrix-quality*to generate column-permuted matrices.

**-perm_merged**-
Merge the permutations in order to obtain a more robust distribution of the permuted matrices. The figure is more readable than with the option -perm_sep (default), but does not reflect the variability between the different permutations.

**-th_prior**-
File in oligo-analysis format.

This option should better be removed, so the user has to specify the bg file with the option -bgfile. To check.