Tool Parameters

Please provide a value for this option.
* required
What would you like to look at?
What would you like to have reported?
Read Reformatting Options
Output format
Note on BAM output format: The tool will generate coordinate-sorted BAM, i.e., may change the order of reads compared to the input. For BAM input, select 'Same as input' to produce BAM output with the read order retained.
Use a reference sequence
Reference data as fasta(.gz). Required for SAM input without @SQ headers and useful/required for writing CRAM output (see help).

Additional Options

Send an email notification when the job completes.

Help

What it does

Samtools view can:

  1. convert between alignment formats (SAM, BAM, CRAM)
  2. filter and subsample alignments according to user-specified criteria
  3. count the reads in the input dataset or those retained after filtering and subsampling
  4. obtain just the header of the input in any supported format

In addition, the tool has (limited) options to modify read records during conversion and/or filtering by:

  • stripping them of user-specified tags
  • collapsing backward CIGAR operations if they are specified in their CIGAR fields

With default settings, the tool generates a BAM dataset with the header and reads found in the input dataset (which can be in SAM, BAM, or CRAM format).

Alignment format conversion

By changing the Output format it is possible to convert an input dataset to another format. Inputs of type SAM, BAM, and CRAM are accepted and can be converted to each of these formats (alternatively alignment counts can be computed) by selecting the appropriate "Output type".

The tool allows you to specify a reference sequence. This is required for SAM input with missing @SQ headers (which include sequence names, length, md5, etc) and useful (and sometimes necessary) for CRAM input and output. In the following the use of the reference sequence in the CRAM format is detailed. CRAM is (primarily) a reference-based compressed format, i.e. only sequence differences between aligned reads and the reference are stored. As a consequence, the reference that was used during read mapping is needed in order to interpret the alignment records (a checksum stored in the CRAM file is used to verify that only the correct reference sequence can be used). This allows for more space-efficient storage than with BAM format, but such a CRAM file is not usable without its reference. It is also possible, however, to use CRAM without a reference with the disadvantage that the reference sequence gets stored then explicitely (as in SAM and BAM).

The Galaxy tool currently generates only CRAM without reference sequence.

For reference based CRAM input the correct refernce sequence needs to be specified.

Filtering alignments

If you ask for A filtered/subsampled selection of reads, the tool will allow you to specify filter conditions and/or to choose a subsampling strategy, and the output will contain one of the following depending on your choice under What would you like to have reported?:

  • All reads retained after filtering and subsampling
  • Reads dropped during filtering and subsampling

If instead you want to split the input reads based on your criteria and obtain two datasets, one with the retained and one with the dropped reads, check the Produce extra dataset with dropped/retained reads? option.

Filtering by regions

You may specify one or more space-separated region specifications after the input filename to restrict output to only those alignments which overlap the specified region(s). Use of region specifications requires a coordinate-sorted and indexed input file (in BAM or CRAM format).

Regions can be specified as: RNAME[:STARTPOS[-ENDPOS]] and all position coordinates are 1-based.

When multiple regions are given, some alignments may be output multiple times if they overlap more than one of the specified regions.

Examples of region specifications:

chr1
Output all alignments mapped to the reference sequence named 'chr1' (i.e. @SQ SN:chr1).
chr2:1000000
The region on chr2 beginning at base position 1,000,000 and ending at the end of the chromosome.
chr3:1000-2000
The 1001bp region on chr3 beginning at base position 1,000 and ending at base position 2,000 (including both end positions).
*
Output the unmapped reads at the end of the file. (This does not include any unmapped reads placed on a reference sequence alongside their mapped mates.)
.
Output all alignments. (Mostly unnecessary as not specifying a region at all has the same effect.)

Filtering by quality

This filters based on the MAPQ column of the SAM format which gives an estimate about the correct placement of the alignment. Note that aligners do not follow a consistent definition.

Filtering by Tag

This filter allows to select reads based on tool or user specific tags, e.g., XS:i:-18 the alignment score tag of bowtie. Thus to filter for a specific value of the tag you need the format STR1:STR2, e.g., XS:-18 to filter reads with an aligment score of -18. You can also just write STR1 without the value STR2 hence the filter selects all reads with the tag STR1, e.g., XS.

Filtering by Expression

Filter expressions are used as an on-the-fly checking of incoming SAM, BAM or CRAM records, discarding records that do not match the specified expression.

The language used is primarily C style, but with a few differences in the precedence rules for bit operators and the inclusion of regular expression matching.

The operator precedence, from strongest binding to weakest, is

Grouping        (, )             E.g. "(1+2)*3"
Values:         literals, vars   Numbers, strings and variables
Unary ops:      +, -, !, ~       E.g. -10 +10, !10 (not), ~5 (bit not)
Math ops:       \*, /, %          Multiply, division and (integer) modulo
Math ops:       +, -             Addition / subtraction
Bit-wise:       &                Integer AND
Bit-wise        ^                Integer XOR
Bit-wise        |                Integer OR
Conditionals:   >, >=, <, <=
Equality:       \=\=, !=, =~, !~   =~ and !~ match regular expressions
Boolean:        &&, ||           Logical AND / OR

Expressions are computed using floating point mathematics, so "10 / 4" evaluates to 2.5 rather than 2. They may be written as integers in decimal or "0x" plus hexadecimal, and floating point with or without exponents.However operations that require integers first do an implicit type conversion, so "7.9 % 5" is 2 and "7.9 & 4.1" is equivalent to "7 & 4", which is 4. Strings are always specified using double quotes. To get a double quote in a string, use backslash. Similarly a double backslash is used to get a literal backslash. For example ab"c\d is the string ab"cd.

Comparison operators are evaluated as a match being 1 and a mismatch being 0, thus "(2 > 1) + (3 < 5)" evaluates as 2. All comparisons involving undefined (null) values are deemed to be false.

The variables are where the file format specifics are accessed from the expression. The variables correspond to SAM fields, for example to find paired alignments with high mapping quality and a very large insert size, we may use the expression "mapq >= 30 && (tlen >= 100000 || tlen <= -100000)". Valid variable names and their data types are:

endpos               int            Alignment end position (1-based)
flag                 int            Combined FLAG field
flag.paired          int            Single bit, 0 or 1
flag.proper_pair     int            Single bit, 0 or 2
flag.unmap           int            Single bit, 0 or 4
flag.munmap          int            Single bit, 0 or 8
flag.reverse         int            Single bit, 0 or 16
flag.mreverse        int            Single bit, 0 or 32
flag.read1           int            Single bit, 0 or 64
flag.read2           int            Single bit, 0 or 128
flag.secondary       int            Single bit, 0 or 256
flag.qcfail          int            Single bit, 0 or 512
flag.dup             int            Single bit, 0 or 1024
flag.supplementary   int            Single bit, 0 or 2048
hclen                int            Number of hard-clipped bases
library              string         Library (LB header via RG)
mapq                 int            Mapping quality
mpos                 int            Synonym for pnext
mrefid               int            Mate reference number (0 based)
mrname               string         Synonym for rnext
ncigar               int            Number of cigar operations
pnext                int            Mate's alignment position (1-based)
pos                  int            Alignment position (1-based)
qlen                 int            Alignment length: no. query bases
qname                string         Query name
qual                 string         Quality values (raw, 0 based)
refid                int            Integer reference number (0 based)
rlen                 int            Alignment length: no. reference bases
rname                string         Reference name
rnext                string         Mate's reference name
sclen                int            Number of soft-clipped bases
seq                  string         Sequence
tlen                 int            Template length (insert size)
[XX]                 int / string   XX tag value

Flags are returned either as the whole flag value or by checking for a single bit. Hence the filter expression flag.dup is equivalent to flag & 1024.

"qlen" and "rlen" are measured using the CIGAR string to count the number of query (sequence) and reference bases consumed. Note "qlen" may not exactly match the length of the "seq" field if the sequence is "*".

"sclen" and "hclen" are the number of soft and hard-clipped bases respectively. The formula "qlen-sclen" gives the number of sequence bases used in the alignment, distinguishing between global alignment and local alignment length.

"endpos" is the (1-based inclusive) position of the rightmost mapped base of the read, as measured using the CIGAR string, and for mapped reads is equivalent to "pos+rlen-1". For unmapped reads, it is the same as "pos".

Reference names may be matched either by their string forms ("rname" and "mrname") or as the Nth @SQ line (counting from zero) as stored in BAM using "tid" and "mtid" respectively.

Auxiliary tags are described in square brackets and these expand to either integer or string as defined by the tag itself (XX:Z:string or XX:i:int). For example [NM]>=10 can be used to look for alignments with many mismatches and [RG]=~"grp[ABC]-" will match the read-group string.

If no comparison is used with an auxiliary tag it is taken simply to be a test for the existence of that tag. So [NM] will return any record containing an NM tag, even if that tag is zero (NM:i:0). In htslib <= 1.15 negating this with ![NM] gave misleading results as it was true if the tag did not exist or did exist but was zero. Now this is strictly does-not-exist. An explicit exists([NM]) and !exists([NM]) function has also been added to make this intention clear.

Similarly in htslib <= 1.15 using [NM]!=0 was true both when the tag existed and was not zero as well as when the tag did not exist. From 1.16 onwards all comparison operators are only true for tags that exist, so [NM]!=0 works as expected.

Some simple functions are available to operate on strings. These treat the strings as arrays of bytes, permitting their length, minimum, maximum and average values to be computed. These are useful for processing Quality Scores.

length(x)   Length of the string (excluding nul char)
min(x)      Minimum byte value in the string
max(x)      Maximum byte value in the string
avg(x)      Average byte value in the string

Note that "avg" is a floating point value and it may be NAN for empty strings. This means that "avg(qual)" does not produce an error for records that have both seq and qual of "*". NAN values will fail any conditional checks, so e.g. "avg(qual) > 20" works and will not report these records. NAN also fails all equality, < and > comparisons, and returns zero when given as an argument to the exists function. It can be negated with !x in which case it becomes true.

Functions that operate on both strings and numerics:

exists(x)      True if the value exists (or is explicitly true).
default(x,d)   Value x if it exists or d if not.

Functions that apply only to numeric values:

qrt(x)     Square root of x
og(x)      Natural logarithm of x
ow(x, y)   Power function, x to the power of y
xp(x)      Base-e exponential, equivalent to pow(e,x)

Unnamed history

Draggable