Each entry or model in Dfam is represented on the website is represented by a single page. If you wish to stably link to this entry page, please use the Dfam accession, such that the URL looks like: http://dfam.org/entry/DFXXXXXXX. Each entry page is divided into five tabs, found below the family identifier and accession.
A one line description of the entry.
The section includes a brief description of the entry, Where appropriate, references in the section below are cross references. External contributors may add to the annotation of an entry by submitting text via the annotation submission form accessed via the pen symbol at the top of this block, or indirectly by editing the Wikipedia article - If a relevant Wikipeida article is missing, please contact us.
A simple classification of the entry, consisting of three tiers: Type, Class and Superfamily. There can be a curated comment associated with this classification, often refering to the confidence of the classification.
This table contains the is the number of matches found between the model and dfamseq at the two different thresholds, gathering and trusted cut-off. Note, a particular subsequence matching one model may also match other profiles HMMs in the database, resulting in what we term 'redundant profile hits' (RPHs). This subsequence will impact the redundant counts for those other entries as well. For non-redundant counts, a sequence will only be counted for the model deemed to be the best fit to the sequence (though a small number of cases, more than one model may be counted because of the conservative nature of the RPH filter).
External database links
Pertinent database cross references.
This table in this section provides details of how the profile HMM was generated, and the curated thresholds and expected false discovery rates. All profile HMMs are build from a set of aligned representative sequences, termed the seed alignment. The source of this seed alignment is shown in the top right corner of the table. The number of sequences in the seed alignment is shown in the table. For each model, three thresholds are defined. The gathering threshold is appropriate for masking the sequences, that match the model specificity. This threshold is set using knowledge of the approximate copy number to yield high sensitivity with low false discovery rates. This threshold is used when defining the models expected false discovery rate. When annotating sequences that fall outside of the models specificity, the more stringent trusted cutoff threshold should be used. The noise cut-off indicates the score of the highest scoring match that falls below the gathering threshold.
A representation of the per-position residue and indel conservation of the HMM for the entry. Each position in the model is represented by a stack of letters, with stack height indicating the information content of the position. The rate and expected length of insertions after each position are shown in the fields below each stack. The logo can be zoomed to show more or less of the model as desired. It is also possible to center the logo by entering a column number in the field provided. Below the logo, the consensus sequence derived from the HMM (which should correspond to the tallest letter, in the per-position stack of the logo), is also available.
False Coverage Plot
This plot shows how matches to an articial benchmark sequence not containing any TE insertions are distributed across the model. This plot helps identify model regions that may be responsible for generating false positive hits, for example due to low-complexity or simple repeat characteristics.
The figure below shows an example False Coverage Plot, for MER4-int. It shows a spike of 33 false hits with E-value better than 1, covering a window around position 800 of the model, and a small tail of false hits out to position 1100. Only the few hits signified by the purple part of the graph have E-value better than 1e-2.
Non-Redundant Coverage, Conservation, and Inserts
The Non-Redundant Coverage, Conservation, and Inserts plot shows, for hits above a variety of thresholds, (1) the distribution of hits along the model, (2) the position-specific levels of conservation of those hits, and (3) the position-specific rates of insertion among those hits. For a selected threshold, the purple line shows, for each model position, the fraction of all hits that have a match to that position, considering only RPH-filtered hits (hits for which this model is deemed to fit the sequence better than any other Dfam model). Among RPH-filtered hits, the green line shows, for each position, the average percent identity for a window of length 7 around the position. The grey line shows the number of insertions among those hits. In the threshold selection box, the number in parentheses shows the number of hits meeting the given threshold.
The figure below shows an example of the plot for the model Kanga1, representing the 387 RPH-filtered matches with E-value better than 1e-4. The purple line shows that hits tend not to be full-length (the middle section is covered by only a few percent of hits), and that only about 40% of hits are aligned to the 3' end, where coverage is greatest. The green line shows that sequence conservation is on average around 70%, but with some variability. The grey lines show that inserts are generally rare, but at model position 1569, the roughly 40%*387=153 hits are followed by 25 inserts.
The Non-Redundant Coverage plot shows the distribution across the model for all above-threshold hits for which this model is deemed to fit the sequence better than any other Dfam model.
The figure below shows the non-redundant coverage for the model Kanga1. This example plot shows a common signal for DNA transposons, with the interior portion of the model covered by fewer instances than the termini, since non-autonomous TEs can suffer various degrees of internal deletion, yet must retain critical terminal features. Many of the 5' terminal hits fall between the gathering threshold E-value of 15 and trusted cutoff E-value of 0.0002, leading to a terminal light green bulge on the left side of the plot.
This plot is much like the Non-Redundant Coverage plot, but instead of showing distribution across the model for only those matches that are best hit by this model, it shows the distribution for all hits to the model, even those better hit by some other model.
The figure below shows the redundant coverage for SVA_A. SVAs carry two reverse-complemented Alu fragments, one from positions 60 to 315 of the SVA model, and another shorter fragment from position 315 to 400. The first of these regions hits most of the Alu instances in the human genome, leading to the very large spike of over 1 million above-threshold hits covering that region, even though there are only a few thousand SVA copies. These Alu instances are better hit by one of the Alu models, and thus do not show up in the non-redundant coverage plot or related hit counts.
Plot showing how many sequences in the seed alignment represent each position in the model.
Seed Whisker Plot
This plot shows the distribution of the seed alignment sequences across the length of the model. Each horizontal line represents one seed sequence, and the range of that line shows the range of model positions covered by the sequence.
The number of gemnomic matches that found for many TE entries means that it is difficult to provide all matches via a web interface or as a multiple sequence alignment. The karyotype ideogram shows the distribution of hits in Human for the entry as a heat map style representation. Each color represents a binned ranged of counts over a 1Mb range. Each color band is clickable, with the hits corresponding to that location loaded below the karyotype ideogram. The table contains the match score and E-value, positions in the respective model and sequence. The alignment between the model and sequence can be obtained by expanding the row using the > symbol found at the beginning of each row. In the alignment, the model line presents the consensus sequence for aligned states in the model, colored according to the match line. The PP line represents the posterior probability, or degree of confidence in each aligned residue (for example, with '*' meaning highest confidence, and low numbers indicating low confidence), with corresponding grey scale coloration of the Query sequence.
By default, the hits counted here are only those in which this model was detemined to better explain the sequence than any other model. To see distribution of all hits to the model (including those prefered by other models), press "K", or click the "Toggle Karyotype bands" button.
The example below shows the non-uniform distribution of MIRs across the human genome. Large patches of white in the hit distribution ideogram indicate regions with no instances of the model; in this case these are particularly difficult to sequence heterochromatic regions (represented by Ns in the genome sequence) as can be seen by toggling karyotype bands. Below the karyotype ideogram are given the hits from a region on chromosome 21, with one hit expanded to show the alignment of that hit to the MIR model. In the alignment, the model line presents the consensus sequence for aligned states in the model, coloured according to the match line. The PP line represents the posterior probability, or degree of confidence in each aligned residue (for example, with '*' meaning highest confidence, and low numbers indicating low confidence), with corresponding grey scale colouration of the Query sequence.
The Relationship tab provides a representation of the similarities between TE entries. Consensus sequences were produced for all models using the HMMER3 tool hmmemit. These sequences were then searched with all models using nhmmer, with a hit with E-value better than 1e-5 supporting a relationship.
The image below shows part of the Relationship tab for the Ricksha_c (DF0001061) entry. In this case, Ricksha TE carries part of an ERVL and its LTR, MLT2. Simple glyphs are used to represent these relationships, as well as similiarities to other Ricksha Dfam entries. The case of reverse-complement similarity is shown using a purple glyph with inverted orientation. Each glyph is shown with accompanying percent identity between the entry consensus sequences, match e-value, and percent shared coverage (length). The list of related entries can be sorted by any of these fields.
From this tab, you can download the seed alignment and profile HMM for the entry. You can also download two files containing tabular hit lists: (1) matches to this model after removing redundant hits to other models ("Non-redundant"), and (2) all matches above threshold, including those with better scoring matches to other models ("All hits").