DFAM : Multiple alignment, consensus sequences, and profile HMMs of repetitive DNA RELEASE 3.6 -------------------------------------------------------------------- 1. INTRODUCTION Dfam is a collection of conserved DNA element sequence alignments, consensus sequences, hidden Markov models (HMMs) and complete genome annotations. Dfam has full, curated libraries for five model organisms: Human, Mouse, Zebrafish, Worm and Fly as well as curated lineage-specific libraries for 242 mammals and a growing number of newly sequenced species. In addition, Dfam now stores uncurated (de-novo produced) families for over 781 additional species. 2. LOCATIONS Dfam is available on the web at: https://dfam.org/ 3. STATISTICS Dfam 3.6 consists of 732,993 families across 1109 taxa. Dfam families include retrotransposons, DNA transposons, interspersed repeats of unknown origin, and a number of non-TE entries used to annotate satellites or to avoid annotating noncoding RNA genes as TEs. The distribution of these constituent family types in each annotated species is given below: Curated Families (DF Records ) ------------------------------ Taxa Retrotransposons DNA Transposons Other ------- ---------------- --------------- ----- Mammalia 8477 1659 125 Sauropsida [Reptiles and Birds] 179 16 7 Actinopterygii [Bony Fishes] 1077 769 48 Nematoda 65 100 23 Insecta 1530 1017 3596 Myxogastri [Slime Molds] 3 0 0 Echinodermata [Starfishes, Sea Urchins/Cucumbers] 2 0 0 Hemichordata [Acorn Worms, etc] 4 0 0 Cyclostomata [Jawless Veterbrates] 1 0 0 Panarthropoda [Anthropods, Velvet Worms, Water Bears] 11 0 0 Spiralia [Segmented Worms, Arrow Worms, Mollusks] 18 0 0 Cnidaria [Sea Anemones and Corals] 11 0 0 Ctenophora [Comb Jellies] 2 0 0 Uncurated Families (DR Records ) -------------------------------- Taxa Retrotransposons DNA Transposons Other ------- ---------------- --------------- ----- Mammalia 95061 18924 11829 Sauropsida [Reptiles and Birds] 122930 43682 64406 Amphibia 4451 4813 6814 Actinopterygii [Bony Fishes] 70668 141149 108984 Chondrichthyes [Cartilaginous Fishes] 5655 786 3434 Petromyzontiformes [lampreys] 521 243 178 Viridiplantae [Green Plants] 1684 780 1138 4. CONSTRUCTION OF DFAM Dfam utilizes public genome assemblies and species-specific GARLIC [3] artificial benchmark sequences for each genome. Sequence alignments for the five model organisms were built using RepBase consensus sequences (http://www.girinst.org/repbase/) in a collaboration with GIRI. For these families the seed alignments were generated using RepeatMasker alignments of the consensus sequences against the current UCSC asssemblies (http://http://genome.ucsc.edu/ - hg38, ce10, dm6, mm10 and danRer10). For each family, annotated instances were transitively aligned based on mutual alignment to the Repbase consensus sequence. Hidden Markov models (HMMs) were constructed from the sequence alignment using the HMMER3 tool hmmbuild, and each model was then searched against Dfamseq using the HMMER3 tool nhmmer, with hit metadata (sequence location, score, etc) captured for distribution. As RepBase is a closed database, newer families in Dfam are based on seed alignments generated using additional methods including: directly from de-novo tools such as RepeatModeler and REPET, subfamily detection programs such as Coseg, and through manual clustering and alignment of genomic sequences. 5a. DESCRIPTION OF CHANGES FROM RELEASE 3.5 TO 3.6 Four new community-submitted libraries of manually curated TE families: * 3360 families in the Rice weevil (Sitophilus oryzae) * 22 SINE families from Moth species * 120 Penelope-classified families * 41 families from the Telomere-to-telomere (T2T) Consortium's complete assembly of the CHM13 human genome. 22 "composite" repetitive families have been excluded from this release due to the inclusion of genes or other non-TE content which would lead to misleading annotation without additional context; these can be found in Table S2 of the paper: https://www.science.org/doi/10.1126/science.abk3112 New EBI produced libraries * In collaboration with the Fergal Martin, and Denye Ogeh at the European Bioinformatics Institute we processed and imported RepeatModeler runs on 444 additional species ( 440,543 families ). An additional extension and re-classification step was run on each and final consensus and HMMs were produced. Relationship data is not available on these uncurated imports at this time. 5b. DESCRIPTION OF CHANGES FROM RELEASE 3.4 TO 3.5 Zoonomia Full Genome Annotation Sets * Full genome annotations for 224 additional Zoonomia project species are now available for exploring or downloading from the website. These assemblies have been annotated with the lineage-specific TE libraries developed by David Ray's lab with further curation by Arian Smit. The annotations also include the ancestral families previously imported as part of our human and mouse libraries. * The Zoonomia project multi-species alignment and ancestral genome reconstructions are being used to generate high-quality ancestral TE libraries for mammals. The first set of 271 LTR families found in Platyrrhini have been added to this release. HMM Consensus Base and Divergence calculations * In this release we have taken another step towards linking the profile HMM and consensus models for each family. Traditionally, the pHMM files are generated with an auxiliary consensus symbol for each node or position within the profile. This symbol is chosen based on the highest scoring match state for the node and is used only to generate a model sequence in alignment output. The consensus sequence developed for each family in Dfam uses the seed alignment data to determine the consensus residue. In this release, we have updated the pHMMs with the seed alignment consensus symbols and recalculated the alignments and average sequence divergence values for all families. Seed alignments * For each family in Dfam we store a multiple sequence alignment composed of a canonical set of TE copies (seed alignments). In previous releases of Dfam, we stored these alignments using a semi-compressed A3M format to manage storage space. This format is lossy in some circumstances. For instance, the relative alignment of residues within an insertion gap are not preserved. In this release, we upgraded Dfam's seed alignments storage to utilize COMSA (Debudaj-Grabysz et al 2019), a lossless HMM compression format. This change resulted in a 40 fold improvement in storage space for this datatype. 6. DESCRIPTION OF RELEASE FILES relnotes.txt - This file. userman.txt - A fuller description of Dfam fields. families/ Dfam.hmm.gz - Dfam HMMs in an HMM library, searchable with the nhmmer program. Dfam.embl.gz - Dfam consensi in an EMBL library, searchable with the RepeatMasker 4.1.0 and earlier. Dfam.h5.gz - Dfam HMMs and consensi in the FamDB file format, for RepeatMasker 4.1.1 and later. Dfam.curatedonly.hmm.gz - Dfam HMMs for curated (DF accessions only) families only. Dfam.curatedonly.embl.gz - Dfam consensi for curated (DF accessions only) families in Dfam in EMBL format. Dfam.curatedonly.h5.gz - Dfam HMMs and consensi in FamDB format for curated (DF accessions only). annotations// .hits - TSV list of all matches found in the given assembly that score above the GA threshold. e.g. hg38.hits.gz .nrph.hits - TSV list of all non-redundant matches found in the given assembly and that score above the GA threshold. e.g. hg38.nrph.hits.gz infrastructure/ dfamscan.pl - A basic search tool to query Dfam families against an input sequence. schema/ - Directory with printable ERD diagrams for Dfam db-exports/ - Partial export of the mysql datbase. We are no longer generating the per-assembly annotation databases in mysql format by default, but will generate them upon request ( help@dfam.org ). apidocs/ - Dfam REST API documentation (HTML) 7. DESCRIPTION OF FIELDS See userman.txt for more detailed description of each field Compulsory fields: ------------------ AC Accession number: Accession number in form DFxxxxxxx. ID Identification: One word name for entry. DE Definition: Short description of entry. AU Author: Authors of the entry. SE Source of seed: The source suggesting the seed members belong to one entry. GA Gathering method: Score used for sequences within the clade specified by MS. TC Trusted Cutoff: Score used for sequences outside the clade specified by MS. NC Noise Cutoff: Smaller cutoff than GA; not used in Dfam. FR False Discovery Rate: Target FDR used to set GA. BM Build method SM Internal search method MS Model specificity: TaxID and TaxName, based on NCBI taxonomy. CT Classification tags: Repeat Type, Class, and Superfamily. SQ Sequence: Number of sequences in alignment. // End of alignment. Optional fields: ---------------- DC Database Comment: Comment about database reference. DR Database Reference: Reference to external database. RC Reference Comment: Comment about literature reference. RN Reference Number: Reference Number. RM Reference Medline: Eight digit medline UI number. RT Reference Title: Reference Title. RA Reference Author: Reference Author RL Reference Location: Journal location. PI Previous identifier: Record of all previous ID lines. CC Comment: Comments. WK Wikipedia Reference: Reference to wikipedia. SN Synonym A widely accepted alternative name for the model. CN Classification Note: A free text comment about the model classification. 8. REFERENCES 1. The Dfam Community Resource of Transposable Element Families, Sequence Models, and Genome Annotations. Storer, J., Hubley, R., Rosen, J., Wheeler, T., & Smit, A. F. Mobile DNA. 2021; 12: 2. https://doi.org/10.1186/s13100-020-00230-y 2. The Dfam Database of Repetitive DNA Families Robert Hubley, Robert D. Finn, Jody Clements, Sean R. Eddy, Thomas A. Jones, Weidong Bao, Arian F.A. Smit, Travis J. Wheeler Nucleic acids research 44.D1 (2016): D81-D89. 3. Dfam: a Database of Repetitive DNA Based on Profile Hidden Markov Models Wheeler TJ, Clements J, Eddy SR, Hubley R, Jones TA, Jurka J, Smit AFA, Finn RD Nucl. Acids Res. (2013) Database Issue 41:D70-82. doi: 10.1093/nar/gks1265 4. Realistic artificial DNA sequences as negative controls for computational genomics. Caballero J, Smit AF, Hood L, Glusman G. Nucl. Acids Res. 2014 doi: 10.1093/nar/gku356 9. THE DFAM CONSORTIUM Dfam is maintained by a consortium of researchers. You can contact the Dfam consortium at: help@dfam.org The current/past members of the Dfam consortium are: Arian F. A. Smit, Robert Hubley, Jessica Storer and Jeb Rosen: Institute for Systems Biology, USA Travis J. Wheeler: University of Montana, USA Robert D. Finn: EMBL, UK Jody Clements: Janelia Farm Research Campus, USA Sean R. Eddy, Thomas A. Jones: Harvard University, USA Jerzy Jurka: Genetic Information Research Institute, USA 11. ACKNOWLEDGEMENTS A.F.A.S., R.M.H., J.R. and T.J.W. were supported by a grant from the National Institutes of Health (NHGRI grant #U24-HG010136). R.D.F., J.C, S.R.E, T.A.J., and T.J.W received institutional support from HHMI Janelia Farm Research Campus. J.J. was supported by grants from the National Library of Medicine, National Institutes of Health (P41LM006252-12). A.F.A.S and R.H were supported by a grant from the National Institutes of Health (RO1 HG002939). 12. COPYRIGHT NOTICE Dfam - A database of conserved DNA element alignments and HMMs Copyright (C) 2015-2022 The Dfam consortium. This database is free; you can redistribute it and/or modify it as you wish, under the terms of the CC0 1.0 license, a 'no copyright' license: The Dfam consortium has dedicated the work to the public domain, waiving all rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information below. Other Information o In no way are the patent or trademark rights of any person affected by CC0, nor are the rights that other persons may have in the work or in how the work is used, such as publicity or privacy rights. o Unless expressly stated otherwise, the Dfam consortium makes no warranties about the work, and disclaims liability for all uses of the work, to the fullest extent permitted by applicable law. o When using or citing the work, you should not imply endorsement by the Dfam consortium. You may also obtain a copy of the CC0 license here: http://creativecommons.org/publicdomain/zero/1.0/legalcode ___________________ The Dfam Consortium 2022