____ __ | _ \ / _| __ _ _ __ ___ /\| | | |_ / _` | '_ ` _ \ \/|_| | _| (_| | | | | | | /\___/|_| \__,_|_| |_| |_| The Dfam community resource of transposable element families, sequence models, and genome annotations. RELEASE 4.0 ------------------------------------------------------------------------------- 1. INTRODUCTION Dfam is a collection of conserved DNA element sequence alignments, consensus sequences, hidden Markov models (HMMs) and complete genome annotations. Dfam contains curated libraries for five model organisms: Human, Mouse, Zebrafish, Worm and Fly as well as lineage-specific libraries for 242 mammals and a growing number of newly sequenced species. In addition, Dfam stores uncurated (de novo produced) libraries for over 4300 additional species. 2. LOCATIONS Dfam is available on the web at: https://dfam.org/ 3. STATISTICS Dfam 4.0 consists of 7.6 million families across 6,549 taxa. Dfam families include retrotransposons, DNA transposons, interspersed repeats of unknown origin, and a number of non-TE entries used to annotate satellites or to avoid annotating noncoding RNA genes as TEs. The distribution of these constituent family types in each annotated species is given below: Curated Families (DF Records) ------------------------------ Taxa Retrotransposons DNA Transposons Other ------- ---------------- --------------- ----- Mammalia 8864 1744 177 Sauropsida [Reptiles and Birds] 179 16 7 Actinopterygii [Bony Fishes] 1541 968 136 Amphibia 68 0 1 Insecta 1872 1079 3864 Viridiplantae [Green Plants] 262 38 0 Fungi 102 121 144 Other Metazoa 3417 1468 4551 Protists 3 0 0 Uncurated Families (DR Records) -------------------------------- Taxa Retrotransposons DNA Transposons Other ------- ---------------- --------------- ----- Mammalia 200058 40316 64079 Sauropsida [Reptiles and Birds] 208523 72391 216396 Chondrichthyes [Cartilaginous Fishes] 13402 2031 18944 Actinopterygii [Bony Fishes] 208512 381117 646306 Petromyzontiformes [lampreys] 2876 997 1778 Amphibia 17565 13212 75990 Insecta 488810 415413 1940051 Viridiplantae [Green Plants] 102323 52170 1093341 Fungi 49036 32383 378612 Other Metazoa 62271 44132 672435 Protists 4646 3076 73021 4. CONSTRUCTION OF DFAM Dfam utilizes public genome assemblies and species-specific GARLIC [3] artificial benchmark sequences for a subset of curated family species. Sequence alignments for the five model organisms were built using RepBase consensus sequences (http://www.girinst.org/repbase/) in a collaboration with GIRI. For these families, the seed alignments were generated using RepeatMasker alignments generated by searching the consensus sequences against the current UCSC assemblies (http://genome.ucsc.edu/ - hg38, ce10, dm6, mm10 and danRer10). For each family, annotated instances were transitively aligned based on mutual alignment to the RepBase consensus sequence. Hidden Markov models (HMMs) were constructed from the sequence alignment using the HMMER3 tool hmmbuild, and each model was then searched against Dfamseq using the HMMER3 tool nhmmer, with hit metadata (sequence location, score, etc) captured for distribution. As RepBase is a closed database, newer families in Dfam are based on seed alignments generated using additional methods including: directly from de novo tools such as RepeatModeler and REPET, subfamily detection programs such as Coseg, and through manual clustering and alignment of genomic sequences. 5a. DESCRIPTION OF CHANGES FROM RELEASE 3.9 TO 4.0 This release includes: - Over 276,000 families from 4,200 species of Fungi submitted by Tobias Baril - Timea, an ancient primate ERV family, submitted by Martin Frith - Two tick libraries (Ixodes scapularis and Rhipicephalus microplus), submitted by David A. Ray - 69 ERV families curated from 29 amphibian genomes, submitted by Hayley Beth Free - 1,704 new RepeatModeler libraries submitted by Leanne Haggerty (EBI) - 75 Vertebrate libraries provided by Hiram Clawson (UCSC, VGP) - 3 Fungi libraries (Dendrothele nivosa, Heterobasidion annosum, and Stereum hirsutum) submitted by Landen Gozashti - Drosophila guayllabambae library submitted by Keyla López - Corrections to Drosophila melanogaster library: - DF003894124 [Osvaldo8a_DM-LTR] was a duplicate of DF000001635 [Osvaldo8b_DM-LTR], the DF003894124 entry was corrected. - DF000001712 [TART-B1] had a poor seed alignment, due to the lack of seeds in dm6. This was corrected using https://www.ncbi.nlm.nih.gov/nuccore/U14101.1. - DF003894158/DF003894159 Renamed from Micropia to Spoink and corrected the seed alignment for DF003894158 which was a duplicate of DF003894159. - Deleted DF000001717 [TLD2]. - Deleted DF000001643 [HETRP_DM]. Component Based FamDB The 3.0.0 release of FamDB represents a major architectural overhaul centered on a new component-based file format that replaces the previous taxonomy-only partition scheme. Families are now organized across four distinct component types representing the most common use-cases: curated consensus (cc), curated HMM (ch), uncurated consensus (uc), and uncurated HMM (uh) — each independently partitioned across the taxonomy tree. A key benefit of this design is that users only need to download the components relevant to their work. For the most common use case — running RepeatMasker with curated consensus sequences — only the cc component partitions are required, and at current Dfam release sizes these fit in a single partition file, making downloading and maintaining local copies more manageable. The famdb.py tool has been updated throughout to reflect this structure: a new check subcommand reports exactly which component partition files are required to satisfy a given species query (including ancestor partitions) and whether each is locally installed or missing. FamDB has been separated from the RepeatMasker distribution and is now a standalone package, making it easier for other tools to integrate with it. Dfam API and Website Improvements Since the last release the Dfam website has undergone a major improvement in the visualization of transposable element (TE) family data through the introduction of the Dfam TE Browser. The TE Browser is a genome-browser-like interface that integrates seed alignments, protein features, and family relationship data previously presented through separate visualizations. Built using the IGV.js visualization framework and inspired in part by the UCSC Repeat Browser, the TE browser supports dynamic zooming, scale-aware rendering, and publication-quality image export. The release also expands relationship visualizations to all Dfam families and introduces new tandem repeat and self-alignment tracks, while the standalone DfamTEBrowser project now allows researchers to generate similar visualizations for custom TE libraries locally. 5b. DESCRIPTION OF CHANGES FROM RELEASE 3.8 TO 3.9 Due to the recent interest and submission of libraries for Drosophila species, this release primarily focused on a thorough curation of the Dfam Drosophila melanogaster library. Special attention was given to reconciling families between RepBase, Flybase, and the Berkeley Drosophila Genome Project (BDGP) with new submissions from Gabriel Rech, Josefa Gonzales, and Sara Signor. This release also incorporates de novo libraries from two major initiatives: the Vertebrate Genomes Project and a eukaryote-wide study led by Landen Gozashti. Additionally, numerous individual genome and family submissions from the research community have been included. Library Submissions: - An updated Drosophila melanogaster library representing detailed curation of new and existing families. This is the combined effort of Arian Smit, Clement Goubert, Gabriel Rech, Josefa Gonzales and Sara Signor. - 515 Eukaryote libraries provided by Landen Gozashti - 153 Vertebrate libraries provided by Hiram Clawson (UCSC, VGP) - Several highly curated primate families submitted by Martin Frith - Silverleaf Whitefly (Bemisia tabaci) library submitted by Juan Paolo Sicat - Atlantic Salmon (Salmo salar) partial library submitted by Oystein Monsen - Waterlily Leafcutter Moth (Elophila obliteralis) submitted by Jacqueline Heckenhauer - The Gulf toadfish (Opsanus beta) submitted by Nicholas Kron - Castor Oil Plant (Ricinus communis) submitted by Kong Lin - Ascomycete Fungi (Zymoseptoria tritici) submitted by Tobias Baril - Oribatid Soil Mite (Oppia nitens) submitted by Derek Smith - Tanaka's Snailfish (Liparis tanakae) submitted by Nathan Rives - SINES from 20 Bivalve species submitted by Jacopo Martelossi - Bengal Slow Loris (Nycticebus bengalensis) submitted by Charles Michie 6. DESCRIPTION OF RELEASE FILES relnotes.txt - This file. userman.txt - A full description of Dfam data formats. families/ Dfam-#.hmm.gz - Dfam HMMs in an HMM library, searchable with the nhmmer program. Due to the large size, the file has been split into 10 files for easier downloading. Dfam-1.embl.gz - Dfam consensi in an EMBL library, searchable with the RepeatMasker 4.1.0 and earlier. Dfam-curated_only-1.hmm.gz - Dfam HMMs for curated (DF accessions only) families only. Dfam-curated_only-1.embl.gz - Dfam consensi for curated (DF accessions only) families in Dfam in EMBL format. families/FamDB/ dfam40_*.h5.gz - FamDB H5 partitions for use with the FamDB tool (https://github.com/Dfam-consortium/FamDB), and with RepeatMasker. This format stores the family HMMs, consensus sequences, and related taxonomy data in one file format, broken up by curation status, model type, and taxonomic groups. infrastructure/ class_ns.tsv - Release specific TE class names, and family name_ns.tsv names for use with the Dfam-curator tools. TEClasses.tsv - Downloadable TE class names, and descriptions for use with spreadsheet programs. apidocs/ - Dfam REST API documentation (HTML) 7. DESCRIPTION OF FIELDS See userman.txt for more detailed description of each field Compulsory fields: ------------------ AC Accession number: Accession number in form DFxxxxxxx. ID Identification: One word name for entry. DE Definition: Short description of entry. AU Author: Authors of the entry. SE Source of seed: The source suggesting the seed members belong to one entry. GA Gathering method: Score used for sequences within the clade specified by MS. TC Trusted Cutoff: Score used for sequences outside the clade specified by MS. NC Noise Cutoff: Smaller cutoff than GA; not used in Dfam. FR False Discovery Rate: Target FDR used to set GA. BM Build method SM Internal search method MS Model specificity: TaxID and TaxName, based on NCBI taxonomy. CT Classification tags: Repeat Type, Class, and Superfamily. SQ Sequence: Number of sequences in alignment. // End of alignment. Optional fields: ---------------- DC Database Comment: Comment about database reference. DR Database Reference: Reference to external database. RC Reference Comment: Comment about literature reference. RN Reference Number: Reference Number. RM Reference Medline: Eight digit medline UI number. RT Reference Title: Reference Title. RA Reference Author: Reference Author RL Reference Location: Journal location. PI Previous identifier: Record of all previous ID lines. CC Comment: Comments. WK Wikipedia Reference: Reference to wikipedia. SN Synonym A widely accepted alternative name for the model. CN Classification Note: A free text comment about the model classification. 8. REFERENCES 1. The Dfam Community Resource of Transposable Element Families, Sequence Models, and Genome Annotations. Storer, J., Hubley, R., Rosen, J., Wheeler, T., & Smit, A. F. Mobile DNA. 2021; 12: 2. https://doi.org/10.1186/s13100-020-00230-y 2. The Dfam Database of Repetitive DNA Families Robert Hubley, Robert D. Finn, Jody Clements, Sean R. Eddy, Thomas A. Jones, Weidong Bao, Arian F.A. Smit, Travis J. Wheeler Nucleic acids research 44.D1 (2016): D81-D89. 3. Dfam: a Database of Repetitive DNA Based on Profile Hidden Markov Models Wheeler TJ, Clements J, Eddy SR, Hubley R, Jones TA, Jurka J, Smit AFA, Finn RD Nucl. Acids Res. (2013) Database Issue 41:D70-82. doi: 10.1093/nar/gks1265 4. Realistic artificial DNA sequences as negative controls for computational genomics. Caballero J, Smit AF, Hood L, Glusman G. Nucl. Acids Res. 2014 doi: 10.1093/nar/gku356 9. THE DFAM CONSORTIUM Dfam is maintained by a consortium of researchers. You can contact the Dfam consortium at: help@dfam.org The current/former members of the Dfam consortium are: Arian F. A. Smit, Robert Hubley, Anthony Gray: Institute for Systems Biology, USA Jessica Storer: University of Connecticut, USA Travis J. Wheeler, Clement Goubert, Jack Roddy: University of Arizona, USA Jeb Rosen: OneRail, USA Robert D. Finn: EBI, UK Jody Clements: Janelia Farm Research Campus, USA Sean R. Eddy, Thomas A. Jones: Harvard University, USA Jerzy Jurka: Genetic Information Research Institute, USA 10. ACKNOWLEDGEMENTS A.F.A.S., R.M.H., A.G., J.R. and T.J.W. were supported by a grant from the National Institutes of Health (NHGRI grant #U24-HG010136). R.D.F., J.C, S.R.E, T.A.J., and T.J.W received institutional support from HHMI Janelia Farm Research Campus. J.J. was supported by grants from the National Library of Medicine, National Institutes of Health (P41LM006252-12). A.F.A.S and R.H were supported by a grant from the National Institutes of Health (RO1 HG002939). 11. COPYRIGHT NOTICE Dfam - A database of conserved DNA element alignments and HMMs Copyright (C) 2015-2026 The Dfam consortium. This database is free; you can redistribute it and/or modify it as you wish, under the terms of the CC0 1.0 license, a 'no copyright' license: The Dfam consortium has dedicated the work to the public domain, waiving all rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information below. Other Information o In no way are the patent or trademark rights of any person affected by CC0, nor are the rights that other persons may have in the work or in how the work is used, such as publicity or privacy rights. o Unless expressly stated otherwise, the Dfam consortium makes no warranties about the work, and disclaims liability for all uses of the work, to the fullest extent permitted by applicable law. o When using or citing the work, you should not imply endorsement by the Dfam consortium. You may also obtain a copy of the CC0 license here: http://creativecommons.org/publicdomain/zero/1.0/legalcode ___________________ The Dfam Consortium 2026