____ __ | _ \ / _| __ _ _ __ ___ /\| | | |_ / _` | '_ ` _ \ \/|_| | _| (_| | | | | | | /\___/|_| \__,_|_| |_| |_| The Dfam community resource of transposable element families, sequence models, and genome annotations. RELEASE 3.8 ------------------------------------------------------------------------------- 1. INTRODUCTION Dfam is a collection of conserved DNA element sequence alignments, consensus sequences, hidden Markov models (HMMs) and complete genome annotations. Dfam has full, curated libraries for five model organisms: Human, Mouse, Zebrafish, Worm and Fly as well as curated lineage-specific libraries for 242 mammals and a growing number of newly sequenced species. In addition, Dfam now stores uncurated (de-novo produced) families for over 2000 additional species. 2. LOCATIONS Dfam is available on the web at: https://dfam.org/ 3. STATISTICS Dfam 3.8 consists of 3.6 million families across 2437 taxa. Dfam families include retrotransposons, DNA transposons, interspersed repeats of unknown origin, and a number of non-TE entries used to annotate satellites or to avoid annotating noncoding RNA genes as TEs. The distribution of these constituent family types in each annotated species is given below: Curated Families (DF Records ) ------------------------------ Taxa Retrotransposons DNA Transposons Other ------- ---------------- --------------- ----- Mammalia 8856 1744 178 Sauropsida [Reptiles and Birds] 179 16 7 Actinopterygii [Bony Fishes] 1503 904 82 Nematoda 1114 395 1750 Insecta 1707 1066 3861 Myxogastri [Slime Molds] 3 0 0 Echinodermata [Starfishes, Sea Urchins/Cucumbers] 2 0 0 Hemichordata [Acorn Worms, etc] 4 0 0 Cyclostomata [Jawless Veterbrates] 1 0 0 Panarthropoda [Anthropods, Velvet Worms, Water Bears] 11 0 0 Spiralia [Segmented Worms, Arrow Worms, Mollusks] 863 762 0 Cnidaria [Sea Anemones and Corals] 11 0 0 Ctenophora [Comb Jellies] 2 0 0 Uncurated Families (DR Records ) -------------------------------- Taxa Retrotransposons DNA Transposons Other ------- ---------------- --------------- ----- Mammalia 174347 33995 21836 Sauropsida [Reptiles and Birds] 182974 62878 105226 Amphibia 11683 9462 25570 Actinopterygii [Bony Fishes] 168120 321399 278811 Chondrichthyes [Cartilaginous Fishes] 6785 934 4772 Petromyzontiformes [lampreys] 2876 997 1778 Viridiplantae [Green Plants] 63188 41168 72253 Echinodermata [Starfishes, Sea Urchins/Cucumbers] 1896 929 6429 Panarthropoda [Anthropods, Velvet Worms, Water Bears] 5197 5440 19036 Spiralia [Segmented Worms, Arrow Worms, Mollusks] 9284 6547 27997 Cnidaria [Sea Anemones and Corals] 1495 2127 4064 4. CONSTRUCTION OF DFAM Dfam utilizes public genome assemblies and species-specific GARLIC [3] artificial benchmark sequences for each genome. Sequence alignments for the five model organisms were built using RepBase consensus sequences (http://www.girinst.org/repbase/) in a collaboration with GIRI. For these families the seed alignments were generated using RepeatMasker alignments of the consensus sequences against the current UCSC asssemblies (http://http://genome.ucsc.edu/ - hg38, ce10, dm6, mm10 and danRer10). For each family, annotated instances were transitively aligned based on mutual alignment to the Repbase consensus sequence. Hidden Markov models (HMMs) were constructed from the sequence alignment using the HMMER3 tool hmmbuild, and each model was then searched against Dfamseq using the HMMER3 tool nhmmer, with hit metadata (sequence location, score, etc) captured for distribution. As RepBase is a closed database, newer families in Dfam are based on seed alignments generated using additional methods including: directly from de-novo tools such as RepeatModeler and REPET, subfamily detection programs such as Coseg, and through manual clustering and alignment of genomic sequences. 5a. DESCRIPTION OF CHANGES FROM RELEASE 3.7 TO 3.8 The development of Dfam 3.8 was focused on two priorities, inclusion of a large set of community data submissions, and a refresh of the technology underpinning the website, API, and offline data formats. Library Submissions: - 71 plant genome libraries provided by Alexandre Paschoal - Walnut (Juglans mandshurica) library submitted by Chen Yidan - Coffee (Coffea eugenioides) submitted by Romain Guyot - Septoria linocola (fungal plant pathogen) library submitted by Adeline Simon - Deer Mouse (Peromyscus maniculatus) library submitted by Lande Gozashti - Eastern hap (Astatotilapia calliptera) library submitted by Pio Sierra - Stalk-eyed Fly (Teleopsis dalmanni) library submitted by Josephine Reinhardt - Libraries for 27 clam species submitted by Andrea Luchetti - 2 nematode libraries (Heligmosomoides bakeri and Heligmosomoides polygyrus) submitted by Isaac Martinez-Ugalde - Plasmodium falciparum (Protozoan parasite) library submitted by Samuel Ortion - Other updates and additions: HERV17 and related families by Martin Frith and Arian Smit Accession Length Changes: * Due to the fast growth of Dfam, we have found it necessary to increase the the Dfam accession number from 7-digits to 9-digits. Existing accessions have been converted by prefixing the existing accession with '00'. E.g DF0000001 becomes DF000000001. * The API, website, and export files (FamDB, HMMs, EMBL etc) have all been updated to the new accession format. Partitioned Data Export: * The FamDB HDF5 database format now supports database partitioning by taxanomic groups. This allows users to download only the portion(s) of Dfam that they need to conduct their work while still providing all the features the famdb.py query tool. At a minimum the root partition must be downloaded however any number of additional partitions may also be present. The files are available for downloaded from http://dfam.org/releases/Dfam_3.8/families/FamDB with the following taxonomic coverage: Partition 0 [dfam38_full.0.h5]: root Mammalia, Amoebozoa, Bacteria , Choanoflagellata, Rhodophyta, Haptista, Metamonada, Fungi, Sar, Placozoa, Ctenophora , Filasterea, Spiralia, Discoba, Cnidaria, Porifera, Viruses Partition 1 [dfam38_full.1.h5]: Obtectomera Partition 2 [dfam38_full.2.h5]: Euteleosteomorpha Partition 3 [dfam38_full.3.h5]: Sarcopterygii Sauropsida, Coelacanthimorpha, Amphibia, Dipnomorpha Partition 4 [dfam38_full.4.h5]: Diptera Partition 5 [dfam38_full.5.h5]: Viridiplantae Partition 6 [dfam38_full.6.h5]: Deuterostomia Chondrichthyes, Hemichordata, Cladistia, Holostei, Tunicata, Cephalochordata, Cyclostomata , Osteoglossocephala, Otomorpha, Elopocephalai, Echinodermata, Chondrostei Partition 7 [dfam38_full.7.h5]: Hymenoptera Partition 8 [dfam38_full.8.h5]: Ecdysozoa Nematoda, Gelechioidea, Yponomeutoidea, Incurvarioidea, Chelicerata, Collembola, Polyneoptera, Tineoidea, Apoditrysia, Monocondylia, Strepsiptera, Palaeoptera, Neuropterida, Crustacea, Coleoptera, Siphonaptera, Trichoptera, Paraneoptera, Myriapoda, Scalidophora This new format is compatible with the FamDB tools v1.0.1, and RepeatMasker 4.1.6. Dfam API and Website Improvements: * Several API endpoints have had their response times improved, most notably the `/families` endpoint. This is often queried for large amounts of data, so efficiency of the code has been priority. In this release, we have added a new caching system to handle large queries that would otherwise have timed out. In these cases the user may wait for the request to complete or come back at a later time to retrieve the results of the search. * The tech stack for Dfam.org was updated. The frontend was updated from Angular 11 to Angular 16 with the accompanying style changes. The API was migrated from Swagger to OpenAPI 3.0. 5b. DESCRIPTION OF CHANGES FROM RELEASE 3.6 TO 3.7 Dfam 3.7 represents the largest expansion of Dfam to date, doubling the total families and species covered. In addition, we received and processed many submisisons from the community outlined below: Curated Family Submissions: * Libraries for six bat species: Molossus_molossus, Myotis_myotis, Phyllostomus discolor, Pipistrellus_kuhlii, Rousettus_aegyptiacus, and Rhinolophus_ferrumequinum provided by David Ray & Kevin Sullivan. Jebb, D., Huang, Z., Pippel, M. et al. Six reference-quality genomes reveal evolution of bat adaptations. Nature 583, 578–584 (2020). https://doi.org/10.1038/s41586-020-2486-3 * Mosquito (Anopheles coluzzii) curated from seven individuals from the same species. Provided by Ano Carlos Vargas-Chavez and Josefa Gonzales. Vargas-Chavez C, Longo Pendy NM, Nsango SE, Aguilera L, Ayala D, González J. Transposable element variants and their potential adaptive impact in urban populations of the malaria vector Anopheles coluzzii. Genome Res. 2022 Jan;32(1):189-202. doi: 10.1101/gr.275761.121. Epub 2021 Dec 29. PMID: 34965939 * LTR7 subfamilies: Three updated families and fifteen new LTR7 HERVH endogenous retroviruses. Provided by Thomas A Carter and Cedric Feschotte with additional curation by Arian Smit. Carter Thomas A, Singh Manvendra, Dumbovic Gabrijela, Chobirko Jason D, Rinn John L, Feschotte Cedric: Mosaic cis-regulatory evolution drives transcriptional partitioning of HERVH endogenous retrovirus in the human embryo PMID:35179489 Raw Family Submissions: * RepeatModeler generated libraries for 771 new assemblies provided by Fergal Martin and Denye Ogeh (European Bioinformatics Institute). * Twenty Orders of Insects: 601 assemblies John S. Sproul, Scott Hotaling, Jacqueline Heckenhauer, Ashlyn Powell, Amanda M. Larracuente, Joanna L. Kelley, Steffen U. Pauls, Paul B. Frandsen Sproul, J.S., Hotaling, S., Heckenhauer, J., Powell, A., Larracuente, A.M., Kelley, J.L., Pauls, S.U., Frandsen, P.B. (2022). Repetitive elements in the era of biodiversity genomics: insights from 600+ insect genomes. bioRxiv 2022.06.02.494618; doi: https://doi.org/10.1101/2022.06.02.494618 * Three Gesneriaceae Species: Primulina huaijiensis (cave plant), Streptocarpus rexii (Cape Primrose), and Dorcoceras hygrometricum provided by Kanae Nishii. Nishii, K., Hart, M., Kelso, N., Barber, S., Chen, Y. Y., Thomson, M., Trivedi, U., Twyford, A. D., & Möller, M. (2022). The first genome for the Cape Primrose Streptocarpus rexii(Gesneriaceae), a model plant for studying meristem-driven shoot diversity. Plant direct, 6(4), e388. https://doi.org/10.1002/pld3.388 * Taro (Colocasia esculenta) provided by Renee Bellinger. Bellinger, M. R., Paudel, R., Starnes, S., Kambic, L., Kantar, M. B., Wolfgruber, T., Lamour, K., Geib, S., Sim, S., Miyasaka, S. C., Helmkampf, M., & Shintaku, M. (2020). Taro Genome Assembly and Linkage Map Reveal QTLs for Resistance to Taro Leaf Blight. G3 (Bethesda, Md.), 10(8), 2763–2775. https://doi.org/10.1534/g3.120.401367 * Water flea (Daphnia pulicaria) provided by Matthew Wersebe. Wersebe, M. J., Sherman, R. E., Jeyasingh, P. D., & Weider, L. J. (2022). The roles of recombination and selection in shaping genomic divergence in an incipient ecological species complex. Molecular ecology, 10.1111/mec.16383. Advance online publication. https://doi.org/10.1111/mec.16383 Relationship Data: The sequence similarity between families in Dfam has historically updated between releases and displayed on the Family Relationships section of the individual family pages. The massive growth of Dfam has made this analysis intractable with our current resources. We are not updating this information at this time, however we will be maintaining the previous results while we work out a new system to restore this capability. FamDB: The latest FamDB (*.h5) files produced for Dfam 3.7 use a new format of the FamDB standard (v0.5). This change improves performance for generating these large files. These files are compatible with most versions of RepeatMasker but prior to RepeatMasker 4.1.5 this will require updating the FamDB tool ("famdb.py") shipped with the release. For more information please visit the FamDB site: https://github.com/Dfam-consortium/FamDB Database Exports: As the database becomes too large to reasonably download over the net we are no longer providing the mysqldump output format with each new release. It still may still be requested by mail by providing a transfer device to us. For more information please contact help@dfam.org. 6. DESCRIPTION OF RELEASE FILES relnotes.txt - This file. userman.txt - A fuller description of Dfam fields. families/ Dfam.hmm.gz - Dfam HMMs in an HMM library, searchable with the nhmmer program. Dfam.embl.gz - Dfam consensi in an EMBL library, searchable with the RepeatMasker 4.1.0 and earlier. Dfam.h5.gz - Dfam HMMs and consensi in the FamDB file format, for RepeatMasker 4.1.1 and later. Dfam.curatedonly.hmm.gz - Dfam HMMs for curated (DF accessions only) families only. Dfam.curatedonly.embl.gz - Dfam consensi for curated (DF accessions only) families in Dfam in EMBL format. Dfam.curatedonly.h5.gz - Dfam HMMs and consensi in FamDB format for curated (DF accessions only). annotations// .hits - TSV list of all matches found in the given assembly that score above the GA threshold. e.g. hg38.hits.gz .nrph.hits - TSV list of all non-redundant matches found in the given assembly and that score above the GA threshold. e.g. hg38.nrph.hits.gz infrastructure/ dfamscan.pl - A basic search tool to query Dfam families against an input sequence. schema/ - Directory with printable ERD diagrams for Dfam apidocs/ - Dfam REST API documentation (HTML) 7. DESCRIPTION OF FIELDS See userman.txt for more detailed description of each field Compulsory fields: ------------------ AC Accession number: Accession number in form DFxxxxxxx. ID Identification: One word name for entry. DE Definition: Short description of entry. AU Author: Authors of the entry. SE Source of seed: The source suggesting the seed members belong to one entry. GA Gathering method: Score used for sequences within the clade specified by MS. TC Trusted Cutoff: Score used for sequences outside the clade specified by MS. NC Noise Cutoff: Smaller cutoff than GA; not used in Dfam. FR False Discovery Rate: Target FDR used to set GA. BM Build method SM Internal search method MS Model specificity: TaxID and TaxName, based on NCBI taxonomy. CT Classification tags: Repeat Type, Class, and Superfamily. SQ Sequence: Number of sequences in alignment. // End of alignment. Optional fields: ---------------- DC Database Comment: Comment about database reference. DR Database Reference: Reference to external database. RC Reference Comment: Comment about literature reference. RN Reference Number: Reference Number. RM Reference Medline: Eight digit medline UI number. RT Reference Title: Reference Title. RA Reference Author: Reference Author RL Reference Location: Journal location. PI Previous identifier: Record of all previous ID lines. CC Comment: Comments. WK Wikipedia Reference: Reference to wikipedia. SN Synonym A widely accepted alternative name for the model. CN Classification Note: A free text comment about the model classification. 8. REFERENCES 1. The Dfam Community Resource of Transposable Element Families, Sequence Models, and Genome Annotations. Storer, J., Hubley, R., Rosen, J., Wheeler, T., & Smit, A. F. Mobile DNA. 2021; 12: 2. https://doi.org/10.1186/s13100-020-00230-y 2. The Dfam Database of Repetitive DNA Families Robert Hubley, Robert D. Finn, Jody Clements, Sean R. Eddy, Thomas A. Jones, Weidong Bao, Arian F.A. Smit, Travis J. Wheeler Nucleic acids research 44.D1 (2016): D81-D89. 3. Dfam: a Database of Repetitive DNA Based on Profile Hidden Markov Models Wheeler TJ, Clements J, Eddy SR, Hubley R, Jones TA, Jurka J, Smit AFA, Finn RD Nucl. Acids Res. (2013) Database Issue 41:D70-82. doi: 10.1093/nar/gks1265 4. Realistic artificial DNA sequences as negative controls for computational genomics. Caballero J, Smit AF, Hood L, Glusman G. Nucl. Acids Res. 2014 doi: 10.1093/nar/gku356 9. THE DFAM CONSORTIUM Dfam is maintained by a consortium of researchers. You can contact the Dfam consortium at: help@dfam.org The current/past members of the Dfam consortium are: Arian F. A. Smit, Robert Hubley, Anthony Gray: Institute for Systems Biology, USA Jessica Storer: University of Connecticut, USA Travis J. Wheeler, Clement Goubert, Jack Roddy: University of Arizona, USA Jeb Rosen: OneRail, USA Robert D. Finn: EBI, UK Jody Clements: Janelia Farm Research Campus, USA Sean R. Eddy, Thomas A. Jones: Harvard University, USA Jerzy Jurka: Genetic Information Research Institute, USA 11. ACKNOWLEDGEMENTS A.F.A.S., R.M.H., A.G., J.R. and T.J.W. were supported by a grant from the National Institutes of Health (NHGRI grant #U24-HG010136). R.D.F., J.C, S.R.E, T.A.J., and T.J.W received institutional support from HHMI Janelia Farm Research Campus. J.J. was supported by grants from the National Library of Medicine, National Institutes of Health (P41LM006252-12). A.F.A.S and R.H were supported by a grant from the National Institutes of Health (RO1 HG002939). 12. COPYRIGHT NOTICE Dfam - A database of conserved DNA element alignments and HMMs Copyright (C) 2015-2023 The Dfam consortium. This database is free; you can redistribute it and/or modify it as you wish, under the terms of the CC0 1.0 license, a 'no copyright' license: The Dfam consortium has dedicated the work to the public domain, waiving all rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information below. Other Information o In no way are the patent or trademark rights of any person affected by CC0, nor are the rights that other persons may have in the work or in how the work is used, such as publicity or privacy rights. o Unless expressly stated otherwise, the Dfam consortium makes no warranties about the work, and disclaims liability for all uses of the work, to the fullest extent permitted by applicable law. o When using or citing the work, you should not imply endorsement by the Dfam consortium. You may also obtain a copy of the CC0 license here: http://creativecommons.org/publicdomain/zero/1.0/legalcode ___________________ The Dfam Consortium 2023