- Research
- Open access
- Published:
Chromosome-scale scaffolds of the fungus gnat genome reveal multi-Mb-scale chromosome-folding interactions, centromeric enrichments of retrotransposons, and candidate telomere sequences
BMC Genomics volume 26, Article number: 443 (2025)
Abstract
Background
The lower Dipteran fungus gnat, Bradysia (aka Sciara) coprophila, has compelling chromosome biology. Paternal chromosomes are eliminated during male meiosis I and both maternal X sister chromatids are retained in male meiosis II. Embryos start with three copies of the X chromosome, but 1–2 copies are eliminated from somatic cells as part of sex determination, and one is eliminated in the germline to restore diploidy. In addition, there is gene amplification in larval polytene chromosomes, and the X polytene chromosome folds back on itself mediated by extremely long-range interactions between three loci. These developmentally normal events present opportunities to study chromosome behaviors that are unusual in other systems. Moreover, little is known about the centromeric and telomeric sequences of lower Dipterans in general, and there are recent claims of horizontally-transferred genes in fungus gnats. Overall, there is a pressing need to learn more about the fungus gnat chromosome sequences.
Results
We produced the first chromosome-scale models of the X and autosomal chromosomes where each somatic chromosome is represented by a single scaffold. Extensive analysis supports the chromosome identity and structural accuracy of the scaffolds, demonstrating they are co-linear with historical polytene maps, consistent with evolutionary expectations, and have accurate centromere positions, chromosome lengths, and copy numbers. The positions of alleged horizontally-transferred genes in the nuclear chromosomes were broadly confirmed by genomic analyses of the chromosome scaffolds using Hi-C and single-molecule long-read datasets. The chromosomal context of repeats shows family-specific biases, such as retrotransposons correlated with the centromeres. Moreover, scaffold termini were enriched with arrays of retrotransposon-related sequence as well as nucleosome-length (~ 175 bp) satellite repeats. Finally, the Hi-C data captured Mb-scale physical interactions on the X chromosome that are seen in polytene spreads, and we characterize these interesting “fold-back regions” at the sequence level for the first time.
Conclusions
The chromosome scaffolds were shown to be of exceptional quality, including loci harboring horizontally-transferred genes. Repeat analyses demonstrate family-specific biases and telomere repeat candidates. Hi-C analyses revealed the sequences of ultra-long-range interactions on the X chromosome. The chromosome-scale scaffolds pave the way for further studies of the unusual chromosome movements in Bradysia coprophila.
Background
The lower Dipteran dark-winged fungus gnat, Bradysia (Sciara) coprophila, is an important model system for studying chromosome biology due to its dynamic genome. Most notable is its chromosome cycle (Fig. 1 A-B). Rather than having either a diploid or haploid set of the same chromosomes (X, II, III, IV, and L) in every cell, the chromosome constitution varies based on whether it is somatic or germline, male or female, and early embryo or late embryo [1, 2]. With respect to somatic or germline, there are chromosomes referred to as the L chromosomes that are only found in the germline [1, 2]. Regarding male or female, half of all females, but no males, contain a variant of the X chromosome called the X’ (X prime) [1, 2]. Concerning embryo stage, embryos start out with three copies of the X chromosome, two of which are paternally derived, and with a variable number of the germline-limited L chromosomes, but later embryos have only 1–2 X chromosomes and no L chromosomes in somatic cells. The L chromosomes are eliminated from all somatic nuclei not set aside for the germline in the 5 th- 6 th nuclear division [1, 2]. The difference in X chromosome copy number arises because embryos from XX mothers are destined to be male and eliminate two paternal X chromosomes from somatic nuclei in the 7 th- 9 th nuclear division whereas embryos produced by X’X mothers are fated to be female and eliminate only one paternal X [1, 2]. Accordingly, sex is determined by the chromosome constitution of the mother in this species, which lacks a Y chromosome. Later in development, both the male and female germline restore diploidy for the X by eliminating one paternal copy, and also eliminate one or more L chromosomes to prevent their accumulation [1, 2]. Thereafter, the meiotic events in oogenesis appear to proceed normally whereas those in spermatogenesis do not. In meiosis I of spermatogenesis, all paternal somatic chromosomes are eliminated in a bud of cytoplasm while the L chromosomes appear to escape this imprinting effect and are retained with the maternal set of somatic chromosomes at the single pole of a naturally occurring monopolar spindle [1, 2]. In meiosis II, both maternally-derived sister chromatids of the X chromosome undergo developmentally programmed nondisjunction and are retained at one pole of a bipolar spindle [1, 2]. Rather than four products, these unusual events lead to one product of male meiosis: a sperm that contains only maternally derived somatic chromosomes (X, II, III, and IV), that is diploid for X, haploid for the autosomes (II, III, and IV), and variable for L chromosomes. In addition to the unusual chromosome cycle, Bradysia coprophila larval salivary glands have highly polytene chromosomes with over 8000 copies in each nucleus [3], and developmentally-programmed gene amplification [4, 5]. Finally, there is mounting evidence that this genome may be a great model for studying horizontal gene transfer (HGT) in insects [6,7,8]. Overall, fungus gnats present many opportunities to study chromosome biology across development and evolution.
The fungus gnat chromosome cycle and chromosome-scale scaffolding flowchart. A The chromosome behaviors across the somatic cell lineage of males and females starting from the initial zygotic genome. In early embryogenesis, L chromosomes are eliminated first from all nuclei destined for the somatic cell lineage. Then either one or two paternal X chromosomes (Xp) are eliminated. Male-producing females produce offspring that eliminate two Xp and become males. Female-producing females contain a variant of the X chromosome, X’ (X prime), and produce offspring that eliminate one Xp and become male-producing and female-producing females depending on whether they inherit the maternal X or X’. The two types of females can be distinguished by whether they have wavy or straight wings. L* represents the presence of two L chromosomes, L1 and L2 [6]. The subscripts “M” and “P” indicate maternal and paternal origin, respectively. The reference genome, Bcop_v1, was made from male embryos, and represents the male somatic genome (highlighted in yellow) box. B The chromosome dynamics across the germline lineage for males and females. Male meiosis is non-Mendelian, and results in a single product through paternal genome elimination and maternal X-non-disjunction. Female meiosis is Mendelian. Purple chromosomes represent that their origin can be from any parent at random. C A flowchart and summary of updating the reference genome, Bcop_v1, with Hi-C. Bcop_v1 is a long read assembly, incorporating both PacBio and Nanopore data, that was scaffolded with optical maps from BioNano Genomics [9]. Bcop_v2 is the result of correcting and scaffolding Bcop_v1 with Hi-C data. The numbers in parentheses denotes how many contigs and scaffolds are in the given component of the assembly
Previously, we published a highly contiguous assembly of the somatic chromosomes (X, II, III, and IV) named Bcop_v1 [9]. We targeted the somatic genome in males (Fig. 1A) as it presented the lowest complexity version of the genome, lacking both the L and X’ chromosomes, which have since been the focus of other studies [6, 10]. Specifically, we sequenced genomic DNA from male embryos using long-read (PacBio RS II SMRT and Oxford Nanopore Technologies MinION) and short-read (Illumina) technologies [9]. As part of that process, we assembled the sequencing data ~ 100 different ways, and used many evaluations to determine the best assembly and polishing pipelines for our datasets [9]. Two assemblies with the best evaluation scores were scaffolded with optical maps from BioNano Genomics, then subject to further evaluations [9]. The chosen BioNano-scaffolded assembly, Bcop_v1 (NCBI GenBank GCA_014529535.1, WGS VSDI01), had a contig NG50 of ~ 2.4 Mb and scaffold NG50 of ~ 8.2 Mb (Table 1) [9]. Since, in Bradysia coprophila, late male embryos have a single copy of the X, but are diploid for the autosomes, it was possible to classify each contig as either X-linked or autosomal using sequencing depth [9]. Moreover, using the known chromosomal locations of a handful of sequences, we were able to anchor ~ 28–33% of the autosomal sequences into chromosomes II, III, and IV [9]. Overall, nearly 50% of the expected genome size was anchored into chromosomes based on read depth and known sequence locations [9]. Nevertheless, chromosome assignments are not available for the majority of contigs in Bcop_v1. Additionally, Bcop_v1 lacks any information regarding the ordering and orientation of the contigs along chromosome sequences. To better facilitate studies into gene amplification and the unusual chromosome dynamics in Bradysia coprophila (Fig. 1A-B), our goal was to develop better models of the somatic chromosome sequences. We approached this by using chromosome conformation capture with deep sequencing (Hi-C) [11] to scaffold the chromosomes as has been successfully done by others [12,13,14] (Fig. 1C). Specifically, Hi-C is powerful at putting chromosome sequences together from a collection of sub-chromosomal sequences because of three major attributes of in vivo chromosome interactions captured by Hi-C data: (i) the interactions arbitrarily yield Mb-scale and chromosome-scale information without the need for sequencing ultra-long DNA molecules, (ii) intra-chromosomal interactions are much more frequent than inter-chromosomal interactions, making it possible to use interaction frequencies to cluster contigs into ‘chromosome groups’, and (iii) interaction frequencies decay with distance, making it possible to use them to order and orient the contigs in a chromosome group in the order and orientations they most likely appear along a chromosome [11,12,13,14].
Overall, we present here the first chromosome-scale genome assembly for Bradysia coprophila, specifically reporting the first full-length scaffolds for the somatic chromosomes (X, II, III, IV). In addition, we lifted over the annotations from the former genome assembly, allowing studies in progress by multiple groups to seamlessly transition to the updated genome assembly. Through several approaches, we demonstrate the chromosome identity and high structural accuracy of the chromosome-scale scaffolds, including through inspecting Hi-C interaction frequencies, correlation with polytene maps, how repeats are distributed, read depth, evolutionary expectations, and special features of the X chromosome. Moreover, using multiple genomic datasets, we demonstrate that recently-discovered “alien” genes [7] in the gene annotation sets are genuine parts of the chromosome-scale scaffolds, ruling out the possibility that they are associated with assembly errors from contaminating DNA sources. The termini of the chromosome-scale scaffolds suggest that telomeres may be composed of Long Complex Terminal Tandem Repeat (LCTTR) sequences as seen in the non-biting midge Chironomus and the mosquito Anopheles gambiae [15, 16], but also arrays of retrotransposon-related sequences. In addition, the Hi-C maps show three loci, separated by > 10 Mb each, with high frequency interactions in vivo, potentially illuminating the loci of the three “fold-back regions” seen to physically interact in images of the polytene X chromosome. The Hi-C data, the chromosome-scale genome sequence, its gene annotations, and insights produced here will be immediately useful to the growing research community interested in the unique biology of Bradysia coprophila as well as the broader research community interested in Dipteran evolution and comparative genomics.
Results
Hi-C guided assembly correction and chromosome-scale scaffolding of the reference genome
To update the previous reference genome, Bcop_v1 [9] (described in the Introduction and Table 1), we generated Hi-C data from whole male pupae, which was used both (i) to break contigs at putatively misjoined regions (Fig. S1A-C) to produce a corrected input assembly (called “Bcop_v1_corrected”) and (ii) to scaffold the resulting corrected assembly to produce an updated assembly with chromosome-scale scaffolds called “Bcop_v2” (Fig. 1C and 2A). Of the 205 contigs in the input assembly, there were 22 contigs with 46 putative misjoins (examples in Supplemental Fig. S1B-C), and the resulting corrected assembly had 297 contigs (Fig. 1C). The majority of mis-join signals flanked a minority of the regions joined together previously by optical mapping (Supplemental Fig. S1). After Hi-C scaffolding and post-processing, ~ 297 Mb of the ~ 299 Mb input was placed into four primary chromosome-scale scaffolds, each spanning 58–71 Mb, and ~ 1.8 Mb remained in 52 unplaced contigs (Table 1; Fig. 1C). There were also 539 “associated contigs” summing to ~ 10.8 Mb from the input assembly held out during scaffolding and added back afterward, totaling 591 unplaced contigs (Fig. 1C). The four primary chromosome scaffolds make up ~ 96% of Bcop_v2 with lengths of 58.3, 70.5, 71.0, and 97.1 Mb (Tables 1 and 2). In Bcop_v2, gaps from the original BioNano scaffolding process (for Bcop_v1) ranged from 25 bp to ~ 662 kb with a gap N50 of ~ 101.5 kb. Hi-C scaffolding introduced 241 gaps, all arbitrarily set to 100 bp, 5 of which are on a short ~ 363 kb scaffold. In total, there are 367 Hi-C and/or optical map gaps across Bcop_v2, 360 of which are within the four chromosome-scale scaffolds.
Chromosome identity, orientation, and assessment of the chromosome-scale scaffolds. A Interaction frequencies are visualized across Bcop_v1_corrected (Bcop_v1 after Hi-C guided correction), which was the input to Hi-C scaffolding. B Interaction frequencies visualized across the final version of Bcop_v2, after polytene orientation and with associated/unplaced contigs. Squares are drawn around boundaries of the four chromosome-scale scaffolds. C-F Anchoring, polytene map orientation, scaffold structural verification, and loci of interest on the chromosome-scale scaffolds for chromosomes (C) X, (D) II, (E) III, and (F) IV. The chromosome identity and specificity of each scaffold, their consistent orientations with polytene maps, and structural verifications were obtained using 10 unique “anchor” sequences with known chromosomal addresses mapped previously using in situ hybridization [17,18,19,20,21,22,23,24,25,26,27,28,29]. Centromeric sequences identified the expected locations of centromeres [22, 30,31,32] on acrocentric (X, II, and III) and meta-centric (IV) chromosomes, which were further supported by the density of a probe sequence (F4 from ScRTE) shown previously to be pericentromeric in all four chromosomes [32]. The scaffold for X also shows the locations found for the fold-back regions and long paracentric inversion breakpoints on the X’. Each chromosome-scale scaffold had at least one known unique sequence location anchoring it to a specific chromosome. There was no conflicting evidence with regard to chromosome identity nor with regard to the order in which sub-chromosomal sequences mapped along the scaffolds. Polytene chromosome maps were reproduced from plates 1, 2 and 3 of Gabrusewycz-Garcia [31] with permission from Springer Nature under permission number: 5490830617309. G Chromosome scaffold sizes are consistent with chromosome length expectations from Crouse [30]. H Chromosome scaffolds are consistent with read depth expectations on the X versus autosomes in males and females. Overall, this evidence suggests that Hi-C accurately ordered and oriented the contigs into chromosome-scale scaffolds corresponding to the expected four somatic chromosomes
With regard to scaffolding accuracy, we interrogated ‘chromosome specificity’, the degree to which contigs in a given scaffold are likely to come from the same chromosome, and ‘structural accuracy’, the degree to which contigs are joined together in the correct order and orientation with respect to the chromosome. The Hi-C signal across Bcop_v2 shows that these metrics were optimized according to expected interaction frequency patterns (Fig. 2A, B, Supplemental Fig. S2). In Bcop_v2, (i) the chromosome-sized scaffolds define regions of much higher interaction frequencies, even far off the diagonal, compared to interactions with other scaffolds, (ii) the lowest interaction frequencies correspond to inter-scaffold contacts, (iii) the strongest interaction frequencies were along the diagonal within the chromosome scaffolds and had no disruptions between neighboring contigs, and (iv) off-diagonal interaction frequencies rapidly decayed with distance (Fig. 2B, Supplemental Fig. S2). Overall, the Hi-C signal is consistent with chromosome scaffolds that have both high chromosome specificity and high structural accuracy.
The chromosomal identities of the chromosome-scale scaffolds were obtained using “anchor” sequences for 10 known unique chromosomal locations [17,18,19,20,21,22,23,24,25,26,27,28,29] by mapping them to the scaffolds (Fig. 2C-F; Tables 2 and 3; see Methods). The structural accuracy was further tested by analyzing the order of the anchor sequences across the scaffolds (Fig. 2C-F). All anchor sequences from a given chromosome mapped to the same chromosome-scale scaffold, and there were no conflicts of chromosome identity among the multiple anchor sequences (Fig. 2C-F). When multiple anchors were present for a given chromosome, they all mapped in the expected order along the same scaffold (Fig. 2C-F). These results further confirm that (i) each chromosome-scale scaffold corresponds to one and only one chromosome, (ii) each scaffold corresponds to a different chromosome than the three other scaffolds, and (iii) the order of contigs along the chromosome-scale scaffolds is highly accurate.
Chromosome identities and structural accuracy were further supported by comparing known centromere locations in polytene chromosome maps [22, 30,31,32] to locations identified in the chromosome-scale scaffolds with known centromere-associated sequences (Fig. 2C-F) [32]. Specifically, chromosomes X, II, and III are acrocentric with centromeres very close to the beginning of these chromosomes at polytene zones X/1 A, II/1 C, and III/1 C. In agreement, the centromere mapped within the first few megabases of the corresponding 58–71 Mb scaffolds (Fig. 2C-F). In contrast, chromosome IV is metacentric with a centrally-located centromere at polytene zone IV/9B, where zones span from IV/1 A to IV/20 C. As expected, the centromere mapped to a medial region, at positions spanning ~ 44.7–45.0 Mb, of the 97.1 Mb chromosome IV scaffold (Fig. 2C-F). Importantly, centromeric positions were concordant in the order of all anchor sequence loci along the polytene maps (Fig. 2C-F). The positional information of the centromeres and other anchors enabled us to orient the chromosome scaffold sequences in the same direction as the locus numbers that have been assigned on polytene maps for all four chromosomes [31] (Fig. 2C-F).
The chromosome-scale scaffold lengths are consistent with prior chromosome length expectations. In early studies of these chromosomes by H.V. Crouse [30], chromosome II was predicted to be shortest, chromosome IV was established as the longest, and chromosomes X and III were expected to be of equivalent lengths. Crouse’s early expectations are confirmed by the chromosome-scale scaffolds (Table 2; Fig. 2G), of which II is shortest (58.3 Mb), IV is the longest (97.1 Mb), and X and III are approximately the same length (70.5 and 71.0 Mb, respectively). Moreover, the chromosome lengths reflected by the scaffolds are concordant with the expected sizes of each chromosome using the number of polytene sections assigned by Crouse [30] and the expected genome size from DNA content measurements [33] (Table 2). The total length (226 Mb) of the autosomal scaffolds (II, III, IV) was also concordant with these predictions (214–231 Mb) (Table 2). Note that these early chromosome length expectations [30] were later revised such that X and II were both shortest and of equivalent length [31], but this under-estimate of the X was anticipated by Crouse [30] and is now refuted by the chromosome-scale scaffolds that support Crouse’s earlier length expectations (Table 2; Fig. 2G).
The chromosome identity of the X chromosome scaffold and of the autosomal nature of the other scaffolds (II, III, and IV) are also confirmed by read depth analysis. Females are diploid for both the autosomes and the X chromosome. Males are diploid for the autosomes, but haploid for X. Thus, read depth for females along the X and autosomes should be equivalent whereas read depth along the X from males should be half that of autosomes. These read depth expectations are borne out by multiple genomic technologies (Fig. 2H), including paired-end Illumina reads for both types of adult female and adult males [10], long reads from Nanopore and PacBio for male embryos [9]; and BioNano Genomic optical maps from male pupae [9]. The diploid-level read depth in both males and females across the scaffolds for chromosomes II, III, and IV confirms that they are strictly composed of autosomal contig sequences (Fig. 2H). Similarly, the haploid-level read depth in all male samples, but diploid level in female samples, across the entirety of the X chromosome scaffold, confirms not only the chromosome identity of the scaffold as X, but that all the contigs within it are indeed X-linked (Fig. 2H).
Gene annotations for the chromosome-scale assembly (Bcop_v2)
The gene annotation sets generated for the previous version of this genome (Bcop_v1) have already been demonstrated to score highly in ‘completeness’ as determined by the number of Dipteran BUSCOs (97%) found within them. For this reason, we used a liftover procedure to precisely map their coordinates in the chromosome-scale assembly, Bcop_v2 (Table 3). The updated annotation showed consistent results when analyzing the resulting gene, transcript, and protein sequences (Table 4). For example, only 2 genes with 4 transcripts were not mapped to Bcop_v2, 99.87% of lifted-over genes had identical protein sequences for all transcript isoforms, and 99.91% had identical protein sequences for all or at least a subset of transcript isoforms (Table 4). The lifted over gene models were also highly concordant with exon–intron structures defined by RNA-seq coverage (Supplemental Fig. S3). Overall, the lift-over process successfully transferred the annotation information from Bcop_v1 to Bcop_v2.
The updated gene locations enabled us to determine that sites of recently-reported “alien” genes are well-supported by diverse genomic evidence, and that they are not contaminants in the assembly. Specifically, a recent analysis of Bcop_v1 gene annotations led to the conclusion that there has been horizontal gene transfer (HGT) of 28 “alien” P450 genes, 17 from springtails, 10 from mites, and 1 from fungi [7]. Another group found five terpene synthase genes that appear to have originated by HGT from mites [8]. Since claims of HGT in genome sequences has sometimes later been found to be assembly artifacts from contaminating DNA sources [34], we interrogated these locations in the genome. All were found within the chromosome-scale scaffolds. Thus, the “alien” gene locations survived the Hi-C guided correction and scaffolding steps. They did not give rise to disrupted or otherwise unexpected Hi-C or sequencing depth signals across a variety of genomic datasets from different technologies and samples (Supplemental Fig. S4). Finally, there exist Hi-C interactions, long reads, and optical maps that physically connect the “alien” genes to highly conserved Dipteran genes (data not shown), indicating they were in the same nucleus and on the same DNA molecule. In summary, we confirm that the so-called “alien” gene sequences found by others in the B. coprophila Bcop_v1 gene sets are well-supported as being part of the chromosome sequences in Bcop_v2, and are not contaminants in this assembly.
Gene sets on chromosome scaffolds fit expectations of chromosome similarity with other species as a function of evolutionary distance
Over evolutionary time, chromosomes tend to undergo rearrangements that include translocations. Through these rearrangements, genes get shuffled around the genome, and more gene shuffling is likely to occur with increasing evolutionary distance. When defining chromosomes as sets of genes, two closely related species are likely to have very similar chromosomes and therefore highly intersecting chromosomal gene sets. In contrast, distantly related species are likely to have smaller intersections between chromosomal gene sets, approaching “random” proportions of shared genes between arbitrarily selected chromosomes from each species. Seeing this trend with respect to B. coprophila genes on Bcop_v2 chromosome-scale scaffolds would further support the accuracy of the Hi-C scaffolding process.
To test the evolutionary prediction described above, the chromosomal groupings of B. coprophila genes (chromosomal gene sets) was compared with that of four other fly (order Diptera) species with known evolutionary relationships (Fig. 3A) and which all had chromosome-scale genome assemblies [35,36,37,38]. The most distantly related species selected was Drosophila melanogaster from the so-called “higher Diptera” (sub-order Brachycera). In contrast, B. coprophila and the three remaining species are all part of the “lower Diptera” (sub-order Nematocera), which diverged from the higher Diptera ~ 200 MYA [39]. The closest related species to B. coprophila was Pseudolycoriella hygida, which is within the same family (Sciaridae) commonly known as black fungus gnats or dark-winged fungus gnats. P. hygida had a chromosome-scale genome release in 2023 [38] after the release of Bcop_v2. The remaining two Nematocerans are mosquitoes (Culicidae family): the yellow fever mosquito (Aedes aegypti) and the African malaria mosquito (Anopheles gambiae). The proteomes of all fives species were compared to determine orthologs as well as to construct a proteome-wide rooted species tree, which recapitulated the known evolutionary relationships (Fig. 3B).
Interchromosomal gene shuffling increases with evolutionary distance, further supporting integrity of B. coprophila chromosome-scale scaffolds. A Known taxonomic and phylogenetic information regarding B. coprophila and four other Dipteran species. With respect to B. coprophila, the species are presented here from left to right in order of the most closely related to most distantly related. The color scheme from left to right (dark purple to light yellow-green) reflects the entropy scores between B. coprophila and these species as in C-E. The species are presented in this order in each sub-figure below as well. B Species tree produced by OrthoFinder [84, 85] using the proteomes of these five species reproduces the expected relationships amongst these Dipteran flies. C Scatter plots of SCO locations along the chromosomes of each of the five species (x-axis) plotted against B. coprophila (y-axis). Inter-chromosomal SCO scattering increases with evolutionary distance as expected. Blue dots represent SCO locations. Colored boxes along x- and y- axes represent chromosomes. The chromosome box color scheme reflects the entropy score color scheme in D-E. D The pairwise min–max normalized entropy scores for B. coprophila computed against all five species. The entropy scores increase with evolutionary distance as expected. Although the entropy scores approach a fully random state with evolutionary distance from B. coprophila, pairwise relationships of chromosomal SCO groupings still display a bit of non-randomness even in the most distantly related fly species tested (D. melanogaster). E Matrix of all pairwise min–max normalized entropy (MMNE) scores among the five species presented as a hierarchically-clustered heatmap. The dendrogram from pairwise entropy scores reproduces the expected relationships among these Dipteran flies. Altogether, as the SCO dispersion and entropy of chromosomal gene sets follows evolutionary expectations, the location of genes in the chromosome-scale scaffolds we report here for B. coprophila are non-random and suggest the contigs were grouped correctly by chromosome
To test if the degree of gene shuffling around the chromosomes of these five species also reflected their known phylogenetic relationships, the genomic locations of “Single copy orthologs” (SCOs) were visually compared across all five species (Fig. 3C, Supplemental Fig. S5) with respect to their locations within B. coprophila chromosomes (Bcop_v2). The five species had 4,297 SCOs. The chromosomal addresses of SCOs were compared between each species relative to B. coprophila in order to (i) map B. coprophila chromosomes to their most likely counterparts in each of the other species (Table 5) and (ii) visualize as dot plots (Fig. 3C). B. coprophila was plotted against itself to demonstrate the base case of what the absence of gene shuffling looks like (Fig. 3C). Visually, the amount of gene shuffling increased with evolutionary distance as expected. While there is gene shuffling even between B. coprophila and the closest related species tested (P. hygida), it is mostly intra-chromosomal and this pair demonstrates the highest amount of conservation of SCOs from given B. coprophila chromosomes to their single corresponding chromosomes in the other species (Fig. 3C; Table 5, Supplemental Fig. S5). In contrast, compared to B. coprophila, the most distantly related species (D. melanogaster) visually has the most gene shuffling across chromosomes, and the mosquito species have intermediate levels, as expected (Fig. 3C; Table 5). These analyses also point to how the chromosomes of mosquitoes relate to those of the gnats (Fig. 3C; Table 5, Supplemental Fig. S5). For example, in Aedes aegypti, chromosome 1 corresponds to the X chromosomes of both fungus gnats with at least three major rearrangements (Supplemental Fig. S5), the left and right arms of Aedes chromosome 2 correspond to III and IV in B. coprophila, respectively (A and B in P. hygida), and the left and right arms of Aedes chromosome 3 correspond to chromosome II and the left end of chromosome IV (C and A in P. hygida). Overall, visual inspection of the pairwise SCO dot plots makes a strong case that the B. coprophila gene sets defined by the chromosome-scale scaffolds (Bcop_v2) follow evolutionary expectations.
The amount of “gene shuffling” was also quantified between all pairs of the five species using the single summary metric of “entropy”. Briefly, entropy can be thought of as a measure of “mixed-up-ness” that can quantify systems as they evolve from a non-random to a random state [40]. Here, it is used to measure the amount of inter-chromosomal gene shuffling between pairs of species, and ultimately as a function of evolutionary distance from B. coprophila. A normalized entropy metric that ranges between 0 and 1, called Min–Max Normalized Entropy (MMNE), was computed for each pairwise comparison of two species. MMNE shows the observed entropy relative to the minimum and maximum entropy estimates for that pairwise species comparison given the number of shared SCOs, chromosomes, and associated probabilities (see Methods; see example SCO probabilities in Table 6). The initial non-random lowest entropy state (when MMNE is 0) is simply a comparison to self (Fig. 3D). Entropy is expected to increase (MMNE approaches 1) with more distantly related species that had larger periods of time for inter-chromosomal rearrangements to shuffle genes around, which is what is observed with increasing evolutionary distance from B. coprophila (Fig. 3D). Furthermore, performing clustering on the matrix of all pairwise inter-species entropy scores reproduces the species groupings expected by the known evolutionary relationships (Fig. 3E), putting the fungus gnat family members (B. coprophila and P. hygida) together within the larger Nematoceran cluster that also included the pair of mosquitoes, all of which were on a separate branch from the single Brachyceran, D. melanogaster. The results of these comparative genomic evaluations further bolster the confidence we have in the Hi-C directed process that yielded Bcop_v2 chromosome-scale scaffolds.
Repeat locations show family-specific biases towards peri-centromeric and sub-telomeric regions
In general, repeats and transposons are often enriched near centromeric, peri-centromeric, telomeric- and sub-telomeric regions in Dipteran chromosomes [35, 36, 41]. To test this expectation, we analyzed the distribution of repeats across the B. coprophila chromosome-scale scaffolds. Centromeric positions are defined as described above and peri-centromeric positions are defined as the much broader repeat-rich regions flanking each side. The telomeric and sub-telomeric regions are defined here simply as the regions near and including the scaffold termini, which have been shown to correspond to sub-telomeres and telomeres in other Hi-C-based scaffolds of insect assemblies [42, 43]. If present, telomeric repeats are expected only in the terminal-most 10–50 kb whereas sub-telomeres are immediately-adjacent to telomeres and are much broader repeat-rich regions.
First, the entire genome assembly was mapped against itself (or individual chromosome scaffolds against themselves) to create dot plots that naively show all pairwise regions that share high sequence similarity with each other (Fig. 4A). The regions around and including centromeric positions and scaffold termini were very obvious in these plots, with highly repetitive natures locally within each chromosome scaffold and also with high similarity among the corresponding regions across all chromosome scaffolds. The first 10 Mb of the left terminus of chromosome X is composed of two adjacent repeat blocks. The first repeat block at the beginning of X corresponds to the left telomeric/sub-telomeric zone, rDNA, and centromeric repeats among other centromere-proximal repeats (Fig. 4A). To the right of the centromere but still on the left side of the chromosome, the second repeat block contains two adjacent loci of interest that are discussed in a subsequent section (centromere-proximal X’ breakpoint and the first “fold-back repeat” locus), and is more similar to the repeat block in the final 5–8 Mb of the right terminus of the scaffold (distal telomeric side) than it is to the left terminal block (Fig. 4A). The chromosome III scaffold has a similar structure on its left terminus. It starts with a 10–12 Mb, centromere-containing, high density repeat block followed on its right by a second smaller repeat block that is more closely related to the terminal ~ 5 Mb repeat block at the distal telomeric end on the right terminus of the scaffold than the centromeric end to the left terminus (Fig. 4A). Unlike X and III, the scaffold for chromosome II has a single large (~ 8 Mb) repeat block at its centromeric end (left terminus) that has high similarity with the terminal ~ 8 Mb at its distal telomeric end on its right side (Fig. 4A). The repeat densities of the chromosome II repeat blocks seem lighter than the other chromosomes in this method of analysis, but not necessarily so in other analyses (Fig. 4A-B). The metacentric chromosome IV scaffold has a repeat block at both termini, each ~ 8 Mb, as well as a repeat block in the middle of ~ 10 Mb surrounding the approximated centromere position in the scaffold (Fig. 4A). Finally, the sequences in the “associated assembly”, which are not placed in the chromosome-scale scaffolds, are enriched for matches within the centromere-proximal repeat blocks, but also contain matches to rDNA and rDNA-proximal repeats, and matches to terminal repeat blocks to a lesser extent (Fig. 4A). Overall, the repeat structures of the scaffolds help confirm their centromere regions and follow expectations of repetitive structures associated with scaffold termini corresponding to telomeric and/or sub-telomeric regions.
Repeat regions, centromere locations, and terminal repeats across the B. coprophila chromosome-scale scaffolds. A Repeat densities in 100 kb bins across each chromosome, and locations of known locus markers. Sccr = Sciara coprophila centromeric repeat [32]. F4 = probe sequence for ScRTE [32]. hAT-Tag1 = a DNA transposon sub-family. RTE = a LINE sub-family. F4, RTE, LINE, and LTR are retrotransposon families. Helitron elements are DNA transposons thought to amplify through a rolling-circle mechanism. Unknown = repeats that were modeled but not classified as a known repeat class. All Repeats = all in comprehensive repeat library. Conserved Arthropod = matches to known Arthropod Repeats (excluding B. coprophila). Tandem Repeats were found with Tandem Repeats Finder [94]. Gaps = assembly gaps. (B) Dot plots corresponding to (i) the entirety of Bcop_v2, including all primary chromosome scaffolds as well as the associated contigs, and (ii) each individual chromosome mapped against itself. The locations of various loci markers are shown in the margins, are color-coded as shown in the legend, and are extended to axes. Gap locations are shown as white rectangles in the grey margins. Polytene chromosome maps were reproduced from plates 1, 2 and 3 of Gabrusewycz-Garcia [31] with permission from Springer Nature under permission number: 5490830617309. (C) Legend to help interpret D-J. D-G Pictorial models of the left termini for scaffolds corresponding to (D) chromosome II (II-L), (E) chromosome III (III-L), (F) chromosome IV (IV-L), and (G) chromosome X (X-L). H-I Pictorial models of the right termini for scaffolds corresponding to (H) chromosome X, and (I) chromosomes II, III, and IV. (J) Pictorial models of satellite sequences in centromeric and pericentromeric regions of each chromosome scaffold. Models represent the approximate positions and lengths of RTRS and satellite arrays. RTRS copy number is shown. Satellite copy number can be approximated by the total length divided by the satellite unit length
To further analyze the distribution of repeats in the chromosome-scale scaffolds, major repeat families were modeled, annotated, classified when possible, and their relative densities visualized along the linear scaffold sequences (Fig. 4B). Tandem repeats were also mapped independently. Both tandem repeats and the summation of all modeled repeat families are indeed enriched around defined centromere positions and scaffold termini (Table 7; Fig. 4B). When inspecting different transposon clades, their densities near centromeres and scaffold termini had different biases. Retrotransposable element (Class I) orders, LTR, and LINE, and super-families therein (e.g. LINE/RTE) were generally more enriched around centromeric positions than near scaffold termini (Fig. 4B). Their densities are higher at the centromeric (left) ends of the acrocentric chromosomes (X, II, and III) compared to the right ends, and higher at the medial centromere position of IV compared to its termini (Fig. 4B). These retrotransposon orders and super-families have relatively little or no enrichment in terminal regions, which have similar densities compared to the rest of the chromosome lengths (Fig. 4B). LTR elements are particularly enriched near the centromere of chromosome III (Fig. 4B). Sequence labeled as LINE was most abundant in general (Table 7). It was previously shown experimentally that a probe called “F4”, which is a sub-sequence from a specific LINE/RTE-like retro-transposable element called ScRTE, is enriched with proximity to all centromeres [32]. The scaffolds reflected this as well. Both the entire LINE/RTE super-family as well as the specific F4 probe sequence were found enriched with proximity to the centromere positions defined in the scaffolds (Fig. 2C-F, Fig. 4B).
DNA (Class II) transposons have a different bias, the majority seeming to have higher relative densities at least in some telomeric and/or sub-telomeric regions than near centromeres (Fig. 4B). For example, see the non-centromeric ends (right termini) of the scaffolds for chromosomes X and II (Fig. 4B). A sub-family called DNA/hAT-Tag1 is a great example of a family whose highest densities are in terminal regions compared to other regions of the chromosome scaffolds, including centromeres (Fig. 4B). Nevertheless, although DNA elements may be more biased in the termini of X and II than their centromeres, DNA transposons are also found near all centromeric and peri-centromeric regions, illustrated well by the medial centromere of chromosome IV (Fig. 4B). Moreover, the bias of DNA transposons to these repeat-rich regions compared to gene-rich intervening regions is not as high as the retrotransposon bias is to peri-centromeric regions for other transposon families (Fig. 4B). Finally, the “Rolling Circle” Helitron order is distributed such that its highest densities appear similar in both terminal and peri-centromeric regions, and the overall weight of Helitron elements is very concentrated in these regions compared to gene-rich areas of the chromosome arms (Fig. 4B).
Repeats that could be recognized and classified as belonging to a particular repeat or transposon family tend to be enriched with proximity to centromeres and termini (Fig. 4B). This is also true for locations that match conserved classified repeats from across other arthropods (Fig. 4B). Thus, the regions associated with termini and around centromeres are enriched for repeats that are conserved enough to classify into families and/or be most similar to repeats from other species. However, the majority of repeat families in the repeat library trained de novo on the B. coprophila genome could not be classified according to known repeat families in the same automated fashion by the repeat modeling and annotation programs (Table 7), and are referred to as ‘unclassified repeats’. While the density of unclassified repeats is still moderately higher near centromeres and telomeres, the bias is much lower, and the unclassified repeats are much more spread out across the rest of the chromosome arms (Fig. 4B). It is possible some of these unclassified repeat families actually correspond to gene families or repeat families, newly arisen or otherwise, that are just not included as possible classification terms in the programs used. Unclassified repeats may also consist of highly degraded copies of known repeat families that were simply not recognized, which would indicate that repeat sequences are either better maintained near centromeres and near chromosome ends, or younger in those regions.
Chromosome-scale scaffold termini highlight arrays of retrotransposon-related and satellite sequences as candidate telomeric repeats
Although tandem repeats of short (5–10 bp) sequences that are maintained by telomerase reverse transcriptase (TERT) are the norms for eukaryotic telomeres, the telomeres of Dipterans demonstrate notable exceptions to this with alternative systems such as telomere-specific retrotransposons (TSRTs) and Long Complex Terminal Tandem Repeats (LCTTRs) [15, 16, 42, 43]. The telomeres of lower Dipterans, of which the fungus gnat (B. coprophila) is a member, are least explored. The chromosome-scale scaffolds in Bcop_v2 allowed us to closely inspect the sequence at each terminus to hunt for candidate telomere repeat sequences, as has been done recently for other insects [42, 43]. As expected for Dipterans, we found no evidence of the TERT gene in the Bradysia coprophila genome or proteome, and no evidence for short (5–10 bp) tandem repeats at the scaffold termini, including the canonical insect pentamer (TTAGG) [16]. Although tandem repeats in the 14–22 bp range might make up the telomeres of a closely related species [16], we did not find evidence of tandem repeats in this size range near the scaffold termini either. What we did find at the scaffold termini were (i) arrays of retrotransposon-related sequences (RTRS) with unit lengths of 1.7–3.3 kb, (ii) arrays of LCTTRs (or satellites) with unit lengths of 161–191 bp, and (iii) examples of medium-length tandem repeats in the 51–56 bp range located further into the sub-telomeres. These are briefly described below.
Retrotransposon-related sequences (RTRS) were enriched near the left termini of autosomal scaffolds (Fig. 4C-F – Pink triangles) and both termini of the X scaffold (Fig. 4G-H – Purple triangles). The left termini of scaffolds corresponding to chromosomes II, III, and IV are composed of arrays of RTRS with unit lengths of 1.7–3.3 kb. Although the RTRS repeat units have different lengths, they share ~ 1.2 kb of sequence, and all three were classified in the repeat annotation as non-LTR LINE elements related to RTE-BovB, CRE, and/or CR1. This is reminiscent of Drosophila, which have non-LTR elements for telomeres. However, the RTRS arrays found at the very ends of the left termini of II and III point away from the centromere, which differs from Drosophila. On the left terminus of II, there are ~ 8 copies of a 1.7 kb RTRS spanning the first ~ 14 kb, and on III, there are ~ 3–4 copies of a 3.3 kb RTRS spanning the first ~ 12 kb (Fig. 4D-E). Each unit in these arrays has a single location with BLAST hits to reverse transcriptase (RT) proteins. After the RTRS array on III, the repeat annotation shows a centromere-facing LINE (RTE-ORTE). The left terminus of IV has a single left-facing LINE-related sequence and corresponding BLAST hit to RT proteins ~ 10 kb into the scaffold. This is then followed by 13–14 copies of a centromere-facing 2.5 kb RTRS unit, although this unit does not have the region corresponding to RT protein BLAST hits despite being related to the RTRS units on II and III that do (Fig. 4F). In contrast to the left termini of the autosomal scaffolds, the right termini do not have RTRS arrays, at least not so close to the termini (F ig. 4I). The right terminus of II has 1–2 partial hits to the RTRS units in the sub-telomeres. The right terminus of III has two related hits: the very final ~ 100 bp matches the RTRS units, and there is a centromere-facing RT protein hit to a region labeled as LINE/CR1 ~ 10 kb from the end. Finally, ~ 90–115 kb from the right terminus of IV are two sub-telomeric arrays of 2 and 6 centromere-facing RTRS units highly similar to those at the left terminus of IV. Of note to Sciarid researchers, the 1.7–3.3 kb RTRS units do not appear to be related to the very abundant non-LTR ScRTE sequence and its corresponding F4 probe studied by Escribá et al. [32]. ScRTE was not found within 10 kb of any terminus, typically having a first instance 60–440 kb away from termini, much farther inward than the RTRS arrays described above. The left and right termini of the scaffold corresponding to the X chromosome also had retrotransposon-related sequence, but different from that on the autosomes. Both sides of the X scaffold start with sequence classified as a centromere-facing LTR/Pao elements, and both are followed immediately by terminus-facing DNA transposons (Fig. 4G-H). LTR elements would be surprising candidates for telomere sequences compared to Drosophila, which depends on the mechanism associated with non-LTR elements for telomere maintenance [16], and may indicate that the X telomeres are not well-represented on this scaffold.
Long Complex Terminal Tandem Repeats (LCTTRs or satellite repeats) were also enriched near the right termini of autosomal scaffolds (F ig. 4I; Table 8). Specifically, the right termini of scaffolds corresponding to chromosomes II and III end with ~ 28 copies of a 175 bp satellite (F ig. 4I – Orange triangles; Table 8) and ~ 28 copies of a 161 bp satellite (F ig. 4I – Red triangles; Table 8), respectively. The most striking example is the right terminal ~ 40.3 kb of IV, which contains 229 copies of a 176 bp satellite (F ig. 4I – Blue triangles; Table 8). Each satellite species is very close to the LCTTR unit length seen in midges of 176 bp. Although each right terminus ends with a different satellite, each terminus or sub-telomeric region also highlights one or more of the others and/or of two other prevalent satellites with unit lengths of 191 bp and 56 bp (F ig. 4I – Brown and light purple triangles, respectively; Table 8). For example, the right terminus of II also has an array of the 191 bp (brown) satellite, the right-terminal sub-telomeric regions of all chromosome scaffolds have arrays of the 56 bp (purple) satellite, and arrays of the 161 bp (red) satellite are found in the right-side sub-telomeric regions of scaffolds for X, II, and IV. The scaffold for chromosome IV shows the richest examples of satellite arrays in its sub-telomeric regions, both the left and right termini of which have arrays of four or more of these satellite species. The left side of IV in particular is super-enriched for the 175 bp (orange) satellite that was seen at the right terminus of II. Interestingly, all RTRS units above have 1–4 partial hits to the 176 bp (blue) satellite sequence that was found in a long array concluding the right terminus of IV.
Centromeric and pericentromeric regions on the chromosome-scale scaffolds are also enriched for the termini-associated satellites (Fig. 4J). For example, the 161 bp (red) satellite was found in the centromeric regions of all three acrocentric chromosome scaffolds: X, II, and III. In fact, the 161 bp (red) satellite was tightly associated with arrays of the 143–155 bp “Sciara coprophila centromeric repeat” (Sccr) sequence [32] (Fig. 4J, Yellow triangles), on the X scaffold, and also turned out to be part of the Sccr-linked repeat family that helped identify centromeric positions in scaffolds, especially for chromosome II (Fig. 4B; “Sccr family”). The centromeric region of II did not have strong representation from Sccr (possibly an assembly artifact), but had 47 copies of the Sccr-associated 161 bp (red) satellite and multiple arrays of the 175 bp (orange) and 176 bp (blue) satellites. The only metacentric scaffold, IV, had massive arrays of Sccr, but was depleted for termini-associated satellites, although it had 131 tandem copies of the 191 bp (brown) satellite in the pericentromeric region ~ 75 kb away from the closest Sccr repeats.
Known biological features of the X chromosome revealed by Hi-C data
There are three regions along the X chromosome where it folds back on itself such that these loci physically interact, leaving the X looking like a loop with its ends knotted up together on polytene spreads [22, 23, 30, 31, 44] as depicted in Fig. 5 (Fig. 5A, B) and earlier reports [22, 23, 30, 32, 44]. These regions have been referred to as “repeats on the X” [30] or “repeat regions R1, R2, and R3” [22, 44]. The richest descriptions of them simply refer to their banding pattern when polytenized as “three three-band non-adjacent pattern repeats” [22]. However, whether these regions truly are repeats of each other at the level of DNA sequence or whether they even contain repetitive DNA sequences is yet unknown. Therefore, we refer to the three “repeat” loci on the X here simply as the “Fold-Back Regions” (FBRs 1–3). Each locus is reported to be composed of three polytene bands, which indicates they correspond to DNA lengths of tens to hundreds of kb, assuming similar band lengths as Drosophila [45]. The Fold-Back Regions have been mapped cytogenetically to the zones within the X polytene map (Fig. 5C) [22, 23, 31, 44]. Slightly downstream from the centromere (part of polytene sub-zone X/1 A) is FBR1 in sub-zone X/1 C, which is followed within a relatively short distance by FBR2 (sub-zone X/4 A), and a subsequent longer distance by FBR3 (sub-zone X/12 C), which is closest to the distal (non-centromeric) end of the chromosome (sub-zone X/14 C).
Putative Fold-Back Regions (FBRs) and X’ Breakpoints further confirm accuracy of the X scaffold. A, B Polytene chromosome spreads demonstrating how the X and X’ chromosomes are usually found: folded back on itself with specific loci physically interacting. C Pictorial representation of what is known about the X and X’ chromosomes, FBRs, and inversion breakpoints. D Hi-C map from male pupae data showing three interacting loci and how they appear to correspond to the expected locations of FBRs along polytene maps. E Hi-C map from X’X adult female data [10] that shows the FBRs as well as two additional loci that correspond to the long paracentric inversion breakpoints. F An intermediate scatter plot heat map visualizing part of the long-range interaction peak cluster calling pipeline with squares demonstrating long-range interactions found by this method. G IGV Trace of the X chromosome with special attention to the locations identified as FBRs and breakpoints (top 3 tracks), as defined by our long-range Hi-C interaction peak detection method, and how they relate to repeat-rich and gene-rich regions. H Dot plot of extracted sequences corresponding to FBRs and breakpoint regions, showing sequence matches within and between each sequence. Blue dots represent matches on the same strand. Red dots represent matches on opposite strands. Polytene chromosome maps in C-E were reproduced from plates 1, 2 and 3 of Gabrusewycz-Garcia [31] with permission from Springer Nature under permission number: 5490830617309
There is a variant of the X chromosome called the X’ that is found only in female-producing females [1, 2, 10, 22, 44, 46,47,48,49], which also folds back on itself via the fold-back region interactions. The X’ has long been known to differ from the X by a long paracentric inversion that reverses the order of the fold-back regions such that FBR3 is centromere-proximal and FBR1 is most distal (Fig. 5C). An example X’X polytene folding back on itself via FBR interactions is specifically shown in Fig. 5B; note the lack of synapsis between the X and X’ homologs across most of the chromosome length corresponding to the long inversion compared to the XX polytene chromosome in Fig. 5A. The breakpoints that correspond to the long paracentric inversion have been mapped cytologically on the polytene zones upstream of FBR1 and downstream of FBR3 (Fig. 5C) [22, 44], and more recently have been explored bioinformatically using sequencing data from X’X females and the Bcop_v2 X chromosome scaffold reported here [10]. Comparing images of the X’ breakpoints and FBRs identified on polytene chromosomes [22] with polytene maps [31], we estimate that (i) the 5’ breakpoint is downstream of both the centromere and the so-called dark “landmark” band [22, 32] in an interband quite close to FBR1 (X/1 C), likely at the end of polytene sub-zone X/1B or the beginning of X/1 C, and (ii) the 3’ breakpoint occurs downstream of FBR3 (X/12 C), likely at the end of X/13 C or beginning of X/14 A (Fig. 5C).
If the Fold-Back Regions physically interact in the X chromosomes of male pupae (for which we collected Hi-C data), then they should appear on the Hi-C map of the X chromosome scaffold as three off-diagonal dots with high frequency long-range interactions. Indeed, the Hi-C map of the X chromosome shows dots corresponding to the frequent interactions of three distal loci in three-dimensional space (Fig. 5D). Moreover, Hi-C data from X’X adult females [10] shows the same three high frequency long-range interactions along the X scaffold, but also has two additional apparently-long-distance interactions corresponding to the breakpoints of the long inversion (Fig. 5E). Importantly, the locations of and nucleotide distances between the centromere, FBRs 1–3, and the X’ breakpoints follow expectations based on the polytene maps (Fig. 5D-E, Table 9). FBR1 is expected to be within ~ 3.4 Mb (~ 2 sub-zones) of the centromere and up to ~ 839 kb (~ 1/2 sub-zone) downstream of the centromere-proximal X’ breakpoint. Indeed, putative FBR1 is ~ 3.49 Mb and ~ 620 kb downstream from the centromere and inversion breakpoint, respectively (Table 9). Moreover, FBR1 and FBR2 are expected to be separated by ~ 11.8 Mb (~ 7 sub-zones), and FBR2 and FBR3 are expected to be separated by 43.6 Mb (~ 26 sub-zones). The observed distances are 10.3 Mb and 42.7 Mb, respectively (Table 9). The 3’ breakpoint is expected to be ~ 5 Mb downstream (~ 3 sub-zones) of FBR3, and was found to be 4.65 Mb downstream. Finally, FBR3 was predicted to be ~ 10.1 Mb upstream of the distal end of chromosome X, and the observed scaffold location is 12.3 Mb upstream (Table 9). In all cases, the observed distances and locations were exceptionally close to projections based on known polytene map distances and locations. Thus, it is extremely likely that the three distal loci that physically interact at high frequencies according to Hi-C data are indeed the three Fold-Back Regions, which have not been previously known at the sequence level. The Hi-C interactions in male pupae and female adults indicate that the physical interactions of the X chromosome seen in polytene chromosome spreads are not in vitro artifacts, nor are the interactions limited to polytenes in larval salivary glands, as they now have been observed in vivo across different tissues and life stages in both sexes (male pupae and female adults). Overall, the FBR and X’ breakpoint locations further confirm the identity and structural accuracy of the X chromosome scaffold.
The three Fold-Back Regions on the X chromosome have been described as repeats in cytogenetic studies [22, 23, 30, 44]. However, the sequences were previously not known, and whether they are composed of repeats has not been well-established. Given the likelihood that the three loci on the X with long-range Hi-C interactions correspond to the three FBRs, their scaffold locations and sequences were extracted and analyzed (Supplemental Tables S1- S5). FBR scaffold locations were discovered by calling long-range Hi-C interaction peaks, after first filtering out short interactions less than 5 Mb (Fig. 5F-G; see Methods). The length of each FBR was in the ~ 103 kb to ~ 1.1 Mb range with interaction frequency summits in the 34–69 kb range, depending on how boundaries were defined (see Methods). To maximize the chance of detecting long stretches of similarity among these regions, we used the longest estimates. When directly comparing the extracted sequences of these three loci by BLAST, there were stretches of similarity found in all three (Fig. 5H). Although some matches reached the kb range, most stretches were short (< 200 bp), and most or all were associated with repeats found elsewhere in the genome. The FBRs did not have shared repeat families specific to these regions. Looking at the distribution of repeats across the X chromosome, it is plausible that the first locus, putative FBR1, is embedded in a highly repetitive region, but less likely for the two downstream loci (putative FBRs 2 and 3), which appear to be in relatively unique regions (Fig. 4A, B, Fig. 5G). When comparing the entire X chromosome to itself using dot plots, if these regions had high similarity the length of up to three polytene bands, there would be off-diagonal “dots” on the dot plots corresponding to the locus pairs analogous to that seen in the Hi-C interaction frequency plots. However, such long stretches of similarity are not seen in those off-diagonal locations (Fig. 4B). Overall, the evidence suggests that although the three FBR loci interact, it is not necessarily due to vast stretches of sequence homology or repeats that would span 1–3 polytene bands (tens to hundreds of kb). However, this does not exclude the possibility that one or a few of the shorter stretches shared between the regions are important aspects of their physical interactions. In contrast to FBRs 2 and 3 being in relatively repeat-poor regions of the X chromosome, the X’ inversion breakpoints are both buried in repetitive domains corresponding to peri-centromeric and sub-telomeric regions, which do share more similarity with each other (Fig. 4A, B, Fig. 5G). The 5’ breakpoint and nearby putative FBR1 are embedded in a repeat block seen on the X scaffold dot plot near the centromeric (left) end of the chromosome that has high similarity with the right terminal repeat block (Fig. 4A, B, Fig. 5G). The presence of the FBRs on both the X and X’ suggest the FBRs and their interactions preceded the inversion, and it is possible that the evolution of the long paracentric inversion on the X’ was mediated by the systematic fold-back interactions of FBR1 and FBR3 bringing these peri-centromeric and sub-telomeric repeat blocks close together.
In addition to repeats, there are also genes within the boundaries of all three putative FBR loci (as well as for the X’ inversion breakpoints). Thus, it is also possible that the long-range fold-back interactions may be involved in regulation of a gene or genes in the fold-back regions, or that these regions each have representatives from a shared gene family. To maximize the sensitivity of finding a gene family with members in all three regions, we focused again on the broader region boundaries, not the narrowest estimates. Depending on the gene annotation set and FBR boundaries used, there are 85–203 genes across all three FBRs with 35–78 in FBR1, 29–80 in FBR2, and 21–45 in FBR3 (Supplemental Tables S1- S3). Regarding whether there are homologous genes in these regions, some of the stretches of nucleotide similarity shared by all three FBRs do overlap gene boundaries, but almost exclusively in introns, with very few exon hits and no coding sequence (CDS) overlaps. In agreement, there appears to be no homologous protein family represented in all three regions when analyzing protein sequence similarity with BLASTP. The regions of similarity at the nucleotide level within introns are largely from transposons and other repeats. Nevertheless, we cannot exclude the possibility that some stretches of similarity in the introns or near genes in general are regulatory elements shared by the FBRs. It is also possible that FBR-specific sequences, not found elsewhere in the genome, are important. There are millions of FBR-specific kmers of various sizes (15–80 bp) not found elsewhere in the genome. These kmers correspond to on the order 0.5–1 Mb of sequence from hundreds to thousands of sub-regions of the FBRs up to hundreds of bp in length, which also lack BLAST hits elsewhere in the genome. Approximately 70% of this sequence is within genes and a partially-overlapping ~ 16–17% corresponds to highly diverged repeats, but ~ 23.5% is intergenic and not labeled as a repeat. Thus, there may be intergenic (and genic) FBR-specific nucleotide elements that are found in one or more of the FBRs, but not elsewhere in the genome.
We explored the nature of the genes within the broadly defined FBR regions as well as those genes closest to the interaction frequency summits (Supplemental Tables S1- S5). Each had a mixture of protein-coding and long non-coding RNA (lncRNA) genes. Although the FBRs were not enriched for lncRNA genes compared to other regions of similar size on the X scaffold, a lncRNA gene was one of the closest genes to the summit of FBR1. The genes across the three FBRs were not enriched for any particular GO terms. The other genes closest to the FBR1 summit were similar to a transcriptional activator (mushroom body large-type Kenyon cell-specific protein 1) and a G-protein-coupled receptor (GPCR; 5-hydroxytryptamine receptor 1-like). The summit of FBR2 was close to three uncharacterized genes and a gene similar to “upstream activation factor subunit spp27” that may be a transcriptional activator of RNA Pol I; and the summit of FBR3 was close to a gene predicted to be involved in actin depolymerization (F-actin-monooxygenase Mical), a cysteine hydrolase (carboxymethylenebutenolidase), among other genes. Finally, the long paracentric inversion breakpoint interaction summits were near several genes, including a protein phosphatase PP2 A subunit and a histone-lysine N-methyltransferase, both of which may be interesting with respect to the mechanism of differential X chromosome elimination in embryogenesis.
In addition to the FBRs, there are other, smaller Mb-range interaction dots in the Hi-C maps of the X chromosome, especially in the X’X female Hi-C map (Fig. 5D-E, Supplemental Fig. S2, S6, S7). However, many of the smaller dots seen in X chromosome Hi-C maps from X’X data are likely from structural differences, including a different transposition landscape, between the X and X’ chromosomes as recently studied [10]. Indeed, many of the same “long-range interaction” dots appear even when treating genomic DNA sequencing data from X’X female adults as if it were also Hi-C data (Supplemental Fig. S6 A). In support, those dots are missing from the X chromosome Hi-C map for males, which lack the X’ (Supplemental Fig. S6B). The size of the dots corresponding to structural differences between the X and X’ are tiny compared to interacting regions corresponding to the FBRs, suggesting that true Mb-range interactions on the X chromosome (or FBRs specifically) involve large regions whereas structural differences only affect small regions of the map. Therefore, it is interesting that the long paracentric inversion breakpoints also cover large regions comparable to the FBRs, suggesting those regions not only represent a known structural difference between X and X’, but also that the regions around those breakpoints interact similar to FBRs. After removing long-range interaction dots in the X’X data that correspond to structural differences, FBRs, and the long paracentric inversion breakpoints, there remained seven long-range interactions of interest spanning 8–55 Mb (Supplemental Fig. S7 A, Supplemental Table S6), four of which were also in the male Hi-C map (Supplemental Fig. S7B). Finally, we also identified a 12 Mb long-range interaction on the left side of chromosome IV (Supplemental Fig. S2D, Supplemental Tables S6 and S14). The genes that were within the regions of highest interaction frequencies for the pairs of loci corresponding to each of the eight dots contain mixtures of protein-coding and lncRNA genes (Supplemental Tables S6-S14). Genes from the X dots include histone deactylases, transcription factors, transporters, acting and actin-binding proteins, insulin-related peptides, and a broad swath of other types of genes (Supplemental Tables S7-S13). One gene that jumped out to us was related to the sex determination protein “fruitless” (“fru”), which was in “X-dot- 4” (Supplemental Figure S7, Supplemental Table S9), a long-range interaction that was only present in the X’X female Hi-C map, not in the male Hi-C map (Supplemental Figure S7).
Discussion
Bradysia coprophila, a dark-winged fungus gnat from the lower Diptera (Nematocera), has interesting chromosomal biology that includes developmentally-programmed examples of elimination, non-disjunction, polytenization, intrachromosomal amplification, and maternally-acting sex chromosomes [1, 2]. It also presents opportunities for studying sex chromosome evolution [10, 50] and horizontal gene transfer [6,7,8]. A crucial part of the toolbox for studying these phenomena is a genome sequence. The previous reference genome sequence for Bradysia coprophila, Bcop_v1, was a high quality, high contiguity assembly of the somatic genome (chromosomes X, II, III, IV) produced from PacBio RS II and Oxford Nanopore MinION long read data from male embryos, and subsequently scaffolded with optical maps of male pupal DNA molecules from BioNano Genomics [9]. Bcop_v1 was produced through an extensive evaluation process comparing nearly 100 assemblies [9]. Moreover, as Bradysia coprophila males have only one copy of the X chromosome, but two copies of autosomes, all contigs in Bcop_v1 were classified as X-linked or autosomal [9]. Finally, two gene annotation sets were constructed for Bcop_v1: we produced one using Maker2 [9, 51, 52] and NCBI produced one for RefSeq [9, 53]. However, Bcop_v1 was not chromosome-scale and mostly not anchored into chromosomes, limiting its usefulness in studying the evolution and behaviors of Bradysia coprophila chromosomes.
Our goal in this project was to improve the reference genome assembly for Bradysia coprophila to better-facilitate studies of its interesting chromosome behaviors and evolution. To do so, we upgraded Bcop_v1 [9] to chromosome-scale scaffolds (Bcop_v2) using Hi-C technology. The updated assembly of the Bradysia coprophila somatic genome (Bcop_v2) represents each chromosome (X, II, III, IV) as its own scaffold for the first time, and is oriented in the same direction as polytene maps produced in the 1960 s [31] to be consistent with historical research. The chromosomal identities of the scaffolds were established and supported several ways, including multiple anchor sequences, chromosome lengths, depth of coverage, centromere positions, and other idiosyncrasies of the X chromosome. The accuracy of the chromosomal structure of the scaffolds was also supported by multiple approaches, including the concordance of anchor sequence locations along the scaffolds with their locations along polytene maps, the relative positioning of centromeres and telomeres, the increasing enrichment of repeats with proximity to centromeric and telomeric regions, and other landmarks on the X chromosome (FBRs and X’ breakpoints). As studies are ongoing across multiple research groups with the gene annotation sets produced for Bcop_v1, we lifted over these gene annotation sets directly onto the chromosome-scale scaffolds so researchers can seamlessly transition ongoing studies from Bcop_v1 to Bcop_v2. The chromosomal specificity of the genes within each chromosome scaffold was further supported by comparative genomics, which showed that the chromosomes of more closely related species were far more similar in terms of gene sets than more distantly related species, both visually and using entropy statistics, the latter of which also reproduces expected phylogenetic species groupings. The totality of evidence overwhelmingly suggests that Bcop_v2 contains high quality scaffolds of the four somatic B. coprophila chromosomes (X, II, III, IV).
The rich history of studying B. coprophila chromosomes with respect to polytene map zones [31] has been a major benefit in producing and interpreting these chromosome-scale scaffolds. For example, previous in situ hybridization studies that identified the chromosomal addresses of cloned DNA sequences [17,18,19,20,21,22,23,24,25,26,27,28,29] has allowed us to identify the chromosomal identity of each chromosome-scale scaffold, orient them according to the polytene maps, test their structural accuracy, map their centromeric regions, and even to study known features of the X corresponding Fold-Back Regions and breakpoints of a long paracentric inversion on a variant of the X chromosome called the X’ (X prime). The locations of polytene zones on the chromosome scaffolds will help integrate information from polytene chromosome studies into modern genomics analyses. In the future, a systematic set of FISH probes informed by the scaffolds in combination with using Hi-C to identify the boundaries of bands and interbands seen in polytenes, which have been associated with topologically associated domains (TADs) in Drosophila polytene chromosomes using Hi-C data [54, 55], could further anchor the scaffolds into the polytene zone landscape of each chromosome.
As Bcop_v2 is derived from scaffolding Bcop_v1, there were no significant changes to conclusions about the types and amounts of repeats reported previously [9]. However, the repeats in Bcop_v1 are scattered across hundreds of unanchored contig sequences whereas the scaffolds of Bcop_v2 allow us to see where the repeats are located within the chromosomes. We found recognizable repeat families concentrated around peri-centromeric and sub-telomeric regions with different propensities. Retrotransposon orders in general, and especially those classified as LINE (non-LTR), were very enriched in pericentromeric regions, including the LINE/RTE -family as shown to be enriched around centromeres previously for a specific ScRTE family [32]. Despite a major bias towards centromeric regions, retrotransposon related sequences and reverse transcriptase BLAST hits are found throughout scaffold terminal regions as well, including within 10 kb of most termini. Retroelements have also been shown to be enriched in functional Drosophila centromeres [56], and to make up and play an important role in maintaining Drosophila telomeres [15, 16]. Helitron elements were enriched in both terminal and centromeric regions. Other DNA transposon orders were biased more towards scaffold terminal regions, although they are likely sub-telomeric as they do not appear as terminal sequences.
Dipterans have unique telomere systems and have lost both the short tandem repeats and the telomerase gene, TERT, characteristic of most other eukaryotes [16]. The telomeres of Drosophila, a “higher” Dipteran, are composed of telomere-specific retrotransposons (TSRTs) that point towards the centromere and are maintained by their reverse transcriptase genes [15, 16, 42, 43]. The telomeres of lower Dipterans are less well understood. Rhynchosciara americana appears to have 14–22 bp long tandem repeats, which might be maintained by a TSRT related to Drosophila TSRTs [16]. Non-biting midges (Chironomidae) have"Long Complex Terminal Tandem Repeats"(LCTTRs) with a unit length of 176 bp in one species and ~ 350 bp in another species, although the latter appears to have evolved from a simpler 175 bp repeat [15, 16]. The long repeats in midges might be maintained by homologous recombination or gene conversion, although long complementary RNAs have been observed, suggesting an RNA intermediate as with telomerase and TSRTs [15, 16]. A mosquito has even longer repeat lengths of ~ 820 bp in its telomeres, and at least one higher Dipteran, Drosophila tristis, has LCTTRs of similar length to midges (181 bp) [15]. In at least some midges, and in some insects more generally, telomeres can differ from each other, and some telomeric repeats may also be seen in centromeric regions [15, 16, 42, 43].
We checked the scaffold termini for signatures of known telomeric systems. As with all Diptera, Telomerase Reverse Transcriptase (TERT) was not found in the B. coprophila genome sequence nor proteome, nor was the associated hexamer/pentamer (TTAGGG/TTAGG) or other variant found at scaffold ends. Non-LTR retrotransposons (HeT-A, TART, TAHRE) make up and maintain Drosophila telomeres [15, 16], and TART copies were also found near the chromosome ends of a more closely related species, Rhynchosciara americana [16]. Although all these transposons are possibly too far diverged to find matches in the terminal regions of B. coprophila chromosome scaffolds (Bcop_v2), reverse transcriptase genes can certainly be found in the sub-telomeric regions, even within the first 10 kb. Moreover, the left termini of the B. coprophila autosomal scaffolds all seem to have arrays of retrotransposon-related sequences (RTRS) with units in the 1.7–3.3 kb range and both termini of the X scaffold were associated with retrotransposons in the repeat annotation. Rhynchosciara was also shown to have a “special type of short tandem repeats” (stSTRs) of 14–22 bp [16]. Given their close evolutionary relationship, B. coprophila might be expected to have the same system. However, we did not find short repeats at the B. coprophila chromosome scaffold termini, although tandem arrays of short repeats may be found further inward in the sub-telomeric regions. It is also possible that B. coprophila has Long Complex Terminal Tandem Repeats (LCTTRs) like those found in Chironomidae and Culicidae [15, 16]. Indeed, we found a set of satellite sequences with unit lengths of 161–176 bp associated with the right termini of autosomal scaffolds. These terminal and sub-terminal satellites were also seen in peri-centromeric regions as was also seen in some midges [15]. We note that the repeat lengths of these satellites are similar to nucleosome spacing lengths. For example, the length of DNA per nucleosome in Drosophila is ~ 175 bp [57], which includes the linker region, and two of these candidate repeats are exactly 175–176 bp. Interestingly, all 1.7–3.3 kb RTRS units above have 1–4 partial hits to the 176 bp satellite sequence, suggesting an ancient relationship between these satellites and transposons. This is also reminiscent of the cycling between retroelements and satellite DNAs found in the centromeres and telomeres across several Drosophila species [58]. Although it is possible that some of the scaffold termini may be sub-telomeric, as opposed to telomeric, our current model for B. coprophila telomeres based on the termini of these chromosome-scale scaffolds appears to include arrays of both RTRS and LCTTR satellites where autosomal centromere-proximal (left) termini are more correlated with RTRS arrays and centromere-distal (right) termini are more correlated with LCTTR arrays. This may be similar to other insects where different telomere sequences have been found on different chromosomes and even on the opposite sides of the same chromosome [15, 16, 42, 43].
The X chromosome scaffold has already benefited a study of the X chromosome variant, X-prime (X’), which differs from the X by a large paracentric inversion [10]. The X and X’ are thought to recombine inefficiently due to this inversion, and analyzing the divergence inside the inverted region suggested (i) the inversion is recent, perhaps originating within the last 0.5 million years, and (ii) there are internal blocks with different divergence rates [10]. There are also interesting biological signals in the Hi-C maps that can be explored further. The notable example here are the three dots on the X chromosome Hi-C map that appear in the expected pattern of the three fold-back regions (FBRs) identified by Crouse [22, 23, 30, 44]. Now that we have identified the fold-back region sequences within the scaffolds, it will be possible to begin defining what mediates their interactions and what function is served by doing so. Our work definitively shows that these interactions happen not just in spreads of larval polytene chromosomes, but in vivo in whole male pupae and female adults. The circular fold-back structure is not seen in spreads of mitotic chromosomes where the X is a rod. Therefore, the in vivo Hi-C interaction signal is likely not from mitotic chromosomes. However, in addition to signal from polytene chromosomes, the in vivo signal from whole pupae and adults could reflect fold-back interactions in interphase chromosomes of diploid cells. Indeed, polytene chromosomes are thought to be analogous to the interphase state.
The X chromosome foldbacks seen in Sciarid polytene chromosomes might be a cytological manifestation of long-range loop interactions that have been described in other systems, such as mosquitoes and Drosophila [59, 60]. For example, mosquitos typically have loops spanning less than 1 Mb, but certain mosquito species have up to six giant loops spanning loci that are up to 31 Mb [59]. Moreover, one of the giant loops is on the mosquito X chromosome [59]. It is tempting to speculate that this might be evolutionarily related to the interactions among the three foldback regions (FBRs) seen on the B. coprophila X and X’ chromosomes where FBR1 and FBR2 are separated by 10.3 Mb and FBR2 and FBR3 are separated by 42.7 Mb (Table 9). In Drosophila, meta-domains reflect interactions between TADs that are separated by up to 50 Mb [60]. Therefore, long range interactions as described here for the B. coprophila foldback regions on the X and X’ chromosome may have counterparts in other lower (eg, mosquito) and higher (eg, Drosophila) dipteran insects. Overall, it will be exciting to further elucidate in future studies what sequences and proteins mediate the long-range fold-back interactions, what function the long-range interactions serve, whether these regions and interactions are involved in other aspects of X chromosome biology, and if these interactions play a role in facilitating structural rearrangements on the X and X’ chromosomes.
The chromosome-level nature of Bcop_v2 will allow researchers to design experiments and analyze data in a chromosome-specific way, the X chromosome being a shining example as described above. The location and sequences of the telomeres and centromeres are now more obvious and open to future investigation. The B. coprophila chromosome scaffolds will allow proper studies of synteny and chromosome evolution across closely and distantly related flies, and will be a valuable data point for investigations into ancestral karyotypes given the rearrangements and translocations of syntenic blocks across Dipteran evolution, as has been done for mosquitoes [61, 62]. Finally, Bcop_v2 will also be useful to researchers working with fragmented genome assemblies of closely related species by letting them clustering and ordering the contigs into pseudo-chromosome scaffolds.
Conclusions
The chromosome behaviors found across the life-cycle of the dark-winged fungus gnat, Bradysia (Sciara) coprophila, present unique opportunities to gain new insights into chromosome biology and evolution. However, despite the emphasis on chromosome biology in this system, it lacked full-length sequence models of the chromosomes. To rectify this, a high quality Mb-scale genome assembly created from single-molecule long-read datasets from Oxford Nanopore, PacBio, and BioNano Genomics [9] was further corrected and scaffolded to the chromosome-scale using Hi-C data from male pupae. The chromosome scaffolds were demonstrated to be of remarkable quality, and allowed us to test and explore several aspects of fungus gnat chromosomes. Regions along the scaffolds containing alleged horizontally-transferred genes have the same level of evidence and support as regions containing highly conserved Dipteran genes, supporting the possibility of HGT and ruling out artifacts from contamination. LINE-related retrotransposons are enriched in centromeric and pericentromeric regions whereas DNA transposons exhibit a sub-telomeric bias. The telomeres of fungus gnat chromosomes appear to harbor long complex terminal tandem repeats (LCTTRs) as has been seen in midges, but also may involve arrays of retrotransposon-related sequences. Finally, the physical interactions between three loci on the X chromosome seen on polytene chromosome spreads is not limited to larval salivary gland polytenes nor is it an artifact of the fixation and spreading procedures for polytene chromosome preparations as these physical interactions are also detected in vivo by Hi-C of whole male pupae and whole female adults. The regions of these three loci were identified along the X scaffold, and their sequences analyzed for the first time. Although these loci were long described in cytological studies as “three three-band repeats”, genomic analyses suggest the three regions are not simply three long paralogous sequences. They do not share the same genes, but may share short sequences. The chromosome-scale genome sequence, its gene annotations, its associated data, and the insights produced in this manuscript will be immediately useful and of major importance to the growing research community interested in the unique chromosome biology of Bradysia coprophila, and more generally to those interested in genome assembly, chromosome biology, Dipteran evolution, horizontal gene transfer, insect telomeres, and long-range chromosome interactions.
Methods
Male pupae collection
The dark-winged fungus gnat, Bradysia coprophila, was previously referred to as Sciara coprophila in chromosomal and molecular biology research papers since the 1920 s [1, 2, 63], and is also known by other names such as Sciara tilicola and Sciara amoena [2]. In this study, the fungus gnats were from the HoLo2 line maintained in the International Sciara Stock Center at Brown University (https://sites.brown.edu/sciara/). B. coprophila females are monogenic, meaning they have either only male or only female offspring, which is determined by whether they harbor a variant of the X chromosome (X’) or not. Specifically, X’X females are female producers and XX females are male producers. The Holo2 line has a phenotypic marker gene on the X’ chromosome called Wavy [1, 48]. Female producers (X’X) have the Wavy wing phenotype whereas male producers (XX) have wild-type straight wings. To collect only male pupae, crosses between straight-winged females (XX) and males (XO) were used to obtain strictly male progeny. Specifically, late-stage larvae from male-only matings were transferred from mating vials to agar petri plates where food and debris were removed from them, and where they entered pupation asynchronously over 2–4 days. The mixed-stage pupae were transferred to Eppendorf tubes (90–100 per tube), washed twice with 2X SSC, and then were flash frozen in liquid nitrogen and stored at − 80˚C until needed. Approximately 1,000 frozen male pupae collected across 11 tubes were shipped to Phase Genomics (Seattle, WA) on dry ice where they used the material as needed to prepare a Phase Genomics Proximo Hi-C sequencing library.
Hi-C data collection
The frozen male pupae were ground into a fine powder. Chromatin conformation capture data was then generated using a Phase Genomics (Seattle, WA) Proximo Hi-C 2.0 Kit, which is a commercially available version of the Hi-C protocol [11]. Following the manufacturer's instructions, intact cells from two samples of finely ground frozen male pupae were crosslinked using a formaldehyde solution, digested using the Sau3 A1 (^GATC) restriction enzyme, and proximity-ligated with biotinylated nucleotides. This creates chimeric molecules made from DNA fragments that came close together in vivo, but that may not be close together along the genome sequence. As instructed by the manufacturer's protocol, the chimeric molecules were pulled down with streptavidin beads and processed into an Illumina-compatible sequencing library. Sequencing was performed on an Illumina NovaSeq, generating a total of 115.2 million 2 × 101 bp paired-end reads.
Hi-C assembly correction
The input assembly to Hi-C correction was Bcop_v1 [9], a mega-base scale, optical-map-scaffolded long-read assembly. Although Bcop_v1 consists of both scaffolds (two or more contigs joined by optical maps) and singleton contigs, here and throughout, the term “contigs” is used generically for both types of sequences from Bcop_v1, especially in the context as input to Hi-C guided contig correction and scaffolding. The Hi-C data was processed, as described below, according to Phase Genomics recommendations [64]. To produce a corrected assembly, the reads were aligned to the 205 primary contigs of Bcop_v1 [9] using BWA-MEM [65, 66] with the − 5SP and -t 8 options specified, and all other options default. The output was piped into SAMBLASTER [67] to flag PCR duplicates (to later exclude from analyses) and subsequently piped into SAMtools [68] using “-F 2304” to remove non-primary and secondary alignments. Juicebox [69,70,71] was used to manually identify and break contigs at putative mis-joined regions based on disruptions in the expected pattern from Hi-C alignments along the contigs [11, 69,70,71]. Such putatively mis-joined regions were cut out and marked as “debris”. The cut-out debris ranged from ~ 1.5–50 kb. Regarding the number of mis-joins detected and corrected (see Results), we leaned toward over-correction rather than under-correction at this step since the Phase Genomics Proximo Hi-C scaffolding in the next step can “repair” these breaks by joining the sub-contigs back together if breaking them apart was in error.
Hi-C assembly scaffolding
The paired-end Hi-C reads were then aligned to Bcop_v1_corrected using the same alignment procedure as above, and these alignments were the input to the Phase Genomics'Proximo Hi-C genome scaffolding platform, which was used to create chromosome-scale scaffolds from the Bcop_v1_corrected as described in Bickhart et al. [13]. As in the LACHESIS method [14], the Proximo scaffolding process computes an interaction frequency matrix from the aligned Hi-C read pairs that is normalized by the number of DPNII restriction sites (^GATC) on each contig. Proximo then constructs scaffolds by optimizing the expected interaction frequencies and other statistical patterns in Hi-C data. Using a brute-force approach, approximately 20,000 separate Proximo runs were performed to optimize the number of scaffolds and the concordance with the observed Hi-C data. Finally, Juicebox [69] was again used to manually inspect for mis-join signals corresponding to scaffolding errors. The associated contigs from Bcop_v1 that were not included during the Hi-C scaffolding process were added back. Note that we also tried including the Bcop_v1 “associated contigs” (haplotigs, etc.) during the Hi-C scaffolding process. In doing so, we found that although the resulting scaffolds were extremely similar to those produced from using only the primary contigs, it resulted in erroneously stitching together primary and associated contigs previously identified as haplotigs rather than keeping them separate. Therefore, the results of using only the primary assembly were preferred.
At the end of the scaffolding and correction process, there were 58 separate sequences not placed into the four chromosome-scale scaffolds (“unplaced”): 57 unscaffolded input contigs and a short (363 kb) Hi-C scaffold of 6 input contigs. Of the 57 unscaffolded input contigs, 46 were marked as “debris” (the regions cut out during the Hi-C correction step above). The “debris” was largely comprised of optical-map gap sequences from the input assembly. Six unscaffolded debris sequences ranging from ~ 25–50 kb that had 100% N content were removed from the assembly (leaving 52 unplaced sequences). Furthermore, 24 unscaffolded sequences had leading or trailing NNNNN-content ranging from 1,489—36,461 bp, making up ~ 6—96% of the unscaffolded sequences in which they resided, that was trimmed off. Only ~ 19.5 kb of internal N content remained in the cleaned-up debris regions after removal and trimming of N-gap sequence. The primary assembly for Bcop_v2 consists of the four chromosome-scale scaffolds whereas the associated assembly for Bcop_v2 consists of the 52 cleaned-up unplaced sequences as well as all the “associated contigs” from Bcop_v1. For simplicity, even though the associated assembly of Bcop_v2 contains a scaffolded sequence, they are all referred to as ‘associated contigs’ throughout. In Bcop_v2, the retained BioNano Genomics optical map gap sequences are represented as capital N’s in the assembly and are of variable estimated lengths whereas the Hi-C gap sequences are represented as lower-case n’s and are always arbitrarily set to 100 bp in length.
There were 539 “associated contigs” from Bcop_v1 [9], largely comprised of haplotigs and repeat variants, that were not included during the Hi-C scaffolding process, which summed to 10.8 Mb and were ~ 20 kb on average. As a semi-final step in producing the updated reference genome, Bcop_v2, these 539 contigs were also added back to the assembly. Moving forward, the four chromosome-scale scaffolds are referred to as the “primary assembly” for Bcop_v2 whereas the “associated assembly” for Bcop_v2 is composed of the 52 unplaced input contigs and the 539 “associated contigs” from Bcop_v1, totaling 591 “associated contigs” for Bcop_v2. Most of the associated assembly is comprised of haplotigs and repeat variants. Researchers interested mainly in the four chromosome-scale scaffolds can opt to use or ignore the associated assembly.
Anchoring and orienting the chromosome-scale scaffolds
Sequences with known chromosomal locations based on previous in situ hybridization work [17,18,19,20,21,22,23,24,25,26,27,28,29] were used with BLAST [72] to identify the corresponding chromosome for each scaffold (Figs. 2 and 3, Tables 2 and 3). Centromere positions in all chromosome scaffolds were approximated using the centromere-associated repeat sequence, Sccr (Sciara coprophila centromeric repeat), that is known to hybridize to the centromeres of all four somatic chromosomes [32], and an associated repeat family that contains Sccr. For localization of centromeric and pericentromeric sequences within the chromosome-scale scaffolds, the short Sccr sequences [32] were used to identify Sccr-enriched Repeat Families in our custom de novo repeat libraries learned by RepeatModeler on the previous version of this genome (Bcop_v1) [9]. Sccr and Sccr-associated Repeat Families were mapped to Bcop_v2 using RepeatMasker [73]. To visualize the coverage and lengths of Sccr-associated alignments in IGV [74], and to identify the most likely centromere positions in the scaffolds, the lengths of all Sccr-repeat-family-associated alignments over a given position were summed for each position in the genome. The proportion of bases covered by Sccr-related or F4-related sequence in 100 kb bins was also computed with BEDtools [75] and visualized in IGV [74]. Approximated centromere positions within the scaffolds were used to compare to known centromere positions within the polytene chromosomes [22, 30,31,32]. The anchor sequences and centromere positions were used to orient the chromosome-scale scaffold sequences to be concordant with conventional polytene maps [31]. Polytene map orientation was the final processing step for Bcop_v2, which was deposited to NCBI (GenBank GCA_014529535.2; WGS VSDI02; release date 2023–01–04). For researchers interested in how the contigs in Bcop_v1 correspond to Bcop_v2, we have also deposited AGP and BED files mapping these two assemblies together to: https://github.com/JohnUrban/Bcop_v2.
Other Hi-C analyses on Bcop_v2
After correction, scaffolding, post-processing, cleaning, anchoring, polytene orientation, and otherwise all finishing steps to produce the final updated reference genome (Bcop_v2), the paired-end Hi-C reads were realigned to Bcop_v2 and interaction frequencies were computed and visualized similar to the above pipelines for correction and scaffolding: the output of mapping reads with BWA-MEM (− 5SP -t 16; version 0.7.17-r1198-dirty) [65, 66] was piped into SAMBLASTER (–removeDups; version 0.1.26) [67] to mark and remove duplicates, and further piped into SAMtools “view” (-F 2316; version 1.15.1 using htslib 1.15.1) [68] to filter out alignments that were marked as supplemental or secondary and/or where one or both mates did not map. BEDtools “bamtobed” (-bedpe; version v2.30.0) [75] was used to convert BAM records to “BEDPE” format. AWK was used to convert BEDPE format to Juicer’s “extra short” format, from which a HIC file (“.hic”) was created using JuicerTools (min. MAPQ of 10; v2.20) [71]. The HIC file was visualized in JuiceBox (v2.15) [69]. Note that the procedure described above for aligning, filtering, and visualizing Hi-C data was used to create all Hi-C map figures of Bcop_v1, Bcop_v1_corrected, and Bcop_v2 that appear in this manuscript.
JuiceBox [69] visualization aided manual inspection of the Hi-C signal across Bcop_v2 corresponding to individual chromosomes and individual contig joins within chromosomes. BED files for optical map gaps were used to visualize their locations and help interpret drops in Hi-C signal and depth of coverage on the Hi-C maps. BED files of Bcop_v1_corrected contig locations across Bcop_v2 were used to visualize them as squares across individual chromosome-scale scaffolds for manual inspection of the Hi-C-guided contig joins. To visualize and manually inspect contig joins in IGV [74], we extracted just the interaction frequencies between the start or end of a selected contig (left-most or right-most 1 kb or 10 kb of the contig) with 1 kb (or 10 kb) bins corresponding to the rest of the chromosome it resided on. Briefly, to do so we extracted “target-linked mates”, which are Hi-C paired-end reads that had at least one of the mates mapped within the targeted bin. Target-linked mate locations (interactions) in BEDPE format were visualized as arcs or links in IGV, which all emanate from the target region bin. However, although visualizing the links this way helps see all regions that are connected to the target bin by paired reads, it vastly misrepresents the interaction frequencies. Therefore, to visualize interaction frequencies across the scaffold, BEDtools was used to count the number of target-linked mates in bins that were constructed on the contig-bearing scaffold starting from the start/stop boundaries of the target region. Since interaction frequency decays with distance, log10 was also used (after adding a pseudocount of 1 to all bins) to better visualize long-distance interactions, such that a log10 score of 0 represented 0 interactions.
Testing coverage expectations of the X chromosome compared to autosomes
Sequencing and mapping coverage across the four chromosome-scale scaffolds was visualized using a variety of orthogonal genomic datasets produced with different technologies and different life stages from previous studies [9, 10]. In sum, we used Oxford Nanopore MinION and PacBio long reads from embryonic gDNA [9], BioNano Genomics long-range optical maps from male pupae gDNA [9], and Illumina paired-end reads from male and female adult gDNA [10]. Paired-end short reads were mapped as in Baird et al. [10]. All long reads were mapped with Minimap2 with technology-specific settings [76]. The optical maps were aligned with Maligner [77] with the following considerations: (i) the Bcop_v2 reference genome sequence was converted to ‘in silico optical maps’ (make_insilico_map with CACGAG recognition sequence); (ii) the in silico Bcop_v2 maps and raw optical map data were “smoothed” using smooth_maps_file with “-m 1500” and “-m 600”, respectively; and (iii) smoothed optical maps were aligned to the smoothed in silico Bcop_v2 maps with maligner_dp (“–max-alignments 1” and otherwise default parameters). Genomic DNA sequencing coverage was computed for each dataset using BEDtools [75] and converted to bigWig using UCSC Kent Utilities [78] for visualization in IGV [74]. Coverage from long reads was computed after filtering for a minimum read length of 5 kb and a minimum MAPQ of 40. Coverage from BioNano optical maps was computed after filtering for molecules with a minimum length of 150 kb that aligned to Bcop_v2 with a maximum Maligner M-score of − 10.
Gene annotations for the chromosome-scale assembly (Bcop_v2)
There are presently two high quality gene annotation sets produced using the previous genome assembly (Bcop_v1). One was constructed by us [9] using Maker2 [52], which can be found at USDA Ag Data Commons [51]. The other was made by the NCBI Eukaryotic Genome Annotation Pipeline for RefSeq, is called NCBI Bradysia coprophila Annotation Release 100, and can be found at NCBI [53]. Both were previously shown to contain ~ 97% of expected Dipteran BUSCOs [9, 79]. While the FASTA files containing the transcript and protein sequences are in no need of updating, the GFF files that detail the coordinates of the gene models, including their exons, introns, CDS, and UTRs, on Bcop_v1 are not useful for the updated assembly, Bcop_v2. Since Bcop_v2 was produced by stitching together the Bcop_v1 sequences, a simple “liftover” procedure, where features from an assembly can be exactly mapped to an updated version, was adequate for creating new GFF files detailing the coordinates of the genes on Bcop_v2. To lift-over gene annotation sets originally produced on Bcop_v1, we used the program designed for this task, LiftOff [80] (version v1.6.3) with Minimap2 [76] (version 2.24-r1122). Along with the Bcop_v1 and Bcop_v2 assembly FASTA files and the Bcop_v1 GFF file, the following parameters were given to LiftOff: -p 16 -u unmapped.txt -polish -flank 0.25 -a 0.9 -s 0.9. The “polished” and “unpolished” outputs were identical for the Maker2 annotation, and were identical except for one gene for NCBI, which was extended 115 bp. Manual inspection showed the “unpolished” version to be correct in terms of matching Bcop_v1. Therefore, the “unpolished” versions are provided. GFF files for both gene annotation sets were deposited to: https://github.com/JohnUrban/Bcop_v2.
To compare the lifted-over gene models to RNA-seq evidence, RNA-seq datasets were first mapped genome-wide to Bcop_v2 without reference to any gene models using STAR [81]. Resulting BAM files were indexed with SAMTools [68]. Genome-wide coverage was computed with BEDtools [75] using the genomecov (-split -bg -ibam) and sortBed functions to produce a bedGraph that was converted to bigWig using bedGraphToBigWig from UCSC Kent Utilities [78]. The gene model GFF files produced by LiftOff [80] and the bigWigs for RNA-seq coverage profiles across Bcop_v2 were viewed in IGV [74].
To compare male and female expression of lifted-over gene models grouped by chromosome, we used published RNA-seq data for males and females across various life stages [9]. Transcript level quantification of all genes in the NCBI annotation set [53] lifted over to Bcop_v2 was carried out by RSEM (v1.3.1) [82] coupled with STAR (version 2.7.10a_alpha_220818) [81] using the “rsem-calculate-expression” pipeline. The RSEM-calculated expected counts were used with EdgeR [83] for between-sample normalization and differential expression analysis. Plots of EdgeR log2 fold-change results were made in Python.
Using comparative genomics and evolutionary expectations to test the groupings of Bcop_v1 contigs into chromosome-scale scaffolds (Bcop_v2):
We identified orthologous relationships across all proteins within the proteomes of the five species using OrthoFinder [84, 85], which also estimated the proteome-wide rooted species tree. For B. coprophila, we used NCBI Bradysia coprophila Annotation Release 100 protein set for Bcop_v1 (NCBI GenBank GCA_014529535.1; RefSeq GCF_014529535.1) [53] and the corresponding gene coordinates lifted over to Bcop_v2 described above. For the other species, we used the protein sets and genomic GTF files associated with the following NCBI accessions: Drosophila melanogaster (NCBI GenBank GCA_000001215.4; RefSeq GCF_000001215.4) [36]; Pseudolycoriella hygida (NCBI GenBank GCA_029228625.1) [38]; yellow fever mosquito (Aedes aegypti; NCBI GenBank GCA_002204515.1; RefSeq GCF_002204515.2) [37]; African malaria mosquito (Anopheles gambiae; NCBI GenBank GCA_000005575.1; RefSeq GCF_000005575.2) [35]. The proteome-wide rooted species tree was visualized using icy tree [86] by providing the Newick code returned by OrthoFinder [84, 85]: “(D. mel:0.243236,((A. gam:0.26685,A. aeg:0.227034)0.954721:0.17008,(P. hyg:0.177818,B. cop:0.123002)0.969863:0.240218)1:0.243236);”.
OrthoFinder [84, 85] creates distinct “Ortho Groups” containing all orthologs for a given gene from all given species. Thus, one can use the genes within an Ortho Group to compare the genomic locations of the “same” gene across all species. For simplicity and higher specificity, we used only the subset of Ortho Groups that are composed of a single ortholog from each and all of the five species, otherwise known as Single Copy Orthologs (SCOs). Thus, using the genomic location of a single gene from each species in an orthogroup of SCOs, we can make dot plots (or scatter plots), that show the location of the “same” gene in one species’ genome on the X-axis and another species’ genome on the Y-axis. Such dot plots were made in R using the assembalign.dotplot function from Lave [9, 87] (https://github.com/JohnUrban/lave). Note that we used “conserved fly genes” in other analyses, which were B. coprophila genes that had one or more orthologs in all five Dipteran species, as opposed to only SCOs (which have one and only one copy in all five species). Genespace [88] was also used with OrthoFinder [84, 85] and MCScanX [89] to determine syntenic and visualize syntenic blocks.
We quantified the amount of “gene shuffling” using entropy, implementing this analysis in Python3. We call this approach EGGS (Entropy of Gene Group Shuffling), of which code to help reproduce these entropy calculations can be found https://github.com/JohnUrban/Bcop_v2. Briefly, entropy can be thought of as a measure of uncertainty, information, disorder, or “mixed-up-ness”, and is often used to study systems that evolve from a highly ordered to a highly disordered state or from a non-random to a random state [40]. Mathematically, entropy is defined as − 1 times the sum of p*log(p) for all probabilities, p, in a set of probabilities, P. That is, − 1*sum(p*log(p)). When log base 2 is used, entropy is expressed in “bits”. For “gene shuffling”, the initial non-random (lowest entropy) state corresponds to how genes are grouped on chromosomes within a given species. As genes get shuffled around into different chromosomal groupings across evolutionary time, they approach randomized (high entropy) states with respect to some target species (e.g., B. coprophila). To calculate the entropy between any two species, we used only SCOs that appeared on the chromosome-scale scaffolds of each species, which is to say we excluded those that appeared on unplaced contigs. We counted the number of SCOs that appeared on each pair of inter-species chromosomes (i.e. chromosome “i” from species A and chromosome “j” from species B), added a pseudo-count of 0.1 to the SCO count of each inter-species pair, then divided each inter-species pair SCO count by the sum of SCO counts across all pairs of inter-species chromosomes (including pseudo-counts). Note that, for D. melanogaster and A. gambiae, the SCO counts from the left and right arms of chromosomes 2 and 3 (2L, 2R, 3L, 3R) were summed to get the full counts for those chromosomes (2L + 2R, 3L + 3R). For each pair of species, this gave a table of joint probabilities of a SCO appearing on chromosome “i” from species A and chromosome “j” from species B (see Table 6 for an example). The inter-species entropy was computed from these joint probabilities using the entropy equation: − 1*sum(p*log2(p)), where p is the matrix of joint probabilities. However, as each pairwise species comparison has a different lowest possible entropy state (the non-random initial state) and a different highest possible (fully random) entropy state given their SCO counts per chromosome, we needed to use min–max normalization ((X-min)/(max–min)) in order to compare entropy values across pairwise inter-species comparisons. To calculate the minimum (initial) entropy state for each species within the context of a pairwise comparison, we marginalized (i.e. summed) over the joint probabilities to get the marginal probabilities that a SCO came from each chromosome for a given species. The entropy of the marginal probabilities of both species was then computed separately, using the equation above, and averaged together to get the lowest entropy value the joint probabilities could take. When comparing a species to itself, the average simply returns the entropy of the marginal probabilities for that species. To calculate the maximum (fully random) entropy state between two species, the marginal probabilities for each chromosome of a given species were multiplied by the marginal probabilities for each chromosome of the other species. This yielded a table of fully-random joint probabilities where the probability of a SCO appearing on chromosome “i” from species A and chromosome “j” from species B was simply the product of the marginal probability of a SCO appearing on chromosome “i” from species A and the marginal probability of a SCO appearing on chromosome “j” from species B. For the given chromosomal SCO counts for each species, this produces the joint probabilities expected at random, and maximum entropy is computed on it as above: − 1*sum(p*log2(p)), where p is the matrix of random joint probabilities. The Seaborn library [90] in Python3 was used to visualize the min–max normalized entropy scores calculated against B. coprophila as a bar plot and all pairwise entropy scores as a clustered-heatmap.
Testing gene locations of “alien” P450 genes within the chromosome-scale scaffolds using a variety of genomic datasets
If the “alien” cytochrome P450 (“CYP”) genes reported recently [7] are contained within the nuclear chromosomes, then they should exhibit expected Hi-C interaction frequencies [11] within the scaffolds. If they are external contamination erroneously stitched into the scaffold sequences, there should be severe drops in the Hi-C interaction frequency signal with these genes and the rest of the scaffold they reside in. To obtain all Hi-C interactions between a given “alien” P450 gene and all other sites in the genome, the same procedure described above for contig ends was used, here using Hi-C paired-end reads that had at least one of the mates mapped within the gene. For each “alien” P450 gene, the closest highly conserved fly gene that had a similar length (within 1% difference) was also interrogated the same way. Interactions and interaction frequencies in 100 kb bins were computed and visualized in IGV similar as described above, but with important differences. The number of interactions between a gene and the rest of the genome, and whether interactions are sparse or dense, is a function of gene length. To make valid comparisons with nearby conserved fly genes, gene length was controlled for in two ways. The first was to use nearby conserved fly genes of similar size (within 1%), as stated above. This helped match the observed interaction sparsity or density. The second was to normalize counts to each gene’s length, then multiply by 10,000 such that the magnitude of counts for all genes corresponded to a gene length of 10 kb, which was chosen since the “alien” gene lengths spanned from ~ 1.8–7.8 kb.
Read depth can also differ significantly at sites of contamination, especially across different biological samples. Some samples may show low or no sequencing depth over sites where contamination was integrated into a genome assembly. Other samples with an abundance of the contaminant may show spikes in read depth there. It is unlikely for such sites of integrated contamination to have no disruptions in the depth signal across multiple samples. To investigate the DNA-sequencing coverage over the regions bearing “alien” P450 genes, we queried the sequencing depth (computed as above) of a diverse set of biological samples, including different sexes and developmental stages [9, 10], using datasets from orthogonal technologies, including Illumina-paired end reads [10], confidently-mapped (MAPQ ≥ 40) long (≥ 5 kb) Nanopore and PacBio reads [9], and stringently-mapped ultra-long (≥ 150 kb) optical maps from BioNano Genomics [9]. See above “coverage” Methods subsection for details.
Repeat annotations for the chromosome-scale assembly (Bcop_v2)
RepeatMasker [73] was used to re-annotate the repeats across Bcop_v2 using comprehensive repeat libraries as done previously for Bcop_v1 [9], rather than using a lift-over process. As previously, the comprehensive repeat library contained (i) species-specific repeat libraries learned with RepeatModeler [91], (ii) previously known repeat sequences from Bradysia coprophila found at NCBI prior to Bcop_v1 [9], and (iii) all Arthropod repeats in the RepeatMasker Combined Database, which included Dfam_Consensus- 2018/10/26 [92] and RepBase- 2018/10/26 [93]. The corresponding repeat annotation GFF file for Bcop_v2 was deposited to: https://github.com/JohnUrban/Bcop_v2. RepeatMasker [73] was also used to map just the arthropod repeats alone. TandemRepeatFinder (TRF) [94] was used independently to map and visualize just tandem repeats in the genome (trf ${FA} 2 7 7 80 10 50 2000 -f -d -m). Repeat densities in 100 kb bins across the genome were computed using BEDtools [75]. Briefly, either all repeats in the RepeatMasker GFF or targeted repeats (e.g. a given family like LINEs) were extracted and converted to BED; the proportion of bases corresponding to those repeat intervals within 100 kb windows was calculated using the “coverage” BEDtool [75]. Repeat density bedGraphs were visualized in IGV [74]. Retrotransposon-related sequences and satellite sequences were found by inspecting the repeat annotation and tandem repeat finder results in the first 10 kb (or up to 100 kb) of the scaffold termini using IGV [74], and looking at dot plots of the first 10–100 kb. Their distributions across the scaffolds were further explored by BLASTing [72] them back to the assembly.
Chromosome-wide and genome-wide dot plots were made to visualize the repeat structure across the scaffolds. Briefly, Minimap2 [76] was used to map chromosomes against themselves or the entire genome against itself using parameters optimized for dot plots (-PD -k19 -w19 -m200 -t8). The PAF output files were visualized as dot plots in R using the assembalign.dotplot function from Lave [9, 87], which presents the pairwise-aligned query and target intervals as line segments. BED files for the locations of anchor sequence midpoints ± 500 kb, estimated centromere midpoints ± 500 kb, and estimated locations of Fold-Back Regions 1–3 and the X’ long paracentric inversion breakpoints were provided to annotate the plot margins for reference points.
Extracting and analyzing locus sequences corresponding to long-range Hi-C interactions on the X chromosome
The locations of the loci on Fold-Back Regions on the X chromosome that appear as three off-diagonal dots corresponding to long-range interacting loci on the Hi-C maps, as well as the “main breakpoints” corresponding to a long paracentric inversion with the X’ chromosome variant, were learned by analyzing the paired Hi-C reads that mapped to the X with long distances in between them. Two datasets were explored: male (XO) pupae data from this study as well as female X’X Hi-C data from a recent analysis of the X’ chromosome [10]. We used our own method called SCOPE (Scatter Clusters Of Paired Ends), of which the specific code for the analyses in this paper can be found at https://github.com/JohnUrban/Bcop_v2. Briefly, paired-end Hi-C reads (in Juicer’s Extra Short format [71]) that mapped to the X chromosome using the above pipeline were read into R where map locations were further processed to filter for paired reads with (i) inter-mate distances of at least 1 Mb, and (ii) minimum mapping qualities (MAPQ) of 10, although a range of MAPQs were explored, including no filtering, and gave similar end results. The 2D Binned Kernel Density Estimate (from the ‘KernSmooth’ package) was then taken using a 1024 × 1024 grid (~ 68.9 kb bins) with 250 kb smoothing bandwidths on both the x- and y- axes corresponding to where mate1 and mate2 mapped across the X. During optimization and exploratory analyses, kernel-smoothed 2D grids were visualized using levelplot from the ‘lattice’ package. Peaks in the Z (heat) dimension of the 2D grid were called by identifying x,y-coordinates where the Z dimension was above the 99.8 th percentile across the entire grid. In other words, Z-peaks corresponded to pairs of distant loci (> 1 Mb) with the highest numbers of Hi-C reads connecting them. The x,y-coordinates of peaks in the 2D grid were then hierarchically clustered, and the tree was cut to identify k = 12 peak clusters on the 2D grid. The median, minimum, and maximum x- and y- coordinates of each cluster were obtained to help define the centers and the boundaries of interacting pairs of loci on the X chromosome, as was the coordinates of the peak summit bin (the bin with the highest interaction frequency in a peak cluster). The peak clusters were filtered for those whose median x-coordinates were separated by 5 Mb from the median y-coordinates. In other words, these were loci pairs separated by at least 5 Mb. Three of the four remaining peak clusters separated by at least 5 Mb corresponded to the interactions between putative FBRs 1 and 2, putative FBRs 1 and 3, and putative FBRs 2 and 3. The fourth remaining peak cluster corresponded to the breakpoint dots for the long paracentric inversion with the X’. The peak clusters for loci separated by less than 5 Mb may correspond to (i) other smaller structural variations between the X’ and X, and (ii) other shorter Mb-range interactions of distal loci. The regions corresponding to the three FBRs and the two main X’ breakpoints were visualized on Hi-C maps using JuiceBox [69]. Note that using inter-mate distances of 5 Mb instead of 1 Mb and k = 4 clusters gave the same end results without needing further filtering. With these parameters peak clusters ranged from 551 kb to 1.1 Mb (8–16 bins) and had summit bins of 68.9 kb. We also identified parameters for much narrower estimates of FBR locations using MAPQ of 10, intermate distance minimum of 5 Mb, a 2048 × 2048 grid (~ 34.4 kb bins) with a smoothing bandwidth of 50 kb, a peak quantile threshold of 0.9999, and k = 4 peak clusters, which produced four interaction peak clusters in the ~ 103–551 kb range (3–16 bins) with summit bin estimates of ~ 34.4 kb.
The nucleotide distances between the mid-points of the FBR regions, X’ breakpoints, and centromeres observed on the X scaffold were compared to expected nucleotide distances given corresponding polytene map distances in proportion to the length of the X chromosome scaffold. The X polytene map is organized into 14 zones of similar lengths, each with 3 subzones (A, B, and C) for a total of 42 sub-zones. Assuming nucleotide lengths of zones are also similar, the ~ 70.5 Mb X-scaffold would indicate zone lengths of ~ 5 Mb and sub-zone lengths of ~ 1.68 Mb, which we used to compute expected lengths. The FBR coordinates were used to define both the sequences of the FBRs as well as the Genomic Complement (the genome sequence excluding the FBRs). FBR and Genomic Complement sequences were extracted using BEDtools [75]. JellyFish [95] was used for kmer operations on both sets of sequences. BLAST [72] was used to compare FBR sequences to each other and to the Genomic Complement sequences. FBR dot plots were made as described for the entire genome and individual chromosomes above. BEDtools [75] was used to extract and analyze genes and repeats in the FBRs. The FBR coordinates were also used to annotate the X chromosome dot plot, IGV traces showing repeats, and Hi-C interaction frequency maps.
The coordinates of loci pairs corresponding to other interaction dots on the X chromosome (and chromosome IV) were identified as described for FBRs above using the SCOPE (Scatter Clusters Of Paired Ends) approach. Specifically, in the X’X Hi-C data we used a MAPQ of 10, intermate distance minimum of 2 Mb, a 1024 × 1024 grid with a smoothing bandwidth of 100 kb, a peak quantile threshold of 0.995, with k = 50 peak clusters, and cluster filtering for interactions > 5 Mb. To remove dots that not likely to be from true interactions, we also used paired-end reads from X’X adult female genomic DNA (gDNA) control samples from a recent study [10] with a similar approach to find “long-distance interactions”, which in this case correspond to a background of structural differences between the X and X’ chromosomes. Specifically, for the X’X gDNA control, reads were mapped as described for Hi-C data above, and long-range interactions were detected as described above, but using a minimum MAPQ of 10, intermate distance minimum of 1 Mb, a 1024 × 1024 grid with a smoothing bandwidth of 250 kb, a peak quantile threshold of 0.999, with k = 50 peak clusters. Structural variants between the X and X’ chromosomes were additionally identified with Smoove/Lumpy [96, 97] for the X’X gDNA control samples exactly as described previously [10]. Bedtools [75] was used to filter out interaction regions identified in the X’X Hi-C data that were also found in the X’X gDNA control or that corresponded to SV breakpoints that were found with Smoove/Lumpy. Juicebox was used to visualize the Hi-C maps, and to annotate them with the structural variants found by both methods. The dot on chromosome IV was identified using a MAPQ of 10, intermate distance minimum of 5 Mb, a 2048 × 2048 grid with a smoothing bandwidth of 100 kb, a peak quantile threshold of 0.99995, and with k = 1 peak cluster. Code for these analyses can also be found at https://github.com/JohnUrban/Bcop_v2. Where relevant, genes found in an extracted region or regions were assigned GO terms and tested for enrichment exactly as described previously [98].
Data availability
This Whole Genome Shotgun project has been deposited at DDBJ/ENA/GenBank under the accession VSDI00000000. The version described in this paper, referred to throughout as Bcop_v2, is version VSDI02000000 (deposited 2022; released Jan 4, 2023). This project is associated with BioProjects PRJNA291918 and PRJNA672144; and BioSamples SAMN12533751 and SAMN20343824. Male Pupa Hi-C data was deposited to SRA: SRR23335771. Lifted over gene model GFFs, repeat annotation GFF, B. coprophila-specific repeat libraries, as well as AGP and BED files mapping Bcop_v1 to Bcop_v2 were deposited to a github repository: https://github.com/JohnUrban/Bcop_v2. The RNA-seq samples from B. coprophila from different life stages used to analyze dosage compensation and test gene models are from Urban et al. (2021) [9] and associated with BioProject PRJNA748150. The RNA-seq samples from the radiation study used to test gene models and visualize exon coverage are from Urban et al. [98] and associated with BioProject PRJNA928089. The long reads from Oxford Nanopore and PacBio, and optical maps from BioNano Genomics are from Urban et al. (2021) [9] and associated with BioProject PRJNA291918, and the Illumina reads from male and female adults as well as X’X Hi-C data are from Baird et al. [10] and associated with BioProject PRJNA953429.
Abbreviations
- AGP:
-
A golden path (file format)
- Bcop:
-
Bradysia coprophila
- bp:
-
Base pairs
- BED:
-
Browser Extensible Data (file format)
- BUSCO:
-
Benchmarking Universal Single-Copy Orthologs
- CDS:
-
Coding sequence
- FBR:
-
Foldback region
- GB:
-
Gigabase pairs
- GFF:
-
General Feature Format (file format)
- GTF:
-
General Transfer Format (file format)
- Hi-C:
-
High dimensional Chromosome conformation capture
- i5k:
-
Initiative to sequence 5000 insect (and other arthropod) genomes (http://i5k.github.io/about)
- kb:
-
Kilobase pairs
- LCTTR:
-
Long complex terminal tandem repeats
- LINE:
-
Long interspersed nuclear element
- LTR:
-
Long terminal repeats
- Mb:
-
Megabase pairs
- MMNE:
-
Min–Max Normalized Entropy
- NCBI:
-
National Center for Biotechnology Information
- PacBio:
-
Pacific Biosciences
- PCR:
-
Polymerase chain reaction
- RTE:
-
Retrotransposable element
- RTRS:
-
Retrotransposon related sequence
- Sccr:
-
Sciara coprophila Centromeric repeat
- ScRTE:
-
Sciara coprophila Retrotransposable element as defined in Escribá et al.
- SMRT:
-
Single-molecule real-time
- ONT:
-
Oxford Nanopore Technologies
References
Gerbi SA. Unusual chromosome movements in sciarid flies. In: Hennig W, editor. Results and problems in cell differentiation. Springer, Berlin: Heidelberg; 1986. p. 71–104.
Gerbi SA. Non-random chromosome segregation and chromosome eliminations in the fly Bradysia (Sciara). Chromosom Res. 2022;30:273–88.
Rasch EM. DNA cytophotometry of salivary gland nuclei and other tissue systems in dipteran larvae. In: Wied GL, Bahr GF, editors. In Introduction to Quantitative Cytochemistry. New York: Academic Press; 1970. p. 357–97.
Gerbi SA, Strezoska Z, Waggener JM. Initiation of DNA replication in multicellular eukaryotes. J Struct Biol. 2002;140:17–30.
Rasch EM. Two-wavelength cytophotometry of Sciara salivary gland chromosomes. In: Wied GL, Bahr GF, editors. Introduction to Quantitative Cytochemistry. New York: Academic Press; 1970. p. 335–55.
Hodson CN, Jaron KS, Gerbi S, Ross L. Gene-rich germline-restricted chromosomes in black-winged fungus gnats evolved through hybridization. PLOS Biol. 2022;20:e3001559.
Feyereisen R, Urban JM, Nelson DR. Aliens in the CYPome of the black fungus gnat. Bradysia coprophila Insect Biochem Mol Biol. 2023;159:103965.
Chen X, Urban JM, Wurlitzer J, Wei X, Han J, O’Connor SE, et al. Canonical terpene synthases in arthropods: Intraphylum gene transfer. Proc Nat Acad Sci. 2024;121:e2413007121.
Urban JM, Foulk MS, Bliss JE, Coleman CM, Lu N, Mazloom R, et al. High contiguity de novo genome assembly and DNA modification analyses for the fungus fly, Sciara coprophila, using single-molecule sequencing. BMC Genomics. 2021;22:1–23.
Baird RB, Urban JM, Mongue AJ, Jaron KS, Hodson CN, Grewoldt M, et al. Recent Evolution of a Maternally Acting Sex-Determining Supergene in a Fly with Single-Sex Broods. Mol Biol Evol. 2023;40.
Lieberman-Aiden E, Van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009;326:289–93.
Rao SSP, Huntley MH, Durand NC, Stamenova EK, Bochkov ID, Robinson JT, et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014;159:1665–80.
Bickhart DM, Rosen BD, Koren S, Sayre BL, Hastie AR, Chan S, et al. Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome. Nat Genet. 2017;49:643–50.
Burton JN, Adey A, Patwardhan RP, Qiu R, Kitzman JO, Shendure J. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat Biotechnol. 2013;31:1119–25.
Kamnert I, López CC, Rosén M, Edström JE. Telomeres terminating with long complex tandem repeats. Hereditas. 1997;127:175–80.
Kuznetsova V, Grozeva S, Gokhman V. Telomere structure in insects: A review. J Zool Syst Evol Res. 2020;58:127–58.
Pardue M Lou, Gerbi SA, Eckhardt RA, Gall JG. Cytological localization of DNA complementary to ribosomal RNA in polytene chromosomes of Diptera. Chromosoma. 1970;29:268-90.
Gerbi SA. Localization and characterization of the ribosomal RNA cistrons in Sciara coprophila. J Mol Biol. 1971;58:499–511.
Foulk MS, Liang C, Wu N, Blitzblau HG, Smith H, Alam D, et al. Ecdysone induces transcription and amplification in Sciara coprophila DNA puff II/9A. Dev Biol. 2006;299:151–63.
Greciano PG, Ruiz MF, Kremer L, Goday C. Two new chromodomain-containing proteins that associate with heterochromatin in Sciara coprophila chromosomes. Chromosoma. 2009;118:361–76.
Foulk MS, Waggener JM, Johnson JM, Yamamoto Y, Liew GM, Urnov FD, et al. Isolation and characterization of the ecdysone receptor and its heterodimeric partner ultraspiracle through development in Sciara coprophila. Chromosoma. 2013;122:103–19.
Crouse HV. X heterochromatin subdivision and cytogenetic analysis in Sciara coprophila (Diptera, Sciaridae): Centromere localization. Chromosoma. 1977;63:39–55.
Crouse HV, Gerbi SA, Liang CM, Magnus L, Mercer IM. Localization of ribosomal DNA within the proximal X heterochromatin of Sciara coprophila (Diptera, Sciaridae). Chromosoma. 1977;64:305–18.
Kerrebrock AW, Srivastava R, Gerbi SA. Isolation and characterization of ribosomal DNA variants from Sciara coprophila. J Mol Biol. 1989;210:1–13.
DiBartolomeis SM, Gerbi SA. Molecular characterization of DNA puff II/9A genes in Sciara coprophila. J Mol Biol. 1989;210:531–40.
Bienz-Tadmor B, Smith HS, Gerbi SA. The promoter of DNA puff gene II/9-1 of Sciara coprophila is inducible by ecdysone in late prepupal salivary glands of Drosophila melanogaster. Cell Regul. 1991;2:875–88.
Wu N, Liang C, DiBartolomeis SM, Smith HS, Gerbi SA. Developmental progression of DNA puffs in Sciara coprophila: amplification and transcription. Dev Biol. 1993;160:73–84.
Mok EH, Smith HS, DiBartolomeis SM, Kerrebrock AW, Rothschild LJ, Lange TS, et al. Maintenance of the DNA puff expanded state is independent of active replication and transcription. Chromosoma. 2001;110:186–96.
Urnov FD, Liang C, Blitzblau HG, Smith HS, Gerbi SA. A DNase I hypersensitive site flanks an origin of DNA replication and amplification in Sciara. Chromosoma. 2002;111:291–303.
Crouse HV. Translocations in Sciara: their bearing on chromosome behavior and sex determination. Univ Missouri Res Bull. 1943;379:1–75.
Gabrusewycz-Garcia N. Cytological and autoradiographic studies in Sciara coprophila salivary gland chromosomes. Chromosoma. 1964;15:312–44.
Escribá MC, Greciano PG, Méndez-Lago M, De Pablos B, Trifonov VA, Ferguson-Smith MA, et al. Molecular and cytological characterization of repetitive DNA sequences from the centromeric heterochromatin of Sciara coprophila. Chromosoma. 2011;120:387–97.
Rasch EM. Genome size and determination of DNA content of the X chromosomes, autosomes, and germ line-limited chromosomes of Sciara coprophila. J Morphol. 2006;267:1316–25.
Koutsovoulos G, Kumar S, Laetsch DR, Stevens L, Daub J, Conlon C, et al. No evidence for extensive horizontal gene transfer in the genome of the tardigrade Hypsibius dujardini. Proc Natl Acad Sci U S A. 2016;113:5053–8.
Sharakhova M V., Hammond MP, Lobo NF, Krzywinski J, Unger MF, Hillenmeyer ME, et al. Update of the Anopheles gambiae pest genome assembly. Genome Biol. 2007;8.
Hoskins RA, Carlson JW, Wan KH, Park S, Mendez I, Galle SE, et al. The release 6 reference sequence of the drosophila melanogaster genome. Genome Res. 2015;25:445–58.
Matthews BJ, Dudchenko O, Kingan SB, Koren S, Antoshechkin I, Crawford JE, et al. Improved reference genome of Aedes aegypti informs arbovirus vector control. Nature. 2018;563:501–7.
Trinca V, Carli S, Uliana JVC, Garbelotti CV, Mendes da Silva M, Kunes V, et al. Biocatalytic potential of Pseudolycoriella CAZymes (Sciaroidea, Diptera) in degrading plant and fungal cell wall polysaccharides. iScience. 2023;26:106449.
Wiegmann BM, Trautwein MD, Winkler IS, Barr NB, Kim JW, Lambkin C, et al. Episodic radiations in the fly tree of life. Proc Natl Acad Sci U S A. 2011;108:5690–5.
Seitz W, Kirwan AD. Mixed-Up-Ness or Entropy? Entropy. 2022;24:1090.
Smith CD, Edgar RC, Yandell MD, Smith DR, Celniker SE, Myers EW, et al. Improved repeat identification and masking in Dipterans. Gene. 2007;389:1–9.
Lukhtanov VA. Diversity and evolution of telomere and subtelomere DNA sequences in insects. bioRxiv. 2022;:2022.04.08.487650.
Lukhtanov VA, Pazhenkova EA. Diversity and evolution of telomeric motifs and telomere DNA organization in insects. Biol J Linn Soc. 2023;:1–20.
Crouse HV. X heterochromatin subdivision and cytogenetic analysis in Sciara coprophila (diptera, sciaridae): The Controlling Element. Chromosoma. 1979;74:219–39.
Zhimulev IF, Belyaeva ES, Semeshin VF, Koryakov DE, Demakov SA, Demakova OV, et al. Polytene Chromosomes: 70 Years of Genetic Research. Int Rev Cytol. 2004;241:203–75.
Metz CW. Sex Determination in Sciara. Am Nat. 1929;63:487–96.
Metz CWW, Schmuck MLL. Studies on sex determination and the sex chromosome mechanism in Sciara. Genetics. 1931;16:225–53.
Metz CW, Smith HB. Further observations on the nature of the X-Prime (X’) Chromosome in Sciara. Proc Natl Acad Sci U S A. 1931;17:195–8.
Metz CWW. Chromosome behavior, inheritance and sex determination in Sciara. Am Nat. 1938;72:485–520.
Baird RB, Mongue AJ, Ross L. Why put all your eggs in one basket? Evolutionary perspectives on the origins of monogenic reproduction. Hered 2023 1312. 2023;131:87–95.
Urban JM. Bradysia coprophila genome annotations Bcop_v1.0. USDA Ag Data Commons. 2021. https://doiorg.publicaciones.saludcastillayleon.es/10.15482/USDA.ADC/1522618.
Holt C, Yandell M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics. 2011;12:491.
NCBI. Bradysia coprophila Annotation Release 100. NCBI Eukaryotic Genome Annotation Pipeline. https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Bradysia_coprophila/100/.
Eagen KP, Hartl TA, Kornberg RD. Stable chromosome condensation revealed by chromosome conformation capture. Cell. 2015;163:934–46.
Spradling AC. Polytene Chromosome Structure and Somatic Genome Instability. Cold Spring Harb Symp Quant Biol. 2017;:033670.
Chang CH, Chavan A, Palladino J, Wei X, Martins NMC, Santinello B, et al. Islands of retroelements are major components of Drosophila centromeres. PLOS Biol. 2019;17:e3000241.
Mavrich TN, Jiang C, Ioshikhes IP, Li X, Venters BJ, Zanton SJ, et al. Nucleosome organization in the Drosophila genome. Nature. 2008;453:358–62.
Courret C, Hemmer LW, Wei X, Patel PD, Chabot BJ, Fuda NJ, et al. Turnover of retroelements and satellite DNA drives centromere reorganization over short evolutionary timescales in Drosophila. PLOS Biol. 2024;22:e3002911.
Lukyanchikova V, Nuriddinov M, Belokopytova P, Taskina A, Liang J, Reijnders MJMF, et al. Anopheles mosquitoes reveal new principles of 3D genome organization in insects. Nat Commun 2022 131. 2022;13:1–22.
Mohana G, Dorier J, Li X, Mouginot M, Smith RC, Malek H, et al. Chromosome-level organization of the regulatory genome in the Drosophila nervous system. Cell. 2023;186:3826-3844.e26.
Morinaga G, Balcazar D, Badolo A, Iyaloo D, Tantely L, Mouillaud T, et al. From macro to micro: De novo genomes of Aedes mosquitoes enable comparative genomics among close and distant relatives. bioRxiv. 2025;:2025.01.13.632753.
Ryazansky SS, Chen C, Potters M, Naumenko AN, Lukyanchikova V, Masri RA, et al. The chromosome-scale genome assembly for the West Nile vector Culex quinquefasciatus uncovers patterns of genome evolution in mosquitoes. BMC Biol 2024 221. 2024;22:1–27.
Metz CW. CHROMOSOMES AND SEX IN SCIARA. Science. 1925;61:212–4.
PhaseGenomics. Aligning and QCing Phase Genomics Hi-C Data. Phase Genomics, Seattle, WA. https://phasegenomics.github.io/2019/09/19/hic-alignment-and-qc.html.
Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–60.
Li H, Durbin R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010;26:589–95.
Faust GG, Hall IM. SAMBLASTER: fast duplicate marking and structural variant read extraction. Bioinformatics. 2014;30:2503–5.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–9.
Durand NC, Robinson JT, Shamim MS, Machol I, Mesirov JP, Lander ES, et al. Juicebox Provides a Visualization System for Hi-C Contact Maps with Unlimited Zoom. Cell Syst. 2016;3:99.
Dudchenko O, Shamim MS, Batra SS, Durand NC, Musial NT, Mostofa R, et al. The Juicebox Assembly Tools module facilitates de novo assembly of mammalian genomes with chromosome-length scaffolds for under $1000. bioRxiv. 2018;:254797.
Durand NC, Shamim MS, Machol I, Rao SSP, Huntley MH, Lander ES, et al. Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments. Cell Syst. 2016;3:95–8.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–10.
Smit A, Hubley R, Green P. RepeatMasker Open-4.0. 2013;:http://www.repeatmasker.org.
Thorvaldsdóttir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform. 2013;14:178–92.
Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–2.
Li H. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–100.
Mendelowitz LM, Schwartz DC, Pop M. Maligner: a fast ordered restriction map aligner. Bioinformatics. 2015;32:1016–22.
Kent WJ, Zweig AS, Barber G, Hinrichs AS, Karolchik D. BigWig and BigBed: enabling browsing of large distributed datasets. Bioinformatics. 2010;26:2204–7.
Simão FA, Waterhouse RM, Ioannidis P, Kriventseva E V, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31.
Shumate A, Salzberg SL. Liftoff: accurate mapping of gene annotations. Bioinformatics. 2021;37:1639–43.
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: Ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15–21.
Li B, Dewey CN. RSEM: Accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011;12:323.
Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–40.
Emms DM, Kelly S. OrthoFinder: Phylogenetic orthology inference for comparative genomics. Genome Biol. 2019;20:1–14.
Emms DM, Kelly S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol. 2015;16:1–14.
Vaughan TG. IcyTree: rapid browser-based visualization for phylogenetic trees and networks. Bioinformatics. 2017;33:2392–4.
Urban JM. Lave: A collection of functions in R for genomic tinkering, processing, and plotting. https://github.com/JohnUrban/lave.
Lovell JT, Sreedasyam A, Schranz ME, Wilson M, Carlson JW, Harkess A, et al. GENESPACE tracks regions of interest and gene copy number variation across multiple genomes. Elife. 2022;11.
Wang Y, Tang H, Debarry JD, Tan X, Li J, Wang X, et al. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res. 2012;40:e49–e49.
Waskom ML. seaborn: statistical data visualization. J Open Source Softw. 2021;6:3021.
Smit A, Hubley R. RepeatModeler Open-1.0. 2008;:http://www.repeatmasker.org.
Hubley R, Finn RD, Clements J, Eddy SR, Jones TA, Bao W, et al. The Dfam database of repetitive DNA families. Nucleic Acids Res. 2016;44:D81–9.
Bao W, Kojima KK, Kohany O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob DNA. 2015;6:11.
Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999;27:573–80.
Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27:764–70.
Layer RM, Chiang C, Quinlan AR, Hall IM. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 2014;15:R84.
Pedersen BS, R. L, Quilan A. smoove: structural-variant calling and genotyping with existing tools. Available from: https://github.com/brentp/smoove. 2020.
Urban JM, Bateman JR, Garza KR, Borden J, Jain J, Brown A, et al. Bradysia (Sciara) coprophila larvae up-regulate DNA repair pathways and down-regulate developmental regulators in response to ionizing radiation. Genetics. 2023. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/GENETICS/IYAD208.
Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, et al. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res. 2012;22:557–67.
Acknowledgements
We thank Jacob Bliss who helped with Bradysia coprophila maintenance during data collection. We thank our correspondents at Phase Genomics, including Hayley Mangelson, Kayla Young, Shawn Sullivan, Brian Fan, Andrew Wiser, and Ivan Liachko. We acknowledge and thank Rob Baird and Laura Ross for being early adopters of Bcop_v2 for collaborative studies of the X’ chromosome. We thank Jennifer Urban for proof-reading the manuscript. We thank Ava Filiss in the Gerbi Laboratory for the X and X' polytene chromosome images. We thank all members of the Spradling Laboratory for feedback during lab meetings, and especially Robert Levis for discussions on telomeres. Polytene chromosome maps in figures 3, 7, and 8 were reproduced from plates 1, 2 and 3 of Gabrusewycz-Garcia (1964) [31] with permission from Springer Nature under permission number: 5490830617309. We thank bioRxiv for hosting our 2022 preprint that preceded this paper (https://doiorg.publicaciones.saludcastillayleon.es/https://doiorg.publicaciones.saludcastillayleon.es/10.1101/2022.11.03.515061; version 1, Nov 2022; version 2, Dec 2024).
Funding
Financial support has come from National Institutes of Health (NIH) NIH/GM121455 to SAG. ACS is an Investigator of the Howard Hughes Medical Institute (HHMI) in the HHMI unit at the Carnegie Institute for Science (CIS). JMU is an HHMI/CIS Research Associate under ACS.
Author information
Authors and Affiliations
Contributions
JMU, SAG, and ACS conceived the project. SAG collected male pupae, coordinated with Phase Genomics, and helped with chromosome maps and map positions. Hi-C library preparation, sequencing, and scaffolding was performed as a service provided by Phase Genomics. JMU performed the original Bcop_v1 assembly, and corresponded with Phase Genomics during the optimization process for Bcop_v2 scaffolding. JMU performed all subsequent analyses and operations, including subsequent Hi-C data processing, analysis, exploration and visualizations in all figures, chromosome identification, scaffold orientation, assembly evaluations, gene annotation lift-overs, comparative genomics and entropy analyses, HGT gene validations, repeat annotation and analyses, scaffold termini characterization, and fold-back repeat sequence extraction and characterization; JMU developed the ideas, analyses, and python/R code for dot plots (Lave), for entropy/MMNE analyses (EGGS), and for the long-range Hi-C interaction peak detection method to discover FBR and breakpoint locations (SCOPE); JMU made all figures and tables. ACS helped conceive of analyses, figures, and tables. All authors participated intellectually in the development and execution of this project. All authors wrote, read, and edited the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Urban, J.M., Gerbi, S.A. & Spradling, A.C. Chromosome-scale scaffolds of the fungus gnat genome reveal multi-Mb-scale chromosome-folding interactions, centromeric enrichments of retrotransposons, and candidate telomere sequences. BMC Genomics 26, 443 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12864-025-11573-2
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12864-025-11573-2