Skip to main content
Fig. 2 | BMC Genomics

Fig. 2

From: High-fidelity (repeat) consensus sequences from short reads using combined read clustering and assembly

Fig. 2

Automated repeat assembly from RepeatExplorer2 (RE2)-derived superclusters (SCLs) leads to long, continuous and accurate repeat consensus sequences. a Impact of read input: Comparison of repeat assembly quality measures, when using different input reads for the MEGAHIT/SPAdes assembly workflow. There are three different ways to collect input reads for the assembly workflow: (i) Reads in fasta format can be directly used from the superclusters (SCL reads; fa); (ii) reads in fastq format can be selected from the original read data using the reads listed in the supercluster (SCL reads; fq); (iii) or the reads in fastq format can be selected from the original data based on their similarity to the supercluster contigs (contig-based; fq). Each colour represents a supercluster, displaying the 100 largest superclusters, with colour being on a continuous scale as outlined in Fig. 2b. Black diamonds show mean values.   b  Impact of the new MEGAHIT/SPAdes workflow on consensus length: For the most abundant 100 superclusters (SCLs), we compared the combined length of the assembled consensuses (NODEs; right-facing bars) and RE2-contigs (left-facing bars) using the ten best-scoring contigs/NODEs of each supercluster. The central blue bar indicates the length of the shared sequences between both, NODEs and RE2-contigs, whereas the depth of the blue shade indicates their mean sequence identity. If the NODE assembly produces longer consensuses, this was indicated by a red dot, whereas a superior RE2 assembly was marked by a green dot.  c The accuracy of the generated consensus is illustrated by an in-depth view into a selected repeat family, an Angela-type retrotransposon, represented by supercluster 8 (SCL8): Dotplot comparison of the 10 highest-scored NODEs and RE2-contigs from supercluster 8 (SCL008), as well as an ONT long read with an actual Angela copy and a reference sequence for a more detailed sequence-wise comparison. The shading refers to the longest common subsequence (LCS), in which darker grey indicates a longer sequence overlap. In the upper part (above the main diagonal) the forward LCS and in the lower part (below the main diagonal) the reverse LCS is used for shading. The ending “_r” indicates sequences that are displayed as reverse complement

Back to article page