Comparative analysis of chloroplast genomes and phylogenetic relationships of different pitaya cultivars

Zheng, Enting; Yisilam, Gulbar; Li, Chuanning; Jiao, Fangfang; Ling, Yulan; Lu, Shuhua; Wang, Qiuyan; Tian, Xinmin

doi:10.1186/s12864-025-11581-2

Research
Open access
Published: 09 May 2025

Comparative analysis of chloroplast genomes and phylogenetic relationships of different pitaya cultivars

Enting Zheng^1,2^na1,
Gulbar Yisilam^1,2,3^na1,
Chuanning Li^1,2,
Fangfang Jiao^1,2,3,
Yulan Ling^1,2,
Shuhua Lu⁴,
Qiuyan Wang^1,2 &
…
Xinmin Tian^1,2

BMC Genomics volume 26, Article number: 463 (2025) Cite this article

249 Accesses
Metrics details

Abstract

Background

Pitaya is an important tropical fruit highly favoured by consumers owing to its good and juicy characteristics. It contains a large amount of betacyanin, which is a natural food-colouring agent, in the peel and pulp. However, few studies have focused on the pitaya chloroplast (cp) genomes.

Results

To explore the genetic differences and phylogenetic relationships among the cp genomes of the six pitaya cultivars, we assembled, annotated, and performed a comparative genomic analysis. The cp genomes of the six cultivars exhibited a typical circular structure, ranging in length from 133,146 to 133,617 bp, with a GC content of 36.4%. All individual cp genomes were annotated with 123 genes, including 80 protein-coding genes, 38 tRNA genes, four rRNA genes, and one pseudogene (ycf68). Six mutated hotspot regions (trnF-GAA-rbcL, trnM-CAU-accD, rpl20-psbB, accD, rpl22, ycf1) were detected, which could be considered potential molecular markers for population genetics and molecular phylogeny studies. Phylogenetic analysis showed that pitaya cultivars clustered into a single branch in the phylogenetic tree of the Cactaceae family. Furthermore, the observed phylogenetic patterns suggest a complex genetic basis for colour variation among pitaya cultivars.

Conclusions

The study findings expand our understanding of the cp genome of pitaya and the phylogenetic relationships among different cultivars. The genomic data obtained provide important information for the breeding and genetic improvement of pitaya.

Peer Review reports

Introduction

Pitaya is the fruit of a class of climbing plants belonging to the genera Selenicereus and Hylocereus in the family Cactaceae. These plants originated from tropical and subtropical Central America [1]. Recently, owing to its rich nutritional value and pharmacological effects, the worldwide cultivation of pitaya has gradually spread from the Americas to tropical and subtropical countries such as Vietnam and China [2,3,4]. Based on the colour of the fruit peel and pulp, pitaya can be classified into the following three main types: Selenicereus monacanthus with red peel and pulp, Selenicereus undatus with red peel and white pulp, and Selenicereus megalanthus with yellow peel and white pulp [5]. Pitaya is rich in a variety of nutrients such as betaine, polyphenols, flavonoids, and anthocyanins [6, 7], which are important for the treatment of a variety of diseases such as diabetes, cardiovascular disease, and cancer [8,9,10]. Furthermore, pitaya peel has great potential in the food industry, such as food packaging and coatings [11]. Current research on pitaya focuses on cultivation techniques [12, 13], nutrient composition [6, 14], pests and diseases [15, 16], and medical value [9, 10]. Despite the potential economic value and health benefits of pitaya, chloroplast (cp) genomics research on pitaya still lags behind that of some traditional fruits. Currently, plant breeding is gradually moving toward the 4.0 era, and the addition of big data and artificial intelligence will provide more efficient and accurate methods for plant breeding [17, 18]. Adequate genomic data is the important foundation for realising “smart breeding”, so we still need to obtain more abundant genomics data including cp genomes to form a more systematic database [19].

The species classification of the genera Hylocereus and Selenicereus is controversial. Britton and Rose separated these two genera based on morphological differences [20]. However, because most fruits of plants in both genera are edible, natural hybridisation between the genera occurs. Some hybridised individuals possess characteristics of both the genera Hylocereus and Selenicereus, presenting significant taxonomic difficulties; therefore, researchers have suggested merging these two genera [21]. Korotkova et al. [22] conducted a molecular phylogenetic with four plastid region segments and supports the taxonomic treatment of transferring all species of Hylocereus to Selenicereus. The chloroplast is important organelle for photosynthesis in plants, and comparative analysis of plant cp genomes is important for the study of evolutionary relationships between relatives and species identification [23, 24]. In recent years, only two articles have been published on the complete cp genome of pitaya, revealing the cp genome sequences of five Selenicereus species and determining the taxonomic positions of these plants in Cactaceae [25, 26]. Adding the complete cp genomic information of different pitaya species can help expand the database of pitaya genomes and better explore the genome evolution of individuals within Selenicereus and their phylogenetic relationships.

In this study, we analysed the sequence structure of six pitaya cultivars cp genomes and performed comparative genomic and phylogenetic studies to investigate the genetic differences and relationships among cp genomes across various cultivars. This study’s findings contribute to our understanding of chloroplast genomes and evolution within the genus Selenicereus and related species in Cactaceae, while providing a valuable genomic resources that could contribute to future pitaya breeding programs. The identified highly variable sites and SSR sites identified through screening could serve as molecular markers for molecular-assisted selection in cultivar development. Meanwhile, comparative genetic analysis of cultivars may enable targeted screening of superior germplasm and optimize hybridization strategies for trait improvement.

Materials and methods

Plant materials

The six different pitaya cultivars samples for the study were provided by the Guangxi Institute of Botany, namely Selenicereus megalanthus 'Yanwoguo' and Selenicereus megalanthus 'Wucihuanglong' with yellow peel white pulp, Selenicereus monacanthus 'Sijihong' and Selenicereus monacanthus 'Jingduyihao' with red peel red pulp, Selenicereus undatus 'Putongbairou' and Selenicereus undatus 'Baishuijing' with red peel white pulp. Voucher specimens (voucher numbers: GZ202302401 – GZ202302406) were identified by Shuhua Lu and deposited at the herbarium of the Guangxi Institute of Botany, Guilin, China.

DNA extraction and sequencing

Total genomic DNA was extracted from the stems of fresh plants. After DNA extraction, samples from the six pitaya cultivars were sent to the Anhui Double Helix Gene Technology Company (Anhui, China) for genomic library construction and Illumina sequencing. Illumina paired-end libraries (150 bp read length) were generated in a single lane on an Illumina HiSeq2500 (2500 Illumina Way, San Diego, USA), and the paired-end raw reads were processed using Trimmomatic v0.39 [27] to remove adapters and low-quality reads to produce high-quality clean data.

Chloroplast assembly and annotation

The genome assembly and annotation processes were conducted using well-established bioinformatics tools, ensuring reliable and high-quality results for all six pitaya cultivars. Red peel white pulp pitaya Selenicereus undatus (Haw.) D. R. Hunt (GB: NC.053698) was downloaded from the National Center for Biotechnology Information (NCBI: https://www.ncbi.nlm.nih.gov/) as a reference sequence, and high-quality resequencing data obtained by screening were assembled using GetOrganelle v1.7.5 [28] with the parameters set to -R 15 -k 21, 45, 65, 85, 105, 127 -F embplant_pt. The completed assembled sequences were preliminarily annotated using Geseq [29] and CPGAVAS2 [30]. The annotation results were manually corrected using Geneious Prime v2024.0.5 [31] to obtain the complete cp genome. Finally, the cp genomes of these six pitaya cultivars were circularly mapped using OGDRAW v1.3.1 (https://chlorobox.mpimp-golm.mpg.de/OGDraw.html) [32]. All fully annotated cp genome sequences were uploaded to the NCBI GenBank database under the accession numbers PQ824054 – PQ824059.

Structural characterization of the chloroplast genome

The total length and guanine-cytosine (GC) content of the cp genome, large single-copy region (LSC), inverted repeat regions (IRs) and small single-copy region (SSC), and gene composition were analysed using Geneious Prime v2024.0.5.

Repeat sequence analysis

Online software MISA-web (https://webblastipk-gatersleben.de/misa/.) was used to identify simple sequence repeats (SSRs) in the target sequences [33], and the thresholds for mono-, di-, tri-, tetra-, penta-, and hexa-nucleotide repeat sequences were set to 10, 5, 4, 3, 3, and 3, respectively.

Dispersed repetitive sequences of the cp genome, including forward, reverse, complementary, and palindromic repeats, were identified using REPuter (https://bibiserv.cebitec.uni-bielefeld.de/reputer/) [34]. The parameters were set as follows: minimum sequence length was 30 bp; Hamming distance was 3, maximum number of computed repeats was 5000 bp. Finally, the number of dispersed repeat sequence types and the distribution of cp genomes were obtained.

Analysis of IRs contraction and expansion

The boundaries of the LSC, IRb, SSC, and IRa regions among six pitaya cultivars cp genomes were visualised using the CPJSdraw tool [35] to analyse the contraction and expansion of four regions in different pitaya cultivars as well as the correlation between the boundary regions and the genes.

Genomic variation analysis

To explore the genomic sequence variation in the cp genome of pitaya, the mVISTA program (http://genome.lbl.gov/vista/index.shtml) was used to analyse the genomic variation of different pitaya cultivars using Selenicereus megalanthus (GB: NC.087625.1) as a reference in the Shuffle-LAGAN mode for sequence similarity comparison of cp genomes of different pitaya cultivars [36].

Nucleotide diversity analysis

To assess the nucleotide diversity (Pi) among the cp genomes of the six pitaya cultivars, the sequences of the six cp genomes were aligned using the MAFFT function in Geneious Prime v2024.0.5. The genome-wide nucleotide diversity was subsequently computed using a sliding window of DnaSP v6.12.03, with a window length of 600 bp and a step size of 200 bp [37].

Analysis of the codon usage bias

The coding sequence (CDS) genes in the cp genomes of six pitaya cultivars were extracted using PhyloSuite v1.2.3 software [38]. According to a previous research method [39], one of the genes with duplicates or gene < 300 bp in length were eliminated, and genes with ATG as the start codon and TAA\TAG\TGA as the stop codon were selected. Finally, 45 gene sequences were used for subsequent codon usage bias analysis, and these 45 CDSs were integrated into one sequence for relative synonymous codon usage (RSCU) analysis. The codon usage frequency and RSCU values of different pitaya cultivars were calculated using CodonW v1.4.2 [40]. Finally, the results were visualised using TBtools v2.154 [41].

Phylogenetic analysis

To construct a phylogenetic tree of Cactaceae, we downloaded 40 cp genomes of Cactaceae plants from the NCBI, together with six fully assembled pitaya from this study and one outgroup of Portulacaceae (Portulaca oleracea, GB: NC.036236). The 45 shared CDSs from these 47 species were selected to construct the maximum likelihood (ML) and Bayesian inference (BI) trees. Phylogenetic analysis of the genus Selenicereus was based on the six pitaya cultivars used in the current study and the four complete cp genomes of pitaya available in the NCBI database. Two data matrices (complete cp genomes and 63 shared CDSs) were selected for ML and BI analyses, with Carnegiea gigantea (GB: NC.027618) as the outgroup. The GenBank numbers and classifications of the cp genome sequences downloaded from the NCBI database are shown in Table S1.

Shared single-copy CDSs were identified and extracted using PhyloSuite v1.2.3. Before constructing the phylogenetic tree, both the extracted shared CDSs and the complete cp genome sequences were aligned using MAFFT v7.505 [42] in PhyloSuite v1.2.3. These aligned nucleotide sequences were rejoined, and unaligned sequences were clipped using Gblocks [43], and phylogenetic trees were created based on the optimised data using the ML method of IQ-TREE v2.2.2.6 [44], with the bootstrap (BS) parameter set to 1000, and Modelfinder v2.2.0 [45] to determine the best alternative model.

BI analysis was performed using MrBayes v3.2.7a Markov chain Monte Carlo method row [46], and the best alternative model was determined using Modelfinde software and run for 200,000 generations, sampling the tree every 1000 generations. The first 20% of the trees were discarded as aged, and the remaining trees were used to generate consensus trees. Finally, the phylogenetic tree was visualised using FigTree v1.4.4 (http://tree.bio.ed.ac.uk/software/figtree).

Results

Characteristics of the chloroplast genome

In this study, the complete cp genomes of six pitaya cultivars with typical tetrameric structures were assembled, including an LSC region, two IR regions (IRa and IRb), and an SSC region. Based on consistent gene content, order, and orientation, we use one cp gene map to represent all six pitaya cultivars cp genomes map (Fig. 1). The complete cp genome of pitaya ranged in size from 133,146 to 133,617 bp. The total GC content of the cp genome was 36.4% in both cases. The size of the LSC region ranged from 68,076 to 68,528 bp, with a GC content ranging from 36.2% to 36.3%. The size of the SSC region ranged from 21,716 to 21,808 bp, with a GC content ranging from 39.6% to 39.7%. The size of the IR region ranged from 21,677 to 21,806 bp, with a GC content from 34.9% to 35.0%. Among the samples examined, S. megalanthus 'Wucihuanglong' exhibited the largest cp genome, while S. monacanthus 'Jingduyihao' had the smallest (Table S2).

In each of the six pitaya cultivar genomes, we identified 123 genes, including 80 protein-coding genes (PCGs), 38 tRNA genes, four rRNA genes, and one pseudogene (ycf68) (Table S2). The gene structures and contents of the six pitaya cp genomes are highly conserved. The number of genes in the six pitaya cultivars was counted, and 19 duplicated genes were identified in the IR regions, including nine tRNA genes and 10 PCGs (rps16, atpA, atpF, psbA, psbI, psbK, clpP, matK, ycf1, and ycf2). The identified genes could be categorized into four groups according to their functions: the first group was photosynthesis-related genes totalling 35 types; the second group was self-replication-related genes totalling 57 types; the third group was other genes totalling 6 types; the fourth group was unknown genes totalling 6 types (Table 1).

Table 1 Gene composition of chloroplast genome of six pitaya cultivars

Full size table

Analyses of simple sequence repeats and dispersed repeats

Overall, 66–69 SSRs were identified in the cp genomes of the six pitaya cultivars. S. megalanthus 'Yanwoguo' had the highest number of repeat sequences (69), followed by S. undatus 'Putongbairou' with 67. The remaining cultivars contained 66 SSRs. All six pitaya cultivars contained mono-, di-, tri-, tetra-, penta-nucleotide repeat sequences. With the exception of two red peel white pulp cultivars (S. undatus 'Putongbairou' and S. undatus 'Baishuijing') which did not have hexanucleotide repeats, hexanucleotide repeats were identified in all the remaining individuals. The identified SSR exhibited the highest number of mononucleotide repeat sequences, followed by di-, tetra-, tri-, penta-, and hexa-nucleotide repeat sequences (Fig. 2A). Mononucleotide repeat sequences consisting of A/T motifs were the most prevalent, accounting for 73.75%. This was followed by dinucleotide repeat sequences based on AT/AT motifs, which accounted for 13.5% (Fig. 2B). Further statistics revealed that, among these six pitaya plants, most of SSRs were distributed in the LSC region, followed by the IR region, and were least distributed in the SSC region (Fig. 2C).

In this study, we analysed repetitive sequences of more than 30 bp in all samples, and these repetitions appeared in a dispersed form in the genome. We found that the presence of 1,097 dispersed repeat sequences in the cp genomes of the six pitaya cultivars. The cultivar S. undatus 'Putongbairou' exhibited the highest number of dispersed repeat sequences, with a total of 214. In contrast, S. monacanthus 'Jingduyihao' had the lowest number, with 161 dispersed repeat sequences. The analysis identified four distinct categories of dispersed repeat sequences in all six pitaya cultivars: forward, reverse, palindromic, and complementary. The six cultivars under consideration had the highest number of forward repeats (96–152), and the lowest number of complementary repeats, with only one identified in each cultivar (Fig. 2D). Further statistical analysis revealed that most of the identified dispersed repeat sequences were less than 50 bp in length (98–151), followed by a range of 50–99 bp (30–43) (Fig. 2E).

IRs contraction and expansion

In this study, we compared the contraction and expansion of the IR/SC boundaries in the cp genomes of six pitaya cultivars. The analysis revealed a high degree of similarity between the six pitaya cultivars. Both the SSC and IR regions of S. megalanthus 'Yanwoguo' showed expansion compared to the other five cultivars, with the IR region being 21,806 bp in length and SSC region being 21,808 bp in length, respectively (Fig. 3). In six cultivars, both copies of the ycf1 gene span the IR/SSC border regions. The ycf1 gene in the IRb/SSC border region extended from the IRb region to the SSC region, with extensions ranging from 4,076 to 4,127 bp. Simultaneously, the ycf1 genes in the SSC/IRa border region both have 45 bp extensions into the SSC region.

Comparative analysis of chloroplast genomes

In this study, published Selenicereus megalanthus (GB: NC.087625) was used as a reference, and the mVISTA online tool was used to conduct a genome-wide comparative analysis of the cp genomes of the six pitaya cultivars. These six pitaya cultivars were similar to the reference sequence in terms of gene structure and alignment order, and the variant sites were mainly found in the LSC region, followed by the SSC region, with no obvious variation in the IR regions. The genes with more significant variations in the protein-coding region were accD, rps18, rpl22, rps19, and ycf1, with the highest degree of sequence variability found in the accD gene. More sequence variation were found in the non-coding regions than in the protein-coding regions, such as atpH-atpI, trnF(GAA)-rbcL, trnM(CAU)-accD, trnL(CAA)-ycf1, ndhD-ccsA, ycf1-trnL(CAA), and trnL(CAA)-ycf2 intergenic regions all showed variability (Fig. 4A).

To elucidate the level of sequence variation, Pi analysis of the six pitaya cultivars was performed in this study using DnaSP software, and the Pi values among the sequences were calculated. The results showed that the Pi values ranged from 0.00000 to 0.08511 with an average value of 0.00363, and the maximum peak appeared in the accD gene (Fig. 4B). The LSC region had the highest average nucleotide diversity (Pi = 0.00481), followed by the SSC (Pi = 0.00405) and IR regions (Pi = 0.00159). The highly variable sites were mostly distributed in the LSC and SSC regions, whereas the IR region was more conserved and had a lower mutation rate, which was consistent with the results of genome-wide comparative analysis. In this study, six different highly variable sites (Pi ≥ 0.015) were screened out, namely trnF-GAA-rbcL, trnM-CAU-accD, accD, rpl20-psbB, rpl22, ycf1.

Codon usage bias analysis

In this study, six pitaya cp genomes were analysed for codon usage, frequency, and preference. A total of PCGs > 300 bp in length were encoded by 19,189 (S. monacanthus 'Jingduyihao') to 19,350 (S. megalanthus 'Wucihuanglong') codons, including stop codons. The total number of codons did not change significantly and the types of codons were consistent with the types of amino acids. Leucine (Leu: 1898–1932 codons) was the most abundant amino acid, whereas cysteine (Cys: 230–243 codons) was the least abundant (Fig. 5A and Table S3). Based on the results of the RSCU analysis, it was shown that among the 64 codons identified, RSCU values ranged from 0.33 to 1.88 (Fig. 5B). Among the six analysed sequences, UUA and CUC encoding leucine showed the largest and smallest RSCU values, respectively. The RSCU value of 1.00 for Met and Trp indicated no bias in methionine and tryptophan codons. The number of high-frequency codons (RSCU > 1) was 31, with 29 ending in A or U bases. Furthermore, three stop codons, UAA, UAG, and UGA, were present. The RSCU value of UAA was > 1, suggesting that the stop codons preferred UAA in the analysed sequences.