For citation purposes: Beccuti M, Carrara M, Cordero F, Donatelli S, Calogero RA. The structure of state-of-art gene fusion-finder algorithms OA Bioinformatics 2013 Aug 01;1(1):2.

Review

 
Genome Bioinformatics

The structure of state-of-art gene fusion-finder algorithms

M Beccuti1, M Carrara2, F Cordero1, S Donatelli1, RA Calogero2*
 

Authors affiliations

(1) Department of Computer Science, University of Torino, Torino, Italy

(2) Department of Biotechnology and Health Sciences, University of Torino, Torino, Italy

* Corresponding author Email: raffaele.calogero@unito.it

Abstract

Introduction

Fusion genes, also known as chimeras, play important roles in tumorigenesis and cancer progression. Then, their role becomes crucial in the areas of biomarkers and therapeutic targets investigation. High-throughput sequencing technologies combined with sophisticated bioinformatics tools might facilitate the discovery of such aberrations. A significant number of bioinformatics algorithms have been developed to detect fusion genes. Detection strategies are quite variegated. In this review, we inspect the strategy of 18 fusion-finder algorithms to understand how these tools call chimeras.

Materials and methods

In this review, we considered 18 tools which, to the best of our knowledge, are the current state-of-the-art chimera detection tools.

Results

The considered tools can be classified according to their alignment strategies into four different macro-groups as follows: whole paired-end, paired-end + fragmentation, direct fragmentation and statistical read distribution. The first two techniques require paired-end reads because they exploit encompassing reads during the first alignment phase, while the last two can be applied on both the read formats.

Conclusion

There is still some work to be done in the area of chimeras detection, especially concerning the definition of common benchmarks and increased specificity.

Introduction

The joining of deoxyribonucleic acid (DNA) of two genes, by translocation or inversion, gives rise to gene fusions resulting in hybrid proteins, also know as chimera/fusion products, or in the deregulation of the transcription of one gene by the cis regulatory elements (enhancers) of another. Gene fusions are an important class of cancer aberrations as in the case of breakpoint cluster region Abelson murine leukaemia (BCR-ABL) fusion, found in nearly all chronic myeloid leukaemia patients[1]. Chimeras can be categorised into two following classes: intergenic and transgenic fusion transcripts[2] as reported in Figure 1 (A) and (B), respectively. Intergenic fusion transcripts refer to a splicing event between adjacent genes in the same chromosome, while transgenic fusion transcripts originate from splicing events involving exons of two genes located in different chromosomes.

The fusion products are classified into two categories, namely, intergenic fusion transcripts given from the biding of exons (En) of two genes on the same chromosome, panel A; and transgenic fusion transcripts obtained by the combination of exons of two genes in different chromosomes, panel B.

High-throughput sequencing technologies facilitate the characterisation of the aberrant background of human cancers[3,4], driving the modern medicine to the development of personalised treatment of cancer patients.

Recently, many computation approaches for the detection of chimera have been developed, taking advantage of the remarkable throughput of the new ribonucleic acid (RNA)-Seq technologies[5]. A review work recently done by Wang et al.[6], listed a total of 23 different fusion detection tools, published between 2009 and 2012. The review of Wang and co-workers, also considered 24 papers involving the identification of fusion products in cancer, published during the same period. This review also reported that the chimera detection methodologies used in those 24 papers, had an advantage only in two cases out of the 23 reported tools, namely, FusionSeq and deFuse, used to detect fusions in one and two papers, respectively.

It seems quite odd that only few biologically important works did not consider the use of the available fusion detection tools. Carrara and co-workers[7], recently compared the behaviour of eight fusion-detection tools (FusionHunter, FusionMap, FusionFinder, MapSplice, deFuse, Bellerophontes, ChimeraScan and TopHat-Fusion), and they also highlighted that these tools are able to detect chimeras, but the number of false positives contaminating the results produced; thus, make the validation of true fusions a real challenge. In this review, the authors highlight that further improvement in fusion-finder algorithms is essential. We also examine the structure of 18 fusions detection, published algorithm in this review, to identify critical steps in the chimera-call procedure that might need further refinements.

Materials and methods

In this review, we consider the following 18 tools: Bellerophontes[8], BreakFusion[9], BreakPointer[10], ChimeraScan[11], deFuse[12], EBARDenovo[13], EricScript[14], FusionAnalyser[15], FusionFinder[16], FusionHunter[17], FusionMap[18], FusionSeq[19], LifeScope[20], MapSplice[21], ShortFuse[22], SnowShoes-FTD[23], SOAPFuse[24] and TopHat-Fusion[25]; which to the best of our knowledge, are the current state-of-the-art chimera detection tools. Before describing these fusion-finder algorithms, we recall some terms used in the rest of this review. RNA-Seq experiments generate a huge set of short reads, which can be in two forms, namely, single-end or paired-end. In the former case, the sequencer reads only one of the two DNA fragment strand; while in the latter case, both the forward and reverse strands of DNA fragment are sequenced, giving rise to a couple of mates, called paired-end read.

Results

During the identification of the fusion boundaries (the positions where the nucleotide coordinates corresponding to the breakpoint of both genes involved in the fusion are discovered), it is possible to classify each read as either spanning or encompassing, as shown in Figure 2. Spanning reads, derived from single-end or paired-end experiments, overlaps with a fusion product. Encompassing read, requires a paired-end format, harbours a fusion boundary so that each read of the mate maps on a different gene of the fused gene couple. All gene fusion-finder algorithms are composed by two phases. First, a mapping step is required to align the reads with respect to the reference specified by the algorithm. Second, a set of filters based on several biological or technical indications will be applied to reduce the set of putative fusion products. The considered tools can be classified according to their alignment strategies into four different macrogroups as follows: whole paired-end, paired-end + fragmentation, direct fragmentation and statistical read distribution, as summarised in Table 1. The first two techniques require paired-end reads because they exploit encompassing reads during the first alignment phase, while the last two techniques can be applied on both the read formats.

Representation of spanning and encompassing reads. The encompassing reads are represented only by paired-end reads, where the mates (continuous lines) harbour the fusion boundary. The spanning reads are represented in both single-end, the read overlaps the fusion product, or paired-end read, only one mate (the blue ones) covers the fusion boundary.

Table 1

Classification of fusion-finder algorithms according to their alignment strategies.

The whole paired-end approach consists in two mapping steps. In the first step, the reads, are aligned to a reference using mapping tools, such as Bowtie[26] and Burrows-Wheeler Alignment (BWA)[27], and a limited number of mismatches are considered. This step can rise to some ‘discordant alignment’, which occurs when both mates have a unique alignment in the reference, but some features do not match the assumption of paired-end design. For example, the mate orientations are not correct or the distance among them do not match with the experiment advices. The discordant alignments are used to generate a set of putative fusion products, which will be used as a reference for the second alignment step, where the unmapped reads are rescued. The resulting putative fusions are passed as inputs to a filtering step. The tools falling into this category are BreakFusion, EricScript, deFuse, FusionAnalyser, FusionHunter, FusionSeq, ShortFuse, SOAPFuse and SnowShoes-FTD.

The paired-end + fragmentation approach proposes a first phase of mapping similar to the ones used in the whole paired-end case. Then, the second phase of alignment, which exploits read fragmentation, is performed. In detail, the unmapped reads are fragmented and remapped with respect to the reference depending on the tool requirements. Note that using a fragmentation approach, these algorithms, are able to detect a higher number of junction-spanning reads, which simplified the detection of the fusion junctions. Again, the putative fusions are passed as input to the filter step. Bellerophontes, ChimeraScan, LifeScope and TopHat-Fusion are part of this category.

Unlike the previous categories, the direct fragmentation approach immediately generates a set of fragmented reads, which will be used in the mapping phase on the reference genome. This approach makes use of the detection strength on the fusion junctions. Also, the algorithms of this category propose a set of filters useful to select the real fusion products. The tools that can be classified in this macrogroup as follows: EBARDenovo, FusionFinder, FusionMap and MapSlice.

Finally, another approach is the statistical read distribution, which identifies putative fusion products, exploiting both local non-uniform read distribution and mapping signatures, containing misalignment at the boundaries of insertions/deletions or more complex structural variants. Then, each putative fusion product is validated using the unmapped reads derived in the previous step. Only the putative fusion products associated with a number of reads greater than a threshold are selected as inputs to the filtering step. The only tool following this approach is BreakPointer.

All tools implement a final filter step to reduce and to validate the discovered fusion products. Table 2 summarises the set of filters used by each fusion-finder algorithm, described as follows:

Table 2

Consideration of filters implemented by each fusion-finder algorithm.

Paired-end information filters verify the correct distance between the tags of a pair to validate the alignment on a fusion. This distance depends on the protocols used for the library preparation, and these tools can either take this information as input or infer it from the first alignment step. In both cases, reads mapping on the putative fusions at an excessive distance are filtered out. The tools including filters of this class are Bellerophontes, ChimeraScan, deFuse, FusionFinder and SOAPFuse.

Anchor length filters use the concept of ‘anchor length’ (i.e., the number of nucleotides overlapping each side of a fusion junction) to evaluate the quality of junction-spanning reads associated with a fusion junction. Junction-spanning reads having at least one of the two anchor lengths below a threshold are interpreted as possible artefacts caused by mismatches or sequence similarity, and are removed. FusionHunter, ChimeraScan and TopHat-Fusion take advantage of this class of filters.

Read-through transcripts filters try to discover and remove RNA molecules formed by exons of adjacent genes, usually generated when the gene end is not recognised during the RNA elongation phase. Bellerophontes, FusionHunter, FusionAnalyser, FusionFinder, FusionMap, SnowShoes-FTD and TopHat-Fusion are the tools, which use this class of filters.

Junction spanning reads filters remove all the fusion products not supported by a number of spanning reads greater than a specified threshold. This class of filters is found in Bellerophontes, FusionHunter, FusionMap, ShortFuse, SnowShoes-FTD, SOAPFuse and TopHat-Fusion.

Polymerase chain reaction artefact filters try to discover and remove all duplicated reads generated by the polymerase chain reaction (PCR) amplification process, by the identification of clusters of reads of the same length with an identical alignment on the reference. Bellerophontes, FusionHunter, BreakPointer, EricScript, FusionMap and ShortFuse have an implementation of this class of filters.

Homology filters remove all the putative fusions having a high number of reads on homologous or repetitive regions, which can lead to multiple alignments. The tools belonging to this class of filters are EricScript, FusionAnalyser, FusionFinder, FusionSeq, SnowShoes-FTD, SOAPFuse and TopHat-Fusion.

Scoring filters compute for each fusion, a corresponding quality based on different metrics (e.g., entropy, base quality, etc.), such that all the candidates with quality lower than a threshold are discarded. BreakPointer, EricScript, FusionMap, FusionSeq and ShortFuse are the tools, which use a specific scoring method to filter putative fusion products.

Reads quality filters act on the available reads, actively removing all the paired reads with a score below a threshold, reducing the possibility of ambiguous alignments due to low sequencing quality. Reads qualities filter are used in FusionMap, FusionSeq, LifeScope and SnowShoes-FTD.

Encompassing reads filters remove all putative fusion products with a number of encompassing read pairs below a threshold. The three tools including this filtering step are Bellerophontes, FusionAnalyser and SnowShoes-FTD.

Blacklist filters remove fusions comprising genes present in a list of non-interesting regions, which can be either user-defined or fixed, depending on the tool. FusionAnalyser, FusionMap and FusionSeq include a blacklisting feature.

Statistics filters analyse the reads distribution on the putative fusion products and compute statistical evaluation with respect to the general read distribution on the genome, to decide whether the fusion should be filtered or not. BreakFusion, EricScript, MapSplice and ShortFuse include a statistical evaluation of the fusion products.

In addition, we identified a total of eleven tool-specific filters, which do not fall in the previous categories. Bellerophontes includes a step removing ambiguous reads. EricScript has a filtering step based on the sequence homology of the fusion junction. FusionFinder removes putative fusion products containing antisense sequences. FusionSeq requires the expression of the putative fusion to be comparable with the general expression obtained from the sequencing. LifeScope introduces a graph, called Junction Evidence Graph, to represent the fusion products and their junctions and to evaluate the confidence level of each called fusion. MapSplice filter products do not contain canonical junctions as well as remove products with introns of unusual length. ShortFuse removes all the reads that align on transcripts for spliceosome components. SnowShoes-FTD checks and possibly filters fusion products on the basis of the orientation of the genes involved and also remove fusions with an excessive number of putative junction points. SOAPFuse adds a step of trimming on the reads that fail to align, in an attempt to rescue them.

Discussion

This review proposes an overview on the main fusion-finder algorithms published in literature. In all algorithms inspected, the first step concerns the usage of mapping algorithms, i.e., Bowtie, BWA, etc. The objective of these algorithms is to select reads that support putative fusion events by discordant reads that have a coherent mapping with known gene annotation. A set of filters based on several biological or technical features follow the mapping step. The filter application can reduce the set of putative fusion products to those that could be real fusion products.

At the present time, there is no complete evaluation between tools on the same dataset. Only partial comparisons are available, typically in the papers proposing a new tool, against a subset of the algorithms considered in this review. Some of the evaluations are even in contrast among them, which is not surprising because of the lack of common available benchmarks. For example, EricScript offers a comparison of its performance in terms of central processing unit (CPU) time and area under the curve, a measure that estimates the accuracy of each algorithm to discriminate true and false positives. The comparisons have been done on a synthetic dataset, against ChimeraScan, deFuse, FusionMap and ShotFuse, and on real datasets, against deFuse. Also, SOAPFuse offers a detailed comparison and a real dataset. The authors compare its algorithm with respect to deFuse, TopHat-Fusion, FusionHunter, ChimeraScan and SnowShoes-FTD in terms of CPU time, memory usage and detection of known fusions.

We observed a very high number of false positives for all the examined tools in the work done by Carrara et al[7]. Unfortunately, the specificity of the tool is not reported in most of the original papers, but there is certainly a factor that may hinder the applicability of the tools in many contexts.

Conclusion

The different performances of fusion-finder algorithms could be imputed to the application of the filters that each tool applies. Probably, a good choice is to provide a clearer separation between alignment and filter phase to offer a modularity usage of these tools. Thus, there is still some work to be done in the area of chimeras detection, especially concerning the definition of common benchmarks and increased specificity.

Abbreviations list

BWA, Burrows-Wheeler Alignment; CPU, central processing unit; DNA, deoxyribonucleic acid; RNA, ribonucleic acid.

Acknowledgments

This study was funded by grants from the Epigenomics Flagship Project EPIGEN, MIUR-CNR; FP7-Health- 2012-Innovation-1 NGS-PTL Grant no. 306242.

Authors Contribution

All authors contributed to the conception, design, and preparation of the manuscript, as well as read and approved the final manuscript.

Competing interests

None declared.

Conflict of Interests

None declared.

A.M.E

All authors abide by the Association for Medical Ethics (AME) ethical rules of disclosure.

References

  • 1. Rowley JD . Letter: a new consistent chromosomal abnormality in chronic myelogenous leukaemia identified by quinacrine fluorescence and giemsa staining. Nature 1973 Jun;243(5405):290-3.
  • 2. Magrangeas F, Pitiot G, Dubois S, Bragado-Nilsson E, Chérel M, Jobert S. Cotranscription and intergenic splicing of human galactose-1-phosphate uridylyltransferase and interleukin-11 receptor alpha-chain genes generate a fusion mRNA in normal cells. Implication for the production of multidomain proteins during evolution. J Biol Chem 1998 Jun;273(26):16005-10.
  • 3. Maher CA, Kumar-Sinha C, Cao X, Kalyana-Sundaram S, Han B, Jing X. ‘Transcriptome sequencing to detect gene fusions in cancer. Nature 2009 Mar;458(7234):97-101.
  • 4. Maher CA, Palanisamy N, Brenner JC, Cao X, Kalyana-Sundaram S, Luo S. Chimeric transcript discovery by paired-end transcriptome sequencing. Proc Natl Acad Sci U S A 2009 Jul;106(30):12353-8.
  • 5. Ozsolak F, Milos PM. RNA sequencing: advances, challenges and opportunities. Nat Rev Genet 2011 Feb;12(2):87-98.
  • 6. Wang Q, Xia J, Jia P, Pao W, Zhao Z. Application of next generation sequencing to human gene fusion detection: computational tools, features and perspectives. Brief Bioinform 2013 Jul;14(4):506-19.
  • 7. Carrara M, Beccuti M, Lazzarato F, Cavallo F, Cordero F, Donatelli S. State-of-the-art fusion-finder algorithms sensitivity and specificity. Biomed Res Int 2013 Feb;2013(2013):340620.
  • 8. Abate F, Acquaviva A, Paciello G, Foti C, Ficarra E, Ferrarini A. Bellerophontes: an RNA-Seq data analysis framework for chimeric transcripts discovery based on accurate fusion model. Bioinformatics 2012 Aug;28(16):2114-21.
  • 9. Chen K, Wallis JW, Kandoth C, Kalicki-Veizer JM, Mungall KL, Mungall AJ. BreakFusion: targeted assembly-based identification of gene fusions in whole transcriptome paired-end sequencing data. Bioinformatics 2012 Jul;28(14):1923-4.
  • 10. Sun R, Love MI, Zemojtel T, Emde AK, Chung HR, Vingron M. Breakpointer: using local mapping artifacts to support sequence breakpoint discovery from single-end reads. Bioinformatics 2012 Apr;28(7):1024-5.
  • 11. Iyer MK, Chinnaiyan AM, Maher CA. ChimeraScan: a tool for identifying chimeric transcription in sequencing data. Bioinformatics 2011 Oct;27(20):2903-4.
  • 12. McPherson A, Hormozdiari F, Zayed A, Giuliany R, Ha G, Sun MG. deFuse: an algorithm for gene fusion discovery in tumor RNA-Seq data. PLoS Comput Biol 2011 May;7(5):e1001138.
  • 13. Chu HT, Hsiao WW, Chen JC, Yeh TJ, Tsai MH, Lin H. EBARDenovo: highly accurate de novo assembly of RNA-Seq with efficient chimera-detection. Bioinformatics 2013;291004-10.
  • 14. Benelli M, Pescucci C, Marseglia G, Severgnini M, Torricelli F, Magi A. Discovering chimeric transcripts in paired-end RNA-seq data by using EricScript. Bioinformatics 2012 Dec;28(24):3232-9.
  • 15. Piazza R, Pirola A, Spinelli R, Valletta S, Redaelli S, Magistroni V. FusionAnalyser: a new graphical, event-driven tool for fusion rearrangements discovery. Nucleic Acids Res 2012 Sep;40(16):e123.
  • 16. Francis RW, Thompson-Wickin K, Carter KW, Anderson D, Kees UR, Beesley AH. FusionFinder: a software tool to identify expressed gene fusion candidates from RNA-Seq data. PLoS One 2012 Jun;7(6):e39987.
  • 17. Li Y, Chien J, Smith DI, Ma J. FusionHunter: identifying fusion transcripts in cancer using paired-end RNA-seq. Bioinformatics 2011 Jun;27(12):1708-10.
  • 18. Ge H, Liu K, Juan T, Fang F, Newman M, Hoeck W. FusionMap: detecting fusion genes from next-generation sequencing data at base-pair resolution. Bioinformatics 2011 Jul;27(14):1922-8.
  • 19. Sboner A, Habegger L, Pflueger D, Terry S, Chen DZ, Rozowsky JS. FusionSeq: a modular framework for finding gene fusions by analyzing paired-end RNA-sequencing data. Genome Biol 2010 Oct;11(10):R104.
  • 20. Sakarya O, Breu H, Radovich M, Chen Y, Wang YN, Barbacioru C. RNA-Seq mapping and detection of gene fusions with a suffix array algorithm. PLoS Comput Biol 2012 Apr;8(4):e1002464.
  • 21. Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Res 2010 Oct;38(18):e178.
  • 22. Kinsella M, Harismendy O, Nakano M, Frazer KA, Bafna V. Sensitive gene fusion detection using ambiguously mapping RNA-Seq read pairs. Bioinformatics 2011 Apr;27(8):1068-75.
  • 23. Asmann YW, Hossain A, Necela BM, Middha S, Kalari KR, Sun Z. A novel bioinformatics pipeline for identification and characterization of fusion transcripts in breast cancer and normal cell lines. Nucleic Acids Res 2011 Aug;39(15):e100.
  • 24. Jia W, Qiu K, He M, Song P, Zhou Q, Zhou F. SOAPFuse: an algorithm for identifying fusion transcripts from paired-end RNA-seq data. Genome Biol 2013 Feb;14(2):R12.
  • 25. Kim D, Salzberg SL. TopHat-Fusion: an algorithm for discovery of novel fusion transcripts. Genome Biol 2011 Aug;12(8):R72.
  • 26. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 2009 Mar;10(3):R25.
  • 27. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009 Jul;25(14):1754-60.
Licensee to OAPL (UK) 2013. Creative Commons Attribution License (CC-BY)

Classification of fusion-finder algorithms according to their alignment strategies.

Tool name Macrogroup
Bellerophontes Paired-end + Fragmentation
BreakFusion Whole paired-end
BreakPointer Statistical information exploiting
ChimeraScan Paired-end + Fragmentation
deFuse Whole paired-end
EBARDenovo Direct fragmentation
EricScript Whole paired-end
FusionAnalyser Whole paired-end
FusionFinder Direct fragmentation
FusionHunter Whole paired-end
FusionMap Direct fragmentation
FusionSeq Whole paired-end
LifeScope Paired-end + Fragmentation
MapSplice Direct fragmentation
ShortFuse Whole paired-end
SnowShoes-FTD Whole paired-end
SOAPFuse Whole paired-end
TopHat-Fusion Paired-end + Fragmentation

Consideration of filters implemented by each fusion-finder algorithm.

Tool name Paired-end information Anchor length Read-through transcripts Junction spanning reads PCR artifact Homology Scoring Reads quality Encompassing reads Black list Statistics Additional filters
Bellerophontes X X X X X Ambiguous reads
BreakFusion X
Break-Pointer X X
ChimeraScan X X
deFuse X
EBARDenovo
EricScript X X X X Junction homology
FusionAnalyser X X X X
FusionFinder X X X Antisense
FusionHunter X X X X
FusionMap X X X X X X
FusionSeq X X X X Comparison chimera expression with general expression
LifeScope X Junction evidence graph
MapSplice X Canonical junctions Introns length
ShortFuse X X X X Reads from Spliceosome components
SnowShoes-FTD X X X X X Fusion genes orientation excessive putative junction point
SOAPFuse X X X Read trimming
TopHat-Fusion X X X

PCR, polymerase chain reaction.

Keywords