The work allows us to deepen our knowledge of the formation of the organism and its diseases.

VALENCIA, 11 Apr. (EUROPA PRESS) –

A team from the Institute of Integrative Systems Biology (UV-CSIC) has published in ‘Nature Methods’ its own software to analyze data obtained by long-read sequencing of the genome. This system makes it possible to discover new RNA molecules and assign them a function in the creation of tissues, which “deepens the knowledge of the formation of the organism and its diseases.”

Those responsible for the discovery remember that the complexity of an organism emerges from its genome, the book that contains the instructions of its DNA for life. The method for reading this book — sequencing — has evolved toward reading increasingly longer fragments of the genome.

In this field, a research group led by the Institute of Integrative Systems Biology (I2SysBio), a joint center of the University of Valencia (UV) and the Higher Council for Scientific Research (CSIC), has improved its own computer program capable of discover new transcripts -RNA molecules to synthesize proteins and create tissues- from their sequencing with long-read instruments; and assign them a function in the formation of the organism.

Long-read sequencing is the third generation of genome sequencing methods. Compared to short fragment reading, which analyzes about 200 nucleotides, long read methods can obtain reads 100 times longer, leaving fewer gaps in the genome information to fill using bioinformatics tools. This was one of the reasons why Nature Methods itself considered it ‘Method of the Year 2022’.

A few years earlier, in 2018, researcher Ana Conesa, then at the University of Florida, developed a computer program called SQANTI to analyze the information that was extracted using these long-read methods. Now, her research team at I2SysBio has published a substantial improvement to this software that can be freely used on the major commercial systems employing long-read sequencing, Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT).

“Long read techniques better analyze the complexity of human transcripts and transcriptome,” says Conesa. This identifies the portion of the genome that is read in each cell to give rise to tissues and organs. Thus, a single gene can give rise to a great diversity of transcripts, through small changes in the structure of the RNA it encodes, and with them proteins with different cellular functions. “Short-read sequencing cannot solve this puzzle. Long-read sequencing better reconstructs the functional complexity of the human transcriptome, and this is key to studying certain diseases, especially neurological diseases and cancer,” says the CSIC researcher in a statement. .

The version published now -SQANTI3- solves some previous problems derived from RNA degradation and introduces notable improvements. The program is capable of discovering new transcripts that were not in the genome databases used by these computer programs. Furthermore, through Artificial Intelligence techniques, the software can assign functional information to the new transcript, “something essential to understand the functional complexity of the organism and the diseases,” highlights Conesa.

To develop this computer program, the I2SysBio Garnatxa computing cluster has been used, which has 15 computing nodes capable of offering 950 parallel computing threads. In addition, the Gene Expression Genomics group led by Ana Conesa at I2SysBio participates in Elixir, one of the strategic infrastructures for the European Strategic Forum on Research Infrastructures (ESFRI) that allows life sciences laboratories across Europe to share and store your data.

The University of Florida and Pacific Biosciences have collaborated in the development of SQANTI3.