Too much information: better ways to manage data
Humanity is generating ever-increasing amounts of data with genome sequencing and internet use, faster than our computers can handle. An EU-funded project is designing storage and analysis solutions which can help optimise transport networks and advance research into diseases and personalised medicine.
© Sikov #222912660, 2019 source: stock.adobe.com
The emergence of cheaper, better technology in the field of molecular biology is enabling the sequencing of the genomes of millions of individual humans and other animal species. This generates enormous volumes of data, and the ability to analyse it is critical to understanding biological organisms and treating diseases.
Similarly, the volume of information on the web is increasing as, either consciously or subconsciously, we generate it in our daily lives through clicks, likes, searches, downloads, uploads, and even our mere connection to the web. The sheer volume of all this data poses challenges to the current computational storage, management and indexing systems.
The EU-funded BIRDS project is working on the problem by designing new structures that compress data, indexes and algorithms to provide better storage, processing and querying of large volumes of data. One approach is by taking advantage of the repetitiveness of and patterns in the data.
Some members of the project team have obtained important results for indexing highly repetitive texts, such as genomic databases, which contain information on an organism’s DNA.
‘We expect these results will have a great impact on current bioinformatic software, such as Bowtie a software package that can align and analyse sequences in bioinformatics and on building indexes for large genomic databases efficiently,’ says project coordinator Susana Ladra of the University of A Coruña, Spain.
‘This can revolutionise the field of bioinformatics and help, for example, with the discovery of rare diseases.’
In the case of the analysis of biological sequences, new algorithms can be used to identify mutations and gene rearrangements present in cancer genomes, which can be essential for understanding the disease and developing targeted therapeutics, says Ladra.
The project is focusing on three lines of research: algorithms for sequence analysis, compression and indexing techniques for repetitive data, and data structures and algorithms for network analysis.
More efficient transport
Some researchers from different partners are working together on storing large amounts of data with spatio-temporal information such as the position of moving objects like boats and planes then locating a specific object among all this data. These challenges can be solved using structures which compress the data and keep an index enabling access to the information without having to decompress everything.
Another project team is working on real transportation problems, with a focus on the public transport systems in Santiago, Chile and Madrid, Spain. Using compressed representations for journeys over previously known networks, they are seeking solutions for vehicle route planning, addressing the needs expressed by some private companies.
The idea of using compact data structures arose when researchers noted the similarity between most object or trajectory movements in transportation systems, making this information highly compressible.
Having studied the different kinds of queries that can be solved with classic spatio-temporal indexes, they designed compact data structures to solve these queries efficiently and by using less space. Extra information and new functions can be added to these data structures, depending on the requirements.
‘Results on trajectories can be commercialised and used by airlines, for example, to know how they can optimise routes, where they can save fuel, or which flights had problems on their routes,’ says Ladra.
Information on vehicle trajectories can also be used to track ships, detect fishing in a prohibited area, determine which routes are more popular, or pinpoint those that can be improved.
The project was funded through the Marie Skłodowska-Curie Research and Innovation Staff Exchange (RISE).
One project goal is to increase the number of new researchers attracted to this field on an international scale and improve the education of PhD candidates and postdoctoral researchers. ‘We expect better research will be carried out in Europe thanks to RISE funding,’ concludes Ladra.
The BIRDS project involves seven research institutions: University of Melbourne, Australia; University of Chile, and University of Concepción, Chile; University of Helsinki, Finland; Kyushu University, Japan; Instituto de Engenharia de Sistemas e Computadores, Investigação e Desenvolvimento em Lisboa, Portugal; and the University of A Coruña, alongside a Spanish SME Enxenio SL.