--- _id: '7181' abstract: - lang: eng text: Multiple sequence alignments (MSAs) are used for structural1,2 and evolutionary predictions1,2, but the complexity of aligning large datasets requires the use of approximate solutions3, including the progressive algorithm4. Progressive MSA methods start by aligning the most similar sequences and subsequently incorporate the remaining sequences, from leaf-to-root, based on a guide-tree. Their accuracy declines substantially as the number of sequences is scaled up5. We introduce a regressive algorithm that enables MSA of up to 1.4 million sequences on a standard workstation and substantially improves accuracy on datasets larger than 10,000 sequences. Our regressive algorithm works the other way around to the progressive algorithm and begins by aligning the most dissimilar sequences. It uses an efficient divide-and-conquer strategy to run third-party alignment methods in linear time, regardless of their original complexity. Our approach will enable analyses of extremely large genomic datasets such as the recently announced Earth BioGenome Project, which comprises 1.5 million eukaryotic genomes6. article_processing_charge: No article_type: original author: - first_name: Edgar full_name: Garriga, Edgar last_name: Garriga - first_name: Paolo full_name: Di Tommaso, Paolo last_name: Di Tommaso - first_name: Cedrik full_name: Magis, Cedrik last_name: Magis - first_name: Ionas full_name: Erb, Ionas last_name: Erb - first_name: Leila full_name: Mansouri, Leila last_name: Mansouri - first_name: Athanasios full_name: Baltzis, Athanasios last_name: Baltzis - first_name: Hafid full_name: Laayouni, Hafid last_name: Laayouni - first_name: Fyodor full_name: Kondrashov, Fyodor id: 44FDEF62-F248-11E8-B48F-1D18A9856A87 last_name: Kondrashov orcid: 0000-0001-8243-4694 - first_name: Evan full_name: Floden, Evan last_name: Floden - first_name: Cedric full_name: Notredame, Cedric last_name: Notredame citation: ama: Garriga E, Di Tommaso P, Magis C, et al. Large multiple sequence alignments with a root-to-leaf regressive method. Nature Biotechnology. 2019;37(12):1466-1470. doi:10.1038/s41587-019-0333-6 apa: Garriga, E., Di Tommaso, P., Magis, C., Erb, I., Mansouri, L., Baltzis, A., … Notredame, C. (2019). Large multiple sequence alignments with a root-to-leaf regressive method. Nature Biotechnology. Springer Nature. https://doi.org/10.1038/s41587-019-0333-6 chicago: Garriga, Edgar, Paolo Di Tommaso, Cedrik Magis, Ionas Erb, Leila Mansouri, Athanasios Baltzis, Hafid Laayouni, Fyodor Kondrashov, Evan Floden, and Cedric Notredame. “Large Multiple Sequence Alignments with a Root-to-Leaf Regressive Method.” Nature Biotechnology. Springer Nature, 2019. https://doi.org/10.1038/s41587-019-0333-6. ieee: E. Garriga et al., “Large multiple sequence alignments with a root-to-leaf regressive method,” Nature Biotechnology, vol. 37, no. 12. Springer Nature, pp. 1466–1470, 2019. ista: Garriga E, Di Tommaso P, Magis C, Erb I, Mansouri L, Baltzis A, Laayouni H, Kondrashov F, Floden E, Notredame C. 2019. Large multiple sequence alignments with a root-to-leaf regressive method. Nature Biotechnology. 37(12), 1466–1470. mla: Garriga, Edgar, et al. “Large Multiple Sequence Alignments with a Root-to-Leaf Regressive Method.” Nature Biotechnology, vol. 37, no. 12, Springer Nature, 2019, pp. 1466–70, doi:10.1038/s41587-019-0333-6. short: E. Garriga, P. Di Tommaso, C. Magis, I. Erb, L. Mansouri, A. Baltzis, H. Laayouni, F. Kondrashov, E. Floden, C. Notredame, Nature Biotechnology 37 (2019) 1466–1470. date_created: 2019-12-15T23:00:43Z date_published: 2019-12-01T00:00:00Z date_updated: 2023-09-06T14:32:52Z day: '01' department: - _id: FyKo doi: 10.1038/s41587-019-0333-6 ec_funded: 1 external_id: isi: - '000500748900021' pmid: - '31792410' intvolume: ' 37' isi: 1 issue: '12' language: - iso: eng main_file_link: - open_access: '1' url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6894943/ month: '12' oa: 1 oa_version: Submitted Version page: 1466-1470 pmid: 1 project: - _id: 26580278-B435-11E9-9278-68D0E5697425 call_identifier: H2020 grant_number: '771209' name: Characterizing the fitness landscape on population and global scales publication: Nature Biotechnology publication_identifier: eissn: - '15461696' issn: - '10870156' publication_status: published publisher: Springer Nature quality_controlled: '1' related_material: record: - id: '13059' relation: research_data status: public scopus_import: '1' status: public title: Large multiple sequence alignments with a root-to-leaf regressive method type: journal_article user_id: c635000d-4b10-11ee-a964-aac5a93f6ac1 volume: 37 year: '2019' ...