Google AI Releases DeepPolisher: A New Deep Learning Tool that Improves the Accuracy of Genome Assemblies by Precisely Correcting Base-Level Errors

Google AI Releases DeepPolisher: A New Deep Learning Tool for Genome Assembly Accuracy

Google AI, in collaboration with the UC Santa Cruz Genomics Institute, has launched DeepPolisher, a deep learning tool aimed at improving the accuracy of genome assemblies by correcting base-level errors. This tool has shown significant efficacy in advancing the Human Pangenome Reference, marking a notable milestone in genomics research.

The Challenge of Accurate Genome Assembly

A reference genome serves as a crucial foundation for understanding genetic diversity, heredity, disease mechanisms, and evolutionary biology. Although advancements in sequencing technologies from companies like Illumina and Pacific Biosciences have improved sequencing accuracy and throughput, assembling an error-free human genome, which consists of over 3 billion nucleotides, remains a significant challenge. Even a minor per-base error rate can lead to thousands of errors, obscuring important genetic variations or misleading downstream analyses.

What Is DeepPolisher?

DeepPolisher is an open-source, transformer-based sequencing correction tool that builds on advancements from DeepConsensus. It utilizes transformer deep learning architectures to reduce errors in genome assembly, particularly focusing on insertion and deletion (indel) errors that can disrupt reading frames and potentially cause the loss of important genes or regulatory elements during annotation.

Technology: Encoder-only transformer, adapting proven techniques from natural language processing for genomic applications.

Training Data: Utilizes a human cell line extensively characterized by NIST and NHGRI, achieving near-complete accuracy (~99.99999%) with approximately 300–1,000 errors across 6 billion bases.

How Does It Work? (Technical Overview)

DeepPolisher operates through the following steps:

Input Alignment: Takes aligned PacBio HiFi reads against a haplotype-resolved genome assembly as input.
Error Site Detection: Scans the assembly in 25 kb windows to identify candidate error sites where read evidence deviates from the assembly.
Data Encoding: For each window containing potential errors (<100 bp), it creates a multi-channel tensor representation of read alignment features such as base, base quality, mapping quality, and match/mismatch status.
Model Inference: Feeds these tensors into the transformer, predicting corrected sequences for the identified regions.
Output Correction: Outputs differences in VCF format, which are then applied to the assembly to produce a polished, highly accurate sequence utilizing tools such as bcftools.

Performance and Impact

DeepPolisher demonstrates substantial improvements in genome assembly accuracy:

Total error reduction: ~50%
Indel error reduction: >70%
Error rates: Achieves as low as one base error per 500,000 assembled bases in real-world deployment with the Human Pangenome Reference Consortium (HPRC).
Genomic Q-score improvement: Raises assembly quality from Q66.7 to Q70.1 on average, indicating <1 error per 12 million nucleotides.

Every sample tested by HPRC showed measurable improvement. These advancements enhance the reliability and accuracy of derived references, such as in the Human Pangenome Reference, which experienced a fivefold data expansion and significant error reduction due to DeepPolisher.

Deployment and Applications

DeepPolisher has been integrated into major genomic projects, including HPRC’s second data release, which provides high-accuracy reference assemblies for 232 individuals, ensuring broad ancestral diversity in genomic references. The tool is open-source and accessible via GitHub, complete with case studies and Dockerized workflows for use on assemblies produced by tools like HiFiasm and sequenced with PacBio HiFi reads.

While initially focused on human genomes, the methodology and approach of DeepPolisher are adaptable for other organisms and sequencing platforms, promoting accuracy across the genomics community.

Practical Workflow Example

A typical workflow using DeepPolisher could involve:

Input: HiFiasm diploid assembly and PacBio HiFi reads, phase-aligned using the PHARAOH pipeline.
Running: Dockerized commands for image creation, inference, and correction application.
Output: Separate VCF files for maternal and paternal assemblies, polished FASTAs after the bcftools consensus step.
Assessment: Utilizing benchmarking tools (e.g., dipcall, Hap.py) to quantify improvements in error rates and variant accuracy.

Conclusion and Future Directions

DeepPolisher signifies a significant advancement in genome polishing technology, markedly reducing error rates and unlocking higher resolution for functional genomics, rare variant discovery, and clinical applications. By addressing the remaining challenges in achieving perfect genome assemblies, it enables more accurate diagnosis, population-level genetic studies, and lays the groundwork for next-generation reference projects that will benefit biomedical research and medicine.

Explore the Technical details, visit our GitHub Page for tutorials, codes, and notebooks, and follow us on Twitter. Don’t forget to join our 100k+ ML SubReddit and subscribe to our newsletter.