The EGI Data Browser is a catalogue of genetic variants identified from 169 probands ascertained and sequenced for their diagnosis of epilepsy. The database includes single nucleotide substitution variants (SNVs) and insertion and deletion (indels) variants.
Enrollment sites affiliated with EGI include Columbia University, Boston Children’s Hospital, The Children’s Hospital of Philadelphia (CHOP), Children’s Hospital Colorado, Children’s National, Ann & Robert H. Lurie Children’s Hospital of Chicago, NYU Langone Medical Center, Royal College of Surgeons in Ireland, University of California San Francisco, University of Iowa Children’s Hospital, and The University of Melbourne.
The members of the EGI Steering Committee are:
The majority of data were generated by clinical laboratories (Ambry Genetics, Baylor Genetics, Centogene, Children’s Hospital of Philadelphia, Claritas Genomics, Emory Genetics Laboratory, GeneDx, Iowa Institute of Human Genetics, Laboratory of Personalized Genomic Medicine at Columbia University, Medgenome, The Center for Advanced Studies Research and Development in Sardinia, The Epilepsy Research Centre at the University of Melbourne, University of California Los Angeles, University of Chicago, and Yale University DNA Diagnostics Laboratory) and transferred to Columbia for a centralized analysis. A small set were sequenced on a research basis by the Broad Institute or by the Institute for Genomic Medicine at Columbia University Medical Center.
The Illumina lane-level fastq files were aligned to the Human Reference Genome (NCBI Build 37) using the Burrows-Wheeler Alignment Tool (BWA). Picard software was used to remove duplicate reads and process these lane-level SAM files, resulting in a sample-level BAM file that was used for variant calling. GATK was used to recalibrate base quality scores, realign around indels, and call variants. For EGIdb, variants were required to have a quality score (QUAL) of at least 30, a quality by depth score of at least 2, a mapping quality score of at least 40, a genotype quality (GQ) score of at least 20, a read position rank sum score greater than -10, a fisher strand bias under 200 and at least 10x coverage. Additionally, variants were restricted according to VQSR tranche (calculated using the known SNV sites from HapMap v3.3, dbSNP, and the Omni chip array from the 1000 Genomes Project): the cutoffs were a tranche of 99.9% for SNVs and 99% for indels. Variants are flagged among the “Genotype Confidence” field if they were determined to be sequencing, batch-specific or kit-specific artifacts or HWE violations.
Variant calls were restricted to coordinates within the Consensus Coding Sequence (CCDS) release 14, with an addition of two base pairs flanking each side of a protein-coding exon. All variants were annotated to Ensembl 73 using Variant Effect Predictor (VeP!). For the summary information only the single most damaging variant effect prediction is reported; however, the effect of a variant on all transcripts can be identified in the variant-level page.
Coverage information for carrier and non-carrier sites is summarized as the percentage of 169 sequenced probands ascertained for an epileptic encephalopathy that had at least 3x, 10x, 20x and 201x read-depth coverage at the site.
Nick Ren, Joshua Bridgers