The ALS Data Browser is a catalogue of genetic variants identified from 2,800 Caucasian patients recruited and sequenced for their diagnosis of Amyotrophic Lateral Sclerosis. Approximately 93.5% of these cases are sporadic. The database includes single nucleotide substitution variants (SNVs) and insertion and deletion (indels) variants. Funding for this study was provided by Biogen Idec.
Sequencing of DNA was performed by the Duke Center for Human Genome Variation (now Institute for Genomic Medicine), McGill University, Stanford University, and HudsonAlpha. Samples were either exome sequenced using the Agilent All Exon (37MB, 50MB or 65MB) or the Nimblegen SeqCap EZ V2.0 or 3.0 Exome Enrichment kit or whole-genome sequenced using Illumina GAIIx or HiSeq 2000 or 2500 sequencers according to standard protocols.
The Illumina lane-level fastq files were aligned to the Human Reference Genome (NCBI Build 37) using the Burrows-Wheeler Alignment Tool (BWA). Picard software was used to remove duplicate reads and process these lane-level SAM files, resulting in a sample-level BAM file that was used for variant calling. GATK was used to recalibrate base quality scores, realign around indels, and call variants. The Duke and McGill/Stanford variants were required to have a quality score (QUAL) of at least 30, a quality by depth score of at least 2, a mapping quality score of at least 40, a genotype quality (GQ) score of at least 20, a read position rank sum score greater than -10 and at least 10x coverage. Additionally, variants were restricted according to VQSR tranche (calculated using the known SNV sites from HapMap v3.3, dbSNP, and the Omni chip array from the 1000 Genomes Project): the cutoffs were a tranche of 99.9% for SNVs and 99% for indels. Variants are flagged among the “Genotype Confidence” field if they were determined to be sequencing, batch-specific or kit-specific artifacts, HWE violations, or if they were marked by EVS as being failures.
Variants calls were restricted to coordinates within the Consensus Coding Sequence (CCDS) release 14, with an addition of up to 10 base pairs flanking each side of a protein-coding exon. All variants were annotated to Ensembl 73 using Variant Effect Predictor (VeP!). For the summary information only the single most damaging variant effect prediction is reported; however, the effect of a variant on all transcripts can be identified in the variant-level page.
Coverage information for carrier and non-carrier sites is summarized as the percentage of 2,800 sequenced patients with ALS that had at least 3x, 10x, 20x and 201x read-depth coverage at the site.