Skip to main content Home Skills LLMs & Models gwas-database Query NHGRI-EBI GWAS Catalog for SNP-trait associations. Search variants by rs ID, disease/trait, gene, retrieve p-values and summary statistics, for genetic epidemiology and polygenic risk scores.
bunx add-skill davila7/claude-code-templates -s gwas-database anthropic anthropic-claude claude claude-code
GWAS Catalog Database
Overview
The GWAS Catalog is a comprehensive repository of published genome-wide association studies maintained by the National Human Genome Research Institute (NHGRI) and the European Bioinformatics Institute (EBI). The catalog contains curated SNP-trait associations from thousands of GWAS publications, including genetic variants, associated traits and diseases, p-values, effect sizes, and full summary statistics for many studies.
When to Use This Skill
This skill should be used when queries involve:
Genetic variant associations : Finding SNPs associated with diseases or traits
SNP lookups : Retrieving information about specific genetic variants (rs IDs)
Trait/disease searches : Discovering genetic associations for phenotypes
Gene associations : Finding variants in or near specific genes
GWAS summary statistics : Accessing complete genome-wide association data
Study metadata : Retrieving publication and cohort information
Population genetics : Exploring ancestry-specific associations
Polygenic risk scores : Identifying variants for risk prediction models
Functional genomics : Understanding variant effects and genomic context
Systematic reviews : Comprehensive literature synthesis of genetic associations
Core Capabilities
1. Understanding GWAS Catalog Data Structure
The GWAS Catalog is organized around four core entities:
Studies : GWAS publications with metadata (PMID, author, cohort details)
Associations : SNP-trait associations with statistical evidence (p ≤ 5×10⁻⁸)
Variants : Genetic markers (SNPs) with genomic coordinates and alleles
Traits : Phenotypes and diseases (mapped to EFO ontology terms)
Study accessions: GCST IDs (e.g., GCST001234)
Variant IDs: rs numbers (e.g., rs7903146) or variant_id format
Trait IDs: EFO terms (e.g., EFO_0001360 for type 2 diabetes)
Gene symbols: HGNC approved names (e.g., TCF7L2)
2. Web Interface Searches Returns all trait associations for this SNP.
type 2 diabetes
Parkinson disease
body mass index
Returns all associated genetic variants.
Returns variants in or near the gene region.
Returns variants in the specified genomic interval.
PMID:20581827
Author: McCarthy MI
GCST001234
Returns study details and all reported associations.
3. REST API Access The GWAS Catalog provides two REST APIs for programmatic access:
GWAS Catalog API: https://www.ebi.ac.uk/gwas/rest/api
Summary Statistics API: https://www.ebi.ac.uk/gwas/summary-statistics/api
Studies endpoint - /studies/{accessionID}
import requests
# Get a specific study
url = "https://www.ebi.ac.uk/gwas/rest/api/studies/GCST001795"
response = requests.get(url, headers={"Content-Type": "application/json"})
study = response.json()
Associations endpoint - /associations
# Find associations for a variant
variant = "rs7903146"
url = f"https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/{variant}/associations"
params = {"projection": "associationBySnp"}
response = requests.get(url, params=params, headers={"Content-Type": "application/json"})
associations = response.json()
Variants endpoint - /singleNucleotidePolymorphisms/{rsID}
# Get variant details
url = "https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/rs7903146"
response = requests.get(url, headers={"Content-Type": "application/json"})
variant_info = response.json()
Traits endpoint - /efoTraits/{efoID}
# Get trait information
url = "https://www.ebi.ac.uk/gwas/rest/api/efoTraits/EFO_0001360"
response = requests.get(url, headers={"Content-Type": "application/json"})
trait_info = response.json()
4. Query Examples and Patterns Example 1: Find all associations for a disease
import requests
trait = "EFO_0001360" # Type 2 diabetes
base_url = "https://www.ebi.ac.uk/gwas/rest/api"
# Query associations for this trait
url = f"{base_url}/efoTraits/{trait}/associations"
response = requests.get(url, headers={"Content-Type": "application/json"})
associations = response.json()
# Process results
for assoc in associations.get('_embedded', {}).get('associations', []):
variant = assoc.get('rsId')
pvalue = assoc.get('pvalue')
risk_allele = assoc.get('strongestAllele')
print(f"{variant}: p={pvalue}, risk allele={risk_allele}")
Example 2: Get variant information and all trait associations
import requests
variant = "rs7903146"
base_url = "https://www.ebi.ac.uk/gwas/rest/api"
# Get variant details
url = f"{base_url}/singleNucleotidePolymorphisms/{variant}"
response = requests.get(url, headers={"Content-Type": "application/json"})
variant_data = response.json()
# Get all associations for this variant
url = f"{base_url}/singleNucleotidePolymorphisms/{variant}/associations"
params = {"projection": "associationBySnp"}
response = requests.get(url, params=params, headers={"Content-Type": "application/json"})
associations = response.json()
# Extract trait names and p-values
for assoc in associations.get('_embedded', {}).get('associations', []):
trait = assoc.get('efoTrait')
pvalue = assoc.get('pvalue')
print(f"Trait: {trait}, p-value: {pvalue}")
Example 3: Access summary statistics
import requests
# Query summary statistics API
base_url = "https://www.ebi.ac.uk/gwas/summary-statistics/api"
# Find associations by trait with p-value threshold
trait = "EFO_0001360" # Type 2 diabetes
p_upper = "0.000000001" # p < 1e-9
url = f"{base_url}/traits/{trait}/associations"
params = {
"p_upper": p_upper,
"size": 100 # Number of results
}
response = requests.get(url, params=params)
results = response.json()
# Process genome-wide significant hits
for hit in results.get('_embedded', {}).get('associations', []):
variant_id = hit.get('variant_id')
chromosome = hit.get('chromosome')
position = hit.get('base_pair_location')
pvalue = hit.get('p_value')
print(f"{chromosome}:{position} ({variant_id}): p={pvalue}")
Example 4: Query by chromosomal region
import requests
# Find variants in a specific genomic region
chromosome = "10"
start_pos = 114000000
end_pos = 115000000
base_url = "https://www.ebi.ac.uk/gwas/rest/api"
url = f"{base_url}/singleNucleotidePolymorphisms/search/findByChromBpLocationRange"
params = {
"chrom": chromosome,
"bpStart": start_pos,
"bpEnd": end_pos
}
response = requests.get(url, params=params, headers={"Content-Type": "application/json"})
variants_in_region = response.json()
5. Working with Summary Statistics The GWAS Catalog hosts full summary statistics for many studies, providing access to all tested variants (not just genome-wide significant hits).
Summary Statistics API Features:
Filter by chromosome, position, p-value
Query specific variants across studies
Retrieve effect sizes and allele frequencies
Access harmonized and standardized data
Example: Download summary statistics for a study
import requests
import gzip
# Get available summary statistics
base_url = "https://www.ebi.ac.uk/gwas/summary-statistics/api"
url = f"{base_url}/studies/GCST001234"
response = requests.get(url)
study_info = response.json()
# Download link is provided in the response
# Alternatively, use FTP:
# ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCSTXXXXXX/
6. Data Integration and Cross-referencing The GWAS Catalog provides links to external resources:
Ensembl: Gene annotations and variant consequences
dbSNP: Variant identifiers and population frequencies
gnomAD: Population allele frequencies
Open Targets: Target-disease associations
PGS Catalog: Polygenic risk scores
UCSC Genome Browser: Genomic context
EFO (Experimental Factor Ontology): Standardized trait terms
OMIM: Disease gene relationships
Disease Ontology: Disease hierarchies
Following Links in API Responses:
import requests
# API responses include _links for related resources
response = requests.get("https://www.ebi.ac.uk/gwas/rest/api/studies/GCST001234")
study = response.json()
# Follow link to associations
associations_url = study['_links']['associations']['href']
associations_response = requests.get(associations_url)
Query Workflows
Workflow 1: Exploring Genetic Associations for a Disease
Identify the trait using EFO terms or free text:
Search web interface for disease name
Note the EFO ID (e.g., EFO_0001360 for type 2 diabetes)
Query associations via API:
url = f"https://www.ebi.ac.uk/gwas/rest/api/efoTraits/{efo_id}/associations"
Filter by significance and population:
Check p-values (genome-wide significant: p ≤ 5×10⁻⁸)
Review ancestry information in study metadata
Filter by sample size or discovery/replication status
Extract variant details:
rs IDs for each association
Effect alleles and directions
Effect sizes (odds ratios, beta coefficients)
Population allele frequencies
Cross-reference with other databases:
Look up variant consequences in Ensembl
Check population frequencies in gnomAD
Explore gene function and pathways
Workflow 2: Investigating a Specific Genetic Variant
Query the variant:
url = f"https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/{rs_id}"
Retrieve all trait associations:
url = f"https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/{rs_id}/associations"
Analyze pleiotropy:
Identify all traits associated with this variant
Review effect directions across traits
Look for shared biological pathways
Check genomic context:
Determine nearby genes
Identify if variant is in coding/regulatory regions
Review linkage disequilibrium with other variants
Workflow 3: Gene-Centric Association Analysis
Search by gene symbol in web interface or:
url = f"https://www.ebi.ac.uk/gwas/rest/api/singleNucleotidePolymorphisms/search/findByGene"
params = {"geneName": gene_symbol}
Retrieve variants in gene region:
Get chromosomal coordinates for gene
Query variants in region
Include promoter and regulatory regions (extend boundaries)
Analyze association patterns:
Identify traits associated with variants in this gene
Look for consistent associations across studies
Review effect sizes and directions
Functional interpretation:
Determine variant consequences (missense, regulatory, etc.)
Check expression QTL (eQTL) data
Review pathway and network context
Workflow 4: Systematic Review of Genetic Evidence
Define research question:
Specific trait or disease of interest
Population considerations
Study design requirements
Comprehensive variant extraction:
Query all associations for trait
Set significance threshold
Note discovery and replication studies
Quality assessment:
Review study sample sizes
Check for population diversity
Assess heterogeneity across studies
Identify potential biases
Data synthesis:
Aggregate associations across studies
Perform meta-analysis if applicable
Create summary tables
Generate Manhattan or forest plots
Export and documentation:
Download full association data
Export summary statistics if needed
Document search strategy and date
Create reproducible analysis scripts
Workflow 5: Accessing and Analyzing Summary Statistics
Identify studies with summary statistics:
Browse summary statistics portal
Check FTP directory listings
Query API for available studies
Download summary statistics:
# Via FTP
wget ftp://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCSTXXXXXX/harmonised/GCSTXXXXXX-harmonised.tsv.gz
Query via API for specific variants:
url = f"https://www.ebi.ac.uk/gwas/summary-statistics/api/chromosomes/{chrom}/associations"
params = {"start": start_pos, "end": end_pos}
Process and analyze:
Filter by p-value thresholds
Extract effect sizes and confidence intervals
Perform downstream analyses (fine-mapping, colocalization, etc.)
Response Formats and Data Fields Key Fields in Association Records:
rsId: Variant identifier (rs number)
strongestAllele: Risk allele for the association
pvalue: Association p-value
pvalueText: P-value as text (may include inequality)
orPerCopyNum: Odds ratio or beta coefficient
betaNum: Effect size (for quantitative traits)
betaUnit: Unit of measurement for beta
range: Confidence interval
efoTrait: Associated trait name
mappedLabel: EFO-mapped trait term
accessionId: GCST study identifier
pubmedId: PubMed ID
author: First author
publicationDate: Publication date
ancestryInitial: Discovery population ancestry
ancestryReplication: Replication population ancestry
sampleSize: Total sample size
Pagination:
Results are paginated (default 20 items per page). Navigate using:
size parameter: Number of results per page
page parameter: Page number (0-indexed)
_links in response: URLs for next/previous pages
Best Practices
Query Strategy
Start with web interface to identify relevant EFO terms and study accessions
Use API for bulk data extraction and automated analyses
Implement pagination handling for large result sets
Cache API responses to minimize redundant requests
Data Interpretation
Always check p-value thresholds (genome-wide: 5×10⁻⁸)
Review ancestry information for population applicability
Consider sample size when assessing evidence strength
Check for replication across independent studies
Be aware of winner's curse in effect size estimates
Rate Limiting and Ethics
Respect API usage guidelines (no excessive requests)
Use summary statistics downloads for genome-wide analyses
Implement appropriate delays between API calls
Cache results locally when performing iterative analyses
Cite the GWAS Catalog in publications
Data Quality Considerations
GWAS Catalog curates published associations (may contain inconsistencies)
Effect sizes reported as published (may need harmonization)
Some studies report conditional or joint associations
Check for study overlap when combining results
Be aware of ascertainment and selection biases
Python Integration Example Complete workflow for querying and analyzing GWAS data:
import requests
import pandas as pd
from time import sleep
def query_gwas_catalog(trait_id, p_threshold=5e-8):
"""
Query GWAS Catalog for trait associations
Args:
trait_id: EFO trait identifier (e.g., 'EFO_0001360')
p_threshold: P-value threshold for filtering
Returns:
pandas DataFrame with association results
"""
base_url = "https://www.ebi.ac.uk/gwas/rest/api"
url = f"{base_url}/efoTraits/{trait_id}/associations"
headers = {"Content-Type": "application/json"}
results = []
page = 0
while True:
params = {"page": page, "size": 100}
response = requests.get(url, params=params, headers=headers)
if response.status_code != 200:
break
data = response.json()
associations = data.get('_embedded', {}).get('associations', [])
if not associations:
break
for assoc in associations:
pvalue = assoc.get('pvalue')
if pvalue and float(pvalue) <= p_threshold:
results.append({
'variant': assoc.get('rsId'),
'pvalue': pvalue,
'risk_allele': assoc.get('strongestAllele'),
'or_beta': assoc.get('orPerCopyNum') or assoc.get('betaNum'),
'trait': assoc.get('efoTrait'),
'pubmed_id': assoc.get('pubmedId')
})
page += 1
sleep(0.1) # Rate limiting
return pd.DataFrame(results)
# Example usage
df = query_gwas_catalog('EFO_0001360') # Type 2 diabetes
print(df.head())
print(f"\nTotal associations: {len(df)}")
print(f"Unique variants: {df['variant'].nunique()}")
Resources
references/api_reference.md Comprehensive API documentation including:
Detailed endpoint specifications for both APIs
Complete list of query parameters and filters
Response format specifications and field descriptions
Advanced query examples and patterns
Error handling and troubleshooting
Integration with external databases
Consult this reference when:
Constructing complex API queries
Understanding response structures
Implementing pagination or batch operations
Troubleshooting API errors
Exploring advanced filtering options
Training Materials The GWAS Catalog team provides workshop materials:
Important Notes
Data Updates
The GWAS Catalog is updated regularly with new publications
Re-run queries periodically for comprehensive coverage
Summary statistics are added as studies release data
EFO mappings may be updated over time
Citation Requirements When using GWAS Catalog data, cite:
Sollis E, et al. (2023) The NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource. Nucleic Acids Research. PMID: 37953337
Include access date and version when available
Cite original studies when discussing specific findings
Limitations
Not all GWAS publications are included (curation criteria apply)
Full summary statistics available for subset of studies
Effect sizes may require harmonization across studies
Population diversity is growing but historically limited
Some associations represent conditional or joint effects
Data Access
Web interface: Free, no registration required
REST APIs: Free, no API key needed
FTP downloads: Open access
Rate limiting applies to API (be respectful)
Additional Resources