Mouse Embryo (Step by Step)

This tutorial guides you through running gsMap step by step, with user-defined parameters and resources, granting greater flexibility and control over the analysis. This mode is suited for users who require detailed customization of their pipeline. For a faster, one-command execution, please see the Quick Mode tutorial.

Preparation

Please ensure you have installed the gsMap. This tutorial guides you through using gsMap in a step-by-step manner.

1. Download dependencies

gsMap requires specific reference files:

  • Gene transfer format (GTF) file, for gene coordinates on the genome.

  • LD reference panel (PLINK bfile), for computing LD scores.

  • SNP weight file, to adjust correlations between SNP-trait association statistics.

  • Homologous gene transformations file (optional), to map genes between species.

  • Enhancer-gene mapping file (optional), for linking SNPs to genes based on enhancer annotations.

To download the resources:

wget https://yanglab.westlake.edu.cn/data/gsMap/gsMap_resource.tar.gz
tar -xvzf gsMap_resource.tar.gz

Directory structure:

tree -L 2

gsMap_resource
    ├── genome_annotation
       ├── enhancer
       └── gtf
    ├── homologs
       ├── macaque_human_homologs.txt
       └── mouse_human_homologs.txt
    ├── LD_Reference_Panel
       └── 1000G_EUR_Phase3_plink
    ├── LDSC_resource
       ├── hapmap3_snps
       └── weights_hm3_no_hla
    └── quick_mode
        ├── baseline
        ├── SNP_gene_pair
        └── snp_gene_weight_matrix.h5ad

If you want to use your own reference files, please ensure that the genome build versions (e.g., Hg37 or Hg38) are consistent between the GTF file and the LD reference panel.

2. Download example data

wget https://yanglab.westlake.edu.cn/data/gsMap/gsMap_example_data.tar.gz
tar -xvzf gsMap_example_data.tar.gz

Directory structure:

tree -L 2

gsMap_example_data/
├── GWAS
│   ├── BCX2_MCHC_EA_GWAMA.sumstats.gz
│   ├── GIANT_EUR_Height_2022_Nature.sumstats.gz
│   ├── gwas_config.yaml
│   └── IQ_NG_2018.sumstats.gz
└── ST
    └── E16.5_E1S1.MOSTA.h5ad

Running gsMap

1. find latent representations (optional)

Objective: Learn latent representations for spots. The latent embedding learned from this step will be store in the AnnData object obsm field under the key latent_GVAE.

Note

The --workdir parameter specifies the working directory for gsMap, where all outputs will be saved.

Execution: required memory: ~60G (120K cells)

gsmap run_find_latent_representations \
    --workdir './example/Mouse_Embryo' \
    --sample_name 'E16.5_E1S1.MOSTA' \
    --input_hdf5_path 'gsMap_example_data/ST/E16.5_E1S1.MOSTA.h5ad' \
    --annotation 'annotation' \
    --data_layer 'count'

2. generate gene specificity scores

Objective: Identify homogeneous spots for each spot based on their latent representations specified by --latent_representation, and then generate gene specificity scores (GSS) for each spot by aggregating information from its homogeneous spots.

Note

If your ST data is not from a human species but you want to map human GWAS data to it, please provide a homologous transformation file to convert gene names. The first column should list gene names from the ST data species, and the second column from the GWAS data species.

Execution: required memory: ~45G (120K cells)

gsmap run_latent_to_gene \
    --workdir './example/Mouse_Embryo' \
    --sample_name 'E16.5_E1S1.MOSTA' \
    --annotation 'annotation' \
    --latent_representation 'latent_GVAE' \
    --num_neighbour 51 \
    --num_neighbour_spatial 201 \
    --homolog_file 'gsMap_resource/homologs/mouse_human_homologs.txt'

3. generate ldscore

Objective: Assign gene specificity scores (GSS) to SNPs and compute the stratified LD score.

Execution: required memory: ~40G

Three SNP to gene linking strategies are available:

This strategy uses TSS to assign GSS to SNPs. The –gene_window_size parameter defines the window size around the gene body for this assignment. If a SNP falls within the window of multiple genes, the GSS from the nearest gene will be used.

for CHROM in {1..22}
do
    gsmap run_generate_ldscore \
        --workdir './example/Mouse_Embryo' \
        --sample_name 'E16.5_E1S1.MOSTA' \
        --chrom $CHROM \
        --bfile_root 'gsMap_resource/LD_Reference_Panel/1000G_EUR_Phase3_plink/1000G.EUR.QC' \
        --keep_snp_root 'gsMap_resource/LDSC_resource/hapmap3_snps/hm' \
        --gtf_annotation_file 'gsMap_resource/genome_annotation/gtf/gencode.v46lift37.basic.annotation.gtf' \
        --gene_window_size 50000
done

This strategy uses enhancer-gene linking to assign GSS to SNPs. When a SNP maps to multiple enhancers, the GSS for the SNP is determined by the --snp_multiple_enhancer_strategy parameter. By default, this is set to max_mkscore, which assigns the SNP the maximum GSS among the enhancers it maps to. Another option is nearest_TSS.

for CHROM in {1..22}
do
    gsmap run_generate_ldscore \
        --workdir './example/Mouse_Embryo' \
        --sample_name 'E16.5_E1S1.MOSTA' \
        --chrom $CHROM \
        --bfile_root 'gsMap_resource/LD_Reference_Panel/1000G_EUR_Phase3_plink/1000G.EUR.QC' \
        --keep_snp_root 'gsMap_resource/LDSC_resource/hapmap3_snps/hm' \
        --gtf_annotation_file 'gsMap_resource/genome_annotation/gtf/gencode.v46lift37.basic.annotation.gtf' \
        --enhancer_annotation_file 'gsMap_resource/genome_annotation/enhancer/by_tissue/ALL/ABC_roadmap_merged.bed' \
        --snp_multiple_enhancer_strategy 'max_mkscore' \
        --gene_window_enhancer_priority 'enhancer_only'
done

This strategy uses both TSS and enhancer-gene linking to assign GSS to SNPs. If a SNP maps to both a gene TSS window and an enhancer linked to a different gene, the --gene_window_enhancer_priority parameter decides which gene the SNP is assigned to. The options are gene_window_first or enhancer_first.

for CHROM in {1..22}
do
    gsmap run_generate_ldscore \
        --workdir './example/Mouse_Embryo' \
        --sample_name 'E16.5_E1S1.MOSTA' \
        --chrom $CHROM \
        --bfile_root 'gsMap_resource/LD_Reference_Panel/1000G_EUR_Phase3_plink/1000G.EUR.QC' \
        --keep_snp_root 'gsMap_resource/LDSC_resource/hapmap3_snps/hm' \
        --gtf_annotation_file 'gsMap_resource/genome_annotation/gtf/gencode.v46lift37.basic.annotation.gtf' \
        --gene_window_size 50000 \
        --enhancer_annotation_file 'gsMap_resource/genome_annotation/enhancer/by_tissue/ALL/ABC_roadmap_merged.bed' \
        --snp_multiple_enhancer_strategy 'max_mkscore' \
        --gene_window_enhancer_priority 'gene_window_first'
done

Caution

If you run out of memory during this step or the next, you can reduce the --spots_per_chunk parameter to a smaller value. Generally, 40GB of memory is required when --spots_per_chunk is set to 1000.

4. spatial ldsc

Objective: Run spatial LDSC to associate spots with traits.

Execution: required memory: ~40G

gsmap run_spatial_ldsc \
    --workdir './example/Mouse_Embryo' \
    --sample_name 'E16.5_E1S1.MOSTA' \
    --trait_name 'IQ' \
    --sumstats_file 'gsMap_example_data/GWAS/IQ_NG_2018.sumstats.gz' \
    --w_file 'gsMap_resource/LDSC_resource/weights_hm3_no_hla/weights.' \
    --num_processes 4

5. cauchy combination (optional)

Objective: Aggregate P values of individual spots within specific spatial regions (cell types) to evaluate the association of these regions (cell types) with the trait.

Execution: required memory: ~12G

gsmap run_cauchy_combination \
    --workdir './example/Mouse_Embryo' \
    --sample_name 'E16.5_E1S1.MOSTA' \
    --trait_name 'IQ' \
    --annotation 'annotation'

6. report generation (optional)

Objective: Generate gsMap reports, including visualizations of mapping results and diagnostic plots.

Note

The default genes for visualization are the top 50 genes whose GSS shows the highest correlation with the -log10 p-values of the trait-cell associations. To select specific genes for visualization, use the --selected_genes parameter.

Execution: required memory: ~60G

gsmap run_report \
    --workdir './example/Mouse_Embryo' \
    --sample_name 'E16.5_E1S1.MOSTA' \
    --trait_name 'IQ' \
    --annotation 'annotation' \
    --sumstats_file 'gsMap_example_data/GWAS/IQ_NG_2018.sumstats.gz' \
    --top_corr_genes 50