SNP file inputs and QC

I have downloaded a bed file of switchgrass SNP data. How do I use this with the switchgrassGWAS R package?

First, check that you have downloaded three files with identical file names and three different file type extensions: “.bed”, “.bim”, and “.fam”. These three files need to be in the same folder for this to work.

You can then do this within R with:

# Install bigsnpr 
install.packages("bigsnpr")

# Load bigsnpr package into your R environment
library(bigsnpr)  

# Create the "bigSNP" object from the bed file
rdsfile <- snp_readBed("path/to/your/bedfile.bed", backingfile = "path/to/new/bigSNPfile")

# Attach the new "bigSNP" object to the R session.
snp <- snp_attach(paste0("path/to/new/bigSNPfile", ".rds"))

# You're ready to rock!

In the snp_readBed function, you will need to replace “path/to/your/bedfile.bed” with the path to your .bed file, and you should replace “path/to/new/bigSNPfile” with your desired output file name and path to that file. Don’t add any extension to specify the file type; that will be done by bigsnpr.

You can also try doing this with the example file that comes with the switchgrassGWAS R package; see the commands to do that here

I have a vcf file of switchgrass SNP data. How do I use this with the switchgrassGWAS R package?

You will need to convert this vcf (or vcf.gz) file to a bed file, then follow the instructions in the previous section.

One way to do this is using the command line program plink.

On the command line:

plink --vcf path/to/your/vcf/or/vcf.gz --const-fid Pvirgatum --allow-extra-chr --make-bed --out outputfilename

You will need to replace path/to/your/vcf/or/vcf.gz with the path to your vcf or vcf.gz file, or just the vcf or vcf.gz filename if you are running the command from the same directory as the file is in.

You will need to replace outputfilename with your desired output file name. Don’t add any extension to specify the file type; that will be done by plink.

Here’s an example command.

plink --vcf Pvirgatum_V5_GWAS_758g_33M.vcf.gz --const-fid Pvirgatum --allow-extra-chr --make-bed --out Pvirgatum_V5_GWAS_758g_33M

Alternatively, if your file is in vcf format (but not vcf.gz format), you can do this from within R using the following commands:

library(bigsnpr)
prefix <- "Pvirgatum_V5_GWAS_prefix"  #  specify your file prefix (with no extension)

# Download plink to use it from within R
plink <- download_plink()

# Create a bedfile from the vcf file (with some switchgrass-specific extra options)
bedfile <- snp_plinkQC(plink.path = plink, prefix.in = prefix,
                       file.type = "--vcf", 
                       extra.options = "--allow-no-sex --allow-extra-chr --chr Chr01K Chr01N Chr02K Chr02N Chr03K Chr03N Chr04K Chr04N Chr05K Chr05N Chr06K Chr06N Chr07K Chr07N Chr08K Chr08N Chr09K Chr09N")

# Create the "bigSNP" object from the bed file
rdsfile <- snp_readBed(bedfile, backingfile = prefix)

Great idea! I recommend subsetting your SNP file to just the individuals you have phenotypes for, then removing SNPs that don’t meet specific QC thresholds. I’d recommend removing SNPs that have a minor allele count of less than 25, that are missing at a frequency of >10%, and/or that are not in Hardy-Weinberg equilibrium.

Why remove individuals?

If you have lots of phenotypes, there’s no need remove individuals on a phenotype-by-phenotype basis. However, if you have many phenotypes but no phenotypes for a particular individual, you gain no statistical signal by including that individual in your genome-wide association. If you include enough individuals with no phenotypes, that may introduce statistical artifacts and noise in SNPs that have low minor allele counts or high missingness in your phenotyped individual set (see next section).

Why remove SNPs?

These SNP datasets are very large, and a single genome-wide association in switchgrass will involve millions of tests, with an unknown ratio of statistical signal to statistical noise. SNPs with low minor allele counts or high missingness - either genotypically or phenotypically - are more likely to be statistical ‘bad actors’ with high levels of statistical noise; removing them often dramatically improves the ratio of statistical signal to noise in the resulting association set.

You can do this QC within R with:

library(bigsnpr)
library(dplyr)

# Load your current bigSNP file into the R session.
snp <- snp_attach("path/to/your/snpfile.rds")

# Load your phenotypes into the R session.
data(pvdiv_phenotypes) # load your own phenotypes, of course

# Filter to individuals that you have at least one phenotype for
# In this example, we only keep non-NA individuals for the bio5 phenotype
my_ind <- pvdiv_phenotypes %>%
    dplyr::filter(!is.na(bio5)) 

# Make numeric vector to select the individuals to keep using bigsnpr
ind_selector <- which(snp$fam$sample.ID %in% my_ind$PLANT_ID)
 # Choose subset name
subset_name <- "Pvirgatum_V5_GWAS_phenotyped_subset"

# Subset SNP files to phenotyped individuals
snpfile_subset <- subset(snp, ind.row = ind_selector)
# Attach new SNP subset file for QC
snp_subset <- snp_attach(snpfile_subset)
# Save SNP subset file as a bed file for QC
snp_writeBed(snp_subset, bedfile = paste0(subset_name, ".bed"))

# Choose minor allele count cutoff. If your dataset is small, use a minor allele frequency cutoff of 5% instead.
mac <- round(25/length(selector), digits = 2) # what should the MAF be?
if(mac > 0.05){
  mac <- 0.05
}
# Download plink to use it from within R
plink <- download_plink()

# Do QC within R using plink, with some switchgrass-specific settings
bedfile <- snp_plinkQC(plink, prefix.in = subset_name,
                       maf = mac, geno = 0.1, hwe = 1e-50, 
                       prefix.out = paste0(subset_name, "_maf", mac),
                       extra.options = "--allow-no-sex --allow-extra-chr --chr Chr01K Chr01N Chr02K Chr02N Chr03K Chr03N Chr04K Chr04N Chr05K Chr05N Chr06K Chr06N Chr07K Chr07N Chr08K Chr08N Chr09K Chr09N")

# Create a new "bigSNP" object from the bed file
rdsfile <- snp_readBed(bedfile, backingfile = "path/to/new/bigSNPfile")

# You're done, but may want to remove the intermediate bigSNP and bed files from the directory where you were working.

I have downloaded a bed file of switchgrass SNP data. How do I use this with the switchgrassGWAS R package?

I have a vcf file of switchgrass SNP data. How do I use this with the switchgrassGWAS R package?

I’d like to do some quality control (QC) on my SNPs before proceeding. What do you recommend?

Why remove individuals?

Why remove SNPs?