When you run GWAS using this package, you now have two options for saving the GWAS results - as RDS files and as file-backed matrix (FBM) files.

You can specify this option in pvdiv_standard_gwas() with the option savetype. savetype can be “rds”, “fbm”, or “both”.

Why use RDS files?

If you are running few GWAS and want to save each GWAS output separately, you can use rds files. These files completely describe the GWAS results without referring back to the original SNP file, using the columns CHR (chromosome) & POS (physical position). They contain the effect estimate estim for the alternate allele (allele 2 in the SNP file), the standard error around this effect std_err, and the -log10pvalue and FDR adjusted p-values log10p and FDR_adj from the genome-wide association.

Pros of RDS files

These files are standalone & contain all the information you need to analyze a GWAS result.

These rds files can be used with the function pvdiv_bigsnp2mashr to run mash, as well as with pvdiv_table_topsnps (by specifying type = “bigsnp”), pvdiv_qqplot, and pvdiv_manhattan.

Cons of RDS files

If you are running many GWAS, these files can become numerous, take up substantial disk space, and become hard to keep track of. Individual files make it more difficult to compare results across multiple GWAS.

Why use FBM files?

If you are running many GWAS, you may want a compressed format, a reduced number of files, and/or additional ways to compare GWAS outputs. To do this, you can use FBM files, which save all GWAS results in a total of 4 files. These files all start with the prefix ‘gwas_effects_’. Two are .csv files; these contain the associated metadata for the GWAS and the column names for the fbm. Two are an .rds & .bk format & contain the actual file-backed matrix results; these can be very large.

For each phenotype used for GWAS, the gwas_effects files contain the effect estimate estim for the alternate allele (allele 2 in the SNP file), the standard error around this effect std_err, and the -log10pvalue log10p from the genome-wide association.

Pros of FBM files

FBM files are memory and computation-efficient tools to analyze big matrices that are stored on disc rather than in your computer’s memory (see here for more details ). Briefly, they have faster I/O, won’t overwhelm your computer’s RAM, and function runtime speeds up ~10-200x.

These fbm files can be used with the functions pvdiv_fbm_upset_df() and pvdiv_fbm_qq(). They can also be used in the snpdiver R package function dive_effects2mash() to run mash.

Cons of FBM files

These files need to be used along with the original snp file used to run the GWAS - thus, you’ll need to keep track of six files in total, the .rds and .bk file for the SNP file and the four files for the gwas results.

These files do not work with the function pvdiv_bigsnp2mashr to run mash.

If you are interested in only a subset of GWAS results saved in a fbm, you will have to create a vector of row numbers using the associated metadata to refer to that data subset.

Using FBM gwas results

The remainder of this package demonstrates some utilities for filebacked matrix files within the bigsnpr and switchgrassGWAS R packages.

Get genotype file

# get the example bedfile from the package switchgrassGWAS
bedfile <- system.file("extdata", "example.bed", package = "switchgrassGWAS")

Set up SNP and phenotype data frames.

# Load packages bigsnpr and switchgrassGWAS
library(switchgrassGWAS)
#> Registered S3 method overwritten by 'GGally':
#>   method from   
#>   +.gg   ggplot2
library(bigsnpr)
#> Loading required package: bigstatsr

# Read from bed/bim/fam to create the new files that bigsnpr uses.
# Let's put them in an temporary directory for this demo.
tmpfile <- tempfile()
snp_readBed(bedfile, backingfile = tmpfile)
#> [1] "/tmp/RtmpgAXKt5/file2f365fdb9ef7.rds"

# Attach the "bigSNP" object to the R session.
snp_example <- snp_attach(paste0(tmpfile, ".rds"))
# What does the bigSNP object look like?
str(snp_example, max.level = 2, strict.width = "cut")
#> List of 3
#>  $ genotypes:Reference class 'FBM.code256' [package "bigstatsr"] with 16 fields
#>   ..and 26 methods, of which 12 are  possibly relevant:
#>   ..  add_columns, as.FBM, bm, bm.desc, check_dimensions,
#>   ..  check_write_permissions, copy#envRefClass, initialize, initialize#FBM,
#>   ..  save, show#envRefClass, show#FBM
#>  $ fam      :'data.frame':   630 obs. of  6 variables:
#>   ..$ family.ID  : chr [1:630] "Pvirgatum" "Pvirgatum" "Pvirgatum" "Pvirgatum"..
#>   ..$ sample.ID  : chr [1:630] "J181.A" "J250.C" "J251.C" "J352.A" ...
#>   ..$ paternal.ID: int [1:630] 0 0 0 0 0 0 0 0 0 0 ...
#>   ..$ maternal.ID: int [1:630] 0 0 0 0 0 0 0 0 0 0 ...
#>   ..$ sex        : int [1:630] 0 0 0 0 0 0 0 0 0 0 ...
#>   ..$ affection  : int [1:630] -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 ...
#>  $ map      :'data.frame':   3600 obs. of  6 variables:
#>   ..$ chromosome  : chr [1:3600] "Chr01K" "Chr01K" "Chr01K" "Chr01K" ...
#>   ..$ marker.ID   : chr [1:3600] "Chr01K_105542" "Chr01K_798650" "Chr01K_1147"..
#>   ..$ genetic.dist: int [1:3600] 0 0 0 0 0 0 0 0 0 0 ...
#>   ..$ physical.pos: int [1:3600] 105542 798650 1147074 1719551 1915851 1965986..
#>   ..$ allele1     : chr [1:3600] "A" "C" "T" "T" ...
#>   ..$ allele2     : chr [1:3600] "G" "G" "C" "G" ...
#>  - attr(*, "class")= chr "bigSNP"

# Load the pvdiv phenotypes into the R session.
data(pvdiv_phenotypes) 
# Make an example dataframe of one phenotype where the first column is PLANT_ID.
# This "phenotype", 'GWAS_CT', is the number of times a plant successfully 
# clonally replicated to plant in the common gardens in 2018.
four_phenotype <- pvdiv_phenotypes %>%
  dplyr::select(PLANT_ID, GWAS_CT, matches("BIOMASS"))

Run genome-wide association (GWAS)

# Save the output to a temporary directory for this demo.
tempdir <- tempdir()

pvdiv_standard_gwas(snp = snp_example, df = four_phenotype, 
                    type = "linear", outputdir = tempdir, 
                    savegwas = TRUE, savetype = "fbm", 
                    saveplots = FALSE,
                    saveannos = FALSE, ncores = 1)
#> 'lambdagc' is TRUE, so lambda_GC will be used to find the best population structure correction using the covariance matrix.
#> 'savetype' is 'fbm', so the gwas results will be saved to disk as a combined filebacked matrix (fbm) file.
#> Covariance matrix (covar) was not supplied - this will be generated using pvdiv_autoSVD().
#> 'saveoutput' is FALSE, so the svd will not be saved to the working directory.
#> Now starting GWAS pipeline for GWAS_CT.
#> Now determining lambda_GC for GWAS models with 16 sets of PCs. This will take some time.
#> saveoutput is FALSE, so lambda_GC values won't be saved to a csv.
#> Finished Lambda_GC calculation for GWAS_CT using 0 PCs.
#> Finished Lambda_GC calculation for GWAS_CT using 1 PCs.
#> Finished Lambda_GC calculation for GWAS_CT using 2 PCs.
#> Finished Lambda_GC calculation for GWAS_CT using 3 PCs.
#> Finished Lambda_GC calculation for GWAS_CT using 4 PCs.
#> Finished Lambda_GC calculation for GWAS_CT using 5 PCs.
#> Finished Lambda_GC calculation for GWAS_CT using 6 PCs.
#> Finished Lambda_GC calculation for GWAS_CT using 7 PCs.
#> Finished Lambda_GC calculation for GWAS_CT using 8 PCs.
#> Finished Lambda_GC calculation for GWAS_CT using 9 PCs.
#> Finished Lambda_GC calculation for GWAS_CT using 10 PCs.
#> Finished Lambda_GC calculation for GWAS_CT using 11 PCs.
#> Finished Lambda_GC calculation for GWAS_CT using 12 PCs.
#> Finished Lambda_GC calculation for GWAS_CT using 13 PCs.
#> Finished Lambda_GC calculation for GWAS_CT using 14 PCs.
#> Finished Lambda_GC calculation for GWAS_CT using 15 PCs.
#> Finished phenotype 1: GWAS_CT
#> Now running GWAS with the best population structure correction.
#> [1] "saveoutput is FALSE so GWAS object will not be saved to disk."
#> Now starting GWAS pipeline for CLMB_BIOMASS.
#> Now determining lambda_GC for GWAS models with 16 sets of PCs. This will take some time.
#> saveoutput is FALSE, so lambda_GC values won't be saved to a csv.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 0 PCs.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 1 PCs.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 2 PCs.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 3 PCs.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 4 PCs.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 5 PCs.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 6 PCs.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 7 PCs.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 8 PCs.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 9 PCs.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 10 PCs.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 11 PCs.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 12 PCs.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 13 PCs.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 14 PCs.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 15 PCs.
#> Finished phenotype 1: CLMB_BIOMASS
#> Now running GWAS with the best population structure correction.
#> [1] "saveoutput is FALSE so GWAS object will not be saved to disk."
#> Now starting GWAS pipeline for KBSM_BIOMASS.
#> Now determining lambda_GC for GWAS models with 16 sets of PCs. This will take some time.
#> saveoutput is FALSE, so lambda_GC values won't be saved to a csv.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 0 PCs.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 1 PCs.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 2 PCs.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 3 PCs.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 4 PCs.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 5 PCs.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 6 PCs.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 7 PCs.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 8 PCs.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 9 PCs.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 10 PCs.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 11 PCs.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 12 PCs.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 13 PCs.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 14 PCs.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 15 PCs.
#> Finished phenotype 1: KBSM_BIOMASS
#> Now running GWAS with the best population structure correction.
#> [1] "saveoutput is FALSE so GWAS object will not be saved to disk."
#> Now starting GWAS pipeline for PKLE_BIOMASS.
#> Now determining lambda_GC for GWAS models with 16 sets of PCs. This will take some time.
#> saveoutput is FALSE, so lambda_GC values won't be saved to a csv.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 0 PCs.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 1 PCs.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 2 PCs.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 3 PCs.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 4 PCs.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 5 PCs.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 6 PCs.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 7 PCs.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 8 PCs.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 9 PCs.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 10 PCs.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 11 PCs.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 12 PCs.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 13 PCs.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 14 PCs.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 15 PCs.
#> Finished phenotype 1: PKLE_BIOMASS
#> Now running GWAS with the best population structure correction.
#> [1] "saveoutput is FALSE so GWAS object will not be saved to disk."

This command will save a set of gwas results to a temporary directory, which will have a randomly generated name like: /tmp/RtmpgAXKt5.

Let’s load these files into R.

gwas <- big_attach(file.path(tempdir, "gwas_effects_0M_SNPs.rds"))
gwas_metadata <- read.csv(file.path(tempdir, "gwas_effects_0M_SNPs_associated_metadata.csv"))
gwas_colnames <- read.csv(file.path(tempdir, "gwas_effects_0M_SNPs_column_names.csv"))

Let’s look at the metadata associated with the GWAS. This includes information about the kind of GWAS that was conducted. In order, the columns indicate, what the phenotype name was, what type of model was run, how many SNPs (in M) were used, how many PCs were used for population structure correction, how many phenotyped individuals were there, how many phenotypic levels were there, what was the genomic inflation factor (lambda_GC), and given the number of SNPs, what was the bonferroni threshold for the phenotype.

kable(gwas_metadata)
phe type nsnp npcs nphe nlev lambda_GC bonferroni
GWAS_CT linear 0_M 15 629 16 1.12682 4.857333
CLMB_BIOMASS linear 0_M 5 495 490 1.02958 4.857333
KBSM_BIOMASS linear 0_M 15 512 512 1.07005 4.857333
PKLE_BIOMASS linear 0_M 10 612 606 1.03844 4.857333

Let’s look at the column names of the GWAS result. These are in the same order as the associated metadata, and each phenotype has three columns: Effect (estimated effect size of allele2), SE (standard error about this effect size), and log10p (-log10p of the association between SNP & phenotype).

kable(gwas_colnames)
colnames_fbm
GWAS_CT_Effect
GWAS_CT_SE
GWAS_CT_log10p
CLMB_BIOMASS_Effect
CLMB_BIOMASS_SE
CLMB_BIOMASS_log10p
KBSM_BIOMASS_Effect
KBSM_BIOMASS_SE
KBSM_BIOMASS_log10p
PKLE_BIOMASS_Effect
PKLE_BIOMASS_SE
PKLE_BIOMASS_log10p

Two-GWAS Comparisons

Let’s analyze overlaps in significance in the four GWAS. Is there a relationship between the -log10pvalues in the three BIOMASS GWAS? If so, this could indicate that SNPs have a similar effect on BIOMASS across the three different environments.

Use the row numbers from the associated metadata to compare -log10p values for row 2 (CLMB_BIOMASS) with rows 3 & 4 (biomass at KBSM & PKLE). Set the threshold above which to plot the p-value at zero (default is 5), because there are not many SNPs in this toy example.

pvdiv_fbm_qq(effects = gwas, metadata = gwas_metadata, e_row = 2, o_row = 3:4, 
             outputdir = tempdir, thr = 0)
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

This function will again save plots to the tempdir. These plots should look like this:

CLMB_vs_KBSM

and this:

CLMB_vs_PKLE

If these GWAS results were perfectly related, all the hexagons would fall on the red dashed 1:1 line. The blue line indicates the actual relationship between the p-values in the two GWAS. We can see that there is a modest positive relationship between biomass in these environments, but it’s by no means strong. This indicates the presence of genotype-by-environment interactions between these environments.

Upset plots

Now, let’s compare results for more than one GWAS. The way I recommend doing this is with an Upset plot. Upset plots are well suited to the quantitative analysis of data when there are more than three sets. They shine when you want to look at all combinations of how sets interact - for example, when you’re interested in sets of SNPs significant in different combinations of GWAS.

Here’s some great documentation on how to code an Upset plot in R.

Let’s make a dataframe to go into an Upset plot and visualize it in R.

There are three required arguments:

  • the first argument is expected to be the gwas results fbm
  • the second argument is expected to be the snp fbm used to generate the gwas results
  • the third argument is expected to be the metadata associated with the GWAS results.

We can also optionally specify a -log10p threshold with ‘thr’. By default, that threshold is 7. A good target is to set this above the noise floor for all of the GWAS you are comparing. For our toy example, we set this artificially low, to 2.

upset_df <- pvdiv_fbm_upset_df(effects = gwas, snp = snp_example, 
                               metadata = gwas_metadata, thr = 2, ncores = 1)
kable(head(upset_df))
CHR POS GWAS_CT CLMB_BIOMASS KBSM_BIOMASS PKLE_BIOMASS
Chr01K 1147074 0 0 0 1
Chr01K 4761900 0 0 0 1
Chr01K 5505667 0 0 0 1
Chr01K 7103018 0 0 0 1
Chr01K 7475083 0 0 0 1
Chr01K 8046966 0 0 1 1

This dataframe has CHR & POS from the SNP file, and then an additional column for each phenotype. If that SNP is not significant at the -log10p value threshold for that phenotype (here, thr = 2), then that row & column combination will have a ‘0’. If that SNP is significant for the phenotype, that value will be ‘1’.

Now, let’s make an upset plot using the R package ComplexUpset.

There are two required arguments:

  • the first argument is expected to be a dataframe with both group indicator variables and covariates,
  • the second argument specifies a list with names of column which indicate the group membership.
library(ComplexUpset)
upset_plot <- upset(upset_df, intersect = colnames(upset_df)[-(1:2)], 
                    name = 'phenotype', min_size = 2)
upset_plot                    

upset_ex In this Upset plot, we can see that there are many more SNPs significant in one or more biomass phenotype than there are significant for the phenotype ‘GWAS_CT’ and any biomass phenotype. Though, overall, most SNPs are significant in only one of the four conditions.

Sometimes this could be because different SNPs in the same gene or genomic region are coming up as highly associated with the different phenotypes. To test this, we can summarize our upset_df slightly based on position, and make a new Upset plot.

Median LD in switchgrass is between 20 and 25kb. Normally this is the largest genomic region I would make. However, for this toy example, we’ll make larger genomic regions of 1Mb.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

upset_df_1Mb <- upset_df %>%
  mutate(POS_binned = floor(POS/1000000)*1000000) %>%
  select(-POS) %>%
  select(CHR, POS_binned, everything()) %>%
  group_by(CHR, POS_binned) %>%
  summarise(across(everything(), ~ max(.x)))
#> `summarise()` has grouped output by 'CHR'. You can override using the `.groups` argument.
upset_1Mb <- upset(upset_df_1Mb, intersect = colnames(upset_df_1Mb)[-(1:2)], 
                    name = 'phenotype', min_size = 2) + 
  labs(title = "Significant at -log10p > 2 in 1Mb intervals")
upset_1Mb  

upset_ex

This binning by genomic interval does result in more genomic regions with significant effects on multiple phenotypes. It doesn’t change the ordering of the kinds of intersections very much, with the exception that the three-way intersection between the three biomass phenotypes is now the least common intersection (with 4 1Mb regions containing significant associations for all three phenotypes).

What if we want to learn more about these 4 1Mb regions? We can go back to our upset dataframes to extract these regions.

sel_reg_biomass <- upset_df_1Mb %>%
  filter(across(CLMB_BIOMASS:PKLE_BIOMASS, ~ .x == 1)) %>%
  mutate(Sel_region = paste(CHR, POS_binned, sep = "_"))

sel_snp_biomass <- upset_df %>%
  mutate(POS_binned = floor(POS/1000000)*1000000,
         Sel_region = paste(CHR, POS_binned, sep = "_")) %>%
  filter(Sel_region %in% sel_reg_biomass$Sel_region)

kable(sel_snp_biomass)
CHR POS GWAS_CT CLMB_BIOMASS KBSM_BIOMASS PKLE_BIOMASS POS_binned Sel_region
Chr03K 2197885 0 1 1 1 2.0e+06 Chr03K_2e+06
Chr05K 12177803 0 1 1 1 1.2e+07 Chr05K_1.2e+07
Chr09K 45587560 0 1 1 1 4.5e+07 Chr09K_4.5e+07
Chr09N 11325811 0 0 0 1 1.1e+07 Chr09N_1.1e+07
Chr09N 11967533 0 1 1 0 1.1e+07 Chr09N_1.1e+07

If we were interested enough in these regions, we could also use this table to find annotations using pvdiv_table_topsnps().

sel_snp_annos <- sel_snp_biomass %>%
  mutate(start = POS - 10000,
         end = POS + 10000)  # get the 20kb surrounding each position, POS, in thr_dfB

library(AnnotationDbi)
library(VariantAnnotation)
txdb <- loadDb(file = "../../pvdiv-genome/Pvirgatum_516_v5.1.gene.txdb.sqlite")

sel_snp_annos <- pvdiv_table_topsnps(df = sel_snp_annos, type = "table", txdb = txdb)
kable(sel_snp_annos)
CHR region_start region_end width region_strand Annotation QUERYID TXID Gene ID POS GWAS_CT CLMB_BIOMASS KBSM_BIOMASS PKLE_BIOMASS POS_binned Sel_region gene_start gene_end gene_width gene_strand pacId transcriptName peptideName Pfam Panther Categories KOG ec KO GO Arabidopsis thaliana homolog A. thaliana gene name A. thaliana gene annotation Rice homolog Rice gene name Rice gene annotation d2_start d2_end distance from gene
Chr03K 2187885 2207885 20001 * intergenic 1 NA NA 2197885 0 1 1 1 2.0e+06 Chr03K_2e+06 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Chr05K 12167803 12187803 20001 + spliceSite 2 60175 Pavir.5KG164407 12177803 0 1 1 1 1.2e+07 Chr05K_1.2e+07 12176848 12177263 416 + 41543573 Pavir.5KG164407.1 Pavir.5KG164407.1.p NA NA NA NA NA NA NA NA NA NA NA NA 955 540 540
Chr05K 12167803 12187803 20001 + spliceSite 2 60176 Pavir.5KG164807 12177803 0 1 1 1 1.2e+07 Chr05K_1.2e+07 12185636 12186204 569 + 41549658 Pavir.5KG164807.1 Pavir.5KG164807.1.p NA NA NA NA NA NA NA NA NA NA NA NA -7833 -8401 7833
Chr09K 45577560 45597560 20001 - intron 3 103241 Pavir.8KG390800 45587560 0 1 1 1 4.5e+07 Chr09K_4.5e+07 NA NA NA NA 41644684 Pavir.8KG390800.1 Pavir.8KG390800.1.p PF03810 PTHR10997,PTHR10997:SF29 NA NA NA GO:0008536,GO:0006886 AT3G17340.1 NA ARM repeat superfamily protein LOC_Os11g47320.1 NA protein transporter, putative, expressed NA NA NA
Chr09K 45577560 45597560 20001 + spliceSite 3 110376 Pavir.9KG334200 45587560 0 1 1 1 4.5e+07 Chr09K_4.5e+07 45586592 45589853 3262 + 41604637 Pavir.9KG334200.1 Pavir.9KG334200.1.p PF00249 PTHR24078,PTHR24078:SF195 NA NA NA NA AT5G47390.1 NA myb-like transcription factor family protein LOC_Os10g41200.1 NA MYB family transcription factor, putative, expressed 968 -2293 968
Chr09K 45577560 45597560 20001 + spliceSite 3 116062 Pavir.9KG334400 45587560 0 1 1 1 4.5e+07 Chr09K_4.5e+07 45576944 45582938 5995 - 41606480 Pavir.9KG334400.4 Pavir.9KG334400.4.p PF00582,PF04564,PF00069 PTHR27003,PTHR27003:SF29 KOG1187 2.7.11.1 NA GO:0006950,GO:0016567,GO:0004842,GO:0006468,GO:0005524,GO:0004672 AT4G25160.1 NA U-box domain-containing protein kinase family protein LOC_Os10g41220.1 NA protein kinase family protein, putative, expressed 10616 4622 4622
Chr09K 45577560 45597560 20001 + spliceSite 3 116063 Pavir.9KG334400 45587560 0 1 1 1 4.5e+07 Chr09K_4.5e+07 45576944 45582938 5995 - 41606480 Pavir.9KG334400.4 Pavir.9KG334400.4.p PF00582,PF04564,PF00069 PTHR27003,PTHR27003:SF29 KOG1187 2.7.11.1 NA GO:0006950,GO:0016567,GO:0004842,GO:0006468,GO:0005524,GO:0004672 AT4G25160.1 NA U-box domain-containing protein kinase family protein LOC_Os10g41220.1 NA protein kinase family protein, putative, expressed 10616 4622 4622
Chr09K 45577560 45597560 20001 + spliceSite 3 116064 Pavir.9KG334100 45587560 0 1 1 1 4.5e+07 Chr09K_4.5e+07 45590557 45594099 3543 - 41607551 Pavir.9KG334100.1 Pavir.9KG334100.1.p PF07690 PTHR23500,PTHR23500:SF179 KOG0254 NA NA GO:0055085,GO:0016021 AT5G61520.1 NA Major facilitator superfamily protein LOC_Os10g41190.1 NA transporter family protein, putative, expressed -2997 -6539 2997
Chr09K 45577560 45597560 20001 + spliceSite 3 116065 Pavir.9KG334100 45587560 0 1 1 1 4.5e+07 Chr09K_4.5e+07 45590557 45594099 3543 - 41607551 Pavir.9KG334100.1 Pavir.9KG334100.1.p PF07690 PTHR23500,PTHR23500:SF179 KOG0254 NA NA GO:0055085,GO:0016021 AT5G61520.1 NA Major facilitator superfamily protein LOC_Os10g41190.1 NA transporter family protein, putative, expressed -2997 -6539 2997
Chr09K 45577560 45597560 20001 + spliceSite 3 116066 Pavir.9KG334100 45587560 0 1 1 1 4.5e+07 Chr09K_4.5e+07 45590557 45594099 3543 - 41607551 Pavir.9KG334100.1 Pavir.9KG334100.1.p PF07690 PTHR23500,PTHR23500:SF179 KOG0254 NA NA GO:0055085,GO:0016021 AT5G61520.1 NA Major facilitator superfamily protein LOC_Os10g41190.1 NA transporter family protein, putative, expressed -2997 -6539 2997
Chr09N 11315811 11335811 20001 - intron 4 117033 Pavir.9KG583700 11325811 0 0 0 1 1.1e+07 Chr09N_1.1e+07 NA NA NA NA 41598633 Pavir.9KG583700.1 Pavir.9KG583700.1.p PF07716 PTHR22952,PTHR22952:SF198 NA NA NA GO:0043565,GO:0006355,GO:0003700 AT3G62420.1 ATBZIP53,BZIP53 basic region/leucine zipper motif 53 LOC_Os03g19375.1 NA expressed protein NA NA NA
Chr09N 11315811 11335811 20001 + spliceSite 4 124936 Pavir.9NG142800 11325811 0 0 0 1 1.1e+07 Chr09N_1.1e+07 11329238 11331352 2115 - 41552177 Pavir.9NG142800.1 Pavir.9NG142800.1.p PF11250 PTHR33155,PTHR33155:SF1 NA NA NA NA AT5G22090.2 NA Protein of unknown function (DUF3049) LOC_Os03g49830.1 NA expressed protein -3427 -5541 3427
Chr09N 11315811 11335811 20001 + spliceSite 4 124937 Pavir.9NG142800 11325811 0 0 0 1 1.1e+07 Chr09N_1.1e+07 11329238 11331352 2115 - 41552177 Pavir.9NG142800.1 Pavir.9NG142800.1.p PF11250 PTHR33155,PTHR33155:SF1 NA NA NA NA AT5G22090.2 NA Protein of unknown function (DUF3049) LOC_Os03g49830.1 NA expressed protein -3427 -5541 3427
Chr09N 11957533 11977533 20001 + intron 5 111994 Pavir.9KG622000 11967533 0 1 1 0 1.1e+07 Chr09N_1.1e+07 NA NA NA NA 41599909 Pavir.9KG622000.1 Pavir.9KG622000.1.p PF01588,PF09334 PTHR11946,PTHR11946:SF78 NA 6.1.1.10 K01874 GO:0000049,GO:0006418,GO:0005524,GO:0004812,GO:0000166 AT4G13780.1 NA methionine–tRNA ligase, putative / methionyl-tRNA synthetase, putative / MetRS, putative LOC_Os06g31210.1 NA methionyl-tRNA synthetase, putative, expressed NA NA NA
Chr09N 11957533 11977533 20001 + spliceSite 5 119578 Pavir.9NG147300 11967533 0 1 1 0 1.1e+07 Chr09N_1.1e+07 11964735 11967667 2933 + 41554068 Pavir.9NG147300.1 Pavir.9NG147300.1.p PF00227 PTHR11599,PTHR11599:SF6 KOG0177 3.4.25.1 K02734 GO:0051603,GO:0005839,GO:0004298 AT3G22630.1 PBD1,PRCGB 20S proteasome beta subunit D1 LOC_Os03g48930.1 NA peptidase, T1 family, putative, expressed 2798 -134 134
Chr09N 11957533 11977533 20001 + spliceSite 5 124997 Pavir.9NG147373 11967533 0 1 1 0 1.1e+07 Chr09N_1.1e+07 11967554 11968429 876 - 41561168 Pavir.9NG147373.1 Pavir.9NG147373.1.p PF14368 PTHR33044,PTHR33044:SF14 NA NA NA NA AT2G48130.1 NA Bifunctional inhibitor/lipid-transfer protein/seed storage 2S albumin superfamily protein LOC_Os07g07860.1 NA LTPL76 - Protease inhibitor/seed storage/LTP family protein precursor, expressed -21 -896 21
Chr09N 11957533 11977533 20001 + spliceSite 5 124998 Pavir.9NG147500 11967533 0 1 1 0 1.1e+07 Chr09N_1.1e+07 11968815 11969741 927 - 41554575 Pavir.9NG147500.1 Pavir.9NG147500.1.p PF14368 PTHR33044,PTHR33044:SF14 NA NA NA NA AT3G22600.1 NA Bifunctional inhibitor/lipid-transfer protein/seed storage 2S albumin superfamily protein LOC_Os03g57970.1 NA LTPL73 - Protease inhibitor/seed storage/LTP family protein precursor, expressed -1282 -2208 1282
Chr09N 11957533 11977533 20001 + spliceSite 5 124999 Pavir.9NG147573 11967533 0 1 1 0 1.1e+07 Chr09N_1.1e+07 11970419 11974568 4150 - 41553090 Pavir.9NG147573.1 Pavir.9NG147573.1.p PF07466 PTHR33975,PTHR33975:SF2 NA NA NA NA AT1G54520.1 NA NA LOC_Os03g48920.1 NA DUF1517 domain containing protein, putative, expressed -2886 -7035 2886