fbm.Rmd
When you run GWAS using this package, you now have two options for saving the GWAS results - as RDS files and as file-backed matrix (FBM) files.
You can specify this option in pvdiv_standard_gwas()
with the option savetype
. savetype
can be “rds”, “fbm”, or “both”.
If you are running few GWAS and want to save each GWAS output separately, you can use rds files. These files completely describe the GWAS results without referring back to the original SNP file, using the columns CHR (chromosome) & POS (physical position). They contain the effect estimate estim
for the alternate allele (allele 2 in the SNP file), the standard error around this effect std_err
, and the -log10pvalue and FDR adjusted p-values log10p
and FDR_adj
from the genome-wide association.
If you are running many GWAS, you may want a compressed format, a reduced number of files, and/or additional ways to compare GWAS outputs. To do this, you can use FBM files, which save all GWAS results in a total of 4 files. These files all start with the prefix ‘gwas_effects_’. Two are .csv files; these contain the associated metadata for the GWAS and the column names for the fbm. Two are an .rds & .bk format & contain the actual file-backed matrix results; these can be very large.
For each phenotype used for GWAS, the gwas_effects files contain the effect estimate estim
for the alternate allele (allele 2 in the SNP file), the standard error around this effect std_err
, and the -log10pvalue log10p
from the genome-wide association.
FBM files are memory and computation-efficient tools to analyze big matrices that are stored on disc rather than in your computer’s memory (see here for more details ). Briefly, they have faster I/O, won’t overwhelm your computer’s RAM, and function runtime speeds up ~10-200x.
These fbm files can be used with the functions pvdiv_fbm_upset_df() and pvdiv_fbm_qq(). They can also be used in the snpdiver R package function dive_effects2mash() to run mash.
These files need to be used along with the original snp file used to run the GWAS - thus, you’ll need to keep track of six files in total, the .rds and .bk file for the SNP file and the four files for the gwas results.
These files do not work with the function pvdiv_bigsnp2mashr to run mash.
If you are interested in only a subset of GWAS results saved in a fbm, you will have to create a vector of row numbers using the associated metadata to refer to that data subset.
The remainder of this package demonstrates some utilities for filebacked matrix files within the bigsnpr and switchgrassGWAS R packages.
# get the example bedfile from the package switchgrassGWAS
bedfile <- system.file("extdata", "example.bed", package = "switchgrassGWAS")
# Load packages bigsnpr and switchgrassGWAS
library(switchgrassGWAS)
#> Registered S3 method overwritten by 'GGally':
#> method from
#> +.gg ggplot2
library(bigsnpr)
#> Loading required package: bigstatsr
# Read from bed/bim/fam to create the new files that bigsnpr uses.
# Let's put them in an temporary directory for this demo.
tmpfile <- tempfile()
snp_readBed(bedfile, backingfile = tmpfile)
#> [1] "/tmp/RtmpgAXKt5/file2f365fdb9ef7.rds"
# Attach the "bigSNP" object to the R session.
snp_example <- snp_attach(paste0(tmpfile, ".rds"))
# What does the bigSNP object look like?
str(snp_example, max.level = 2, strict.width = "cut")
#> List of 3
#> $ genotypes:Reference class 'FBM.code256' [package "bigstatsr"] with 16 fields
#> ..and 26 methods, of which 12 are possibly relevant:
#> .. add_columns, as.FBM, bm, bm.desc, check_dimensions,
#> .. check_write_permissions, copy#envRefClass, initialize, initialize#FBM,
#> .. save, show#envRefClass, show#FBM
#> $ fam :'data.frame': 630 obs. of 6 variables:
#> ..$ family.ID : chr [1:630] "Pvirgatum" "Pvirgatum" "Pvirgatum" "Pvirgatum"..
#> ..$ sample.ID : chr [1:630] "J181.A" "J250.C" "J251.C" "J352.A" ...
#> ..$ paternal.ID: int [1:630] 0 0 0 0 0 0 0 0 0 0 ...
#> ..$ maternal.ID: int [1:630] 0 0 0 0 0 0 0 0 0 0 ...
#> ..$ sex : int [1:630] 0 0 0 0 0 0 0 0 0 0 ...
#> ..$ affection : int [1:630] -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 ...
#> $ map :'data.frame': 3600 obs. of 6 variables:
#> ..$ chromosome : chr [1:3600] "Chr01K" "Chr01K" "Chr01K" "Chr01K" ...
#> ..$ marker.ID : chr [1:3600] "Chr01K_105542" "Chr01K_798650" "Chr01K_1147"..
#> ..$ genetic.dist: int [1:3600] 0 0 0 0 0 0 0 0 0 0 ...
#> ..$ physical.pos: int [1:3600] 105542 798650 1147074 1719551 1915851 1965986..
#> ..$ allele1 : chr [1:3600] "A" "C" "T" "T" ...
#> ..$ allele2 : chr [1:3600] "G" "G" "C" "G" ...
#> - attr(*, "class")= chr "bigSNP"
# Load the pvdiv phenotypes into the R session.
data(pvdiv_phenotypes)
# Make an example dataframe of one phenotype where the first column is PLANT_ID.
# This "phenotype", 'GWAS_CT', is the number of times a plant successfully
# clonally replicated to plant in the common gardens in 2018.
four_phenotype <- pvdiv_phenotypes %>%
dplyr::select(PLANT_ID, GWAS_CT, matches("BIOMASS"))
# Save the output to a temporary directory for this demo.
tempdir <- tempdir()
pvdiv_standard_gwas(snp = snp_example, df = four_phenotype,
type = "linear", outputdir = tempdir,
savegwas = TRUE, savetype = "fbm",
saveplots = FALSE,
saveannos = FALSE, ncores = 1)
#> 'lambdagc' is TRUE, so lambda_GC will be used to find the best population structure correction using the covariance matrix.
#> 'savetype' is 'fbm', so the gwas results will be saved to disk as a combined filebacked matrix (fbm) file.
#> Covariance matrix (covar) was not supplied - this will be generated using pvdiv_autoSVD().
#> 'saveoutput' is FALSE, so the svd will not be saved to the working directory.
#> Now starting GWAS pipeline for GWAS_CT.
#> Now determining lambda_GC for GWAS models with 16 sets of PCs. This will take some time.
#> saveoutput is FALSE, so lambda_GC values won't be saved to a csv.
#> Finished Lambda_GC calculation for GWAS_CT using 0 PCs.
#> Finished Lambda_GC calculation for GWAS_CT using 1 PCs.
#> Finished Lambda_GC calculation for GWAS_CT using 2 PCs.
#> Finished Lambda_GC calculation for GWAS_CT using 3 PCs.
#> Finished Lambda_GC calculation for GWAS_CT using 4 PCs.
#> Finished Lambda_GC calculation for GWAS_CT using 5 PCs.
#> Finished Lambda_GC calculation for GWAS_CT using 6 PCs.
#> Finished Lambda_GC calculation for GWAS_CT using 7 PCs.
#> Finished Lambda_GC calculation for GWAS_CT using 8 PCs.
#> Finished Lambda_GC calculation for GWAS_CT using 9 PCs.
#> Finished Lambda_GC calculation for GWAS_CT using 10 PCs.
#> Finished Lambda_GC calculation for GWAS_CT using 11 PCs.
#> Finished Lambda_GC calculation for GWAS_CT using 12 PCs.
#> Finished Lambda_GC calculation for GWAS_CT using 13 PCs.
#> Finished Lambda_GC calculation for GWAS_CT using 14 PCs.
#> Finished Lambda_GC calculation for GWAS_CT using 15 PCs.
#> Finished phenotype 1: GWAS_CT
#> Now running GWAS with the best population structure correction.
#> [1] "saveoutput is FALSE so GWAS object will not be saved to disk."
#> Now starting GWAS pipeline for CLMB_BIOMASS.
#> Now determining lambda_GC for GWAS models with 16 sets of PCs. This will take some time.
#> saveoutput is FALSE, so lambda_GC values won't be saved to a csv.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 0 PCs.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 1 PCs.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 2 PCs.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 3 PCs.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 4 PCs.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 5 PCs.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 6 PCs.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 7 PCs.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 8 PCs.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 9 PCs.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 10 PCs.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 11 PCs.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 12 PCs.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 13 PCs.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 14 PCs.
#> Finished Lambda_GC calculation for CLMB_BIOMASS using 15 PCs.
#> Finished phenotype 1: CLMB_BIOMASS
#> Now running GWAS with the best population structure correction.
#> [1] "saveoutput is FALSE so GWAS object will not be saved to disk."
#> Now starting GWAS pipeline for KBSM_BIOMASS.
#> Now determining lambda_GC for GWAS models with 16 sets of PCs. This will take some time.
#> saveoutput is FALSE, so lambda_GC values won't be saved to a csv.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 0 PCs.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 1 PCs.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 2 PCs.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 3 PCs.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 4 PCs.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 5 PCs.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 6 PCs.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 7 PCs.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 8 PCs.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 9 PCs.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 10 PCs.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 11 PCs.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 12 PCs.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 13 PCs.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 14 PCs.
#> Finished Lambda_GC calculation for KBSM_BIOMASS using 15 PCs.
#> Finished phenotype 1: KBSM_BIOMASS
#> Now running GWAS with the best population structure correction.
#> [1] "saveoutput is FALSE so GWAS object will not be saved to disk."
#> Now starting GWAS pipeline for PKLE_BIOMASS.
#> Now determining lambda_GC for GWAS models with 16 sets of PCs. This will take some time.
#> saveoutput is FALSE, so lambda_GC values won't be saved to a csv.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 0 PCs.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 1 PCs.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 2 PCs.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 3 PCs.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 4 PCs.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 5 PCs.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 6 PCs.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 7 PCs.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 8 PCs.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 9 PCs.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 10 PCs.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 11 PCs.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 12 PCs.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 13 PCs.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 14 PCs.
#> Finished Lambda_GC calculation for PKLE_BIOMASS using 15 PCs.
#> Finished phenotype 1: PKLE_BIOMASS
#> Now running GWAS with the best population structure correction.
#> [1] "saveoutput is FALSE so GWAS object will not be saved to disk."
This command will save a set of gwas results to a temporary directory, which will have a randomly generated name like: /tmp/RtmpgAXKt5.
Let’s load these files into R.
gwas <- big_attach(file.path(tempdir, "gwas_effects_0M_SNPs.rds"))
gwas_metadata <- read.csv(file.path(tempdir, "gwas_effects_0M_SNPs_associated_metadata.csv"))
gwas_colnames <- read.csv(file.path(tempdir, "gwas_effects_0M_SNPs_column_names.csv"))
Let’s look at the metadata associated with the GWAS. This includes information about the kind of GWAS that was conducted. In order, the columns indicate, what the phenotype name was, what type of model was run, how many SNPs (in M) were used, how many PCs were used for population structure correction, how many phenotyped individuals were there, how many phenotypic levels were there, what was the genomic inflation factor (lambda_GC), and given the number of SNPs, what was the bonferroni threshold for the phenotype.
kable(gwas_metadata)
phe | type | nsnp | npcs | nphe | nlev | lambda_GC | bonferroni |
---|---|---|---|---|---|---|---|
GWAS_CT | linear | 0_M | 15 | 629 | 16 | 1.12682 | 4.857333 |
CLMB_BIOMASS | linear | 0_M | 5 | 495 | 490 | 1.02958 | 4.857333 |
KBSM_BIOMASS | linear | 0_M | 15 | 512 | 512 | 1.07005 | 4.857333 |
PKLE_BIOMASS | linear | 0_M | 10 | 612 | 606 | 1.03844 | 4.857333 |
Let’s look at the column names of the GWAS result. These are in the same order as the associated metadata, and each phenotype has three columns: Effect (estimated effect size of allele2), SE (standard error about this effect size), and log10p (-log10p of the association between SNP & phenotype).
kable(gwas_colnames)
colnames_fbm |
---|
GWAS_CT_Effect |
GWAS_CT_SE |
GWAS_CT_log10p |
CLMB_BIOMASS_Effect |
CLMB_BIOMASS_SE |
CLMB_BIOMASS_log10p |
KBSM_BIOMASS_Effect |
KBSM_BIOMASS_SE |
KBSM_BIOMASS_log10p |
PKLE_BIOMASS_Effect |
PKLE_BIOMASS_SE |
PKLE_BIOMASS_log10p |
Let’s analyze overlaps in significance in the four GWAS. Is there a relationship between the -log10pvalues in the three BIOMASS GWAS? If so, this could indicate that SNPs have a similar effect on BIOMASS across the three different environments.
Use the row numbers from the associated metadata to compare -log10p values for row 2 (CLMB_BIOMASS) with rows 3 & 4 (biomass at KBSM & PKLE). Set the threshold above which to plot the p-value at zero (default is 5), because there are not many SNPs in this toy example.
pvdiv_fbm_qq(effects = gwas, metadata = gwas_metadata, e_row = 2, o_row = 3:4,
outputdir = tempdir, thr = 0)
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
This function will again save plots to the tempdir. These plots should look like this:
and this:
If these GWAS results were perfectly related, all the hexagons would fall on the red dashed 1:1 line. The blue line indicates the actual relationship between the p-values in the two GWAS. We can see that there is a modest positive relationship between biomass in these environments, but it’s by no means strong. This indicates the presence of genotype-by-environment interactions between these environments.
Now, let’s compare results for more than one GWAS. The way I recommend doing this is with an Upset plot. Upset plots are well suited to the quantitative analysis of data when there are more than three sets. They shine when you want to look at all combinations of how sets interact - for example, when you’re interested in sets of SNPs significant in different combinations of GWAS.
Here’s some great documentation on how to code an Upset plot in R.
Let’s make a dataframe to go into an Upset plot and visualize it in R.
There are three required arguments:
We can also optionally specify a -log10p threshold with ‘thr’. By default, that threshold is 7. A good target is to set this above the noise floor for all of the GWAS you are comparing. For our toy example, we set this artificially low, to 2.
upset_df <- pvdiv_fbm_upset_df(effects = gwas, snp = snp_example,
metadata = gwas_metadata, thr = 2, ncores = 1)
kable(head(upset_df))
CHR | POS | GWAS_CT | CLMB_BIOMASS | KBSM_BIOMASS | PKLE_BIOMASS |
---|---|---|---|---|---|
Chr01K | 1147074 | 0 | 0 | 0 | 1 |
Chr01K | 4761900 | 0 | 0 | 0 | 1 |
Chr01K | 5505667 | 0 | 0 | 0 | 1 |
Chr01K | 7103018 | 0 | 0 | 0 | 1 |
Chr01K | 7475083 | 0 | 0 | 0 | 1 |
Chr01K | 8046966 | 0 | 0 | 1 | 1 |
This dataframe has CHR & POS from the SNP file, and then an additional column for each phenotype. If that SNP is not significant at the -log10p value threshold for that phenotype (here, thr = 2
), then that row & column combination will have a ‘0’. If that SNP is significant for the phenotype, that value will be ‘1’.
Now, let’s make an upset plot using the R package ComplexUpset.
There are two required arguments:
library(ComplexUpset)
upset_plot <- upset(upset_df, intersect = colnames(upset_df)[-(1:2)],
name = 'phenotype', min_size = 2)
upset_plot
In this Upset plot, we can see that there are many more SNPs significant in one or more biomass phenotype than there are significant for the phenotype ‘GWAS_CT’ and any biomass phenotype. Though, overall, most SNPs are significant in only one of the four conditions.
Sometimes this could be because different SNPs in the same gene or genomic region are coming up as highly associated with the different phenotypes. To test this, we can summarize our upset_df slightly based on position, and make a new Upset plot.
Median LD in switchgrass is between 20 and 25kb. Normally this is the largest genomic region I would make. However, for this toy example, we’ll make larger genomic regions of 1Mb.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
upset_df_1Mb <- upset_df %>%
mutate(POS_binned = floor(POS/1000000)*1000000) %>%
select(-POS) %>%
select(CHR, POS_binned, everything()) %>%
group_by(CHR, POS_binned) %>%
summarise(across(everything(), ~ max(.x)))
#> `summarise()` has grouped output by 'CHR'. You can override using the `.groups` argument.
upset_1Mb <- upset(upset_df_1Mb, intersect = colnames(upset_df_1Mb)[-(1:2)],
name = 'phenotype', min_size = 2) +
labs(title = "Significant at -log10p > 2 in 1Mb intervals")
upset_1Mb
This binning by genomic interval does result in more genomic regions with significant effects on multiple phenotypes. It doesn’t change the ordering of the kinds of intersections very much, with the exception that the three-way intersection between the three biomass phenotypes is now the least common intersection (with 4 1Mb regions containing significant associations for all three phenotypes).
What if we want to learn more about these 4 1Mb regions? We can go back to our upset dataframes to extract these regions.
sel_reg_biomass <- upset_df_1Mb %>%
filter(across(CLMB_BIOMASS:PKLE_BIOMASS, ~ .x == 1)) %>%
mutate(Sel_region = paste(CHR, POS_binned, sep = "_"))
sel_snp_biomass <- upset_df %>%
mutate(POS_binned = floor(POS/1000000)*1000000,
Sel_region = paste(CHR, POS_binned, sep = "_")) %>%
filter(Sel_region %in% sel_reg_biomass$Sel_region)
kable(sel_snp_biomass)
CHR | POS | GWAS_CT | CLMB_BIOMASS | KBSM_BIOMASS | PKLE_BIOMASS | POS_binned | Sel_region |
---|---|---|---|---|---|---|---|
Chr03K | 2197885 | 0 | 1 | 1 | 1 | 2.0e+06 | Chr03K_2e+06 |
Chr05K | 12177803 | 0 | 1 | 1 | 1 | 1.2e+07 | Chr05K_1.2e+07 |
Chr09K | 45587560 | 0 | 1 | 1 | 1 | 4.5e+07 | Chr09K_4.5e+07 |
Chr09N | 11325811 | 0 | 0 | 0 | 1 | 1.1e+07 | Chr09N_1.1e+07 |
Chr09N | 11967533 | 0 | 1 | 1 | 0 | 1.1e+07 | Chr09N_1.1e+07 |
If we were interested enough in these regions, we could also use this table to find annotations using pvdiv_table_topsnps()
.
sel_snp_annos <- sel_snp_biomass %>%
mutate(start = POS - 10000,
end = POS + 10000) # get the 20kb surrounding each position, POS, in thr_dfB
library(AnnotationDbi)
library(VariantAnnotation)
txdb <- loadDb(file = "../../pvdiv-genome/Pvirgatum_516_v5.1.gene.txdb.sqlite")
sel_snp_annos <- pvdiv_table_topsnps(df = sel_snp_annos, type = "table", txdb = txdb)
kable(sel_snp_annos)
CHR | region_start | region_end | width | region_strand | Annotation | QUERYID | TXID | Gene ID | POS | GWAS_CT | CLMB_BIOMASS | KBSM_BIOMASS | PKLE_BIOMASS | POS_binned | Sel_region | gene_start | gene_end | gene_width | gene_strand | pacId | transcriptName | peptideName | Pfam | Panther Categories | KOG | ec | KO | GO | Arabidopsis thaliana homolog | A. thaliana gene name | A. thaliana gene annotation | Rice homolog | Rice gene name | Rice gene annotation | d2_start | d2_end | distance from gene |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Chr03K | 2187885 | 2207885 | 20001 | * | intergenic | 1 | NA | NA | 2197885 | 0 | 1 | 1 | 1 | 2.0e+06 | Chr03K_2e+06 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Chr05K | 12167803 | 12187803 | 20001 | + | spliceSite | 2 | 60175 | Pavir.5KG164407 | 12177803 | 0 | 1 | 1 | 1 | 1.2e+07 | Chr05K_1.2e+07 | 12176848 | 12177263 | 416 | + | 41543573 | Pavir.5KG164407.1 | Pavir.5KG164407.1.p | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 955 | 540 | 540 |
Chr05K | 12167803 | 12187803 | 20001 | + | spliceSite | 2 | 60176 | Pavir.5KG164807 | 12177803 | 0 | 1 | 1 | 1 | 1.2e+07 | Chr05K_1.2e+07 | 12185636 | 12186204 | 569 | + | 41549658 | Pavir.5KG164807.1 | Pavir.5KG164807.1.p | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | -7833 | -8401 | 7833 |
Chr09K | 45577560 | 45597560 | 20001 | - | intron | 3 | 103241 | Pavir.8KG390800 | 45587560 | 0 | 1 | 1 | 1 | 4.5e+07 | Chr09K_4.5e+07 | NA | NA | NA | NA | 41644684 | Pavir.8KG390800.1 | Pavir.8KG390800.1.p | PF03810 | PTHR10997,PTHR10997:SF29 | NA | NA | NA | GO:0008536,GO:0006886 | AT3G17340.1 | NA | ARM repeat superfamily protein | LOC_Os11g47320.1 | NA | protein transporter, putative, expressed | NA | NA | NA |
Chr09K | 45577560 | 45597560 | 20001 | + | spliceSite | 3 | 110376 | Pavir.9KG334200 | 45587560 | 0 | 1 | 1 | 1 | 4.5e+07 | Chr09K_4.5e+07 | 45586592 | 45589853 | 3262 | + | 41604637 | Pavir.9KG334200.1 | Pavir.9KG334200.1.p | PF00249 | PTHR24078,PTHR24078:SF195 | NA | NA | NA | NA | AT5G47390.1 | NA | myb-like transcription factor family protein | LOC_Os10g41200.1 | NA | MYB family transcription factor, putative, expressed | 968 | -2293 | 968 |
Chr09K | 45577560 | 45597560 | 20001 | + | spliceSite | 3 | 116062 | Pavir.9KG334400 | 45587560 | 0 | 1 | 1 | 1 | 4.5e+07 | Chr09K_4.5e+07 | 45576944 | 45582938 | 5995 | - | 41606480 | Pavir.9KG334400.4 | Pavir.9KG334400.4.p | PF00582,PF04564,PF00069 | PTHR27003,PTHR27003:SF29 | KOG1187 | 2.7.11.1 | NA | GO:0006950,GO:0016567,GO:0004842,GO:0006468,GO:0005524,GO:0004672 | AT4G25160.1 | NA | U-box domain-containing protein kinase family protein | LOC_Os10g41220.1 | NA | protein kinase family protein, putative, expressed | 10616 | 4622 | 4622 |
Chr09K | 45577560 | 45597560 | 20001 | + | spliceSite | 3 | 116063 | Pavir.9KG334400 | 45587560 | 0 | 1 | 1 | 1 | 4.5e+07 | Chr09K_4.5e+07 | 45576944 | 45582938 | 5995 | - | 41606480 | Pavir.9KG334400.4 | Pavir.9KG334400.4.p | PF00582,PF04564,PF00069 | PTHR27003,PTHR27003:SF29 | KOG1187 | 2.7.11.1 | NA | GO:0006950,GO:0016567,GO:0004842,GO:0006468,GO:0005524,GO:0004672 | AT4G25160.1 | NA | U-box domain-containing protein kinase family protein | LOC_Os10g41220.1 | NA | protein kinase family protein, putative, expressed | 10616 | 4622 | 4622 |
Chr09K | 45577560 | 45597560 | 20001 | + | spliceSite | 3 | 116064 | Pavir.9KG334100 | 45587560 | 0 | 1 | 1 | 1 | 4.5e+07 | Chr09K_4.5e+07 | 45590557 | 45594099 | 3543 | - | 41607551 | Pavir.9KG334100.1 | Pavir.9KG334100.1.p | PF07690 | PTHR23500,PTHR23500:SF179 | KOG0254 | NA | NA | GO:0055085,GO:0016021 | AT5G61520.1 | NA | Major facilitator superfamily protein | LOC_Os10g41190.1 | NA | transporter family protein, putative, expressed | -2997 | -6539 | 2997 |
Chr09K | 45577560 | 45597560 | 20001 | + | spliceSite | 3 | 116065 | Pavir.9KG334100 | 45587560 | 0 | 1 | 1 | 1 | 4.5e+07 | Chr09K_4.5e+07 | 45590557 | 45594099 | 3543 | - | 41607551 | Pavir.9KG334100.1 | Pavir.9KG334100.1.p | PF07690 | PTHR23500,PTHR23500:SF179 | KOG0254 | NA | NA | GO:0055085,GO:0016021 | AT5G61520.1 | NA | Major facilitator superfamily protein | LOC_Os10g41190.1 | NA | transporter family protein, putative, expressed | -2997 | -6539 | 2997 |
Chr09K | 45577560 | 45597560 | 20001 | + | spliceSite | 3 | 116066 | Pavir.9KG334100 | 45587560 | 0 | 1 | 1 | 1 | 4.5e+07 | Chr09K_4.5e+07 | 45590557 | 45594099 | 3543 | - | 41607551 | Pavir.9KG334100.1 | Pavir.9KG334100.1.p | PF07690 | PTHR23500,PTHR23500:SF179 | KOG0254 | NA | NA | GO:0055085,GO:0016021 | AT5G61520.1 | NA | Major facilitator superfamily protein | LOC_Os10g41190.1 | NA | transporter family protein, putative, expressed | -2997 | -6539 | 2997 |
Chr09N | 11315811 | 11335811 | 20001 | - | intron | 4 | 117033 | Pavir.9KG583700 | 11325811 | 0 | 0 | 0 | 1 | 1.1e+07 | Chr09N_1.1e+07 | NA | NA | NA | NA | 41598633 | Pavir.9KG583700.1 | Pavir.9KG583700.1.p | PF07716 | PTHR22952,PTHR22952:SF198 | NA | NA | NA | GO:0043565,GO:0006355,GO:0003700 | AT3G62420.1 | ATBZIP53,BZIP53 | basic region/leucine zipper motif 53 | LOC_Os03g19375.1 | NA | expressed protein | NA | NA | NA |
Chr09N | 11315811 | 11335811 | 20001 | + | spliceSite | 4 | 124936 | Pavir.9NG142800 | 11325811 | 0 | 0 | 0 | 1 | 1.1e+07 | Chr09N_1.1e+07 | 11329238 | 11331352 | 2115 | - | 41552177 | Pavir.9NG142800.1 | Pavir.9NG142800.1.p | PF11250 | PTHR33155,PTHR33155:SF1 | NA | NA | NA | NA | AT5G22090.2 | NA | Protein of unknown function (DUF3049) | LOC_Os03g49830.1 | NA | expressed protein | -3427 | -5541 | 3427 |
Chr09N | 11315811 | 11335811 | 20001 | + | spliceSite | 4 | 124937 | Pavir.9NG142800 | 11325811 | 0 | 0 | 0 | 1 | 1.1e+07 | Chr09N_1.1e+07 | 11329238 | 11331352 | 2115 | - | 41552177 | Pavir.9NG142800.1 | Pavir.9NG142800.1.p | PF11250 | PTHR33155,PTHR33155:SF1 | NA | NA | NA | NA | AT5G22090.2 | NA | Protein of unknown function (DUF3049) | LOC_Os03g49830.1 | NA | expressed protein | -3427 | -5541 | 3427 |
Chr09N | 11957533 | 11977533 | 20001 | + | intron | 5 | 111994 | Pavir.9KG622000 | 11967533 | 0 | 1 | 1 | 0 | 1.1e+07 | Chr09N_1.1e+07 | NA | NA | NA | NA | 41599909 | Pavir.9KG622000.1 | Pavir.9KG622000.1.p | PF01588,PF09334 | PTHR11946,PTHR11946:SF78 | NA | 6.1.1.10 | K01874 | GO:0000049,GO:0006418,GO:0005524,GO:0004812,GO:0000166 | AT4G13780.1 | NA | methionine–tRNA ligase, putative / methionyl-tRNA synthetase, putative / MetRS, putative | LOC_Os06g31210.1 | NA | methionyl-tRNA synthetase, putative, expressed | NA | NA | NA |
Chr09N | 11957533 | 11977533 | 20001 | + | spliceSite | 5 | 119578 | Pavir.9NG147300 | 11967533 | 0 | 1 | 1 | 0 | 1.1e+07 | Chr09N_1.1e+07 | 11964735 | 11967667 | 2933 | + | 41554068 | Pavir.9NG147300.1 | Pavir.9NG147300.1.p | PF00227 | PTHR11599,PTHR11599:SF6 | KOG0177 | 3.4.25.1 | K02734 | GO:0051603,GO:0005839,GO:0004298 | AT3G22630.1 | PBD1,PRCGB | 20S proteasome beta subunit D1 | LOC_Os03g48930.1 | NA | peptidase, T1 family, putative, expressed | 2798 | -134 | 134 |
Chr09N | 11957533 | 11977533 | 20001 | + | spliceSite | 5 | 124997 | Pavir.9NG147373 | 11967533 | 0 | 1 | 1 | 0 | 1.1e+07 | Chr09N_1.1e+07 | 11967554 | 11968429 | 876 | - | 41561168 | Pavir.9NG147373.1 | Pavir.9NG147373.1.p | PF14368 | PTHR33044,PTHR33044:SF14 | NA | NA | NA | NA | AT2G48130.1 | NA | Bifunctional inhibitor/lipid-transfer protein/seed storage 2S albumin superfamily protein | LOC_Os07g07860.1 | NA | LTPL76 - Protease inhibitor/seed storage/LTP family protein precursor, expressed | -21 | -896 | 21 |
Chr09N | 11957533 | 11977533 | 20001 | + | spliceSite | 5 | 124998 | Pavir.9NG147500 | 11967533 | 0 | 1 | 1 | 0 | 1.1e+07 | Chr09N_1.1e+07 | 11968815 | 11969741 | 927 | - | 41554575 | Pavir.9NG147500.1 | Pavir.9NG147500.1.p | PF14368 | PTHR33044,PTHR33044:SF14 | NA | NA | NA | NA | AT3G22600.1 | NA | Bifunctional inhibitor/lipid-transfer protein/seed storage 2S albumin superfamily protein | LOC_Os03g57970.1 | NA | LTPL73 - Protease inhibitor/seed storage/LTP family protein precursor, expressed | -1282 | -2208 | 1282 |
Chr09N | 11957533 | 11977533 | 20001 | + | spliceSite | 5 | 124999 | Pavir.9NG147573 | 11967533 | 0 | 1 | 1 | 0 | 1.1e+07 | Chr09N_1.1e+07 | 11970419 | 11974568 | 4150 | - | 41553090 | Pavir.9NG147573.1 | Pavir.9NG147573.1.p | PF07466 | PTHR33975,PTHR33975:SF2 | NA | NA | NA | NA | AT1G54520.1 | NA | NA | LOC_Os03g48920.1 | NA | DUF1517 domain containing protein, putative, expressed | -2886 | -7035 | 2886 |