Alice MacQueen - Using OLGA to simulate T-cell receptor sequence diversity to assess genomic assay biases

T-cell receptor (TCR) proteins are heterodimers, made up of one \(\alpha\) chain (TCRA) and one \(\beta\) chain (TCRB). These sequences each have several hypervariable regions, the most diverse of which is known as the CDR3. In TCRB sequences, the CDR3 is generated by recombination between three genomic sequences, a V-, D-, and J-gene. Up to \(10^{15}\) distinct TCRs could, in principle, be formed by V-D-J recombination; this is an extremely large number (e.g., \(10^{15}\) meters encompasses our entire solar system to the Oort clouds). About \(4\times10^{11}\) TCRs circulate in an adult human, and about \(10^{10}\) of these have unique receptor sequences (Lythe et al. 2016). In meters, \(10^{11}\) meters encompasses about six weeks of Earth’s orbit, and \(10^{10}\) about four days of Earth’s orbit.

A V-, D-, and J- gene recombine to form the CDR3 region of each unique T-cell receptor sequence (“clonotype”).

Adaptive’s immunosequencing platform has several assays to find the sequences of these hypervariable CDR3 regions. However, these assays differ in several important sequencing features, most importantly in the set of multiplexed PCR primers used to amplify the CDR3 region. Primer changes can lead to changes in cross-priming between assay versions, and these changes can cause consistent, recurrent sequence and gene annotation biases between these assay versions. These biases can cause strong false signals in downstream machine learning approaches such as disease modeling unless they are understood and corrected.

To measure and correct assay version-specific biases, I first simulated realistic CDR3 regions of varying lengths, then merged these simulated sequences to assay version-specific V-gene sequences. Fisher’s exact tests on the number of V- & J- gene annotations in each assay were sufficient to reveal any biases between assay versions, after which I could explore workflow changes that reduced these assay biases.

The first step in this process is to simulate the CDR3 regions of the TCR sequences. To do this, I use the python package OLGA, which can generate CDR3 sequences using a generative model of V(D)J recombination.

Installing OLGA

Run pip install olga in your terminal to install OLGA.

Once OLGA is installed it can be run from the command line. The following command generates 500K sequences (-n 5e5) using the default generative model settings for human TCRB sequences (--humanTRB) and saves them in the working directory to a file called simulated_tcrb.tsv (-o simulated_tcrb.tsv).

alice@genegenie ~ % olga-generate_sequences --humanTRB -n 5e5 -o simulated_tcrb.tsv

Starting sequence generation... 
100000 sequences generated in 3.90 seconds. Estimated time remaining: 15.60 seconds.
200000 sequences generated in 7.77 seconds. Estimated time remaining: 11.65 seconds.
300000 sequences generated in 11.80 seconds. Estimated time remaining: 7.87 seconds.
400000 sequences generated in 15.66 seconds. Estimated time remaining: 3.92 seconds.
500000 sequences generated in 19.51 seconds. Estimated time remaining: 0.00 seconds.
Completed generating all 500000 sequences in 19.51 seconds.

Let’s visualize the CDR3 sequences OLGA generated and look at some basic visualizations of the data.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(knitr)
# library(here)

tcrb <- read_delim("simulated_tcrb.tsv",
  delim = "\t",
  col_names = c("CDR3_nt", "CDR3_aa", "V_gene", "J_gene"), 
  col_types = "cccc"
  )

The default table that OLGA generates has four columns: the CDR3 nucleotide & amino acid sequences, and the V- and J-gene selected in V(D)J recombination to generate this CDR3 sequence.

tcrb |>
  head(n = 15) |>
  kable()

CDR3_nt	CDR3_aa	V_gene	J_gene
TGTGCCAGCAGGGTTGGGACCCTAGGAAACACCATATATTTT	CASRVGTLGNTIYF	TRBV6-3	TRBJ1-3
TGTGCCAGCAGCGGTCTAGACAGGCGGGGTCATAATTCACCCCTCCACTTT	CASSGLDRRGHNSPLHF	TRBV9	TRBJ1-6
TGTGCCAGTAGTGAAGGGACTAGCGGGCCCTCCGGGGAGCTGTTTTTT	CASSEGTSGPSGELFF	TRBV19	TRBJ2-2
TGTGCCAGCTTTATGGAGCGCTACGAGCAGTACTTC	CASFMERYEQYF	TRBV7-6	TRBJ2-7
TGCAGTGCGCCGGGAGGGGGGAACACTGAAGCTTTCTTT	CSAPGGGNTEAFF	TRBV20-1	TRBJ1-1
TGTGCCGGAAAGCCGGGACAGCGCAAGGGGCGCTCTGGAAACACCATATATTTT	CAGKPGQRKGRSGNTIYF	TRBV19	TRBJ1-3
TGTGCCAGCAGCTGGGGGACCGGGGAGCTGTTTTTT	CASSWGTGELFF	TRBV11-2	TRBJ2-2
TGCAGTGCTAGAGATAAATCTGGACAGGGGTGTAGCAATCAGCCCCAGCATTTT	CSARDKSGQGCSNQPQHF	TRBV20-1	TRBJ1-5
TGCAGTGCCCCTTACTATAACACTGAAGCTTTCTTT	CSAPYYNTEAFF	TRBV20-1	TRBJ1-1
TGTGCCAGCAGCCAAGATGGACAGAGCTCTGGAAACACCATATATTTT	CASSQDGQSSGNTIYF	TRBV4-2	TRBJ1-3
TGCGCCAGCAGCTTGCCTACGCGGGAGGGCCAAGAGACCCAGTACTTC	CASSLPTREGQETQYF	TRBV5-1	TRBJ2-5
TGTAGAAATTCCCTCATGAAAAACATTCAGTACTTC	CRNSLMKNIQYF	TRBV6-5	TRBJ2-4
TGTGCCACGCTCCCAGGTGGGCAGCCCCAGCATTTT	CATLPGGQPQHF	TRBV12-3	TRBJ1-5
TGCGCCAGCAGCTTGACCCCCACTAGCGGGAGTATGAGGAACACCGGGGAGCTGTTTTTT	CASSLTPTSGSMRNTGELFF	TRBV5-1	TRBJ2-2
TGTGCCAGCAGCATACTGCAGAACGCCCAGCATTTT	CASSILQNAQHF	TRBV3-1	TRBJ1-5

What are the lengths of these nucleotide sequences, and how many have each combination of V- and J-gene?

tcrb <- tcrb |> 
  mutate(CDR3_nt_length = nchar(CDR3_nt)) 

tcrb |> 
  ggplot(aes(x = CDR3_nt_length)) +
  geom_histogram(binwidth = 3) +
  theme_classic(base_size = 12) +
  labs(x = "CDR3 length (nt)", y = "Count")

CDR3 from 24 to 75 nucleotides in length are represented in this dataset.

tcrb |> 
  ggplot(aes(x = J_gene, y = CDR3_nt_length)) + 
  geom_count() + 
  scale_size_area(max_size = 9) +
  theme_classic(base_size = 12) +
  theme(axis.text.x = element_text(hjust = 1, vjust = 1, angle = 45)) +
  labs(x = "J-gene", y = "CDR3 length (nt)")

Some J-genes are much more commonly used in OLGA’s generative model, such as TCRBJ2-7. Many produce a similar range of CDR3 lengths, at least by eye; TCRBJ2-6 seems to create noticeably longer CDR3 on average.

tcrb |> 
  ggplot(aes(x = V_gene, y = CDR3_nt_length)) + 
  geom_count() + 
  scale_size_area(max_size = 6) +
  theme_classic(base_size = 12) +
  theme(axis.text.x = element_text(hjust = 1, vjust = 1, angle = 45)) +
  labs(x = "V-gene", y = "CDR3 length (nt)")

Finally, there is quite a range in the number of CDR3 generated using each V-gene in OLGA’s generative model. V-genes like TCRBV16 are rarely used to make functional CDR3, while V-genes like TCRBV20-1 and TCRBV7-9 are very commonly used to generate CDR3.

References

Lythe, Grant, Robin E. Callard, Rollo L. Hoare, and Carmen Molina-París. 2016. “How Many TCR Clonotypes Does a Body Maintain?” Journal of Theoretical Biology 389 (January): 214–24. https://doi.org/10.1016/j.jtbi.2015.10.016.

Citation

BibTeX citation:

@online{macqueen2024,
  author = {MacQueen, Alice},
  title = {Using {OLGA} to Simulate {T-cell} Receptor Sequence Diversity
    to Assess Genomic Assay Biases},
  date = {2024-06-03},
  url = {https://alice-macqueen.github.io/posts/2024-06-03-olga/},
  langid = {en}
}

For attribution, please cite this work as:

MacQueen, Alice. 2024. “Using OLGA to Simulate T-Cell Receptor Sequence Diversity to Assess Genomic Assay Biases.” June 3, 2024. https://alice-macqueen.github.io/posts/2024-06-03-olga/.