Skip to content

Read Alignment

Alignment Process

Alignment of bisulfite sequencing reads differs from alignment of non-bisulfite data in two key ways:

  1. Reference cytosines are sequenced as either thymine or cytosine following bisulfite conversion.
  2. The reference sense and anti-sense strands are no longer complementary following bisulfite conversion.

BSBolt utilizes a modified version of BWA MEM tailored for alignment of bisulfite sequencing data, that follows the general workflow below.

  1. Input reads are In silico bisulfite converted, modifying unconverted methylatable bases.
    1. When aligning undirectional reads, two substitution patterns per read are possible. DNA originating from either reference strand will have a cytosine to thymine conversion pattern while its PCR product will have a guanine to adenine conversion pattern. To assess which conversion pattern is most appropriate the read with the fewer conversion substitutions relative to the read length is preferentially aligned. So, a read with a cytosine to thymine conversion pattern should contain fewer cytosine bases than guanine bases. If there are a high number of both cytosine and guanine bases alignments for both conversion patterns are assessed and the better alignment is output.
  2. Converted reads are mapped to a bisulfite alignment index that contains both the sense and anti-sense bisulfite strands relative to the reference.
  3. Read alignments are assessed for mapping quality and bisulfite uniqueness.
    1. Bisulfite sequencing meta-data is the generated by comparing the read alignment to an unconverted reference sequence.

BSB Align Commands

bsbolt Align -F1 {fastq1} -DB {bsbolt db} -O {output}

-h, --help    show this help message and exit

Input / Output Options:
  -F1 File     path to fastq 1
  -F2 File     path to fastq 2 [null]
  -O File      output Prefix
  -OS          output to stdout [False]
  -OT Int      number of bam output threads [1]
  -DB File     path to bsbolt database
  -R Str       read group header line such as '@RG ID:foo SM:bar' [null]
  -H Str       insert STR to header if it starts with @; or insert lines in FILE [null]
  -XA Int,Int  if there are <INT hits with score >80 percent of the max score, output all in XA [100,200]
  -DR Float    drop ratio for alternative hits reported in XA tag, 
               for best bisulfite alignment performance set at or above default [0.95]
  -p           smart pairing (ignoring in2.fq)
Scoring Options
  -A Int       score for a sequence match, which scales options -TdBOELU unless overridden [1]
  -B Int       penalty for a mismatch [4]
  -INDEL Int   gap open penalties for deletions and insertions [6,6]
  -E Int       gap extension penalty; a gap of size k cost '{-O} + {-E}*k' [1,1]
  -L Int,Int   penalty for 5'- and 3'-end clipping [30,30]
  -U Int       penalty for an unpaired read pair [17]
Bisulfite Options
  -UN          library undirectional, ie. consider PCR products of bisulfite converted DNA
  -CP Float    CH conversion proportion threshold [0.5]
  -CT Int      number of CH sites needed to assess read conversion
  -SP Float    substitution threshold for read bisulfite conversion patterns (ie C2T, G2A) [0.1]
               for undirectional libraries the substitution pattern with the fewer number of
               substitutions relative to the total read length (if < threshold) is aligned preferentially
Algorithm Options
  -t Int       number of bwa threads [1]
  -k Int       minimum seed length [19]
  -w Int       band width for banded alignment [100]
  -d Int       off-diagonal X drop off [100]
  -r Float     look for internal seeds inside a seed longer than {-k} * FLOAT [1.5]
  -y Int       seed occurrence for the 3rd round seeding [20]
  -c Int       skip seeds with more than INT occurrences [500]
  -D Float     drop chains shorter than FLOAT fraction of the longest overlapping chain [0.50]
  -W Int       discard a chain if seeded bases shorter than INT [0]
  -m Int       perform at most INT rounds of mate rescues for each read [50]  -S           skip mate rescue
  -P           skip pairing; mate rescue performed unless -S also in use
  -j           ignore ALT contigs
  -T Int       minimum score to output [80], set based on read length
  -M           mark shorter split hits as secondary
  -I Fl,Fl,Int,Int    specify the mean, standard deviation (10 percent of the mean if absent),
                      max (4 sigma from the mean if absent) and min of the insert size distribution.
                      FR orientation only. [inferred]

Example Commands

Paired End Alignment

# Paired End Alignment Using Default Commands
python3 -m bsbolt Align -DB ~/tests/TestData/BSB_Test_DB -F1 ~/tests/TestSimulations/BSB_pe_meth_1.fastq -F2 ~/tests/TestSimulations/BSB_pe_meth_2.fastq -O ~/tests/BSB_pe_test

Single End Alignment

# Single End Alignment Using Default Commands
python3 -m bsbolt Align -DB ~/tests/TestData/BSB_Test_DB -F1 ~/tests/TestSimulations/BSB_pe_meth_1.fastq -O ~/tests/BSB_pe_test