Read Alignment
Alignment Process
Alignment of bisulfite sequencing reads differs from alignment of non-bisulfite data in two key ways:
- Reference cytosines are sequenced as either thymine or cytosine following bisulfite conversion.
- The reference sense and anti-sense strands are no longer complementary following bisulfite conversion.
BSBolt utilizes a modified version of BWA MEM tailored for alignment of bisulfite sequencing data, that follows the general workflow below.
- Input reads are In silico bisulfite converted, modifying unconverted methylatable bases.
- When aligning undirectional reads, two substitution patterns per read are possible. DNA originating from either reference strand will have a cytosine to thymine conversion pattern while its PCR product will have a guanine to adenine conversion pattern. To assess which conversion pattern is most appropriate the read with the fewer conversion substitutions relative to the read length is preferentially aligned. So, a read with a cytosine to thymine conversion pattern should contain fewer cytosine bases than guanine bases. If there are a high number of both cytosine and guanine bases alignments for both conversion patterns are assessed and the better alignment is output.
- Converted reads are mapped to a bisulfite alignment index that contains both the sense and anti-sense bisulfite strands relative to the reference.
- Read alignments are assessed for mapping quality and bisulfite uniqueness.
- Bisulfite sequencing meta-data is the generated by comparing the read alignment to an unconverted reference sequence.
BSB Align Commands
bsbolt Align -F1 {fastq1} -DB {bsbolt db} -O {output}
-h, --help show this help message and exit
Input / Output Options:
-F1 File path to fastq 1
-F2 File path to fastq 2 [null]
-O File output Prefix
-OS output to stdout [False]
-OT Int number of bam output threads [1]
-DB File path to bsbolt database
-R Str read group header line such as '@RG ID:foo SM:bar' [null]
-H Str insert STR to header if it starts with @; or insert lines in FILE [null]
-XA Int,Int if there are <INT hits with score >80 percent of the max score, output all in XA [100,200]
-DR Float drop ratio for alternative hits reported in XA tag,
for best bisulfite alignment performance set at or above default [0.95]
-p smart pairing (ignoring in2.fq)
Scoring Options
-A Int score for a sequence match, which scales options -TdBOELU unless overridden [1]
-B Int penalty for a mismatch [4]
-INDEL Int gap open penalties for deletions and insertions [6,6]
-E Int gap extension penalty; a gap of size k cost '{-O} + {-E}*k' [1,1]
-L Int,Int penalty for 5'- and 3'-end clipping [30,30]
-U Int penalty for an unpaired read pair [17]
Bisulfite Options
-UN library undirectional, ie. consider PCR products of bisulfite converted DNA
-CP Float CH conversion proportion threshold [0.5]
-CT Int number of CH sites needed to assess read conversion
-SP Float substitution threshold for read bisulfite conversion patterns (ie C2T, G2A) [0.1]
for undirectional libraries the substitution pattern with the fewer number of
substitutions relative to the total read length (if < threshold) is aligned preferentially
Algorithm Options
-t Int number of bwa threads [1]
-k Int minimum seed length [19]
-w Int band width for banded alignment [100]
-d Int off-diagonal X drop off [100]
-r Float look for internal seeds inside a seed longer than {-k} * FLOAT [1.5]
-y Int seed occurrence for the 3rd round seeding [20]
-c Int skip seeds with more than INT occurrences [500]
-D Float drop chains shorter than FLOAT fraction of the longest overlapping chain [0.50]
-W Int discard a chain if seeded bases shorter than INT [0]
-m Int perform at most INT rounds of mate rescues for each read [50] -S skip mate rescue
-P skip pairing; mate rescue performed unless -S also in use
-j ignore ALT contigs
-T Int minimum score to output [80], set based on read length
-M mark shorter split hits as secondary
-I Fl,Fl,Int,Int specify the mean, standard deviation (10 percent of the mean if absent),
max (4 sigma from the mean if absent) and min of the insert size distribution.
FR orientation only. [inferred]
Example Commands
Paired End Alignment
# Paired End Alignment Using Default Commands
python3 -m bsbolt Align -DB ~/tests/TestData/BSB_Test_DB -F1 ~/tests/TestSimulations/BSB_pe_meth_1.fastq -F2 ~/tests/TestSimulations/BSB_pe_meth_2.fastq -O ~/tests/BSB_pe_test
Single End Alignment
# Single End Alignment Using Default Commands
python3 -m bsbolt Align -DB ~/tests/TestData/BSB_Test_DB -F1 ~/tests/TestSimulations/BSB_pe_meth_1.fastq -O ~/tests/BSB_pe_test