Variant Calling

class bsbolt.Variant.CallRegionVariation(input_file=None, genome_database=None, ignore_overlap=True, ignore_orphans=True, min_read_depth=10, max_read_depth=8000, contig=None, min_base_quality=10, return_queue=None, min_mapping_quality=10, start=None, stop=None)

Params:

  • input_file (str): str to input bam/sam file
  • genome_database (str): str to genome directory
  • ignore_overlap (bool): ignore overlapping signal, calls methylation using the higher quality base for reads with overlapping alignments, [True]
  • ignore_oprhans (bool): ignore orphaned reads (not properly paired), [True]
  • max_read_depth (int): maximum read depth for pileup, [8000]
  • contig (str): contig to process
  • start (int): contig start position
  • stop (int): contig stop position
  • min_base_quality (int): minimum quality for a base to be reported for methylation calling, [10]
  • return_queue (Queue.queue): results are added to a queue in batches for multi-threaded access
  • cg_only (bool): only return CG sites to queue, [False]
  • min_mapping_quality (int): minimum mapping quality for an alignment to be used for methylation calling, [10]

Attributes:

  • self.chunk_size (int): Number of sites in chunk added to queue, [10000]
call_contig(self, chrom_seq)

Iterates through bam pileup, calling methylation values if the reference nucleotide is a C or G. Pileup reads are buffered and accessed as needed. Process sites are appended to list and returned in chunks.

Reads flagged as:

  • pcr duplicates (1024), read unmapped (4), read fails platform/vendor quality checks (512) are ignored
call_variation(self)

Run methylation call for contig

check_cg(nucleotide, fivemer)

Params:

  • nucleotide (str): 1 nucleotide
  • fivemer (str): 5 nucleotides

Returns:

  • cg (bool): True if CG site
get_reference_sequence(path)

load serialized reference file from path

bsbolt.Variant.generate_log_likelihood_matrix(error_rates=array([0.005]), methylation_rate=0.01, conversion_rate=0.99)

Generates log likelihood matrix for Bayesian bisulfite sequencing variant caller. The likelihoods are dependent on the expected methylation level for a particular site. The likelihood of observing one of 4 possible bases is a function of the genotype ($n=10$ for a diploid organism) and the strand ($n=2$). The probability of observing a cytosine / thymine for sense reads and guanine / adenine for anti-sense reads is dependent on the methylation rate, the conversion rate, and the base call error rate.

Params:

  • error_rates (np.ndarray): base call error probabilities
  • methylation_rate (float): methylation rate for matrix generation
  • conversion_rate (float): bisulfite conversion efficiency

Returns:

  • llm (np.ndarray): $8\times10$matrix of read base log likelihoods for sense and anti-sense bisulfite converted reads. *Rows (+A, +T, +C, +G, -A, -T, -C, -G), Columns (AA, AT, AC, AG, TT, TC, TG, CC, CG, GG)
bsbolt.Variant.predict_bayes_genotype(base_calls, log_likelihood_matrix, priors=array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]))

Provides genotype calls given an array of base counts at a site of interest and a generated log likelihood matrix. Weighted priors may also be provided, defaults to 1.0 for all genotypes. The genotype with the maximum probability is returned.

Params:

  • base_calls (np.ndarray): counts of base calls in order (+A, +T, +C, +G, -A, -T, -C, -G)
  • log_likelihood_matrix (np.ndarray): $8\times10$matrix of read base log likelihoods for sense and anti-sense bisulfite converted reads. *Rows (+A, +T, +C, +G, -A, -T, -C, -G), Columns (AA, AT, AC, AG, TT, TC, TG, CC, CG, GG)
  • priors (np.ndarray): expected diploid genotype frequency, defaults to 1.0 for all genotypes

Returns:

  • genotype call: Tuple[float, float, float, str]: genotype probability, 1 - genotype probability, genotype score (log_{10}(Max(genotype probability)/ Max -1(genotype probability )), genotype
class bsbolt.Variant.ProcessVarContigs(input_file=None, genome_database=None, output_prefix=None, ignore_overlap=True, text_output=False, min_read_depth=10, max_read_depth=8000, threads=1, verbose=True, min_base_quality=10, min_mapping_quality=10, ignore_orphans=False, bed_output=True, vcf_output=False, output_reference_calls=False, call_region_bed=None, min_pval=0.001)

Multi-threaded contig processing wrapper. Passes thread safe queue to workers and outputs values as CGmap file.

Params:

  • input_file (str): str to input bam/sam file
  • genome_database (str): str to genome directory
  • output_prefix (str): output prefix for CGmap
  • ignore_overlap (bool): ignore overlapping reads, [True]
  • text_output (bool): output compressed or plain text files, [False]
  • min_read_depth (int): default = minimum read depth to call methylation values, [1]
  • max_read_depth (int): maximum read depth for pileup, [8000]
  • threads (int): , if one watcher and processing on same thread else separated, [1]
  • min_base_quality (int): minimum base quality for base to be considered, [10]
  • min_mapping_quality (int): minimum mapping quality for an alignment to be considered, [10]
  • verbose (bool): Verbose processing, [False]
  • cg_only (bool): only return CG sites to queue, [False]
  • ignore_oprhans (bool): ignore orphaned reads (not properly paired), [True]

Usage:

python process_values = ProcessContigs(**kwargs) process_values.process_contigs()

collect_stats(self, variant_call)

Collect global methylation statistics

format_bed(variant_call)

bed line formatting

format_vcf(variant_call)

VCF line formatting

get_contigs

get list of contigs in input file, threads set across contigs

get_output_objects
print_stats(self)

Print global methylation statistics

process_contigs(self)

Launches a processing pool to call methylation values across the input file contigs

unpack_var_call(variant_call)
variant_calling_error(self, error)

Raise if exception thrown in methylation calling process

watch_pool(self)

Watch self.return_dict and process return methylation values. Contigs are processed in order so buffer can become large if first contig is large, ie. Human Chr1

write_line(self, output_object, line)

Outputs line, and encodes if necessary

Params:

  • output_object (TextIO/GZipIO): output object
  • line (str): formatted line to write
write_output(self, variant_calls)

Give a list of methylation call dicts, output formatted line

Params:

  • methylation_lines (list): list of dict containing methylation call information
class bsbolt.Variant.CallVariant(error_rates=array([0.001]), conversion_rate=0.995, lower_methylation_rate=0.01, upper_methylation_rate=0.9)
call_variant(self, base_calls, priors=array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]), cg_site=False)