Variant Calling
bsbolt.Variant.CallRegionVariation
(input_file=None, genome_database=None, ignore_overlap=True, ignore_orphans=True, min_read_depth=10, max_read_depth=8000, contig=None, min_base_quality=10, return_queue=None, min_mapping_quality=10, start=None, stop=None)Params:
- input_file (str): str to input bam/sam file
- genome_database (str): str to genome directory
- ignore_overlap (bool): ignore overlapping signal, calls methylation using the higher quality base for reads with overlapping alignments, [True]
- ignore_oprhans (bool): ignore orphaned reads (not properly paired), [True]
- max_read_depth (int): maximum read depth for pileup, [8000]
- contig (str): contig to process
- start (int): contig start position
- stop (int): contig stop position
- min_base_quality (int): minimum quality for a base to be reported for methylation calling, [10]
- return_queue (Queue.queue): results are added to a queue in batches for multi-threaded access
- cg_only (bool): only return CG sites to queue, [False]
- min_mapping_quality (int): minimum mapping quality for an alignment to be used for methylation calling, [10]
Attributes:
- self.chunk_size (int): Number of sites in chunk added to queue, [10000]
call_contig
(self, chrom_seq)Iterates through bam pileup, calling methylation values if the reference nucleotide is a C or G. Pileup reads are buffered and accessed as needed. Process sites are appended to list and returned in chunks.
Reads flagged as:
- pcr duplicates (1024), read unmapped (4), read fails platform/vendor quality checks (512) are ignored
call_variation
(self)Run methylation call for contig
check_cg
(nucleotide, fivemer)Params:
- nucleotide (str): 1 nucleotide
- fivemer (str): 5 nucleotides
Returns:
- cg (bool): True if CG site
get_reference_sequence
(path)load serialized reference file from path
bsbolt.Variant.generate_log_likelihood_matrix
(error_rates=array([0.005]), methylation_rate=0.01, conversion_rate=0.99)Generates log likelihood matrix for Bayesian bisulfite sequencing variant caller. The likelihoods are dependent on the expected methylation level for a particular site. The likelihood of observing one of 4 possible bases is a function of the genotype ($n=10$ for a diploid organism) and the strand ($n=2$). The probability of observing a cytosine / thymine for sense reads and guanine / adenine for anti-sense reads is dependent on the methylation rate, the conversion rate, and the base call error rate.
Params:
- error_rates (np.ndarray): base call error probabilities
- methylation_rate (float): methylation rate for matrix generation
- conversion_rate (float): bisulfite conversion efficiency
Returns:
- llm (np.ndarray): $8\times10$matrix of read base log likelihoods for sense and anti-sense bisulfite converted reads. *Rows (+A, +T, +C, +G, -A, -T, -C, -G), Columns (AA, AT, AC, AG, TT, TC, TG, CC, CG, GG)
bsbolt.Variant.predict_bayes_genotype
(base_calls, log_likelihood_matrix, priors=array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]))Provides genotype calls given an array of base counts at a site of interest and a generated log likelihood matrix. Weighted priors may also be provided, defaults to 1.0 for all genotypes. The genotype with the maximum probability is returned.
Params:
- base_calls (np.ndarray): counts of base calls in order (+A, +T, +C, +G, -A, -T, -C, -G)
- log_likelihood_matrix (np.ndarray): $8\times10$matrix of read base log likelihoods for sense and anti-sense bisulfite converted reads. *Rows (+A, +T, +C, +G, -A, -T, -C, -G), Columns (AA, AT, AC, AG, TT, TC, TG, CC, CG, GG)
- priors (np.ndarray): expected diploid genotype frequency, defaults to 1.0 for all genotypes
Returns:
- genotype call: Tuple[float, float, float, str]: genotype probability, 1 - genotype probability, genotype score (log_{10}(Max(genotype probability)/ Max -1(genotype probability )), genotype
bsbolt.Variant.ProcessVarContigs
(input_file=None, genome_database=None, output_prefix=None, ignore_overlap=True, text_output=False, min_read_depth=10, max_read_depth=8000, threads=1, verbose=True, min_base_quality=10, min_mapping_quality=10, ignore_orphans=False, bed_output=True, vcf_output=False, output_reference_calls=False, call_region_bed=None, min_pval=0.001)Multi-threaded contig processing wrapper. Passes thread safe queue to workers and outputs values as CGmap file.
Params:
- input_file (str): str to input bam/sam file
- genome_database (str): str to genome directory
- output_prefix (str): output prefix for CGmap
- ignore_overlap (bool): ignore overlapping reads, [True]
- text_output (bool): output compressed or plain text files, [False]
- min_read_depth (int): default = minimum read depth to call methylation values, [1]
- max_read_depth (int): maximum read depth for pileup, [8000]
- threads (int): , if one watcher and processing on same thread else separated, [1]
- min_base_quality (int): minimum base quality for base to be considered, [10]
- min_mapping_quality (int): minimum mapping quality for an alignment to be considered, [10]
- verbose (bool): Verbose processing, [False]
- cg_only (bool): only return CG sites to queue, [False]
- ignore_oprhans (bool): ignore orphaned reads (not properly paired), [True]
Usage:
python
process_values = ProcessContigs(**kwargs)
process_values.process_contigs()
collect_stats
(self, variant_call)Collect global methylation statistics
format_bed
(variant_call)bed line formatting
format_vcf
(variant_call)VCF line formatting
get_contigs
get list of contigs in input file, threads set across contigs
get_output_objects
print_stats
(self)Print global methylation statistics
process_contigs
(self)Launches a processing pool to call methylation values across the input file contigs
unpack_var_call
(variant_call)variant_calling_error
(self, error)Raise if exception thrown in methylation calling process
watch_pool
(self)Watch self.return_dict and process return methylation values. Contigs are processed in order so buffer can become large if first contig is large, ie. Human Chr1
write_line
(self, output_object, line)Outputs line, and encodes if necessary
Params:
- output_object (TextIO/GZipIO): output object
- line (str): formatted line to write
write_output
(self, variant_calls)Give a list of methylation call dicts, output formatted line
Params:
- methylation_lines (list): list of dict containing methylation call information
bsbolt.Variant.CallVariant
(error_rates=array([0.001]), conversion_rate=0.995, lower_methylation_rate=0.01, upper_methylation_rate=0.9)call_variant
(self, base_calls, priors=array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]), cg_site=False)