Methylation Calling
bsbolt.CallMethylation.CallMethylationValues
(input_file=None, genome_database=None, ignore_overlap=True, remove_ccgg=False, ignore_orphans=True, max_read_depth=8000, contig=None, min_base_quality=10, return_queue=None, cg_only=False, min_mapping_quality=10)Params:
- input_file (str): str to input bam/sam file
- genome_database (str): str to genome directory
- ignore_overlap (bool): ignore overlapping signal, calls methylation using the higher quality base for reads with overlapping alignments, [True]
- remove_ccgg (bool): don't call CCGG sequences, [False]
- ignore_oprhans (bool): ignore orphaned reads (not properly paired), [True]
- max_read_depth (int): maximum read depth for pileup, [8000]
- contig (str): contig to process
- min_base_quality (int): minimum quality for a base to be reported for methylation calling, [10]
- return_queue (Queue.queue): results are added to a queue in batches for multi-threaded access
- cg_only (bool): only return CG sites to queue, [False]
- min_mapping_quality (int): minimum mapping quality for an alignment to be used for methylation calling, [10]
Attributes:
- self.chunk_size (int): Number of sites in chunk added to queue, [10000]
call_contig
(self, chrom_seq)Iterates through bam pileup, calling methylation values if the reference nucleotide is a C or G. Pileup reads are buffered and accessed as needed. Process sites are appended to list and returned in chunks.
Reads flagged as:
- pcr duplicates (1024), read unmapped (4), read fails platform/vendor quality checks (512) are ignored
call_methylation
(self)Run methylation call for contig
check_ccgg
(self, sequence)checks if sequence is == to CCGG
get_context
(self, nucleotide, fivemer)Params:
- nucleotide (str): 1 nucleotide
- fivemer (str): 5 nucleotides
Returns:
- context (str): methylation context
- subcontext (str): methylation subcontext
get_context_tables
Returns:
- context_tables (dict): base methylation context relative to sense strand
get_methylation_call
(nucleotide, base_counts)Methylation for a C relative to the sense strand of the reference can only be called using watson reads, and G with crick reads
Param:
- nucleotide (str): reference nucleotide
- base_counts (collections.Counter): watson nucleotides are Uppercase and crick nucleotides lowercase
Returns:
- meth_line (dict): formatted methylation calls for output
get_reference_sequence
(path)load serialized reference file from path
bsbolt.CallMethylation.CallMethylationVector
(input_file=None, genome_database=None, contig=None, min_base_quality=0, min_mapping_quality=0, return_queue=None, cg_only=False, start=None, end=None, filter_duplicates=True)Experimental:
return vector of methylated sites
call_contig
(self, chrom_seq)Iterates through bam reads call methylation along vectors. When a read overlaps a site where methylation is called the site with a higher quality is taken. If overlapping sites with the same quality are observed the first observed site is reported. If an overlapping site is reported as a mismatch only the site with a methylation call is reported (this should be extremely rare but is observed in the test cases)
call_methylation
(self)call_vector
(self, aligned_read, positions, reference_nuc)clean_overlap
(methylation_calls)get_mate_flags
get_methylation_call
(reference_nucleotide, base_call)Methylation for a C relative to the sense strand of the reference can only be called using watson reads, and G with crick reads Arguments nucleotide (str): reference nucleotide base_call (collections.Counter): watson nucleotides are Uppercase and crick nucleotides lowercase Returns: methylation call dictionary
get_reference_sequence
(path)load serialized reference file from path
process_methylation_vector
(self, aligned_read, methylation_calls, strand, methylation_vectors)bsbolt.CallMethylation.ProcessContigs
(input_file=None, genome_database=None, output_prefix=None, ignore_overlap=True, text_output=False, remove_ccgg=False, min_read_depth=10, max_read_depth=8000, threads=1, verbose=True, min_base_quality=10, min_mapping_quality=10, ATCGmap=False, cg_only=True, ignore_orphans=False, bedgraph_output=False)Multi-threaded contig processing wrapper. Passes thread safe queue to workers and outputs values as CGmap file.
Params:
- input_file (str): str to input bam/sam file
- genome_database (str): str to genome directory
- output_prefix (str): output prefix for CGmap
- ignore_overlap (bool): ignore overlapping reads, [True]
- text_output (bool): output compressed or plain text files, [False]
- remove_ccgg (bool): don't call CCGG sequences, [False]
- min_read_depth (int): default = minimum read depth to call methylation values, [1]
- max_read_depth (int): maximum read depth for pileup, [8000]
- threads (int): , if one watcher and processing on same thread else separated, [1]
- min_base_quality (int): minimum base quality for base to be considered, [10]
- min_mapping_quality (int): minimum mapping quality for an alignment to be considered, [10]
- verbose (bool): Verbose processing, [False]
- cg_only (bool): only return CG sites to queue, [False]
- ignore_oprhans (bool): ignore orphaned reads (not properly paired), [True]
- bedgraph_output (bool): output bedgraph format inplace of CGmap, [False]
Usage:
python
process_values = ProcessContigs(*kwargs)
process_values.process_contigs()
collect_stats
(self, meth_line)Collect global methylation statistics
format_atcg
(meth_line)ATCGmap line formatting
format_bedgraph
(meth_line)Bedgraph line formatting
format_cgmap
(meth_line)CGmap line formatting
get_contigs
get list of contigs in input file, threads set across contigs
get_output_objects
methylation_process_error
(self, error)Raise if exception thrown in methylation calling process
print_stats
(self)Print global methylation statistics
process_contigs
(self)Launches a processing pool to call methylation values across the input file contigs
unpack_meth_line
(meth_line)watch_pool
(self)Watch self.return_dict and process return methylation values. Contigs are processed in order so buffer can become large if first contig is large, ie. Human Chr1
write_line
(self, output_object, line)Outputs line, and encodes if necessary
Params:
- output_object (TextIO/GZipIO): output object
- line (str): formatted line to write
write_output
(self, methylation_lines)Give a list of methylation call dicts, output formatted line
Params:
- methylation_lines (list): list of dict containing methylation call information