Methylation Calling

class bsbolt.CallMethylation.CallMethylationValues(input_file=None, genome_database=None, ignore_overlap=True, remove_ccgg=False, ignore_orphans=True, max_read_depth=8000, contig=None, min_base_quality=10, return_queue=None, cg_only=False, min_mapping_quality=10)

Params:

  • input_file (str): str to input bam/sam file
  • genome_database (str): str to genome directory
  • ignore_overlap (bool): ignore overlapping signal, calls methylation using the higher quality base for reads with overlapping alignments, [True]
  • remove_ccgg (bool): don't call CCGG sequences, [False]
  • ignore_oprhans (bool): ignore orphaned reads (not properly paired), [True]
  • max_read_depth (int): maximum read depth for pileup, [8000]
  • contig (str): contig to process
  • min_base_quality (int): minimum quality for a base to be reported for methylation calling, [10]
  • return_queue (Queue.queue): results are added to a queue in batches for multi-threaded access
  • cg_only (bool): only return CG sites to queue, [False]
  • min_mapping_quality (int): minimum mapping quality for an alignment to be used for methylation calling, [10]

Attributes:

  • self.chunk_size (int): Number of sites in chunk added to queue, [10000]
call_contig(self, chrom_seq)

Iterates through bam pileup, calling methylation values if the reference nucleotide is a C or G. Pileup reads are buffered and accessed as needed. Process sites are appended to list and returned in chunks.

Reads flagged as:

  • pcr duplicates (1024), read unmapped (4), read fails platform/vendor quality checks (512) are ignored
call_methylation(self)

Run methylation call for contig

check_ccgg(self, sequence)

checks if sequence is == to CCGG

get_context(self, nucleotide, fivemer)

Params:

  • nucleotide (str): 1 nucleotide
  • fivemer (str): 5 nucleotides

Returns:

  • context (str): methylation context
  • subcontext (str): methylation subcontext
get_context_tables

Returns:

  • context_tables (dict): base methylation context relative to sense strand
get_methylation_call(nucleotide, base_counts)

Methylation for a C relative to the sense strand of the reference can only be called using watson reads, and G with crick reads

Param:

  • nucleotide (str): reference nucleotide
  • base_counts (collections.Counter): watson nucleotides are Uppercase and crick nucleotides lowercase

Returns:

  • meth_line (dict): formatted methylation calls for output
get_reference_sequence(path)

load serialized reference file from path

class bsbolt.CallMethylation.CallMethylationVector(input_file=None, genome_database=None, contig=None, min_base_quality=0, min_mapping_quality=0, return_queue=None, cg_only=False, start=None, end=None, filter_duplicates=True)

Experimental:

return vector of methylated sites

call_contig(self, chrom_seq)

Iterates through bam reads call methylation along vectors. When a read overlaps a site where methylation is called the site with a higher quality is taken. If overlapping sites with the same quality are observed the first observed site is reported. If an overlapping site is reported as a mismatch only the site with a methylation call is reported (this should be extremely rare but is observed in the test cases)

call_methylation(self)
call_vector(self, aligned_read, positions, reference_nuc)
clean_overlap(methylation_calls)
get_mate_flags
get_methylation_call(reference_nucleotide, base_call)

Methylation for a C relative to the sense strand of the reference can only be called using watson reads, and G with crick reads Arguments nucleotide (str): reference nucleotide base_call (collections.Counter): watson nucleotides are Uppercase and crick nucleotides lowercase Returns: methylation call dictionary

get_reference_sequence(path)

load serialized reference file from path

process_methylation_vector(self, aligned_read, methylation_calls, strand, methylation_vectors)
class bsbolt.CallMethylation.ProcessContigs(input_file=None, genome_database=None, output_prefix=None, ignore_overlap=True, text_output=False, remove_ccgg=False, min_read_depth=10, max_read_depth=8000, threads=1, verbose=True, min_base_quality=10, min_mapping_quality=10, ATCGmap=False, cg_only=True, ignore_orphans=False, bedgraph_output=False)

Multi-threaded contig processing wrapper. Passes thread safe queue to workers and outputs values as CGmap file.

Params:

  • input_file (str): str to input bam/sam file
  • genome_database (str): str to genome directory
  • output_prefix (str): output prefix for CGmap
  • ignore_overlap (bool): ignore overlapping reads, [True]
  • text_output (bool): output compressed or plain text files, [False]
  • remove_ccgg (bool): don't call CCGG sequences, [False]
  • min_read_depth (int): default = minimum read depth to call methylation values, [1]
  • max_read_depth (int): maximum read depth for pileup, [8000]
  • threads (int): , if one watcher and processing on same thread else separated, [1]
  • min_base_quality (int): minimum base quality for base to be considered, [10]
  • min_mapping_quality (int): minimum mapping quality for an alignment to be considered, [10]
  • verbose (bool): Verbose processing, [False]
  • cg_only (bool): only return CG sites to queue, [False]
  • ignore_oprhans (bool): ignore orphaned reads (not properly paired), [True]
  • bedgraph_output (bool): output bedgraph format inplace of CGmap, [False]

Usage:

python process_values = ProcessContigs(*kwargs) process_values.process_contigs()

collect_stats(self, meth_line)

Collect global methylation statistics

format_atcg(meth_line)

ATCGmap line formatting

format_bedgraph(meth_line)

Bedgraph line formatting

format_cgmap(meth_line)

CGmap line formatting

get_contigs

get list of contigs in input file, threads set across contigs

get_output_objects
methylation_process_error(self, error)

Raise if exception thrown in methylation calling process

print_stats(self)

Print global methylation statistics

process_contigs(self)

Launches a processing pool to call methylation values across the input file contigs

unpack_meth_line(meth_line)
watch_pool(self)

Watch self.return_dict and process return methylation values. Contigs are processed in order so buffer can become large if first contig is large, ie. Human Chr1

write_line(self, output_object, line)

Outputs line, and encodes if necessary

Params:

  • output_object (TextIO/GZipIO): output object
  • line (str): formatted line to write
write_output(self, methylation_lines)

Give a list of methylation call dicts, output formatted line

Params:

  • methylation_lines (list): list of dict containing methylation call information