Simulation

class bsbolt.Simulate.SimulateMethylatedReads(reference_file=None, sim_output=None, sequencing_error=0.005, mutation_rate=0.001, mutation_indel_fraction=0.15, indel_extension_probability=0.15, random_seed=-1, paired_end=False, read_length=100, read_depth=20, undirectional=False, methylation_reference=None, cgmap=None, ambiguous_base_cutoff=0.05, haplotype_mode=False, pe_fragment_size=400, insert_deviation=25, mean_insert_size=100, collect_ch_sites=True, collect_sim_stats=False, verbose=True, overwrite_db=False)

Bisulifite read simulation class. The class works as follows:

  1. WGSIM (forked and modified version) is called to simulate paired end Illumina reads
    • If run in single end mode, the number of read simulated is double to get desired coverage and the second read isn't process as a bisulfite read or output.
  2. Methylation values are set for all methylatable bases (Cytosine and Guanine relative to the reference)
    • Values can be set randomly or taken from a reference file (bsbolt simulation database or CGmap file)
  3. Reads are bisulfite converted and output

Params:

  • reference_file (str): path to reference fasta file.
  • sim_output (str): output path
  • sequencing_error (float): simulated sequencing error rate, [0.005]
  • mutation_rate (float): simulated mutation error rate, [0.0010]
  • mutation_indel_fraction (float): fraction of mutations that are INDELs, [0.15]
  • indel_extension_probability (float): probability INDEL length will be extended, [0.15]
  • random_seed (int): random seed for mutation and sequencing error generation, [-1]
  • paired_end (bool): simulate paired end bisulfite sequencing data. [False]
  • read_length (int): length of simulated reads, [100]
  • read_depth (int): average read depth over simulated contigs, [20]
  • undirectional (bool): simulate undirectional (PCR product of Watson and Crick strands), [False]
  • methylation_reference (str): path to previously generated bsbolt reference directory
  • cgmap (str): path to CGmap file to use as methylation reference
  • ambiguous_base_cutoff (float): reference segments where the proportion of ambiguous bases, - or N, greater than threshold will be skipped, [0.05]
  • haplotype_mode (bool): simulate only homozygous variants, [False]
  • pe_fragment_size (int): maximum fragment size, [400]
  • insert_deviation (int): standard deviation of simulated insert sizes, [25]
  • mean_insert_size (int): mean insert size, [100]
  • collect_ch_sites (bool): simulated and collect CH methylation sites, [True]
  • collect_sim_stats (bool): output simulated bases for all collected methylatable bases, [False]
  • verbose (bool): verbose output, [True]
  • overwrite_db (bool): overwrite previously generated bsbolt simulated database, [False]

Usage:

simulation = SimulateMethylatedReads(**kwargs)
simulation.run_simulation()
get_methylation_reference(self, contig, variant_data=False)

Set variant methylation if variant data provided and return methylation profile else return methylation profile

Params:

  • contig (str): contig id
  • variant_data (dict): simulated variant information

Returns:

  • contig_profile (Dict[str, float]): methylation reference values
get_output_objects

Return io object for fastq writing

handle_match(self, seq_base, ref_base, pos)
output_reference(self)

Output reference methylation values used for simulation

output_sim_reads(self, sim_data, sub_base, ref_strand)

Write simulated bisulfite reads

process_read_group(self, sim_data, stranded_capture=False)

Set read methylation values, randomly assign reads to Watson or Crick strand

random_roll(proportion_positive=0.1)

Return true is random < proportion_positive, used to set individual read methylation

run_simulation(self)

Simulated bisulfite sequencing reads

set_base_methylation(self, methyl_position)

Perform random roll to set methylated bases. Update reference values if self.collect_sim_stats=True

set_read_methylation(self, read, sub_base='C')

Set methylation according to sim value, variants can be methylation but not sequencing errors

set_variant_methylation(self, meth_pos, seq_base, variant_type='X')
simulate_methylated_reads(self)

Read processing loop