Simulation
class
bsbolt.Simulate.SimulateMethylatedReads
(reference_file=None, sim_output=None, sequencing_error=0.005, mutation_rate=0.001, mutation_indel_fraction=0.15, indel_extension_probability=0.15, random_seed=-1, paired_end=False, read_length=100, read_depth=20, undirectional=False, methylation_reference=None, cgmap=None, ambiguous_base_cutoff=0.05, haplotype_mode=False, pe_fragment_size=400, insert_deviation=25, mean_insert_size=100, collect_ch_sites=True, collect_sim_stats=False, verbose=True, overwrite_db=False)Bisulifite read simulation class. The class works as follows:
- WGSIM (forked and modified version) is called to simulate paired end Illumina reads
- If run in single end mode, the number of read simulated is double to get desired coverage and the second read isn't process as a bisulfite read or output.
- Methylation values are set for all methylatable bases (Cytosine and Guanine relative to the reference)
- Values can be set randomly or taken from a reference file (bsbolt simulation database or CGmap file)
- Reads are bisulfite converted and output
Params:
- reference_file (str): path to reference fasta file.
- sim_output (str): output path
- sequencing_error (float): simulated sequencing error rate, [0.005]
- mutation_rate (float): simulated mutation error rate, [0.0010]
- mutation_indel_fraction (float): fraction of mutations that are INDELs, [0.15]
- indel_extension_probability (float): probability INDEL length will be extended, [0.15]
- random_seed (int): random seed for mutation and sequencing error generation, [-1]
- paired_end (bool): simulate paired end bisulfite sequencing data. [False]
- read_length (int): length of simulated reads, [100]
- read_depth (int): average read depth over simulated contigs, [20]
- undirectional (bool): simulate undirectional (PCR product of Watson and Crick strands), [False]
- methylation_reference (str): path to previously generated bsbolt reference directory
- cgmap (str): path to CGmap file to use as methylation reference
- ambiguous_base_cutoff (float): reference segments where the proportion of ambiguous bases, - or N, greater than threshold will be skipped, [0.05]
- haplotype_mode (bool): simulate only homozygous variants, [False]
- pe_fragment_size (int): maximum fragment size, [400]
- insert_deviation (int): standard deviation of simulated insert sizes, [25]
- mean_insert_size (int): mean insert size, [100]
- collect_ch_sites (bool): simulated and collect CH methylation sites, [True]
- collect_sim_stats (bool): output simulated bases for all collected methylatable bases, [False]
- verbose (bool): verbose output, [True]
- overwrite_db (bool): overwrite previously generated bsbolt simulated database, [False]
Usage:
simulation = SimulateMethylatedReads(**kwargs)
simulation.run_simulation()
get_methylation_reference
(self, contig, variant_data=False)Set variant methylation if variant data provided and return methylation profile else return methylation profile
Params:
- contig (str): contig id
- variant_data (dict): simulated variant information
Returns:
- contig_profile (Dict[str, float]): methylation reference values
get_output_objects
Return io object for fastq writing
handle_match
(self, seq_base, ref_base, pos)output_reference
(self)Output reference methylation values used for simulation
output_sim_reads
(self, sim_data, sub_base, ref_strand)Write simulated bisulfite reads
process_read_group
(self, sim_data, stranded_capture=False)Set read methylation values, randomly assign reads to Watson or Crick strand
random_roll
(proportion_positive=0.1)Return true is random < proportion_positive, used to set individual read methylation
run_simulation
(self)Simulated bisulfite sequencing reads
set_base_methylation
(self, methyl_position)Perform random roll to set methylated bases. Update reference values if self.collect_sim_stats=True
set_read_methylation
(self, read, sub_base='C')Set methylation according to sim value, variants can be methylation but not sequencing errors
set_variant_methylation
(self, meth_pos, seq_base, variant_type='X')simulate_methylated_reads
(self)Read processing loop