Loading Data for momi3

The only required user specified data for any function in momi3 is the number of haploid samples in each population and the observed frequency spectrum.

Tskit Tree Sequence Data

One can simulate tree sequence data with tskit package.


  demo = msp.Demography()
  demo.add_population(initial_size=5000, name="anc")
  demo.add_population(initial_size=5000, name="P0")
  demo.add_population(initial_size=5000, name="P1")
  demo.set_symmetric_migration_rate(populations=("P0", "P1"), rate=0.0001)
  demo.add_population_split(time=1000, derived=["P0", "P1"], ancestral="anc")
  
  sample_size = 10 # we simulate 10 diploids
  samples = {"P0": sample_size, "P1": sample_size}
  ts = msp.sim_mutations(
      msp.sim_ancestry(
          samples=samples, demography=demo,
          recombination_rate=1e-8, sequence_length=1e8, random_seed=12
      ),
      rate=1e-8, random_seed=13
  )
  
  # msprime simulates diploids so we need the multiplier of 2 to count the number of haploids
  afs_samples = {"P0": sample_size * 2, "P1": sample_size * 2} 
  afs = ts.allele_frequency_spectrum(
      sample_sets=[ts.samples([1]), ts.samples([2])],
      span_normalise=False,
      polarised=True
  )

For any tskit tree sequence object, one can call on allele_frequency_spectrum to retrieve the observed frequency spectrum afs. In momi3, the number of haploids (afs_samples) and observed frequency spectrum (afs) are all the data necessary to run all of its core functions. Please note that setting polarised=False results in tskit returning a folded SFS that is in “lower triangular” form. This form is not compatabile with momi3. To create a folded SFS compatable with momi3, please see section below and use the function joint_sfs_from_vcz with fold=True.

VCZ and VCF files

Suppose we use the previous example and construct a VCF file to use.

vcf_path = "./example.vcf"
with open(vcf_path, "w") as vcf_file:
    ts.write_vcf(vcf_file)

To obtain the site frequency spectrum. One can call on joint_sfs_from_vcz using a VCZ file path or dataset. Note that all non-segregating sites will be ignored when constructing the site frequency spectrum.

from demestats.fit.util import joint_sfs_from_vcz
import sgkit as sg
import bio2zarr.vcf as v2z

# If you already have a vcz file or a vcz dataset you may skip this
vcz_path = "./example.vcz"
v2z.convert([vcf_path], vcz_path)

# you have to specify the index of which haplotypes you wish to use for construct the SFS
# you are picking out the columns/samples of vcf/vcz file you wish to include
pops = [
    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],   # population 1
    [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],   # population 2
]

# you can either provide the path to the vcz or pass in a vcz dataset directly
# e.g. ds = sg.load_dataset(vcz_path) and pass in ds instead of vcz_path
unfolded_jsfs = joint_sfs_from_vcz(vcz_path, pops, ploidy = 2, fold=False) 
# to get folded sfs, set fold=True
# ploidy = 2 because we are working with diploids