Tutorials

There are two main ways of accessing the resources of the stdpopsim package that will be detailed in this tutorial. The first is via the command line interface (CLI). This is useful if you want to do a straightforward run of the models in the stdpopsim catalog. The other way to access the stdpopsim resources is via the python API. This is a bit more complicated, but allows for more advanced tasks. This tutorial will walk through both ways of using the stdpopsim package as well as a few examples of producing and processing output.

To simulate genomes using stdpopsim, we need to make several choices about what will be simulated:

  1. which species

  2. which contig (i.e., how much of what chromosome)

  3. which genetic map

  4. which demographic model

  5. how many samples from each population of the model

  6. which simulation engine to use

These choices are nested: the species determines what contigs, genetic maps, and demographic models are available, and the demographic model determines how many populations can be sampled from. Currently, the two choices of simulation engine are msprime and SLiM. Below are examples of making these choices in various contexts, using both CLI and python interfaces.

Running stdpopsim with the command-line interface (CLI)

In order to use the stdpopsim CLI the stdpopsim package must be installed (see Installation). The CLI provides access to the catalog of models that have already been implemented by stdpopsim.

A first simulation

As a first step, we’ll use the CLI built-in help to build up to a realistic coalescent simulation of some copies of human chromosome 22 with the HapMap genetic map, and a published demographic model.

Choose a species and a contig

The first step for using the CLI is to select the species that you are interested in simulating data for. In order to see which species are available run

$ stdpopsim --help
usage: stdpopsim [-h] [-V] [-v | -q] [-c CACHE_DIR] [-e {msprime,slim}]
                 [--msprime-model {hudson,dtwf,smc,smc_prime}]
                 [--msprime-change-model T MODEL] [--slim-path PATH]
                 [--slim-script] [--slim-scaling-factor Q] [--slim-burn-in X]
                 {AraTha,CanFam,DroMel,EscCol,HomSap,PonAbe,download-genetic-maps}
                 ...

Command line interface for stdpopsim.

positional arguments:
  {AraTha,CanFam,DroMel,EscCol,HomSap,PonAbe,download-genetic-maps}
    AraTha              Run simulations for Arabidopsis thaliana.
    CanFam              Run simulations for Canis familiaris.
    DroMel              Run simulations for Drosophila melanogaster.
    EscCol              Run simulations for Escherichia coli.
    HomSap              Run simulations for Homo sapiens.
    PonAbe              Run simulations for Pongo abelii.
    download-genetic-maps
                        Download genetic maps

optional arguments:
  -h, --help            show this help message and exit
  -V, --version         show program's version number and exit
  -v, --verbose         Increase logging verbosity (can use be used multiple
                        times).
  -q, --quiet           Do not write any non-essential messages
  -c CACHE_DIR, --cache-dir CACHE_DIR
                        Set the cache directory to the specified value. Note
                        that this can also be set using the environment
                        variable STDPOPSIM_CACHE. If both the environment
                        variable and this option are set, the option takes
                        precedence. Default: /home/docs/.cache/stdpopsim
  -e {msprime,slim}, --engine {msprime,slim}
                        Specify a simulation engine.

msprime specific parameters:
  --msprime-model {hudson,dtwf,smc,smc_prime}
                        Specify the simulation model used by msprime. See
                        msprime API documentation for details.
  --msprime-change-model T MODEL
                        Change to the specified simulation MODEL at generation
                        T. This option may provided multiple times.

SLiM specific parameters:
  --slim-path PATH      Full path to `slim' executable.
  --slim-script         Write script to stdout and exit without running SLiM.
  --slim-scaling-factor Q
                        Rescale model parameters by Q to speed up simulation.
                        See SLiM manual: `5.5 Rescaling population sizes to
                        improve simulation performance`. [default=1].
  --slim-burn-in X      Length of the burn-in phase, in units of N generations
                        [default=10].

This shows the species currently supported by stdpopsim. This means that stdpopsim knows various traits of these species including chromosome size and recombination rates. Once we’ve selected a species, in this case humans, we can look at the help again as follows.

$ stdpopsim HomSap --help
usage: stdpopsim HomSap [-h] [--help-models [HELP_MODELS]] [-b BIBTEX_FILE]
                        [--help-genetic-maps [HELP_GENETIC_MAPS]] [-D] [-g]
                        [-c] [-l LENGTH_MULTIPLIER] [-s SEED] [-d] [-o OUTPUT]
                        samples [samples ...]

Run simulations for Homo sapiens using up-to-date genome information,
genetic maps and simulation models from the literature. NOTE: By
default, the tskit '.trees' binary file is written to stdout,so you
should either redirect this to a file or use the '--output' option to
specify a filename.

Default population parameters for Homo sapiens:
Generation time: 30
Population size: 10000
Mutation rate: 1.2899999999999998e-08
Recombination rate: 1.2051790226570845e-08

positional arguments:
  samples               The number of samples to draw from each population. At
                        least two samples must be specified. The number of
...

For conciseness we do not show all the output here but this time you should see a different output which shows options for performing the simulation itself and the species default parameters. This includes selecting the demographic model, chromosome, recombination map, and number of samples.

The most basic simulation we can run is to simulate two (haploid) genomes - i.e., two samples - using the species’ defaults as seen in the species help (stdpopsim HomSap --help). These defaults include constant size population and a uniform recombination map. To save time we will specify that the simulation use chromosome 22, using the -c option. We also specify that the resulting tree-sequence formatted output should be written to the file foo.ts with the -o option. For more information on how to use tree-sequence files see tskit .

$ stdpopsim HomSap -c chr22 -o foo.ts 2

Warning

It’s important to remember to either redirect the output of stdpopsim to file or to use the -o/--output option. If you do not, the binary output may mess up your terminal session.

Choose a model and a sampling scheme

Next, suppose we want to use a specific demographic model. We look up the available models using the --help-models flag (here, truncated for space):

$ stdpopsim HomSap --help-models

All simulation models for Homo sapiens

OutOfAfrica_3G09: Three population out-of-Africa
     The three population Out-of-Africa model from Gutenkunst et al.
    2009. It describes the ancestral human population in Africa, the
    out of Africa event, and the subsequent European-Asian population
    split. Model parameters are the maximum likelihood values of the
    various parameters given in Table 1 of Gutenkunst et al.

    Populations:
        YRI: 1000 Genomes YRI (Yorubans)
        CEU: 1000 Genomes CEU (Utah Residents (CEPH) with Northern and Western European Ancestry
        CHB: 1000 Genomes CHB (Han Chinese in Beijing, China)

OutOfAfrica_2T12: Two population out-of-Africa
     The model is derived from the Tennesen et al. analysis of the
    jSFS from European Americans and African Americans. It describes
    the ancestral human population in Africa, the out of Africa event,
    and two distinct periods of subsequent European population growth
    over the past 23kya. Model parameters are taken from Fig. S5 in Fu
    et al.

    Populations:
        AFR: African Americans
        EUR: European Americans

Africa_1T12: African population
     The model is a simplification of the two population Tennesen et
    al. model with the European-American population removed so that we
...

This gives all of the possible demographic models we could simulate. We choose the two population out-of-Africa model from Tennesen et al. (2012) . By looking at the model help we find that the name for this model is OutOfAfrica_2T12 and that we can specify it using the --demographic-model or -d option. We choose to draw two samples from the “African American” population and three samples from the “European American” population. To increase simulation speed we can also choose to simulate a sequence that is a fraction of the length of the specified chromosome using the -l option (e.g. 5%). This is just specifying a sequence length, not actually selecting a subset of the chromosome to sequence and as such cannot be used with anything other than a uniform recombination map. The command now looks like this:

$ stdpopsim HomSap -c chr22 -l 0.05 -o foo.ts -d OutOfAfrica_2T12 2 3

Note that the number of samples from each population are simply specified as two numbers at the end of the command. There must be two numbers because the model has two populations that we can sample from The order of those numbers is the same as the order specified in the model documentation. In this case, 2 3 means we are simulating two African American samples and three European American samples.

Now we want to add an empirical recombination map to make the simulation more realistic. We can look up the available recombination maps using the --help-genetic-maps flag (here, truncated for space):

$ stdpopsim HomSap --help-genetic-maps

All genetic maps for Homo sapiens

HapMapII_GRCh37
     This genetic map is from the Phase II Hapmap project and based on
    3.1 million genotyped SNPs from 270 individuals across four
    populations (YRI, CEU, CHB and JPT). Genome wide recombination
    rates were estimated using LDHat. This version of the HapMap
    genetic map was lifted over to GRCh37 (and adjusted in regions
    where the genome assembly had rearranged) for use in the 1000
    Genomes project. Please see the README file on the 1000 Genomes
    download site for details of these adjustments.

DeCodeSexAveraged_GRCh36
     This genetic map is from the deCode study of recombination events
...

In this case we choose the HapMapII_GRCh37 map. Empirical recombination maps cannot be used with length multipliers so we have to remove the -l option. (NOTE: this may a minute or so to run).

$ stdpopsim HomSap -g HapMapII_GRCh37 -c chr22 -o foo.ts -d OutOfAfrica_2T12 2 3

For reproducibility we can also choose set the seed for the simulator using the -s flag.

$ stdpopsim HomSap -s 1046 -g HapMapII_GRCh37 -c chr22 -o foo.ts -d OutOfAfrica_2T12 2 3

On running these commands, the CLI also outputs the relevant citations for both the simulator used and the resources used for simulation scenario.

Convert output to VCF

The output from a stdpopsim simulation is a tree sequence, a compact and efficient format for storing both genealogies and genome sequence. Some examples of analyzing tree sequences are given below. If desired, these can be converted to VCF on the command line if the tskit package is installed, with the tskit vcf command:

$ tskit vcf foo.ts > foo.vcf

For this small example (only five samples), the file sizes are similar, but the tree sequence is slightly larger (it does carry a good bit more information about the trees, after all). However, if we up the sample sizes to 2000 and 3000 (the simulation is still pretty quick) the tree sequence is twenty-three times smaller:

$ stdpopsim HomSap -s 1046 -g HapMapII_GRCh37 -c chr22 -o foo.ts -d OutOfAfrica_2T12 2000 3000
$ tskit vcf foo.ts > foo.vcf
$ ls -lth foo.*
-rw-r--r-- 1 peter peter 3139M Apr  3 10:40 foo.vcf
-rw-r--r-- 1 peter peter  136M Apr  3 10:39 foo.ts

Zipping the files (using the tszip package) reduces this difference quite a lot, but increases time required for processing:

$ tskit vcf foo.ts | gzip -c > foo.vcf.gz
$ tszip foo.ts
$ ls -lth foo.*
-rw-r--r-- 1 peter peter  31M Apr  3 10:51 foo.ts.tsz
-rw-r--r-- 1 peter peter  72M Apr  3 10:50 foo.vcf.gz

Using the SLiM simulation engine

The default “simulation engine” - i.e., the program that actually does the simulating - is msprime, a coalescent simulator. However, it is also possible to swap this out for SLiM, a forwards-time, individual-based simulator.

Specifying the engine

Using SLiM is as easy as passing the --engine/-e flag (we didn’t do this above, so it used the default engine, msprime). For instance, to use SLiM to simulate the same 5% chunk of chromosome 22 under the OutOfAfrica_2T12 model as above, we would just run:

$ stdpopsim -e slim HomSap -c chr22 -l 0.05 -o foo.ts -d OutOfAfrica_2T12 2 4

Here we’ve changed the sample sizes to be even: SLiM simulates diploid individuals, but sample sizes are in numbers of chromosomes, so if you ask for an odd number, it will be rounded up to an even number. But: this simulation can take quite a while to run, so before you try that command out, read on!

The scaling factor

However, even with only 5% of a chromosome, that is a pretty big simulation, due to the large number of individuals (unlike msprime, SLiM must actually simulate all the individuals in the population even just to get a few samples). To make it run fast enough for a tutorial, we can specify a scaling factor, described in more detail below (see The scaling factor), using the --slim-scaling-factor option. Unlike the previous command, this one should run very fast:

$ stdpopsim -e slim --slim-scaling-factor 10 HomSap \
$    -c chr22 -l 0.05 -o foo.ts -d OutOfAfrica_2T12 2 4

(Indeed, this example runs in less than a minute, but without setting the scaling factor, leaving at its default of 1.0, it takes on the same computer about 20 minutes.) Briefly, what this does is reduces all the population sizes by a “scaling factor” (here set to 10), and rescales time by the same factor, thus increasing mutation, recombination, and population growth rates. A model with selection would need to have selection coefficients multiplied by the factor as well. This results in a model that is equivalent in many senses - the same rate of genetic drift, the same expected decay of linkage disequilibrium - but generally runs much faster because there are fewer individuals to keep track of. In practice, rescaling seems to produce indistinguishable results in much shorter times at many parameter values. However, the user should be aware that in principle, the results are not equivalent, possibly in subtle and hard-to-understand ways. This is particularly true in simulations with large amounts of selection. See the SLiM manual and/or Urrichio & Hernandez (2014) for more discussion.

Debugging output from SLiM

Next we’ll look at running a different model with SLiM, but getting some sanity check output along the way.

Choose a species: Drosophila melanogaster

Perusing the Catalog <sec_catalog>, we see that to simulate copies of chromosome arm 2L from Drosophila melanogaster individuals with the demographic model inferred by Sheehan & Song (2016), using SLiM with a (very large!) scaling factor of 1000, we could run

$ stdpopsim -e slim --slim-scaling-factor 1000 DroMel \
$     -c chr2L -l 0.05 -o foo.ts -d African3Epoch_1S16 100

The scaling factor of 1000 makes this model run very quickly, but should also make you very nervous. What actually is being simulated here? We can at least find out what the actual population sizes are in the SLiM simulation by asking the simulation to be more verbose. Prepending the -v flag will request that SLiM print out information every time a demographic event occurs (helpfully, this also gives us an idea of how quickly the simulation is going): This also outputs a fair bit of other debugging output, which can be turned off with --quiet:

$ stdpopsim -v -e slim --slim-scaling-factor 1000 DroMel \
$     -c chr2L -l 0.05 -o foo.ts -d African3Epoch_1S16 100 --quiet

Trimming down the output somewhat, we get:

2020-04-03 22:05:52,160 [1098] DEBUG stdpopsim.species: Making flat chromosome 0.05 * chr2L
2020-04-03 22:05:52,161 [1098] INFO stdpopsim.cli: Running simulation model African3Epoch_1S16 for DroMel on Contig(length=1.2E+06, recombination_rate=8.4E-09, mutation_rate=5.5E-09, genetic_map=None) with 100 samples using slim.

1: sim.addSubpop(0, 652);
1: Starting burn-in...
6521: {dbg(self.source); p0.setSubpopulationSize(145);}
8521: {dbg(self.source); p0.setSubpopulationSize(544);}
8721: {dbg(self.source); inds=p0.sampleIndividuals(50); sim.treeSeqRememberIndividuals(inds);}
8721: {dbg(self.source); end();}

This tells us that after rescaling by a factor of 1000, the population sizes in the three epochs are 652, 145, and 544 individuals, respectively. No wonder it runs so quickly! At the end, fifty (diploid) individuals are sampled, to get us our requested 100 genomes. These numbers are not obviously completely wrong, as would be for instance if we had population sizes of 1 or 2 individuals. However, extensive testing would need to be done to find out if data produced with such an extreme scaling factor actually resembles the data that would be produced without rescaling.

Running stdpopsim with the Python interface (API)

Nearly all the functionality of stdpopsim is available through the CLI, but for complex situations it may be desirable to use python. Furthermore, downstream analysis may happen in python, using the tskit tools for working with tree sequences. In order to use the stdpopsim API the stdpopsim package must be installed (see Installation).

Running a published model

The first example uses a mostly default genome with a published demographic model.

Pick a species and a contig

First, we will pick a species (humans), contig (chromosome 22), and genetic map (a flat map, the default):

import stdpopsim
species = stdpopsim.get_species("HomSap")
contig = species.get_contig("chr22") # default is a flat genetic map

Choose a demographic model

Now, we need a demographic model. In stdpopsim there are two types of model: ones taken to match the demographic history reported in published papers, and “generic” models. First, we’ll simulate samples from a published model of human prehistory. Let’s see what demographic models are available for humans:

for x in species.demographic_models:
   print(x.id)

# OutOfAfrica_3G09
# OutOfAfrica_2T12
# Africa_1T12
# AmericanAdmixture_4B11
# OutOfAfricaArchaicAdmixture_5R19
# Zigzag_1S14
# AncientEurasia_9K19
# PapuansOutOfAfrica_10J19

These models are described in detail in the Catalog. We’ll look at the first model, “OutOfAfrica_3G09”, from Gutenkunst et al (2009). First we need to know how many populations it has, and what they are:

model = species.get_demographic_model('OutOfAfrica_3G09')
print(model.num_populations)
# 3
print(model.num_sampling_populations)
# 3
print([pop.id for pop in model.populations])
# ['YRI', 'CEU', 'CHB']

This model has 3 populations, named YRI, CEU, CHB, and all three can be sampled from. The number of “sampling” populations could be smaller than the number of populations, since some models have ancient populations which are currently not allowed to be sampled from - but that is not the case in this model.

Choose a sampling scheme and simulate

The final ingredient we need before simulating is a specification of the number of samples from each population. We’ll simulate 10 samples each from YRI and CHB, and zero from CEU, using msprime as the simulation engine:

samples = model.get_samples(10, 0, 10)
engine = stdpopsim.get_engine('msprime')
ts = engine.simulate(model, contig, samples)
print(ts.num_sites)
# 88063

And that’s it! It’s that easy! We now have a tree sequence describing the history and genotypes of 20 genomes, between which there are 88,063 variant sites. (We didn’t set the random seed, though, so you’ll get a somewhat different number.)

Let’s look at the metadata for the resulting simulation, to make sure that we’ve got what we want. The metadata for the populations is stored in json, so we use the json module to easily parse it:

import json
ts.num_samples
# 20
for k, pop in enumerate(ts.populations()):
   popdata = json.loads(pop.metadata)
   print(f"The tree sequence has {len(ts.samples(k))} samples from "
         f"population {k}, which is {popdata['id']}.")

# The tree sequence has 10 samples from population 0, which is YRI.
# The tree sequence has 0 samples from population 1, which is CEU.
# The tree sequence has 10 samples from population 2, which is CHB.

Running a generic model

Next, we will simulate using a “generic” model, with piecewise constant population size. This time, we will simulate a small portion of human chromosome 2, again with a flat recombination map, using the current best estimate of the human effective population size from the Catalog.

Choose a species

Although the model is generic, we still need a species, to get contig information. Again, we’ll use Homo sapiens, which has the id “HomSap”. (But, you could use any species from the Catalog!)

import stdpopsim
species = stdpopsim.get_species("HomSap")

Choose a contig and recombination map

Next, we set the contig information, to 5% of chromosome 2, about 12Mb. Again, you could use a fraction of any of the chromosomes listed in the Catalog, keeping in mind that larger contigs will take longer to simulate.

contig = species.get_contig("chr2", length_multiplier=0.05)
print(contig.recombination_map.get_sequence_length())
# 12159968.65

The “sequence length” is the length in base pairs. It is not an integer because msprime works in continuous genomic coordinates. Again, since we didn’t specify a genetic map, by default a “flat” map of constant recombination rate is used.

Set up the generic model

Next, we set the model to be the generic piecewise constant size model, using the predefined human effective population size (see Catalog). Since we are providing only one effective population size, the model is a single population of constant over all time.

model = stdpopsim.PiecewiseConstantSize(species.population_size)

Each species has a “default” population size, species.population_size, which for humans is 10,000.

Choose a sampling scheme, and simulate

Next, we set the number of samples and set the simulation engine. In this case we will simulate genomes of 10 samples using the simulation engine msprime. But, you can go crazy with the sample size! msprime is great at simulating large samples!

samples = model.get_samples(10)
engine = stdpopsim.get_engine('msprime')

Finally, we simulate the model with the contig length and number of samples we defined above. The simulation results are recorded in a tree sequence object (tskit.TreeSequence).

ts = engine.simulate(model, contig, samples)

Sanity check the tree sequence output

Now, we do some simple checks that our simulation worked with tskit.

print(ts.num_samples)
# 10
print(ts.num_populations)
# 1
print(ts.num_mutations)
# 18187
print(ts.num_trees)
# 12203

As expected, there are 10 samples in one population. We can also see that it takes 12203 distinct genealogical trees across these 5Mb of sequence, on which there were 18187 mutations (since we are not using a seed here, the number of mutations and trees will be slightly different for you). Try running the simulation again, and notice that the number of samples and populations stays the same, while the number of mutations and trees changes.

Output to VCF

In addition to working directly with the simulated tree sequence, we can also output other common formats used for population genetics analyses. We can use tskit to convert the tree sequence to a vcf file called “foo.vcf”. See the tskit documentation (tskit.TreeSequence.write_vcf()) for more information.

with open("foo.vcf", "w") as vcf_file:
   ts.write_vcf(vcf_file, contig_id='2')

Taking a look at the vcf file, we see something like this:

##fileformat=VCFv4.2
##source=tskit 0.2.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=2,length=12159969>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM      POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  tsk_0   tsk_1   tsk_2   tsk_3   tsk_4   tsk_5   tsk_6   tsk_7   tsk_8   tsk_9
2   96      .       0       1       .       PASS    .       GT      0       0       1       0       1       0       0       0       1       0
2   129     .       0       1       .       PASS    .       GT      0       0       0       0       0       0       0       0       1       0
2   436     .       0       1       .       PASS    .       GT      0       0       0       0       0       1       0       0       0       0
2   466     .       0       1       .       PASS    .       GT      0       0       1       0       1       0       0       0       0       0
2   558     .       0       1       .       PASS    .       GT      0       0       0       0       0       0       0       0       1       0
2   992     .       0       1       .       PASS    .       GT      1       1       0       1       0       1       1       1       0       1

Using the SLiM engine

Above, we used the coalescent simulator msprime as the simulation engine, which is in fact the default. However, stdpopsim also has the ability to produce simulations with SLiM, a forwards-time, individual-based simulator. Using SLiM provides us with a few more options. You may also want to install the pyslim package to extract the additional SLiM-specific information in the tree sequences that are produced.

An example simulation

The stdpopsim tool is designed so that different simulation engines are more or less exchangeable, so that to run an equivalent simulation with SLiM instead of msprime only requires specifying SLiM as the simulation engine. Here is a simple example.

Choose the species, contig, and recombination map

First, let’s set up a simulation of 10% of human chromosome 22 with a flat recombination map, drawing 200 samples from the Tennesen et al (2012) model of African history, Africa_1T12. Since SLiM must simulate the entire population, sample size does not affect the run time of the simulation, only the size of the output tree sequence (and, since the tree sequence format scales well with sample size, it doesn’t affect this very much either).

import stdpopsim
species = stdpopsim.get_species("HomSap")
contig = species.get_contig("chr22", length_multiplier=0.1)
# default is a flat genetic map
model = species.get_demographic_model("Africa_1T12")
samples = model.get_samples(200)
Choose the simulation engine

This time, we choose the SLiM engine, but otherwise, things work pretty much just as before.

engine = stdpopsim.get_engine("slim")
ts = engine.simulate(model, contig, samples, slim_scaling_factor=10)

(Note: you have to have SLiM installed for this to work, and if it isn’t installed in your PATH, so that you can run it by just typing slim on the command line, then you will need to specify the slim_path argument to simulate.) To get an example that runs quickly, we have set the scaling factor, described in more detail above (sec_slim_scaling_factor),

Other SLiM options

Besides rescaling, there are a few additional options specific to the SLiM engine, discussed here.

The SLiM burn-in

Another option specific to the SLiM engine is slim_burn_in: the amount of time before the first demographic model change that SLiM begins simulating for, in units of N generations, where N is the population size at the first demographic model change. By default, this is set to 10, which is fairly safe. History before this period is simulated with an msprime coalescent simulation, called “recapitation” because it attaches tops to any trees that have not yet coalesced. For instance, the Africa_1T12 model (Tennesen et al 2012) we used above has three distinct epochs:

import stdpopsim
species = stdpopsim.get_species("HomSap")
model = species.get_demographic_model("Africa_1T12")
model.get_demography_debugger().print_history()

# =============================
# Epoch: 0 -- 204.6 generations
# =============================
#      start     end      growth_rate |     0
#    -------- --------       -------- | --------
# 0 |4.32e+05 1.45e+04         0.0166 |     0

# Events @ generation 204.6
#    - Population parameter change for 0: initial_size -> 14474 growth_rate -> 0
# ==================================
# Epoch: 204.6 -- 5920.0 generations
# ==================================
#      start     end      growth_rate |     0
#    -------- --------       -------- | --------
# 0 |1.45e+04 1.45e+04              0 |     0

# Events @ generation 5920.0
#    - Population parameter change for 0: initial_size -> 7310
# ================================
# Epoch: 5920.0 -- inf generations
# ================================
#      start     end      growth_rate |     0
#    -------- --------       -------- | --------
# 0 |7.31e+03 7.31e+03              0 |     0

Since the longest-ago epoch begins at 5,920 generations ago with a population size of 7,310, if we set slim_burn_in=0.1, then we’d run the SLiM simulation starting at 5,920 + 731 = 6,651 generations ago, and anything longer ago than that would be simulated with a msprime coalescent simulation.

To simulate 200 samples of all of human chromosome 22 in this way, with the HapMapII_GRCh37 genetic map, we’d do the following (again setting slim_scaling_factor to keep this example reasonably-sized):

contig = species.get_contig("chr22", genetic_map="HapMapII_GRCh37")
samples = model.get_samples(200)
engine = stdpopsim.get_engine("slim")
ts = engine.simulate(model, contig, samples, slim_burn_in=0.1, slim_scaling_factor=10)
Outputting the SLiM script

One final option that could be useful is that you can ask stdpopsim to output the SLiM model code directly, without actually running the model. You could then edit the code, to add other features not implemented in stdpopsim. To do this, set slim_script=True (which prints the script to stdout; here we capture it in a file):

from contextlib import redirect_stdout
with open('script.slim', 'w') as f:
    with redirect_stdout(f):
        ts = engine.simulate(model, contig, samples, slim_script=True,
                             verbosity=2, slim_scaling_factor=10)

The resulting script is big - 18,122 lines - because it has the actual HapMapII_GRCh37 genetic map for chromosome 22 included, as text. To use it, you will at least want to edit it to save the tree sequence to a reasonable location - searching for the string trees_file you’ll find that the SLiM script currently saves the output to a temporary file. So, for instance, after changing

defineConstant("trees_file", "/tmp/tmp4hyf8ugn.ts");

to

defineConstant("trees_file", "foo.trees");

we could then run the simulation in SLiM’s GUI, to do more detailed investigation, or we could just run it on the command line:

$ slim script.slim

If you go this route, you need to do a few postprocessing steps to the tree sequence that stdpopsim usually does. Happily, these are made available through a single python function, engine.recap_and_rescale. Back in python, we could do this by

import stdpopsim, pyslim
ts = pyslim.load("foo.trees")
ts = engine.recap_and_rescale(ts, model, contig, samples,
                              slim_scaling_factor=10)
ts.dump("foo_recap.trees")

The final line saves the tree sequence, now ready for analysis, out again as foo_recap.trees.

The function engine.recap_and_rescale is doing three things. The first, and most essential step, is undoing the rescaling of time that the slim_scaling_factor has introduced. Next is “recapitation”, for which the rationale and method is described in detail in the pyslim documentation. The third (and least crucial) step is to simplify the tree sequence. If as above we ask for 200 samples from a population whose final size is 1,450 individuals (after rescaling), then in fact the tree sequence returned by SLiM contains the entire genomes and genealogies of all 1,450 individuals, but stdpopsim throws away all the information that is extraneous to the requested 100 (diploid) individuals, using a procedure called simplification. Having the extra individuals is not as wasteful as you might think, because the size of the tree sequence grows very slowly with the number of samples. However, for many analyses you will probably want to extract samples of realistic size for real data. Again, methods to do this are discussed in the pyslim documentation.

Example analyses with stdpopsim

Calculating genetic divergence

Next we’ll give an example of computing some summaries of the simulation output. The tskit documentation has details on many more statistics that you can compute using the tree sequences. We will simulate some samples of human chromosomes from different populations, and then estimate the genetic divergence between each population pair.

1. Simulating the dataset

First, let’s use the --help-models option to see the selection of demographic models available to us:

$ stdpopsim HomSap --help-models

All simulation models for Homo sapiens

OutOfAfrica_3G09: Three population out-of-Africa
     The three population Out-of-Africa model from Gutenkunst et al.
    2009. It describes the ancestral human population in Africa, the
    out of Africa event, and the subsequent European-Asian population
    split. Model parameters are the maximum likelihood values of the
    various parameters given in Table 1 of Gutenkunst et al.

    Populations:
        YRI: 1000 Genomes YRI (Yorubans)
        CEU: 1000 Genomes CEU (Utah Residents (CEPH) with Northern and Western European Ancestry
        CHB: 1000 Genomes CHB (Han Chinese in Beijing, China)

OutOfAfrica_2T12: Two population out-of-Africa
     The model is derived from the Tennesen et al. analysis of the
    jSFS from European Americans and African Americans. It describes
    the ancestral human population in Africa, the out of Africa event,
    and two distinct periods of subsequent European population growth
...

This prints detailed information about all of the available models to the terminal. In this tutorial, we will use the model of African-American admixture from 2011 Browning et al. From the help output (or the Catalog), we can see that this model has id AmericanAdmixture_4B11, and allows samples to be drawn from 4 contemporary populations representing African, European, Asian and African-American groups.

Using the --help-genetic-maps option, we can also see what recombination maps are available:

$ stdpopsim HomSap --help-genetic-maps

All genetic maps for Homo sapiens

HapMapII_GRCh37
     This genetic map is from the Phase II Hapmap project and based on
    3.1 million genotyped SNPs from 270 individuals across four
    populations (YRI, CEU, CHB and JPT). Genome wide recombination
    rates were estimated using LDHat. This version of the HapMap
    genetic map was lifted over to GRCh37 (and adjusted in regions
    where the genome assembly had rearranged) for use in the 1000
    Genomes project. Please see the README file on the 1000 Genomes
    download site for details of these adjustments.

DeCodeSexAveraged_GRCh36
     This genetic map is from the deCode study of recombination events
    in 15,257 parent-offspring pairs from Iceland. 289,658 phased
    autosomal SNPs were used to call recombinations within these
    families, and recombination rates computed from the density of
    these events. This is the combined male and female (sex averaged)
    map. See https://www.decode.com/addendum/ for more details.

PyrhoACB_GRCh37
     This genetic map was inferred using individuals from the ACB
    population from Phase 3 of the 1000 Genomes Project. Rates were
    estimated using pyrho (https://github.com/popgenmethods/pyrho)
    while using population-specific population size history estimates
    obtained from smc++ (https://github.com/popgenmethods/smcpp).
    Genetic maps are only available for the 22 autosomes. See
    https://doi.org/10.1126/sciadv.aaw9206 for more details.

PyrhoASW_GRCh37
     This genetic map was inferred using individuals from the ASW
    population from Phase 3 of the 1000 Genomes Project. Rates were
    estimated using pyrho (https://github.com/popgenmethods/pyrho)
    while using population-specific population size history estimates
    obtained from smc++ (https://github.com/popgenmethods/smcpp).
    Genetic maps are only available for the 22 autosomes. See
    https://doi.org/10.1126/sciadv.aaw9206 for more details.

PyrhoBEB_GRCh37
     This genetic map was inferred using individuals from the BEB
    population from Phase 3 of the 1000 Genomes Project. Rates were
    estimated using pyrho (https://github.com/popgenmethods/pyrho)
    while using population-specific population size history estimates
    obtained from smc++ (https://github.com/popgenmethods/smcpp).
    Genetic maps are only available for the 22 autosomes. See
    https://doi.org/10.1126/sciadv.aaw9206 for more details.

PyrhoCDX_GRCh37
     This genetic map was inferred using individuals from the CDX
    population from Phase 3 of the 1000 Genomes Project. Rates were
    estimated using pyrho (https://github.com/popgenmethods/pyrho)
    while using population-specific population size history estimates
    obtained from smc++ (https://github.com/popgenmethods/smcpp).
    Genetic maps are only available for the 22 autosomes. See
    https://doi.org/10.1126/sciadv.aaw9206 for more details.

PyrhoCEU_GRCh37
     This genetic map was inferred using individuals from the CEU
    population from Phase 3 of the 1000 Genomes Project. Rates were
    estimated using pyrho (https://github.com/popgenmethods/pyrho)
    while using population-specific population size history estimates
    obtained from smc++ (https://github.com/popgenmethods/smcpp).
    Genetic maps are only available for the 22 autosomes. See
    https://doi.org/10.1126/sciadv.aaw9206 for more details.

PyrhoCHB_GRCh37
     This genetic map was inferred using individuals from the CHB
    population from Phase 3 of the 1000 Genomes Project. Rates were
    estimated using pyrho (https://github.com/popgenmethods/pyrho)
    while using population-specific population size history estimates
    obtained from smc++ (https://github.com/popgenmethods/smcpp).
    Genetic maps are only available for the 22 autosomes. See
    https://doi.org/10.1126/sciadv.aaw9206 for more details.

PyrhoCHS_GRCh37
     This genetic map was inferred using individuals from the CHS
    population from Phase 3 of the 1000 Genomes Project. Rates were
    estimated using pyrho (https://github.com/popgenmethods/pyrho)
    while using population-specific population size history estimates
    obtained from smc++ (https://github.com/popgenmethods/smcpp).
    Genetic maps are only available for the 22 autosomes. See
    https://doi.org/10.1126/sciadv.aaw9206 for more details.

PyrhoCLM_GRCh37
     This genetic map was inferred using individuals from the CLM
    population from Phase 3 of the 1000 Genomes Project. Rates were
    estimated using pyrho (https://github.com/popgenmethods/pyrho)
    while using population-specific population size history estimates
    obtained from smc++ (https://github.com/popgenmethods/smcpp).
    Genetic maps are only available for the 22 autosomes. See
    https://doi.org/10.1126/sciadv.aaw9206 for more details.

PyrhoESN_GRCh37
     This genetic map was inferred using individuals from the ESN
    population from Phase 3 of the 1000 Genomes Project. Rates were
    estimated using pyrho (https://github.com/popgenmethods/pyrho)
    while using population-specific population size history estimates
    obtained from smc++ (https://github.com/popgenmethods/smcpp).
    Genetic maps are only available for the 22 autosomes. See
    https://doi.org/10.1126/sciadv.aaw9206 for more details.

PyrhoFIN_GRCh37
     This genetic map was inferred using individuals from the FIN
    population from Phase 3 of the 1000 Genomes Project. Rates were
    estimated using pyrho (https://github.com/popgenmethods/pyrho)
    while using population-specific population size history estimates
    obtained from smc++ (https://github.com/popgenmethods/smcpp).
    Genetic maps are only available for the 22 autosomes. See
    https://doi.org/10.1126/sciadv.aaw9206 for more details.

PyrhoGBR_GRCh37
     This genetic map was inferred using individuals from the GBR
    population from Phase 3 of the 1000 Genomes Project. Rates were
    estimated using pyrho (https://github.com/popgenmethods/pyrho)
    while using population-specific population size history estimates
    obtained from smc++ (https://github.com/popgenmethods/smcpp).
    Genetic maps are only available for the 22 autosomes. See
    https://doi.org/10.1126/sciadv.aaw9206 for more details.

PyrhoGIH_GRCh37
     This genetic map was inferred using individuals from the GIH
    population from Phase 3 of the 1000 Genomes Project. Rates were
    estimated using pyrho (https://github.com/popgenmethods/pyrho)
    while using population-specific population size history estimates
    obtained from smc++ (https://github.com/popgenmethods/smcpp).
    Genetic maps are only available for the 22 autosomes. See
    https://doi.org/10.1126/sciadv.aaw9206 for more details.

PyrhoGWD_GRCh37
     This genetic map was inferred using individuals from the GWD
    population from Phase 3 of the 1000 Genomes Project. Rates were
    estimated using pyrho (https://github.com/popgenmethods/pyrho)
    while using population-specific population size history estimates
    obtained from smc++ (https://github.com/popgenmethods/smcpp).
    Genetic maps are only available for the 22 autosomes. See
    https://doi.org/10.1126/sciadv.aaw9206 for more details.

PyrhoIBS_GRCh37
     This genetic map was inferred using individuals from the IBS
    population from Phase 3 of the 1000 Genomes Project. Rates were
    estimated using pyrho (https://github.com/popgenmethods/pyrho)
    while using population-specific population size history estimates
    obtained from smc++ (https://github.com/popgenmethods/smcpp).
    Genetic maps are only available for the 22 autosomes. See
    https://doi.org/10.1126/sciadv.aaw9206 for more details.

PyrhoITU_GRCh37
     This genetic map was inferred using individuals from the ITU
    population from Phase 3 of the 1000 Genomes Project. Rates were
    estimated using pyrho (https://github.com/popgenmethods/pyrho)
    while using population-specific population size history estimates
    obtained from smc++ (https://github.com/popgenmethods/smcpp).
    Genetic maps are only available for the 22 autosomes. See
    https://doi.org/10.1126/sciadv.aaw9206 for more details.

PyrhoJPT_GRCh37
     This genetic map was inferred using individuals from the JPT
    population from Phase 3 of the 1000 Genomes Project. Rates were
    estimated using pyrho (https://github.com/popgenmethods/pyrho)
    while using population-specific population size history estimates
    obtained from smc++ (https://github.com/popgenmethods/smcpp).
    Genetic maps are only available for the 22 autosomes. See
    https://doi.org/10.1126/sciadv.aaw9206 for more details.

PyrhoKHV_GRCh37
     This genetic map was inferred using individuals from the KHV
    population from Phase 3 of the 1000 Genomes Project. Rates were
    estimated using pyrho (https://github.com/popgenmethods/pyrho)
    while using population-specific population size history estimates
    obtained from smc++ (https://github.com/popgenmethods/smcpp).
    Genetic maps are only available for the 22 autosomes. See
    https://doi.org/10.1126/sciadv.aaw9206 for more details.

PyrhoLWK_GRCh37
     This genetic map was inferred using individuals from the LWK
    population from Phase 3 of the 1000 Genomes Project. Rates were
    estimated using pyrho (https://github.com/popgenmethods/pyrho)
    while using population-specific population size history estimates
    obtained from smc++ (https://github.com/popgenmethods/smcpp).
    Genetic maps are only available for the 22 autosomes. See
    https://doi.org/10.1126/sciadv.aaw9206 for more details.

PyrhoMSL_GRCh37
     This genetic map was inferred using individuals from the MSL
    population from Phase 3 of the 1000 Genomes Project. Rates were
    estimated using pyrho (https://github.com/popgenmethods/pyrho)
    while using population-specific population size history estimates
    obtained from smc++ (https://github.com/popgenmethods/smcpp).
    Genetic maps are only available for the 22 autosomes. See
    https://doi.org/10.1126/sciadv.aaw9206 for more details.

PyrhoMXL_GRCh37
     This genetic map was inferred using individuals from the MXL
    population from Phase 3 of the 1000 Genomes Project. Rates were
    estimated using pyrho (https://github.com/popgenmethods/pyrho)
    while using population-specific population size history estimates
    obtained from smc++ (https://github.com/popgenmethods/smcpp).
    Genetic maps are only available for the 22 autosomes. See
    https://doi.org/10.1126/sciadv.aaw9206 for more details.

PyrhoPEL_GRCh37
     This genetic map was inferred using individuals from the PEL
    population from Phase 3 of the 1000 Genomes Project. Rates were
    estimated using pyrho (https://github.com/popgenmethods/pyrho)
    while using population-specific population size history estimates
    obtained from smc++ (https://github.com/popgenmethods/smcpp).
    Genetic maps are only available for the 22 autosomes. See
    https://doi.org/10.1126/sciadv.aaw9206 for more details.

PyrhoPJL_GRCh37
     This genetic map was inferred using individuals from the PJL
    population from Phase 3 of the 1000 Genomes Project. Rates were
    estimated using pyrho (https://github.com/popgenmethods/pyrho)
    while using population-specific population size history estimates
    obtained from smc++ (https://github.com/popgenmethods/smcpp).
    Genetic maps are only available for the 22 autosomes. See
    https://doi.org/10.1126/sciadv.aaw9206 for more details.

PyrhoPUR_GRCh37
     This genetic map was inferred using individuals from the PUR
    population from Phase 3 of the 1000 Genomes Project. Rates were
    estimated using pyrho (https://github.com/popgenmethods/pyrho)
    while using population-specific population size history estimates
    obtained from smc++ (https://github.com/popgenmethods/smcpp).
    Genetic maps are only available for the 22 autosomes. See
    https://doi.org/10.1126/sciadv.aaw9206 for more details.

PyrhoSTU_GRCh37
     This genetic map was inferred using individuals from the STU
    population from Phase 3 of the 1000 Genomes Project. Rates were
    estimated using pyrho (https://github.com/popgenmethods/pyrho)
    while using population-specific population size history estimates
    obtained from smc++ (https://github.com/popgenmethods/smcpp).
    Genetic maps are only available for the 22 autosomes. See
    https://doi.org/10.1126/sciadv.aaw9206 for more details.

PyrhoTSI_GRCh37
     This genetic map was inferred using individuals from the TSI
    population from Phase 3 of the 1000 Genomes Project. Rates were
    estimated using pyrho (https://github.com/popgenmethods/pyrho)
    while using population-specific population size history estimates
    obtained from smc++ (https://github.com/popgenmethods/smcpp).
    Genetic maps are only available for the 22 autosomes. See
    https://doi.org/10.1126/sciadv.aaw9206 for more details.

PyrhoYRI_GRCh37
     This genetic map was inferred using individuals from the YRI
    population from Phase 3 of the 1000 Genomes Project. Rates were
    estimated using pyrho (https://github.com/popgenmethods/pyrho)
    while using population-specific population size history estimates
    obtained from smc++ (https://github.com/popgenmethods/smcpp).
    Genetic maps are only available for the 22 autosomes. See
    https://doi.org/10.1126/sciadv.aaw9206 for more details.

Let’s go with HapMapII_GRCh37. The next command simulates 4 samples of chromosome 1 from each of the four populations, and saves the output to a file called afr-america-chr1.trees. For the purposes of this tutorial, we’ll also specify a random seed using the -s option. To check that we have set up the simulation correctly, we may first wish to perform a dry run using the -D option. This will print information about the simulation to the terminal:

$ stdpopsim HomSap -c chr1 -o   afr-america-chr1.trees -s 13 -g HapMapII_GRCh37 -d AmericanAdmixture_4B11 4 4 4 4 -D
Simulation information:
    Engine: msprime (0.7.4)
    Model id: AmericanAdmixture_4B11
    Model desciption: American admixture
    Seed: 13
    Population: number_samples (sampling_time_generations):
        AFR: 4 (0)
        EUR: 4 (0)
        ASIA: 4 (0)
        ADMIX: 4 (0)
Contig Description:
    Contig length: 249218992.0
    Mean recombination rate: 1.1487055306951004e-08
    Mean mutation rate: 1.29e-08
    Genetic map: HapMapII_GRCh37

If you use this simulation in published work, please cite:
[stdpopsim]
Adrion et al., 2020: https://doi.org/10.7554/eLife.54967
[simulation engine]
Kelleher et al., 2016: https://doi.org/10.1371/journal.pcbi.1004842
[genome assembly]
The Genome Sequencing Consortium, 2001: http://dx.doi.org/10.1038/35057062
[mutation rate]
Tian, Browning, and Browning, 2019: https://doi.org/10.1016/j.ajhg.2019.09.012
[genetic map, recombination rate]
The International HapMap Consortium, 2007: https://doi.org/10.1038/nature06258
[demographic model]
Browning et al., 2011: http://dx.doi.org/10.1371/journal.pgen.1007385

Once we’re sure, we can remove the -D flag to run the simulation. (Note: This took around 8 minutes to run on a laptop.)

$ stdpopsim HomSap -c chr1 -o afr-america-chr1.trees -s 13 -g HapMapII_GRCh37 -d AmericanAdmixture_4B11 4 4 4 4

2. Calculating divergences

We should now have a file called afr-america-chr1.trees. Our work with stdpopsim is done; we’ll now switch to a Python console and import the tskit package to load and analyse this simulated tree sequence file.

import tskit
ts = tskit.load("afr-america-chr1.trees")

Recall that genetic divergence (often denoted \(d_{xy}\)) between two populations is the mean density per nucleotide of sequence differences between two randomly sampled chromosomes, one from each population (and averaged over pairs of chromosomes). Genetic diversity of a population (often denoted \(\pi\)) is the same quantity, but with both chromosomes sampled from the same population. These quantities can be computed directly from our sample using tskit’s tskit.TreeSequence.divergence().

By looking at the documentation for this method, we can see that we’ll need two inputs: sample_sets and indexes. In our case, we want sample_sets to give the list of sample chromosomes (nodes) from each separate population. We can obtain the necessary list of lists like this:

sample_list = []
for pop in range(0, ts.num_populations):
    sample_list.append(ts.samples(pop).tolist())

print(sample_list)

# [[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15]]

Note that the samples with node IDs 0 - 3 are from population 0, samples with node IDs 4 - 7 are from population 1 and so on. (Also, the .tolist() in the code above is not necessary; it is only there to make the output simpler.)

The next argument, indexes should give the pairs of integer indexes corresponding to the sample sets that we wish to compute divergence between. For instance, the tuple (0, 2) will compute the divergence between sample set 0 and sample set 2 (so, in our case, population 0 and population 2). We can quickly get all the pairs of indexes as follows:

inds = []
for i in range(0, ts.num_populations):
   for j in range(i, ts.num_populations):
      inds.append((i, j))

print(inds)
# [(0, 0), (0, 1), (0, 2), (0, 3), (1, 1), (1, 2), (1, 3), (2, 2), (2, 3), (3, 3)]

We are now ready to calculate the genetic divergences.

divs = ts.divergence(sample_sets=sample_list, indexes=inds)
print(divs)
# array([0.00035424, 0.0003687 , 0.00036707, 0.0003705 , 0.00026696,
#        0.00029148, 0.00029008, 0.00025767, 0.0002701 , 0.00028184])

As a sanity check, this demographic model has population sizes of around \(N_e = 10^4\), and the mutation rate is \(\mu = 1.29 \times 10^{-8}\) (shown in the output of stdpopsim, or found in python with contig.mutation_rate), so we expect divergence values to be of order of magnitude \(2 N_e \mu = 0.000254\), slightly higher because of population structure.

3. Plotting the divergences

The output lists the divergences of all population pairs that are specified in indexes, in the same order. However, instead of simply printing these values to the console, it might be nicer to create a heatmap of the values. Here is some (more advanced) code that does this. It relies on the numpy, seaborn and matplotlib packages.

import numpy as np
import seaborn
import matplotlib.pyplot as plt
div_matrix = np.zeros((ts.num_populations, ts.num_populations))
for pair in range(0, len(inds)):
    pop0, pop1 = inds[pair]
    div_matrix[pop0, pop1] = divs[pair]
    div_matrix[pop1, pop0] = divs[pair]
seaborn.heatmap(div_matrix, vmin=0, vmax=0.0005, square=True)
ax = plt.subplot()
plt.title("Genetic divergence")
plt.xlabel("Populations", fontweight="bold")
plt.ylabel("Populations", fontweight="bold")
ax.set_xticks([0,1,2,3], minor=True)
ax.set_xticklabels(['AFR', 'EUR', 'ASI', 'ADM'], minor=False)
ax.tick_params(which='minor', length=0)
ax.set_yticks([0,1,2,3], minor=True)
ax.set_yticklabels(['AFR', 'EUR', 'ASI', 'ADM'], minor=False)
ax.tick_params(which='minor', length=0)
Heatmap of divergence values.

These values make sense given the model of demography we have specified: the highest divergence estimates were obtained when African samples (AFR) were compared with samples from other populations, and the lowest divergence estimates were obtained when Asian (ASI) samples were compared with themselves. However, the overwhelming sameness of the sample chromosomes is also evident: on average, any two sample chromosomes differ at less than 0.04% of positions, regardless of the populations they come from.

Calculating the allele frequency spectrum

Next, we will simulate some samples of chromosomes from different populations of a non-human (finally!), Arabidopsis thaliana, and analyse the allele frequency spectrum (AFS) for each population (also called the “site frequency spectrum, or SFS).

1. Simulating the dataset

This time, we will use the stdpopsim.IsolationWithMigration() model. Since this is a generic model that can be used for any species, we must use the Python interface for this simulation. See our Python tutorial for an introduction to this interface.

We begin by importing stdpopsim into a Python environment and specifying our desired species, Arabidopsis thaliana. From the Catalog, we can see that this species has the ID AraTha:

import stdpopsim
species = stdpopsim.get_species("AraTha")

After skimming the Catalog to see our options, we’ll specify our desired chromosome chr4 and recombination map SalomeAveraged_TAIR7.

contig = species.get_contig("chr4", genetic_map="SalomeAveraged_TAIR7")

From the API description, we can see that the stdpopsim.IsolationWithMigration() model allows us to sample from a pair of populations that diverged from a common ancestral population. We’ll specify that the effective population size of the ancestral population was 5000, that the population sizes of the two modern populations are 4000 and 1000, that the populations diverged 1000 generations ago, and that rates of migration since the split between the populations are both zero.

model = stdpopsim.IsolationWithMigration(NA=5000, N1=4000, N2=1000, T=1000, M12=0, M21=0)

We’ll simulate 10 chromosomes from each of the populations using the msprime engine.

samples = model.get_samples(10, 10)
engine = stdpopsim.get_engine("msprime")

Finally, we’ll run a simulation using the objects we’ve created and store the outputted dataset in an object called ts. For the purposes of this tutorial, we’ll also run this simulation using a random seed:

ts = engine.simulate(model, contig, samples, seed=13)

2. Calculating the AFS

Recall that the allele frequency spectrum (AFS) summarises the distribution of allele frequencies in a given sample. At each site, there is an ancestral and (sometimes more than one) derived allele, and each allele is observed in the sample with some frequency. Each entry in the AFS corresponds to a particular sample frequency, and records the total number of derived alleles with that frequency. We can calculate the AFS directly from our tree sequence using the tskit.TreeSequence.allele_frequency_spectrum() method.

Since we wish to find the AFS separately for each of our two populations, we will first need to know which samples correspond to each population. The tskit.TreeSequence.samples() method in tskit allows us to find the IDs of samples from each population:

pop_samples = [ts.samples(0), ts.samples(1)]
print(pop_samples)
# [array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32),
#  array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19], dtype=int32)]

We are now ready to calculate the AFS. Since our dataset was generated using the default msprime simulation engine, we know that it has exactly one derived allele at any polymorphic site. We also know what the derived and ancestral states are. We can therefore calculate the polarised AFS using tskit’s tskit.TreeSequence.allele_frequency_spectrum() method:

sfs0 = ts.allele_frequency_spectrum(sample_sets=[pop_samples[0]],
                                    polarised=True, span_normalise=False)
print(sfs0)
# [1603. 2523. 1259.  918.  598.  471.  434.  367.  343.  265.  136.]

The output lists the number of derived alleles that are found in 0, 1, 2, … of the given samples. Since each of our populations have 10 samples each, there are 11 numbers. The first number, 1603, is the number of derived alleles found in the tree sequence but not found in that population at all (they are present because they are found in the other population). The second, 2523, is the number of singletons, and so forth. The final number, 136, is the number of derived alleles in the tree sequence found in all ten samples from this population. Since an msprime simulation only contains information about polymorphic alleles, these must be alleles fixed in this population but still polymorphic in the other.

Here is the AFS for the other population:

sfs1 = ts.allele_frequency_spectrum(sample_sets=[pop_samples[1]],
                                    polarised=True, span_normalise=False)
print(sfs1)
# [3755.  972.  744.  558.  476.  502.  409.  320.  306.  271.  604.]

The somewhat mysterious polarised=True option indicates that we wish to calculate the AFS for derived alleles only, without “folding” the spectrum, and the span_normalise=False option disables tskit’s default behaviour of dividing by the sequence length. See tskit’s documentation for more information on these options.

We will do further analysis in the next section, but you might first wish to convince yourself that this output makes sense to you. You might also wish to check that the total sum of sites is the sum of the AFS entries:

sum(sfs0), sum(sfs1), ts.num_sites
# (8917.0, 8917.0, 8917)

3. Plotting the AFS

Here is some more advanced code that compares the estimated AFS from each population. It relies on the matplotlib and numpy packages. We will scale each AFS by the number of mutated sites in the corresponding sample set.

import matplotlib.pyplot as plt
import numpy as np
bar_width = 0.4
r1 = np.arange(0,11) - 0.2
r2 = [x + bar_width for x in r1]
ax = plt.subplot()
plt.bar(x = r1, height = sfs0/ts.num_sites, width=bar_width, label='pop0')
plt.bar(x = r2, height = sfs1/ts.num_sites, width=bar_width, label='pop1')
plt.xlabel("Allele count", fontweight="bold")
plt.ylabel("Proportion of mutated sites in sample", fontweight="bold")
ax.set_xticks(np.arange(0,11))
ax.legend()
plt.plot()
AFS plots.

This figure shows substantial differences in the allele frequency spectrum between the two populations, most notably a larger number of singletons in population 0 and a larger number of fixed and absent alleles in population 1. This makes sense given the demography we have specified: population 1 has had a much more extreme population size reduction.