# Workshop Notes: Models and Inference in Population Genetics IV: Fragmentation-coalescence and related models

**Last modified:**Post views:

The workshop is held in the University of Warwick, by department of Statistics, on 15th - 19th April 2024.

Stochastic coalescence and fragmentation models respectively describe how blocks of mass randomly join together and break apart over time according certain rules of random evolution. These models are important in fields including physical chemistry, ecology, and population genetics – where coalescence and fragmentation underlie widely studied genealogical processes. The aims of this workshop are to bring together researchers in probability and statistics working in all aspects of fragmentation, coalescence, genealogy, and genetic inference.

I attended the workshop as I wanted to learn about such models and seek possibility of applying them in my research of virus evolution.

This workshop contains two mini-courses and a series of talks.

## Mini-course 1: “Stochastic models of genealogies in spatially structured populations” - by Amandine Véber (CNRS, Université Paris Cité)

The spatial structure of a population impacts the shape of the genealogy of a sample of individuals taken from this population. In this course, we shall focus on populations that are distributed over a continuous space. Using variations of the spatial Lambda-Fleming-Viot process, introduced by Alison Etheridge and Nick Barton in 2008, we shall consider different scenarii: a neutral case where the population is demographically stable and all alleles confer the same reproductive potential on their carriers, a demographically stable population in which one allele has a slight reproductive advantage over the others, and an expanding population with equally fit individuals. We shall describe the main characteristics of the ancestral lines (or potential ancestral lines) over appropriate space and time scales.

Aim: Model and understand the evolution in continuous spatial structure.

- (I fell like such model can be applied in studying the quasispecies dynamics in a virus population wihtin a host, aiding with spatial sequencing.)

## Mini-course 2: “Coalescent theory and coagulation equations” - by Emmanuel Schertzer (University of Vienna)

In this mini-course, I will explore the link between coagulation equations and several classical models from population genetics. I will review some well-known results on the speed of coming down for Λ-coalescents [1] and show how to leverage those asymptotics to extract some information about the lower part of the frequency spectrum (SFS). This methodology is based on an analytical approach relating combinatorial quantities (e.g., branch length of genealogical trees) to solutions of coagulation equations. In the second part of the course, I will introduce two classes of coalescent models where the same general methodology can be applied: (1) structured coalescents, (2) Xi-coalescents [2] and (3) nested exchangeable coalescents (trees within trees) [3]. The lastter class of models is motivated by models from phylogenetics where gene lineages are constrained inside a species tree. I will show a De-Finetti representation analogous to Pitman’s theorem for Λ-coalescents [3] and then focus on a simple example known as the nested Kingman coalescent [4][5]. Despite its simplicity, the latter has a rich mathematical structure. In particular, the asymptotics of the SFS is related to solutions of a transport-coagulation PDE. This equation exhibits an interesting phase transition that can be understood in terms of a stochastic representation. Biological implications of this result will be discussed. [1] The Lambda-coalescent speed of coming down from infinity (Annals of Proba 2010) [2] Asymptotics of the frequency spectrum for general Dirichlet Xi-coalescents (EJP 2023) [3] Trees within trees: simple nested coalescents (EJP 2018) [4] Coagulation-transport equations and the nested coalescents (PTRF, 2020) [5] The nested Kingman coalescent: Speed of coming down from infinity (Annals of Appl. Proba 2019) a recent pre-print [6] The Yule- Nested Coalescent: Distribution of the Number of Lineages and recent (yet unpublished) joint work with my Florin Boenkost.

In this mini-course, we aim at investigating the SFS of large class of coalescents:

- Lambda-coalescent (Smoluchowsky <-> CSBP)
- Multi-type coalescent (multitype-Smoluchowsky <-> multitype CSBP)
- Nested coalescent (transport-coagulation PDE <-> branching CSBP)

Assuming that the size of our sample is large. The lower (or intermediate) part of the SFS can be approximated the solution of a coagulation PDE.

## Talks

### Adrián González Casanova: Asymmetric Frequency Processes

Asymmetric Frequency Processes emerge as a natural extension of two-type Fleming-Viot processes, manifesting in the study of competitive dynamics among two distinct groups of individuals employing different reproductive mechanisms. This talk entails their derivation from continuous state branching processes, alongside exploring select properties, notably moment duality. Finally, we examine the multidimensional version of these processes and their relation with multitype Λ-Coalescents.

- Asymmetric processes:
- Gillespie I (1974); Should all eggs be placed in one bag? The Wright Fisher Gillespie diffusion.
- Gillespie II (1975); different conditioning.
- Etheridge and March;

- Multi-type Λ-coalescents:
- (I feel like this can be used for modelling multi-strain evolution in a virus)

### Alison Etheridge: Forwards and backwards in spatially heterogeneous populations

We introduce a broad class of individual based models that might describe how spatially heterogeneous populations live, die, and reproduce. Our primary interest is in understanding how genetic ancestry spreads across geography when looking back through time in these populations. A novelty is that by explicitly splitting reproduction into two phases (production of juveniles and their maturation) we produce a framework that not only captures models which when suitably scaled converge to classical reaction diffusion equations, but also ones with nonlinear diffusion that exhibit quite different behaviour. This is joint work with Tom Kurtz (Madison), Peter Ralph (Oregon) and Ian Letter and Terence Tsui (Oxford).

A model based on “interacting branching processes”.

(**Alison Etheridge** seems to be a very experienced speaker, her talk is clear and easy to follow, she first explain the framework as a whole and proceed to the details. I should learn from her. Also the topic is very interesting, similar with the previous talk, I feel like this can be applied in studying the spatial dynamics of virus evolution within host.)

### Sam Johnston: Ancestral Reproductive Bias in Branching Processes

Consider a branching process whose reproduction law is homogeneous. Sampling a single cell uniformly from the population at a time T > 0 and looking along the sampled cell’s ancestral lineage, we find that the reproduction law and rate along this lineage is heterogeneous and differs from the reproduction in the population at large. This is due an `inspection paradox’: cells with a larger number of offspring are more likely to have one of their descendants sampled by virtue of their prolificity. We explore this inspection paradox (as well as other inspection paradoxes in the broader probability literature) and link the resulting bias to recent observations in genetic data.

(He mentioned an example of inspection bias: the marathon paradox: when you run in a marathon, you always see people superseding you or lagging behind you, but you never see people running at the same pace as you. This is because you are always in the middle of the group. This is a kind of inspection bias.)

### Asger Hobolth: Matrix-analytical for infinite sites model

(The speaker changed the topic.)

### Aurélien Tellier: Inference of ecological and life-history trains from full genome polymorphism data: tales of success and limitations

While most inference methods using full-genome data can be applied to all possible kind of species, the underlying assumptions are often sexual reproduction in each generation and non-overlapping generations. However, in many plants, invertebrates, fungi and other taxa, those assumptions are often violated due to different ecological and life history traits, such as self-fertilization, long term dormant structures (seed or egg-banking) or large variance in offspring production. Furthermore, the resolution of past inference decreases when there is a lack of SNPs in the data. I will present here three new developments of the Sequentially Markovian Coalescent (SMC) and Deep Learning (DL) methods based on Graph Neural Networks (GNN) allowing us to 1) infer seed banking / dormancy or selfing rates and their change in time, 2) infer the variance in offspring production and regions under positive selection along the genome, and 3) integrate epigenetic (methylation) markers to improve the inference of past events.

- Sequentially Markovian Coalescent (SMC): emission probability, transition probability.
- Models DNA methylation.
- GNNcoal: Graph Neural Networks for coalescent inference.

### Simon Harris: Genealogies of samples from stochastic population models

Consider some population evolving stochastically in time. Conditional on the population surviving until some large time T, take a sample of particles from those alive. What does the ancestral tree drawn out by this sample look like? Some special cases were known, e.g. Durrett (1978), O’Connell (1991), but we will discuss an approach behind some more recent advances for Bienyame-Galton-Watson (BGW) processes conditioned to survive. In near-critical and in critical varying environment settings, the same universal limiting sample genealogy always appears up to some deterministic time change (which depends only on the mean and variance of the offspring distributions). This genealogical tree has the same binary tree topology as a Kingman coalescent, but where the coalescent (or split) times can be represented as a mixture of IID times - this very roughly interpreted as a mixture of time changed ‘slowed down’ Kingman coalescents. In contrast, in critical infinite variance offspring settings, we find that more complex universal limiting sample genealogies emerge that exhibit multiple-mergers, these being driven by massive birth events within the underlying population. Our key tool is a change of measure involving k distinguished particles, also known as spines. Some ongoing work and open problems may also be mentioned. This talk is based on work in collaboration with M.Roberts (Bath), S.Johnston (KCL) in AAP (2020), with J.C.Pardo (CIMAT), S.Johnston in AOP (2024), and with S.Palau (UNAM), J.C.Pardo (2022+). Acknowledgements: This research is supported by New Zealand Royal Society Te Apārangi Marsden funding (Harris), PAPIIT UNAM funding (Palau), and CONAHCyT funding (Pardo).

### Juan Carlos Pardo: On the speed of coming down from infinity for subcritical branching processes with pairwise interactions

In this talk, we investigate the phenomenon of coming-down from infinity for (sub)critical cooperative branching processes with pairwise interactions under suit- able conditions. A process in this class behaves as a pure branching process with the difference that competition and cooperation events between pairs of individuals are also allowed. In particular, we are interested in the speed of BPI-processes when their initial population is very large, as well as in their second order fluctuations. This is a joint work with Gabriel Berzunza.

### Martina Favero: Sampling probabilities, diffusions, ancestral graphs, and duality under strong selection

Wright–Fisher diffusions and their dual ancestral graphs occupy a central role in the study of allele frequency change and genealogical structure, and they provide expressions, explicit in some special cases but generally implicit, for the sampling probability, a crucial quantity in inference. Under a finite-allele mutation model, with possibly parent-dependent mutation, we consider the asymptotic regime where the selective advantage of one allele grows to infinity, while the other parameters remain fixed. In this regime, we show that the Wright–Fisher diffusion can be approximated either by a Gaussian process or by independent continuous-state branching processes with immigration. While the first process becomes degenerate at stationarity, the latter do not and provide a simple, analytic approximation for the leading term of the sampling probability. We then characterise all remaining terms using another approach based on a recursion formula. Finally, we study the asymptotic behaviour of the block-counting process of the conditional ancestral selection graph and establish an asymptotic duality relationship between this and the diffusion. This is joint work with Paul Jenkins.

### Ellen Baake: The Markov embedding problem, multiple coupon collection, and finite-sites mutation with dependencies

The embedding problem of Markov transition matrices into Markov semigroups is a classic problem that regained a lot of impetus and activities in the last few years. We consider it here for a generalisation of the classical coupon collection process, which is equivalent to a finite-sites mutation model with dependencies. We obtain explicit conditions for the resulting discrete-time Markov chain to be representable as the semigroup of a continuous-time process. In genetic terms, this amounts to comparing the mutation process resulting from replication errors in discrete generations with mutations occurring continuously during the life of an individual, via radiation, thermal fluctuations etc. This is joint work with Michael Baake.

### Apolline Louvet: Modelling populations expanding in a spatial continuum

Understanding the emergence of genetic diversity patterns in expanding populations is of longstanding interest in population genetics. In this talk, I will introduce a model that can be used to gain some insight on the evolution of genetic diversity patterns at the front edge of an expanding population. This model, called the ∞-parent spatial Λ-Fleming Viot process (or ∞-parent SLFV), is characterized by an “event-based” reproduction dynamics that makes it possible to control local reproduction rates and to study populations living in unbounded regions. I will present what is currently known of the growth properties of this process, and what are the implications of these results in terms of genetic diversity at the front edge. Based on a joint work with Amandine Véber (MAP5, Univ. Paris Cité) and Matt Roberts (Univ. Bath).

(*Expanding*, simulation work, biological aspect)
(I got photos of some slides)

### Matteo Ruggiero: Filtering Poisson-Dirichlet diffusions

We tackle inference, in a hidden Markov model framework, for the trajectory of a one- or two- parameter Poisson-Dirichlet diffusion, which evolve in the infinite-dimensional ordered simplex and model the continuous random evolution of an infinite vector of ranked frequencies. The problem is particularly challenging in that the symmetry induced by ranking the frequencies makes the collected observations take the form of unlabeled partitions, where group types are not specified, and computing the likelihood requires integrating over all possible probabilistic generation of the observed partition given the underlying state of the random measure. Furthermore, the time-marginal model is not conjugate, as conditioning on further sets of data gives rise to mixtures of nonparametric laws whose cardinality grows exponentially, making the statistical problem virtually intractable. In this setting, we provide recursive formulae for the signal filtering distribution, i.e., the conditional law of the diffusion state given past and present observations, which are finite mixtures whose weights are given by the laws of certain coagulated partitions. We devise suitable approximation schemes that allow to efficiently implement the filter given partition-valued data, and similarly tackle the marginal smoothing problem, given also future data. Joint work with Marco Dalla Pria (University of Torino) and Dario Spanò (University of Warwick).

(HMM, MCMC)

### Jere Koskela: Simple criteria for consistent Bayesian tree reconstruction

Reconstructing an unobserved tree from a Markov process run along its edges, and observed only at its leaves, is a central problem in phylogenetics. Frequentist results are available for determining when tree reconstruction can be done consistently as the number of i.i.d. replicates of the Markov process at the leaves increases. They all rely on unnatural technical assumptions, typically to do with either discretising or bounding branch lengths from above and below. Moreover, many of the most popular algorithms for tree reconstruction have a Bayesian character, to which frequentist guarantees are not directly applicable. I will show that adopting a Bayesian nonparametric perspective yields consistency for a broad class of tree models with no discretisation or boundedness assumptions, at a rate of convergence which matches frequentist results. This is joint work with Alisa Kirichenko and Luke Kelly.

The speaker try to mathematically prove that MCMC methods are valid in some cases for tree reconstruction.

## Poster session and casual discussion

### Bayesian inference of basic reproduction number

This work is also done by a PhD student from University of Warwick, supervised by Jere Koskela. They are working on Bayesian inference of basic reproduction number, making use of ** information from phylogenetic trees**. I should sent a follow-up email to the speaker to ask for the details.

### MCMC of tree models

I met a friend who is a final year PhD student in University of Warwick, supervised by Jere Koskela, working on MCMC of tree models, inferring the tree time and tree topology using MCMC methods, which is a similar topic of the BEAST software. Should keep in touch with him.

## Some keywords to follow up on

- Kingman
- Lambda-coalescent
- PDE
- coaguation
- species tree / gene tree, species / lineages
- SLFV
- CSBP
- Martingale

** Tags: **
Population genetics

** Categories: **
Course notes

## Comments