Ep 118: Dog in the machine (with Andy Kern)

How should biologists deal with the massive amounts of population genetic data that are now routinely available? Will AIs make biologists obsolete?

In this episode, we talk with Andy Kern, an Associate Professor of Biology at the University of Oregon. Andy has spent much of his career applying machine learning methods in population genetics. We talk with him about the fundamental questions that population genetics aims to answer and about older theoretical and empirical approaches. We then turn to the promise of machine learning methods, which are increasingly being used to estimate population genetic structure, patterns of migration, and the geographic origins of trafficked samples. These methods are powerful because they can leverage high dimensional genomic data. Andy also talks about the implications of AI and machine learning for the future of biology research.
Cover photo: Keating Shahmehri

Art Woods 0:00
Hey Big Biology listeners. We're moving into a more intense period of fundraising over the next few months. So you'll notice a couple of changes to our show.

Cameron Ghalambor 0:08
First, behind the scenes, we're applying to foundations and other funding sources that will help keep Big Biology on the air.

Art Woods 0:15
So if you don't have any good sources of support scicomm podcasts, please share them with us, you can reach us at info at big biology dot org.

Cameron Ghalambor 0:23
Second, please consider making a donation either through our website or through Patreon.

Art Woods 0:29
If you donate through our website, go to www dot big biology.org and click on the "About" link. It's super easy to make a safe and secure donation using your credit card.

Cameron Ghalambor 0:39
While you are there, you should also click on the shop link, where you'll find all the super cool Big Biology art on T-shirts, coffee mugs, and other stuff.

Art Woods 0:48
And lastly, you can go to patreon.com/big bio to become a patron for just a few dollars per month.

Cameron Ghalambor 0:55
Patrons get cool insider stuff like access to behind the scenes audio and extras from our guests about their lives, their hobbies, and their careers.

Art Woods 1:03
Now onto the intro for today's show.

Cameron Ghalambor 1:11
If you follow the history of evolutionary biology, you will often hear reference to The Modern Synthesis.

Art Woods 1:18
It certainly comes up frequently on the show, both for what it has accomplished and what it has not.

Cameron Ghalambor 1:23
One of its major accomplishments has been the development of a robust mathematical theory to describe changes in genetic variation within and between populations over time.

Art Woods 1:33
This mathematical foundation allows population genetic theory to explore how different evolutionary processes, meaning things like genetic drift, gene flow, mutation and natural selection, how those things act on genetic variation.

Cameron Ghalambor 1:47
Interestingly, modern molecular genetics has provided a flood of data for testing ideas that come out of population genetics theory. But that flood has also started to force changes in how biologists build those models in the first place.

Art Woods 2:02
While most of the basic models focused on a single locus with two alleles, they now routinely consider variation across whole genomes .

Cameron Ghalambor 2:11
It can take a lot of computational power to run even simplified traditional models. So when you think about studies like the 1000 Genomes Project, which contains data on millions of variants across many different human genomes, you quickly reach the limits of what traditional approaches can handle.

Art Woods 2:27
One new approach, which we talked about in the show today is machine learning.

Cameron Ghalambor 2:33
Machines can learn? Like the Terminator movies?

Art Woods 2:36
Not quite, you can think of machine learning as a set of methods for constructing algorithms to extract information from data or to map data onto outcomes.

Cameron Ghalambor 2:46
Hmm like facial recognition?

Art Woods 2:48
Yeah that's a good example. Facial recognition software typically uses machine learning in which an algorithm is first trained on a large data set of known facial images, and then is tasked with identifying unknown individuals based on the learned patterns.

Cameron Ghalambor 3:03
Training models with labeled data is a powerful form of machine learning called supervised learning. For example, if you train a model to identify known proteins based on their underlying amino acid sequences, then that model can predict protein structures based on any amino acid sequence, even if it's not always right about the protein structure.

Art Woods 3:23
Our guest today is Andy Kern, who is the Evergreen associate professor of biology at the University of Oregon, and he has been at the forefront of using machine learning tools to study population genetics.

Cameron Ghalambor 3:35
We talk with Andy about traditional approaches to population genetics, and how newly available machine learning methods can be applied to understanding patterns of genetic variation.

Art Woods 3:45
Unlike traditional approaches that might require lots of simplifying assumptions and that struggle computationally to deal with large datasets, machine learning techniques only require a set of known inputs to predict outputs.

Cameron Ghalambor 3:57
And this approach can easily be scaled up to high dimensional genomic data. For example, you can train models to analyze 1000s of markers from across the genome to identify where that sample came from.

Art Woods 4:13
We also talk about current and future uses of machine learning, like is there a future in which you upload genetic data from your project and then work with an AI to analyze it.

Cameron Ghalambor 4:24
And whether that might make scientists obsolete? Wait, maybe we are talking about Terminator?

Art Woods 4:30
I'm Art Woods.

Cameron Ghalambor 4:32
And I'm Cameron Gallagher,

Art Woods 4:33
And you're listening to Big Biology.

Cameron Ghalambor 4:35
Andy Kern, thanks so much for joining us today on Big Biology.

Andy Kern 4:47
Thanks, pleasure to be here.

Cameron Ghalambor 4:49
So we're really looking forward to talking to you about your research in population genetics, and specifically the application of cutting edge computational methods, as well as your perspectives on the current state of population genetic theory and where the field is going. And I think to start off, I want to say that I think I first met you probably over 20 years ago, when you were at UC Davis and those through our mutual friend John Mckay. And I have vivid memories, talking to you about your research on Drosophila, while your dog who I think was named Fidel, was wrestling with John McCay's dog named Rudy. I don't know if you remember that.

Andy Kern 5:32
I remember very well. It was Thanksgiving.

Cameron Ghalambor 5:35
Yes, yes. Good. Alright, so I'm curious how you became interested in population genetics, specifically?

Andy Kern 5:37
Oh, yeah. So I owe my whole career in population genetics to wandering into a class one day. I was an undergraduate at Brown University. And Brown had this really great thing where, for the first, like two weeks of term, there was a so-called "shopping period," where you didn't have to sign up for a class, you could just like shop around. And I thought I was going to be a physics major, I had started down the physics track. And I had kind of fallen out of love with what I was doing there, and I was looking around for other things to study. And I hopped into this class that was actually being taught by one of my academic mentors, now, David Rand. And David was at the front of the room talking about, it was an evolutionary biology class, and he was talking about evolution, and genetics, and math. And I knew a little bit about genetics, I knew a bunch about math, and it just all kind of clicked. I was like, wow, this is fascinating. And I took that class, and I got to spend some time with David, and yeah it really changed the course of things. I started shortly thereafter working in David's lab, I took more evolution classes. Yeah. And I really fell in love with this sort of intersection of evolution, mathematical modeling, and genetics. So that's how I got started in pop gen.

Cameron Ghalambor 7:16
Yeah. So if we kind of scale back a little bit here, what exactly are the goals of population genetics? What are what is it that we're trying to actually explain? So in a general sense, we're maybe interested in explaining patterns of genetic variation over time and space. And this maybe is a bit technical, but like, if I can lay sort of the the sort of the foundation that we're sort of dealing with here. So we have some population genetic data that we get, these days, primarily say sequencing individuals, and we have loci or genetic markers, like single nucleotide polymorphisms SNP s that have two alleles that they inherit, you know, one from each parent. And then, given some assumptions, we are trying to understand, you know, is the variation that we observe, consistent with our expectations based on, you know, a certain population size, a certain mutation rate? You know, for example, can this variation just be explained by neutral processes, like genetic drift? Or is there something different going on, more different than you would expect, based on neutrality like, natural selection? And so is that more or less like, kind of the starting foundation of like, where a lot of the historic and current theory sort of is based on?

Andy Kern 8:44
Yeah, absolutely. I mean, I think you already nailed it. You know, population genetics is, as a field, fascinated with the origin and maintenance of variation, genetic variation and phenotypic variation. So like, where do all these differences come from? You know, on the podcast, right, the three of us, we look different one from another. Okay, Cam, you and I look a little bit similar. But, you know,

Art Woods 9:09
I mean, no hair either.

Andy Kern 9:11
Oh nice, so there you go. So like, you know, where do all these differences come from? What maintains differences in populations? And so these are fundamentally evolutionary questions, right, like, over time, we believe that genetic variation changes, that is, the alleles that segregate in populations are changing, the frequencies of alleles in populations are changing. And we believe that over time, that leads to differences that accrue amongst species, right among, you know, sort of higher level taxonomic categories among all of the biological world that we see.

Andy Kern 9:54
You know, to me, one of the motivating things here is like, if you look at all of life on Earth, like there's a palm tree behind me, right? That palm tree and I are related. We go back, say, a billion years, and that palm tree and I share an ancestor. And from that ancestor, right, I'm walking around, that palm tree is green and soaking up sunshine, how does that happen? So, to me, that's the sort of the biological diversity that we see is the motivating observation behind population genetics. And then at a very small scale, population genetics zooms in and says, "Okay, how can we deal with this observation? Well, we'll start by trying to figure out the process." So what population genetics essentially boils down to right is describing what the process is at a very small scale, at the over generations kind of scale. What are the forces that act? You know, you already mentioned, genetic drift. I don't want to get technical about it, but there are a bunch of different evolutionary forces, things like natural selection, migration between populations or migration across a landscape. And these things all stack up and interact to shape genetic variation that we observe today, much in the same way that physical forces that operate say, on a plane that's flying stack up, right. You have a plane, it's going through the air, there's friction against the wings, there's gravity pulling it down, there's lift that's generated by the airfoil and-

Art Woods 11:39
Right, and what's the balance of all those things?

Andy Kern 11:41
Exactly. All of these forces are interacting, and the planes flying through the air. And so what population geneticists are interested in doing this, like breaking those things down for a population?

Art Woods 11:52
Yeah, totally. Would you say that the average population genetics model, if there is such a thing could be thought of as a kind of null model? So I think about this in relation to, say, the Hardy Weinberg equilibrium, right, which is kind of a null model about the distribution of genotypes in a population in the absence of all these different forces of evolution. Does that characterize these pop gen models sort of altogether? And you're looking for deviations from what they predict?

Andy Kern 12:20
Well, you know, I would say historically, that was very true. I think these days, you know, what we aspire to is having descriptive models that we can actually do parameter estimation out of so, you know, historically, a lot of population genetics was about here's a null model, let me have some data, do I deviate from the null model? If so I can reject my null model. A hypothesis testing kind of framework. And I would say that these days, our models are a lot richer. And what we're really interested in doing is estimating parameters out of our model. For instance, how much migration is happening between these populations? How many individuals, each generation, are going from Oregon to California, in my poppy population? So I would say we're moving, hopefully, away from null models and more towards rich descriptive models, where we're trying to nail down what are the forces, and what are the strengths of those forces?

Cameron Ghalambor 13:26
So Andy, you mentioned that there is the sort of the importance of the interplay between theory and empiricism, in the goal of sort of parameter estimation. So, if we can estimate certain parameters, you know how much of the variation that we observe in a population, are we able to actually like accurately explain?

Andy Kern 13:50
Yeah, so that's an excellent question, Cam. And in everything we do, just because populations are large. Evolution is noisy. That is, there's a lot of stochasticity And, you know, I'm not really sure how one could or maybe I should say, I've never really thought about answering that question. But what we could do is we can definitely do like, Goodness of Fit kinds of things. And I would say, you know, depending on what you're interested in, so for instance, if we're interested in, you know, the size of a population over time, I would say we can do a pretty good job of that, You know, we're accurate to like, a factor of two over say, and this is a weird measurement for the audience, but over like the last N generations, where N is like the population size. So that's pretty good. What that means is that we can get a sense of historical population sizes of organisms, and do, what I would think, is a pretty darn good job. Now, that might not be pleasing to some people, because I said like a factor of two, right? You know, there could be 100 whales, there could be 200 whales. But part of that is the nature of evolution, that evolution is itself a noisy process.

Cameron Ghalambor 15:46
Maybe we can transition to talking a little bit about machine learning now. So for the past several years, you've been leveraging machine learning. But before we talk about that, can you first kind of describe the traditional approach to working with and testing population genetic data and how well it matches expectations from theory?

Andy Kern 16:06
Yeah, so the traditional approach is very much a first principles mathematical approach, where one devises a generative mathematical model, much in the tradition of say, Wright or Fisher, where what we try to do is describe, for a given set of forces, the average allele frequency change that we expect. And having done that, we can sort of play this forward over time and try to get a probability distribution. So it's like a real traditional kind of mathematical first principles approach to things. And I should say that those methods were really, really illustrative and really valuable. But we're limited, often to single locus descriptions, so like an individual portion of the genome, and hit their limit of utility, in some sense, when genomes started coming out, and we had this ability to sequence huge amounts of DNA, we were left with making sense of observations at the chromosome scale.

Art Woods 17:22
So this flood of information that was too much, huh?

Andy Kern 17:25
Yeah, and to be like, slightly technical for a second, essentially, what it meant was, we were forced with trying to write down probabilistic models for multiple loci. And that's really, really hard. Fundamentally, it's really, really hard. So one of the paths forward then that became quite popular was to use simulations to gain some kind of intuition or some kind of comparator in much the same way that a null mathematical model would be a comparator, to the observations that we're getting. And so, in the 90s and early aughts, we really saw this sort of rise of simulation, as being a powerful tool for making sense of the data that people were collecting.

Art Woods 18:39
So that makes sense, I think. If we had to put a little bit of flesh on those bones, and talk about that mapping process in a population genetic context. So what are you mapping to what? And what's the process by which you develop that map?

Andy Kern 18:58
So let me let me go back to this thing that I said a second ago. So, tracing historically, in like the 90s and early aughts, we had this real sort of rise of simulation in population genetics to deal with the data and the complexity of the data that people were collecting, people started turning to simulation. And so what increasingly we needed was a way to use our simulations to learn about the populations. It wasn't enough to say, look, you know, one way we could do this, right is, this traditional thing, Art, that you were bringing up, this null model thinking. So we could do like simulations and say, okay, these simulations represent a null distribution, right? I get these different outcomes, I do a hundred simulations, do my data fall within those hundred simulations? If not, I could reject that model. Well, that's okay. Right, that was good. But could we do better? And the way that we wanted to do better, specifically was we would like to use our simulations to estimate parameters from the model. Like, let's do simulations of say, we'll do a really simple thing, I have two populations. And I want to estimate how much migration happens between them. Well, can I do that using simulation? Well, we could, okay, and people started devising different ways to do this.

Andy Kern 20:34
And I was doing my PhD, around this time, that simulation was becoming really popular. And one of the things that I saw going on in the computer science world was sort of this machine learning thing taking off. And it occurred to me and others simultaneously, that these machine learning tools could probably be used to make that link between simulation and data, that we could actually take our simulations, and use them to train machine learning methods to estimate parameters that we were interested in. So that's part of the connection, I would say. Part of the connection comes through this idea of, what I'll call simulation based inference, using simulations to make sense of data. That's only part of it, I would say, because what we call machine learning these days is so broad. Simultaneously, one of the other things that was really, really, to my mind, powerful about the sort of machine learning mentality is that, for certain questions, we could do it without a model, using machine learning. We could just focus on data, and take a very sort of empirical, data-centric look at population genetic data and make sense of things.

Andy Kern 22:13
An example of this, we developed a recent method, this is like an example of this sort of data-centric approach, we developed a recent method that aims to predict where an individual is from, from their genotype you know, from their DNA, essentially. And the idea there is that we don't have any kind of model. In fact, what we have is we have a large collection of observations. We have, say, a collection of individuals, that we know where they're from, we've sequence their genome or some portion of their genome. And we can train a modern machine learning method to predict where a new individual where we don't know where they're from, is from this idea is has like utility, you know, for instance, customs officers are interested in in knowing whether-

Art Woods 23:17
The smuggled lizard is from

Andy Kern 23:19
Yeah, where did this piece of ebony come from? Did it come from the Brazilian population that shouldn't be harvested? Or was this farmed? So that's a different flavor of this kind of thing.

Art Woods 23:33
Let me just play the dumb guy here for a minute. Like I understand. I think a lot of these things that you're saying, but if I sort of step back and say to myself, Okay, what is machine learning in this context? There's something I'm still not grasping. So if you can say it, for the uninitiated about what's the process and what's actually getting built under the hood, when you do machine learning?

Andy Kern 23:57
I think, you know, one of the things that's troublesome these days is like people throw around AI and machine learning as if it's magic.

Art Woods 24:07
Let's demystify, yeah.

Andy Kern 24:09
Right. Yeah, there's, there's zero magic. All we're doing is we're taking a mathematical technique that allows us to take some inputs, and connect them with outputs. So for this example, that I just said, the inputs are DNA variation, like DNA sequences, and the output is like longitude and latitude. The machine learning aspect is that what we're doing is we have an algorithmic tool, our machine learning model, that can in a way that doesn't depend on process, connect our DNA sequences to location.

Art Woods 24:52
And when you say connect, you mean that there's like an entire sort of web of almost neural-like connections that are made inside software that are weighting inputs of different kinds and creating outputs. And the whole idea of constructing a model like that is to tune those outputs so that they very accurately predict the data outputs from the data inputs, right? So that's what it means to train up a model like this?

Andy Kern 25:19
That's exactly it. So that's absolutely accurate. And, you know, for listeners, if that's daunting, we could just think about fitting a line, right? So like in ninth grade, folks learned y equals a x plus b. Right? And so why would be you know where you are in the y axis, right? X is where you are on the x axis. And then we've got two parameters, A, the slope, and B, right, our intercept. And so given some data, where we know x, and we know Y, we can fit a line. We can fit a straight line to some data, and we can get a and b, right? So we get an equation y equals ax plus b. And then if you give me a new data point, x, okay, so maybe our model is how big is my dog, given its weight? You know, how tall is my dog given its weight, right? If you give me a new dog, and you only give me its weight, I can predict how tall it is. And so that's literally the kind of thing that's happening under the hood, the models are more sophisticated than Y equals a x plus b. But that's essentially all that's happening.

Cameron Ghalambor 26:41
One component of machine learning, or at least one flavor of machine learning is what's referred to as "supervised machine learning." My understanding of that is that you have these inputs and outputs, but you also have to train the model for what it is that the goal is, what it is that you're supposed to be looking for. So like in the context of population genetics, if you wanted to have like a supervised machine learning approach, and you wanted to train this algorithm to find, for example, signatures of selection across the genome. In that context, would you feed in sort of sequence data that like, here's where you have, like, background levels of FST, that don't vary. And then here's where you have like differentiation occurring that's greater than that. So it looks different. That's the signature selection. That's what I want this algorithm to go find and go through like, you know, terabytes of data and find all those cases for me. Is that a good description?

Andy Kern 27:55
Yeah, absolutely. So you know, just going back to this toy example, because I think it's pretty approachable. So this dog example, like we're gonna predict dog height based on dog weight. We can do that, i f we start with a collection of dogs of known height and known weight. The whole idea of supervised machine learning is that we start with this idea of a training set, labeled examples that connect our input with a labeled output, okay. Another example that I always give to classrooms is, you know, training a computer to detect apples versus oranges, right. It starts with a labeled kind of training set. And given this label training set, where I have dogs of known height, and known weight, I trained up some algorithm such that if I have a dog whose weight measurement I have, but I don't know their height, I can predict it. We contrast that with so-called unsupervised learning and unsupervised learning, there aren't any labels to begin with. The tasks that we use for unsupervised learning are kind of different. You know, generally with unsupervised learning, we're thinking about things like clustering. And for supervised learning, we're talking about like classification or regression kinds of approaches.

Art Woods 29:14
So like in unsupervised you would say to the machine learning algorithm, how many groups are there in this data?

Andy Kern 29:20
Yeah, exactly, so that's a classic case of unsupervised learning. Things like chat GPT, large language models that like almost everyone is familiar with, so it's probably worth talking about, right? Those methods are examples of something that's called like self-supervised learning. Essentially, it's a supervised learning method, but what we're trying to do is we have like sentences, we mask out words or parts of words. And we try to train a method that can predict the masked out word, okay. And when training has proceeded enough that it does a good job of that, well, these methods essentially can learn language at that point.

Cameron Ghalambor 30:05
So for, you know, training a model to say distinguish between apples and oranges, or height, mass like that, that seems pretty intuitive to me. But when you start getting into more kind of complex situations, like if you're trying to understand sort of the processes acting on genetic variation I guess one thing that I'm curious about is like, is it possible that you could train an algorithm to look for something, but actually have like bad assumptions? Or like, train it to do the wrong thing?

Andy Kern 30:44
Yes, is a great question Cam. So this connects back to what we were talking about earlier with simulation based inference. And part of why I went down this route was there are processes in population genetics that are very, very hard to model from first principles. Like we already mentioned, it's hard to describe from first principles what chromosomal-level variation will look like, or what it will look like under, say, a process with natural selection versus not natural selection, say. And this is stuff that I've worked on in the past, like if we're interested in trying to detect where in the genome has natural selection operated, given chromosome-scale data, it's hard to come up with sort of mathematical models that would describe that. So what do you do? What we realized we could do is use simulations of the process as a standard for our training set. And essentially, the idea is that simulations provide us labels. We know what process we're using with our simulated data, And so you know, we can simulate DNA sequences under process A and process B. And then train up a method that when input DNA sequences from process A or process B can differentiate them. So that's the idea.

Art Woods 32:19
That's really cool. So it's basically like, you know, you understand it, if your simulations produce a training data set that allows the algorithm to actually identify the real thing.

Andy Kern 32:29
Yes, exactly.

Art Woods 32:30
Huh, that's really neat.

Andy Kern 32:32
To my mind, this was really important, because we took a little bit of a leap in saying that we'll use simulation to train up these methods. And as Cam pointed out already, if your assumptions are faulty somehow, you can be misled. So what that means essentially, is in the simulation-based inference world, okay, our simulations have to be adequate for us to get the ground truth out, to get the right answer out. Now, what I like about that, is that there's again this sort of cycle. We can say, okay, do our simulations look like the data we observe? If not, let's go back to simulation. And so we're constantly probing what the, what the processes that generate the data are now through simulation. And that's quite appealing. So I should say, you know, if this sounds like a step too far, when we come up with mathematical models, we're also depending on simulation. Like when we write down a mathematical model, to figure out if our math works, what we do is we do simulations. Okay and we say, "Oh, does our math match our simulations? Okay, it does. Now, let's take our math out for a spin." So it's closely connected to that endeavor as well.

Andy Kern 36:05
So Andy, another question I'm kind of curious about is like, I have the sample that I want to know where it came from. And I guess in this case, your sample size is one. NBut, in the background, you've done simulations you've done, you've looked at all the existing genetic variation, how it's distributed, sort of spatially. So you know, your power to be able to then say where this sample came from, with some fairly, you know, predictable accuracy isn't so much dependent on having a hundred individuals from that site. But it's more of like having that background of having trained the data to know, you know, where to match it to.

Andy Kern 38:04
Okay, so in some of like the population genetic applications, what we found is that often we do a lot better with small samples. So we have a recent method. A postdoc that I work with, named Chris Smith, has really been doing a lot of great work at developing new methods for estimating parameters from spatial population genetics, like how to individuals move across landscapes, kind of thing. And Chris has now developed three related methods in collaboration with me and Peter Ralph at University of Oregon. And in each case, these methods perform really well at small sample sizes in a way that competing traditional model based methods do not. So for those particular applications, what you're saying is absolutely true. The machine learning methods are doing a better job in small sample sizes.

Art Woods 39:50
Hmm, nice.

Cameron Ghalambor 39:06
Yeah, well, and I have a real vested interest in this because I actually have been looking at some of these papers. And like, in our case, we have a location for an individual, a bird. And we want to know, you know, based on using like a spatial pedigree, where is its most likely natal territory where it was born. But the problem is that the males in the population tend to not move very far. But females move a lot. And so if you knew the father, you could do a pretty good job of estimating where it came from. But if you only knew the mother, not so much. So could you train Locator or one of these programs to account for or differences in dispersal distance between males and females to do a better job?

Andy Kern 39:57
One of the things that I'm really interested in these days that I think we're going to be able to make a lot of progress on is about heterogeneity across landscapes. So of course, individuals move across different portions of the landscape at different rates, right? Like if there's a mountain range, salamanders might have a hard time moving over a mountain range, whereas you know, if they're on a plane, it might be a lot easier.

Art Woods 40:06
So this is like resistance landscape kind of approaches?

Andy Kern 40:31
Yeah, exactly. People have used this idea of resistance landscapes to look at heterogeneity. And we're about to release a new method based on machine learning that really does quite a nice job now of estimating this kind of heterogeneity.

Cameron Ghalambor 40:47
Do you do this in like GIS kind of contexts where you're looking like a layer within sort of a spatially explicit map like that?

Andy Kern 40:57
Yeah we think of that as estimating maps. So what we're trying to do is estimate, currently, and this again, this is work that Chris Smith has been leading, we're focused on estimating maps of things like dispersal. And maps of things like, simultaneously, density. So like, where are individuals in a landscape? So the input here is you give me genotypes from individuals across the landscape, few individuals across the landscape. And the output is I'll give you a map of where is the population dense? Where is it not dense? Where are individuals free to move? Where are they more restricted in their movement, generation to generation?

Andy Kern 41:42
This is also led simultaneously to like, you know, there are a lot of conservation sort of avenues here. So one of the things we've been trying to do that a graduate student in the group, Gilia Patterson, has been leading, is doing, essentially spatially explicit, close kin mark-recapture. So another way to estimate how many individuals are is like Mark recapture. And over the last 10, 15 years, a genetics angle has been taken on this, like maybe people can do slightly better at mark-recapture, if we look at close relatives, like if we can take, you know, genetic samples and say, "Okay, these are where sibling pairs are." That tells me something about population density. And so we've been taking a machine learning approach to this idea that allows us to get essentially, spatially explicit estimates of density across a landscape by combining machine learning kinetics, and georeference genotyping.

Art Woods 42:48
I mean, you're starting to allude to this, but if you had to look forward a few years and say, you know, what are the next big things in population genetics that machine learning could solve, what's like on the distant horizon?

Andy Kern 43:00
So I'll start with something that's not very closely population genetics, but it's very much related, that I'm very excited about. A colleague of mine from Berkeley, Yun Song, and others, have really been leading the charge in applying large language models to DNA sequences to make sense of genomes. And the idea here, essentially, is, you know, there's this idea like chatGPT being an example of a large language model. These are self-supervised machine learning methods that try to make sense of essentially strings of characters. People have been applying this kind of thinking to DNA sequences to genomes. And, boy, what they're able to get out is really starting to get neat. Essentially, these methods are able to, in a self-supervised way, make sense of different portions of the genome to differentiate bits of the genome. As I said, this isn't specifically pop gen, but it is like evolution, it is genetics. I think one of the things that this will be really exciting for is leading automated genome annotation, people are collecting millions and millions of genomes, hundreds of thousands of genomes are being sequenced. No one's ever going to spend the time to annotate these figure out which portions are coding which portions are noncoding in an experimental way. So what we need is we need automated ways of annotating these genomes. And boy, these large language models are really going to do a good job, I think.

Andy Kern 43:07
Hmm yeah yeah. And what if we push that even further and say, you know, what are the roles of AIS and machine learning going to be just in biology altogether in five or 10 years? Is it going to go way beyond pop ge n and ecological genetics?

Andy Kern 44:57
Oh, yeah. So it's really funny because I think population genetics as a field is quite conservative, and sort of slow moving in some way. Let me say that personally, I had a very, very hard time convincing people that machine learning had anything to add to population genetics. For many, many years, I struggled to publish my papers, to get grants. And it wasn't until machine learning got buzzy in the popular press that people in my field really turned around and started accepting the kind of work that I was doing and others were doing. Other fields of biology are way, way ahead, in terms of using this every day. For instance, our friends in protein structure prediction, machine learning has eaten that world. I think what we're going to see is a tremendous rise in AI and machine learning, in synthetic biology, in genetics, in trait prediction in breeding, quantitative genetics. You know, so our quantitative geneticists, friends have been on this for a very long time in their own worlds, you know, genomic prediction has been a thing for a decade plus, I mean that's essentially just black boxing the genome and saying, "Okay, well, there's this genetic variation, we get these phenotypes out, we'll just turn this crank and make bigger and bigger chickens." And that's been wildly successful. So part of this, Art is that I really, really can't stress enough that machine learning is just statistics, there's no magic, I think it will be everywhere. I think it's already everywhere. And it'll just continue to grow and grow and grow. I'm excited about these really modern kinds of techniques that are coming out like large language modeling of deep neural networks, of course, is something that I'm really into. I think these will be exploited in sort of all avenues soon.

Cameron Ghalambor 47:05
So I have kind of a funny question for you, Andy, which is like, so, you know, you talked about, for example, methods that could be automated to look at whole genomes and annotate them. And it seems to me that it's only a small step from that to basically automating where you could just upload a sequence data, and ask any question, any population genetic or evolutionary kind of question, based on what you've uploaded, in the same way that you would like interface, maybe with ChatGPT? You know, you upload your sequence data, and then it's like, "Okay, tell me the population size. Tell me which regions of the genome are under selection." How far away are we from the sort of like, unified, single machine learning, or AI kind of interface, like a ChatGPT? Where, you know, any kind of question that you'd want to ask would be potentially answered? Is that like science fiction?

Andy Kern 48:15
I think it's a nice idea. I mean, I love this idea, like, let's chat about my VCF.

Cameron Ghalambor 48:19
Yeah exactly.

Andy Kern 48:20
That's a great idea. Look, you know, I don't think that's science fiction. I mean, supposedly, this year, we're going to start seeing the upload my Excel spreadsheet and chat about my data. There are beta versions of this kind of thing that are already out. And I think that's awesome. Let me ask you a slightly different question Cam, because I think your idea is a great one, and could easily be on the horizon.

Andy Kern 48:44
A different question that's related, that sometimes gets asked is like, do we need scientists anymore? Or like do we need? Do we need theoretical population genetics, if we have simulation and we have, you know, machine learning methods and this kind of thing? And my answer to that is absolutely, yes, we do. The reason is because fundamentally, we want to gain insights. And prediction is different from understanding and what we need, there needs to be a cycle between, these are the methods I have, this is the data that I'm confronted with. And when the methods are inadequate to describe the data, we need a human in the loop, we need to understand, essentially, or make sense of what our methods are telling us so that we can say, "Oh, we don't have an adequate description. Oh, we don't understand the process." So I'm kind of an optimist or I'd like to think of myself as an optimist. I'm really excited about the future here because I see all of these great sort of algorithmic increases and, you know, advances in machine learning, as just amplifying our ability to make sense. Have a lot of data. And so what that means is that we, as practitioners of science, will be able to ask questions that we weren't able to ask before we'll be able to try to understand things at scales we weren't able to understand before, but we absolutely need to be in the loop. And we absolutely need, you know, theory and first principles too.

Cameron Ghalambor 50:23
Yeah, super interesting. That's a nice, I think, way to kind of wrap things up, but we always give our guests an opportunity. You know, if there's anything else you'd like to say, or any questions that we didn't ask you, something you really wanted to talk about, before we leave?

Andy Kern 50:27
I just want to thank you guys for the conversation for the opportunity to be a part of this. It's really great.

Cameron Ghalambor 50:46
Yeah thank you.

Art Woods 50:48
Yeah, thank you.

Cameron Ghalambor 50:57
Thanks for listening. If you like what you hear, let us know via X, Facebook, Instagram, or leave a review wherever you get your podcasts. And if you don't, we'd love to know that too. Write to us at info at big biology dot org.

Art Woods 51:12
Thanks to Steve Lane who manages the website and to Molly Magid for producing the episode.

Cameron Ghalambor 51:16
Thanks also to Dayna De La Cruz for her amazing social media work. Keating Shahmehri produces our awesome cover.

Art Woods 51:24
And thanks to the College of Public Health at the University of South Florida and the National Science Foundation for support.

Cameron Ghalambor 51:29
Music on the episode is from Podington Bear and Tieren Costello

Big BiologyMarch 21, 2024

Ep 118: Dog in the machine (with Andy Kern)

Episode Transcript