Skip to content
No PriorsNo Priors

No Priors Ep. 17 | With Karan Singhal

What if AI could revolutionize healthcare with advanced language learning models? Sarah and Elad welcome Karan Singhal, Staff Software Engineer at Google Research, who specializes in medical AI and the development of MedPaLM2. On this episode, Karan emphasizes the importance of safety in medical AI applications and how language models like MedPaLM2 have the potential to augment scientific workflows and transform the standard of care. Other topics include the best workflows for AI integration, the potential impact of AI on drug discoveries, how AI can serve as a physician's assistant, and how privacy-preserving machine learning and federated learning can protect patient data, while pushing the boundaries of medical innovation. 00:00 - Introduction 00:22 - Google's Medical AI Development 08:57 - Medical Language Model and MedPaLM 2 Improvements 18:18 - Safety, cost/benefit decisions, drug discovery, health information, AI applications, and AI as a physician's assistant. 24:51 - Privacy Concerns - HIPAA's implications, privacy-preserving machine learning, and advances in GPT-4 and MedPOM2. 37:43 - Large Language Models in Healthcare and short/long term use.

Sarah GuohostKaran SinghalguestElad Gilhost
May 18, 202342mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. 0:000:22

    Introduction

    1. SG

      Welcome to No Priors. Today, we're speaking with Karan Singhal, a researcher at Google where he is a leader on medical AI, specifically on Med-PaLM 2, where he and a team are working on a responsible path to generative AI in healthcare. Google just announced the launch of its next generation language model, PaLM 2,

  2. 0:228:57

    Google's Medical AI Development

    1. SG

      which improved multilingual reasoning and coding capabilities, which is behind Med-PaLM 2, so it's a great time to be speaking with Karan about everything he and his team are working on. Karan, welcome to No Priors.

    2. KS

      Hey, guys.

    3. SG

      So, you've been working in this field for a long time. Um, tell us about how you ended up working on medical AI at Google. I- I think, uh, I saw you started a fake news detector for using AI as a, as a 19-year-old.

    4. KS

      Yeah, that was one of my first AI projects. I, I really got into AI thinking about how it could be used in socially responsible ways, and for me, I was thinking around the time of the 2016 election that, um, maybe a little bit naively, that we could, you know, AI-based solutions could be, uh, a bit of, uh, help for, you know, things like misinformation and, and detecting that. I think in the, in the longer run, I mean, I- I've thought about it as kind of a more naive project, and I think in the longer run I've been thinking more about, you know, how I can help shape the trajectory of AI to be more beneficial more broadly, and I think for me, thinking about the medical setting has been motivated largely by thinking about the fact that, you know, it's, it's a great place to think about concerns around safety, reducing hallucination and misinformation as well here, um, you know, thinking about how we can produce, uh, medical, uh, question answers that are less likely to be harmful and all these kinds of things. And, you know, that motivation I think has driven us to this point where really going for the jugular in terms of thinking about how to train these models and make them better in this setting. And so, very excited about that kind of work.

    5. SG

      Have you been working on the medical domain your, your entire time with Google?

    6. KS

      No. I mean, for me, this is just something I've gotten into in the last, like, year and a half. So, I- I've been new to it, I've been learning from an excellent team, and it's been an amazing journey so far.

    7. SG

      What, what else has been the most interesting in your, your work at Google so far?

    8. KS

      Yeah. I started working out, um, in representation learning and federated learning, uh, so this is kind of the technology, representation learning in particular is kind of the technology underlying a lot of the, you know, deep neural networks of today, including GPT-3, GPT-4 and so on. And so this is largely about, um, learning, uh, representations of text, of images, of other modalities such that you can efficiently encode them, you can learn from them in the future, you can generalize to new, uh, texts and images and so on. And so work for this really started, you know, back in the beginnings of the deep learning era, like in 2013 with convolutional neural networks and scaling those up, and Word2vec around 2015, an- and GloVe and all these things. And I think since then, you know, we've been working on technol- technologies around self-supervised learning, um, around, you know, doing that in a privacy-preserving way. And so, you know, after a couple of years of working on that at Google, I had the opportunity to kind of quickly grow and start to lead a team. Um, I kind of got to the point where I was thinking, like, "Okay. I've upskilled in a lot of ways. I've, I've, I've gotten to the point where I can mentor many other researchers in, in a lot of ways. Uh, now's a great time to be thinking about my next thing and, you know, going for something ambitious in terms of shaping the trajectory of AI." And so, you know, about a year and a half ago, a few of us had the idea to think about this medical setting as, as kind of a setting in which, you know, these concerns are especially important, and it, it was a ripe opportunity to think about this paradigm of foundation models in medical AI. And so within Google, we had the opportunity to pitch, um, what's called a Brain Moonshot, which is kind of like an internal, um, incubator program for ambitious research projects, and this has ... You know, a lot of cool research projects that you've heard of from Google have eventually come out of this, um, program. And so we pitched that. We, we got it accepted and funded. We got the ability to kind of get a bunch of compute, to bring other folks on board with the sponsorship of a bunch of leaders, and our first thing together was really Med-PaLM, and so that was, you know, a really amazing thing for us, uh, to be able to work on together.

    9. EG

      Can you talk a little bit about PaLM and how that's related to Med-PaLM, and what PaLM is to begin with, and then how Med-PaLM is different?

    10. KS

      Yeah. Absolutely. I mean, so the original Med-PaLM work built on this model called PaLM, which stands for Pathways Language Model, and so this is really an infrastructure that Google has built to be able to scale up large language model training that is kind of Google-wide. And so the first PaLM model was released, um, in 2022, which was, um, kind of this 540B decoder-only transformer model, at the time, the largest, uh, densely activated model, and, um, you know, it, it kind of re- realized this breakthrough achievements in, uh, code, in multilingual capabilities, in reasoning, and so, you know, I think a lot of the work with respect to kind of improving benchmarks, uh, specifically that we're seeing with, like, PaLM, Med-PaLM, GPT-4 recently, I think all comes down to a lot of improvements that were made during PaLM, that, during the training of PaLM. And so, you know, shortly after PaLM there was this Minerva work where maybe, like, a few months after the PaLM work itself, people were able to show that, um, on stem benchmarks, that there was this kind of zero to 100, or zero to 60 at least, uh, effect where, you know, you went from random chance to, you know, solid performance a- across a bunch of benchmarks, and that laid the, laid the gr- foundation for a lot of the work that, um, Jason Wei and others have had on, you know, thinking about emergent abilities in large language models. And so for us, that was part of the motivation for looking at multiple choice benchmarks as well for Med-PaLM. And so for Med-PaLM in particular, what we did was we took Med-PaLM, um, this kind of general large language model trained on web-scale data, and then kind of further aligned it to the medical domain. We, we evaluated it base, but also thought about, like, given its limitations in long form medical question answering, thinking about things like safety, factuality, uh, low likelihood of, you know, outputting a- an answer with bias, um, uh, what do we need to do to kind of better align that model with this domain? And so...... really, Med-PaLM was an attempt to do that.

    11. EG

      Yeah. So basically, it sounds like you started off with PaLM, and PaLM was te- tested a- against a bunch of different types of tests, right? And so, you could take the MCAT or you could take other types of ... effectively tests for professional accreditation, or for knowledge, understanding, and then it sounds like you then said, "Hey, this- this seems really interesting, right? We're starting to get really good performance here, and so can we do something that's in the medical domain specifically?" And that was Med-PaLM. And so how did you do that alignment that you mentioned? Was it some form of RLHF? Was it some other form of fine-tuning? Was it how you trained the model to begin with? Like, what was the difference in terms of Med-PaLM versus PaLM?

    12. KS

      Yeah, absolutely. I mean, when- when we tried evaluating PaLM in the medical setting, we noticed that it was out of the box on multiple-choice questions, performing pretty well, and when we took a variation of PaLM, the FLAN-PaLM model, which was again, work from Jason Wei and team, um, this is an instruction tuned model, a- a model that's been trained to follow instructions better. Um, you know, again, it was able to perform quite well out of the box, and this was the first model that was able to perform above the pass mark on the MedQA set of USMLE-style exam- uh, questions. But then what we noticed is that when we evaluated it on long form medical question answering, like actually getting the model to generate a response, there was a lot of limitations, and when we com- compared that to clinician performance, it actually didn't do super well. And so really, that was the motivation for that Med-PaLM, uh, specific alignment, and so what we did there was really thinking about instruction prompt tuning, which was this technique which we explored in that- in that Med-PaLM paper, which is kind of a data efficient technique, an- a technique that doesn't require too much data to work, because, you know, getting labels from doctors is expensive, which took a bunch of expert demonstrations of good behavior from doctors, and then used that to tune the parameters of the model, and do that in a way that's, um, a little bit more learned than prompting, but also less expensive than full fine-tuning.

    13. EG

      And so you did that, and then I guess if you start looking at this now shift from PaLM to PaLM 2, and from Med-PaLM to Med-PaLM 2, did you basically just reproduce that same approach for Med-PaLM 2, or did you do anything different there?

    14. KS

      Yeah. And this is the work of- of many folks other than myself, so, um, just- just to preface it with that. I mean, I think a few things that have been important have been, one, is better objectives for pre-training and using something like a mixture of objectives, training objective. Um, and so that's been something that's, you know, been crucial. Um, and so this is work that started with UL2, uh, a paper that was released also last year, and then two other things that ended up being super important. One is following the optimal scaling laws that were, you know, empirically, uh, evaluated again in this work, uh, and there's been a few works that have, you know, tried to do this from OpenAI and DeepMind, and again, this is where I've tried to understand in this context, what are the optimal scaling laws with respect to data and compute, and how do you trade those things off? And so this paper, again, found something

  3. 8:5718:18

    Medical Language Model and MedPaLM 2 Improvements

    1. KS

      similar to the Chinchilla paper, which was that the total amount of data being used for these models was relatively, uh, low compared to the number of parameters, and that if we wanted to add in more data, we could do so, we could train a better model in a more compute efficient way, so this model also did that. Um, so that's, you know, an important improvement as well. Um, and the third thing was, um, kind of improvements in the data that were used to train the model, and so this especially focused on multilingual data, including more multilingual data and more code data in- in a bunch of different coding languages as well.

    2. SG

      Uh, maybe just zooming out a little bit in- in terms of, uh, when you might apply some of these different techniques to align a model to a specific domain. Like, do you have a framework in your mind for why you might do full, like, pre-training from scratch, why you might do fine-tuning, why you might do a more efficient form of fine-tuning, and when you can just get away with prompt tuning or- or prompting? Like, how do you think about that?

    3. KS

      Yeah. Th- this is a great question. I think it really comes down to the data that's available, both in quantity and relevance to a particular topic. I think if you have an infinite supply of data that's relevant for the specific problem that you're trying to solve, then probably the best thing to do is pre-train everything from scratch and do everything end-to-end. If- if you don't mind compute and money as well. If you are working on a task in which, you know, general pre-training data and the web confers general advantages to that task, and so that could be domain knowledge, it could be general abilities like reasoning, you know, which is very applicable across many tasks, which I think is, you know, the case for medical reasoning as well, um, then I think it makes a lot of sense to build on top of an existing model, especially if you're sensitive to things like cost or compute, you know, which most people are these days. And so, um, you know, I think on that spectrum between things like prompting and, um, prompt tuning, uh, all the way to, like, full fine-tuning, I think it largely comes down to, um ... So- so given an existing pretrained model, you know, which is, I think, a big hurdle for- for most, uh, teams and most people, uh, to train a large-scale pretrained model, then the question is, do you prompt it, do you prompt tune it, do you full fine tune it? I think that largely comes down to data. If you have three to five examples, let's say, then I would prompt it. If you have, um, maybe 10 or 50 examples, it would either be prompt tuning or fine-tuning. I think generally in that realm, prompt tuning and- and fine-tuning perform similarly, and I would prefer prompt tuning if you're at all sensitive to things like compute or cost. Um, if you care about the best performance and you have more than 100 examples, then probably fine-tuning is your best bet and it's not as expensive as full pre-training, um, if you're doing it with a model that's been pre-trained, of course.

    4. SG

      When you thought about evaluation of this model, um, you must have been surveying the landscape for the other sort of medical, um, probably, you know, science-specific and then medical-specific models. Like, what- what's out there, and how did you guys think about eval-ing and changing eval?

    5. KS

      Yeah, absolutely, and this is not the first work to explore the potential of a large language model in science or biomedicine.And so I think it's, you know, important to acknowledge the, all the work that's come before us. What we saw when we, you know, first came into this work and, uh, tried to understand what other models existed, what other evaluation has been done, was that, one, there was a few exci- exciting works from other teams, uh, like Galactica or BioGPT and so on, that we thought we could learn from and, you know, benefit from, and so that was, you know, a really exciting thing to be able to see. And the second thing we saw was that there was a bit of a shortage of kind of a systematic way of doing evaluation of these models, and so it didn't feel like there was a systematic way to think about automated evaluation of the clinical knowledge of these models. So for example, via multiple choice benchmarks, uh, there were a few popular benchmarks, like the MedQA benchmark, but, you know, it, it varied across paper what benchmarks they were studying. In some cases, we felt like these benchmarks were not high quality, um, and so that was one thing that we saw. And then another thing that we saw which was more acute, I think, was kind of a lack of hu- detailed human evaluation across many of these works, and so there was some steps in this direction that we were able to build on, but I think for the most part, a lot of these models that have already existed didn't have kind of detailed human evaluation given a use case like medical question answering. And so I think that, to us, was a significant limitation as we think about, you know, the real world potential of these models because, you know, when it comes down to it, we have to make sure that it actually serves humans and, and is benefit- beneficial to humans. And so, for us, that was, like, a, a significant motivator for the Med-PaLM work being relatively evaluation forward and thinking carefully about human evaluation with both physicians and laypeople.

    6. EG

      How do you think about where that bar is? Because I think it's one of those things that, um, you know, having started a, a medical-centric company before, on the one hand, you really wanna be cautious in terms of providing people back information that's accurate, right? And so, when I was working actively as a, as a, on the operating side of Color, we spent a lot of time agonizing over the ensuring that the results that we provided back to patients were as accurate as possible, particularly in the context of anything that had to do with, you know, core genetic or other information. The flip side of it is, you know, I remember I took my son to the emergency room when he was younger, and the doctor said, "I'm gonna go research this case and I'll be right back," and I had to go ask him, "Another follow-up question?" and I go around the corner and he's in his cube literally googling the symptoms, right?

    7. KS

      (laughs)

    8. EG

      And so it wasn't like he had some deep, accurate source. He was just making things up, right? Effectively, right? I mean, I've seen Google results and you're kinda clicking around, and he was just clicking around. I was like, "Oh my gosh," and I, I could see the query, right? So I knew he was looking at my kid's symptoms. He had no idea, right? And so there's this bar from, hey, it needs to be incredibly accurate and correct on through to, well, the state of the art actually isn't that amazing in many circumstances. And so how do you think about the right quality bar for these sorts of things, in terms of real use application or practice?

    9. KS

      That's, that's an amazing, great question. I think, I think, as you said, there's, uh, two competing forces here, right? Obviously the stakes are high in the medical setting, and counterfactually you wanna make sure that the information you provide versus the information they would have otherwise gotten is actually high quality, and so that's, you know, very, very careful as you think about, you know, any informational use case for these models. At the same time, I, I think it's useful to recognize that people are searching, uh, for health information online, and indecision is a decision as well, and so, you know, a large percentage, roughly 10% of, of searches on the internet are for health information, and some of these are coming from physicians themselves, as you, as you mentioned, Elad. And so, you know, I think th- there is a responsibility to think about how to shepherd this technology carefully and safely towards, um, that real world impact for pa- patient health information, and I think that, you know, is crucial as well. And I, I think one thing that has been missing from our work so far is really grounded evaluations in a specific use case, in a workflow, to show that there is a benefit, uh, both in terms of safety, um, in the short term and in terms of kinda long term patient outcomes as well. And so, you know, I think that could be a health informational use case. It could be other clinical workflows, but, you know, I think that's one thing that we have to really make sure we do and, you know, are careful about before any kind of real world, uh, use case here.

    10. EG

      Yeah. That makes sense. Yeah, it definitely feels like in the medical world, um, the importance of safety is paramount, and at the same time, there's very little cost benefit being done anymore.

    11. KS

      Yeah.

    12. EG

      And so there's, you know, interviews with Janssen and other sort of giants of the industry basically saying, you know, we need, we need to think about the benefit side, not just the cost side or the safety side, um, and what you're working on I think is so important in terms of if you think of the really big areas of societal impact, it's what you folks are doing, right? If you could provide amazing health equity globally for everyone in terms of this information, how powerful is that? I mean, that's fundamental. And maybe education is the other one, right? And it feels like AI really has a promise in both of these areas, and so I always worry about, you know, how do you make sure that this can get to market because it's so valuable, but there's gonna be all these regulatory or safety, uh, obstacles that in some cases are merited but in some cases may actually prevent the emergence of really important applications? So I think it's awesome that you folks are working on all this and are being so thoughtful about it. How do you think about what workflows this is gonna be most useful for? So, you know, if you look at a lot of the bio or biomedical AI companies, for some reason, they keep doing drug development. A, why do you think that is? Like why... 'cause it, this seems like such an important part of healthcare, and probably the, the bigger driver of healthcare efficacy, and so, A, why is everybody just going and building another protein folding model or, or, you know, molecular company, and B, where do you think are the best applications of what you've been working on?

    13. KS

      Yeah. These, these are great questions. I, I think on the drug discovery front, there are... there's a, there's a bit of a playbook here which any new company here looking for some revenue in the short term can follow, and that, you know, that could be a safe option. Like, there are, for example, existing AI augmented pipelines for doing things like giving small molecule chemistry predicting things like absorption or toxicity.... and it's kind of relatively easy to see that,

  4. 18:1824:51

    Safety, cost/benefit decisions, drug discovery, health information, AI applications, and AI as a physician's assistant.

    1. KS

      you know, some of the more modern models, if placed into these pipelines, could perform better. And so there's, like, a relatively safe bet there, um, and so I think that probably accounts for a lot of the popularity of, of that as a use case. I totally agree that, like, there is a kind of a chance to go for the jugular here, uh, in terms of health information, for example, and so, um, you know, I think this is something that is, is gonna be crucial, but I think it is also something where a lot of the big players are more risk-averse. Um, and so, you know, the people who, who gate access to health information or provide access to health information are also thinking not, maybe super counterfactually about the positive benefits of, of things and they're thinking more about the risks, and so, you know, I think that is also, you know, a concern that's been slowing folks down, both in terms of big companies and smaller companies. And, you know, I think there, there is an opportunity to kind of think more about that and what that could look like, and I think the company that, you know, gets that right, or, you know, the set of companies that get ri- that right I think will also have a seat at the conversation when it comes to policy and regulation and, and things like that, and so they have the chance to shape, you know, what this looks like for the, uh, for the future, and so, you know, I think that's gonna be potentially quite impactful.

    2. EG

      Yeah. It seems very exciting 'cause if you look at healthcare, it's 20% of GDP. Pharmaceuticals are about 20% of that, and then drug development is a fraction of that, right? So really, what you folks are focused on in terms of the types of models that you're building is at least, you know, 16% of GDP. You know, maybe it's more than that if some of the pharma stuff is, you know, more clinical decision-making around who gets a certain pharmaceutical. Do you view this as a technology that's in- initially a physician's assistant? Do you view it as something that helps with adjudication of medical claims and billing? Like, there's so many places where this can sort of insert. I'm just sort of curious, like, you know, where do you think you'll, you'll see this technology popping up first?

    3. KS

      Yeah. I think we're already starting to see it in some clinical workflows when it comes to documentation and billing. I think there, there are a lot of, um, companies and people thinking about taking models like GPT-4 and applying them in that setting, and I think that, that is definitely gonna be something. I think that is also gonna be something where players like Epic are gonna be able to partner with existing models and, I think, potentially deliver real value there. Um, and I think, I think that's, that's, that's very exciting, and I think that's something that also, you know, general domain models will be potentially quite good at as well. Um, I think where there might be a m- more of a need for specialized models is when it comes down to, um, kind of higher stakes workflows, and I think that might look, in m- in the short term, more like a physician's assistant. And so imagine, for example, an agent that can work with a radiologist, help them interpret a scan, and, you know, leverage the benefits of AI to kind of help contextualize, um, you know, what, a patient's medical record or any previous scans or different angles of scans that, that a patient has had, uh, to help a radiologist write a more accurate report. I think that's something, that's the kind of thing which I think, you know, is in the sweet spot of, of both feasible today, you know, leverages the benefits of AI in terms of taking an additional context and, you know, potential multi-modality and all these kinds of things, and it's also potentially in a sweet spot with respect to regulation as well. And so I think that's, um, you know, something that could happen in, in the shorter, medium to short term.

    4. EG

      How do you architect a model or workflow in this context to deal with things like HIPAA or patient privacy? So, I feel like healthcare data is unique from the context of what you're allowed to do in terms of who you send it to, with what permissions from users. So is it just you have to get the right user opt-in and then it's fine, or is there extra work that you need to do in terms of blinding data or doing other things relative to the prompts or queries you're sending in?

    5. KS

      Yeah. That's a great question. I mean, I, I think this is something that people are just trying right now and just seeing what happens, and it's, it's kind of interesting. People are just putting in patient information into GPT-4, sometimes they're redacting information and all these kinds of things. I mean, I think the, like, the ideal way to do this, obviously, is, is more privacy-forward, I think, in, in terms of building trust, uh, with the relevant, uh, pr- stakeholders and all these kinds of things. Um, you know, I think a starting point is just models that are able to automatically redact, you know, very sensitive information from, you know, you know, being sent further down a, a pipeline. I think that's something that's, like, a very low-hanging fruit that, you know, many people can do. There's also potential for HIPAA compliance within organizations. So I know some organizations working in this space are partially HIPAA-compliant or are kind of trying to make that claim, um, and I think that's, that's something that's useful, and I, I think that's something that we should work towards as well. Um, you know, I think in the, in the longer run, I think a lot of these concerns, I think, will, are actually unclear in terms of how things will work out. Like, I think there is kind of bigger question about software of unknown provenance and how that will be used and regulated, you know, in the future. There could be some kind of situation in which, like, these things actually end up being very hard to scale up and apply in the, in the real world for, you know, high-stakes settings, but I think we'll probably end up with a scenario where it'll become obvious that we need to and that we must, um, and that doing so will improve patient outcomes. And so then, I think it'll be ti- it'll be time to have, like, a serious conversation about what regulating these models and making sure privacy concerns are, are mitigated looks like, and I think, you know, I think we have, we have yet to have that discussion.

    6. EG

      Yeah. HIPAA's kind of interesting from the context of it was an incredibly well-intentioned piece of legislation, but the flip side of it is, it's really backfired in all sorts of ways in terms of actual patient good. And you see that sometimes as well in, in terms of some of the things that, that, as you sign up for a clinical trial or other that you can actually do with your own data, where sometimes you're constrained from accessing it. I know of one example where somebody had, um, uh, brain cancer. They had a glioblastoma and it was a researcher at MIT, and he participated in a small clinical trial and then they wouldn't, they were unable because of compliance to give him his own data.... so that he could try and discover drugs against his own glioblastoma, his own brain cancer, right? And so sometimes you see these very well-intentioned approaches in terms of the protocols, or on a clinical trial, or on HIPAA, or other things that are very well-intentioned in terms of what they wanna do, but then sometimes they may backfire as you start to enter the modern data world. Since I think that legislature is now almost 30 years old, right? And so I just think it was set up for a world that's very different from what we have now, in terms of, you know, the liquidity and fluency of, of your ability to interact with information and, you know, patients driving their own diagnoses and things like that. So I ... You know, my hope is that some of these things get rebalanced

  5. 24:5137:43

    Privacy Concerns - HIPAA's implications, privacy-preserving machine learning, and advances in GPT-4 and MedPOM2.

    1. EG

      in the AI world, since it could be so valuable to things like what you're doing.

    2. SG

      I was, I was just gonna say that is the status quo, and you've also worked on the ar- like, the areas of, you know, privacy-preserving machine learning and federated learning. Those areas have broadly taken a backseat to, let's say, like, scaling and aligning these more centralized models. Like, do you see a place for that technology in this field?

    3. KS

      Yeah, that's a great question. So I mean, as I mentioned before, the, the first couple years of my career were really thinking more about privacy-preserving machine learning and, you know, uh, federated learning and scaling that up and coming up with new algorithms that can learn new things without, you know, sending all the data to a centralized place. And so in, in a lot of ways, that has a very, very natural fit with the setting, and, you know, part of my motivation when I first started working on the setting was bringing in a lot of that expertise and bringing it into that setting. My sense is that ... I, I, I think one hesitation I have there is that I think a lot of, you know, th- the most impactful work that's gonna happen in this setting is gonna happen with the largest and most capable models, uh, uh, at least for the next few years, it seems like. And I think that, like, one, one thing that we're seeing is that even without any patient health information put into these models ... Like, for example, MedPaLM and MedPaLM 2 are trained without any patient health information. They, they're just kind of taking all the knowledge of PaLM and PaLM 2, and then just kind of aligning them and, and making them behave in a certain way. I think in the short term, there is this kind of thing that we see where models like GPT-4 and MedPaLM and MedPaLM 2 are able to do, you know, surprisingly well without any patient health information, and so it seems like we can get fairly far with that. I mean, in the longer run, I do think that, like, you know, coming back to that question of, of data and how do you think about how to train a model depending on how much data you have and how relevant you have, how relevant that data is, the ideal thing would be to have access to all of the data, but in a privacy-preserving way. You know, in a way that people are c- in control of their data, are able to revoke access to that data, and are able to kind of benefit from that shared understanding of their data. And so that's the, kind of the ideal world, but I think there are, like, real-world obstacles to doing federated learning on health data, which actually kind of increase the activation energy to the point where in the next few years, I doubt that, like, the biggest advances are gonna come from using federated learning approaches. But I think there are kind of intermediate solutions, which people often, sometimes refer to as federated, but maybe are not technically federated, which are things like trusted execution environments or other environments in which models are running, but, you know, don't have ... You know, the folks at Google don't have access to the, to the data or the direct access to the models. And so there's this ability to kind of, uh, silo that from, um, you know, silo any patient health information i- in the future potentially or, you know, any other data that's quite sensitive from engineers or other folks at, you know, big companies or small companies.

    4. SG

      Yeah. Going back to, uh, perhaps more, uh, promising near-term areas of, of research, you've had this idea of building a medical assistant as a, a sort of laboratory for safety and alignment research. Um, can you talk about that?

    5. KS

      Yeah, absolutely. I mean, this is a lot of what got me thinking about the setting, especially coming into the setting as somebody who, you know, didn't have much of a medical background in terms of expertise. I was really thinking about, you know, what are the, what are the big things that I could do to help shape the trajectory of AI or nudge it in a more beneficial direction? And thinking about, um, AI safety, um, seriously, in terms of both short-term and, and longer-term risks, I think was important to me. And so, you know, one thing I've, I've become more convinced of, uh, about over time is this idea that, you know, many organizations right now, um, Google, DeepMind, uh, Anthropic, OpenAI, are right now looking at the idea of a general chat assistant and kind of instead of, like, doing alignment research in a vacuum or looking at that setting as a way in which we can think about kind of re- better refining these models and better aligning them to human values. I think there's a good chance that this setting, this medical setting, uh, for example, medical question answering or maybe more broadly, I think ends up being a better scenario to study, uh, concerns about technical safety and to mitigate concerns like, um, misalignment with human values or hallucinations or things like that. And so, I mean, I think this comes down to things like making sure the incentives are aligned with respect to releasing products. Um, so for example, I think if any organization wants to release products in this space, it actually needs to work on these problems more so than, I think, ChatGPT. Uh, I think it also comes down to, um, kind of the stakes of the setting. I think everybody feels like the stakes of the setting are high enough that everybody feels like these i- uh, issues are, you know, especially important, and there's no debate about that. And I think there's also, like, s- more subtle technical points, like I think one issue that, you know, alignment researchers are now working on is the idea of scalable oversight, which means how do you give human feedback to a model when human feedback might not be super well-informed or it might be unreliable because, uh, AI capabilities are starting to reach human level? And so when we start to get to that point, like, things like RLHF start to fail and it starts to become u- unclear what to do. And so I actually think the medical setting is a scenario in which this is already more obvious. So you're already in a setting in which you need experts to be able to evaluate, uh, answers, and one thing we're seeing with MedPaLM 2 as we get closer to physician-level performance on medical question answering is that...It's hard to tell the difference anymore. It's hard to tell the difference between different models, it's hard to tell the difference between models and physicians. And when you're at that point where it's- it's uninformed oversight, then it becomes very tricky to think about aligning to human values, and so that problem is super well-motivated in this setting and that's something I'm, you know, very excited about.

    6. EG

      What- what do you think is the solution to that? Because if you look at the gaming analog, which is probably a bad analog here, right, once machines were better than humans at things like Go or Chess or other things, people started learning off of the things that the machines were doing that were unique or creative or different, or the problem-solving was very different. And if we really want this technology to be incredibly valuable for medical applications, in some cases, we may end up with the s- with these suggestions that will really work well but that, to your point, people may misinterpret or misunderstand. And so what- how do you think about evaluating things when the AI will be better than a person at medical adjudication, or better than an expert?

    7. KS

      Yeah, this is, I mean this is, you know, a really, really interesting question, and I don't think I have a- I have all of the answers, but I think there are approaches that, you know, people at Google and other organizations have been looking at. And I think a couple of ideas here that I think are- are interesting and useful, one is the idea of kind of self-refinement or self-critique of these models, and so this is the idea that, um, these models can take their own responses, give critiques often guided with human feedback and so that's where the- the place where human feedback comes in, but some- some of these techniques there's no human feedback in h- in that case, so I'm not sure that's as valuable. Give critiques guided by human feedback and then use that to produce better answers. That's- that's one line of approaches. I- I think a second line of approaches is around debate. And so the idea here is that it's easier for a human to judge a debate between two different answers than to judge the answer itself, and so the kind of standard for verification is a bit lower here, and so, um, there's that ability for humans to be able to judge a response that potentially they wouldn't be able to judge otherwise via things like debate. And so that's- that's another thing. I mean, another thing which is, people are working on, um, as well is thinking about how we can take AIs that are less capable and use that, use them to kind of supervise other AIs that are more capable, and so this is kind of the motivation, and this is partly the motivation of RLHF as well, even though it's about human feedback, it's about training a reward model that takes into account human feedback and then at that point it's AI feedback from then on, and then you use your RL algorithm and then you get rewards from your reward model. RLAIF or constitutional AI, you know, kind of builds on that idea, but there's also limitations to that approach as well. I mean, I think if you ask, you know, researchers across all these organizations, "Have we solved this problem? Do we know what we're supposed to do?" I think most of them would say no. It seems like a pretty consequential problem, so I'm excited for more folks to work on it.

    8. EG

      Yeah, one thing that I feel like would also sort of be generated as a side effect of all this is just, you end up with these really interesting closed-loop datasets over time that may be unique outside of an EMR or somewhere else, or a- a really robust, um, medical record system, because if you have effectively a physician's assistant or something else and then you have the endpoint of what happened based on treatment, you actually have a really interesting retrospective data mining training set.

    9. KS

      Yeah, I mean, I think that's like another opportunity for feedback for these models, which, you know, could have a huge impact on the world.

    10. EG

      Yeah, it'll actually be data-driven medicine, which I think, you know, sometimes happens, but sometimes doesn't, so it's very exciting. Um, I guess one more question is just, you know, there- there's amazing potential here, and if I look at the history of medical technology, you know, in the 1970s there was something known as the Mycin project at Stanford where they built an expert system, uh, which was a old computer program of its time, or a computer program of its day that was sort of a precursor to some of the things that eventually happened in AI. They had a expert system that outperformed all of Stanford's medical staff on the prediction of the infectious disease that somebody had. So 40 years ago almost we had a machine that outperformed people in terms of diagnosis, but it never got adopted, and so often when I look at medical technologies, there's this almost like anti-adoption curve in some cases for the things that may be most impactful. How has the medical field embraced or not embraced these AI models? Is it different this time? Are people excited about it? Are they not excited? Does it really depend on the type of physician? I'm just sort of curious like what the reaction has been from the medical community to date.

    11. KS

      Absolutely. That- that's a- that's a really great question. You know, I think when we started this, um, this Brain Moonshot, which we call it within- within Google, that was actually our motivation. It was really to think about the fact that these models had already, kind of already exist, and there was this opportunity to catalyze the medical AI community to really think about them carefully, think about the promise there, and to catalyze the AI community to think about how we can resolve any remaining limitations that would prevent real world uptake. And so this was really our goal, and you know, I think when we started this there was much less conversation about the potential for large language models and foundation models for healthcare, and I think, I mean, partly because of, I- I think largely also because of other work that's gone on, you know, with GPT-4 and excitement around that. I think there's much, much more conversation about, you know, how these models can be used in this setting in a productive way, and I think that's really, really exciting, and I think there's a lot of optimism I see, um, but there's also a lot of justified concern about, you know, the potential limitations of these models and how we can- how we can get over them. Personally, I mean, from what I've seen from- from giving talks to different groups and- and- and chatting with different folks and different stakeholders, I think there's like a, you know, a wide- wide- widely held optimism about this technology and- and about the potential, but I think there's also kind of a little bit of fear that I think, you know, people have seen in other domains, like, uh, I think programmers often feel a little bit of fear when they see GPT-4, for example, and I think it's not- not- not necessarily a fear that, like, jobs will be replaced in the short term or things like that, but it's more of a fear of, like...Look how fast things are moving. This is, this is nuts. Like, think about just the improvement from Med-PaLM 1, GPT-4, Med-PaLM 2 in, in three months. Like, it's, it's absolutely crazy, and I think we... You know, it's, it's definitely an inflection point for AI, as, as you guys know, and I think it's definitely a good time to think about, you know, what are the most improm- important problems we need to solve versus, like, getting caught up in the hype wave and, um, you know, forgetting to solve the most important problems as well.

    12. SG

      I think back to Elade's sort of point earlier, thinking about the actual benefits of these technologies at scale, if adopted, even at human and at some, you know, defined superhuman level, should we come to some sort of a- agreement as a democratic society about what eval looks like is really important. In that if you, if you just think about what the status quo is for, uh, somebody, um, who has a complex case in a, um, median background in America, what do they know about the error matrix of their doctor and what, you know, in, in a field that's also advancing in parallel to AI, um, like the, uh, specific rare condition that they have? It's, it's not super encouraging, right? And so in terms of leverage

  6. 37:4342:18

    Large Language Models in Healthcare and short/long term use.

    1. SG

      for a field where the status quo is not sufficient, um, not, not as a comment on, you know, the class of physicians and researchers, but just in terms of the, the, uh, quality of care that we want to be able to offer every person, that seems, it seems like we want to set a reasonable safety case, not a unlimited safety case, right, which is, I think is one of, one of the things that has held back other, uh, sort of mission-critical AI applications in the past. Maybe on that note, like, uh, one last ask for you in terms of encouraging some optimism. Y- you know, you're working on the state-of-the-art in this field and thinking about the, um, the barriers to, uh, the applied use. Like, five years from now, like, how do you hope we are using large language models in the, in the medical field?

    2. KS

      Yeah, I guess, I guess I think about this in, in two broad buckets. I think there, there are two broad types of things that we can do for large language models in the medical field. I think the first is increasing the standard of care very broadly, and so that looks a lot like, you know, increasing access to health information, uh, providing assistance to physicians. So the radiology example I gave earlier. Um, potentially clinical decision support, like double-checking a doctor's decision or quality assurance for a radiologist's report. So if, if, you know, a radiologist is dictating a report, they say, "No pleural effusion seen," but then it's written down as "Pleural effusion seen," then maybe an AI double-checks that and just makes sure that, that's, that's what was intended. I think augmenting telemedicine, I think is, is, is kind of a short-term opportunity that I think in the next five years is, is very achievable. I think the other big bucket of things that is very much achievable, um, is augmenting scientific workflows, and I think this could be a longer term thing than five years, but I think there's also short-term things that we can do as well. So thinking about looking at, you know, correlations across modalities and existing data to find novel biomarkers for existing diseases that we know about, or, um, kind of using large language models as research assistants. Um, so I think there's already a lot of work on the idea of literature search and augmenting literature search with large language models. I think there's a lot of opportunity there, um, and that, that goes a little bit beyond, you know, what, what, uh, Med-PaLM is likely gonna do, but I think that's something that I think, you know, is gonna be really promising with respect to the future of AI. 'Cause I, I think in the long term when things go really well with AI, it's gonna be because we've solved a lot of the most pressing scientific problems of, of today, and I think that's gonna be because, you know, it augmented scientists, it helped scientists, it helped us, uh, figure out one of the things that we're missing, and I think there's a lot of potential there, so I'm, I'm also really excited about that in the long term.

    3. SG

      Awesome. Wrapping up, is there anything else you think we should touch on?

    4. KS

      Yeah, absolutely. I mean, I think for, you know, real world uptake of these models, I think there are a few large language model capabilities in, in some cases that already exist, but we need to figure out the right way to do them, and I think a few of them are just, you know, multimodality, which is something that we, we're working on, we, we kind of previewed last week at IO, and, uh, grounding authoritative sources I think is important as well, thinking about how these models can use tool former-like approaches to, for example, query autho- authoritative, uh, medical information like a human would, but potentially better. And I think that's also, you know, one way of getting around the risk aversiveness that you, that you see in this area with respect to health information. If you're able to attribute, you know, information to an authoritative source, um, I think that has been something that has progressed this area in big companies before. And so, where, for example, Google is doing that, uh, with health information is largely because it can attribute things to the Mayo Clinic and other organizations, and so, uh, I think that's gonna be really important for moving this forward. You know, I think also, you know, solid research thinking about, you know, better ways to improve the ways we are taking in human feedback. Uh, I think, you know, the, the jury's still out with respect to how to best, you know, collect human feedback even. I think people are still debating things like whether or not, you know, pairwise comparison versus rewrites are the best things to do, and, you know, I think that's, you know, a valuable thing to think about. I think another thing to think about is how do you actually use that human feedback in the most valuable way, especially given all the scalable oversight concerns that you guys mentioned. I think that's, you know, a significant limitation of Med-PaLM as it is today, and I think there's a lot of exciting things to do, and I think a lot of these questions are, like, foundational questions for AI more broadly, but you know, become more acute and more relevant in this setting.

    5. SG

      (instrumental music plays) It's been great to have you on No Priors. Thanks for doing this.

    6. EG

      Yeah, thanks so much for joining.

    7. KS

      Thanks guys.

Episode duration: 42:18

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode MlyxAjuLObc

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome