false
Catalog
CHEST 2023 On Demand Pass
Artificial Intelligence in Pulmonary and Critical ...
Artificial Intelligence in Pulmonary and Critical Care: A Pro-Con Debate
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
everyone, thank you so much for joining us this morning. I am delighted to present what will hopefully be an invigorating discussion about the use of artificial intelligence or as Vinny so eloquently called it earlier this morning, augmented intelligence in the use of both critical care settings as well as pulmonary settings. We have today a fantastic panel of discussants. We have Drew Mickelson here from Wash U. We have Mike Schoding here from the University of Michigan and Vinny Liu here from Kaiser Permanente California. I am Catherine Chen here from UT Southwestern in Dallas, Texas, and I'm going to spend hopefully the next 15 minutes or so trying to convince you why AI and ML is not quite ready for prime time as regarding to early warning systems. I have no disclosures to make regarding this talk. We will be covering the role of early warning systems in both predicting and managing clinical deterioration, understanding the role of artificial intelligence in the creation of these prediction models and understanding the future directions of artificial intelligence in early warning systems. So, I kind of want to start off with good, right, because since 2004, the Institute for Healthcare Improvement recommended the use of early warning systems and rapid response teams, and since then, the implementation of these kinds of things have really exploded. I'm sure everybody in this room has had an interaction, whether knowingly or not, with an early warning system, whether it was homegrown or commercially developed, and we probably all have, can stand up here and tell stories of how an early warning system either saved us or just annoyed us, and, but really, we're here to be data-driven, so let's talk about the data. What we do know is that the discriminative performance of these early warning systems are very good. I have here a table from a systematic review that summarized the area under the curve for five different early warning systems, and as you can see, there was a strong predictive ability for in-hospital mortality for four out of these five early warning systems, as well as a very strong predictive ability for in all five of predicting cardiac arrest within 48 hours of the trigger. I'm not here to debate that whatsoever. What I am here to discuss is the fact that being able to predict a poor outcome is very, very different from preventing that poor outcome, so, of course, that begs the question, does EWS implementation result in a better patient outcome? And in 2021, Cochran Systematic Review tried to answer exactly that question, and they compiled the results of 11 different studies that involved over half a million participants that examined EWS implementation and its effect on preventing patient deterioration in the acute inpatient setting, and for the primary outcome of hospital mortality, 10 of those studies, four RCTs, six non-RCTs, demonstrated that EWS and rapid response system implementation had little to no effect in hospital mortality. You've got the randomized controlled trials, let's see if this works here, off here to the left, and the non-randomized controlled trials here on the right separated by that line you see. So even when you're looking at the outcome of preventing unexpected cardiac arrest or respiratory arrest, because we know that hospital mortality is this kind of, it's a very difficult endpoint to achieve, so let's just look at preventing cardiac arrest, they still didn't do that well. No benefit in either of the three randomized controlled trials here off to the left, or the six non-randomized controlled trials here off to the right that reported on this outcome. The Cochram review was only able to report changes in hospital length of stay for one out of the 11 studies that were included in their analysis, but there was actually an earlier systematic review from 2013 that examined the effect of EWS on length of stay, and I've summarized the results of those five studies here, and as you can see, only one right here demonstrated a statistically significant difference, a decrease in length of stay following EWS implementation, and ultimately the Cochram review authors concluded that there was low certainty evidence that early warning and rapid response systems have little to no difference in mortality, unexpected cardiac or respiratory arrest, and hospital length of stay. But unfortunately the story doesn't quite end there, because in the Cochram review they summarized that EWS implementation didn't have any statistically significant benefit in reducing unplanned ICU admissions, as you can see with the forest plot that we have right here, but we all know that EWS implementation and actualization and operationalization comes at a cost, and that cost may be at the actual implementation of it, but may also come at the cost of increasing ICU utilization, which an early systematic review did demonstrate that ICU utilization increased in three of the five studies they examined following EWS implementation. And one group of Dutch researchers actually found the same thing at their own institute when they implemented an early warning system, and so they performed a cost analysis to determine the mean cost of EWS implementation per patient day, institution-wide. And so they were able to estimate extra ICU materials at almost 23,000 euros, and each extra ICU day was 1,600 euros, and they were able to determine using a before-after analysis that there were 14 additional ICU days per 1,000 patient days, and that meant that their pre-implementation 194 planned ICU days of 16, a little over 16,000 patient days went to almost 800 unplanned ICU days of 30, a little over 30,000 patient days post-implementation. And they calculated that the total extra ICU costs following implementation, given this change in ICU utilization, was 705,000 euros after EWS implementation for a mean extra unplanned ICU days cost of almost 23 euros per patient day. This is institution-wide, and 85% of that cost was implementation of the EWS. But you're like, cool, I'm not here to talk, you're not here to talk about EWS, you're here to talk about AI and EWS. So the reason I'm basing it off of the discussion of EWS is because you, in order to understand how AIML can argument EWS, you first have to understand how EWS fails, right? And there is a very nice review paper done here, this is an adapted figure from that paper, where they did a study that characterized the reasons for failure to rescue into five overall themes, which as you can see here are governance, rapid response team, professional boundaries, clinical experience, and EWS parameters. And what I really want to focus on here are the ones at the ends, EWS parameters and governance. So EWS parameters, they characterize the subpopulation adjustments, parameter adjustments, tweaking the model to fit the patient population for whom you're supposed to be triggering on. And then under governance, they were able to identify things like lack of policies and protocols, lack of knowledge of those policies and protocols, lack of overall education, and staffing shortages as reasons why EWS and rapid response teams fail. So what we've demonstrated with the data I've shown is that EWS is really good at that part on the very far right, with EWS parameters, we're really good at being able to make subpopulation adjustments and parameter adjustments in order to make sure that our early warning system is properly tuned to the patients for whom we're caring for. But the same things that cause EWS and rapid response teams to fail, such as the policies and the lack of staffing shortages, can actually be augmented when you implement AI and ML on top of these kind of flawed systems. So how can AI and ML amplify harm? We know that EWS underperform when data are missing, right? And so because the EWS can only act on the data it can see. But what is not often appreciated is that when data is missing, data missingness is nonrandom. When a patient has data missing, they're more likely to have data in other areas missing, and that makes data missingness monotone. And the very missingness of that data may encode important patient information, such as interactions with healthcare systems, if you're looking on the outpatient setting or on the inpatient setting, a variety of other reasons why these important characteristics are not being documented in the EHR, and therefore these EWS systems are unable to act on them. And when you create an AI or ML model off of imperfect data, those imperfections may actually be amplified. As you can see from this very nice graphic here, these algorithms are trained on biased data and therefore create biased and low quality analyses, which results in poor patient outcomes worsening the mistrust of healthcare, and thereby encoding this kind of data missingness or failure to interact with the healthcare system even further. But this risk isn't theoretical, I'm not talking just about that nice graphic there where someone just basically drew a nice circular chart, right? Because there actually are already in use large commercial risk prediction tools that healthcare systems use for population health management. And there was a very nice study done where they examined the bias of this risk prediction model, and as you can see here, along the x-axis is the algorithm risk score. Patients who are white are in yellow, patients who are black are in purple, and you can see for the number of active chronic conditions at every single risk score, black patients had more chronic conditions. For hemoglobin A1c, same risk score along the x-axis, higher mean hemoglobin A1c is along the y-axis for black patients as compared to white patients. And this is using theoretically what was unbiased data, right? There is still a statistically significant difference in the AI's ability to identify patients who, on a clinical basis, have different severities of illness. But I'm not here just to be cynical, right? There needs to be a way forward because it's very clearly ingrained that AI is going to be a very impactful tool in the next decade, two decades in healthcare. So I'm not here to tell you to abandon AI. We absolutely need to be adopting AI and ML in critical care, but we really need to be mindful of how we're using this tool. And knowing that these tools are going to be available, how do we assess them properly? So first, there's a variety of reporting standards that have been developed, including TripodML, Transparency Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis for Machine Learning. There's Minimar, the Minimum Information for Medical AI Reporting, Prediction Model Risk of Bias Assessment Tool or PROVAST AI. And these are all things that we need to be looking for ourselves when we're reading these papers and asking and encouraging journals to have authors report on these things when they are publishing these models so that we can better assess how they are actually working. As Raven mentioned in sessions earlier today, the FDA has stepped in and now has a proposed rule for artificial intelligence machine learning-based software as a medical device. And these kinds of action plans, while they are, in our field, very disruptive, have to be taken from the intent that they are, which is they're intended to try and minimize bias and to maximize benefit from these tools. So my conclusion is that AI, ML-driven EWS isn't ready yet, okay? So far today, I've been able to demonstrate that EWSs are effective at predicting clinical deterioration. There's no question about that. But the implementation of them is where we fall short because they do not significantly improve hospital mortality rates, in-hospital cardiac arrest rates, or length of stay, nor did they decrease ICU utilization. So the reasons for this is because the same barriers that impede meaningful EWS implementation can also contribute itself to data missingness. And if not carefully undertaken, AI and ML implementation may actually amplify and propagate bias that's already very prominent in the healthcare field. So with that, it's ultimately incredibly important to be utterly transparent when reporting on AI, ML outcomes in order for continued success. So that's all I've got. Thank you. I'm going to turn it over to Vinny, who is going to probably dismantle my arguments one by one, as is appropriate, and please don't forget to rate our session this morning on the app. Thanks. Thank you. I am not going to dismantle your argument. I am going to support it. So you're all here on the last day of the conference, so thank you. And pro-con debates are supposed to have a vigorous back and forth. But I convinced Dr. Chen to go first, because everything she said I agree with. So I didn't think that would make a very compelling argument. And I'm going to talk a little bit more about that. So to understand how AI and ML tools can be used effectively to improve outcomes in healthcare. So her proposition was that EWS derived by AI do not provide meaningful clinical benefit. And I say thumbs up, checkmark to that. And because AI is not in a state where it's like, if you build it, they will come. Meaning that if you just plop this tool into a milieu, a complex place in which we're making clinical decisions, that you're going to see success. In fact, one of our folks who oversees the deployment of our EPIC system across our 12 million patients likes to say, if you build it, they won't come. Because of their extensive experience building tools that someone thinks is going to be effective, which nobody wants to use. And I think in that way, I completely agree with Dr. Chen. If you think that AI on its own is going to be effective, you're kind of living in a field of dreams. And we need to actually get down to the reality of doing effective, sustainable things in healthcare. What we don't have is a shortage of builders. And so if you saw me talk earlier this morning, I showed this slide. This is Michael Patton, who's an MD, PhD student at University of Alabama, Birmingham. And what he did was reviewed thousands of papers about prediction models in critical care. And we chose seven or so, seven or eight of the most common prediction targets. They're indicated by color, mortality, sepsis, shock. And early warning score is often kind of in that mortality. And the lines indicate the uptake of the EHR, both in the hospital and the outpatient setting. But you can see that as of 2022, there was at least 2,500 papers in PubMed about the prediction of these eight outcomes. And there's far more probably today. So what we don't actually have is a shortage of AI tools. What we lack is the proof that they're effective. We also lack demonstrated evidence that there's been stark improvements over time. So this is the performance of machine learning in hospital mortality prediction tools. The y-axis is the common performance metric, the AUC or the C-statistic. And the different shapes indicate the type of machine learning method that was used for that study. So Michael went through every study and pulled out the relevant information. And I'm sorry, the colors are the method, the shapes are the number of patients. And what you can see is we've been at this game since 1995. And MARTA Render in the year 2000, actually if we're using VA data. This is the triangle there to the left, produced a really high performing model that continues to be on par with even the much more fancier and in some cases, you know, AI or machine learning built models that are coming today. So again, it's not that we lack proof of efficacy. I'm sorry, proof that the performance is good. It's that we're lacking evidence that they're efficacious or effective. And that is exactly the gap. So AI on its own, you're living in a field of dreams. What we need is to get down to reality. And, you know, I think, you know, how could we not be excited about a technology which produces this image for the first time in the world based on someone's text input? I mean, like, if you're not blown away by this, and this is probably like from a year ago, it's just amazing. And I also, I love to do this. I love to ask chat GPT to write limericks. So I said, write a limerick about an AI pro-con debate from chess conference in Hawaii. So you can all judge how well it did. At a chess meet in Sunlit, Hawaii, AI's pros and its cons did fly high. Will it help or will it harm? Ask docs with alarm while the waves whispered answers nearby, okay? So if you can do better than this in less than 10 seconds, please, like, kick me off the stage and let's go. I mean, how could we not be excited about the potential of tools like this? And these are kind of, some people say like toy examples, but I do think they really get to this incredible new frontier, which has been unlocked. And I'll say that experts in the field, meaning computer scientists who have long been deeply skeptical about AI over the past year, have repeatedly been blown away, right? So we're not talking about people who are just trying to promote their own industry, but really who've been at the frontier of this field, just wondering, having existential crises, you know, what is intelligence? What is my role? So I think that, again, AI alone, not sufficient, but how could we ignore this kind of emerging technology? And so that gets to, I think, some of Dr. Chen's, you know, this really nice review that she put up. And this was the largest study here, where the number of patients in that study was 360,000. It's a very big, robust study using aspects of randomization across hospital sites. The important thing to note is that that was in 2005. Does anybody know when the iPhone was first released? 2007. So it's really hard to say this represents AI when it came before the iPhone, right? Just as an example. So I don't know if we should really use this as good evidence of whether or not things are efficacious. But we are struggling with this. There's no question. We have thousands of papers being submitted about the amazing performance of these tools, and we have just a small dribble of papers that come out that say, oh, we actually applied this at scale and proved that this actually made a difference. It is one of the reasons, and I mentioned this earlier, that we call AI at Kaiser Permanente augmented intelligence, because we focus on the people, the patients, the clinicians in the communities, rather than the algorithms. And because we know that, again, algorithms alone, if you build it, they won't come. People won't know how to use it. They will be, you know, often they will turn it off or ignore it or just be pissed off. So I was excited when this news article came out and said Kaiser Permanente's AI approach puts patients and doctors first, because that warms my heart, right? And I think that's where we want this technology to be doing things on behalf of the patients we care for and hopefully making our lives easier. And we have, you know, experience doing this. This is a paper that was first authored by my now-retired colleague, Dr. Gabriel Escobar. It came out in the New England Journal of Medicine, I think, in 2021. And this was the deployment of an early warning score in our health system in Northern California. So in real time, it scans vital signs, laboratory values, comorbidities, and deterioration risk. This shows the graph as we rolled it out across 19 of our hospitals in Northern California, where the lighter blue aspect is the period before activation, and then the darker blue is after activation. This is one of the evaluation strategies that health systems use when things can't be randomized at the patient level. So this is called, you know, like a stepped wedge implementation. And what's really important, because both Gabriel and I would agree, the early warning score here is not even that fancy. It's not like a chat GPT super machine that writes limericks. It's actually pretty standard. What was really important is this, right? So Dr. Chen mentioned the lack of protocols as one of the kind of hindrances to actually getting users to understand it and take it up. So it's kind of small, but this is the entirety of the program that's built up to respond. So from time zero to hour one, that's the circle to the left, the alert appears on the virtual nursing team dashboard. So there's a virtual first respondent who's taking in alerts across the 19 hospitals. And what this person does is provides a little human filtering, you know, because there are sometimes alerts that pop up that generate concern, but the patient's being actively cared for. And I'll give you an example. When you do a procedure, the number of vital signs being measured increases in density and patients have more physiologic abnormalities. Your heart rate, your respiratory rate go up. So they filter out some of those. From hour one to hour three, they then call down to the local hospital for the alerts which are going to move ahead. They talk to the local RRT nurse. And then from hour three to hour six, they respond with the physician counterpart. They make an assessment. If necessary, and many cases are necessary, social work and palliative care get involved because not every patient who is at risk for deterioration wants more aggressive care. Again, this is a huge part of the human aspect to this. You know, as many as 25% of our patients are not full code on hospitalization. And so we would not want a score to be released and then just start driving more aggressive and intensive care that's not aligned with patients' goals of care. And then ultimately there's some follow-up rounding. The whole design of this score is that you show up to a patient's room and they may be sitting there eating their lunch. And that's the point, right? We tell our clinicians, walk, don't run. Because we have mechanisms for detecting patients in the midst of kind of active, very rapid clinical deterioration. This is meant to predict those who aren't. I just distilled out some of our results. So we analyzed 43,000 hospitalizations in which the alert met threshold. And you can see it was met with both a decrease in the rate of ICU admission, as well as the rate of mortality within 30 days, increased goals of care. So I would say this is evidence of the fact that when an early warning score can be put in an entire kind of surrounding context of the care of the patient, it does produce clinically meaningful outcomes. But the early warning score itself is insufficient. One of our colleagues in the machine learning space says, you know, this deployment is one drop AI, 19 drops implementation. So if you're not, you know, if you're getting into this field or in this field and are not aware of that kind of ratio, you're expecting something like the flip 19 to 1 or even 50-50, recognize that this is going to look like other types of implementation we do, whether that's, you know, low tidal volume ventilation or, you know, sepsis, early sepsis management. You know, it involves boots on the ground, winning the hearts and minds of our clinicians and our multidisciplinary teams. We've done this for other things. We've done this for readmission. Here when we deployed the score for the medium and high risk patients, we showed that we reduced risk adjusted mortality and readmission, whereas in the blue line, which is the low risk patients, it didn't. So we're actually pretty familiar with these. We have, I don't show it here, but a menu of about 30 or 40 models that are being deployed in the diversity of fields, HIV prediction and prevention, suicide prediction and prevention, early warning score in labor and delivery, pre-op surgical, you know, risk stratification, et cetera, et cetera. But again, the tools are not enough, even as amazing as these next generation of tools are going to be. And so we are also leading an effort where we're funding five external health systems to prospectively evaluate AI ML tools at the bedside. And so we are down to our 12 finalists. We received 120 letters of intent from health systems and universities across the country. And we're down to, we'll be announcing five, but before the end of the year, who will each be funded for $750,000 over three years to robustly evaluate the human side of taking an AI algorithm and putting it into practice. I think what is an essential message for all of us as clinicians and probably many physicians is that we do need to safeguard the use of AI, and we need to be active in this role because again, this is not a, you know, something that you drag and drop and expect that it's going to go efficiently. So here are some books that I think are very informative. If you're interested in thoughts about CHAT-GPT, a struggle of some scientists with that technology capability, I would recommend the AI revolution in medicine from Peter Lee who's the research VP at Microsoft and Zach Kohane and others who's an AI expert in Boston. If you are concerned about will AI become my technologic overlord, then I would read Life 3.0 by Max Tegmark from MIT because it goes through 16 scenarios about what the future might look like between AI and humans. And then if you're interested in understanding how AI will be the basis of the fourth industrial revolution then Klaus Schwab who's the founder and executive chairman of the World Economic Forum. This one is very interesting because it's not just AI. It's AI plus nanotechnology and genetics and synthetic biology and new material science and sensors and all of these things. It's not just AI using imaging and using EHR data. It's that plus the ability to actually change the fabric of human life, the macromolecules which make us up. And so there's some really thoughtful considerations about the impacts of that on our world. So with that, thank you for your time and attention. Thank you. That was a really great talk. We're gonna switch gears a little bit and talk about the use of artificial intelligence or augmented intelligence in radiographic interpretation. If my slides load there. Okay, great. Well, I'm here to keep the trend going of talking about the cons first. So I'm gonna talk to you about why computer vision is not ready to provide a clinical benefit in radiographic augmentation. My name is Andrew Michelson and I am an assistant professor in the Division of Pulmonary Critical Care at Washington University in St. Louis. And the only disclosure I have is that I do own some stock in NVIDIA, which makes the graphics processing units that make a lot of these tools possible. Shows you where my allegiances actually lay, I guess. So my goals for today are really to talk about the landscape of AI and ML commercialization. And I will do that very briefly because I know my colleague, Dr. Schoding, will talk a little bit more about that in detail. And then we'll talk about some of the barriers to wide-scale deployment. So as we kind of heard from Dr. Chen and Dr. Liu, we've seen amazing success in the use of computer vision in healthcare. And we've seen these models achieve expert level performance in a whole variety of fields that rely on imaging. So ophthalmology, dermatology, pathology, and radiology. Interestingly, I think radiology is in a unique position really to adopt these technologies. And I think that's because not only are these algorithms very good at analyzing images, but they can also work in a field that's kind of primed for AI adoption. And so what do I mean by that? I mean, radiology already has an established digital workflow where it'd be easy to integrate some of these tools. And they have some of the infrastructure that we need for data to flow from system to system. They have standard file formats, a lot of distribution networks. And so radiology really is kind of a field where I think it would be relatively, I say relatively easy to implement some of these tools. And just as we've seen some more interest in developing these tools, there's been an interest in developing commercialized tools. And the FDA actually keeps a list of all the tools that have been, they use artificial intelligence and machine learning that have subsequently gained approval. And so I went to their website and it was last updated I think in October of last year. And I just plotted the number of devices that we have approved over time. And you can see really around 2013, maybe a little earlier than that, there was really this kind of linear or almost starting to be this greater acceleration of device approval. And now we have about 392 devices approved. I think since this data was actually updated, there's probably a few more, maybe closer to 396. But we've seen a lot of entities starting to make commercial products based on radiographic interpretation. And what we know now is, okay, these models perform really well. They achieve expert level performance. There's a lot of interest in bringing them to market. But for the most part, we really haven't seen market penetration. We haven't seen these devices come to the bedside. And so the question is why? And I think there's many reasons, but two of the primary reasons that I think we're struggling to overcome are data heterogeneity and data bias. In addition to implementation, but that's a different discussion. So we'll talk about each of these in individuality. So let's talk about heterogeneity first. So if you've ever worked with some of these computer vision models, there's two things that they really like. They like a lot of data, and they like data that's kind of homogeneous. And so what do I mean? I mean, these computer vision algorithms learn by example. So they wanna see as many examples as possible to get as good as possible, kind of like we do when we were in training. But the other part of this is that these algorithms accept data that's of a certain size and a certain shape. And if it doesn't meet that size, you have to do some data manipulation to get it to the right expected format. And so in the real world, someone can have a CT scan and it can have 42 slices or it can have 62 slices. And so you have to do some data manipulation to actually get it down to the size that it expects. So once it does that, you can take all your data, you eliminate some, maybe you transform it, and then you train it into your algorithm. But the problem is in the real world, as many of us know, no two images are ever gonna be the same. There's always gonna be more data heterogeneity added. Right, you get a chest X-ray and the patient will be ever so slightly rotated a different direction. Or it'll be maybe more of a KUV than an actual chest X-ray. Or you'll get a CT scan and maybe your slices will be one millimeter in thickness, or maybe there'll be five millimeters. And all these different points of data heterogeneity make it harder for the models to perform well, especially when they haven't seen this type of data before. And I'll give you a great example of that. So there was this great study that came out earlier this year where these investigators took CT scans of patients who had idiopathic pulmonary fibrosis. And they asked a simple question, can we quantify the volume of different types of lung tissue in these CT scans? And so they tried to quantify normal lung, consolidation, ground glass, nodules, very classic elements. And this was a very high quality study. They had a decent number of CT scans. It was around 304. They were all reviewed by two independent radiologists. State-of-the-art machine learning techniques. It was a 3D model. They really did a great job with this study. And it showed because when you looked at the results, they could see that there was high degree of correlation between the individual radiologist reviewers and the ultimate outcome of the machine learning model. And for most of these things that they looked at, they saw correlation coefficients of .9, .8, .7. So fairly good performance. The interesting thing is they kind of anticipated that the model may see data that it wasn't necessarily trained for. And they wanted to quantify the degree of performance degradation. So what did they do? Well, they noted that most of the CT scans were either half a millimeter or one millimeter thickness in their parameters. So they said, well, how well are we going to do on CT scans that maybe have five millimeter thickness? So they did that sub-analysis. And they looked at how much their model degradation occurred. So in ground glass opacities, they saw their model degraded from an interclass correlation coefficient with one being perfect from .95 down to .095. So an order of magnitude worse. In consolidation, they saw 60% degradation. Nodules, 75% degradation. And all they did was change the slice thickness parameter that their model was evaluating. Unfortunately, that's not the only study to unveil or reveal some of these limitations. So here is another study. They used chest X-rays to identify pneumothoraces. And they did their internal validation. They had a sensitivity of about 0.84. When they did their external validation, that dropped to about .5. There was another study that looked at an FDA-approved device to identify spinal fractures in the cervical spinal area. And they reported a sensitivity of 0.92. And when it was actually prospectively validated, they saw a sensitivity of 0.55. And so there are many other studies that kind of recapitulate this information. But it really shows us that whenever we're looking at these computer vision models, the real-world data is not gonna be similar to the training data. And whether that's the positioning of the patient, the CT scan, the protocol that's used, whether it's the incidence of the disease, or incidence of the actual outcome you're looking at, it's never gonna be the same. And when we see that, we're gonna see the model act in unpredictable ways, which usually leads to a decremented model performance. And it really highlights for us the need to be skeptical and to say, we need to have external validation, we need to have prospective validation, and we have to actually prove that these models are behaving in the real world as we would expect them to in silica or when we evaluate them in the computer world. Okay, so let's shift gears. I wanna talk a little bit about bias. So I'll be honest, I am not an algorithmic bias expert, but a couple years ago, I came across this great quote in a paper that said, AI systems have increasingly achieved expert-level performance in medical imaging applications. However, there is growing concern that AI systems may reflect and amplify human bias and reduce the quality of the performance in historically underserved populations. And so this quote actually came from this paper from Nature Medicine that was published in 2021. And these investigators created a machine learning model to look at chest X-rays and identify certain pathological findings like pneumothorax, consolidation, things like that. They then looked at how often a report would say there's no finding when there actually was a finding, and they stratified that by certain sociodemographic groups. So this data set had about 372,000 chest X-rays, so a fairly good size. I'll say the four demographics like that were sex, age, race, ethnicity, and insurance provider. And if you look at just gender here, you can see it was fairly balanced, which is kind of unusual in the machine learning world, but it was relatively a one-to-one ratio, and we'll talk a little bit more about that in a few minutes. You can see that definitely in the age group, there was a minority of patients who were in that zero to 20 age group. You can see in race and ethnicity, it was predominantly a white cohort at 68%, with only 20% being black. And insurance status, you can see there, about 46% were Medicare and only 9% were Medicaid. And in the key figure, and I'll just let you know that they had an AUROC of about 0.83, so fairly good performance. And in the key figure from their paper, they found that female patients were more likely to have an X-ray report as normal when there was, in fact, an abnormality identified. And that's despite being in a one-to-one ratio with a large volume of data, right? And the same held true when they talked about patients who were zero to 20, who were black, and who had Medicaid insurance. And what's more is when two or more of these factors were present, it really amplified the effect. So if you look at someone who was black in that zero to 20 age group, their false positive rate was almost 75% there. So really a high impact of these two factors. So what did we learn? Well, I think bias exists in computer vision, even though you wouldn't think it's just an image, but there is bias there. When we see it, it probably affects our model's performance. And just like we saw with race, or I'm sorry, with gender in a one-to-one ratio with a high volume of data, we still saw bias. So the typical solution in AI is throw more data at the problem and it'll get better, but that may not fix everything here. And I think we have to get creative in how we eliminate bias from our models. So where do we go from here? Well, I'm an optimist, as I told you, I own stock in NVIDIA. But I think we're gonna continue to improve these models. And as Dr. Liu showed, there's a lot of potential, and I think we should try to harness the potential as much as we can, but we're obligated to remove as much bias as possible. And I think we have to recognize that no model in isolation's ever gonna be perfect. And so when we implement them, we have to do it in a way that's gonna augment clinical decision-making, right? These aren't tools that are gonna supplant our ability to make decisions. They're there to help us and to improve the quality of care. And I will leave it there. Thank you, bud. Thank you. Okay, well, thanks for joining and staying around to the last talk of this session. I think it's quite telling that all of us are probably ultimately gonna say the same thing, which is that we all really see the potential, but we all are gonna ask for everyone in this space to do more. My name is Michael Schroding. I do have a conflict of interest. I was involved in development of a technology that's under license. I will not be talking about that technology today. So I'm gonna argue that computer vision does provide clinical benefit in radiographic augmentation, but I'm gonna do something that many debaters do, which is gonna change the argument that I'm gonna make. And what I'm actually gonna argue is that computer vision can provide incremental benefit in radiographic augmentation, although I will show you that at this point, there's really little published evidence to support this. So I'd like to draw an analogy here. When we think about the level of evidence we need to convince ourselves that AI can provide radiographic augmentation, and I would argue potentially this is sort of true for everything else we've talked about today, like early warning scores, that we really should think about that type of evidence in the same way that we've thought for years about this drug development pipeline. This is a well-established pipeline where a sequence of studies are performed to bring a molecule to market. So in the pre-discovery phase, we have investigations of new drug pathways, potential targets. In the pre-clinical phase, we have sort of early testing of potential candidate drugs in cells and animals. Then we move to the phase of clinical trials, phase one, where safety is tested, phase two, where dosing is tested, and three, where clinical effectiveness is trialed. Then, and only after that, do we get FDA approval, and then followed by FDA approval, we get phase four studies, where we continue to monitor these drugs for effectiveness and safety. And I'd really like our field to think about AI development in the same way. And so here is my sort of version of what this would look like. So we have retrospective non-clinical environment studies, and we have prospective studies in a clinical environment. So in the retrospective space, sort of the first stuff we have is research to develop new AI architectures and new models. And honestly, a lot of the hype in AI is based on just that, right? Like we have new studies sort of at this far end of the spectrum, and people are like, oh, it's gonna change medicine, right? But as many of us have already talked about, it's a long way forward. So after that, we have studies maybe comparing the AI model versus the physician who performs more accurately, followed by, in radiology, so-called multi-reader performance studies with AI models, and I'll spend a bit more time on those because those are sort of key for AI development radiology. And so those are all retrospective non-clinical studies. Those are not studies where the AI model is tested in clinical practice. Then we move on to pilot studies and large randomized controlled trials, ideally patient-level randomized, but with, if not, cluster-randomized controlled trials, step-wedged trials like Dr. Liu talked about, and effectiveness studies. And then ideally, after that, we would have FDA approval and large-scale implementation. And the strongest evidence for AI benefit comes from these large-scale randomized trials and implementation effectiveness studies. Sorry, but for better or for worse, that's what we need to really convince ourselves that we have clinical benefit. But the reality is that's not what's happening in artificial intelligence or radiology. And in fact, for better or for worse, this is partly because of current FDA guidelines for approval. So if you go to the FDA website and you learn about what's required to bring a new radiologic technology to market, the key study that needs to be performed is the so-called multi-reader performance study with AI models. And I'll talk about that again in a little bit more detail. But we don't really need to do any prospective testing in a clinical environment. In fact, it's not a requirement of current FDA guidance. And as a consequence, we just don't have many studies in this space. So this was a nice review, looking at specifically artificial intelligence systems and radiology that are commercially approved and describe the level of evidence that supports the approval of these studies. And you can see the majority of the evidence is these so-called diagnostic accuracy effectiveness studies. These are studies that are looking in a non-clinical environment, can a model support a clinician to be more accurate? And you really do not see a lot of studies that go beyond that, because frankly, that is what is required of a company to bring a technology to market. But again, what I'd really like to show you to convince you that AI has clinical benefit is to highlight these really beautiful large-scale RTCs that show the benefit. So here's what these multi-reader diagnostic accuracy studies look like. And don't get me wrong, some of them are quite impressive. And I think this is sort of the best in class. So this is a study performed in Lancet Digital Health in 2021. So in this study, a group took 800,000 chest X-ray images and trained a deep learning model to detect up to 127 different clinical findings. So first they did that. Then they found a group of 20 radiologists who reviewed a set of over 2,000 chest X-rays. And each of the radiologists scored all these chest X-rays for these 127 findings. So those radiologists then took a three-month break, and then they read the same X-rays again, now with the support of this AI model. And this AI model, they could over-review they could open up the film, look at the X-ray, look at the AI model, see what the model is looking at. And when they did this, their radiologist area under receiver operating curve went from 0.71 to 0.81, an impressive difference. And their accuracy was improved in 102 out of 127 findings. You can't read all these findings, but I'm just showing you this to show you the scope of this work. It's quite impressive. And some of these findings are actually quite important, like improvement in accuracy on detection of pneumothorax, or other potential important findings like pulmonary nodule. So multi-reader diagnostic accuracy studies are really good because they quantify the direct impact of an AI model in physician accuracy. You know, you just couldn't have done the study that I described in a clinical setting. It just wouldn't be possible. But the problems are, these studies use retrospective data, as we heard, which may not be representative of clinical practice. These studies may not fully replicate the clinical environment or the workflows. And they ultimately don't quantify the clinical impact. Because, so we saw that, you know, the radiologist's accuracy got a bit better with the support of the AI model. But does that translate into benefit for patients? So now I really wanna focus on the few randomized control trials in this space. I really wanna highlight sort of three that I was able to find. And there's really not much more. And so again, I'm going to really stress that that's what we need. But hopefully I'll show sort of three really, I think, good studies that shows the potential benefit of this to convince you that really AI can improve clinical practice. And the three studies are on mammography, on the assessment of skeletal age, and pulmonary nodule detection. So the first study is on mammography. And in this study, investigators wanted to see if an AI model could serve as a second radiologist reviewing mammography films. And so you may not know this, because in the US this isn't required. But in many other countries, it is an expectation that two radiologists review mammography films to improve the accuracy of the reading. And so investigators specifically wanted to find out whether an AI model could serve as a second reviewer. So this is a patient-level randomized study. Again, one of the few studies that we see in this space doing patient-level randomization. And what the groups did is that if a patient was randomized to the intervention group, there was a radiologist who read the mammogram. And then the AI model also read the mammogram. And the AI model generated a score between one and 10. And if the AI model generated a score of 10, then that mammogram was reviewed by a second radiologist. And in the control group, two radiologists read the mammogram. And so in the intervention group, the cancer detection rate was 6.1 per 1,000 screened. That was in the AI group. And 5.1 per 1,000 in the control group. At the same time, there was no difference in false positives, false referral rates. So I think this is very promising evidence that an AI model could serve as a second reviewer for mammography. And if you could think about the potential scalability of this, now we could potentially have a system supporting radiologists as a second reviewer to increase the accuracy and the consistency of mammography at scale. The second study that I wanna highlight is on skeletal age. So again, we're probably pulmonologists in this room. We don't know what this sort of skeletal age assessment is. But in kids, when you have a potential abnormality in growth and there's a concern that there may be something going on, a pediatrician may order a hand-plane x-ray. And based on the hand-plane x-ray features, particularly looking at the bone plates, you can estimate a patient's skeletal age. And if the skeletal age is different than the patient's actual chronologic age, that can highlight an abnormality. So this is something I hear, because my wife is a pediatrician. This is done occasionally. And how do radiologists interpret these images? So what they do, actually, is they have this giant atlas of example images of hands of kids that are girls or boys at different ages. And they basically look at the image that they have in front of them, and they try to figure out what's it closest to. And so this seemed, again, like a natural type of clinical situation where an AI model could provide support. So again, patient-level randomized. So when the patient's films were randomized to this AI model support, the radiologist got to see what the AI model thought. And when they did this, the skeletal age assessments were less likely to be significantly wrong, which they defined as sort of a difference in more than 12 months. With the AI model, 9.3% versus 13, and interpretation time was faster, about 40 seconds faster. So again, it's sort of incremental, but you think about the ability to scale this, you could think about this being quite powerful to improve the consistency of interpretation of radiologic films. The final one I'll just highlight again, because this is chest, and so we're all sort of probably interested in this subject area, was nodule detection. And an AI support improved the detection of actual lung nodules on chest x-rays. So in this study, again, patient-level randomized study. Patients who had a chest x-ray as part of a routine health maintenance exam were randomized to either have their chest x-ray go to a PAC system where this AI support for lung nodule detection could be run. And so then the radiologist had access to the AI support, which is basically what you see up on the screen. And in this study, radiologists who had access to the AI support were more likely to detect potentially actionable lung nodules on the chest x-ray compared to the control group. And they defined actionable as lung nodules that ultimately turned out to be greater than eight millimeters on a chest CT when that was performed. And another really important finding of this study is that this increase in detection of actionable lung nodules did not come at the expense of a higher rate of false referrals. So it's not like we're finding more things, but we're also getting more false positives. So again, promising studies showing that these technologies, when implemented in the right clinical context, can improve care. So I'm gonna end again by arguing what I started out with, that computer vision can provide incremental clinical benefit on radiographic augmentation. And I wanna say one other thing about why this is ultimately gonna be incremental. Because for these systems to provide substantial clinical benefit, that would really mean that in current clinical practice, bad things are happening. And I think that there can be that occurring occasionally, but I think in clinical practice in general, people provide really good care. So we can't expect this computer vision algorithm to provide substantially better care when we're all ultimately already doing a pretty good job. But I do think that these systems can provide incremental increased accuracy and efficacy and consistency in delivering good care in clinical practice. So thank you very much for attending this session. It's been really fun. I hope you've enjoyed it. Are we gonna do a question or two since we have three minutes? Okay, thanks. Thank you.
Video Summary
In this video, a panel of experts discuss the use of artificial intelligence (AI) or augmented intelligence (AI) in the field of healthcare, specifically in critical care and radiology settings. The panelists present both the potential benefits and limitations of using AI in these settings. They highlight that while AI models can achieve expert-level performance in analyzing images and predicting outcomes, there are challenges in implementing these models in real-world clinical settings. One major challenge is the heterogeneity of data, as AI models require large amounts of consistent data to perform well. Data bias is also a concern, as AI models may reflect and amplify human biases, leading to inequities in healthcare. The panelists emphasize the need for more rigorous studies and evidence to support the clinical benefit of AI in healthcare. They suggest a framework similar to the drug development pipeline encompassing various stages of AI development, including pre-discovery, pre-clinical, clinical trials, and large-scale implementation. The panelists also stress the importance of transparency and reporting standards in AI research, as well as the need for regulatory oversight to ensure patient safety and mitigate biases in AI systems. Overall, while AI has the potential to augment healthcare, it requires careful implementation and ongoing evaluation to ensure its effectiveness and ethical use.
Meta Tag
Category
Biotechnology
Session ID
1034
Speaker
Catherine Chen
Speaker
Vincent Liu
Speaker
Andrew Michelson
Speaker
Michael Sjoding
Track
Biotechnology
Keywords
artificial intelligence
augmented intelligence
healthcare
critical care
radiology
AI models
data heterogeneity
data bias
clinical benefit
©
|
American College of Chest Physicians
®
×
Please select your language
1
English