CHAPTER 5: Science Medicine Artificial Intelligence --------------PAGE Number: 1. END OF PAGE----------------------- CHAPTER 5: LAI Artificial Intelligence Science and Index Report 2024 Medicine Preview Overview 3 ACCESS THE PUBLIC DATA Chapter Highlights 4 5.1 5 AlphaDev 5 FlexiCubes 6 Synbot 8 GraphCast 9 GNoME 10 Flood Forecasting 11 5.2 12 Notable Medical Systems 12 SynthSR 12 Coupled Plasmonic Infrared Sensors 14 EVEscape 15 AlphaMissence 17 Human Pangenome Reference 18 Clinical Knowledge 19 MedQA 19 Highlighted Research: GPT-4 Medprompt 20 Highlighted Research: MediTron-70B 22 Diagnosis 23 Highlighted Research: CoDoC 23 Highlighted Research: CT Panda 24 Other Diagnostic Uses 25 FDA-Approved AI-Related Medical Devices 26 Administration and Care 28 Highlighted Research: MedAlign 28 Appendix 30 2 --------------PAGE Number: 2. END OF PAGE----------------------- CHAPTER 5: LAI Artificial Intelligence Index Report 2024 Science and Medicine Overview This year's Al Index introduces a new chapter on Al in science and medicine in recognition of Al's growing role in scientific and medical discovery. It explores 2023's standout Al-facilitated scientific achievements, including advanced weather forecasting systems like GraphCast and improved material discovery algorithms like GNoME. The chapter also examines medical AI system performance, important 2023 AI-driven medical innovations like SynthSR and ImmunoSEIRA, and trends in the approval of FDA AI-related medical devices. 3 --------------PAGE Number: 3. END OF PAGE----------------------- CHAPTER 5: LAI Artificial Intelligence Index Report 2024 Science and Medicine Chapter Highlights 1. Scientific progress accelerates even further, thanks to Al. In 2022, AI began to advance scientific discovery. 2023, however, saw the launch of even more significant science-related Al applications-- from AIphaDev, which makes algorithmic sorting more efficient, to GNoME, which facilitates the process of materials discovery. 2. Al helps medicine take significant strides forward. In 2023, several significant medical systems were launched, including EVEscape, which enhances pandemic prediction, and AlphaMissence, which assists in Al-driven mutation classification. Al is increasingly being utilized to propel medical advancements. 3. Highly knowledgeable medical AI has arrived. Over the past few years, Al systems have shown remarkable improvement on the MedQA benchmark, a key test for assessing AI's clinical knowledge. The standout model of 2023, GPT-4 Medprompt, reached an accuracy rate of 90.2%, marking a 22.6 percentage point increase from the highest score in 2022. Since the benchmark's introduction in 2019, Al performance on MedQA has nearly tripled. 4. The FDA approves more and more AI-related medical devices. In 2022, the FDA approved 139 AI-related medical devices, a 12.1% increase from 2021. Since 2012, the number of FDA-approved AI-related medical devices has increased by more than 45-fold. Al is increasingly being used for real-world medical purposes. 4 --------------PAGE Number: 4. END OF PAGE----------------------- LAI Artificial Intelligence Chapter 5: Science and Medicine Index Report 2024 5.1 This section highlights significant Al-related scientific breakthroughs of 2023 as chosen by the AI Index Steering Committee. 5.1 AlphaDev fundamental sorting algorithms on short sequences AlphaDev discovers faster sorting algorithms such as Sort 3, Sort 4, and Sort 5 (Figure 5.1.1). Some AlphaDev is a new Al reinforcement learning system of the new algorithms discovered by AlphaDev have that has improved on decades of work by scientists been incorporated into the LLVM standard C++ sort and engineers in the field of computational algorithmic library. This marks the first update to this part of enhancement. AlphaDev developed algorithms with the library in over 10 years and is the first addition fewer instructions than existing human benchmarks for designed using reinforcement learning. AlphaDev vs. human benchmarks when optimizing for algorithm length Source: Mankowitz et al., 2023 | Chart: 2024 AI Index report 120 115 AlphaDev Human benchmarks 100 80 E 66 63 60 46 42 40 37 33 31 28 28 27 18 21 20 17 0 Sort 3 Sort 4 Sort 5 VarSort3 VarSort4 VarSort5 Varlnt Algorithm Figure 5.1.1 Chapter5Preview 5 --------------PAGE Number: 5. END OF PAGE----------------------- LAI Artificial Intelligence Chapter 5: Science and Medicine Index Report 2024 5.1 FlexiCubes quality. FlexiCubes addresses some of these 3D mesh optimization with FlexiCubes limitations by employing Al for gradient-based 3D mesh generation, crucial in computer graphics, optimization and adaptable parameters (Figure involves creating a mesh of vertices, edges, and 5.1.2 faces to define 3D objects. It is key to video games, mesh adjustments. Compared to other leading animation, medical imaging, and scientific visualization. methods that utilize differentiable isosurfacing for Traditional isosurface extraction algorithms often mesh reconstruction, FlexiCubes achieves mesh struggle with limited resolution, structural rigidity, and extractions that align much more closely with the numerical instabilities, which subsequently impacts underlying ground truth (Figure 5.1.3). Sample FlexiCubes surface reconstructions Source: Nvidia, 2023 1 1 Marching Cubes 15k tris DMTET 15k tris FLEXiCuBES 13k tris Reference 91k tris 1 1 3D reconstruction from images Generative 3D modeling Animated 3D reconstruction Tet-mesh physics simulation Adaptive Meshing Developability Figure 5.1.2 Chapter 5 Preview 6 --------------PAGE Number: 6. END OF PAGE----------------------- LAI Artificial Intelligence Chapter 5: Science and Medicine Index Report 2024 5.1 Notable Scientific Milestones Select quantitative results on 3D mesh reconstruction Source: Shen et al., 2023 | Chart: 2024 Al Index report 80.67% 80% 70% 63.34% 60% 55.22% 52.37% 50.20% E 50% 48.66% 40% 34.87% 30% 20% 10% 0% NDCsDF MCsDF DChermite MC DMTet(64) DMTet(80) FlexiCubes Algorithm/method evaluated at 643 Figure 5.1.3 Chapter 5Preview 7 --------------PAGE Number: 7. END OF PAGE----------------------- LAI Artificial Intelligence Chapter 5: Science and Medicine Index Report 2024 5.1 Synbot Synbot design Source: Ha et al., 2023 Al-driven robotic chemist for synthesizing organic molecules [Output] Synthetic [Input] recipe & material Target molecule & task Synbot employs a multilayered system, comprising an AI software layer for @ Pantry @ Retrosynthesis Dispensing Experimental DoE& optimization chemical synthesis planning, a robot Reaction results Decision-making @ Sample-prep. Database software layer for translating commands, Analysis Robot AI S/W Transfer-robot layer layer and a physical robot layer for conducting Synbot experiments. The closed-loop feedback Robot Synthesis, commands recipes, mechanism between the Al and the S/W layerRobot robotic system enables Synbot to develop @ Recipe generation @ Recipe translation synthetic recipes with yields equal to Online scheduling or exceeding established references Figure 5.1.4 (Figure 5.1.4). In an experiment aimed at synthesizing M1 [4-(2,3-dimethoxyphenyl)- the mid-80% reference range and completed the synthesis 1H-pyrrolo[2,3-b]pyridine], Synbot in significantly less time (Figure 5.1.5). Synbot's automation developed multiple synthetic formulas of organic synthesis highlights Al's potential in fields such as that achieved conversion yields surpassing pharmaceuticals and materials science. Reaction kinetics of M1 autonomous optimization experiment, Synbot vs. reference Source: Ha et al., 2023 | Chart: 2024 AI Index report 100 100% 100100 85%, Reference 80% 60% 40% 20% 0% 12 0 3 6 9 15 18 21 24 Time (hours) Figure 5.1.5 Chapter5Preview 8 --------------PAGE Number: 8. END OF PAGE----------------------- LAI Artificial Intelligence Chapter 5: Science and Medicine Index Report 2024 5.1 GraphCast and more. Figure 5.1.7 compares the performance More accurate global weather forecasting of GraphCast with the current industry state-of-the- with GraphCast art weather simulation system: the High Resolution GraphCast is a new weather forecasting system Forecast (HRES). GraphCast posts a lower root mean that delivers highly accurate 10-day weather squared error, meaning its forecasts more closely predictions in under a minute (Figure 5.1.6). Utilizing correspond to observed weather patterns. GraphCast graph neural networks and machine learning, can be a valuable tool in deciphering weather patterns, GraphCast processes vast datasets to forecast enhancing preparedness for extreme weather events, temperature, wind speed, atmospheric conditions, and contributing to global climate research. GraphCast weather prediction Source: DeepMind, 2023 a) Input weather state b) Predict the next state c) Roll out a forecast Figure 5.1.6 Ten-day z500 forecast skill: GraphCast vs. HRES Source: Lam et al., 2023 | Chart: 2024 AI Index report GraphCast HRES (O6z/18z) )HRES(OOz/12z) 800 700 600 1 500 400 300 200 100 0 1 2 3 4 5 6 7 8 9 10 Lead time (days) Figure 5.1.7 Chapter 5Preview 9 --------------PAGE Number: 9. END OF PAGE----------------------- LAI Artificial Intelligence Chapter 5: Science and Medicine Index Report 2024 5.1 GNoME Sample material structures Source: Merchant et al., 2023 Discovering new materials with GNoME The search for new functional materials is key to advancements in various scientific fields, including robotics and semiconductor manufacturing. Yet this discovery process is typically expensive and slow. Recent advancements by Google researchers have K,BiCI Li4MgGe,S MogGeB, demonstrated that graph networks, a type of AI model, can expedite this process when trained on large datasets. Their model, GNoME, outperformed the Materials Project, a leading method in materials discovery, by identifying a significantly larger number of stable crystals (Figure 5.1.8). GNoME has KV im unveiled 2.2 million new crystal structures, many overlooked by human researchers (Figure 5.1.9 and Figure 5.1.8 Figure 5.1.10). The success of Al-driven projects like GNoME highlights the power of data and scaling in speeding up scientific breakthroughs. GNoME vs. Materials Project: stable crystal count GNoME vs. Materials Project: distinct prototypes Source: Merchant et al., 2023 | Chart: 2024 AI Index report Source: Merchant et al., 2023 | Chart: 2024 AI Index report 1,000,000 GNoME Material Project GNoME Material Project 20,000 100,000 10,000 10,000 1,000 2 3 4 5 6 2 3 4 5 6 Unique elements Unique elements Figure 5.1.9 Figure 5.1.10 Chapter 5 Preview 10 --------------PAGE Number: 10. END OF PAGE----------------------- LAI Artificial Intelligence Chapter 5: Science and Medicine Index Report 2024 5.1 Flood Forecasting A team of Google researchers has used Al to develop Al for more accurate and reliable flood forecasts highly accurate hydrological simulation models New research introduced in 2023 has made that are also applicable to ungauged basins.' These significant progress in predicting large-scale flood innovative methods can predict certain extreme flood events. Floods, among the most common natural events up to five days in advance, with accuracy that disasters, have particularly devastating effects in matches or surpasses current state-of-the-art models, less developed countries where infrastructure for such as GloFAS. The AI model demonstrates superior prevention and mitigation is lacking. Consequently, precision (accuracy of positive predictions) and recall developing more accurate prediction methods that (ability to correctly identify all relevant instances) can forecast these events further in advance could across a range of return period events, outperforming yield substantial positive impacts. the leading contemporary method (Figure 5.1.11).2 The model is open-source and is already being used to predict flood events in over 80 countries. Predictions of AI model vs. GloFAS across return periods Source: Nearing et al., 2023 | Chart: 2024 Al Index report 1.00 1.00 Al model GloFAS 0.80 0.80 0.60 1 0.60 0.40 0.40 0.20 0.20 0.00 0.00 1(N=3,649) 2 (N=3,675) 5 (N=3,416) 10 (N=3,087) 1(N=3,682) 2 (N=3,691) 5 (N=3,597) 10 (N=3,321) Return period Figure 5.1.11 1 An ungauged basin is a watershed for which there is insufficient streamflow data to model hydrological flows. 2 A return period (recurrence interval) measures the likelihood of a particular hydrological event recurring within a specific period. For example, a 100-year flood means there is a 1% chance of the event being equaled or exceeded in any given year. Chapter 5Preview 11 --------------PAGE Number: 11. END OF PAGE----------------------- LAI Artificial Intelligence Chapter 5: Science and Medicine Index Report 2024 5.2 Al models are becoming increasingly valuable in healthcare, with applications for detecting polyps to aiding clinicians in making diagnoses. As Al performance continues to improve, monitoring its impact on medical practice becomes increasingly important. This section highlights significant Al-related medical systems introduced in 2023, the current state of clinical AI knowledge, and the development of new Al diagnostic tools and models aimed at enhancing hospital administration. 5.2 Notable Medical Systems SynthSR generations Source: Iglesias et al., 2023 This section identifies significant Al-related Input SynthSR FreeSurfer seg. 3D render medical breakthroughs of 2023 as chosen by the E Al Index Steering Committee. SynthSR Transforming brain scans for advanced analysis SynthSR is an AI tool that converts clinical brain scans into high-resolution T-1 weighted images (Figure 5.2.1). This advancement addresses the issue i of scan quality variability, which previously limited the use of many scans in advanced research. By transforming these scans into T1-weighted images, known for their high contrast and clear brain Figure 5.2.1 structure depiction, SynthSR facilitates the creation of detailed 3D brain renderings. Experiments using SynthSR demonstrate robust correlations between observed volumes at both scan and subject levels, suggesting that SynthSR generates images closely resembling those produced by high-resolution T1 scans. Figure 5.2.2 illustrates the extent to which SynthSR scans correspond with ground-truth observations across selected brain regions. SynthID significantly improves the visualization and analysis of brain structures, facilitating neuroscientific research and clinical diagnostics. Chapter 5 Preview 12 --------------PAGE Number: 12. END OF PAGE----------------------- LAI Artificial Intelligence Chapter 5: Science and Medicine Index Report 2024 5.2 SynthSR correlation with ground-truth volumes on select brain regions Source: Iglesias et al., 2023 | Chart: 2024 Al Index report Subject level 0.91 0.93 0.91 0.99 0.89 0.90 (n=41) Scan level (ablated 0.79. 0.79. 0.76 0.99 0.74 0.54 segmentation task) Scan level 0.79 0.83 0.77 0.99 0.76 0.60 (n=435) White matter cCortical gray matter Ventricles Hippocampus Amygdala Brain region Figure 5.2.2 Chapter 5 Preview 13 --------------PAGE Number: 13. END OF PAGE----------------------- LAI Artificial Intelligence Chapter 5: Science and Medicine Index Report 2024 5.2 Coupled Plasmonic Infrared Sensors ImmunoSEIRA detection principle and the setup Source: Kavungal et al., 2023 Coupled plasmonic infrared sensors for the detection of neurodegenerative diseases A Diagnosis of neurodegenerative diseases such as Parkinson's and Alzheimer's depends on fast and precise identification of biomarkers. Traditional methods, such as mass spectrometry and ELISA, are useful in that they can focus on quantifying protein levels; however, they cannot discern changes in Monomers Oligomers Fibrils structural states. This year, researchers uncovered a B D E Infrared objective new method for neurodegenerative disease diagnosis IR light that combined Al-coupled plasmonic infrared sensors Inlet Outlet that use Surface-Enhanced Infrared Absorption Chipcell (SEIRA) spectroscopy with an immunoassay Amidell Amidel -flowcell Au ASyn 1500 1600 1700 161516351643166016671685 16881696 technique (ImmunoSEIRA; Figure 5.2.3). In tests that Analytes nanorodAntibodyspecies Wave number (cm-) compared actual fibril percentages with predictions Figure 5.2.3 made by Al systems, the accuracy of the predictions was found to very closely match the actual reported percentages (Figure 5.2.4). Deep neural network predicted vs. actual fibrils percentages in test samples Source: Kavungal et al., 2023 | Chart: 2024 AI Index report 100% 80% 60% 40% 20% 0% 0% 25% 40% 50% 60% 75% 100% Actual fibrils concentration (%) Figure 5.2.4 Chapter 5 Preview 14 --------------PAGE Number: 14. END OF PAGE----------------------- LAI Artificial Intelligence Chapter 5: Science and Medicine Index Report 2024 5.2 EVEscape Forecasting viral evolution for pandemic of viruses (Figure 5.2.5). EVEscape evaluates preparedness viral escape independently of current strain data Predicting viral mutations is vital for vaccine design predicting 50.0% of observed SARS-CoV-2 mutations, and pandemic minimization. Traditional methods, outperforming traditional lab studies which predicted which rely on real-time virus strain and antibody data, 46.2 face challenges during early pandemic stages due predicted only 24% of mutations (Figure 5.2.6). to data scarcity. EVEscape is a new AI deep learning This performance highlights EVEscape's potential model trained on historical sequences and biophysical as a valuable asset for enhancing future pandemic and structural information that predicts the evolution preparedness and response efforts. EVEscape design Source: Thadani et al. 2023 a Escape Fitness Accessibility Dissimilarity ACE2 x Spike + P(mutation maintains fitness) P(mutation accessible to Ab | fit) P(mutation disrupts Ab binding ( fit, accessible) P(mutation escapes immunity) Deep learning of Biophysical information evolutionary sequences b Pandemic Variant Variant starts appears becomes VOC Time Warning time of previous models (~2-4 months) EVEscape early warning time allows for vaccine development Figure 5.2.5 Chapter 5 Preview 15 --------------PAGE Number: 15. END OF PAGE----------------------- LAI Artificial Intelligence Chapter 5: Science and Medicine Index Report 2024 5.2 EVEscape vs. other models on SARS-CoV-2 RBD mutation prediction Source: Thadani et al., 2023 | Chart: 2024 AI Index report E : 1 50% 50%, EVEscape (prepandemic) 46%, Later experimental scans (pandemic ab + sera) 40% 32%, Earlier experimental scans (pandemic ab) 30% 24%, Previous model 20% 10% 0% 2020 -Jan 2020 2021 2021 -Jul 2022 2022 2023 Pandemic date Figure 5.2.6 Chapter5Preview 16 --------------PAGE Number: 16. END OF PAGE----------------------- LAI Artificial Intelligence Chapter 5: Science and Medicine Index Report 2024 5.2 AlphaMissence Hemaglobin subunit beta (HBB) Source: Google DeepMind, 2023 Better classification of Al mutations Scientists still do not fully understand which genetic mutations lead to diseases. With millions of possible genetic mutations, determining whether a mutation is benign or pathogenic requires labor- intensive experiments. In 2023, researchers from Google DeepMind unveiled AlphaMissense, a new Al model that predicted the pathogenicity of 71 million missense variants. Missense mutations are genetic alterations that impact the functionality of human proteins (Figure 5.2.7) and can lead to various diseases, including cancer. Of the 71 million possible missense variants, AlphaMissense classified 89%, identifying 57% as likely benign and 32% as likely pathogenic, while the remainder were categorized as uncertain (Figure 5.2.8). In contrast, human annotators have only been able to confirm Figure 5.2.7 the nature of O.1% of all missense mutations. AlphaMissense predictions Source: Google DeepMind, 2023 |Chart: 2024 AI Index report Likely benign Likely pathogenic Uncertain Prediction category 57% 32% 11% 0% 20% 40% 60% 80% 100% % of variants classified Figure 5.2.8 Chapter 5 Preview 17 --------------PAGE Number: 17. END OF PAGE----------------------- LAI Artificial Intelligence Chapter 5: Science and Medicine Index Report 2024 5.2 Al in Medicine Human Pangenome Reference Graph genome for the MHC region of the genome Source: Google Research, 2023 Using Al to map the human genome The human genome is a set of molecular instructions for a human. The first human genome draft was Different individual's sequences released in 2000 and updated in 2022. However, the update was somewhat incomplete. It did not incorporate various genetic mutations, like blood type, and did not as completely map diverse ancestry groups. Therefore, under the existing genome reference, it would be difficult to detect diseases or find cures in certain groups of people. In 2023, the Human Pangenome Research Consortium, comprising 119 scientists from 60 institutions, used AI Reference Genome path to develop an updated and more representative human genome map (Figure 5.2.9). The researchers achieved remarkable accuracy, annotating a median of 99.07% Figure 5.2.9 of protein-coding genes, 99.42% of protein-coding This latest version of the genome represents the most transcripts, 98.16% of noncoding genes, and 98.96% comprehensive and genetically diverse mapping of the of noncoding transcripts, as detailed in Figure 5.2.10. human genome to date. Ensembl mapping pipeline results Source: Liao et al., 2023 | Chart: 2024 Al Index report 100% 99.07% 99.42% 98.16% 98.96% 80% 60% 40% de 20% 0% Protein-coding genes Protein-coding transcripts Noncoding genes Noncoding transcripts Genes and transcripts Figure 5.2.10 Chapter 5Preview 18 --------------PAGE Number: 18. END OF PAGE----------------------- LAI Artificial Intelligence Chapter 5: Science and Medicine Index Report 2024 5.2 Clinical Knowledge Evaluating the clinical knowledge of Al models AI performance on the MedQA benchmark has seen involves determining the extent of their medical remarkable improvement, with the leading system, expertise, particularly knowledge applicable in a GPT-4 Medprompt, achieving an accuracy rate of clinical setting. 90.2 the top score in 2022 (Figure 5.2.11). Since MedQA's MedQA inception, Al capabilities on this benchmark have Introduced in 2020, MedQA is a comprehensive nearly tripled, showcasing the rapid improvements of dataset derived from professional medical board clinically knowledgeable Al systems. exams, featuring over 60,000 clinical questions designed to challenge doctors. MedQA: accuracy Source: Papers With Code, 2023 | Chart: 2024 Al Index report 90% 90.20% 80% 70% 1 60% 50% 40% 2019 2020 2021 2022 2023 Figure 5.2.11 Chapter 5Preview 19 --------------PAGE Number: 19. END OF PAGE----------------------- LAI Artificial Intelligence Chapter 5: Science and Medicine Index Report 2024 5.2 Highlighted Research: GPT-4 Medprompt Although LLMs exhibit impressive Moreover, as noted earlier, GPT-4 Medprompt was the first to general knowledge, it is commonly surpass the 90% accuracy mark on the MedQA benchmark. assumed that significant fine-tuning This breakthrough not only underscores GPT-4 Medprompt's is required for them to excel at exceptional and potentially clinically useful medical specialized knowledge, such as capabilities but also demonstrates that fine-tuning may not answering medical questions. Fine- always be necessary for adapting models to specialized tuning entails training an LLM on domains. Prompt engineering has shown to be a promising domain-specific data. alternative strategy. Research from Microsoft in late 2023 has overturned this assumption. GPT-4 vs. Med-PaLM 2 answering a medical question This study employed prompt Source: Nori et al., 2023 engineering to direct GPT-4 toward achieving remarkable performance Question: A 22-year-old male marathon runner presents to the office with the complaint of right-sided rib pain when he runs long distances. Physical examination reveals normal on the MultiMedQA benchmark heart and lung findings and an exhalation dysfunction at ribs 4-5 on the right. Which of the following muscles or muscle groups will be most useful in correcting this dysfunction suite, a group of four challenging utilizing a direct method? medical benchmarks (Figure 5.2.12). (A) anterior scalene(B) Iatissimus dorsi (C) pectoralis minor(D) quadratus lumborum GPT-4 Medprompt exceeded the Hand-crafted CoT from Med PaLM 2 performance of the top 2022 model, Let's solve this step-bystep, referring to authoritative sources as needed. Flan-PaLM 540B, in the multiple- Among the options, only, pectoralis minor muscle origins from the outer Expert Surfaces of the 3rd to 5th ribs. choice sections of several renowned medical benchmarks, including GPT-4 generated CoT PubMedQA,MedMCQA,and MMLU, The primary muscle involved in rib exhalation is the internal intercostal muscle. However, this option is not listed among the answer choices. Among the by 3.0, 21.5, and 16.2 percentage GPT-4 provided options, the pectoralis minor can contribute to rib movement, as it originates from the 3rd-5th ribs and can act to depress the ribs during exhalation when the scapula is fixed. Therefore, it could potentially be useful in points, respectively. It also exceeded correcting an exhalation dysfunction at ribs 4-5. the performance of the then state-of- the-art Med-PaLM 2 (Figure 5.2.13). Figure 5.2.12 Chapter5Preview 20 --------------PAGE Number: 20. END OF PAGE----------------------- LAI Artificial Intelligence Chapter 5: Science and Medicine Index Report 2024 5.2 Highlighted Research: GPT-4 Medprompt (cont'd) Model performance on MultiMedQA sub-benchmarks Source: Nori et al., 2023 | Chart: 2024 Al Index report MMLU MedMCQA PubMedQA MedQA 100% 94.25% 89.88% 87.37% 90.20% 86.50% 79.00% 81.80% 81.40% 79.10% 82.00% 80% 78.02% 75.20% 72.30% 72.40% 67.60% 1 60% 57.60% 40% 20% 0% Flan-PaLM 540B Med-PaLM 2 GPT-4 GPT-4 Medprompt 2022 2023 Figure 5.2.13 Chapter 5 Preview 21 --------------PAGE Number: 21. END OF PAGE----------------------- LAI Artificial Intelligence Chapter 5: Science and Medicine Index Report 2024 5.2 Highlighted Research: MediTron-70B GPT-4 Medprompt is an impressive system; PaLM 2 (both closed models), it represents however, it is closed-source, meaning its weights a significant improvement over the state-of- are not freely available to the broader public for the-art results from 2023 and surpasses other use. New research in 2023 has also sought to open-source models like Llama 2 (Figure 5.2.14). advance the capabilities of open-source medical MediTron-70B's score on MedQA is the highest LLMs. Among this new research, MediTron-70B yet achieved by an open-source model. If medical stands out as particularly promising. This model Al is to reach its fullest potential, it is important achieves a respectable 70.2% accuracy on the that its capabilities are widely accessible. In this MedQA benchmark. AIthough this is below the context, MediTron represents an encouraging performance of GPT-4 Medprompt and Med- step forward. Performance of select models on MedQA Source: Chen et al., 2023 | Table: 2024 AI Index report Model Release date Access type Score on MedQA GPT-4 Medprompt November 2023 Closed 90.20% Med-PaLM 2 April 2023 Closed 86.20% MediTron-70B November 2023 Open 70.20% Med-PaLM December 2022 Closed 67.20% Llama 2 July 2023 Open 63.80% Figure 5.2.14 Chapter5Preview 22 --------------PAGE Number: 22. END OF PAGE----------------------- LAI Artificial Intelligence Chapter 5: Science and Medicine Index Report 2024 5.2 Diagnosis AI tools can also be used for diagnostic purposes including, for example, in radiology or cancer detection. Highlighted Research: CoDoC Al medical imaging systems demonstrate robust (the ability to accurately identify those without it). diagnostic capabilities, yet there are instances Specifically, across four medical datasets, CoDoC's where they overlook diagnoses that clinicians sensitivity surpasses clinicians' by an average of catch, and vice versa. This observation suggests 4.5 a logical integration of Al systems and clinicians' by 6.5 percentage points (Figure 5.2.15). In terms diagnostic abilities. In 2023, researchers unveiled of specificity, CoDoC outperforms clinicians by CoDoC (Complementarity-Driven Deferral to an average of 2.7 percentage points across tested Clinical Workflow), a system designed to discern datasets and a standalone predictive model by 5.7 when to rely on AI for diagnosis and when to defer percentage points. Moreover, CoDoC has been to traditional clinical methods. CoDoC notably shown to reduce clinical workflow by 66%. These enhances both sensitivity (the ability to correctly findings suggest that Al medical systems can be identify individuals with a disease) and specificity integrated into clinical workflows, thereby enhancing diagnostic accuracy and efficiency. CoDoC vs. standalone predictive Al system and clinical readers: sensitivity Source: Dvijotham et al., 2023 | Chart: 2024 AI Index report CoDoC Clinician(s) Standalone predictive Al model 100% 96.70 90.50 86.70% 80% 72.60% 62.70% 64.90% 1 60% 56.90% 50.00 40% 20% 0% UK mammography dataset US mammography dataset 1 US mammography dataset 2 TB dataset Breast cancer detection TB detection Task and dataset Figure 5.2.15 Chapter 5 Preview 23 --------------PAGE Number: 23. END OF PAGE----------------------- LAI Artificial Intelligence Chapter 5: Science and Medicine Index Report 2024 5.2 Highlighted Research: CT Panda Pancreatic ductal adenocarcinoma (PDAC) is a particularly PANDA detection PANDA prediction (on non-contrast CT) lethal cancer, often detected too late for surgical intervention. Source: Screening for PDAC in asymptomatic individuals is Cao et al., 2023 Figure 5.2.16 challenging due to its low prevalence and the risk of false positives. This year, a Chinese research team developed PANDA (pancreatic cancer detection with artificial intelligence), an Al model capable of efficiently detecting and classifying pancreatic lesions in X-rays (Figure 5.2.16). In validation tests, PANDA surpassed the average radiologist in sensitivity by 34.1% and in specificity by 6.3% (Figure 5.2.17). In a large-scale, real-world test involving approximately 20,000 a specificity of 99.9% (Figure 5.2.18). Al medical tools like PANDA represent significant advancements in diagnosing challenging conditions, offering cost-effective and accurate detection previously considered difficult or prohibitive. PANDA vs. mean radiologist on multicenter validation PANDA performance on real-world multi-scenario (6,239 patients) validation (20,530 patients) Source: Cao et al., 2023 | Chart: 2024 Al Index report Source: Cao et al., 2023 | Chart: 2024 AI Index report 100% 99.90% 35% 34.10% 92.90% 30% 80% 25% 60% 20% 15% 40% 10% 6.30% 20% 5% 0% Sensitivity Specificity 0% Sensitivity Specificity Figure 5.2.17 Figure 5.2.18 Chapter 5 Preview 24 --------------PAGE Number: 24. END OF PAGE----------------------- LAI Artificial Intelligence Chapter 5: Science and Medicine Index Report 2024 5.2 Other Diagnostic Uses New research published in 2023 highlights how Al can be used in other diagnostic contexts. Figure 5.2.19 summarizes some of the findings. Additional research on diagnostic Al use cases Source: Al Index, 2024 Research Use case Findings Schopf et al., 2023 Breast cancer The authors conducted a meta-review of the literature exploring mammography-image-based Al algorithms. They discovered that predicting future breast cancer risk using only mammography images achieves accuracy that is comparable to or better than traditional risk assessment tools. Dicente Cid et al., 2023 X-ray interpretation The researchers developed two open-source neural networks, X-Raydar and X-Raydar-NLP, for classifying chest X-rays using images and free-text reports. They found that these automated classification methods perform at levels comparable to human experts and demonstrate robustness when applied to external data sets. Figure 5.2.19 Chapter 5Preview 25 --------------PAGE Number: 25. END OF PAGE----------------------- LAI Artificial Intelligence Chapter 5: Science and Medicine Index Report 2024 5.2 FDA-Approved AI-Related Medical Devices The U.S. Food and Drug Administration (FDA) Figure 5.2.20 illustrates the number of Al medical maintains a list of AI/ML-enabled medical devices devices approved by the FDA over the past decade. that have received approval. The devices featured In 2022, a total of 139 Al-related medical devices on this list meet the FDA's premarket standards, received FDA approval, marking a 12.1% increase from which include a detailed review of their effectiveness the total approved in 2021. Since 2012, the number of and safety. As of October 2023, the FDA has not these devices has increased by more than 45-fold. approved any devices that utilize generative Al or are powered by LLMs. Number of AI medical devices approved by the FDA, 2012-22 Source: FDA, 2023 | Chart: 2024 AI Index report 140 139 124 120 107 100 80 77 63 60 40 26 20 18 3 3 6 5 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 Figure 5.2.20 3 The FDA last updated the list in October 2023, meaning that the totals for 2023 were incomplete. Consequently, the AI Index limited its data presentation to include only information up to 2022. Chapter 5Preview 26 --------------PAGE Number: 26. END OF PAGE----------------------- LAI Artificial Intelligence Chapter 5: Science and Medicine Index Report 2024 5.2 Figure 5.2.21 illustrates the specialties associated with FDA-approved medical devices. Of the 139 devices approved in 2022, a significant majority, 87.1%, were related to radiology. The next most common specialty was cardiovascular, accounting for 7.2% of the approvals. Number of AI medical devices approved by the FDA by specialty, 2012-22 Source: FDA, 2023 | Chart: 2024 AI Index report 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 Radiology 2 5 11 15 39 51 94 105 121 Cardiovascular 4 6 9 12 7 11 10 Neurology 4 4 2 2 Gastroenterology and urology 1 1 3 1 Hematology 2 2 1 3 1 Microbiology 2 2 1 General hospital 2 General and plastic surgery 2 1 Ophthalmic 2 1 Clinical chemistry 2 Anesthesiology Pathology Ear nose and throat Dental Orthopedic Obstetrics and gynecology Figure 5.2.21 Chapter5Preview 27 --------------PAGE Number: 27. END OF PAGE----------------------- LAI Artificial Intelligence Chapter 5: Science and Medicine Index Report 2024 5.2 Administration and Care Al tools also hold the potential to enhance medical administration efficiency and elevate the standard of patient care. Highlighted Research: MedAlign Despite significant advances benchmark with 983 questions and instructions and 303 clinician in AI for healthcare, existing responses, drawn from seven different medical specialties (Figure benchmarks like MedQA and 5.2.22 USMLE, focused on knowledge- The researchers then tested various existing LLMs on MedAlign. Of based questions, do not fully all LLMs, a GPT-4 variant using multistep refinement achieved the capture the diverse tasks highest correctness rate (65.0%) and was routinely preferred over clinicians perform in patient other LLMs (Figure 5.2.23). MedAlign is a valuable milestone toward care. Clinicians often engage using Al to alleviate administrative burdens in healthcare. in information-intensive tasks, such as creating tailored diagnostic plans, and spend a MedAlign workflow Source: Fleming et al., 2023 significant proportion of their working hours on administrative Clinician Instruction LLM Response tasks. Although Al has the Summarize from the EHR the strokes that the patient EHR LLM potential to streamline these had and their associated + 4 processes, there is a lack of neurologic deficits. suitable electronic health Clinician Response records (EHR) datasets for The patient had strokes in the L basal ganglia in 2018 and multiple strokes in 2022: benchmarking and fine-tuning R occipital, left temporal, L frontal. The medically administrative LLMs. patient had right sided weakness associated ??? with the 2018 stroke after which she was This year researchers have admitted to rehab. She then had a left sided hemianopsia related to the 2022 stroke. made strides to address this Evaluating LLMs with MedAlign gap by introducing MedAlign: a comprehensive EHR-based Figure 5.2.22 Chapter 5Preview 28 --------------PAGE Number: 28. END OF PAGE----------------------- LAI Artificial Intelligence Chapter 5: Science and Medicine Index Report 2024 5.2 Highlighted Research: MedAlign (cont'd) Evaluation of model performance: human vs. COMET ranks Source: Fleming et al., 2023 | Chart: 2024 Al Index report Human ranks COMET ranks GPT-4 (32k + MR) 48% 56% 73% 71% 82% GPT-4 (32k + MR) 50% 52% 66% 63% 79% GPT-4 (32k) 52% 58% 72% 74% 81% GPT-4 (32k) 50% 51% 63% 58% 77% 1 1 GPT-4 (2k) 44% 42% 67% 70% 76% GPT-4 (2k) 48% 49% 66% 61% 79% Vicuna-13B (2k) 27% 28% 33% 50% 63% Vicuna-13B (2k) 34% 37% 34% 49% 70% Vicuna-7B (2k) 29% 26% 30% 50% 64% Vicuna-7B (2k) 37% 42% 39% 51% 71% MPT-7B-Instruct (2k) 18% 19% 24% 37% 36% MPT-7B-Instruct (2k) 21% 23% 21% 30% 29% E E 1 1 E E 1 1 Model B (loser) Model B (loser) Figure 5.2.23 Chapter 5Preview 29 --------------PAGE Number: 29. END OF PAGE----------------------- LAI Artificial Intelligence Chapter 5: Science and Medicine Index Report 2024 Appendix Appendix Acknowledgments The Al Index would like to acknowledge Emma Williamson for her work surveying the literature on significant Al-related science and medicine trends. Benchmarks 1. MedQA: Data on MedQA was taken from the MedQA Papers With Code leaderboard in January 2024 original paper. FDA-Approved AI-Medical Devices Data on FDA-approved AI-medical devices is sourced from the FDA website that tracks artificial intelligence and machine learning (AI/ML)-enabled medical devices. Chapter 5Preview 30 --------------PAGE Number: 30. END OF PAGE----------------------- LAI Artificial Intelligence Chapter 5: Science and Medicine Index Report 2024 Appendix Works Cited Cao, K., Xia, Y., Yao, J., Han, X., Lambert, L., Zhang, T., Tang, W., Jin, G., Jiang, H., Fang, X., Nogues, I., Li, X., Guo, W., Wang, Y., Fang, W., Qiu, M., Hou, Y., Kovarnik, T., Vocka, M., Lu, J. (2023). "Large-Scale Pancreatic Cancer Detection via Non-contrast CT and Deep Learning." Nature Medicine 29, no. 12: 3033-3043. https://doi.org/10.1038/s41591-023-02640-w. Chen, Z., Cano, A. H., Romanou, A., Bonnet, A., Matoba, K., Salvi, F., Pagliardini, M., Fan, S., Kopf, A., Mohtashami, A., Sallinen, A., Sakhaeirad, A., Swamy, V., Krawczuk, I., Bayazit, D., Marmet, A., Montariol, S., Hartley, M.-A., Jaggi, M. & Bosselut, A. (2023). MEDITRON-70B: Scaling Medical Pretraining for Large Language Models (arXiv:2311.16079). arXiv. http://arxiv.org/abs/2311.16079. Cheng, J., Novati, G., Pan, J., Bycroft, C., Zemgulyte, A., Applebaum, T., Pritzel, A., Wong, L. H., Zielinski, M., Sargeant, T., Schneider, R. G., Senior, A. W., Jumper, J., Hassabis, D., Kohli, P. & Avsec, Z. (2023). "Accurate Proteome-Wide Missense Variant Effect Prediction With AlphaMissense." Science 381. https://doi.org/10.1126/science.adg7492. Cid, Y. D., Macpherson, M., Gervais-Andre, L., Zhu, Y., Franco, G., Santeramo, R., Lim, C., Selby, I., Muthuswamy, K., Amlani, A., Hopewell, H., Indrajeet, D., Liakata, M., Hutchinson, C. E., Goh, V. & Montana, G. (2024). "Development and Validation of Open-Source Deep Neural Networks for Comprehensive Chest X-Ray Reading: A Retrospective, Multicentre Study." The Lancet Digital Health 6, no. 1: e44-e57. https://doi.org/10.1016/S2589-7500(23)00218-2. Fleming, S. L., Lozano, A., Haberkorn, W. J., Jindal, J. A., Reis, E. P., Thapa, R., Blankemeier, L., Genkins, J. Z., Steinberg, E., Nayak, A., Patel, B. S., Chiang, C.-C., Callahan, A., Huo, Z., Gatidis, S., Adams, S. J., Fayanju, O., Shah, S. J., Savage, T., ... Shah, N. H. (2023). MedAlign: A Clinician-Generated Dataset for Instruction Following With Electronic Medical Records (arXiv:2308.14089). arXiv. http://arxiv.org/abs/2308.14089. Ha, T., Lee, D., Kwon, Y., Park, M. S., Lee, S., Jang, J., Choi, B., Jeon, H., Kim, J., Choi, H., Seo, H.-T., Choi, W., Hong, W., Park, Y. J., Jang, J., Cho, J., Kim, B., Kwon, H., Kim, G., .. Choi, Y.-S. (2023). "AI-Driven Robotic Chemist for Autonomous Synthesis of Organic Molecules." Science Advances 9, no. 44. https://doi.org/10.1126/sciadv.adj0461. Iglesias, J. E., Billot, B., Balbastre, Y., Magdamo, C., Arnold, S. E., Das, S., Edlow, B. L., Alexander, D. C., Golland, P. & Fischl, B. (2023). "SynthSR: A Public AI Tool to Turn Heterogeneous Clinical Brain Scans into High-Resolution T1-Weighted Images for 3D Morphometry." Science Advances 9, no. 5. https://doi.org/10.1126/sciadv.add3607. Jin, D., Pan, E., Oufattole, N., Weng, W.-H., Fang, H. & Szolovits, P. (2020). What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset From Medical Exams (arXiv:2009.13081; Version 1). arXiv. http://arxiv.org/abs/2009.13081. Kavungal, D., Magalhaes, P., Kumar, S. T., Kolla, R., Lashuel, H. A. & Altug, H. (2023). "Artificial Intelligence--Coupled Plasmonic Infrared Sensor for Detection of Structural Protein Biomarkers in Neurodegenerative Diseases." Science Advances 9, no. 28. https://doi.org/10.1126/sciadv.adg9644. Lam, R., Sanchez-Gonzalez, A., Willson, M., Wirnsberger, P., Fortunato, M., Alet, F., Ravuri, S., Ewalds, T., Eaton-Rosen, Z., Hu, W., Merose, A., Hoyer, S., Holland, G., Vinyals, O., Stott, J., Pritzel, A., Mohamed, S. & Battaglia, P. (2023). "Learning Skillful Medium-Range Global Weather Forecasting." Science 382. https://doi.org/10.1126/science.adi2336. Liao, W.-W., Asri, M., Ebler, J., Doerr, D., Haukness, M., Hickey, G., Lu, S., Lucas, J. K., Monlong, J., Abel, H. J., Buonaiuto, S., Chang, X. H., Cheng, H., Chu, J., Colonna, V., Eizenga, J. M., Feng, X., Fischer, C., Fulton, R. S., ... Paten, B. (2023). "A Draft Human Pangenome Reference." Nature 617: 312-24. https://doi.org/10.1038/s41586-023-05896-x. Mankowitz, D. J., Michi, A., Zhernov, A., Gelmi, M., Selvi, M., Paduraru, C., Leurent, E., Iqbal, S., Lespiau, J.-B., Ahern, A., Koppe, T., Millikin, K., Gaffney, S., Elster, S., Broshear, J., Gamble, C., Milan, K., Tung, R., Hwang, M., ... Silver, D. (2023). "Faster Sorting Algorithms Discovered Using Deep Reinforcement Learning." Nature 618: 257-63. https://doi.org/10.1038/s41586-023-06004-9. Chapter5Preview 31 --------------PAGE Number: 31. END OF PAGE----------------------- LAI Artificial Intelligence Chapter 5: Science and Medicine Index Report 2024 Appendix Merchant, A., Batzner, S., Schoenholz, S. S., Aykol, M., Cheon, G. & Cubuk, E. D. (2023). "Scaling Deep Learning for Materials Discovery" Nature 624: 80-85. https://doi.org/10.1038/s41586-023-06735-9. Nearing, G., Cohen, D., Dube, V., Gauch, M., Gilon, O., Harrigan, S., Hassidim, A., Klotz, D., Kratzert, F., Metzger, A., Nevo, S., Pappenberger, F., Prudhomme, C., Shalev, G., Shenzis, S., Tekalign, T., Weitzner, D. & Matias, Y. (2023). AI Increases Global Access to Reliable Flood Forecasts (arXiv:2307.16104). arXiv. http://arxiv.org/abs/2307.16104. Nori, H., Lee, Y. T., Zhang, S., Carignan, D., Edgar, R., Fusi, N., King, N., Larson, J., Li, Y., Liu, W., Luo, R., McKinney, S. M., Ness, R. O., Poon, H., Qin, T., Usuyama, N., White, C. & Horvitz, E. (2023a). Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine (arXiv:2311.16452; Version 1). arXiv. http://arxiv.org/abs/2311.16452. Schopf, C. M., Ramwala, O. A., Lowry, K. P., Hofvind, S., Marinovich, M. L., Houssami, N., EImore, J. G., Dontchos, B. N., Lee, J. M. & Lee, C. I. (2024). "Artificial Intelligence-Driven Mammography-Based Future Breast Cancer Risk Prediction: A Systematic Review." Journal of the American College of Radiology 21, no. 2: 319-28. https://doi.org/10.1016/j.jacr.2023.10.018. Shen, T., Munkberg, J., Hasselgren, J., Yin, K., Wang, Z., Chen, W., Gojcic, Z., Fidler, S., Sharp, N. & Gao, J. (2023). "Flexible Isosurface Extraction for Gradient-Based Mesh Optimization." ACM Transactions on Graphics 42, no. 4: 1-16. https://doi.org/10.1145/3592430. Thadani, N. N., Gurev, S., Notin, P., Youssef, N., Rollins, N. J., Ritter, D., Sander, C., Gal, Y. & Marks, D. S. (2023). "Learning From Prepandemic Data to Forecast Viral Escape." Nature 622: 818-25. https://doi.org/10.1038/s41586-023-06617-Q. Chapter 5Preview 32 --------------PAGE Number: 32. END OF PAGE-----------------------