📄 Page
1
N g uyen H a nd s-O n H ea lthca re D a ta H a nd s-O n H ea lthca re D a ta Andrew Nguyen Hands-On Healthcare Data Taming the Complexity of Real-World Data
📄 Page
2
DATA SCIENCE “This book captures the complexity of healthcare data that impacts decisions in patient care, brings new scientific discoveries, and improves the industry as a whole. You’ll learn best practices and new possibilities to collect, transform, and analyze healthcare data.” —Łukasz Kaczmarek Medical Informatics Architect Hands-On Healthcare Data US $79.99 CAN $99.99 ISBN: 978-1-098-11292-9 Twitter: @oreillymedia linkedin.com/company/oreilly-media youtube.com/oreillymedia Healthcare is the next frontier for data science. Using the latest in machine learning, deep learning, and natural language processing, you’ll be able to solve healthcare’s most pressing problems: reducing cost of care, ensuring patients get the best treatment, and increasing accessibility for the underserved. But first, you have to learn how to access and make sense of all that data. This book provides pragmatic and hands-on solutions for working with healthcare data, from data extraction to cleaning and harmonization to feature engineering. Author Andrew Nguyen covers specific ML and deep learning examples with a focus on producing high-quality data. You’ll discover how graph technologies help you connect disparate data sources so you can solve healthcare’s most challenging problems using advanced analytics. You’ll learn about: • Different types of healthcare data: electronic health records, clinical registries and trials, digital health tools, and claims data • Challenges of working with healthcare data, especially when trying to aggregate data from multiple sources • Current options for extracting structured data from clinical text • How to make trade-offs when using tools and frameworks for normalizing structured healthcare data • How to harmonize healthcare data using terminologies, ontologies, and mappings and crosswalks Andrew Nguyen is a principal medical informatics architect at one of the largest biopharma companies in the world, where he designs scalable solutions to harmonize real-world healthcare data sources for machine learning and advanced analytics. He’s worked at the intersection of healthcare data and machine learning for a variety of organizations, from academia to startups, for over a decade. Andrew holds a PhD in biological and medical informatics from UCSF and a BS in electrical and computer engineering from UCSD. N g uyen
📄 Page
3
Andrew Nguyen Hands-On Healthcare Data Taming the Complexity of Real-World Data Boston Farnham Sebastopol TokyoBeijing
📄 Page
4
978-1-098-11292-9 [LSI] Hands-On Healthcare Data by Andrew Nguyen Copyright © 2022 Andrew Nguyen. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Michelle Smith Development Editor: Melissa Potter Production Editor: Christopher Faucher Copyeditor: Kim Wimpsett Proofreader: James Fraleigh Indexer: nSight, Inc. Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea August 2022: First Edition Revision History for the First Edition 2022-08-10: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781098112929 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Hands-On Healthcare Data, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
📄 Page
5
Table of Contents Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1. Introduction to Healthcare Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 The Enterprise Mindset 2 The Complexity of Healthcare Data 4 Sources of Healthcare Data 5 Electronic Health Records 5 Claims Data 9 Clinical/Disease Registries 11 Clinical Trials Data 12 Data Collection and How That Affects Data Scientists 12 Prospective studies 13 Retrospective studies 14 Conclusion 16 2. Technical Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Basic Introduction to Docker and Containers 18 Installing and Testing Docker 18 Conceptual Introduction to Databases 19 ACID Compliance 20 OLTP Systems 21 OLAP Systems 22 SQL Versus NoSQL 23 SQL Databases 24 iii
📄 Page
6
(Labeled) Property Graph Databases 28 Hypergraph Databases 33 Resource Description Framework Databases 36 Conclusion 39 3. Standardized Vocabularies in Healthcare. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Controlled Vocabularies, Terminologies, and Ontologies 42 Key Considerations 44 Pre-coordination Versus Post-coordination 47 Case Study Example: EHR Data 49 Common Terminologies 51 CPT 51 ICD-9 and ICD-10 52 LOINC 52 RxNorm 53 SNOMED CT 53 Key Takeaways 54 Using the Unified Medical Language System 55 Some Basic Definitions 56 Concept Orientation 58 Working with the UMLS 58 UMLS and Relational Databases 62 Preprocessing the UMLS 65 UMLS and Property Graph Databases 67 UMLS and Hypergraph Databases 71 Review of the UMLS 77 Conclusion 77 4. Deep Dive: Electronic Health Records Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Publicly Accessible Data 79 Medical Information Mart for Intensive Care 80 Synthea 87 Data Models 89 Goals 89 Examples of Data Models 91 Case Study: Medications 96 The Medication Harmonization Problem 97 Technical Deep Dive 99 Connecting to the UMLS 119 Difficulties Normalizing Structured Medical Data 120 Conclusion 120 iv | Table of Contents
📄 Page
7
5. Deep Dive: Claims Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Publicly Accessible Data—SynPUF 124 Data Models 125 Choosing a Data Model 126 Combining Claims and EHR Data 128 Case Study: Combining Diagnoses and Medications 134 OMOP Versus Graphs 135 Considerations When Combining Different Sources of Healthcare Data 136 Conclusion 140 6. Machine Learning and Analytics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 A Primer on Machine Learning 144 What Is Feature Engineering? 144 Graph-Based Deep Learning 146 Extracting Data as a Table 147 To SQL or Not to SQL 148 Querying OMOP Data 151 From Graphs to Dataframes 153 Why Add the Complexity of Graphs? 155 Machine Learning and Feature Engineering with Graphs 157 Graph Embeddings 162 node2vec 162 cui2vec 164 med2vec 166 snomed2vec 166 Some Final Thoughts About Embeddings 168 Making the Case for Graph-Based Analysis 169 Conclusion 170 7. Trends in Healthcare Analytics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Federated Learning and Federated Analytics 174 How Does Federated Learning Work? 175 Why Federated Analytics/Learning? 176 The Data Harmonization Challenge in a Federated Context 178 Graphs and Federated Approaches 182 Natural Language Processing 184 Concept Extraction 185 Beyond Concept Extraction 190 Clinical NLP Tools 191 Commercial Clinical NLP Solutions 197 Table of Contents | v
📄 Page
8
Key Differences Between Clinical NLP and Other Applications of NLP 198 Conclusion 200 8. Graphs, Harmonization, and Some Final Thoughts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Other Types of Healthcare RWD 204 Data Normalization and Harmonization 205 Merging Datasets 206 Bridging IT and the Business 207 It’s a Human, Not Technical, Problem 211 Graphs Can Be Part of the Solution 214 Graphs Are Not a Silver Bullet 215 Conclusion 216 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 vi | Table of Contents
📄 Page
9
Foreword It is an exciting time to get involved in healthcare data. There is an emergence of advanced technologies, from MLOps to Cloud Computing, that enable new ways to harness machine learning and AI to improve the human condition. For example, a recent JAMA study showed that neurocognitive decline is associated with a person’s decrease in walking speed at age forty-five compared to their pace when they were younger. Imagine being able to send guidance to a 25, 35, or 45-year- old that would improve their future self through the use of machine learning for healthcare. This scenario isn’t science fiction; it is possible today. Similarly, radiology could benefit from advanced computer vision technologies that guide the expert practitioner to look at specific imaging results more closely and automatically identify rare conditions. Primary care physicians could link all available health records automatically to find a complete lineage of medical treatment and improve patient outcomes. These exciting possibilities are on the horizon, but require new ways to work. The guidance in Andrew’s book will get us started on the path to this new world. Healthcare organizations can standardize the creation and processing of healthcare records, thus streamlining the processing of existing and new data, and driving down costs due to efficiency improvements and better treatment outcomes. Several chapters discuss the intricate details of electronic health records data and the associated vocab‐ ularies, terminologies, and ontologies. These detailed resources are invaluable for healthcare professionals working to improve how they work with their organization’s data. Further chapters dive into sophisticated technical approaches to solving challenging problems in healthcare. These solutions include coverage of topics like graph-based deep learning, commercial clinical NLP solutions, and data harmonization. Andrew covers these topics from a theoretical standpoint, as well as with hands-on examples with working code. vii
📄 Page
10
Andrew’s experience with healthcare data and medical informatics makes him the perfect person to propel these discussions. With this book, Andrew gives the reader an excellent opportunity to be part of the drive to revolutionize healthcare and turn what seems like science fiction into our new reality. — Noah Gift Executive in Residence, Duke Master in Interdisciplinary Data Science (MIDS) viii | Foreword
📄 Page
11
Preface A few years ago, I was at the Google Faculty Institute, where I met Noah Gift during one of the lunch breaks. We got to talking about academia and education, and many of the challenges and opportunities we saw when it came to empowering people to become experts in data. Whether this was data engineering, data science, or even the more basic aspects of programming, we both saw the potential for fundamentally changing how knowledge is disseminated. It was shortly after this conversation that Noah floated the idea of writing a book. While I had considered this previously, it was a fleeting thought and not something I had seriously considered. I filed the conversation away in the back of my mind and figured it could be the focus of my sabbatical (I was still in academia at the time). A year later, everything was flipped upside down by COVID-19 and the world’s response. Despite having just received tenure and promotion, I decided to leave aca‐ demia and return to industry—rolling up my sleeves and getting back into the thick of it. I was a few months into my first project (building a clinicogenomic database that pulled data from a handful of hospitals) when I started to see opportunities to help educate our teams on how we could improve our approach to dealing with the com‐ plexities of electronic health record (EHR) data. By then, we were deep into the pandemic and all riding the roller-coaster of repeated loosening and tightening of the many COVID restrictions. Every day, I saw news arti‐ cles and reports that were making a desperate attempt to draw conclusions from all of the data and anecdotes about the number of infections, mortality rates, false posi‐ tives/negatives, and so forth. As someone who had been working with healthcare data for years, I found it very challenging to listen to data scientists, epidemiologists, public health professionals, and even lay people draw conclusions and make serious decisions based on what I knew was very dirty and faulty data. ix
📄 Page
12
1 For detailed discussion on the attribution of this quote, please see https://oreil.ly/sqMGO. It also did not help that COVID-19 became a highly charged and political topic, with people trying to fit the data to preconceived notions, embodying the quote: [People] use statistics as a drunken man uses lamp-posts, for support rather than for illumination.1 I saw a tremendous opportunity to help people better understand the nuances and complexities of working with data that were collected outside of clinical studies and trials. Healthcare data reflects the underlying complexity of the delivery of care as well as our ever-evolving understanding of biology, physiology, pathophysiology, and interventions. Whether you are a data scientist or healthcare professional, this book will provide you with a data-centric perspective of various facets of healthcare. It can be difficult to develop the appropriate skills, knowledge, and experience for tackling healthcare data, particularly for those not embedded within medical centers/health systems, public and private payers, or other organizations handling deep patient-level data. My goal in writing this book is to help bridge this gap, particularly for those who are new to healthcare data. This includes data scientists from other industries and even healthcare professionals who are not familiar with analyzing EHR data. This book also will be useful for epidemiologists, biostatisticians, and data scientists/analysts who have worked with cleaned and processed data, but have not been a part of the data-wrangling process itself. If you’re reading this book, you are interested in working with data and passionate about solving problems in healthcare. However, you might be coming from a more technical, computer science, or data science background. Or, you might be an epi‐ demiologist, researcher, or clinician with domain expertise and training but who is relatively new to working with data at this level. If you have a technical background, this book will give you a crash course on many of the key learnings from the field of medical informatics over the past several decades. The intent is to help you get up and running more quickly and effectively than if you were to figure it out on your own. I have seen many excellent data engineers and data scientists work their way through one challenge after another, only to have reinvented something that hospital informatics teams have refined over the years. Not only did they reinvent the wheel, they reinvented a square wheel. If you have a healthcare background, you are used to working with healthcare data but typically from narrow and specific perspectives. As a clinician, you interact with EHRs and other clinical information systems transactionally while caring for patients. As an epidemiologist or clinical researcher, you may have relied on your data and x | Preface
📄 Page
13
informatics teams to clean and process your data. This book will help you take a step back so you can see the bigger picture and how we can and need to incorporate your knowledge and experience into the data-wrangling process. The topics we will discuss in this book truly span both technical and domain topics. To be successful with healthcare data, particularly “real-world data” (as we call it in biotech and pharma), you need to have a foundational understanding of both sets of topics. This book bounces between qualitative discussions of healthcare data and technical walkthroughs. Depending on your background and interest, you might be drawn to some chapters more than others. However, my hope is that you come away from this book with a new perspective and common understanding of the challenges and potential solutions, regardless of your professional background. As you will see, I also have a deep interest in graphs and graph databases and firmly believe that they are a necessary (but not sufficient) part of our overall solution to leveraging healthcare data at scale. I’ve taken the liberty of highlighting how many of our challenges can be mitigated or solved using graph databases (versus SQL). I debated how deep to go into the code examples—too deep and I might lose those with less computer or data science experience; too shallow and you might be left wondering, “That’s it?” I tried to strike a balance by walking through a narrow use case, followed by examples of several different approaches. It is impossible to give you a recipe that is universally applicable. There are far too many nuances from one use case to the next. So, my goal was to provide explanations in the context of a use case with the hope and intention that you might adapt this to your own situations and scenarios. The associated GitLab repository contains examples with more depth. I find examples are always good to get the creative juices flowing. As you think about the ideas in the book or review the code examples, I urge you to always ask yourself: • How might I adapt this to my use case? • How is my use case similar or different? • What would I need to adapt or change in order to make this work for me? Success with healthcare (real-world) data requires that we be creative with how we frame our use case and how we apply different processes and technology. There is simply no one-size-fits-all solution. So, if you build upon the approaches in this book, please contribute examples back to the repository to help other readers. I hope you enjoy the journey! Preface | xi
📄 Page
14
Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. This element signifies a tip or suggestion. This element signifies a general note. This element indicates a warning or caution. Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at https://gitlab.com/hands-on-healthcare-data. If you have a technical question or a problem using the code examples, please send an email to bookquestions@oreilly.com. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not xii | Preface
📄 Page
15
need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Hands-On Healthcare Data by Andrew Nguyen (O’Reilly). Copyright 2022 Andrew Nguyen, 978-1-098-11292-9.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/healthcare-data. Preface | xiii
📄 Page
16
Email bookquestions@oreilly.com to comment or ask technical questions about this book. For news and information about our books and courses, visit https://oreilly.com. Find us on LinkedIn: https://linkedin.com/company/oreilly-media. Follow us on Twitter: https://twitter.com/oreillymedia. Watch us on YouTube: https://youtube.com/oreillymedia. Acknowledgments Bố, Mẹ First and foremost, I want to acknowledge all of the sacrifices you have made over the years as you prioritized our education above all else. In doing so, you helped nurture a certain curiosity that has made today possible. From Stuart Hall to Urban to UCSD to UCSF, you made sure that I never had to worry about any‐ thing other than learning what I wanted to. This freedom allowed me to chal‐ lenge myself and explore my passions (namely, computers and medicine) and carve out my niche in the world. Day You have always helped me remember that life is about balance—too much or too little of anything can be detrimental. I know I don’t say it enough but I am grateful for your perspective and influence, without which I would not have been able to enjoy the process of writing as much as I have. Of course, there are many others who have been instrumental along the way, without whom I would not have gotten to this point: Brenna Rowe As I think back to the late nights writing my dissertation, you always supported my intellectual curiosity and constant thirst for learning and experimenting, helping lay the foundation that led to this book. David Avrin Thank you for introducing me to medical informatics and spending the time to help a young high school student discover an entirely new world. And, most importantly, thank you for pulling me back into medical informatics after a brief foray into the world of software engineering. Lukasz Kaczmarek Thank you for tolerating my monologues and squirrel brain, and for helping me refine my thinking around medical informatics, IT, software, databases, architec‐ ture, and how best to communicate complex ideas. xiv | Preface
📄 Page
17
Noah Gift I remember our first conversation about rethinking the idea of teaching and aca‐ demia. What started as a random lunch conversation has blossomed into my first book and I am forever grateful. Looking forward to more fun in the future! Yao Sun I always enjoyed our weekly conversations because I knew you would understand my crazy ideas enough to ask challenging and thought-provoking questions. I wouldn’t be where I am without your support and guidance along the way. Of course, thank you to all those who reviewed my drafts and provided helpful com‐ ments and feedback—Ed Mitchell, Huanmei Wu, and Tim McLerran. And finally, thank you, Melissa Potter and Chris Faucher for your support, guidance, and tolerance as I stumbled my way through my first book! Preface | xv
📄 Page
18
(This page has no text content)
📄 Page
19
CHAPTER 1 Introduction to Healthcare Data Healthcare data is an exciting vertical for data science, and there are many opportuni‐ ties to have real impact, whether from a clinical or technical perspective. For patients and clinicians, there is the alluring promise of truly personalized care where patients get the right treatment at the right time, tailored to their genetics, environment, beliefs, and lifestyle—each requiring effective integration, harmonization, and analy‐ sis of highly complex data. For data scientists and computer scientists, there are many open problems for natural language processing, graphs, semantic web, and databases, among many others. Additionally, there are “frontier” problems that arise given the specific combination of a specific technology and the nuances and complexities of healthcare. For example, there is nothing about healthcare data itself nor data science that requires “regulatory-grade” reproducibility. Data scientists know how to use version control tools such as Git, and IT people know how to create database snapshots and use Docker containers. However, with regulatory bodies such as the US Food and Drug Administration (FDA) or the European Medicines Agency (EMA), there are specific requirements to track and store metadata and other artifacts to “prove” the results of the analysis, including reproducibility. Similarly, there is increasing desire and pres‐ sure to ensure reproducibility of studies or the sharing of negative results among aca‐ demics. How we can address these challenges at scale is still unsolved. Despite the excitement for working with various types of healthcare data, there are still many misconceptions. Those with extensive experience working in enterprise environments tend to underestimate the complexity, often comparing real-world healthcare data projects to enterprise integrations. This is not to say that a typical enterprise data project is simple or easy. One of the major differences is the relation of how and why the data was captured relative to the actual work being done. 1
📄 Page
20
In nearly every industry, the use of data today is a function of engineered systems. In other words, most data is generated by software systems versus collected and entered by a human. For example, in advertising/marketing analytics, the data is generated by websites that track clicks and impressions. In this chapter, we will walk through some of the nuances and complexities of health‐ care data. Much of this complexity is a reflection of the delivery of healthcare itself— it is just really complicated! For those with a traditional IT background or who have worked in large companies dealing with complex data issues, we will start with a little discussion of the enterprise mindset and how you might frame healthcare data. After this, we will dive into a broader view of the complexities of healthcare data. Once this foundation has been set, you will get a broad overview of common sources of healthcare data. The Enterprise Mindset The data science industry has had many successes—from companies using data sci‐ ence, as well as creating new data science methods. When leveraging and using data science, most organizations have the benefit of following the traditional enterprise mindset. Information and data architects within the organization can sit down together, discuss the various sources of data and intended use cases, and then craft an overarching information model and architecture. Part of this process typically involves getting various stakeholders together into a sin‐ gle room to agree on how best to define individual nuggets of data or information. Until recently, this has been the approach that most companies have taken when try‐ ing to build data warehouses. The challenge in healthcare is that the sources of data operate in disconnected silos. When a patient enters the healthcare system, they typi‐ cally do so via their primary care physician, urgent care, or the emergency department. Naturally, one might say we should start here in order to create the information model that will be used to represent healthcare data. After all, nearly everything else flows downstream from the moment a patient makes an appointment or shows up at urgent care. Insurance companies or governments will need to reimburse hospitals for providing care; physicians will prescribe medications and companion diagnostics from the biopharma industry. So, information architects can start by defining the idea of a patient and all of the associated data elements, such as demographics, medical history, and medication pre‐ scription history. However, as we start to look at the healthcare industry overall, there are already potential issues even when defining the “simple” idea of a patient. How does an insurance company think of patients? At least in the United States, insurance companies typically think of people as covered lives, not patients. While some may be 2 | Chapter 1: Introduction to Healthcare Data