Building Knowledge Graphs A Practitioner’s Guide (Jesus Barrasa, Jim Webber) (Z-Library)

Author: Jesus Barrasa, Jim Webber

科学

Incredibly useful, knowledge graphs help organizations keep track of medical research, cybersecurity threat intelligence, GDPR compliance, web user engagement, and much more. They do so by storing interlinked descriptions of entities—objects, events, situations, or abstract concepts—and encoding the underlying information. How do you create a knowledge graph? And how do you move it from theory into production? Using hands-on examples, this practical book shows data scientists and data engineers how to build their own knowledge graphs. Authors Jesús Barrasa and Jim Webber from Neo4j illustrate common patterns for building knowledge graphs that solve many of today’s pressing knowledge management problems. You’ll quickly discover how these graphs become increasingly useful as you add data and augment them with algorithms and machine learning. - Learn the organizing principles necessa

📄 File Format: PDF

💾 File Size: 17.7 MB

Views

Downloads

0.00

Total Donations

📖 Read Online ⬇️ Download

📄 Text Preview (First 20 pages)

ℹ️

Registered users can read the full content for free

📄 Page 1

(This page has no text content)

📄 Page 2

(This page has no text content)

📄 Page 3

Quickly uncover hidden relationships and patterns across billions of data connections

📄 Page 4

Jesús Barrasa and Jim Webber Building Knowledge Graphs A Practitioner’s Guide Boston Farnham Sebastopol TokyoBeijing

📄 Page 5

978-1-098-12711-4 [LSI] Building Knowledge Graphs by Jesús Barrasa and Jim Webber Copyright © 2023 Neo4j Inc. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (https://oreilly.com). For more information, contact our corporate/institu‐ tional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Nicole Butterfield Development Editor: Corbin Collins Production Editor: Jonathon Owen Copyeditor: Shannon Turlington Proofreader: Piper Editorial Consulting, LLC Indexer: nSight, Inc. Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea June 2023: First Edition Revision History for the First Edition 2023-06-21: First Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Building Knowledge Graphs, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. This work is part of a collaboration between O’Reilly and Neo4j. See our statement of editorial independence.

📄 Page 6

Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1. Introducing Knowledge Graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 What Are Graphs? 2 The Motivation for Knowledge Graphs 7 Knowledge Graphs: A Definition 8 Summary 8 2. Organizing Principles for Building Knowledge Graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Organizing Principles of a Knowledge Graph 9 Plain Old Graphs 10 Richer Graph Models 11 Knowledge Graphs Using Taxonomies for Hierarchy 14 Knowledge Graphs Using Ontologies for Multilevel Relationships 19 Which Is the Best Organizing Principle for Your Knowledge Graph? 21 Organizing Principles: Standards Versus Create Your Own 23 Creating Your Own Organizing Principle 23 Essential Characteristics of a Knowledge Graph 24 Summary 25 3. Graph Databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 The Cypher Query Language 28 Creating Data in a Knowledge Graph 29 Avoiding Duplicates When Enriching a Knowledge Graph 31 Graph Local Queries 37 Graph Global Queries 42 Calling Functions and Procedures 44 Supporting Tools for Writing Knowledge Graph Queries 45 iii

📄 Page 7

Neo4j Internals 47 Query Processing 47 ACID Transactions 49 Summary 50 4. Loading Knowledge Graph Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Loading Data with the Neo4j Data Importer 51 Online Bulk Data Loading with LOAD CSV 56 Initial Bulk Load 61 Summary 64 5. Integrating Knowledge Graphs with Information Systems. . . . . . . . . . . . . . . . . . . . . . . . 65 Towards a Data Fabric 65 The Database Driver 67 Graph Federation with Composite Databases 71 Server-Side Procedures 73 Data Virtualization with Neo4j APOC 75 Custom Functions and Procedures 79 Complementary Tools and Techniques 81 GraphQL 82 Kafka Connect Plug-In 84 Neo4j Spark Connector 87 Apache Hop for ETL 89 Summary 92 6. Enriching Knowledge Graphs with Data Science. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Why Graph Algorithms? 93 Different Classes of Graph Algorithms 94 Graph Data Science Operations 96 Experimenting with Graph Data Science 101 Production Considerations 107 Enriching the Knowledge Graph 109 Summary 111 7. Graph-Native Machine Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Machine Learning in a Nutshell 113 Topological Machine Learning 114 Graph-Native Machine Learning Pipelines 116 Recommending Complementary Actors 117 Summary 125 iv | Table of Contents

📄 Page 8

8. Mapping Data with Metadata Knowledge Graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 The Challenge of Distributed Data Stewardship 127 Datasets Connected to Data Platforms 128 Tasks and Data Pipelines 129 Data Sinks 130 Metadata Graph Example 130 Querying the Metadata Graph Model 131 Using Relationships to Connect Data and Metadata 133 Summary 134 9. Identity Knowledge Graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Knowing Your Customer 135 When Does the Problem Appear? 136 Graph-Based Entity Resolution Step by Step 137 Data Preparation 138 Entity Matching 141 Build/Update a Persisted Record of Master Entities 145 Working with Unstructured Data 149 Summary 154 10. Pattern Detection Knowledge Graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Fraud Detection 155 First-Party Fraud 156 Uncovering Fraud from Data 157 Fraud Rings 159 Innocent Bystanders 162 Operationalizing the Fraud Detection Knowledge Graph 164 Skills Matching 165 Organizational Knowledge Graph 165 Skills Knowledge Graph 167 Expertise Knowledge Graph 170 Individual Career Growth 173 Organizational Planning 175 Predicting Organizational Performance 179 Summary 186 11. Dependency Knowledge Graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Dependencies as a Graph 187 Advanced Graph Dependency Modeling 190 Qualified Dependencies 190 Semantics of Multidependency 193 Impact Propagation with Cypher 198 Table of Contents | v

📄 Page 9

Validating a Dependency Knowledge Graph 201 Validation 1: No Cycles 202 Validation 2: Aggregate Multidependencies Add Up to the Expected Total 202 Complex Dependency Processing 205 Single-Point-of-Failure Analysis 205 Root Cause Analysis 206 Summary 210 12. Semantic Search and Similarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Search over Unstructured Data 211 From Strings to Things: Annotating Documents with Entities 212 Navigating the Connections: Document Similarity for Recommendations 217 The Cold Start Problem 221 Making the Annotation Semantic with an Organizing Principle 222 Summary 232 13. Talking to Your Knowledge Graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Question Answering: Natural Language as a Source of Facts for a Knowledge Graph 234 Using Natural Language Query with a Knowledge Graph 238 Natural Language Generation from Knowledge Graphs 245 Annotating the Knowledge Graph’s Organizing Principle to Drive Natural Language Generation 248 Working with Lexical Databases 253 Graph-Based Semantic Similarity 257 Path Similarity 258 Leacock-Chodorow Similarity 260 Wu and Palmer Similarity 261 Summary 266 14. From Knowledge Graphs to Knowledge Lakes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 Conventional Knowledge Graph Deployments 267 From Knowledge Graphs to Knowledge Lakes 268 Looking to the Future 270 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 vi | Table of Contents

📄 Page 10

Preface Graph databases and graph data science have reached a significant level of adoption. They have been extensively used for a range of discrete use cases like logistics, recom‐ mendations, and fraud detection. But there is a bigger emerging trend to arrange data in a deliberate manner that enables insight at scale across functional silos. The technology underpinning this trend is know as a knowledge graph. The forces behind the trend are clear: organizations are no longer suffering from data scarcity. In fact, in an era when big data seems to be a solved problem (at least from a storage point of view), many organizations are practically drowning in data. Industry anecdotes of many thousands of relational tables per day being ingested into a data lake abound, but with an abundance of data there comes the unexpected challenge of what to with it. This is where knowledge graphs help. A knowledge graph is a purposeful arrangement of data such that information is put in context and insight is readily available. Individual records are placed in an associative network of relationships that provide rich semantic connectivity and context. That network of relationships—a graph—is an incredibly intuitive way of representing useful knowledge. Data that might have originally existed to serve a fraud-detection use case can be repurposed seamlessly within the knowledge graph to provide data for recommending financial products (or vice versa). And from there it is straightforward to connect other data to support other vertical use cases or horizontal analysis. Importantly, while the term knowledge graph has only come to prominence in indus‐ try relatively recently, knowledge graph systems have been in existence for some time. This book tries to distill our experience of understanding knowledge graphs deployed in real systems by organizations around the world. It addresses the emerging trend of building systems on knowledge graphs as well as thinking about knowledge graphs as a general-purpose underlay for the enterprise. It also addresses the contemporary intersection of knowledge graphs and artificial intelligence (AI), where knowledge vii

📄 Page 11

graphs provide high-quality features for machine learning, are themselves enriched by AI, and can even tame the hallucinatory nature of large language models (LLMs). While this is our most in-depth and unapologetically technical book on the topic, this isn’t our first time writing about knowledge graphs. In fact, in the book Knowledge Graphs: Data in Context for Responsive Businesses (O’Reilly), we highlighted the busi‐ ness benefits of knowledge graph adoption aimed at an audience of CIOs and CDOs. But this book goes much deeper technically, and it contains enough implementation detail for a range of tools, patterns, and practices so that you can build your own knowledge graphs with confidence. We hope that what you learn here will propel you to your first successful knowledge graph project and beyond! Who This Book Is For This is a technical book, aimed at computing professionals—typically software engi‐ neers, system architects, and techincal managers—who want to understand both the potential of knowledge graphs and how to go about implementing them. While no prior experience with knowledge graphs (or graphs in the general sense) is required, readers will get the most of the book if they are modestly comfortable with database concepts like queries and have some programming experience. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. viii | Preface

📄 Page 12

This element signifies a tip or suggestion. This element signifies a general note. This element indicates a warning or caution. O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-889-8969 (in the United States or Canada) 707-829-7019 (international or local) 707-829-0104 (fax) support@oreilly.com https://www.oreilly.com/about/contact.html Preface | ix

📄 Page 13

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/building-knowledge-graphs. For news and information about our books and courses, visit https://oreilly.com. Find us on LinkedIn: https://linkedin.com/company/oreilly-media Follow us on Twitter: https://twitter.com/oreillymedia Watch us on YouTube: https://www.youtube.com/oreillymedia Acknowledgments We are grateful to all those who helped us while we were writing this book: the staff at O’Reilly, especially Corbin Collins, whose efforts kept our prose properly American‐ ized, and our Neo4j colleagues, especially Maya Natarajan and Deb Cameron, who worked with us on early drafts. We are grateful to Dr. Nicola Vitucci, who provided deep technical expertise and patient guidance on our prose. We would also like to thank Nigel Small, who provided detailed technical feedback on idiomatic Python graph data science. Finally, we would like to thank the technical reviewers from O’Reilly, Max de Marzi and Janit Anjaria, for their enthusiastic and detailed feedback. x | Preface

📄 Page 14

CHAPTER 1 Introducing Knowledge Graphs We’re overwhelmed by data. It’s everywhere and being collected at a fantastic rate and stored at substantial cost. But we’re not necessarily getting value from that data, though there is significant value in it, if only we could understand it. All is not lost. Over the last decade, a new category of technology based on graphs has moved from obscurity to prominence. Graphs have come to underpin everything from consumer-facing systems like navigation and social networks to critical infra‐ structure like supply chains and power grids. These important graph use cases have reached a common conclusion: applying knowledge in context is the most powerful tool that most businesses have. A set of patterns and practices called knowledge graphs has been emerging to help understand data in context, where the context is represented as a graph of connected data items. With knowledge graphs, there is hope that you can distill business value from data. This book is an attempt to show how it can be done. Knowledge graphs are useful because they provide contextualized understanding of data. Context derives from the layer of metadata (graph topology and other features) that provides rules for structure and interpretation. This book shows how the con‐ nected context provided by knowledge graphs enables you to extract greater value from existing data, drive automation and process optimization, improve predictions, and support an agile response to changing business environments. We wrote this book for technology professionals who are interested in building and operating knowledge graphs within their businesses. In a way, it’s a sequel to or expansion of O’Reilly’s Knowledge Graphs: Data in Context for Responsive Busi‐ nesses, which we had the pleasure of writing in 2021. That report was intended for chief information officers (CIOs) and chief data officers (CDOs), helping them to 1

📄 Page 15

understand the benefits of knowledge graphs. This time we’re aiming at data and software professionals who build sophisticated information systems. The book is arranged in two parts. The first part deals with graph fundamentals, including graph databases, query languages, data wrangling, and graph data science. It teaches technical practitioners the fundamental tooling needed to understand the second part of this book, which tackles significant knowledge graph use cases and how to implement them, complete with code examples and system architectures. For data and software professionals, this book provides an on-ramp to the world of knowledge graphs and a language for discussing their implementation with your peers and management. It also gives deep examples for how to build and use knowl‐ edge graphs—from graph basics all the way to graph machine learning. For CIOs or CDOs, this book may still be useful since it provides a good overview of knowledge graphs and how they are delivered. You can skim the earlier chapters and code examples if that’s not your thing, but you’ll still be able to understand what your practitioner colleagues are doing and why. This chapter explains the background and motivation for knowledge graphs. Here we’ll introduce the notions of graphs and graph data and start to show how to build smarter systems with knowledge graphs. What Are Graphs? Knowledge graphs are a type of graph, so it’s important to have a basic understanding of graphs. Graphs are simple structures that use nodes (or vertices) connected by relationships (or edges) to create high-fidelity models of a domain. To avoid any confusion, the graphs in this book have nothing to do with visualizing data as histograms or plotting a function, which are called charts, as shown in Figure 1-1. 2 | Chapter 1: Introducing Knowledge Graphs

📄 Page 16

Figure 1-1. Graphs versus charts The graphs in this book are sometimes referred to as networks. They are a simple but powerful way of showing how things connect. Graphs are not new. In fact, graph theory was invented by Swiss mathematician Leonhard Euler in the 18th century to help compute the minimum distance that the emperor of Prussia had to walk to see the town of Königsberg (modern-day Kaliningrad) by ensuring that each of its seven bridges was crossed only once, as shown in Figure 1-2. What Are Graphs? | 3

📄 Page 17

Figure 1-2. A graphical representation of Königsberg and its seven bridges across the Pregel River Euler’s insight was that the problem shown in Figure 1-2 could be reduced to a logical form, stripping out all the noise of the real world and concentrating solely on how things are connected. He was able to demonstrate that the problem didn’t need to involve bridges, islands, or emperors. He proved that in fact the physical geography of Königsberg was completely irrelevant. To us, Euler’s approach is similar to modern software development. The inherent noise of the real world is stripped away so that a more valuable logical representation—the software—remains. For software professionals, this feels comfortingly familiar. You can use the superimposed graph in Figure 1-2 to figure out the shortest route for walking around Königsberg without having to put on your walking boots and try it for real. In fact, Euler proved that the emperor could not walk the whole town crossing each bridge only once, since there would have needed to be (at least) one island (node) with an even number of connecting bridges (relationships) from which the emperor could start his walk. No such island existed in Königsberg, so no such route (path) is possible. 4 | Chapter 1: Introducing Knowledge Graphs

📄 Page 18

Building on Euler’s work, mathematicians have studied various graph models, all variations on the theme of nodes connected by relationships. Some models allow relationships to be directed, where they have an explicit start and end node, while some have undirected relationships connecting nodes. Some models, like hypergraphs, allow relationships to connect multiple nodes. In theory, there’s no single best graph model to choose (though you can usually transform from one model to another). But there are better or worse models in practice, especially for building computer systems. In this book we’ve chosen the labeled property graph model as the foundation. It’s a popular model that is simple for software and data professionals to understand. It is expressive enough to represent even the most complicated domains and is information rich (unlike the graphs beloved of mathematicians). The Property Graph Model The property graph model is the most popular model for modern graph databases. Correspondingly, it’s a common basis for creating knowledge graphs. It consists of the following elements: Nodes representing entities in the domain • Nodes can contain zero or more properties, which are key-value pairs represent‐ ing entity data such as price or date of birth. • Nodes can have zero or more labels, which declare the node’s purpose in the graph, such as representing a Customer or a Product. Relationships representing how entities interrelate • Relationships have a type, such as BOUGHT, FOLLOWS, or LIKES. • Relationships have a direction, going from one node to another (or back to the same node). • Relationships can contain zero or more properties, which are key-value pairs representing some characteristic of the link, such as a timestamp or distance. • Relationships never dangle—there are always a start node and an end node (which can be the same node). You can use these primitives (nodes, labels, relationships, and properties) along with rules to assemble sophisticated, high-fidelity graph data models with relative ease. Figure 1-3 shows a small social graph, but compared to the example in Figure 1-2, this graph holds much more information. What Are Graphs? | 5

📄 Page 19

Figure 1-3. A graph representing people, their friendships, and their locations In Figure 1-3 each node has a label that represents its role in the graph. Some nodes are labeled Person and some are labeled Place, representing people and places respectively. Stored inside those nodes are properties. For example, one node has name:'Rosa' and gender:'f' that you can interpret as being a female person called Rosa. Note that the Karl and Fred nodes have slightly different properties on them, which is a perfectly fine tool as it accommodates the messiness of the real world. The property graph model does not enforce any schemas based on labels or relationship types. It’s deliberately intended to be flexible and accommodating to help develop high-fidelity models. If you really need to ensure that nodes with certain labels have certain properties, you can apply constraints to the label to ensure those properties exist, are unique, and so on. You can view this not as schema or schemaless, but schema-ish. The idea is to use schema-like constraints just where you need them rather than eagerly locking down the whole data model. Those schema-like constraints, like everything else in a graph data model, can change over time. Real-world data is often uneven and incomplete, so your knowledge graphs should reflect this reality. Between the nodes in Figure 1-3 you see relationships. The relationships are richer than in Figure 1-2 since they have a type, a direction, and optional properties. Relationships cannot “dangle”; they always have a start node and an end node, even if the start and end nodes are the same. For instance, there is a Person node with the property name:'Rosa' that has an outgoing LIVES_IN relationship with the property since: 2020 to the Place node with the property city:'Berlin'. You can read this as “Rosa has lived in Berlin since 2020,” based on the direction of the relationships. You can also see that Fred is a 6 | Chapter 1: Introducing Knowledge Graphs

📄 Page 20

FRIEND of Karl and that Karl is a FRIEND of Fred. Rosa and Karl are also friends, but Rosa and Fred are not. In the property graph model, there are no limits on the number of nodes or the number or type of relationships that interconnect them. Some nodes are densely connected while others are sparsely connected. All that matters is that the model matches the problem domain. Similarly, some nodes have lots of properties, while some have few or none. Some relationships have lots of properties, but many have none at all. This is all perfectly normal for knowledge graphs. It’s easy to see how the graph in Figure 1-3 can answer questions about friendships and who lives where. Extending the model to include other data like hobbies, publications, or jobs is also straightforward: just keep adding nodes and relation‐ ships to match your problem domain. Creating large, complex graphs with millions or billions of connections isn’t a problem for modern graph databases and graph- processing software, so building even very large knowledge graphs is achievable. Graph data models can comfortably represent complex networks of relationships in a way that is both human readable and machine friendly. Graphs might seem very technical at first, but they are created from very simple primitives, making them very accessible in practice. In fact the combination of a simple data model and the ease of algorithmic processing to discover connections, patterns, and features is what has made graphs so popular. It’s a powerful combination you will also exploit in your knowledge graphs. The Motivation for Knowledge Graphs Interest in knowledge graphs has exploded, with a myriad of research papers, solutions, analyst reports, groups, and conferences on this topic. Knowledge graphs have become so popular partly because graph technology has accelerated in recent years but also because there is strong demand to make sense of data. External factors have undoubtedly accelerated knowledge graphs to greater promi‐ nence. Stresses from the COVID-19 pandemic and fallout from geopolitics have strained some organizations to the point of breaking. Decision making has never needed to be more rapid. At the same time, businesses are hampered by the lack of timely and accurate insight. Now businesses are reconfiguring their operations and processes to flex rapidly. As historical knowledge ages faster and is invalidated by market dynamics, many organizations need new ways of capturing, analyzing, and learning from data. Busi‐ nesses need rapid insights and recommendations, from customer experience and patient outcomes to product innovation, fraud detection, and automation. They need contextualized data to generate knowledge. The Motivation for Knowledge Graphs | 7

The above is a preview of the first 20 pages. Register to read the complete e-book.

💝 Support Author

0.00

Total Amount (¥)

Donation Count

← Back to List