Graph Data Science with Neo4j (Estelle Scifo)（Z-Library）

(This page has no text content)

BIRMINGHAM—MUMBAI Graph Data Science with Neo4j Copyright © 2023 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Publishing Product Manager: Ali Abidi Senior Editor: Nathanya Dias Technical Editor: Rahul Limbachiya Copy Editor: Safis Editing Project Coordinator: Farheen Fathima Proofreader: Safis Editing Indexer: Hemangini Bari Production Designer: Shankar Kalbhor Marketing Coordinator: Vinishka Kalra First published: January 2023

Production reference: 1310123 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-80461-274-3 www.packtpub.com Contributors About the author Estelle Scifo is a Neo4j Certified Professional and Neo4j Graph Data Science certified user. She is currently a machine learning engineer at GraphAware where she builds Neo4j-related solutions to make customers happy with graphs. Before that, she worked in several fields, starting out with research in particle physics, during which she worked at CERN on uncovering Higgs boson properties. She received her PhD in 2014 from the Laboratoire de l’Accélérateur Linéaire (Orsay, France). Continuing her career in industry, she worked in real estate, mobility, and logistics for almost 10 years. In the Neo4j community, she is known as the creator of neomap, a map visualization application for data stored in Neo4j. She also regularly gives talks at conferences such as NODES and PyCon. Her domain expertise and deep insight into the perspective of a beginner’s needs make her an excellent teacher.

There is only one name on the cover, but a book is not the work of one person. I would like to thank everyone involved in making this book a reality. Beyond everyone at Packt, the reviewers did an incredible job of suggesting some very relevant improvements. Thank you, all!

I hope this book will inspire you as much as other books of this genre have inspired me. About the reviewers Dr. David Gurzick is the founding chair of the George B. Delaplaine Jr. School of Business and an associate professor of management science at Hood College. He has a BS in computer science from Frostburg State University, an M.S. in computer science from Hood College, a PhD in information systems from the University of Maryland, Baltimore County, and is a graduate of Harvard’s business analytics program. As a child of the internet, he grew up on AOL and programmed his way through dot.com. He now helps merge technology and business strategy to enable innovation and accelerate commercial success as the lead data scientist at Genitive.ai and as a director of the Frederick Innovative Technology Center, Inc (FITCI). Sean William Grant is a product and analytics professional with over 20 years of experience in technology and data analysis. His experience ranges from geospatial intelligence with the United States Marine Corps, product management within the aviation and autonomy space, to implementing advanced analytics and data science within organizations. He is a graph data science and network analytics enthusiast who frequently gives presentations and workshops on connected data. He has also been a technical advisor to several early-stage start-ups. Sean is passionate about data and technology, and how it can elevate our understanding of ourselves. Jose Ernesto Echeverria has worked with all kinds of databases, from relational databases in the 1990s to non-SQL databases in the 2010s. He considers graph databases to be the best fit for solving real-world problems, given their strong capability for modeling and adaptability to change. As a polyglot programmer, he has used languages such as Java, Ruby, and R and tools such as Jupyter with Neo4j in order to solve data management problems for multinational corporations. A long-time advocate of data science, he expects this long-awaited book to cover the proper techniques and approach the intersections of this discipline, as well as help readers to discover the possibilities of graph databases. When not working, he enjoys spending time with friends and family.

Table of Contents Preface

Part 1 – Creating Graph Data in Neo4j 1 Introducing and Install ing Neo4j Technical requirements What is a graph database? Databases Graph database Finding or creating a graph database A note about the graph dataset’s format Modeling your data as a graph Neo4j in the graph databases landscape Neo4j ecosystem Setting up Neo4j Downloading and starting Neo4j Desktop Creating our first Neo4j database Creating a database in the cloud – Neo4j Aura Inserting data into Neo4j with Cypher, the Neo4j query language Extracting data from Neo4j with Cypher pattern matching Summary Further reading Exercises 2 Importing Data into Neo4j to Build a Knowledge Graph Technical requirements

Importing CSV data into Neo4j with Cypher Discovering the Netflix dataset Defining the graph schema Importing data Introducing the APOC library to deal with JSON data Browsing the dataset Getting to know and installing the APOC plugin Loading data Dealing with temporal data Discovering the Wikidata public knowledge graph Data format Query language – SPARQL Enriching our graph with Wikidata information Loading data into Neo4j for one person Importing data for all people Dealing with spatial data in Neo4j Importing data in the cloud Summary Further reading Exercises

Part 2 – Exploring and Characterizing Graph Data with Neo4j 3 Characterizing a Graph Dataset Technical requirements Characterizing a graph from its node and edge properties Link direction Link weight Node type Computing the graph degree distribution Definition of a node’s degree Computing the node degree with Cypher Visualizing the degree distribution with NeoDash Installing and using the Neo4j Python driver Counting node labels and relationship types in Python Building the degree distribution of a graph Improved degree distribution Learning about other characterizing metrics Triangle count Clustering coefficient Summary Further reading Exercises 4

Using Graph Algorithms to Characterize a Graph Dataset Technical requirements Digging into the Neo4j GDS library GDS content Installing the GDS library with Neo4j Desktop GDS project workflow Projecting a graph for use by GDS Native projections Cypher projections Computing a node’s degree with GDS stream mode The YIELD keyword write mode mutate mode Algorithm configuration Other centrality metrics Understanding a graph’s structure by looking for communities Number of components Modularity and the Louvain algorithm Summary Further reading 5 Visualizing Graph Data Technical requirements The complexity of graph data visualization Physical networks

General case Visualizing a small graph with networkx and matplotlib Visualizing a graph with known coordinates Visualizing a graph with unknown coordinates Configuring object display Discovering the Neo4j Bloom graph application What is Bloom? Bloom installation Selecting data with Neo4j Bloom Configuring the scene in Bloom Visualizing large graphs with Gephi Installing Gephi and its required plugin Using APOC Extended to synchronize Neo4j and Gephi Configuring the view in Gephi Summary Further reading Exercises

Part 3 – Making Predictions on a Graph 6 Building a Machine Learning Model with Graph Features Technical requirements Introducing the GDS Python client GDS Python principles Input and output types Creating a projected graph from Python Running GDS algorithms from Python and extracting data in a dataframe write mode stream mode Dropping the projected graph Using features from graph algorithms in a scikit-learn pipeline Machine learning tasks with graphs Our task Computing features Extracting and visualizing data Building the model Summary Further reading Exercise 7

Automatically Extracting Features with Graph Embeddings for Machine Learning Technical requirements Introducing graph embedding algorithms Defining embeddings Graph embedding classification Using a transductive graph embedding algorithm Understanding the Node2Vec algorithm Using Node2Vec with GDS Training an inductive embedding algorithm Understanding GraphSAGE Introducing the GDS model catalog Training GraphSAGE with GDS Computing new node representations Summary Further reading Exercises 8 Building a GDS Pipeline for Node Classification Model Training Technical requirements The GDS pipelines What is a pipeline? Building and training a pipeline Creating the pipeline and choosing the features Setting the pipeline configuration Training the pipeline

Making predictions Computing the confusion matrix Using embedding features Choosing the graph embedding algorithm to use Training using Node2Vec Training using GraphSAGE Summary Further reading Exercise 9 Predicting Future Edges Technical requirements Introducing the LP problem LP examples LP with the Netflix dataset Framing an LP problem LP features Topological features Features based on node properties Building an LP pipeline with the GDS Creating and configuring the pipeline Pipeline training and testing Summary Further reading 10

Writing Your Custom Graph Algorithms with the Pregel API in Java Technical requirements Introducing the Pregel API GDS’s features The Pregel API Implementing the PageRank algorithm The PageRank algorithm Simple Python implementation Pregel Java implementation Implementing the tolerance-stopping criteria Testing our code Test for the PageRank class Test for the PageRankTol class Using our algorithm from Cypher Adding annotations Building the JAR file Updating the Neo4j configuration Testing our procedure Summary Further reading Exercises Index Other Books You May Enjoy

Preface Data science today is a core component of many companies and organizations taking advantage of its predictive power to improve their products or better understand their customers. It is an ever- evolving field, still undergoing intense research. One of the most trending research areas is graph data science (GDS), or how representing data as a connected network can improve models. Among the different tools on the market to work with graphs, Neo4j, a graph database, is popular among developers for its ability to build simple and evolving data models and query data easily with Cypher. For a few years now, it has also stood out as a leader in graph analytics, especially since the release of the first version of its GDS library, allowing you to run graph algorithms from data stored in Neo4j, even at a large scale. This book is designed to guide you through the field of GDS, always using Neo4j and its GDS library as the main tool. By the end of this book, you will be able to run your own GDS model on a graph dataset you created. By the end of the book, you will even be able to pass the Neo4j Data Science certification to prove your new skills to the world.

Who this book is for This book is for people who are curious about graphs and how this data structure can be useful in data science. It can serve both data scientists who are learning about graphs and Neo4j developers who want to get into data science. The book assumes minimal data science knowledge (classification, training sets, confusion matrices) and some experience with Python and its related data science toolkit (pandas, matplotlib, and scikit- learn).

What this book covers Chapter 1, Introducing and Installing Neo4j, introduces the basic principles of graph databases and gives instructions on how to set up Neo4j locally, create your first graph, and write your first Cypher queries. Chapter 2, Using Existing Data to Build a Knowledge Graph, guides you through loading data into Neo4j from different formats (CSV, JSON, and an HTTP API). This is where you will build the dataset that will be used throughout this book. Chapter 3, Characterizing a Graph Dataset, introduces some key metrics to differentiate one graph dataset from another. Chapter 4, Using Graph Algorithms to Characterize a Graph Dataset, goes deeper into understanding a graph dataset by using graph algorithms. This is the chapter where you will start to use the Neo4j GDS plugin. Chapter 5, Visualizing Graph Data, delves into graph data visualization by drawing nodes and edges, starting from static representations and moving on to dynamic ones. Chapter 6, Building a Machine Learning Model with Graph Features, talks about machine learning model training using scikit-learn. This is where we will first use the GDS Python client. Chapter 7, Automating Feature Extraction with Graph Embeddings for Machine Learning, introduces the concept of node embedding, with practical examples using the Neo4j GDS library. Chapter 8, Building a GDS Pipeline for Node Classification Model Training, introduces the topic of node classification within GDS without involving a third-party tool. Chapter 9, Predicting Future Edges, gives a short introduction to the topic of link prediction, a graph-specific machine learning task. Chapter 10, Writing Your Custom Graph Algorithms with the Pregel API in Java, covers the exciting topic of building an extension for the GDS plugin. To get the most out of this book You will need access to a Neo4j instance. Options and installation instructions are given in Chapter 1, Introducing and Installing Neo4j. We will also intensively use Python and the following packages: pandas, scikit-learn, network, and graphdatascience. The code was tested with Python 3.10 but should work with newer versions, assuming no breaking change is made in its dependencies. Python

code is provided as a Jupyter notebook, so you’ll need Jupyter Server installed and running to go through it. For the very last chapter, a Java JDK will also be required. The code was tested with OpenJDK 11. Software/hardware covered in the book Operating system requirements Neo4j 5.x Windows, macOS, or Linux Python 3.10 Windows, macOS or Linux Jupyter Windows, macOS or Linux OpenJDK 11 Windows, macOS or Linux You will also need to install Neo4j plugins: APOC and GDS. Installation instructions for Neo4j Desktop are given in the relevant chapters. However, if you are not using a local Neo4j instance, please refer to the following pages for installation instructions, especially regarding version compatibilities: APOC: https://neo4j.com/docs/apoc/current/installation/ GDS: https://neo4j.com/docs/graph-data-science/current/installation/ If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code. Download the example code fi les You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Graph-Data-Science-with-Neo4j. If there’s an update to the code, it will be updated in the GitHub repository. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out! Conventions used There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system.” A block of code is set as follows: CREATE (:Movie { id: line.show_id, title: line.title, releaseYear: line.release_year } When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold: LOAD CSV WITH HEADERS FROM 'file:///netflix/netflix_titles.csv' AS line WITH split(line.director, ",") as directors_list UNWIND directors_list AS director_name CREATE (:Person {name: trim(director_name)}) Any command-line input or output is written as follows: $ mkdir css $ cd css Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: “Select System info from the Administration panel.” TIPS OR IMPORTANT NOTES Appear like this. Get in touch Feedback from our readers is always welcome. General feedback: If you have questions about any aspect of this book, email us at customercare@packtpub.com and mention the book title in the subject of your message. Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form. Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at copyright@packt.com with a link to the material.

Graph Data Science with Neo4j (Estelle Scifo)（Z-Library）

AI Reading Assistant

Passage locations

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Recommended for You

AI Reading Assistant

Passage locations

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Recommended for You

Reply to Comment

Edit Comment