Statistics
26
Views
0
Downloads
0
Donations
Uploader

高宏飞

Shared on 2025-12-18
Support
Share

AuthorKhaled El Emam, Lucy Mosquera, Richard Hoptroff

No description

Tags
No tags
ISBN: 1492072745
Publisher: O'Reilly Media
Publish Year: 2020
Language: 英文
Pages: 166
File Format: PDF
File Size: 11.3 MB
Support Statistics
¥.00 · 0times
Text Preview (First 20 pages)
Registered users can read the full content for free

Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.

Khaled El Emam, Lucy Mosquera & Richard Hoptroff Practical Synthetic Data Generation Balancing Privacy and the Broad Availability of Data
(This page has no text content)
Khaled El Emam, Lucy Mosquera, and Richard Hoptroff Practical Synthetic Data Generation Balancing Privacy and the Broad Availability of Data Boston Farnham Sebastopol TokyoBeijing
978-1-492-07274-4 [LSI] Practical Synthetic Data Generation by Khaled El Emam, Lucy Mosquera, and Richard Hoptroff Copyright © 2020 K Sharp Technology Inc., Lucy Mosquera, and Richard Hoptroff. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Jonathan Hassell Development Editor: Corbin Collins Production Editor: Christopher Faucher Copyeditor: Piper Editorial Proofreader: JM Olejarz Indexer: Potomac Indexing, LLC Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Jenny Bergman May 2020: First Edition Revision History for the First Edition 2020-05-19: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492072744 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Practical Synthetic Data Generation, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1. Introducing Synthetic Data Generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Defining Synthetic Data 1 Synthesis from Real Data 2 Synthesis Without Real Data 2 Synthesis and Utility 3 The Benefits of Synthetic Data 4 Efficient Access to Data 4 Enabling Better Analytics 5 Synthetic Data as a Proxy 6 Learning to Trust Synthetic Data 6 Synthetic Data Case Studies 8 Manufacturing and Distribution 9 Healthcare 11 Financial Services 16 Transportation 19 Summary 21 2. Implementing Data Synthesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 When to Synthesize 24 Identifiability Spectrum 24 Trade-Offs in Selecting PETs to Enable Data Access 25 Decision Criteria 28 PETs Considered 29 Decision Framework 33 Examples of Applying the Decision Framework 36 iii
Data Synthesis Projects 39 Data Synthesis Steps 39 Data Preparation 41 The Data Synthesis Pipeline 42 Synthesis Program Management 47 Summary 48 3. Getting Started: Distribution Fitting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Framing Data 50 How Data Is Distributed 50 Fitting Distributions to Real Data 60 Generating Synthetic Data from a Distribution 62 Measuring How Well Synthetic Data Fits a Distribution 62 The Overfitting Dilemma 63 A Little Light Weeding 67 Summary 67 4. Evaluating Synthetic Data Utility. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Synthetic Data Utility Framework: Replication of Analysis 71 Synthetic Data Utility Framework: Utility Metrics 74 Comparing Univariate Distributions 75 Comparing Bivariate Statistics 79 Comparing Multivariate Prediction Models 83 Distinguishability 87 Summary 92 5. Methods for Synthesizing Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Generating Synthetic Data from Theory 95 Sampling from a Multivariate Normal Distribution 96 Inducing Correlations with Specified Marginal Distributions 97 Copulas with Known Marginal Distributions 98 Generating Realistic Synthetic Data 99 Fitting Real Data to Known Distributions 101 Using Machine Learning to Fit the Distributions 102 Hybrid Synthetic Data 103 Machine Learning Methods 106 Deep Learning Methods 107 Synthesizing Sequences 108 Summary 112 iv | Table of Contents
6. Identity Disclosure in Synthetic Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Types of Disclosure 116 Identity Disclosure 116 Learning Something New 117 Attribute Disclosure 117 Inferential Disclosure 119 Meaningful Identity Disclosure 120 Defining Information Gain 121 Bringing It All Together 121 Unique Matches 122 How Privacy Law Impacts the Creation and Use of Synthetic Data 123 Issues Under the GDPR 125 Issues Under the CCPA 129 Issues Under HIPAA 130 Article 29 Working Party Opinion 133 Summary 135 7. Practical Data Synthesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Managing Data Complexity 137 For Every Pre-Processing Step There Is a Post-Processing Step 138 Field Types 138 The Need for Rules 138 Not All Fields Have to Be Synthesized 139 Synthesizing Dates 140 Synthesizing Geography 141 Lookup Fields and Tables 141 Missing Data and Other Data Characteristics 141 Partial Synthesis 142 Organizing Data Synthesis 142 Computing Capacity 142 A Toolbox of Techniques 143 Synthesizing Cohorts Versus Full Datasets 143 Continuous Data Feeds 144 Privacy Assurance as Certification 144 Performing Validation Studies to Get Buy-In 144 Motivated Intruder Tests 145 Who Owns Synthetic Data? 145 Conclusions 146 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Table of Contents | v
(This page has no text content)
Preface Interest in synthetic data has been growing rapidly over the last few years. This inter‐ est has been driven by two simultaneous trends. The first is the demand for large amounts of data to train and build artificial intelligence and machine learning (AIML) models. The second is recent work that has demonstrated effective methods for generating high-quality synthetic data. Both have resulted in the recognition that synthetic data can solve some difficult problems quite effectively, especially within the AIML community. Companies like NVIDIA, IBM, and Alphabet, as well as agencies such as the US Census Bureau, have adopted different types of data synthesis methodologies to support model building, application development, and data dissemination. This book provides you with a gentle introduction to methods for the following: gen‐ erating synthetic data, evaluating the data that has been synthesized, understanding the privacy implications of synthetic data, and implementing synthetic data within your organization. We show how synthetic data can accelerate AIML projects. Some of the problems that can be tackled by having synthetic data would be too costly or dangerous to solve using more traditional methods (e.g., training models controlling autonomous vehicles), or simply cannot be done otherwise. We also explain how to assess the privacy risks from synthetic data, even though they tend to be minimal if synthesis is done properly. While we want this book to be an introduction, we also want it to be applied. There‐ fore, we will discuss some of the issues that will be encountered with real data, not curated or cleaned data. Real data is complex and messy, and data synthesis needs to be able to work within that context. Our intended audience is analytics leaders who are responsible for enabling AIML model development and application within their organizations, as well as data scien‐ tists who want to learn how data synthesis can be a useful tool for their work. We will use examples of different types of data synthesis to illustrate the broad applicability of this approach. Our main focus here is on the synthesis of structured data. vii
Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit http://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/practical-synthetic-data- generation. Email bookquestions@oreilly.com to comment or ask technical questions about this book. For news and information about our books and courses, visit http://oreilly.com. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://youtube.com/oreillymedia viii | Preface
Acknowledgments The preparation of this book benefited from a series of interviews with subject matter experts. I would like to thank the following individuals for making themselves avail‐ able to discuss their experiences and thoughts on the synthetic data market and tech‐ nology: Fernanda Foertter, Jim Karkanias, Alexei Pozdnoukhov, Rev Lebaradian, John Ashley, Rob Csonger, and Simson Garfinkel. Rob Csonger and his team provided the content for the section on autonomous vehicles. Mike Hintze from Hintze Law LLC prepared the legal analysis in the identity disclo‐ sure chapter. We wish to thank Janice Branson for reviewing earlier versions of the manuscript. Our clients and collaborators, who often give us challenging problems, have been key to driving our innovations in the methods of data synthesis and the implementation of the technology in practice. Preface | ix
(This page has no text content)
CHAPTER 1 Introducing Synthetic Data Generation We start this chapter by explaining what synthetic data is and its benefits. Artificial intelligence and machine learning (AIML) projects run in various industries, and the use cases that we include in this chapter are intended to give a flavor of the broad applications of data synthesis. We define an AIML project quite broadly as well, to include, for example, the development of software applications that have AIML components. Defining Synthetic Data At a conceptual level, synthetic data is not real data, but data that has been generated from real data and that has the same statistical properties as the real data. This means that if an analyst works with a synthetic dataset, they should get analysis results simi‐ lar to what they would get with real data. The degree to which a synthetic dataset is an accurate proxy for real data is a measure of utility. We refer to the process of generat‐ ing synthetic data as synthesis. Data in this context can mean different things. For example, data can be structured data, as one would see in a relational database. Data can also be unstructured text, such as doctors’ notes, transcripts of conversations or online interactions by email or chat. Furthermore, images, videos, audio, and virtual environments are types of data that can be synthesized. Using machine learning, it is possible to create realistic pic‐ tures of people who do not exist in the real world. There are three types of synthetic data. The first type is generated from actual/real datasets, the second type does not use real data, and the third type is a hybrid of these two. Let’s examine them here. 1
Synthesis from Real Data The first type of synthetic data is synthesized from real datasets. This means that the analyst has some real datasets and then builds a model to capture the distributions and structure of that real data. Here structure means the multivariate relationships and interactions in the data. Once the model is built, the synthetic data is sampled or generated from that model. If the model is a good representation of the real data, then the synthetic data will have statistical properties similar to those of the real data. This is illustrated in Figure 1-1. Here we fit the data to a generative model first. This captures the relationships in the data. We then use that model to generate synthetic data. So the synthetic data is produced from the fitted model. Figure 1-1. The conceptual process of data synthesis For example, a data science group specializing in understanding customer behaviors would need large amounts of data to build its models. But because of privacy or other concerns, the process for accessing that customer data is slow and does not provide good enough data on account of extensive masking and redaction of information. Instead, a synthetic version of the production datasets can be provided to the analysts to build their models with. The synthesized data will have fewer constraints on its use and will allow them to progress more rapidly. Synthesis Without Real Data The second type of synthetic data is not generated from real data. It is created by using existing models or the analyst’s background knowledge. These existing models can be statistical models of a process (developed through sur‐ veys or other data collection mechanisms) or they can be simulations. Simulations can be, for instance, gaming engines that create simulated (and synthetic) images of scenes or objects, or they can be simulation engines that generate shopper data with 2 | Chapter 1: Introducing Synthetic Data Generation
particular characteristics (say, age and gender) for people who walk past a store at dif‐ ferent times of the day. Background knowledge can be, for example, knowledge of how a financial market behaves that comes from textbook descriptions or the movements of stock prices under various historical conditions. It can also be knowledge of the statistical distri‐ bution of human traffic in a store based on years of experience. In such a case, it is relatively straightforward to create a model and sample from background knowledge to generate synthetic data. If the analyst’s knowledge of the process is accurate, then the synthetic data will behave in a manner that is consistent with real-world data. Of course, the use of background knowledge works only when the analyst truly under‐ stands the phenomenon of interest. As a final example, when a process is new or not well understood by the analyst, and there is no real historical data to use, then an analyst can make some simple assump‐ tions about the distributions and correlations among the variables involved in the process. For example, the analyst can make a simplifying assumption that the vari‐ ables have normal distributions and “medium” correlations among them, and create data that way. This type of data will likely not have the same properties as real data but can still be useful for some purposes, such as debugging an R data analysis pro‐ gram, or some types of performance testing of software applications. Synthesis and Utility For some use cases, having high utility will matter quite a bit. In other cases, medium or even low utility may be acceptable. For example, if the objective is to build AIML models to predict customer behavior and make marketing decisions based on that, then high utility will be important. On the other hand, if the objective is to see if your software can handle a large volume of transactions, then the data utility expectations will be considerably lower. Therefore, understanding what data, models, simulators, and knowledge exist, as well as the requirements for data utility, will drive the specific approach for generating the synthetic data. A summary of the synthetic data types is given in Table 1-1. Table 1-1. Different types of data synthesis with their utility implications Type of synthetic data Utility Generated from real nonpublic datasets Can be quite high Generated from real public data Can be high, although there are limitations because public data tends to be de-identified or aggregated Generated from an existing model of a process, which can also be represented in a simulation engine Will depend on the fidelity of the existing generating model Based on analyst knowledge Will depend on how well the analyst knows the domain and the complexity of the phenomenon Defining Synthetic Data | 3
1 US Government Accountability Office, “Artificial Intelligence: Emerging Opportunities, Challenges, and Implications for Policy and Research” (March 2018) https://www.gao.gov/products/GAO-18-644T. 2 McKinsey Global Institute, “Artificial intelligence: The next digital frontier?”, June 2017. https://oreil.ly/pFMkl. 3 Deloitte Insights, “State of AI in the Enterprise, 2nd Edition” 2018. https://oreil.ly/EiD6T. 4 Ben Lorica and Paco Nathan, The State of Machine Learning Adoption in the Enterprise (Sebastopol: O’Reilly, 2018). Type of synthetic data Utility Generated from generic assumptions not specific to the phenomenon Will likely be low Now that you have seen the different types of synthetic data, let’s look at the benefits of data synthesis overall and of some of these data types specifically. The Benefits of Synthetic Data We will highlight two important benefits of data synthesis: providing more efficient access to data and enabling better analytics. Let’s examine each of these in turn. Efficient Access to Data Data access is critical to AIML projects. The data is needed to train and validate mod‐ els. More broadly, data is also needed for evaluating AIML technologies that have been developed by others, as well as for testing AIML software applications or appli‐ cations that incorporate AIML models. Typically, data is collected for a particular purpose with the consent of the individual —for example, for participating in a webinar or a clinical research study. If you want to use that same data for a different purpose, such as to build a model to predict what kind of person is likely to sign up for a webinar or to participate in a clinical study, then that is considered a secondary purpose. Access to data for secondary purposes, such as analysis, is becoming problematic. The Government Accountability Office1 and the McKinsey Global Institute2 both note that accessing data for building and testing AIML models is a challenge for their adoption more broadly. A Deloitte analysis concluded that data-access issues are ranked in the top three challenges faced by companies when implementing AI.3 At the same time, the public is getting uneasy about how its data is used and shared, and privacy laws are becoming stricter. A recent survey by O’Reilly highlighted the pri‐ vacy concerns of companies adopting machine learning models, with more than half of companies experienced with AIML checking for privacy issues.4 Contemporary privacy regulations, such as the US Health Insurance Portability and Accountability Act (HIPAA) and the General Data Protection Regulation (GDPR) in 4 | Chapter 1: Introducing Synthetic Data Generation
5 Khaled El Emam et al., “A Review of Evidence on Consent Bias in Research,” The American Journal of Bioethics 13, no. 4 (2013): 42–44. 6 Other governance mechanisms would generally be needed, and we cover these later in the book. Europe, require a legal basis to use personal data for a secondary purpose. An exam‐ ple of that legal basis would be additional consent or authorization from individuals before their data can be used. In many cases this is not practical and can introduce bias into the data because consenters and nonconsenters differ on important characteristics.5 Given the difficulty of accessing data, sometimes analysts try to just use open source or public datasets. These can be a good starting point, but they lack diversity and are often not well matched to the problems that the models are intended to solve. Fur‐ thermore, open data may lack sufficient heterogeneity for robust training of models. For example, open data may not capture rare cases well enough. Data synthesis can give the analyst, rather efficiently and at scale, realistic data to work with. Synthetic data would not be considered identifiable personal data. There‐ fore, privacy regulations would not apply and additional consent to use the data for secondary purposes would not be necessary.6 Enabling Better Analytics A use case where synthesis can be applied is when real data does not exist—for exam‐ ple, if the analyst is trying to model something completely new, and the creation or collection of a real dataset from scratch would be cost-prohibitive or impractical. Synthesized data can also cover edge or rare cases that are difficult, impractical, or unethical to collect in the real world. Sometimes real data exists but is not labeled. Labeling a large amount of examples for supervised learning tasks can be time-consuming, and manual labeling is error- prone. Again, synthetic labeled data can be generated to accelerate model develop‐ ment. The synthesis process can ensure high accuracy in the labeling. Analysts can use the synthetic data models to validate their assumptions and demon‐ strate the kind of results that can be obtained with their models. In this way the syn‐ thetic data can be used in an exploratory manner. Knowing that they have interesting and useful results, the analysts can then go through the more complex process of get‐ ting the real data (either raw or de-identified) to build the final versions of their models. For example, if an analyst is a researcher, they can use their exploratory models on synthetic data to then apply for funding to get access to the real data, which may require a full protocol and multiple levels of approvals. In such an instance, efforts The Benefits of Synthetic Data | 5
7 Jerome P. Reiter, “New Approaches to Data Dissemination: A Glimpse into the Future (?),” CHANCE 17, no. 3 (June 2004): 11–15. with the synthetic data that do not produce good models or actionable results would still be beneficial, because they will redirect the researchers to try something else, rather than trying to access the real data for a potentially futile analysis. Another scenario in which synthetic data can be valuable is when the synthetic data is used to train an initial model before the real data is accessible. Then when the analyst gets the real data, they can use the trained model as a starting point for training with the real data. This can significantly expedite the convergence of the real data model (hence reducing compute time) and can potentially result in a more accurate model. This is an example of using synthetic data for transfer learning. The benefits of synthetic data can be dramatic—it can make impossible projects doa‐ ble, significantly accelerate AIML initiatives, or result in material improvement in the outcomes of AIML projects. Synthetic Data as a Proxy If the utility of the synthetic data is high enough, analysts are able to get results with the synthetic data that are similar to what they would have with the real data. In such a case, the synthetic data plays the role of a proxy for the real data. Increasingly, there are more use cases where this scenario is playing out: as synthesis methods improve over time, this proxy outcome is going to become more common. We have seen that synthetic data can play a key role in solving a series of practical problems. One of the critical factors for the adoption of data synthesis, however, is trust in the generated data. It has long been recognized that high data utility will be needed for the broad adoption of data synthesis methods.7 This is the topic we turn to next. Learning to Trust Synthetic Data Initial interest in synthetic data started in the early 1990s with proposals to use multi‐ ple imputation methods to generate synthetic data. Imputation in general is the class of methods used to deal with missing data by using realistic data to replace the miss‐ ing values. Missing data can occur, for example, in a survey in which some respond‐ ents do not complete a questionnaire. Accurate imputed data requires the analyst to build a model of the phenomenon of interest using the available data and then use that model to estimate what the imputed value should be. To build a valid model the analyst needs to know how the data will eventually be used. 6 | Chapter 1: Introducing Synthetic Data Generation
With multiple imputation you create multiple imputed values to capture the uncer‐ tainty in these estimated values. This results in multiple imputed datasets. There are specific techniques that can be used to combine the analysis that is repeated in each imputed dataset to get a final set of analysis results. This process can work reasonably well if you know in advance how the data will be used. In the context of using imputation for data synthesis, the real data is augmented with synthetic data using the same type of imputation techniques. In such a case, the real data is used to build an imputation model that is then used to synthesize new data. The challenge is that if your imputation models are different than the eventual mod‐ els that will be built with the synthetic data, then the imputed values may not be very reflective of the real values, and this will introduce errors in the data. This risk of building the wrong model has led to historic caution in the application of synthetic data. More recently, statistical machine learning models have been used for data synthesis. The advantage of these models is that they can capture the distributions and complex relationships among the variables quite well. In effect, they discover the underlying model in the data rather than requiring that model to be prespecified by the analyst. And now with deep learning data synthesis, these models can be quite accurate because they can capture much of the signal in the data—even subtle signals. Therefore, we are getting closer to the point where the generative models available today produce datasets that are becoming quite good proxies for real data. But there are also ways to assess the utility of synthetic data more objectively. For example, we can compare the analysis results from synthetic data with the analy‐ sis results from the real data. If we do not know what analysis will be performed on the synthetic data, then a range of possible analyses can be tried based on known uses of that data. Or an “all models” evaluation can be performed, in which all possible models are built from the real and synthetic datasets and compared. Synthetic data can also be used to increase the heterogeneity of a training dataset to result in a more robust AIML model. For example, edge cases in which data does not exist or is difficult to collect can be synthesized and included in the training dataset. In that case, the utility of the synthetic data is measured in the robustness increment to the AIML models. The US Census Bureau has, at the time of writing, decided to leverage synthetic data for one of the most heavily used public datasets, the 2020 decennial census data. For its tabular data disseminations, it will create a synthetic dataset from the collected individual-level census data and then produce the public tabulations from that The Benefits of Synthetic Data | 7
8 Aref N. Dajani et al., “The Modernization of Statistical Disclosure Limitation at the U.S. Census Bureau” (paper presented at the Census Scientific Advisory Committee meeting, Suitland, MD, March 2017). synthetic dataset. A mixture of formal and nonformal methods will be used in the synthesis process.8 This, arguably, demonstrates the large-scale adoption of data synthesis for one of the most critical and heavily used datasets available today. Beyond the census, data synthesis is being used in a number of industries, as we illus‐ trate later in this chapter. Synthetic Data Case Studies While the technical concepts behind the generation of synthetic data have been around for a few decades, their practical use has picked up only recently. One reason is that this type of data solves some challenging problems that were quite hard to solve before, or solves them in a more cost-effective way. All of these problems per‐ tain to data access: sometimes it is just hard to get access to real data. This section presents a few application examples from various industries. These examples are not intended to be exhaustive but rather to be illustrative. Also, the same problem may exist in multiple industries (for example, getting realistic data for software testing is a common problem that data synthesis can solve), so the applica‐ tions of synthetic data to solve that problem will therefore be relevant in these multi‐ ple industries. Because we discuss software testing, say, only under one heading does not mean that it would not be relevant in another. The first industry that we examine is manufacturing and distribution. We then give examples from healthcare, financial services, and transportation. The industry exam‐ ples span the types of synthetic data we’ve discussed, from generating structured data from real individual-level and aggregate data, to using simulation engines to generate large volumes of synthetic data. 8 | Chapter 1: Introducing Synthetic Data Generation
The above is a preview of the first 20 pages. Register to read the complete e-book.