Tidy Modeling with R (Final Release) (Max Kuhn, Julia Silge) (Z-Library)

(This page has no text content)

Tidy Modeling with R A Framework for Modeling in the Tidyverse Max Kuhn and Julia Silge

Tidy Modeling with R by Max Kuhn and Julia Silge Copyright © 2022 Max Kuhn and Julia Silge. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Michelle Smith Development Editor: Rita Fernando Production Editor: Beth Kelly Copyeditor: Piper Editorial Consulting, LLC Proofreader: Tom Sullivan Indexer: Potomac Indexing, LLC Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea July 2022: First Edition Revision History for the First Edition

2022-07-12: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492096481 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Tidy Modeling with R, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-492-09648-1 [LSI]

Dedication To Amy: When you read this, know that I love you more today than every day before. —M.K. To Robert: Happy 20 years of choosing each other. —J.S.

Preface Welcome to Tidy Modeling with R! This book is a guide to using a collection of software in the R programming language for model building called tidymodels, and it has two main goals: First and foremost, this book provides a practical introduction to how to use these specific R packages to create models. We focus on a dialect of R called the tidyverse that is designed with a consistent, human-centered philosophy and demonstrate how the tidyverse and the tidymodels packages can be used to produce high quality statistical and machine learning models. Second, this book will show you how to develop good methodology and statistical practices. Whenever possible, our software, documentation, and other materials attempt to prevent common pitfalls. In Chapter 1, we outline a taxonomy for models and highlight what good software for modeling is like. The ideas and syntax of the tidyverse, which we introduce (or review) in Chapter 2, are the basis for the tidymodels approach to these challenges of methodology and practice. Chapter 3 provides a quick tour of conventional base R modeling functions and summarizes the unmet needs in that area. After that, this book is separated into parts, starting with the basics of modeling with tidy data principles. Chapters 4–9 introduce an example data set on house prices and demonstrate how to use the fundamental tidymodels packages: recipes, parsnip, workflows, yardstick, and others. The next part of the book moves forward with more details on the process of creating an effective model. Chapters 10–15 focus on creating good estimates of performance as well as tuning model hyperparameters.

Finally, the last section of this book, Chapters 16–21 cover other important topics for model building. We discuss more advanced feature engineering approaches like dimensionality reduction and encoding high-cardinality predictors, as well as how to answer questions about why a model makes certain predictions and when to trust your model predictions. We do not assume that readers have extensive experience in model building and statistics. Some statistical knowledge is required, such as random sampling, variance, correlation, basic linear regression, and other topics that are usually found in a basic undergraduate statistics or data analysis course. We do assume that the reader is at least slightly familiar with dplyr, ggplot2, and the %>% “pipe” operator in R, and is interested in applying these tools to modeling. For users who don’t yet have this background R knowledge, we recommend books such as R for Data Science by Wickham and Grolemund (2016). Investigating and analyzing data is an important part of any model process. This book is not intended to be a comprehensive reference on modeling techniques; we suggest other resources to learn more about the statistical methods themselves. For general background on the most common type of model, the linear model, we suggest Fox (2008). For predictive models, Kuhn and Johnson (2013) and Kuhn and Johnson (2020) are good resources. For machine learning methods, Goodfellow, Bengio, and Courville (2016) is an excellent (but formal) source of information. In some cases, we do describe the models we use in some detail, but in a way that is less mathematical, and hopefully more intuitive. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions.

Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values determined by context. TIP This element signifies a tip or suggestion. NOTE This element signifies a general note. WARNING This element indicates a warning or caution. Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/tidymodels/TMwR. This book was written with RStudio using bookdown (Xie 2016). We generated all plots in this book using ggplot2 and its black and white theme (theme_bw()). An

online version of this book is available and will continue to evolve after publication of the physical book. If you have a technical question or a problem using the code examples, please email to bookquestions@oreilly.com. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Tidy Modeling with R by Max Kuhn and Julia Silge (O’Reilly). Copyright 2022 Max Kuhn and Julia Silge, 978-1-492-09648-1.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. This version of the book was built with: R version 4.1.3 (2022-03-10), pandoc version 2.17.1.1, and the following packages: applicable (0.0.1.2, CRAN) av (0.7.0, CRAN) baguette (0.2.0, CRAN) beans (0.1.0, CRAN) bestNormalize (1.8.2, CRAN) bookdown (0.25, CRAN)

broom (0.7.12, CRAN) censored (0.0.0.9000, GitHub) corrplot (0.92, CRAN) corrr (0.4.3, CRAN) Cubist (0.4.0, CRAN) DALEXtra (2.1.1, CRAN) dials (0.1.1, CRAN) dimRed (0.2.5, CRAN) discrim (0.2.0, CRAN) doMC (1.3.8, CRAN) dplyr (1.0.8, CRAN) earth (5.3.1, CRAN) embed (0.1.5, CRAN) fastICA (1.2-3, CRAN) finetune (0.2.0, CRAN) forcats (0.5.1, CRAN) ggforce (0.3.3, CRAN) ggplot2 (3.3.5, CRAN) glmnet (4.1-3, CRAN) gridExtra (2.3, CRAN) infer (1.0.0, CRAN) kableExtra (1.3.4, CRAN) kernlab (0.9-30, CRAN)

kknn (1.3.1, CRAN) klaR (1.7-0, CRAN) knitr (1.38, CRAN) learntidymodels (0.0.0.9001, GitHub) lime (0.5.2, CRAN) lme4 (1.1-29, CRAN) lubridate (1.8.0, CRAN) mda (0.5-2, CRAN) mixOmics (6.18.1, Bioconductor) modeldata (0.1.1, CRAN) multilevelmod (0.1.0, CRAN) nlme (3.1-157, CRAN) nnet (7.3-17, CRAN) parsnip (0.2.1.9001, GitHub) patchwork (1.1.1, CRAN) pillar (1.7.0, CRAN) poissonreg (0.2.0, CRAN) prettyunits (1.1.1, CRAN) probably (0.0.6, CRAN) pscl (1.5.5, CRAN) purrr (0.3.4, CRAN) ranger (0.13.1, CRAN) recipes (0.2.0, CRAN)

rlang (1.0.2, CRAN) rmarkdown (2.13, CRAN) rpart (4.1.16, CRAN) rsample (0.1.1, CRAN) rstanarm (2.21.3, CRAN) rules (0.2.0, CRAN) sessioninfo (1.2.2, CRAN) stacks (0.2.2, CRAN) stringr (1.4.0, CRAN) svglite (2.1.0, CRAN) text2vec (0.6, CRAN) textrecipes (0.5.1.9000, GitHub) themis (0.2.0, CRAN) tibble (3.1.6, CRAN) tidymodels (0.2.0, CRAN) tidyposterior (0.1.0, CRAN) tidyverse (1.3.1, CRAN) tune (0.2.0, CRAN) uwot (0.1.11, CRAN) workflows (0.2.6, CRAN) workflowsets (0.2.1, CRAN) xgboost (1.5.2.1, CRAN) yardstick (0.0.9, CRAN)

O’Reilly Online Learning NOTE For more than 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/tidy- modeling-r.

Email bookquestions@oreilly.com to comment or ask technical questions about this book. For news and information about our books and courses, visit https://oreilly.com. Find us on LinkedIn: https://linkedin.com/company/oreilly-media. Follow us on Twitter: https://twitter.com/oreillymedia. Watch us on YouTube: https://youtube.com/oreillymedia. Acknowledgments We are so thankful for the contributions, help, and perspectives of people who have supported us in this project. There are several we would like to thank in particular. We would like to thank our RStudio colleagues on the tidymodels team (Davis Vaughan, Hannah Frick, Emil Hvitfeldt, and Simon Couch) as well as the rest of our coworkers on the RStudio open source team. Thank you to Desirée De Leon for the site design of the online work. We would also like to thank our technical reviewers, Chelsea Parlett-Pelleriti and Dan Simpson, for their detailed, insightful feedback that substantively improved this book, as well as our editors, Nicole Taché and Rita Fernando, for their perspective and guidance during the process of writing and publishing. This book was written in the open, and multiple people contributed via pull requests or issues. Special thanks goes to the 38 people who contributed via GitHub pull requests (in alphabetical order by username): Aris Paschalidis (@arisp99), Brad Hill (@bradisbrad), Bryce Roney (@bryceroney), Cedric Batailler (@cedricbatailler), Ildikó Czeller (@czeildi), David Kane (@davidkane9), @DavZim, @DCharIAA, Emil Hvitfeldt (@EmilHvitfeldt), Emilio (@emilopezcano), Fgazzelloni (@Fgazzelloni), Hannah Frick (@hfrick), Hlynur (@hlynurhallgrims), Howard Baek (@howardbaek), Jae Yeon Kim (@jaeyk), Jonathan D. Trattner (@jdtrat), Jeffrey Girard (@jmgirard), John W. Pickering (@JohnPickering), Jon

Harmon (@jonthegeek), Joseph B. Rickert (@joseph-rickert), Maximilian Rohde (@maxdrohde), Michael Grund (@michaelgrund), @MikeJohnPage, Mine Cetinkaya-Rundel (@mine-cetinkaya-rundel), Mohammed Hamdy (@mmhamdy), @nattalides, Y. Yu (@PursuitOfDataScience), Riaz Hedayati (@riazhedayati), Rob Wiederstein (@RobWiederstein), Scott (@scottyd22), Simon Schölzel (@simonschoe), Simon Sayz (@tagasimon), @thrkng, Tanner Stauss (@tmstauss), Tony ElHabr (@tonyelhabr), Dmitry Zotikov (@x1o), Xiaochi (@xiaochi-liu), and Zach Bogart (@zachbogart).

Part I. Introduction

Chapter 1. Software for Modeling Models are mathematical tools that can describe a system and capture relationships in the data given to them. Models can be used for various purposes, including predicting future events, determining if there is a difference between several groups, aiding map-based visualization, discovering novel patterns in the data that could be further investigated, and more. The utility of a model hinges on its ability to be reductive, or to reduce complex relationships to simpler terms. The primary influences in the data can be captured mathematically in a useful way, such as in a relationship that can be expressed as an equation. Since the beginning of the 21st century, mathematical models have become ubiquitous in our daily lives, in both obvious and subtle ways. A typical day for many people might involve checking the weather to see when might be a good time to walk the dog, ordering a product from a website, typing a text message to a friend and having it autocorrected, and checking email. In each of these instances, there is a good chance that some type of model was involved. In some cases, the contribution of the model might be easily perceived (“You might also be interested in purchasing product X”) while in other cases, the impact could be the absence of something (e.g., spam email). Models are used to choose clothing that a customer might like, to identify a molecule that should be evaluated as a drug candidate, and might even be the mechanism that a nefarious company uses to avoid the discovery of cars that overpollute. For better or worse, models are here to stay.

NOTE There are two reasons that models permeate our lives today: An abundance of software exists to create models It has become easier to capture and store data, as well as make it accessible This book focuses largely on software. It is obviously critical that software produces the correct relationships to represent the data. For the most part, determining mathematical correctness is possible, but the reliable creation of appropriate models requires more. In this chapter, we outline considerations for building or choosing modeling software, the purposes of models, and where modeling sits in the broader data analysis process. Fundamentals for Modeling Software It is important that the modeling software you use is easy to operate properly. The user interface should not be so poorly designed that the user would not know that they used it inappropriately. For example, Baggerly and Coombes (2009) report myriad problems in the data analyses from a high-profile computational biology publication. One of the issues was related to how the users were required to add the names of the model inputs. The software user interface made it easy to offset the column names of the data from the actual data columns. This resulted in the wrong genes being identified as important for treating cancer patients and eventually contributed to the termination of several clinical trials (Carlson 2012). If we need high-quality models, software must facilitate proper usage. Abrams (2003) describes an interesting principle to guide us: The Pit of Success: in stark contrast to a summit, a peak, or a journey across a desert to find victory through many trials and surprises, we want our customers to simply fall into winning practices by using our platform and frameworks.

Data analysis and modeling software should espouse this idea. Second, modeling software should promote good scientific methodology. When working with complex predictive models, it can be easy to unknowingly commit errors related to logical fallacies or inappropriate assumptions. Many machine learning models are so adept at discovering patterns that they can effortlessly find empirical patterns in the data that fail to reproduce later. Some of these methodological errors are insidious in that the issue can go undetected until a later time when new data that contain the true result are obtained. WARNING As our models have become more powerful and complex, it has also become easier to commit latent errors. This same principle also applies to programming. Whenever possible, the software should be able to protect users from committing mistakes. Software should make it easy for users to do the right thing. These two aspects of model development—ease of proper use and good methodological practice—are crucial. Since tools for creating models are easily accessible and models can have such a profound impact, many more people are creating them. In terms of technical expertise and training, creators’ backgrounds will vary. It is important that their tools be robust to the user’s experience. Tools should be powerful enough to create high- performance models, but, on the other hand, should be easy to use appropriately. This book describes a suite of software for modeling that has been designed with these characteristics in mind. The software is based on the R programming language (R Core Team 2014). R has been designed especially for data analysis and modeling. It is an implementation of the S language (with lexical scoping rules adapted from Scheme and Lisp) which was created in the 1970s to “turn ideas into software, quickly and faithfully” (Chambers 1998). R is open source and

free. It is a powerful programming language that can be used for many different purposes but specializes in data analysis, modeling, visualization, and machine learning. R is easily extensible; it has a vast ecosystem of packages, mostly user-contributed modules that focus on a specific theme, such as modeling, visualization, and so on. One collection of packages is called the tidyverse (Wickham et al. 2019). The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. Several of these design philosophies are directly informed by the aspects of software for modeling described in this chapter. If you’ve never used the tidyverse packages, Chapter 2 contains a review of basic concepts. Within the tidyverse, the subset of packages specifically focused on modeling are referred to as the tidymodels packages. This book is a practical guide for conducting modeling using the tidyverse and tidymodels packages. It shows how to use a set of packages, each with its own specific purpose, together to create high-quality models. Types of Models Before proceeding, let’s describe a taxonomy for types of models, grouped by purpose. This taxonomy informs both how a model is used and many aspects of how the model may be created or evaluated. While this list is not exhaustive, most models fall into at least one of these categories: descriptive, inferential, or predictive. Descriptive Models The purpose of a descriptive model is to describe or illustrate characteristics of some data. The analysis might have no other purpose than to visually emphasize some trend or artifact in the data. For example, large-scale measurements of RNA have been possible for some time using microarrays. Early laboratory methods placed a biological sample on a small microchip. Very small locations on the chip can measure

Statistics

Uploader

Tidy Modeling with R (Final Release) (Max Kuhn, Julia Silge) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Recommended for You

Statistics

Uploader

Tidy Modeling with R (Final Release) (Max Kuhn, Julia Silge) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Reply to Comment

Edit Comment

Recommended for You