AI Projects in PyTorch Hands-On Projects in Vision, Text, and Generative Models (Siddhesh Prashant Chaubal)（Z-Library）

AI Projects in PyTorch Hands-On Projects in Vision, Text, and Generative Models — Siddhesh Prashant Chaubal

AI Projects in PyTorch Hands-On Projects in Vision, Text, and Generative Models Siddhesh Prashant Chaubal

AI Projects in PyTorch: Hands-On Projects in Vision, Text, and Generative Models ISBN-13 (pbk): 979-8-8688-2116-5 ISBN-13 (electronic): 979-8-8688-2117-2 https://doi.org/10.1007/979-8-8688-2117-2 Copyright © 2025 by Siddhesh Prashant Chaubal This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Managing Director, Apress Media LLC: Welmoed Spahr Acquisitions Editor: Celestin Suresh John Coordinating Editor: Gryffin Winkler Cover designed by eStudioCalamar Cover image by freepik (freepik.com) Distributed to the book trade worldwide by Springer Science+Business Media New York, 1 New York Plaza, New York, NY 10004. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail orders-ny@springer-sbm.com, or visit www.springeronline.com. Apress Media, LLC is a Delaware LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation. For information on translations, please e-mail booktranslations@springernature.com; for reprint, paperback, or audio rights, please e-mail bookpermissions@springernature.com. Apress titles may be purchased in bulk for academic, corporate, or promotional use. eBook versions and licenses are also available for most titles. For more information, reference our Print and eBook Bulk Sales web page at http://www.apress.com/bulk-sales. Any source code or other supplementary material referenced by the author in this book is available to readers on GitHub (https://github.com/Apress). For more detailed information, please visit https://www. apress.com/gp/services/source-code. If disposing of this product, please recycle the paper Siddhesh Prashant Chaubal Thane, Maharashtra, India

Dedicated to my amazing wife, Sayali.

(This page has no text content)

xiii About the Author Dr. Siddhesh Prashant Chaubal has dedicated his career to building and studying intelligent systems — from cutting- edge research in artificial intelligence to large-scale machine learning platforms powering real-world applications. Currently, he works as a Research Scientist at Dream11 in Mumbai. In earlier roles, he has served as a Staff Engineer at Qualcomm India and an Applied Scientist at Amazon in Seattle. He holds a B.Tech. in Computer Science from IIT Bombay and a PhD from the University of Texas at Austin, where his research explored theoretical aspects of computer science and machine learning. His work has been published in leading international conferences such as CIKM and MFCS. When not immersed in AI, he enjoys reading, playing chess, or listening to music.

xv About the Technical Reviewer Shibsankar is currently working as a Senior Data Scientist at Microsoft. He has 10+ years of experience working in IT, where he has led several data science initiatives, and in 2019, he was recognized as one of the top 40 data scientists in India. His core strength is in GenAI, deep learning, NLP, and graph neural networks. Currently, he is focusing on his research on AI agents and knowledge graphs. He has experience working in the domains of foundational research, FinTech, and ecommerce. Before Microsoft, he worked at Optum, Walmart, Envestnet, Microsoft Research, and Capgemini. He pursued a master's from the Indian Institute of Technology, Bangalore.

xvii Acknowledgments I am grateful to Santanu Pattanayak, my manager at Qualcomm, who first introduced me to Apress. His passion for learning and writing has been a true inspiration and a significant influence on my own journey. My sincere thanks also go to the entire editing and production team at Apress for their invaluable guidance and support. In particular, I am grateful to Celestin, Shibsankar, Nirmal, and the rest of the team, whose efforts have been instrumental in shaping this book into its final form. I am profoundly thankful to my parents, who have been a source of strength and encouragement throughout my life. Above all, I would like to express my deepest gratitude to my wife, whose encouragement, support, and patience have been indispensable to the completion of this book. She has also been kind enough to prepare many of the illustrations of this book, which add clarity and support to the explanations.

xix Introduction This book is primarily meant as a segue into artificial intelligence for software engineers with hands-on projects. It also serves as a guide to mastering PyTorch, which is one of the most popular frameworks for deep learning. The initial chapters cover the fundamentals of machine learning and PyTorch. Subsequently, it goes into different domains of AI, namely, computer vision, natural language processing, audio classification, and recommender systems. Each domain is brought to life through one or more end-to-end projects. Who Will Benefit from This Book This book is primarily meant as an introduction to AI and its different domains for readers familiar with Python. As such, anyone with an intermediate grasp of Python who is interested in venturing into AI will benefit from reading this book. Specifically, software engineers, or other professionals who dabble in Python for their work, are one of the primary audiences. Curious students and enthusiasts can also benefit, discovering practical ways to begin their journey in AI. How This Book Is Organized The book starts with two background chapters that build the foundations for all the projects to follow. Chapter 1 gives a primer on AI and machine learning, assuming no prior knowledge. Chapter 2 dives into the nitty-gritties of PyTorch, with several programming illustrations and exercises at the end of the chapter. The subsequent chapters each contain one or more hands-on projects, typically beginning with a section that explains the necessary background of the domain before moving into the project.

xx Computer Vision Chapter 3 introduces the field of computer vision with a project on image classification using convolutional neural networks (CNNs). Natural Language Processing Chapters 4–6 focus on natural language processing (NLP). Chapter 4 begins with an introduction to NLP along with common preprocessing steps such as tokenization, numericalization, padding, and truncation. It also explains the evolution of different modeling approaches, from RNNs to transformers, before moving into a text classification project using various strategies. Chapter 5 introduces modern NLP, teaching various aspects of the Hugging Face ecosystem, and tackling four different NLP tasks with pretrained models, including fine-tuning. Chapter 6 builds a transformer- based language model for storytelling. Audio Classification Chapter 7 introduces the audio processing domain, guiding the reader with a project in audio classification. Recommender Systems Chapter 8 covers the foundations of recommender systems and includes a hands-on project. Multimodal Models Chapter 9 walks the reader through an image captioning project using a multimodal model that combines vision and NLP. InTroduCTIon

xxi How to Read This Book This book is meant as a practical guide, so I strongly recommend running all the code in each chapter step by step. Treat the chapters like a lab notebook: experiment with different parameter settings, try variations, comment out parts of the code to see what breaks, and follow your curiosity. The more actively you explore, the more deeply you will master these ideas. The first two chapters are foundational and warrant careful study – if you are new to PyTorch, be sure to complete all the exercises at the end of Chapter 2. If you are already comfortable with ML or PyTorch, you may skim them. The remaining chapters can generally be read in any order, though I recommend starting with Chapter 3 before going further, as it introduces additional practical PyTorch and ML concepts. InTroduCTIon

1 © Siddhesh Prashant Chaubal 2025 S. P. Chaubal, AI Projects in PyTorch, https://doi.org/10.1007/979-8-8688-2117-2_1 CHAPTER 1 Introduction to Machine Learning Machine learning is a rapidly expanding field – both in academic research, with thousands of papers published every month, and in the development of revolutionary tech products like ChatGPT and Veo. This chapter explains the basics of machine learning, enabling you to understand the fundamentals behind these advanced technologies, some of which we will discuss in the later chapters. Chapter 1 begins with a brief introduction to artificial intelligence (AI) and machine learning (ML). Next, we take up a simple example to build intuition for the basics of machine learning, where we introduce linear regression and neural network algorithms. We then explain some of the ML concepts in depth, including practical subtleties like data collection, preprocessing, feature engineering, etc. We then move on to model training, explaining the mechanics of the gradient descent algorithm in detail. Finally, we conclude this chapter by explaining the ideas of model overfitting and underfitting. AI and ML Artificial intelligence (AI) has been a major topic of discussion across both social media and scientific forums over the past decade. AI encompasses all endeavors toward emulating human behaviors in machines, whether in self-driving cars, factory robots, or AI chatbots like ChatGPT. For developers, AI often takes the form of software that demonstrates humanlike cognitive abilities. The two main domains of AI where major strides have been made recently are computer vision and natural language processing. On a high level, the former seeks to emulate the functionalities of the human eye, while the latter develops a computer’s comprehension of human languages. Most of the projects in this book will belong to one

2 of these two domains of AI (or both). We will be working with data-driven algorithmic approaches to these problems, commonly referred to as machine learning (ML) in literature, and henceforth, we will use these two terms (AI and ML) interchangeably. Machine Learning with an Example A House Price Prediction Problem Let us say that a one-bedroom house in a certain neighborhood sells for 100K, a two- bedroom house sells for 200K, and a three-bedroom house sells for 300K. Now, as an ML expert, you are asked to predict the price of a four-bedroom house in that same locality (with no additional info). What would you say? You would not need to write a Python program to predict the price of 400K for this house. In fact, a fourth grader could answer this question without batting an eyelid. But what methodology does our brain really use to come up with this answer? This question is worth asking because only then would we still be able to predict the price if the numbers were less obvious. Let us make it less obvious: say the price of a one-bedroom was 120K, and that of a two-bedroom was 220K, and that of a three-bedroom was 320K. A little more thought will tell you that a similar method works after subtracting a constant value of 20K from each price. So, you will still be able to correctly predict 420K without using a pencil. Linear Regression Now let us make it even more interesting (see Figure 1-1): say the prices were 160K, 300K, and 440K for houses with one, two, and three bedrooms, respectively. Now, how would you predict the price of a four-bedroom house? One technique you could use is to plot these values on the y axis, with the number of bedrooms on the x axis, and fit a straight line passing through all these points. If no such line exists (i.e., the points are not collinear), then pass a line that passes as closely as possible through these points. This technique of predicting the price using the number of bedrooms, assuming linear relationships between them, is called linear regression in machine learning. It is the same technique that you used implicitly in the first two examples. Chapter 1 IntroduCtIon to MaChIne LearnIng

3 Figure 1-1. Linear regression Mathematically, fitting a line is equivalent to setting variables m (slope) and c (intercept), satisfying the equation y = mx + c. Here, x would be the number of bedrooms in the house, and y would be the price of the house. For the three houses, we have these three equations: 160 = m + c 300 = 2m + c 440 = 3m + c Solving this gives m = 140 and c = 20. So, the answer to our original question comes out to be 4 * 140 + 20 = 580K, as you can also verify from the line we plotted. In this case, we had three equations in two variables, and we got lucky to find values of m and c that satisfy these three perfectly (because this system with more equations than variables could have very easily been insatiable). However, in general, we cannot rely on luck every time, and therefore, we try to fit a line as closely as possible (as opposed to an exact fit, which may not be possible). More precisely, we aim to minimize the error as we select these parameters m and c. For linear regression, we usually use the mean squared error (MSE), defined as MSE n y y i i i n1 2 1 ˆ Equation 1 Chapter 1 IntroduCtIon to MaChIne LearnIng

AI Projects in PyTorch Hands-On Projects in Vision, Text, and Generative Models (Siddhesh Prashant Chaubal)（Z-Library）

AI Reading Assistant

Passage locations

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Recommended for You

AI Reading Assistant

Passage locations

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Recommended for You

Reply to Comment

Edit Comment