📄 Page
1
(This page has no text content)
📄 Page
2
(This page has no text content)
📄 Page
3
Practical Machine Learning with Spark Uncover Apache Spark’s Scalable Performance with High-Quality Algorithms Across NLP, Computer Vision and ML Gourav Gupta Dr. Manish Gupta Dr. Inder Singh Gupta www.bpbonline.com
📄 Page
4
FIRST EDITION 2022 Copyright © BPB Publications, India ISBN: 978-93-91392-086 All Rights Reserved. No part of this publication may be reproduced, distributed or transmitted in any form or by any means or stored in a database or retrieval system, without the prior written permission of the publisher with the exception to the program listings which may be entered, stored and executed in a computer system, but they can not be reproduced by the means of publication, photocopy, recording, or by any electronic and mechanical means. LIMITS OF LIABILITY AND DISCLAIMER OF WARRANTY The information contained in this book is true to correct and the best of author’s and publisher’s knowledge. The author has made every effort to ensure the accuracy of these publications, but publisher cannot be held responsible for any loss or damage arising from any information in this book. All trademarks referred to in the book are acknowledged as properties of their respective owners but BPB Publications cannot guarantee the accuracy of this information. www.bpbonline.com
📄 Page
5
Dedicated to Our Parents
📄 Page
6
About the Authors Gourav Gupta is a Data specialist having 5+ years of experience in Big Data, Artificial Intelligence, Deep Learning, Internet of Things and Digital Twin. Mr. Gourav has worked on several interdisciplinary real time project which are the conglomerations of Digital Technologies. His expertise is on architectural optimization and technical solutioning on Big Data, AI, Computer Vision, and Internet of Things. He also loves to write research article and serving as a reviewer with Springer Journal. https://www.linkedin.com/in/gourav-g-8929a560/
📄 Page
7
Dr. Manish Gupta is a 21st century researcher, innovator, and entrepreneur. He has completed his Ph.D. from reputed Jawaharlal Nehru University, India. Presently, he is working at Department of Radiology, Perelman School of Medicine, University of Pennsylvania (UPENN), Philadelphia, USA. Prior joined at UPENN, Dr. Gupta worked at Gwangju Institute of Science and Technology, Gwangju, South Korea. In addition, he is founder member and Chief Research Advisor of digital healthcare startup (Arogya Pandit Private Limited) at India. He has filled patent and published several research articles in well-reputed SCI journals and international conferences/book chapters. His research interest is on Low-cost biosensors development, Development and optimization of pulse sequence using MRI, Tumor classification using Machine Learning and Deep Learning using MRI. In addition, he is also working on several projects related to Big Data integration with Artificial intelligence and Internet of Things. Dr. Gupta also loves to write poem and technical blogs. https://www.linkedin.com/in/manish-gupta-ph-d-9544ba60/ Professor (Dr.) Inder Singh Gupta is a seismologist, statistician, mathematical modeler, and Data Science expert. He has 37+ years of rich experience in Research, Teaching, Principal Supervisor for many Govt. funded projects along with numerous research publications in reputed international journals and conferences. He is also an author of many undergraduate and postgraduate books of mathematics. Currently, he got retired from JVMGRR(PG) College, India, and serving as Chief Executive Officer in digital healthcare startup (ArogyaPandit Private Limited,India (arogyapandit.com)). https://www.linkedin.com/in/dr-i-s-gupta-87aa2120/
📄 Page
8
About the Reviewers Kiran Raja is a Faculty Member with the Department of Computer Science at Norwegian University of Science and Technology (NTNU), Norway. He received his PhD degree in Computer Science from the NTNU in 2016. He was/is participating in EU projects FP7-INGRESS, H2020-SOTAMD, H2020-iMARS, and other national projects. During his participation in SOTAMD and iMARS projects at NTNU, he has worked on different problems in morphing attacks from both generation and detection perspectives. He is a member of the European Association of Biometrics (EAB) and chairs the Academic Special Interest Group at EAB. He also advises various national agencies in Norway on making biometric systems secure. His recent research focuses on attacks and defenses on biometric systems using statistical pattern recognition, image processing, and machine learning. He has authored several papers in his field of interest and serves as a reviewer for several journals and conferences. He also serves as program chair for the BIOSIG conference. He is also a member of the editorial board for various journals. Er. Nidhi Gupta has 9 years of extensive experience to perform troubleshooting and testing of advanced analytics applications which deploy on-premise and cloud-based architecture. Currently, she is associated with Department of Treasury and Finance under the Australian Government as a “Senior Test Analyst”. Where, she is leveraging disparate tools such as Selenium, Talend, Jenkins, AWS stack, Cucumber, RestAssured, Robotic Process Automation (RPA), Protactor, and Jmeter (Interpreter using Python, PySpark, Java, TypeScript) for executing the manual and automated test cases. Also, she has been responsible to landing the Machine Learning and Big Data based projects impeccably with zero caveats. Apart of being a technocrat, she loves to do travelling and trekking with loved ones in her leisure time. She can be reached at [nidhigupt8190@gmail.com/nidhi.gupta@arogyapandit.com or
📄 Page
9
linkedin.com/in/nidhi-gupta-957458bb]
📄 Page
10
Acknowledgements I am feeling profound happiness to be able to deliver this book to all my readers across the globe who have been working in the domain of advanced analytics and intelligence. In this book, I tried my best to elucidate all the indispensable information for extending the adaptability of distributed processing towards Big Data and Artificial Intelligence. First and foremost, a special thanks to my mother, Mrs. Varsha Gupta, for providing the ideal atmosphere while writing the book chapters. Also, I would like to thank the co-authors of this book, Dr. I.S. Gupta and Dr. Manish Gupta, for their helpful and valuable guidance. However, this book wouldn’t have been possible without the encouragement of my brother-in- law, Er. Manish Gupta, my younger brother, Sourav Gupta, and other family members. Finally, I would like to thank Mr. Nrip Jain and the entire BPB team for providing the opportunity to write this book. Also, I have no words for the reviewers, Dr. Kiran Raja and Er. Nidhi Gupta, for improving the standard and quality of this book. I agree that the content of this book will confound the reader with great interest. — Gourav Gupta In the last two decades, we have continually witnessed tremendous growth in digital data coming from numerous digital platforms. To handle this massive amount of data, advanced analytics and intelligence techniques are continuously gaining popularity among the data science community across the globe. The present book is a sincere attempt to adorn all analytics techniques under one umbrella for the convenience of readers. It is my great privilege to introduce this book to data analysts and the science community. This book potentially creates a bridge to fulfil a gap between the academic community and corporate researchers. In no words, I can articulate my infinite indebtedness to a loving family whose unending love always provided me with the moral strength to materialise this book within a scheduled time frame. I owe an enormous debt of gratitude to my co-authors for countless technical discussions and also for their erudition.
📄 Page
11
I owe an immeasurable debt to both reviewers for their active support, which did not let me feel let-down during the finalisation of this book. I appreciate both efforts in putting my endeavours in the right direction. In the end, needless to say, without the active support of the entire BPB family, this would have remained an unfulfilled dream. — Dr. Manish Gupta In the era of automation, it has become necessary to update and apprise the public about the upcoming advancements using machine learning and deep learning. It is quite difficult to achieve more precision with fewer computations without the implementation of statistical methods and mathematical concepts while training and testing an intelligence system. In my 40 years of teaching and research experience, I taught and delivered numerous international and national lectures on these statistical methods, numerical methods, and operational research methods for solving the tedious problems in seismology, particularly in the propagation of waves in solids theoretically. As a co-founder and director at ArogyaPandit Private Limited, India, I help and teach my data science team about the core and advanced mathematical functions and calculations in AI. I also express my gratitude to my supervisor, Professor Dr. Sarva Jit Singh (former head of the mathematics department, MDU, Rohtak India), for his blessing and support throughout my professional life. I would like to thank my wife and family members for their cooperation. Also, I thank the reviewers, Mrs. Nidhi Gupta and Dr. Kiran Raja, for improving the book's contents and technical refinement. Finally, I would also like to thank the BPB Publications for providing this opportunity. — Dr. I.S. Gupta
📄 Page
12
Preface Since 1964, from the beginning of automation and intelligence towards machines, the applications of machine learning (ML) have made tremendous progress during the last two decades. But still, there is a large scope of improvement for fast and accurate decisions. The aim of the present book is to make the readers aware of day-to-day activities that make life smarter and cosier with the use of ML applications using Apache Spark. Initially, there was a single processing framework used in ML to solve the critical problems. Due to the standalone processing, the training and testing of models usually takes more time and requires more resources. Also, the problem becomes more complex and time-consuming for big data (high dimension and data volume of features) in ML. Therefore, a promising in- memory analytics layer needs to be introduced, such as Apache Spark, for handling and training the heavy intelligence model in an optimised manner. Generally, there are two types of distributed frameworks, like Apache Hadoop and Apache Spark. Due to some limitations in Hadoop, most MNCs later adopted Apache Spark. This book contains comprehensive and lucid details from scratch to production level implementation of a distributed framework, which the readers will find useful. Also, readers will learn to easily transition from conceptual scenarios to practical implementation and get educated them about the various components of ML pipelines using Apache Spark. Although a Github link is provided in this book where the reader can try the practical stuff using those codebases. Chapter 1 delineates the introductory phase and disparate real-time applications of various domains of ML. Compendious discussion regarding its derived technologies such as Neutral Network (NN) and Deep Learning (DL) in connection with ML applications is also discussed. Beginning from the evolution of ML to its future scope, it is also mentioned in detail for readers. Chapter 2 deals with issues including handling, storing, and processing large volumes of data by leveraging the Distributed Framework (DF). The installation and configuration of Apache Spark on-premises systems,
📄 Page
13
Apache Spark on cloud-based systems, Python, DBeaver, Code Editors, and PowerBI are also deeply discussed in this chapter. Chapter 3 contains the various ways to read and manipulate heterogeneous formats of data, a detailed explanation of the architecture, an optimization interactive monitoring of Spark's job through Apache Livy. Workflow creation through Apache Oozie and other tools for creating a unified pipeline are also mentioned in this chapter. Chapter 4 presents deep knowledge about various components of ML pipelines, actions, transformations for making the unified ML pipeline using Apache Spark. Also, this chapters explain all the SparkML methods for training and testing the intelligence model on actual data. Chapter 5 deals with distributed processing-based supervised learning along with implementation. Also, the discussion on regression and classification-based performance metrics is given to check the performance of the model. Chapter 6 highlights the use of unsupervised learning methods for clustering of random samples to understand hidden patterns in the data and find outliers etc. The implementation of each learning method is given in this chapter. Chapter 7 deals with the evolution of Natural Language Processing (NLP) and its distributed processing using the SparkNL P library along with future scope. Also, topic modelling, text-classification, and sentiment analysis are discussed in detail. Chapter 8 is deeply concerned with the recommendation engine and its distributed processing-based operation. The uses are also mentioned in relation to recommendations regarding products, services, and information. Chapter 9 discusses the uses of DL process to improve the performance of computation and hence reduces the time consumption and cost reduction. In this chapter, evolution of DL and its components explanation and advancement in DL are also discussed. Chapter 10 gives comprehensive details regarding the evolution of Computer Vision (CV) and its related libraries, core components, data augmentation, and applications. CV enhancement is also discussed, as well as their practical implementation in real-time CV-based pipelines.
📄 Page
14
Code Bundle and Coloured Images Please follow the link to download the Code Bundle and the Coloured Images of the book: https://rebrand.ly/lrsgks7 The code bundle for the book is also hosted on GitHub at https://github.com/bpbpublications/Practical-Machine-Learning-with- Spark. In case there's an update to the code, it will be updated on the existing GitHub repository. We have code bundles from our rich catalogue of books and videos available at https://github.com/bpbpublications. Check them out! Errata We take immense pride in our work at BPB Publications and follow best practices to ensure the accuracy of our content to provide with an indulging reading experience to our subscribers. Our readers are our mirrors, and we use their inputs to reflect and improve upon human errors, if any, that may have occurred during the publishing processes involved. To let us maintain the quality and help us reach out to any readers who might be having difficulties due to any unforeseen errors, please write to us at : errata@bpbonline.com Your support, suggestions and feedbacks are highly appreciated by the BPB Publications’ Family. Did you know that BPB offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.bpbonline.com and as a print book
📄 Page
15
customer, you are entitled to a discount on the eBook copy. Get in touch with us at: business@bpbonline.com for more details. At www.bpbonline.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on BPB books and eBooks.
📄 Page
16
Piracy If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at business@bpbonline.com with a link to the material. If you are interested in becoming an author If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit www.bpbonline.com. We have worked with thousands of developers and tech professionals, just like you, to help them share their insights with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea. Reviews Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions. We at BPB can understand what you think about our products, and our authors can see your feedback on their book. Thank you! For more information about BPB, please visit www.bpbonline.com.
📄 Page
17
Table of Contents 1. Introduction to Machine Learning Introduction Structure Objectives Evolution of Machine Learning Fundamentals and Definition of Machine Learning Types of Machine Learning Learning of Models Based on the First Criteria Supervised Learning (SL) Unsupervised Learning (USL) Reinforcement Learning (RL) Hybrid Learning Problem (HLP) Learning of Models Based on Second Criteria (Batch Mode Learning and Online Mode Learning) Batch Learning Online Learning Applications of Machine Learning Recommendation Engine Financial Services Social Media Face Recognition Healthcare Sentiment Analysis Video Surveillance Future Scope of Machine Learning A New Trail of Intelligence Augmentation (IA) Edge Computing with ML Quantum Computing with ML Improved Cognitive Services Robotics Machine Learning in Space Exploration Self-driving Cars and Autonomous Transportation Enhanced Healthcare using AI
📄 Page
18
Conclusion 2. Apache Spark Environment Setup and Configuration Introduction Structure Objectives Laconic View on Apache Spark Apache Spark Installation using Hortonworks Sandbox VMware Workstation Player Installation ClouderaVM Installation for HDP Apache Hadoop and Apache Spark Setup on Amazon Web Services (AWS) AWS Account Credentials and Amazon EC2 Creation PuTTY and PuTTYgen Software for Generating a .ppk file from a .pem and Accessing the Amazon EC2 Instance Through a Public IP Address Apache Ambari Installation on Amazon EC2 Disabling the iptables Installation of Apache Ambari Repository and Hadoop Services on Amazon EC2 Python Editors for the Spark Programming Framework Sublime Editor PySpark or Python Codebase Syncing from a Server to a Local Directory and Vice Versa Jupyter Notebook Microsoft PowerBI Installation for Data Visualization DBeaver Installation for Accessing the Data from the Persistence Layer Apache Spark Installation on Google Colab Conclusion 3. Apache Spark Introduction Structure Objectives Need of Apache Spark Evolution of Apache Spark Apache Spark Components
📄 Page
19
Architecture of Apache Spark Resilient Distributed Dataset (RDD) Direct Acyclic Graph (DAG) in Spark Lazy Evaluation DataFrames Datasets Accumulator and Broadcast Accumulator Broadcast Apache Spark Optimization and its Techniques Memory Storage Levels: Cache and Persist Spark Submit Spark Monitoring Apache Livy: An Easy Interaction With a Spark Cluster Over a REST Interface Job Scheduling Spark RDD Operations: Transformation and Action Data Ingestion in Apache Spark Application of Apache Spark Conclusion 4. Apache Spark MLlib Introduction Structure Objectives Spark MLlib Algorithms Classification Category Regression Category Clustering Category ML Components/Pipelines DataFrame Transformer Estimator Pipeline Parameter CrossValidator Evaluator
📄 Page
20
Spark MLlib’s Datatypes Local Vector Sparse Vector DenseVector LabelPoint Local Matrix Distributed Matrix Extracting, Transforming, and Selecting Features Term Frequency-Inverse Document Frequency (TF-IDF) Word2Vec CountVectorizer FeatureHasher Feature Transformers Tokenizer StopWordsRemover N-Gram Binarizer Principal Component Analysis (PCA) Polynomial Expansion Discrete Cosine Transform (DCT) StringIndexer IndexToString VectorIndexer Normalizer StandardScaler MinMaxScaler MaxAbsScaler Bucketizer ElementwiseProduct SQLTransformer VectorAssembler VectorSizeHint Quantile Discretizer (QD) Imputer Feature Selectors VectorSlicer ChiSqSelector