Large Language Models The Hard Parts (for Raymond Rhine) (First Early Release) (Tharsis T.P. Souza, Jonathan K. Regenstein, Jr.) (Z-Library)

(This page has no text content)

Large Language Models: The Hard Parts Open Source AI Solutions for Common Pitfalls With Early Release ebooks, you get books in their earliest form—the author’s raw and unedited content as they write—so you can take advantage of these technologies long before the official release of these titles. Tharsis T. P. Souza and Jonathan K. Regenstein, Jr.

Large Language Models: The Hard Parts by Tharsis T.P. Souza and Jonathan K. Regenstein, Jr. Copyright © 2026 Tharsis T.P. Souza and Jonathan K. Regenstein Jr. All rights reserved. Published by O’Reilly Media, Inc., 141 Stony Circle, Suite 195, Santa Rosa, CA 95401. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (https://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Nicole Butterfield Development Editor: Jeff Bleiel Production Editor: Clare Laylock Interior Designer: David Futato Interior Illustrator: Kate Dullea May 2026: First Edition Revision History for the Early Release 2025-08-27: First Release See https://oreilly.com/catalog/errata.csp?isbn=9798341622524 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Large Language Models: The Hard Parts, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc.

The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 979-8-341-62247-0 [LSI]

Brief Table of Contents (Not Yet Final) Chapter 1. The Evals Gap (available) Chapter 2. Open-Source Evaluation Frameworks (available) Chapter 3. Unstructured Input Data (available) Chapter 4. Structured Data Output (unavailable) Chapter 5. Laws, Guidelines, and Safety Techniques (unavailable) Chapter 6. Safety Tools (unavailable) Chapter 7. Policy Alignment (unavailable) Chapter 8. Local and Open-Source LLMs (unavailable) Chapter 9. Frontiers (unavailable)

Chapter 1. The Evals Gap It doesn’t matter how beautiful your theory is, it doesn’t matter how smart you are. If it doesn’t agree with experiment, it’s wrong. —Richard Feynman A NOTE FOR EARLY RELEASE READERS With Early Release ebooks, you get books in their earliest form—the author’s raw and unedited content as they write—so you can take advantage of these technologies long before the official release of these titles. This will be the 1st chapter of the final book. Please note that the GitHub repo will be made active later on. If you’d like to be actively involved in reviewing and commenting on this draft, please reach out to the editor at jbleiel@oreilly.com. Introduction The advent of Large Language Models (LLMs) and the subsequent rapid adoption of LLM-based applications (LLMBAs) mark a pivotal shift in the landscape of software development, testing and verification. Unlike traditional software systems, where deterministic outputs are the norm, LLMs and LLMBAs introduce a realm of non-deterministic and generative behaviors that challenge conventional software engineering paradigms. This shift is not merely a technical evolution but a fundamental transformation in how we conceive, build, and assess business applications. For that reason,

remaining entrenched in traditional testing frameworks that fail to account for the probabilistic nature of LLMs and LLMBAs will inevitably lead to significant risks. However, even for the motivated, enthusiastic and excited participants in this new world, shifting mindset and methodology is not easy. To help address a crucial part of this shift, this Chapter explores a critical “evaluation gap” between traditional software testing approaches and the unique requirements of LLMBAs, examining why traditional testing frameworks fall short and what new evaluation strategies are needed. To be clear, by LLMBAs, we mean the deployment of a combination of an LLM, a data source and an output, to generate a business or commercial outcome. For example, using an LLM to write a report or an email is an LLMBA; creating a chat interface to technical documents is an LLMBA. As we explore different evaluation frameworks, our ultimate goal is to use LLMs in LLMBAs or as part of software development pipelines. In the remainder of this chapter, we use practical examples to investigate key aspects of LLM evaluation including benchmarking approaches and metrics selection. We emphasize the importance of developing comprehensive evaluation frameworks that can handle both the near deterministic (when we cover structured outputs with logit post processing) and probabilistic aspects of LLM behavior, providing concrete guidance for implementing robust evaluation pipelines for LLM-based applications. Non-Deterministic Nature of LLMs One of the most fundamental challenges when incorporating LLMs into LLMBAs is their non-deterministic nature. Unlike traditional software systems where the same input reliably produces the same output, LLMs can produce novel text that may not exist in their training data, and produce different responses each time they’re queried - even with identical prompts and input data. This behavior is both a strength (when we prize creativity and flexibility) and a significant challenge. This non-deterministic behavior

manifests itself in generative tasks when an LLM gives different written answers to the same queries. Indeed, most models offer a “temperature” parameter, which controls the randomness of outputs, allows models to be creative and allows them to generate diverse responses. However, this same feature makes it difficult to build reliable, testable systems especially at the enterprise level where a fundamental requirement of business intelligence is to offer consistent answers. Consider a financial services company using LLMs to generate the following investment research: an executive summary of a 10-K (a filing made by public companies with the Securities and Exchange Commission). This research would require factual grounding in the filing (this is the input data) but also potentially some level of opinion about the latest information reported by the company (this is the output where we might favor creative thinking). However, the non-deterministic nature of LLMs means that: The same input data could yield different analytical conclusions Regulatory compliance becomes challenging to guarantee Trust and efficiency may be degraded by inconsistent responses Testing becomes more complex compared to traditional software Where does this non-determinism arise from when LLMs generate answers? The primary source is sampling. More specifically, during text generation: 1. The chosen tokenizer segments the input text into tokens 2. Each token gets mapped to a unique numerical ID 3. The LLM processes these token IDs through its deep neural network

4. The model produces logits - unnormalized scores for each possible next token 5. A softmax transformation converts these raw logits into a probability distribution. That conversion is represented by the following equation: P (yi|x) = softmax (xi) = exi ∑n j=1 exj for i = 1, ..., n where P(yi|x) represents the probability of class i given input x and xi is the logit (raw score) for class i. 6. The highest probability indicates the model’s strongest prediction for the next token 7. Text generation then selects the next token based on a sampling given strategy (e.g. greedy, top-k, etc) As those steps indicate, an LLM doesn’t just choose the most likely next token, it samples from a probability distribution. Let’s see how this works in a real life example. In the following simple experiment, we use an LLM to write a single- statement executive summary from a 10-K and jitter the temperature parameter to see how non-determinstic LLMs can be. We observe that even a simple parameter like temperature can dramatically alter model behavior in ways that are difficult to systematically assess. At temperature 0.0, responses are consistent but potentially too rigid.1 At 1.0, outputs become more varied but less predictable. At 2.0, responses can be wildly different and often incoherent. This non-deterministic behavior makes traditional software testing approaches inadequate. from dotenv import load_dotenv import os # Load environment variables from .env file load_dotenv()

from openai import OpenAI import pandas as pd from typing import List def generate_responses( model_name: str, prompt: str, temperatures: List[float], attempts: int = 3 ) -> pd.DataFrame: """ Generate multiple responses at different temperature settings to demonstrate non-deterministic behavior. """ client = OpenAI() results = [] for temp in temperatures: for attempt in range(attempts): response = client.chat.completions.create( model=model_name, messages=[{"role": "user", "content": prompt}], temperature=temp, max_tokens=50 ) results.append({ 'temperature': temp, 'attempt': attempt + 1, 'response': response.choices[0].message.content }) # Display results grouped by temperature df_results = pd.DataFrame(results) for temp in temperatures: print(f"\nTemperature = {temp}") print("-" * 40) temp_responses = df_results[df_results['temperature'] == temp] for _, row in temp_responses.iterrows(): print(f"Attempt {row['attempt']}: {row['response']}") return df_results Now let’s supply our base document, the 10-K. We are not going to supply the entire 10-k, but rather just 10,000 characters. The reason for this is our

model “gpt-3.5-turbo” has a token limit of 4,096. The combination of our input and output can’t exceed 4,096 tokens. A token is a meaningful chunk of text, like a word, a phrase, possibly punctuation. 10,000 characters typically equates to about 2,500 tokens, or 4-5 pages of text. So we are passing in about 4-5 pages of the 10-K and leaving room for a 4-5 page response, if needed. For the curious, we recommend having a look at the full 10-K filing and jotting your own executive summary, and then asking an LLM to do so. Notice the full filing is over 150 pages long. Would we truly ask an LLM or a person to summarize more than 150+ pages, or would we ask the LLM or person to focus on key sections? How long would it take a person to read the full 150+ pages?2 Here’s a quick glimpse of the sec_filing we have stored as text. [Example to Come] Now let’s start our code flow to load the first 10,000 characters of the filing, pass it to our model and supply some simple instructions via a prompt. MAX_LENGTH = 10000 # We limit the input length to avoid token issues with open('../data/apple.txt', 'r') as file: sec_filing = file.read() sec_filing = sec_filing[:MAX_LENGTH] df_results = generate_responses(model_name="gpt-3.5-turbo", prompt=f"Write a single-statement executive summary of the following text: {sec_filing}", temperatures=[0.0, 1.0, 2.0]) Temperature = 0.0 ---------------------------------------- Attempt 1: Apple Inc. filed its Form 10-K for the fiscal year ended September 28, 2024 with the SEC, detailing its business operations and financial performance. Attempt 2: Apple Inc. filed its Form 10-K with the SEC for the fiscal year ended September 28, 2024, detailing its business operations, products, and financial information. Attempt 3: Apple Inc. filed its Form 10-K with the SEC for the fiscal year ended September 28, 2024, detailing its business operations, products, and financial information.

Temperature = 1.0 ---------------------------------------- Attempt 1: Apple Inc., a well-known seasoned issuer based in California, designs, manufactures, and markets smartphones, personal computers, tablets, wearables, and accessories, with a focus on innovation and technology. Attempt 2: Apple Inc. filed its Form 10-K with the SEC for the fiscal year ended September 28, 2024, reporting on its business operations, products, and financial performance. Attempt 3: Apple Inc., a well-known seasoned issuer, filed its Form 10-K for the fiscal year ended September 28, 2024, reporting on its financial condition and operations. Temperature = 2.0 ---------------------------------------- Attempt 1: The Form 10-K for Apple Inc. for the fiscal year ended September 28, 2024, filed with the Securities and Exchange Commission, outlines the company's financial performance, products, and risk factors affecting future results. Attempt 2: Apple Inc., a California-based company and leading technology manufacturer invDestacksmeticsisdiction setIspection-$20cyan evaluationseld anvisions droitEntering discernminerval Versbobprefversible vo该 Option和 meio forecast времCisco dellaischenpoihsCapabilities Geme.getTime future Attempt 3: Apple Inc's Form 10-K provides a comprehensive overview of the company's financial reporting, business operations, products and market information. We run three generations for each temperature level. As expected we observe the following results: Temperature = 0: Summaries seem deterministic (even though still stochastic) showing to be repetitive. Temperature = 1: Summaries present balanced creativity with a certain level of coherence Temperature = 2: Increased randomness and potentially incoherent. The Evals Challange

As the above example illustrates, even a straightforward task with one short sentence of output introduces challenges that confound traditional software and model evaluation. This is why we say there is an Evals Gap between traditional software testing and LLM evaluation. We need new frameworks that can account for both the deterministic aspects we’re used to testing and the non- deterministic properties that make LLMs unique. evals-table summarizes how LLM evaluation differs from traditional software testing across several key dimensions: Capability Assessment vs Functional Testing: Traditional software testing validates specific functionality against predefined requirements. LLM evaluation must assess not necessarily pre- defined behavior but also “emergent properties” like reasoning, creativity, and language understanding that extend beyond explicit programming. Metrics and Measurement Challenges: While traditional software metrics can usually be precisely defined and measured, LLM evaluation often involves subjective qualities like “helpfulness” or “naturalness” that resist straightforward quantification. Even when we try to break these down into numeric scores, the underlying judgment often remains inherently human and context-dependent. Dataset Contamination: Traditional software testing uses carefully crafted test cases with known inputs and expected outputs (e.g., unit tests). In contrast, LLMs trained on massive internet- scale datasets risk having already seen and memorized evaluation examples during training, which can lead to artificially inflated performance scores. This requires careful dataset curation to ensure test sets are truly unseen by the model and rigorous cross- validation approaches.

Benchmark Evolution: Traditional software maintains stable test suites over time. LLM benchmarks continuously evolve as capabilities advance, making longitudinal performance comparisons difficult and potentially obsoleting older evaluation methods. Human Evaluation Requirements: Traditional software testing automates most validation. LLM evaluation still demands significant human oversight3 to assess output quality, appropriateness, and potential biases through structured annotation and systematic review processes. Evaluating LLMs v. LLMBAs Before we proceed to an evaluation design, it’s important to emphasize the distinction between evaluating an LLM purely on its inherent capabilities versus evaluating how an LLM perform as part of an LLMBA. LLMs offer foundation capabilities4 and are typically general-purpose. When we see reports about their performance (usually when a new model or version is released there is a lot of hype and fanfare about performance), it is on general purpose tasks (we discuss the history of benchmarks used here later in this Chapter). Within the context of an LLMBA, we are evaluating how those general purpose models perform in a particular application. There are many components of an LLMBA that might affect performance, such as: the data needed to solve those business problems the prompt supplied to the LLM ethical or compliance guidelines particular to a business application the commercial value of the end result

We will discuss each of these in great detail as we proceed in this book but a short example helps elucidate this. Take our previous example of an executive summary of 10-K. The 10-K is the data we need for this LLMBA - how we extract text and make it searchable will affect an LLMs ability to summarize the document. The prompt that we supply to ask for the summary will affect how the LLM replies. However, for example, unbeknownst to the LLM, we might have a strict compliance policy that all executive summaries must contain citations. Finally, the LLM might perform admirably on generating a summary but if our stakeholders find no value in it, the LLMBA itself has not performed well. The LLM itself is crucial to this process of course, but we are evaluating more than just the inherent capabilities of the LLM. As we start to think about evaluating LLMs in the context of LLMBAs, we build a taxonomy of categories, sub-categories, criteria (what to test) and explanations (why these are important) to help structure our thinking. Each enteprise will have different priorities, and indeed individual business lines may have different priorities, but this can be a good starting point. Before we discuss why each is important and how to test for each, here’s a quick outline of our taxonomy: Safety – Misinformation Prevention – Unqualified Advice – Bias Detection – Privacy Protection Cognitive – Reasoning & Logic – Language Understanding Technical

– Code Generation – System Integration Meta Cognitive – Self-Awareness – Communication Quality Ethical – Harmful Content – Decision-Making Environmental – C02 Emission With that skeleton taxonomy in mind, let’s summarize why each of these is important and how we can test them. Safety - Misinformation Prevention Accurate answers serve several critical safety purposes that extend far beyond simple correctness. Accurate information prevents real-world harm that can result from users acting on false data, whether in medical decisions, financial analysis, or other business domains. This accuracy also maintains the essential trust relationship between users and LLMBAs. Without trust, users won’t use our LLMBAs. From an institutional perspective, reliable outputs reduce both legal liability and reputational damage that organizations could face. Furthermore, accuracy enables LLMBAs to serve as dependable decision-making support tools, allowing users to make informed choices based on sound information. Testing for misinformation begins with verifying the accuracy of factual statements by cross-referencing LLMBA outputs against established, verified databases and authoritative sources to identify any incorrect claims

or data errors. Consistency testing ensures that our LLMBA provides similar responses to equivalent queries, preventing contradictory information that could confuse users or indicate underlying reliability issues. Citation and source accuracy testing evaluates whether our LLMBA properly attributes information and references legitimate, credible sources . Additionally, testing must assess how the LLMBA behaves when facing uncertainty, ensuring it appropriately communicates limitations rather than generating false confidence. Temporal consistency checks verify that LLMBAs maintain accurate information over time and don’t introduce errors through updates or data shifts. Finally, scientific accuracy testing specifically evaluates performance in technical domains where precision is paramount and errors could have serious consequences for users relying on the information for important decisions Safety - Unqualified Advice Unqualified advice is crucial boundary that LLMBAs must not cross. When LLMBAs provide incorrect professional advice in areas like medicine, law, or finance, they can cause significant harm to users who may lack the expertise. This concern is particularly acute for protecting vulnerable users, including those facing urgent decisions or lacking access to qualified professionals. From a liability standpoint, reducing the risk of providing unqualified advice helps protect both developers and enterprises from legal consequences when users act on inappropriate guidance. Maintaining professional standards requires that LLMBAs appropriately defer to human experts and clearly communicate the limitations of their knowledge. Testing for unqualified advice begins with assessing an LLMBA’s ability to accurately identify medical, legal, and financial queries (or whatever field where we wish to avoide unqualified advice), ensuring it can distinguish between general informational requests and those requiring professional judgment advice. Disclaimer consistency testing verifies that the LLMBA reliably provides appropriate warnings and limitations when discussing professional topics, maintaining consistent messaging about the need for qualified human expertise. Professional referral mechanisms must be

evaluated to ensure that we effectively directs users toward appropriate licensed professionals, healthcare providers, attorneys, or financial advisors when specialized guidance is needed. Boundary recognition testing examines whether our LLMBA maintains clear distinctions between providing general information versus offering specific professional recommendations that could be acted upon inappropriately. Emergency situation handling represents a particularly critical testing area, evaluating how our LLMBA responds to urgent medical, legal, or financial crises. Finally, testing must verify our LLMBA’s consistent avoidance of specific recommendations in professional domains, ensuring it refrains from suggesting particular treatments, legal strategies, investment decisions, or other personalized advice even when asked in different prompt settings. Safety - Bias Detection Bias detection addresses both individual fairness and broader societal impacts. By identifying and mitigating biased outputs, LLMBAs can avoid reinforcing harmful societal prejudices that could perpetuate discrimination across racial, gender, cultural, and other demographic lines. From a social responsibility perspective, bias detection helps LLMBAs contribute positively to society rather than amplifying existing inequalities or creating new forms of digital discrimination. Enterprises also benefit from protected brand reputation, as biased outputs can cause significant public relations damage and erode public trust. Perhaps most importantly, effective bias detection enables LLMBAs to genuinely support diverse user bases by providing respectful, accurate, and inclusive responses that acknowledge and serve the full spectrum of human experiences and perspectives. Testing for bias detection begins with systematic assessment of gender, racial, and cultural bias by examining whether LLMBAs produce different quality responses, make unfair assumptions, or apply stereotypes based on user identity markers or contextual clues about demographics. Demographic representation testing evaluates whether our outputs reflect diverse perspectives and avoid defaulting to dominant group viewpoints. Language inclusivity assessmes for accessible, respectful terminology that doesn’t

exclude or marginalize particular groups, while avoiding language that reinforces hierarchies or prejudices. Stereotype avoidance testing specifically identifies instances where an LLMBA might perpetuate harmful generalizations. Problem-solving fairness evaluation ensures that our LLMBA provides equally helpful and thorough assistance regardless of the user’s apparent background, avoiding scenarios where certain groups receive inferior support or guidance. Safety - Privacy Protection Privacy protection means that our LLMBAs must recognize and safeguard private details during interactions, preventing inadvertent disclosure or misuse of personal information that users may reveal in their queries or conversations. This protection becomes essential for ensuring regulatory compliance with data protection laws like GDPR, CCPA, and HIPAA, which impose strict requirements on how organizations collect, process, and protect personal information. Maintaining robust data security through privacy-conscious design helps prevent privacy breaches that could expose user information to unauthorized parties or malicious actors seeking to exploit personal data. Furthermore, effective privacy safeguards protect the broader ecosystem from sensitive information that flows through LLMBA pipelines, including proprietary business data, confidential communications, and other information. Testing for privacy protection begins with detecting Personally Identifiable Information (PII) and PII handling capabilities, ensuring we can accurately recognize PII data such as names, addresses, phone numbers, and social security numbers. Data anonymization should be tested, to ensure our LLMBA can effectively strip identifying details from information when necessary, maintaining usefulness while protecting individual privacy. Information leakage also must be tested, to ensure we don’t inadvertently reveal sensitive details from previous interactions or training data, particularly focusing on scenarios where confidential information might be exposed through seemingly innocent queries. Context carryover requires careful testing as to what information persists across conversation turns

without creating privacy risks through excessive data retention or inappropriate sharing between user sessions. Of course, we must test for compliance with relevant privacy regulations such as GDPR, CCPA, and HIPAA. Finally, security protocol evaluation examines the robustness of our LLMBA’s privacy safeguards against potential attacks or exploitation attempts, testing whether malicious users could circumvent privacy protections through prompt injection, social engineering, or other techniques. Cognitive - Reasoning & Logic Reasoning and logic capabilities are what empower an LLMBA to provide meaningful intellectual support to users. Reliable problem-solving ensures that our LLMBA can work through multi-step challenges systematically, breaking down complex issues into manageable components and applying appropriate methodologies. Maintaining computational accuracy becomes crucial when users depend on our LLmBA for mathematical calculations, data analysis, or quantitative reasoning tasks where precision directly impacts the utility and trustworthiness of the results. Critical thinking evaluation examines whether our LLMBA can analyze arguments, identify assumptions, evaluate evidence, and help users examine issues from multiple perspectives rather than simply providing surface-level responses. Preventing logical errors requires testing for common reasoning fallacies, inconsistent conclusions, and flawed inferential chains that could mislead users or undermine our LLMBA’s credibility as a thinking partner. Testing reasoning and logic capabilities begins with multi-step problem- solving assessment, asking whether our LLMBA can break down complex challenges into sequential components and arrive at coherent solutions. Logical fallacy testing ensures our LLMBA can identify flawed reasoning patterns, invalid arguments, and common cognitive biases both in external content and in its own analytical processes. Causal reasoning evaluation tests whether our LLMBA can accurately distinguish between correlation and causation, understand cause-and-effect relationships, and avoid drawing inappropriate conclusions from temporal or statistical associations. Finally,

Statistics

Uploader

Large Language Models The Hard Parts (for Raymond Rhine) (First Early Release) (Tharsis T.P. Souza, Jonathan K. Regenstein, Jr.) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Recommended for You

Statistics

Uploader

Large Language Models The Hard Parts (for Raymond Rhine) (First Early Release) (Tharsis T.P. Souza, Jonathan K. Regenstein, Jr.) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Reply to Comment

Edit Comment

Recommended for You