GPT Under the Hood


Barry S. Stahl

Solution Architect & Developer

@bsstahl@cognitiveinheritance.com

https://CognitiveInheritance.com

Transparent Half Width Image 720x800.png

Favorite Physicists & Mathematicians

Favorite Physicists

  1. Harold "Hal" Stahl
  2. Carl Sagan
  3. Richard Feynman
  4. Marie Curie
  5. Nikola Tesla
  6. Albert Einstein
  7. Neil Degrasse Tyson
  8. Niels Bohr
  9. Galileo Galilei
  10. Michael Faraday

Other notables: Stephen Hawking, Edwin Hubble

Favorite Mathematicians

  1. Ada Lovelace
  2. Alan Turing
  3. Johannes Kepler
  4. Rene Descartes
  5. Isaac Newton
  6. Emmy Noether
  7. George Boole
  8. Blaise Pascal
  9. Johann Gauss
  10. Grace Hopper

Other notables: Daphne Koller, Grady Booch, Leonardo Fibonacci, Evelyn Berezin, Benoit Mandelbrot

Some OSS Projects I Run

  1. Liquid Victor : Media tracking and aggregation [used to assemble this presentation]
  2. Prehensile Pony-Tail : A static site generator built in c#
  3. TestHelperExtensions : A set of extension methods helpful when building unit tests
  4. Conference Scheduler : A conference schedule optimizer
  5. IntentBot : A microservices framework for creating conversational bots on top of Bot Framework
  6. LiquidNun : Library of abstractions and implementations for loosely-coupled applications
  7. Toastmasters Agenda : A c# library and website for generating agenda's for Toastmasters meetings
  8. ProtoBuf Data Mapper : A c# library for mapping and transforming ProtoBuf messages

Fediverse Supporter

Logos.png

http://GiveCamp.org

GiveCamp.png

Achievement Unlocked

bss-100-achievement-unlocked-1024x250.png
 
 
 

Resume Scanning

  • Armando's Resume: 12.7% "worse"
    • More distant from a match to the job listing
    • A 95 for Jonathon's resume ≈ an 84 for Armando's
  • If Armando had 2 additional years of experience
    • His score only increases by ≈ 1 point to 85
TwoResumes_800x800.jpg

Neural Networks

  • Machine Learning
    • Make predictions based on input
    • Model is tuned via "Training"
  • Training
    • Minimize the error in the model
Training a Neural Network 800x800.jpg

Transformer Architectures

A neural network architecture that has enabled the recent advancements in NLP

  • Based on the 2017 Google paper Attention is All You Need by Vaswani et al.
  • Key Breakthrough: The Self-Attention Mechanism
    • Focus on most relevant words
    • Add critical context even at a distance
  • Enables Parallel Processing
    • Process all words in a sentence simultaneously
    • Unlike traditional RNNs that process data sequentially
Self Attention 800x800.jpg

Transformer (Simplified)

Transformer Architectures - 938x800.png

Agenda

Goal: Build intuition for when and why these models may be effectively applied by understanding how they work

  • Generative Transformer Models
    • Tokenization
    • Embedding
    • Attention
    • Transformer Blocks
  • Use-Case Analysis
    • What they don't work for
    • What these tools do well
    • How we can reduce risk
    • Ethical Considerations
Generative Transformer Models 800x800.jpg

Tokenization

Transformer Architectures - Tokenization - 938x800.png

Tokenization

Convert UTF-8 text containing words, word parts, or characters into an equivalent numeric representation

Unicode Token Comments
hello 15339 Common words
Don 8161 Common names
[space]Qu 3489 Common sequences
26130 Foreign characters
█████ 93429 Redaction characters
/>[newline] 10381 Symbols
Cosmic Library 800x800.jpg

GPT Tokenization

GPT-3 and beyond use the cl100K tokenization model

  • Defines 100,256 tokens
    • Word
    • Word part
    • Character combination
    • Character
    • Character part
  • Efficiently reprepresent language numerically
    • Minimize token usage
    • More compact than Unicode
    • More flexible than ASCII
    • Large vocabulary
      • All Unicode characters available
    • The most-common combinations
Cosmic Library Book 800x800.jpg
 

Exploring Tokenization

Reference Implementation - AI Demos on GitHub

  • Examples from the code
    • "Hello, World! How are you today? 🌍"
      • 9906, 11, 4435, 0, 2650, 527, 499, 3432, 30, 11410, 234, 235
    • "こんにちは、世界!お元気ですか?"
      • 90115, 5486, 3574, 244, 98220, 6447, 33334, 24186, 95221, 38641, 32149, 11571
    • "Hola, mundo! ¿Cómo estás hoy? 🇪🇸"
      • 69112, 11, 29452, 0, 29386, 96997, 1826, 7206, 49841, 30, 11410, 229, 103, 9468, 229, 116
AI Demos - Tokenization.png

Embedding

Transformer Architectures - Embedding - 938x800.png

Embeddings

  • A point in multi-dimensional space
  • Mathematical representation of a word or phrase
  • Encode both semantic and contextual information

  • Model: text-embedding-ada-002
  • Vectors normalized to unit length
  • Use 1536 dimensions
Embeddings - Cosmic Desert 800x800.jpg
  Ram - Just Statements.png
  Ram - With Clusters.png

Cosine Similarity & Distance

Relate vectors based on the angle between them

  • Cosine Similarity ranges from -1 to 1, where:

    • +1 indicates that the vectors represent similar semantics & context
    • 0 indicates that the vectors are orthogonal (no similarity)
    • -1 indicates that the vectors have opposing semantics & context
  • Cosine Distance is defined as 1 - cosine similarity where:

    • 0 = Synonymous
    • 1 = Orthogonal
    • 2 = Antonymous

Note: For normalized vectors, cosine similarity is the same as the dot-product

Cosine Unit Circle - Enhanced.jpg

Cosine Distance

Cosine Distance 989x600.png

Cosine Distance

Angles2.svg

Embedding Distance

Feature Example
Synonym "Happy" is closer to "Joyful" than to "Sad"
Language "The Queen" is very close to "La Reina"
Idiom "He kicked the bucket" is closer to "He died" than to "He kicked the ball"
Sarcasm "Well, look who's on time" is closer to "Actually Late" than "Actually Early"
Homonym "Bark" (dog sound) is closer to "Howl" than to "Bark" (tree layer)
Collocation "Fast food" is closer to "Junk food" than to "Fast car"
Proverb "The early bird catches the worm" is closer to "Success comes to those who prepare well and put in effort" than to "A bird in the hand is worth two in the bush"
Metaphor "Time is money" is closer to "Don't waste your time" than to "Time flies"
Simile "He is as brave as a lion" is closer to "He is very courageous" than to "He is a lion"

Usage of Embeddings

Embeddings can be used directly, or as an input to other models

  • Direct Usage
    • Measuring Semantic Distance
      • Quantify similarity between pieces of text
      • Useful for tasks like semantic search
    • Clustering for Pattern Discovery
      • Discover groupings in the data
      • Useful when categorizing user comments or other textual characteristics
  • Indirect Usage
    • Feature Vectors for Machine Learning
      • Inputs to traditional ML models
    • Input to Transformer Attention Mechanisms
      • Dynamically adjusted by attention blocks
      • Powers text generation tasks
Cosmic Desert under the Milky Way 800x800.jpg

Attention Blocks

Transformer Architectures - Attention - 938x800.png

"I'm getting PC ram"

Attention mechanisms consider relationships between words, helping the model emphasize relevant context, even if far away.

  • Pre-Attention
    • Does "Ram" represent a sheep, truck or memory?
    • The model has a general understanding of the word without context
  • Post-Attention
    • The "Ram" vector shifts towards "Computer Memory"
    • Context has been added
Ram Graph 2 800x800.png

Positional Encoding

Sine Waves 1280x720.png

Attention Heads

The 3 matrices that are learned in training of the attention mechanism

  • Query (Q)
    • Can be thought of as a question about the current token
  • Key (K)
    • How well the token "attends to" the Query
  • Value (V)
    • How much the embedding should be moved towards each other token

When these 3 matrices are applied to the positionalized input, it results in the model adding appropriate context to each token.

Query-Key-Value 800x800.jpg

Transformer Blocks

Transformer Architectures - Features - 938x800.png

Feature Layers

The Feature layers allow the Transformer to understand and generate complex language patterns through deep learning from the structure of the input data

  • Each layer acts as a mini neural network
  • In the Encoder:
    • Complex Feature Extraction from the input data
  • In the Decoder
    • Refines the model's predictions by processing the encoder's output alongside previous decoder outputs

By stacking multiple Feature between Attention layers, GPT models achieve deeper learning capabilities, allowing for more nuanced understanding and generation of text

Perceptron 800x800.jpg

Transformer (Simplified)

Transformer Architectures - 938x800.png
 

Embeddings Search

Transformer Architectures - Embedding Only - 938x800.png

Cosine Distances

Country D(USA,C)
United Kingdom 0.11981422
United Kingdom of Great Britain and Northern Ireland 0.16426033
England 0.16678649
Argentine Republic 0.18766826
República Argentina 0.20281911
Country Distances.png

Embeddings Differences

  • Language
    • English (different dialects) vs Spanish
  • Historical Ties
    • USA and UK - Political (colonial), economic and cultural ties
  • Cultural Similarities
    • USA and UK - shared media, literature, entertainment, and educational exchanges including TV, movies, music, and news
  • Political Systems
    • USA and UK - Democracies w/ common roots
  • Other differences
    • Economic Ties
    • Geopolitical Alliances
    • Educational Systems
    • Legal Systems
Embedding Differences - USA UK Argentina - 800x800.jpg
 

Resume Summarization

  • System: As a resume scanner specializing in software engineers, focus solely on summarizing their key skills and experience durations, omitting demographics like age and location.

  • User: [Submits appropriate resume]

  • Assistant:

- Expertise in Java, J2EE, Spring Framework, Hibernate
- Experience with front-end technologies such as HTML, CSS, JavaScript, and AngularJS
- Proficient in SQL databases and ORM technologies
- Familiar with containerization and orchestration tools like Docker and Kubernetes
- Strong understanding of software design patterns and best practices
- Excellent problem-solving abilities and attention to detail
- Effective team player with strong communication skills

**Experience:**
- Senior Java Developer at XYZ Tech Solutions (June 2016 – Present, 5 years)
- Java Software Engineer at Innovative Software Co. (May 2012 – May 2016, 4 years)

**Certifications:**
- Oracle Certified Professional, Java SE 8 Programmer
- Certified ScrumMaster (CSM)

**Education:**
- Bachelor of Science in Computer Science, University of Technical Excellence (Graduated: May 2012)

Resume Clustering

  • Clustering
    • Unsupervised machine learning technique
    • Clusters form around centroids (the geometric center)
    • Data points are grouped (clustered) based on similarity
    • Minimize the error (distance from centroid)
  • Resume Clustering
    • Identify a group of relevant resumes
    • Not for ranking, just filtering
    • Cluster the resumes, the GPT summaries, or both
    • Still not perfect
k-means results.png

Limitations of Attention

  • Long Sequence Processing
    • Attention uses O(n2) memory
    • Some models may fragment or chunk the input
    • Be careful with attention over long input
  • Local Context Priority
    • Transformers excel at global dependencies
    • Less efficient focusing on local context
  • Irrelevant Data
    • Attention may overvalue irrelevant inputs
    • Especially in noisy, unstructured or sparce data
  • Resource Constraints
    • Resource constrained devices may not be able to handle Attention
  • Low-Resource Domains
    • Transformers require a large training corpus
    • Limited data yields poor attention performance
Limits of Attention 800x800.jpg

Ethical Concerns

  • Model Creation
    • Copyright violations
    • Energy Consumption
  • Model Usage
    • Bias and Fairness
    • Privacy Concerns
    • Integrity
    • Accountability and Transparency
Ethical Concerns 800x800.jpg

Model Creation Concerns

  • Copyright Infringement: Models were probably trained on copyrighted materials
    • Consumer Risk: Models removed or nerfed
    • Artist Risk: Theft of copyrighted materials undermines the rights of creators
  • Carbon Footprint: Training requires huge resources
    • Energy Usage: Massive compute
    • Water Usage: Cooling data centers
    • Sustainability: Rapid advancement
Model Creation Concerns 800x800.jpg

The model is biased

  • Not handling the bias makes bias a feature of your app
    • Understand what fairness means in your context
    • Understand the impact of being wrong
  • Take fairness very seriously
    • Prevent all predictable biases
    • Monitor for unpredictable biases
bias logo - large.jpg

Model Answers May Be

  • Incomplete
  • Poorly phrased
  • Outright wrong
No Takesies backsies.png

Model Privacy Concerns

  • Massive amounts of private data were likely gathered
    • What are the odds that all of it was caught?
    • Even anonymized data may be de-anonymized
  • Take great care in what you send to the Model
    • See Microsoft AI Recall for a recent example
Microsoft Copilot 800x800.png

Accountability & Transparency

Should this really be done using a stochastic model?

  • Why did the model produce the output it did?
  • Who is responsible for the answer?
  • What is the impact of the answer?
    • What if the answer was wrong?
    • What if the answer was evil?
    • What if humans suffer as a result?
A Computer Can Never Be Held Accountable 800x800.png

Safe Chatbot Interactions

  • Use the model to verify it isn't a jailbreak attempt
  • Have the model predict the best answer to the query
  • Get embedding from predicted answers
  • Compare this embedding with a set of pre-defined answer embeddings
  • If a close match is found within the pre-defined answers, use it
  • If no suitable pre-defined answer is found, request additional information from the user
Safe Chatbot 800x800.png

When Should AI be Used?

  • When all possible biases have been accounted for
  • When all sensitive data has been removed, tokenized or encrypted
  • When the stochastic nature of responses has been accounted for
    • A wrong answer is no worse than no answer
    • Outputs have been fully constrained
    • A human is in-the-loop to fix the inevitable errors
When to use AI 800x800.jpg

Resources

GPT Under the Covers - Slide Deck - QR Code.png

Discriminative vs Generative

  • Discriminative Models
    • Conditional Probability: Predict label Y based on input X
    • Identify the boundaries between data groups
    • Great for classification and tokenization
    • Examples: BERT, Random Forest
  • Generative Models
    • Joint Probability: Predict when label Y occurs along with input X
    • Predict new data instances
    • Great for predicting text, creating audio & video
    • Examples: GPT-4o, PixelRNN/PixelCNN
Conditional vs Joint Probability 800x800.jpg

Softmax Activation Function

SoftMax(z) = e^(z_i) / Σ(e^(z_j))

  • Converts a vector into a probability distribution
  • Output probabilities range from 0 to 1
  • Output probabilities sum up to 1
  • Where
    • z is the input vector
    • i is the class being predicted
    • j runs over all classes
Softmax 800x800.jpg

ReLU Activation Function

f(x) = max(0, x)

  • ReLU (Rectified Linear Unit)
    • Outputs the input directly if it is positive
    • Otherwise, it outputs zero
  • Common activation function used in neural networks
    • Very fast and efficient
    • Helps avoid the vanishing gradient problem
    • Training often converges faster vs sigmoid or tanh
ReLU 800x800.jpg

Vector Databases

  • Store/retrieve high-dimensional vectors
  • Values are retrieved using similarity searches
  • Leverage data-structures such as K-D Trees
  • Examples
    • Azure AI Search
    • Redis
    • Qdrant
    • Pinecone
    • Chroma
VectorDB-650x650.png