GPT Under the Hood

Barry S. Stahl

Solution Architect & Developer

@bsstahl@cognitiveinheritance.com

https://CognitiveInheritance.com

Transparent Half Width Image 720x800.png

Favorite Physicists & Mathematicians

Favorite Physicists

Harold "Hal" Stahl
Carl Sagan
Richard Feynman
Marie Curie
Nikola Tesla
Albert Einstein
Neil Degrasse Tyson
Niels Bohr
Galileo Galilei
Michael Faraday

Other notables: Stephen Hawking, Edwin Hubble

Favorite Mathematicians

Ada Lovelace
Alan Turing
Johannes Kepler
Rene Descartes
Isaac Newton
Emmy Noether
George Boole
Blaise Pascal
Johann Gauss
Grace Hopper

Other notables: Daphne Koller, Grady Booch, Leonardo Fibonacci, Evelyn Berezin, Benoit Mandelbrot

Some OSS Projects I Run

Liquid Victor : Media tracking and aggregation [used to assemble this presentation]
Prehensile Pony-Tail : A static site generator built in c#
TestHelperExtensions : A set of extension methods helpful when building unit tests
Conference Scheduler : A conference schedule optimizer
IntentBot : A microservices framework for creating conversational bots on top of Bot Framework
LiquidNun : Library of abstractions and implementations for loosely-coupled applications
Toastmasters Agenda : A c# library and website for generating agenda's for Toastmasters meetings
ProtoBuf Data Mapper : A c# library for mapping and transforming ProtoBuf messages

Fediverse Supporter

http://GiveCamp.org

Achievement Unlocked

Resume Scanning

Armando's Resume: 12.7% "worse"
- More distant from a match to the job listing
- A 95 for Jonathon's resume ≈ an 84 for Armando's
If Armando had 2 additional years of experience
- His score only increases by ≈ 1 point to 85

Neural Networks

Machine Learning
- Make predictions based on input
- Model is tuned via "Training"
Training
- Minimize the error in the model

Transformer Architectures

A neural network architecture that has enabled the recent advancements in NLP

Based on the 2017 Google paper Attention is All You Need by Vaswani et al.
Key Breakthrough: The Self-Attention Mechanism
- Focus on most relevant words
- Add critical context even at a distance
Enables Parallel Processing
- Process all words in a sentence simultaneously
- Unlike traditional RNNs that process data sequentially

Transformer (Simplified)

Agenda

Goal: Build intuition for when and why these models may be effectively applied by understanding how they work

Generative Transformer Models
- Tokenization
- Embedding
- Attention
- Transformer Blocks
Use-Case Analysis
- What they don't work for
- What these tools do well
- How we can reduce risk
- Ethical Considerations

Generative Transformer Models 800x800.jpg

Tokenization

Convert UTF-8 text containing words, word parts, or characters into an equivalent numeric representation

Unicode	Token	Comments
hello	15339	Common words
Don	8161	Common names
[space]Qu	3489	Common sequences
注	26130	Foreign characters
█████	93429	Redaction characters
/>[newline]	10381	Symbols

GPT Tokenization

GPT-3 and beyond use the cl100K tokenization model

Defines 100,256 tokens
- Word
- Word part
- Character combination
- Character
- Character part
Efficiently reprepresent language numerically
- Minimize token usage
- More compact than Unicode
- More flexible than ASCII
- Large vocabulary
  - All Unicode characters available
- The most-common combinations

Exploring Tokenization

Reference Implementation - AI Demos on GitHub

Examples from the code
- "Hello, World! How are you today? 🌍"
  - 9906, 11, 4435, 0, 2650, 527, 499, 3432, 30, 11410, 234, 235
- "こんにちは、世界！お元気ですか？"
  - 90115, 5486, 3574, 244, 98220, 6447, 33334, 24186, 95221, 38641, 32149, 11571
- "Hola, mundo! ¿Cómo estás hoy? 🇪🇸"
  - 69112, 11, 29452, 0, 29386, 96997, 1826, 7206, 49841, 30, 11410, 229, 103, 9468, 229, 116

Embedding

Embeddings

A point in multi-dimensional space
Mathematical representation of a word or phrase
Encode both semantic and contextual information

Model: text-embedding-ada-002
Vectors normalized to unit length
Use 1536 dimensions

Cosine Similarity & Distance

Relate vectors based on the angle between them

Cosine Similarity ranges from -1 to 1, where:
- +1 indicates that the vectors represent similar semantics & context
- 0 indicates that the vectors are orthogonal (no similarity)
- -1 indicates that the vectors have opposing semantics & context
Cosine Distance is defined as 1 - cosine similarity where:
- 0 = Synonymous
- 1 = Orthogonal
- 2 = Antonymous

Note: For normalized vectors, cosine similarity is the same as the dot-product

Cosine Distance

Embedding Distance

Feature	Example
Synonym	"Happy" is closer to "Joyful" than to "Sad"
Language	"The Queen" is very close to "La Reina"
Idiom	"He kicked the bucket" is closer to "He died" than to "He kicked the ball"
Sarcasm	"Well, look who's on time" is closer to "Actually Late" than "Actually Early"
Homonym	"Bark" (dog sound) is closer to "Howl" than to "Bark" (tree layer)
Collocation	"Fast food" is closer to "Junk food" than to "Fast car"
Proverb	"The early bird catches the worm" is closer to "Success comes to those who prepare well and put in effort" than to "A bird in the hand is worth two in the bush"
Metaphor	"Time is money" is closer to "Don't waste your time" than to "Time flies"
Simile	"He is as brave as a lion" is closer to "He is very courageous" than to "He is a lion"

Usage of Embeddings

Embeddings can be used directly, or as an input to other models

Direct Usage
- Measuring Semantic Distance
  - Quantify similarity between pieces of text
  - Useful for tasks like semantic search
- Clustering for Pattern Discovery
  - Discover groupings in the data
  - Useful when categorizing user comments or other textual characteristics
Indirect Usage
- Feature Vectors for Machine Learning
  - Inputs to traditional ML models
- Input to Transformer Attention Mechanisms
  - Dynamically adjusted by attention blocks
  - Powers text generation tasks

Cosmic Desert under the Milky Way 800x800.jpg

Attention Blocks

Transformer Architectures - Attention - 938x800.png

"I'm getting PC ram"

Attention mechanisms consider relationships between words, helping the model emphasize relevant context, even if far away.

Pre-Attention
- Does "Ram" represent a sheep, truck or memory?
- The model has a general understanding of the word without context
Post-Attention
- The "Ram" vector shifts towards "Computer Memory"
- Context has been added

Positional Encoding

Attention Heads

The 3 matrices that are learned in training of the attention mechanism

Query (Q)
- Can be thought of as a question about the current token
Key (K)
- How well the token "attends to" the Query
Value (V)
- How much the embedding should be moved towards each other token

When these 3 matrices are applied to the positionalized input, it results in the model adding appropriate context to each token.

Transformer Blocks

Transformer Architectures - Features - 938x800.png

Feature Layers

The Feature layers allow the Transformer to understand and generate complex language patterns through deep learning from the structure of the input data

Each layer acts as a mini neural network
In the Encoder:
- Complex Feature Extraction from the input data
In the Decoder
- Refines the model's predictions by processing the encoder's output alongside previous decoder outputs

By stacking multiple Feature between Attention layers, GPT models achieve deeper learning capabilities, allowing for more nuanced understanding and generation of text

Transformer (Simplified)

Embeddings Search

Transformer Architectures - Embedding Only - 938x800.png

Cosine Distances

Country	D(USA,C)
United Kingdom	0.11981422
United Kingdom of Great Britain and Northern Ireland	0.16426033
England	0.16678649
Argentine Republic	0.18766826
República Argentina	0.20281911

Embeddings Differences

Language
- English (different dialects) vs Spanish
Historical Ties
- USA and UK - Political (colonial), economic and cultural ties
Cultural Similarities
- USA and UK - shared media, literature, entertainment, and educational exchanges including TV, movies, music, and news
Political Systems
- USA and UK - Democracies w/ common roots
Other differences
- Economic Ties
- Geopolitical Alliances
- Educational Systems
- Legal Systems

Embedding Differences - USA UK Argentina - 800x800.jpg

Resume Summarization

System: As a resume scanner specializing in software engineers, focus solely on summarizing their key skills and experience durations, omitting demographics like age and location.
User: [Submits appropriate resume]
Assistant:

- Expertise in Java, J2EE, Spring Framework, Hibernate
- Experience with front-end technologies such as HTML, CSS, JavaScript, and AngularJS
- Proficient in SQL databases and ORM technologies
- Familiar with containerization and orchestration tools like Docker and Kubernetes
- Strong understanding of software design patterns and best practices
- Excellent problem-solving abilities and attention to detail
- Effective team player with strong communication skills

**Experience:**
- Senior Java Developer at XYZ Tech Solutions (June 2016 – Present, 5 years)
- Java Software Engineer at Innovative Software Co. (May 2012 – May 2016, 4 years)

**Certifications:**
- Oracle Certified Professional, Java SE 8 Programmer
- Certified ScrumMaster (CSM)

**Education:**
- Bachelor of Science in Computer Science, University of Technical Excellence (Graduated: May 2012)

Resume Clustering

Clustering
- Unsupervised machine learning technique
- Clusters form around centroids (the geometric center)
- Data points are grouped (clustered) based on similarity
- Minimize the error (distance from centroid)
Resume Clustering
- Identify a group of relevant resumes
- Not for ranking, just filtering
- Cluster the resumes, the GPT summaries, or both
- Still not perfect

Limitations of Attention

Long Sequence Processing
- Attention uses O(n²) memory
- Some models may fragment or chunk the input
- Be careful with attention over long input
Local Context Priority
- Transformers excel at global dependencies
- Less efficient focusing on local context
Irrelevant Data
- Attention may overvalue irrelevant inputs
- Especially in noisy, unstructured or sparce data
Resource Constraints
- Resource constrained devices may not be able to handle Attention
Low-Resource Domains
- Transformers require a large training corpus
- Limited data yields poor attention performance

Ethical Concerns

Model Creation
- Copyright violations
- Energy Consumption
Model Usage
- Bias and Fairness
- Privacy Concerns
- Integrity
- Accountability and Transparency

Model Creation Concerns

Copyright Infringement: Models were probably trained on copyrighted materials
- Consumer Risk: Models removed or nerfed
- Artist Risk: Theft of copyrighted materials undermines the rights of creators
Carbon Footprint: Training requires huge resources
- Energy Usage: Massive compute
- Water Usage: Cooling data centers
- Sustainability: Rapid advancement

The model is biased

Not handling the bias makes bias a feature of your app
- Understand what fairness means in your context
- Understand the impact of being wrong
Take fairness very seriously
- Prevent all predictable biases
- Monitor for unpredictable biases

Model Answers May Be

Incomplete
Poorly phrased
Outright wrong

Model Privacy Concerns

Massive amounts of private data were likely gathered
- What are the odds that all of it was caught?
- Even anonymized data may be de-anonymized
Take great care in what you send to the Model
- See Microsoft AI Recall for a recent example

Accountability & Transparency

Should this really be done using a stochastic model?

Why did the model produce the output it did?
Who is responsible for the answer?
What is the impact of the answer?
- What if the answer was wrong?
- What if the answer was evil?
- What if humans suffer as a result?

A Computer Can Never Be Held Accountable 800x800.png

Safe Chatbot Interactions

Use the model to verify it isn't a jailbreak attempt
Have the model predict the best answer to the query
Get embedding from predicted answers
Compare this embedding with a set of pre-defined answer embeddings
If a close match is found within the pre-defined answers, use it
If no suitable pre-defined answer is found, request additional information from the user

When Should AI be Used?

When all possible biases have been accounted for
When all sensitive data has been removed, tokenized or encrypted
When the stochastic nature of responses has been accounted for
- A wrong answer is no worse than no answer
- Outputs have been fully constrained
- A human is in-the-loop to fix the inevitable errors

Resources

This Presentation - Web | PDF
AI Code Demos - Code | Docs
GPT Embeddings: Not Magic - Just Math
The Depth of GPT Embeddings
Programmers -- Take Responsibility for Your AI’s Output
Experiments in Reconstructing Text from Embeddings
Azure OpenAI Client
Semantic Kernel
Carl Sagan - Reasoning on Higher Dimensions (YouTube)

GPT Under the Covers - Slide Deck - QR Code.png

Discriminative vs Generative

Discriminative Models
- Conditional Probability: Predict label Y based on input X
- Identify the boundaries between data groups
- Great for classification and tokenization
- Examples: BERT, Random Forest
Generative Models
- Joint Probability: Predict when label Y occurs along with input X
- Predict new data instances
- Great for predicting text, creating audio & video
- Examples: GPT-4o, PixelRNN/PixelCNN

Conditional vs Joint Probability 800x800.jpg

Softmax Activation Function

SoftMax(z) = e^(z_i) / Σ(e^(z_j))

Converts a vector into a probability distribution
Output probabilities range from 0 to 1
Output probabilities sum up to 1
Where
- z is the input vector
- i is the class being predicted
- j runs over all classes

ReLU Activation Function

f(x) = max(0, x)

ReLU (Rectified Linear Unit)
- Outputs the input directly if it is positive
- Otherwise, it outputs zero
Common activation function used in neural networks
- Very fast and efficient
- Helps avoid the vanishing gradient problem
- Training often converges faster vs sigmoid or tanh

Vector Databases

Store/retrieve high-dimensional vectors
Values are retrieved using similarity searches
Leverage data-structures such as K-D Trees
Examples
- Azure AI Search
- Redis
- Qdrant
- Pinecone
- Chroma