Amazing Algorithms
For Solving Problems in Software
Barry S. Stahl
Principal Engineer - AZNerds.net
Favorite Physicists & Mathematicians
Favorite Physicists
Harold "Hal" Stahl
Carl Sagan
Richard Feynman
Marie Curie
Nikola Tesla
Albert Einstein
Neil Degrasse Tyson
Niels Bohr
Galileo Galilei
Michael Faraday
Other notables : Stephen Hawking, Edwin Hubble, Leonard Susskind, Christiaan Huygens
Favorite Mathematicians
Ada Lovelace
Alan Turing
Johannes Kepler
Rene Descartes
Isaac Newton
Emmy Noether
George Boole
Blaise Pascal
Johann Gauss
Grace Hopper
Other notables : Daphne Koller, Grady Booch, Leonardo Fibonacci, Evelyn Berezin, Benoit Mandelbrot
Fediverse Supporter
It is now more important than ever that we create networks that are not controlled by corporate entities or petulant oligarchs
I am proud to be one of the founders of AZGiveCamp.
GiveCamps are weekend events where we put a bunch of software creators in a room with a bunch of great local charities, and through the course of the weekend, we create software to help them further their mission
http://GiveCamp.org
July 25, 2022 was my 100th public talk. Thank you for allowing me to participate in your community over these past few decades!
Achievement Unlocked
Agenda
08:00 - 08:15 : Intro and Modeling
08:15 - 08:45 : Dynamic Programming
08:45 - 09:15 : Genetic Algorithms
09:15 - 09:45 : K-Means Clustering
09:45 - 10:00 : Break
10:00 - 10:15 : Clustering Wrap-up
10:15 - 10:45 : Best-Path Algorithms
10:45 - 11:15 : Cost-Minimization Algorithms
11:15 - 11:45 : Training Neural Networks
11:45 - 12:00 : Q&A and Wrap-up
Bonus (if time allows) : LP & MIP
Simulates A Rational Actor
Attempts to make the best possible decision based on the model and the data
Model - Our description of the problem
Data - Our understanding of the state of the domain
Models are the Blueprint
The best coder in the world can't solve a problem if they don't understand it.
The best algorithm in the world can't solve a problem if it isn't modeled well.
The model is how we describe the problem to the system
The success of a problem-solving algorithm is often determined before the algorithm even runs -- by how the problem is modeled
Modeling
A traveling salesman problem with only 20 cities has 20! (2x1018 ) possible routes. A brute-force solution that checks each route would take over 77 million years to complete, assuming it could evaluate one route per microsecond.
Modeling defines the problem space
What’s included & what’s ignored
Poor models lead to irrelevant solutions
Modeling shapes the solution strategy
Logical models » OOP or Rules Engines
Graph models » traversal algorithms
Constraint models » LP or MIP
Probabilistic models » Neural/Bayesian Networks
Modeling can determine tractability
Some formulations make problems NP-hard
Others reveal polynomial-time solutions
Examples
Conference Scheduling with LP vs MIP
Traveling Salesman with OOP
Workshop Goals
Describe these algorithms
Show models that work well with each type of problem
Describe sample implementations for your use
Provide intuition as to why they work as they do
Knapsack Problems
A family of combinatorial optimization problems
Goal
Select a subset of items from a larger set
Without exceeding resource constraints
Such that total value is maximized
Memoization
Store the results of function calls for reuse when the same inputs occur again
Goal: Compute the nth Fibonacci number
Naïve Recursion
Recomputes each subproblem multiple times
O(2n ) Problem - Exponential time
With Memoization
Store results of F(k)
Reuse stored values
O(n) Problem - Linear Time
We'll start by solving this toy problem using CP, then, once we understand how CP works, we'll explore why it is a good fit for solving knapsack problems.
Canonical Knapsack Problem
Dynamic Programming
A mathematical optimization technique
Best for solving problems that can be described recursively
Breaks the problem into smaller sub-problems
Solves each small problem only once
Guarantees an optimal solution
Examples: Scheduling, Routing
This will become very important later
Optimization Problems
Find the best values for a set of decision variables that
Satisfies all constraints
Maximizes the features we want
Minimizes the features we don’t want
Chutes and Ladders - Greedy Algorithm
Start at the 1st space on the board
While we haven't reached the last space
Add the space to the current path
If the current space has a ladder
Add the end of the ladder to the path
Make that space the current space
Move to the next space on the board
Return the completed path through the board
Dynamic Programming - Step 1
DetermineDistance(s,d)
Call DetermineDistance(space[0], 0)
If shorter than previously found
Set DistanceFromStart property
If this space starts a chute or ladder
DetermineDistance(endOfNavigation, distanceFromStart+1)
If not at the end of board
DetermineDistance(nextSpace, distanceFromStart+1)
Return all spaces with their DistanceFromStart property populated
Dynamic Programming - Step 2
Start at the last space on the board
While we are still on the board
Add the current space to the path
Set the current space to a space that is a neighbor of the current space and whose DistanceFromStart is space.DistanceFromStart-1
Reverse the direction of the path
Summary - Dynamic Programming
Populate the cache
Simple calculations
Build from the ground-up
Use the cached data to determine the answer(s)
Work backwards through the cache
Guaranteed Optimality
A fully populated cache means all options are explored
Works in any number of dimensions
Works with any graph (nodes & edges)
Works best when
Problems can be described recursively
1 or more axis are limited in scope
Use Case: Game Strategy
Find the best strategy in a multi-player board game
Variant of Chutes & Ladders
Player decides whether or not to take a chute or ladder
Extremely large Solution Space
Impossible to simply try all options
Other Model Types
Shortest Path Algorithm
Guarantees shortest path but not necessarily best strategy
Mixed-Integer Programming Model
Requires a mathematical fitness function (can't use a simulation)
Neural Network
Requires a very large training data corpus
Genetic Algorithms
Find Solutions by Simulating Darwinian Evolution
Each candidate solution is defined by its properties (chromosomes)
A fitness function is used to determine which solutions "survive"
A simulation may be used as the fitness function
Surviving solutions may mutate and evolve other solutions
Optimality is Never Guaranteed
Optimal solution may never be found
i.e. If a optimality requires a large combination of elements
Optimum may be found but not recognized
If conditions make that solution non-optimal at that moment
i.e. If a key feature is not used during the simulation
Minimize the liklihood by increasing simulation size
Determine Our Game Strategy
We need a strategy that will tell us where we
should move to next, given:
Our location on the board
The result of our spin
The location of our opponents (opt)
Our player number (opt)
Example: If we are on square 45 and spin a:
1 - 46 (no choice)
2 - 47 (no choice)
3 - 48 or 26
4 - 49 or 27
5 - 50, 11 or 28
6 - 51, 12, 29 or 84
Step 1 - Define DNA
A chromosome represents all of the possible decisions that can be made
Each gene represents a decision
Ideally a gene fully describes a single game state
If no decision is possible for a state, there is no gene for that state
An allele is an option for that decision
If there are 3 options for a given state there will be 3 possible alleles for that gene
Step 2 - Setup Parameters
Simulations per Generation
Controls how often the solutions evolve
More simulations make it more likely the best solution will survive
Generation Count
Controls simulation length
Optionally look for convergence
When does improvement stop?
When do significant changes stop during crossover?
Error (misspelling) rate
Controls how frequently genes change allele's during the evolutionary process
Higher rates cause greater variations from generation to generation
Selection strategy
Defines which solutions survive and mutate
Survive intact - Remain unchanged to next generation
Survive mutated - Modified versions survive to next generations
Survive recombined - Crossover between two solutions survives
Options for Experimentation
Implement in your preferred language
C# implementation is on GitHub
Augment to include what place the player is in
i.e. Might make different decisions if started 1st vs 6th
Augment to include opponent locations
i.e. Might make different decisions if opponents are close to the end
Define the DNA of another game
Recommendation - stick with simpler games to start
Avoid games like chess with massive complexity
I want to give you a non-game example. This is an actual problem I am reasearching.
Image Created by Microsoft Designer 2026-01-08
Problem 2 - Semantic Distance
Embeddings
There are many other embedding models, some are optimized for different purposes. You should consider what model might work best for your use-case. This presentation uses the same model used by GPT-3 and GPT-4.
A point in multi-dimensional space
Mathematical representation of a word or phrase
Encode both semantic and contextual information
Model: text-embedding-ada-002
Vectors normalized to unit length
Use 1536 dimensions
I used a T-SNE (T-distributed Stochastic Neighbor Embedding) transformation to perform this dimensionality reduction
The key to understanding embeddings is to remember that they are a way to represent a word or phrase in a multi-dimensional space. The distance between two embeddings can be used to determine how similar the words or phrases they represent are to each other.
Cosine Similarity & Distance
Since GPT vectors are normalized to unit length, Cosine similarity and dot-product calculations will produce the same result.
Relate vectors based on the angle between them
Cosine Similarity ranges from -1 to 1, where:
+1 indicates that the vectors represent similar semantics & context
0 indicates that the vectors are orthogonal (no similarity)
-1 indicates that the vectors have opposing semantics & context
Cosine Distance is defined as 1 - cosine similarity where:
0 = Synonymous
1 = Orthogonal
2 = Antonymous
Note: For normalized vectors, cosine similarity is the same as the dot-product
I added additional context here: i.e. "Sheep" => "Bighorn Sheep", to make the distances more visible.
Cosine Distance
Embeddings encode many facets of a word or phrase, including language spoken, idioms, sarcasm, and figurative language
Embedding Distance
Feature
Example
Synonym
"Happy" is closer to "Joyful" than to "Sad"
Language
"The Queen" is very close to "La Reina"
Idiom
"He kicked the bucket" is closer to "He died" than to "He kicked the ball"
Sarcasm
"Well, look who's on time" is closer to "Actually Late" than "Actually Early"
Homonym
"Bark" (dog sound) is closer to "Howl" than to "Bark" (tree layer)
Collocation
"Fast food" is closer to "Junk food" than to "Fast car"
Proverb
"The early bird catches the worm" is closer to "Success comes to those who prepare well and put in effort" than to "A bird in the hand is worth two in the bush"
Metaphor
"Time is money" is closer to "Don't waste your time" than to "Time flies"
Simile
"He is as brave as a lion" is closer to "He is very courageous" than to "He is a lion"
Modeling Exercise
Design a system to find words/phrases that are far apart (semantically different) in the embedding space?
Assumptions
No db of embeddings available to search
Limited understanding of the features of the embedding space
i.e. changing language changes nearly every dimension
What type of models might work?
Can we do it logically?
Can we brute-force it?
Are there hybrid approaches?
What are the features of each model?
How do we execute the model(s)?
Possible Genes
Categorical
Part of Speech (noun, verb, etc)
Register (formal, slang, technical, etc)
Morphology (root, compound, phrase)
Animacy (inanimate, animate, agentic, collaborative)
Scientific Discipline (biology, physical, life/health, etc)
Continuous
Polarity (negative -> positive)
Idiomacity (literal -> idiomatic)
Concreteness (abstract -> concrete)
Other Options
If the phrase is "Artificial Intelligence", which has a Scientific Discipline of "Software Engineering", we might want to try to compare it to a word in the "Philosophy" discipline, such as "Mindfully".
Mapping Strategies
We need to find what should change in order to identify a word distant from the current one
Part of Speech: Noun -> Adverb
Discipline: Software Engineering -> Philosophy
Polarity: Neutral -> Highly Positive
Linguistic Agent - System Prompt
You are a simulation of a great linguist. You classify words/phrases within the following categories:
Part of Speech (noun, verb, etc)
Register (formal, slang, technical, etc)
Scientific Discipline (biology, physical, life/health, etc)
Morphology (root, compound, phrase)
Animacy (inanimate, animate, agentic, collaborative)
Polarity (highly negative, negative, neutral, positive, highly positive)
Idiomacity (highly literal, literal, idiomatic, highly idiomatic)
Concreteness (highly abstract, abstract, concrete, highly concrete)
Your job is to identify words/phrases that meet the requested characteristics. You respond only with the requested word or phrase in lower case.
Linguistic Agent - User Prompt
Give me a word or phrase with the following characteristics:
Primary part of speech is verb
Register is technical
Morphology is phrase
Animacy is collaborative
Scientific Discipline is biology
Polarity is highly negative
Idiomacity idiomatic
Concreteness highly concrete
and is not in the following list: "prime genius", "engineer a cooperative symbiosis".
Be sure to only respond with the selected word or phrase, no ceremony.
Commonalities
We discussed 2 problems:
Game strategy - what's the best move given game state?
Semantic Distance - what word should I test next?
What do they have in common? What makes them good candidates for genetic algorithms?
Summary
Genetic Algorithms "Learn" by simulating Darwinian Evolution
The "DNA" represents the decisions that can be made
One chromosome per condition (game state)
One gene per option (choices from a state)
Several parameters control how solutions evolve
Simulations per Generation - how often the solutions change
Fewer simulations make it more likely a less-fit solution will survive over a more-fit solution
Error (misspelling) rate - how often each chromosome changes between generations
Higher rates cause greater variations from generation to generation
Selection strategy - Which solutions survive and mutate
Survive intact - Remain unchanged to next generation
Survive mutated - Modified versions survive to next generations
Survive recombined - Crossover between two solutions survives
Pheromones
Greater Usage => More Pheromones
Shorter Path Length => More Pheromones Remain
More Pheromones => Higher Probability of Usage
Ant Colony Optimization
Neighborhood Search
Start with random paths
Explore neighboring paths based on probability
More pheromone = Higher probability of use
Can get stuck in a local minima
Tweaks
Number of ants
Pheromones to dispense
Pheromone dissipation rate
Features
Less sensitive to scale
Optimality not guaranteed
Types of Bees
Active Workers Forage on their known path and on a neighboring path Scouts Forage on a random path Inactive Workers Wait for information from other bees
Bee Colony Optimization
Multiple Search types
Active Workers perform neighborhood search
Scouts perform random search
Best known path propogates
Tweaks
Number of each type of bee
Probability of persuasion
Visit limit
Features
Less Sensitive to Scale
Optimality not guaranteed
Firefly Optimization
initialize n fireflies to random positions
loop maxEpochs times
for each firefly i
for each firefly j
if intensity(i) < intensity(j)
compute attractiveness
move firefly(i) toward firefly(j)
update firefly(i) intensity
end for
end for
sort fireflies
end loop
return best position found
Firefly Optimization
Neighborhood Search
Search the area toward best known solution
Closer and Brighter: More Attractive
Tweaks
Number of fireflies
Range of possible values
How fast fireflies move
Features
Works better for linear problems
Optimality not guaranteed
Can suffer from sparseness problems
Sparseness - There is a lot of matter in the universe, but it is concentrated in relatively small clusters across very large expanses of space.
Amoeba Optimization
initialize the amoeba with n (size) locations
loop maxEpochs times
calculate new possible solutions
contracted - midway between centroid and worst
reflected - contracted point reflected across centroid
expanded - beyond reflected point by a constant factor
if any solution is better than the current
replace worst value with best value from new solution
else
shrink (multiple contract) all lesser nodes toward the best
increment epoch count
end loop
return best position found
Amoeba Optimization
Neighborhood Search
Start with random locations
Explore neighbors based on amoeba movement
Surround the solution, then contract to it
Tweaks
Size of the amoeba
Number of executions
Features
Optimality no guaranteed
Can suffer from sparseness problems
Gradient Descent
Optimization algorithm used to minimize the cost function of a model
This data has obviously been cleaned. There are no major outliers and data appears fairly consistent.
In this case, it represents the distance travelled on a train between stations at various points in time.
Linear Model
Input = Independent Variable
Output = Dependent Variable
Weight and Bias are Parameters of the Model
MSE = Mean Squared Errror
Cost Function (Loss Function) determines how well the predictions match reality
Component
Details
Input Variable
X
Output Variable
Y
Weight Parameter
M
Bias Parameter
B
Linear Equation
Y=mX+b
ML's Linear Linchpin
The power of AI is based primarily on 8th grade mathematics!
We'll discuss more about linear-separability when we get to model design.
Y = mX + B
Every neuron gets its value from a linear transformation
Multiple inputs result in a sum of the linear transformations
The sum of a linear transformation is linear
Only linearly-separable functions can be modeled without a non-linear activation
Model Parameters
The goal of the training process is to find the values of our parameters that best predict the output
Error Function
It may be difficult to see, but the bias does also contribute to the error, just by far less that the weight.
This makes sense because the weight is multiplied by the input, the bias is just added.
Think about what happens if the slope is perpendicular to the expected line.
The error for every point will be large, except for the area near where the lines cross.
X-Axis: The value of the weight (m)
Y-Axis: The value of the bias (b)
Z-Axis: The size of the error
The weight (m) often has a greater effect on the error than the bias (b)
Training the Model
We could use any optimization algorithm to train the model. In fact, I have a demo where I use Amoeba Optimization to train a model. However, the most common optimization algorithm used in training neural networks is Gradient Descent.
Objective: Minimize the error
Error: The difference between the predicted value and the actual value
MSE: Mean Squared Error - Avg of the squared differences between predicted and actual
Means: Gradient Descent Optimization
Nearly any Optimization algorithm can be used
Gradient Descent is the most common/suited
Linear Regression Demos
Demo 1 - C# code demonstrating a linear regression using train timings
Demo 2 - Spreadsheet modeleling the discrete inputs/outputs of an AND gate
Demo 3 - Spreadsheet modeleling an inverse-proportionality relationship
Mathematics
The use of Linear Algebra and matrix multiplication in machine learning models serves as a mathematical optimization that enhances computational efficiency and scalability. This approach allows for parallel processing of multiple variables, making it faster and more efficient than traditional loop-based methods. The same results can be achieved, however, through traditional methods, as long as you have a few million years to train a large model.
...we've invented a fantastic array of tricks and gimmicks for putting together the numbers, without actually doing it. We don't actually [apply Y = Mx+B for every neuron] We do it by the tricks of mathematics, and that's all. So, we're not going to worry about that. You don't have to know about [Linear Algebra]. All you have to know is what it is, tricky ways of doing something which would be laborious otherwise.
With apologies to Professor Feynman, who was talking about the tricks of Calculus as applied
to Physics, not the tricks of Linear Algebra as applied to Machine Learning.
Of course, there are simpler ways to do this type of linear regression. We are using this example to demonstrate the process of training a model, not to show the simplest way to solve this problem.
Railroad Times Model
Linear Regression Model
Predict the unknown values in a linear equation
Given X (time), predict Y (location)
Find the best values for m and b
For our discussions today, we will be focused only on the Sigmoid activation function as one of the most common and most flexible activations. You should experiment with many to determine which works better for your problem.
ML Demo
Using Amoeba Optimzation to Train an ML Model
More Bio-Inspired
Roach Infestation Optimization
Particle Swarm Optimization
Multi-Swarm (Birds) Optimization
Bacterial Foraging Optimization
Bonus Content
Linear Programming (LP) & Mixed-Integer Programming (MIP)
Use Case: Production Targets
Products
Small Vase
1 oz. clay, 1 oz. glaze
Sells for $3.00 each
Large Vase
4 oz. clay, 2 oz. glaze
Sells for $9.00 each
Inventory
Clay - 24 oz.
Glaze - 16 oz.
16oz glaze (1 oz ea) = 16 small
24oz clay (4 oz ea) = 6 large
71 feasible values
What if X and Y are MUCH LARGER?
Solution Space
Zeroing out parameters gives us intercepts so we can draw the lines
Constraint Equations
Clay Constraint
Glaze Constraint
Linear and Mixed-Integer Programming
Define Constraints that the Solution Must Satisfy
Use these constraints to limit the search space
More constraints = smaller search space
Solutions that exist are found faster
May not find a feasible solution
Add an Objective Function to Improve Solutions
All constraints must be satisfied first
Objectives are a goal relative to an equation
i.e. Maximize revenue where r = 3X + 9Y