Crafting AI

A Developer's Guide to Machine Learning

Barry S. Stahl

Solution Architect & Developer

@bsstahl@cognitiveinheritance.com

https://CognitiveInheritance.com

Transparent Half Width Image 720x800.png

Favorite Physicists & Mathematicians

Favorite Physicists

Harold "Hal" Stahl
Carl Sagan
Richard Feynman
Marie Curie
Nikola Tesla
Albert Einstein
Neil Degrasse Tyson
Niels Bohr
Galileo Galilei
Michael Faraday

Other notables: Stephen Hawking, Edwin Hubble, Leonard Susskind, Christiaan Huygens

Favorite Mathematicians

Ada Lovelace
Alan Turing
Johannes Kepler
Rene Descartes
Isaac Newton
Emmy Noether
George Boole
Blaise Pascal
Johann Gauss
Grace Hopper

Other notables: Daphne Koller, Grady Booch, Leonardo Fibonacci, Evelyn Berezin, Benoit Mandelbrot

Some OSS Projects I Run

Liquid Victor : Media tracking and aggregation [used to assemble this presentation]
Prehensile Pony-Tail : A static site generator built in c#
TestHelperExtensions : A set of extension methods helpful when building unit tests
Conference Scheduler : A conference schedule optimizer
IntentBot : A microservices framework for creating conversational bots on top of Bot Framework
LiquidNun : Library of abstractions and implementations for loosely-coupled applications
Toastmasters Agenda : A c# library and website for generating agenda's for Toastmasters meetings
ProtoBuf Data Mapper : A c# library for mapping and transforming ProtoBuf messages

Fediverse Supporter

Hear my thoughts on why a thriving Fediverse is so critical on David Giard's Technology and Friends show.
Learn more about the Fediverse at fediverse.party.
Get your own Mastodon account at Join Mastodon.
- Don't stress on what server to use, just pick one and go. You can always move later.
Follow me on Mastodon at @bsstahl@CognitiveInheritance.com.

http://GiveCamp.org

Achievement Unlocked

Minimizing Error

Simple Linear Model

AI Models - Single Input Linear - 1280x720.png

2-Input Linear Model

AI Models - 2 Input Linear - 1280x720.png

Multiple-Input Linear Model

AI Models - Multiple Input Linear - 1280x720.png

Multi-Layer Perceptron

Crafting AI Models

Designing the Model Structure
- Define Inputs and Outputs
- Determine Layers and Neurons
- Select Activation Function
Training the Model
- Cost Function (Mean Squared Error)
- Gradient Descent Optimization
- Adjust Hyperparameters (e.g. learning rate)
These are the critical concepts for all ML tooling
- TensorFlow, Keras, ML.Net, etc.
- Master these concepts to harness any ML tool suite effectively

Linear Model

Component	Details
Input Variable	X
Output Variable	Y
Weight Parameter	M
Bias Parameter	B
Linear Equation	Y=mX+b

ML's Linear Linchpin

Y = mX + B

Every neuron gets its value from a linear transformation
Multiple inputs result in a sum of the linear transformations
- The sum of a linear transformation is linear
Only linearly-separable functions can be modeled without a non-linear activation

AI Models - Annotated Single Input Linear - 640x360.png

Mathematics

...we've invented a fantastic array of tricks and gimmicks for putting together the numbers, without actually doing it. We don't actually [apply Y = Mx+B for every neuron] We do it by the tricks of mathematics, and that's all. So, we're not going to worry about that. You don't have to know about [Linear Algebra]. All you have to know is what it is, tricky ways of doing something which would be laborious otherwise.

With apologies to Professor Feynman, who was talking about the tricks of Calculus as applied to Physics, not the tricks of Linear Algebra as applied to Machine Learning.

Model Parameters

Parameters: Internal values learned during training
- Define the relationship between input and output
- Adjusted to minimize prediction error
In Linear Regression: Y = mX + b
- m: Slope (Weight) - The influence of X on Y
- b: Intercept (Bias) - Shifts the output vertically
In more complex models:
- Counts include weights/biases across layers
- Can capture non-linear relationships
- Parameter counts are a proxy for complexity
- GPT-4 uses roughly 1.8 trillion parameters

Error Function

X-Axis: The value of the weight (m)
Y-Axis: The value of the bias (b)
Z-Axis: The size of the error

The weight (m) often has a greater effect on the error than the bias (b)

Linear Regression - Error Function Plot - Smaller.png

Training the Model

Objective: Minimize the error
- Error: The difference between the predicted value and the actual value
- MSE: Mean Squared Error - Avg of the squared differences between predicted and actual
Means: Gradient Descent Optimization
- Nearly any Optimization algorithm can be used
- Gradient Descent is the most common/suited

Gradient Descent

Optimization algorithm used to minimize the cost function of a model

Shifts the parameters in the opposite direction of the cost gradient
- The 1st derivative of the cost function
- Represents the slope of that function
Hyperparameters include:
- Number of iterations
- Learning rate
- Convergence criteria
Stochastic Gradient Descent
- Updates parameters using a subset of data each cycle

AI Models - Gradient Descent - 400x800.png

Linear Regression Demos

Linear Regression - C# Code
AND Gate Model - Excel
Maintainability Model - Excel

Railroad Times Model

Train & Test Cycle

Voting Dataset

Voting Data

Voting Model

AI Models - SingleLayer Voter Model - 1280x720.png

Linear Separability

Not Linearly Separable

Activation

Functions applied to the output to introduce non-linearity

The sum of linear functions is linear
- i.e. multiple variables in a layer
- Y = M1*X1 + M2*X2 + ... + Mn*Xn + B
- Hyperplane in n-dimensional space
Composition of linear functions is linear
- i.e. multiple layers
- Y = M1*(M2*X1 + ... + Mn*Xn + B2) + B1
- Hyperplane in n-dimensional space
Composition of linear and non-linear is non-linear
- i.e. Activation functions
- Non-linear

Feature Analysis

Weights represent the influence of an input feature
The 13th weight is the smallest - least influential
- Least partisan: Superfund Right to Sue
The 4th weight is the largest - most influential
- Most partisan: Physician Fee Freeze

Trained Model Parameters - Voting Model - 800x800 - With Highlights.png

Feature Engineering

The process of identifying, selecting, and combining key attributes of a problem to help a model make better predictions

Choice and quality of features are critical
- Directly impact the model's ability to learn and make accurate predictions
Hierarchical layers enable complex feature extraction, capturing non-linear relationships
Techniques like Dimensionality Reduction help reduce the number of features while preserving key information
- Example: Combine length and width to create a single feature representing area
Requires experimentation and iteration to find the best feature set for a given problem

Features of the Voting Model

Current Model
- Inputs: 16 Features
- Output: 1 Neuron
- Architecture: Linear Perceptron
Hypothesis: Existence of factions within parties
- Single-layer model may not capture this non-linear feature

Enhancing the Voting Model

Solution: Hidden Layer
Expected Outcome
- Better insight into party dynamics
- Improved accuracy
New Model
- Inputs: 16 Features
- Hidden Layer: 3 Neurons
- Output: 1 Neuron
- Architecture: Multi-layer Perceptron

AI Models - MultiLayer Voter Model - 1280x720.png

BackPropagation

A mechanism to train multi-layer perceptrons by determining the appropriate outputs for hidden layers during training

Create a feed-forward prediction
Calculate error at the output layer
Compute error gradient using the chain rule
Update Parameters using gradients
- Start from output layer, move backwards through network.
- Adjustments to weights are weighted by the input values
- Adjustments to biases are not weighted
No need to recalculate gradient after each update during an iteration

Resources

Chain Rule of Calculus

To find the derivative of a composite function h(x) = f(g(x)), you take the derivative of the outer function f with respect to the inner function g, and multiply it by the derivative of the inner function g with respect to x

If h(x) = f(g(x)), then:
- dh/dx = df/dg * dg/dx
How changes in x affect the output h by accounting for how x influences g and how g influences f
Enables calculation of gradients for each layer by propagating errors backward through the network
Essential for training deep networks, as it helps adjust weights and biases to minimize prediction error

Overfitting

When a neural network learns the training data too well, capturing noise rather than the underlying pattern

Produces a model that performs well on training data but poorly on other data
Can happen when the model is too complex relative to the dataset
- Excessive parameters allow it to fit even noise within the data
Solutions include:
- Regularization: Add a penalty to the loss function
  - L1 (Lasso): Performs feature selection by setting some coefficients to zero
  - L2 (Ridge): Disperses the weights across all features
- Dropout: Randomly drop neurons during training to force the network to learn more robust features
- Early stopping: Stop training when performance on a validation set starts to degrade
- Data augmentation: Diversify the training set by applying transformations to the input data
- Weight initialization: Better initialization to avoid configurations that lead to overfitting
- Batch normalization: Normalize inputs on mini-batches to reduce fluctuations

Xavier/Glorot Initialization

Helps prevent gradients from becoming too small or large, aiding convergence

  // Xavier/Glorot initialization for better gradient flow
  int inputWeightCount = inputCount * hiddenLayerNodes;
  int totalWeightCount = inputWeightCount + hiddenLayerNodes;
  var weightScale = Math.Sqrt(2.0 / inputWeightCount);
  startingWeights = new double[totalWeightCount];
  for (int i = 0; i < startingWeights.Length; i++)
    startingWeights[i] = _random.GetRandomDouble(-weightScale, weightScale);

Balances the signal through layers
Prevents activation function saturation
Ideal for symmetric activations (e.g., sigmoid or hyperbolic tangent)