Crafting AI

A Developer's Guide to Machine Learning


Barry S. Stahl

Solution Architect & Developer

@bsstahl@cognitiveinheritance.com

https://CognitiveInheritance.com

Transparent Half Width Image 720x800.png

Favorite Physicists & Mathematicians

Favorite Physicists

  1. Harold "Hal" Stahl
  2. Carl Sagan
  3. Richard Feynman
  4. Marie Curie
  5. Nikola Tesla
  6. Albert Einstein
  7. Neil Degrasse Tyson
  8. Niels Bohr
  9. Galileo Galilei
  10. Michael Faraday

Other notables: Stephen Hawking, Edwin Hubble, Leonard Susskind, Christiaan Huygens

Favorite Mathematicians

  1. Ada Lovelace
  2. Alan Turing
  3. Johannes Kepler
  4. Rene Descartes
  5. Isaac Newton
  6. Emmy Noether
  7. George Boole
  8. Blaise Pascal
  9. Johann Gauss
  10. Grace Hopper

Other notables: Daphne Koller, Grady Booch, Leonardo Fibonacci, Evelyn Berezin, Benoit Mandelbrot

Some OSS Projects I Run

  1. Liquid Victor : Media tracking and aggregation [used to assemble this presentation]
  2. Prehensile Pony-Tail : A static site generator built in c#
  3. TestHelperExtensions : A set of extension methods helpful when building unit tests
  4. Conference Scheduler : A conference schedule optimizer
  5. IntentBot : A microservices framework for creating conversational bots on top of Bot Framework
  6. LiquidNun : Library of abstractions and implementations for loosely-coupled applications
  7. Toastmasters Agenda : A c# library and website for generating agenda's for Toastmasters meetings
  8. ProtoBuf Data Mapper : A c# library for mapping and transforming ProtoBuf messages

Fediverse Supporter

Logos.png

http://GiveCamp.org

GiveCamp.png

Achievement Unlocked

bss-100-achievement-unlocked-1024x250.png

Minimizing Error

training_animation.gif

Simple Linear Model

AI Models - Single Input Linear - 1280x720.png

2-Input Linear Model

AI Models - 2 Input Linear - 1280x720.png

Multiple-Input Linear Model

AI Models - Multiple Input Linear - 1280x720.png

Multi-Layer Perceptron

AI Models - MultiLayer Perceptron - 1280x720.png

Crafting AI Models

  • Designing the Model Structure
    • Define Inputs and Outputs
    • Determine Layers and Neurons
    • Select Activation Function
  • Training the Model
    • Cost Function (Mean Squared Error)
    • Gradient Descent Optimization
    • Adjust Hyperparameters (e.g. learning rate)
  • These are the critical concepts for all ML tooling
    • TensorFlow, Keras, ML.Net, etc.
    • Master these concepts to harness any ML tool suite effectively

Linear Model

Component Details
Input Variable X
Output Variable Y
Weight Parameter M
Bias Parameter B
Linear Equation Y=mX+b
AI Models - Single Input Linear - 1280x720.png

ML's Linear Linchpin

Y = mX + B

  • Every neuron gets its value from a linear transformation

  • Multiple inputs result in a sum of the linear transformations

    • The sum of a linear transformation is linear
  • Only linearly-separable functions can be modeled without a non-linear activation

AI Models - Annotated Single Input Linear - 640x360.png

Mathematics

...we've invented a fantastic array of tricks and gimmicks for putting together the numbers, without actually doing it. We don't actually [apply Y = Mx+B for every neuron] We do it by the tricks of mathematics, and that's all. So, we're not going to worry about that. You don't have to know about [Linear Algebra]. All you have to know is what it is, tricky ways of doing something which would be laborious otherwise.

  • With apologies to Professor Feynman, who was talking about the tricks of Calculus as applied to Physics, not the tricks of Linear Algebra as applied to Machine Learning.
Feynman - QED - 600x400.png

Model Parameters

  • Parameters: Internal values learned during training

    • Define the relationship between input and output

    • Adjusted to minimize prediction error

  • In Linear Regression: Y = mX + b

    • m: Slope (Weight) - The influence of X on Y

    • b: Intercept (Bias) - Shifts the output vertically

  • In more complex models:

    • Counts include weights/biases across layers

    • Can capture non-linear relationships

    • Parameter counts are a proxy for complexity

    • GPT-4 uses roughly 1.8 trillion parameters

Model Parameters.png

Error Function

  • X-Axis: The value of the weight (m)
  • Y-Axis: The value of the bias (b)
  • Z-Axis: The size of the error

The weight (m) often has a greater effect on the error than the bias (b)

Linear Regression - Error Function Plot - Smaller.png

Training the Model

  • Objective: Minimize the error
    • Error: The difference between the predicted value and the actual value
    • MSE: Mean Squared Error - Avg of the squared differences between predicted and actual
  • Means: Gradient Descent Optimization
    • Nearly any Optimization algorithm can be used
    • Gradient Descent is the most common/suited
training_animation.gif

Gradient Descent

Optimization algorithm used to minimize the cost function of a model

  • Shifts the parameters in the opposite direction of the cost gradient

    • The 1st derivative of the cost function

    • Represents the slope of that function

  • Hyperparameters include:

    • Number of iterations

    • Learning rate

    • Convergence criteria

  • Stochastic Gradient Descent

    • Updates parameters using a subset of data each cycle
AI Models - Gradient Descent - 400x800.png

Linear Regression Demos

Linear Model Demos 800x800.jpg
 
 

Railroad Times Model

AI Models - Single Input Linear - 1280x720.png

Train & Test Cycle

Train and Test Cycle.png
 
 
 
 
 
 
 
 
 
 

Voting Dataset

Congressional Voting Dataset 1280x720.png

Voting Data

Excel - Voting Dataset 1280x720.png

Voting Model

AI Models - SingleLayer Voter Model - 1280x720.png

Linear Separability

LinearSeparability.png

Linear Separability

LinearSeparability-WithLine.png

Not Linearly Separable

NotLinearlySeparable.png

Activation

Functions applied to the output to introduce non-linearity

  • The sum of linear functions is linear

    • i.e. multiple variables in a layer
    • Y = M1*X1 + M2*X2 + ... + Mn*Xn + B
    • Hyperplane in n-dimensional space
  • Composition of linear functions is linear

    • i.e. multiple layers
    • Y = M1*(M2*X1 + ... + Mn*Xn + B2) + B1
    • Hyperplane in n-dimensional space
  • Composition of linear and non-linear is non-linear

    • i.e. Activation functions
    • Non-linear
Activation.png
  Sigmoid Function.png
 
 
 
 
 
 

Feature Analysis

  • Weights represent the influence of an input feature

  • The 13th weight is the smallest - least influential

    • Least partisan: Superfund Right to Sue
  • The 4th weight is the largest - most influential

    • Most partisan: Physician Fee Freeze
Trained Model Parameters - Voting Model - 800x800 - With Highlights.png

Feature Engineering

The process of identifying, selecting, and combining key attributes of a problem to help a model make better predictions

  • Choice and quality of features are critical
    • Directly impact the model's ability to learn and make accurate predictions
  • Hierarchical layers enable complex feature extraction, capturing non-linear relationships
  • Techniques like Dimensionality Reduction help reduce the number of features while preserving key information
    • Example: Combine length and width to create a single feature representing area
  • Requires experimentation and iteration to find the best feature set for a given problem
Necker_cube_with_background.png

Features of the Voting Model

  • Current Model

    • Inputs: 16 Features

    • Output: 1 Neuron

    • Architecture: Linear Perceptron

  • Hypothesis: Existence of factions within parties

    • Single-layer model may not capture this non-linear feature
AI Models - SingleLayer Voter Model - 1280x720.png

Enhancing the Voting Model

  • Solution: Hidden Layer

  • Expected Outcome

    • Better insight into party dynamics

    • Improved accuracy

  • New Model

    • Inputs: 16 Features

    • Hidden Layer: 3 Neurons

    • Output: 1 Neuron

    • Architecture: Multi-layer Perceptron

AI Models - MultiLayer Voter Model - 1280x720.png

BackPropagation

A mechanism to train multi-layer perceptrons by determining the appropriate outputs for hidden layers during training

  • Create a feed-forward prediction

  • Calculate error at the output layer

  • Compute error gradient using the chain rule

  • Update Parameters using gradients

    • Start from output layer, move backwards through network.

    • Adjustments to weights are weighted by the input values

    • Adjustments to biases are not weighted

  • No need to recalculate gradient after each update during an iteration

 
 

Resources

Crafting AI - QR Code.png

Chain Rule of Calculus

To find the derivative of a composite function h(x) = f(g(x)), you take the derivative of the outer function f with respect to the inner function g, and multiply it by the derivative of the inner function g with respect to x

  • If h(x) = f(g(x)), then:

    • dh/dx = df/dg * dg/dx
  • How changes in x affect the output h by accounting for how x influences g and how g influences f

  • Enables calculation of gradients for each layer by propagating errors backward through the network

  • Essential for training deep networks, as it helps adjust weights and biases to minimize prediction error

Overfitting

When a neural network learns the training data too well, capturing noise rather than the underlying pattern

  • Produces a model that performs well on training data but poorly on other data
  • Can happen when the model is too complex relative to the dataset
    • Excessive parameters allow it to fit even noise within the data
  • Solutions include:
    • Regularization: Add a penalty to the loss function
      • L1 (Lasso): Performs feature selection by setting some coefficients to zero
      • L2 (Ridge): Disperses the weights across all features
    • Dropout: Randomly drop neurons during training to force the network to learn more robust features
    • Early stopping: Stop training when performance on a validation set starts to degrade
    • Data augmentation: Diversify the training set by applying transformations to the input data
    • Weight initialization: Better initialization to avoid configurations that lead to overfitting
    • Batch normalization: Normalize inputs on mini-batches to reduce fluctuations

Xavier/Glorot Initialization

Helps prevent gradients from becoming too small or large, aiding convergence

  // Xavier/Glorot initialization for better gradient flow
  int inputWeightCount = inputCount * hiddenLayerNodes;
  int totalWeightCount = inputWeightCount + hiddenLayerNodes;
  var weightScale = Math.Sqrt(2.0 / inputWeightCount);
  startingWeights = new double[totalWeightCount];
  for (int i = 0; i < startingWeights.Length; i++)
    startingWeights[i] = _random.GetRandomDouble(-weightScale, weightScale);
  • Balances the signal through layers
  • Prevents activation function saturation
  • Ideal for symmetric activations (e.g., sigmoid or hyperbolic tangent)