# Automatic Differentiation

keywords: Reasoning, Big Data, AI, Math, Prolog

I have not completed this article. The work remaining is the description of the "Big Ideas" of AD, appetite whetting, references, and I am still debugging the Prolog code for Reverse-Mode Automatic Differentiation. I think what I have so far may be of interest. Stay tuned, I expect to find some time to complete this in January.

**Automatic Differentiation** ("AD") has been discovered and re-discovered as an implementation technique for numerical processing but is best known now as a key part of neural net / deep learning *backprop* algorithms.

The oft repeated trope that AD is not symbolic nor numeric differentiation is unhelpful. I think the contrasts are true but distract from the important borrowing of methods from both symbolic and numeric differentiation. The introductions to AD should be re-structured to emphasize this and my presentation here is in that spirit.

This article does not attempt to review the literature or to be a complete self-standing tutorial about AD; Other sources, including some that are listed in the References section below do that well. Here I assume that you are acquainted with AD.

My purpose here is to walk though examples of forward- and backward-mode automatic differentiation and provide a sketch of how AD might be implemented in Prolog. Prolog is not an ideal language for numerical processing but the algorithms for AD take on a beauty in Prolog that is lost in most languages, including Julia.

## The Big Ideas

The importance of evaluation of derivatives to machine learning and optimization algorithms cannot be exaggerated. Implementers and users of such algorithms are assumed to be familiar with the concepts of directional derivatives, gradients, Hessians and Jacobian matrixes but may be in the dark about how these are actually computed.

XXXXXXXXXXXXXXXXXXXXXXXX – Fix this – XXXXXXXXXXXXXXXXXXXXXXXX

The following should be the Reals (problem in PDF version)

The definition of a partial derivative using a central difference is:

Numerical differentiation is easy to understand and corresponds to everyone's first exposure to ideas about taking limits and approximations computed by truncating Taylor series (known in differential geometry as "jets".) For example, a center difference approximation of a first order derivative has the familiar form:

where the truncation error is second-order in $h$ but as $h \to 0$ round-off error increases. The problem may become acute as dimensionality increases.

Continuing our review, the directional derivative is the rate of change (or slope of the tangent) of $f(\mathbf{x})$ at a point $a$ in the direction $\mathbf{e}$ which may be approximated using the above or may be directly evaluated by a symbolic form of the derivative of $f$. As we all experienced in introductory Calculus class, taking and evaluating derivatives may be tedious but was merely the application of rules that were learned by rote.

The gradient is related to the directional derivative. Given a differentiable scalar function $f:\mathbb{R}^n \to \mathbb{R}$ the gradient is a vector field e.g., a vector-valued function $\nabla f:\mathbb{R}^n \to \mathbb{R}^n$ which can be evaluated for a point $p$ as

The differential operator del or $\nabla$ takes different forms in different coordinate systems. In a Cartesian coordinate system with an Euclidean metric we have

Much of machine learning and optimization boils down to minimizing (or maximizing) an objective function in an iterative descent procedure.

The Jacobian operator is a generalization of the derivative operator to the vector-valued functions. As we have seen earlier, a vector-valued function is a mapping from $f:\mathbb{R}^n \to \mathbb{R}^m$, hence, now instead of having a scalar value of the function $f$, we will have a mapping $[x_1,x_2,\dotsc,x_n] \to [f_1,f_2,\dotsc,f_n]$. Thus, we now need the rate of change of each component of $f$ with respect to each component of the input variable $x$, this is exactly what is captured by a matrix called Jacobian matrix $J$

Hessian

The gradient is the first order derivative of a multivariate function. To find the second order derivative of a multivariate function, we define a matrix called a Hessian matrix given by

Dual variables

Clifford - totally remarkable (aside)

Functional code + source level transformation => function + derivative computed simultaneously

f: R^n -> R^m

Jacobian-vector products Directional derivatives

## Symbolic Algebra and Differentiation

Symbolic differentiation is based on the exact rules that you learned in Calculus class. Prolog allows these to be represented with great fidelity. I provide a few below as an illustration.

Rule | Prolog | English equivalent |
---|---|---|

$\small\frac{dx}{dx} = 1$ | d(X,X,1). | "the derivative of X with respect to itself is 1. |

$\small\frac{dC}{dx} = 0$ | d(C,X,0) :- number(C). | "the derivative of a constant C with respect to X is 0 |

$\small{\frac{d(A+B)}{dx} = \frac{d(A)}{dx}+\frac{d(B)}{dx}}$ | d(U+V,X,A+B) :- d(U,X,A), d(V,X,B). | "the derivative of the sum is the sum of the derivatives" |

$\small\frac{d(A \cdot B)}{dx}\tiny =A \small \frac{d(B)}{dx}\tiny +B \small\frac{d(A)}{dx}$ | d(U*V,X,B*U+A*V) :- d(U,X,A), d(V,X,B). | "the derivative of a product is the first * the derivative of the second plus the second * the derivative of the first" |

$\small\frac{d(\sin(A))}{dx} = \scriptsize \cos(A)\small\frac{d(A)}{dx}$ | d(sin(T),X,R*cos(T)) :- d(T,X,R). | "the derivative of the sin is the cos - using the chain rule" |

Below is an example of how Prolog can be used for symbolic differentiation. The algebraic simplification approach is not very effective for complex expressions. The correct approach is to transform the expression tree formed by precedence and associativity of operators into a flatter structure of (for example) sum and product lists. Then strategies for gathering of like terms and re-writing can explore a space of equivalent representations that satisfices some quality characteristic such as minimizing execution cost.

A particularly interesting system for algebraic manipulation is PRESS: PRolog Equation Solving System. For a book length treatment, see B. Silver **Meta-Level Inference**.

```
/*****************************************************************************
Symbolic Algebra and Differentiation System in Arity/Prolog32
*****************************************************************************/
/*
Some Rules of Partial Differentiation
diff(F, X, DF) The derivative of F w.r.t. X is DF
*/
:- mode diff(+, +, -/+):det.
diff(X, X, 1) :- !.
diff(F, X, 0) :- notOccurs(F, X) , !.
diff(-F, X, -DF) :-
!,
diff(F, X, DF).
diff(sin(F), X, cos(F) * DF) :-
!,
diff(F, X, DF).
diff(cos(F), X, -sin(F) * DF) :-
!,
diff(F, X, DF).
diff(tan(F), X, (cos(F) ^ -2) * DF) :-
!,
diff(F, X, DF).
diff(exp(F), X, exp(F)*DF) :-
!,
diff(F, X, DF).
diff(ln(F), X, DF / F) :- % where X > 0
!,
diff(F, X, DF).
diff(sigmoid(F), X, DF) :- % rewrite and recurse
!,
diff(1 / (1 - exp(-F)), X, DF).
diff(F + G, X, DF + DG) :-
!,
diff(F, X, DF),
diff(G, X, DG).
diff(F - G, X, DF - DG) :-
!,
diff(F, X, DF),
diff(G, X, DG).
diff(N * F, X, N * DF) :-
number(N),
!,
diff(F, X, DF).
diff(F * N, X, N * DF) :-
number(N),
!,
diff(F, X, DF).
diff(F * G, X, F * DG + G * DF) :-
!,
diff(F, X, DF),
diff(G, X, DG).
diff(F / G, X, (G * DF - F * DG)/ G^2) :- % where G is nonzero
!,
diff(F, X, DF),
diff(G, X, DG).
diff(F ^ G, X, G * (F ^ (G - 1)) * DF) :- % power rule, special case
number(G),
!,
diff(F, X, DF).
diff(F ^ G, X, (F ^ G) * (DF * G / F + DG * ln(F))) :-
!,
diff(F, X, DF),
diff(G, X, DG).
/*
notOccurs(F, X)
true iff X does not appear in F
*/
:- mode notOccurs(+,+):det.
notOccurs(X, X) :- !, fail.
notOccurs(A, _) :- atomic(A), !.
notOccurs([F|T], X) :-
!,
notOccurs(F, X),
notOccurs(T, X).
notOccurs(F, X) :-
F =.. [_|Args],
notOccurs(Args, X).
/*
simplify(F, G)
G is (an) algebraicly simplified version of F
Note that this is really a very poor version of simplification. It is
provided here to be suggestive rather than effective. Sums and products
should be represented as sorted lists enabling better application of
re-write rules.
Better systems implement very elaborate in strategies for
term gathering and simplification. See for example,
PRESS: PRolog Equation Solving System (https://github.com/maths/PRESS)
B. Silver Meta-Level Inference (www.amazon.com//dp/0444879005)
*/
:- mode simplify(+,-/+):det.
simplify(X, Y) :-
Y is X,
number(Y),
!.
simplify(X, X) :-
atomic(X),
!.
simplify(F, G) :-
F =.. [Op|Argsf],
simplifyL(Argsf, Argsg),
G0 =.. [Op|Argsg],
simplifyRules(G0, G1),
simplify1(G0, G1, G).
:- mode simplify1(+,+,-/+):det.
simplify1(G, G, G) :- !. % no need to recurse
simplify1(_, G1, G) :-
simplify(G1, G).
:- mode simplifyL(+,-/+):det.
simplifyL([F|Tf], [G|Tg]) :-
!,
simplify(F, G),
simplifyL(Tf, Tg).
simplifyL([], []).
:- mode simplifyRules(+,-/+):det.
simplifyRules(-(-X), X) :- !.
simplifyRules((-X), -1 * X) :- !.
simplifyRules(0 + X, X) :- !.
simplifyRules(X + 0, X) :- !.
simplifyRules((X + N), (N + X)) :- % number first in mixed sum
number(N),
!.
simplifyRules((N1+X) + N2, N3 + X) :-
N3 is N1 + N2,
number(N3),
!.
simplifyRules(N1 + (N2+X), N3 + X) :-
N3 is N1 + N2,
number(N3),
!.
simplifyRules((N1+X) + (N2+Y), N3 + (X+Y)) :-
N3 is N1 + N2,
number(N3),
!.
simplifyRules((N+X) + (Y+Z), N + (X+Y+Z)) :-
number(N),
!.
simplifyRules((X+Y) + (N+Z), N + (X+Y+Z)) :-
number(N),
!.
simplifyRules(X+(Y+Z), X+Y+Z) :-
!.
simplifyRules(0 - X, -X) :- !.
simplifyRules(X - 0, X) :- !.
simplifyRules(X - N, N1 + X) :-
N1 is -N,
number(N1),
!.
simplifyRules(X - X, 0) :- !.
simplifyRules(X - Y, X + '-'(Y)) :- !.
simplifyRules(0 * X, 0) :- !.
simplifyRules(X * 0, 0) :- !.
simplifyRules(1 * X, X) :- !.
simplifyRules(X * 1, X) :- !.
simplifyRules((X * N), (N * X)) :- % number first in mixed product
number(N),
!.
simplifyRules(X + (N + Y), N + (X + Y)) :-
number(N),
!.
simplifyRules((N1*X) * N2, N3 * X) :-
N3 is N1 * N2,
number(N3),
!.
simplifyRules(N1 * (N2*X), N3 * X) :-
N3 is N1 * N2,
number(N3),
!.
simplifyRules((N1*X) * (N2*Y), N3 * (X*Y)) :-
N3 is N1 * N2,
number(N3),
!.
simplifyRules((N*X) * (Y*Z), N * (X*Y*Z)) :-
number(N),
!.
simplifyRules((X*Y) * (N*Z), N * (X*Y*Z)) :-
number(N),
!.
simplifyRules(X*(Y*Z), X*Y*Z) :-
!.
simplifyRules(N * '-'(X), N1 * X) :-
N1 is - N,
number(N1),
!.
simplifyRules(N * (X+Y), (N * X)+ (N * Y)) :-
number(N),
!.
simplifyRules(X * (N+Y), (N * X)+ (X * Y)) :-
number(N),
!.
simplifyRules((N1 * X) + (N2 * X), N3 * X) :-
N3 is N1 + N2,
number(N3),
!.
simplifyRules((W + X) * (Y + Z), W*Y + W*Z + X*Y + Y*Z) :-
!.
simplifyRules(X / 0, err) :- !.
simplifyRules(X / X, 1) :- !.
simplifyRules(0 / X, 0) :- !.
simplifyRules(X / 1, X) :- !.
simplifyRules(X / (Y / Z), (X * Z) / Y) :- !.
simplifyRules(X ^ 1, X) :- !.
simplifyRules((X + Y) ^ 2, (X + Y) * (X + Y)) :- !.
simplifyRules(X*X, X^2) :-
atomic(X),
!.
simplifyRules(X * X^N, X^N1) :-
atomic(X),
N1 is N + 1,
number(N1),
!.
simplifyRules(X^N * X, X^N1) :-
atomic(X),
N1 is N + 1,
number(N1),
!.
simplifyRules(X^N1 * X^N2, X^N3) :-
atomic(X),
N3 is N1 + N2,
number(N3),
!.
simplifyRules(X^N * Y, Y * X^N) :-
Y \= A^B,
!.
simplifyRules(sigmoid(X), 1/(1+exp(-X))) :- !.
simplifyRules(X , X).
/*
evaluate(F, Bind, V)
V is the computed value of F, using bindings Bind
Bind is a list of elements bind(Var, Value)
*/
:- mode evaluate(+,+,-/+):det.
evaluate(F, Bind, V) :-
evaluateBind(F, Bind, G),
simplify(G, V).
evaluateBind(F, Bind, F) :-
number(F),
!.
evaluateBind(F, Bind, V) :-
atomic(F),
!,
(
memberChk(bind(F, V), Bind),
!
;
V = F
).
evaluateBind(F, Bind, V) :-
F =.. [Op|ArgsF],
evaluateBindL(ArgsF, Bind, ArgsV),
V =.. [Op|ArgsV].
:- mode evaluateBindL(+,+,-/+):det.
evaluateBindL([F|Ft], Bind, [G|Gt]) :-
!,
evaluateBind(F, Bind, G),
evaluateBindL(Ft, Bind, Gt).
evaluateBindL([], Bind, []).
memberChk(X, [X|_]) :- !.
memberChk(X, [_|T]) :-
memberChk(X, T).
/*** Examples
?- diff(x, x, D), simplify(D, SD).
D = 1
SD = 1
yes
?- diff(4*x*(1-x), x, D), simplify(D, SD).
D = 4 * x * (0 - 1) + (1 - x) * (4 * 1)
SD = 4 + -8 * x
yes
?- diff(16*x*(1-x)*(1-2*x)^2, x, D), simplify(D, SD).
D = 16 * x * (1 - x) * (2 * (1 - 2 * x) ^ (2 - 1) * (0 - 2 * 1)) + (1 - 2 * x) ^
2 * (16 * x * (0 - 1) + (1 - x) * (16 * 1))
SD = 16 + -64 * x + 64 * x ^ 2 + (128 * x + -128 * x ^ 2) * x + -32 * x + (-32 +
64 * x) * x + (-32 + 64 * x + (64 + -128 * x) * x) * x
yes
***/
```

## Forward-Mode Automatic Differentiation

```
/*****************************************************************************
Forward-Mode Automatic Differentiation System in Arity/Prolog32
X = Dual numbers are represented as CONSes, e.g. [x|dx]
Compiled operations are represented as lists with elements representing
a single computational step for both an expression and its derivative:
s(a, da, x + y, dx + dy)
New variables are in the form v_NNN where NNN ranges from 000 to 999.
A new differential variable name is formed by prependinding 'd' to its
corresponding variable name.
*****************************************************************************/
/*
Some Rules of Partial Differentiation
fmadDiff(E, DE) The derivative of E w.r.t. X is DE
Note that these rules are non-recursive, except for some cases of
re-writing, such as sigmoid(E).
*/
fmadDiff(E, 0) :-
X is E,
number(X),
!.
fmadDiff(-Var, -DVar) :-
!,
dVar(Var, DVar).
fmadDiff(sin(Var), cos(Var) * DVar) :-
!,
dVar(Var, DVar).
fmadDiff(cos(F), -sin(Var) * DVar) :-
!,
dVar(Var, DVar).
fmadDiff(tan(Var), (cos(Var) ^ -2) * DVar) :-
!,
dVar(Var, DVar).
fmadDiff(exp(Var), exp(Var)*DVar) :-
!,
dVar(Var, DVar).
fmadDiff(ln(Var), DVar / Var) :- % where X > 0
!,
dVar(Var, DVar).
fmadDiff(sigmoid(Var), Diff) :- % rewrite and recurse
!,
fmadDiff(1 / (1 - exp(-Var)), Diff).
fmadDiff(F + G, DF + DG) :-
!,
dVar(F, DF),
dVar(G, DG).
fmadDiff(F - G, DF - DG) :-
!,
dVar(F, DF),
dVar(G, DG).
fmadDiff(N * F, N * DF) :-
number(N),
!,
dVar(F, DF).
fmadDiff(F * N, N * DF) :-
number(N),
!,
dVar(F, DF).
fmadDiff(F * G, F * DG + G * DF) :-
!,
dVar(F, DF),
dVar(G, DG).
fmadDiff(F / G, (G * DF - F * DG)/ G^2) :- % where G is nonzero
!,
dVar(F, DF),
dVar(G, DG).
fmadDiff(F ^ G, G * (F ^ (G - 1)) * DF) :- % power rule, special case
number(G),
!,
dVar(F, DF).
fmadDiff(F ^ G, (F ^ G) * (DF * G / F + DG * ln(F))) :-
!,
dVar(F, DF),
dVar(G, DG).
/*
Forward-mode AD compilation
compile(E, InDuals, AllDuals, OutDual, EPlan, DEPlan)
compileRHS(E, InDuals0, InDuals, AllDuals0, AllDuals, OutDual,
EPlanA, EPlanZ, DEPlanA, DEPlanZ, N, N1)
E - Expression compiled by recursive descent
InDuals0, InDuals - Accumulate Dual Input variables
AllDuals0, AllDuals - Accumulate all Dual variables (inp + intemed. vars)
OutDual - Output Dual value or variable of RHS compilation
EPlanA, EPlanZ - Difference list for Expression operations
DEPlanA, DEPlanZ - Difference list for Differential Expression operations
N, N1 - Counter for intermediate variable name generation
*/
:- mode compile(+,-/+,-/+,-/+,-/+,-/+):det.
compile(LHS=RHS, InDuals, AllDuals, OutDual, EPlan, DEPlan) :-
% could simplify RHS here
compileRHS(RHS, [], InDuals0, [], AllDuals0, OutDual0, EPlan0, [],
DEPlan0, [], 0, _, Simple),
dVar(LHS, DLHS),
OutDual = [LHS|DLHS],
(
Simple = true, % trivial RHS (dual numbers or vars)
!,
InDuals = InDuals0,
AllDuals = InDuals, % reset to InDuals
OutDual0 = [E|DE],
EPlan = [LHS=E],
DEPlan = [DLHS=DE]
;
removeOutDual(AllDuals0, OutDual0, AllDuals1),
sort(InDuals0, InDuals),
sort(AllDuals1, AllDuals),
substLHS(EPlan0, LHS, EPlan1),
substLHS(DEPlan0, DLHS, DEPlan1),
% could optimize EPlan and DPlan here (simplification + peephole)
EPlan = EPlan1,
DEPlan = DEPlan1
).
:- mode compileRHS(+,+,-/+,+,-/+,-/+,?,?,?,?,+,-/+,-/+):det.
compileRHS(E, InDuals, InDuals, AllDuals, AllDuals, OutDual, EPlan, EPlan,
DEPlan, DEPlan, N, N, true) :-
X is E,
number(X),
!,
OutDual = [X|0].
compileRHS(Var, InDuals0, InDuals, AllDuals0, AllDuals, OutDual, EPlan, EPlan,
DEPlan, DEPlan, N, N1, true) :-
atomic(Var),
!,
(
memberChk([Var|DVar], InDuals0),
!,
% variable already seen (must be InDual)
InDuals = InDuals0,
AllDuals = AllDuals0
;
% variable seen for 1st time (must be InDual)
dVar(Var, DVar),
InDuals = [OutDual|InDuals0],
AllDuals = [OutDual|AllDuals0]
),
OutDual = [Var|DVar],
N1 = N.
compileRHS(E0, InDuals0, InDuals, AllDuals0, [OutDual|AllDuals], OutDual,
EPlanA, EPlanZ, DEPlanA, DEPlanZ, N, N2, fail) :-
E0 =.. [Op|ArgsE],
compileRHSL(ArgsE, InDuals0, InDuals, AllDuals0, AllDuals, OutDuals,
EPlanA, [OutVar=E|EPlanZ],
DEPlanA, [DOutVar=DE|DEPlanZ], N, N1),
duals2vars(OutDuals, Vars),
E =.. [Op|Vars],
fmadDiff(E, DE),
genVar(N1, OutVar, N2),
dVar(OutVar, DOutVar),
OutDual = [OutVar|DOutVar].
:- mode compileRHSL(+,+,-/+,+,-/+,-/+,?,?,?,?,+,-/+):det.
compileRHSL([E|T], InDuals0, InDuals, AllDuals0, AllDuals, [OutDual|TOD],
EPlanA, EPlanZ, DEPlanA, DEPlanZ, N, N2) :-
!,
compileRHS(E, InDuals0, InDuals1, AllDuals0, AllDuals1, OutDual,
EPlanA, EPlanB, DEPlanA, DEPlanB, N, N1, _),
compileRHSL(T, InDuals1, InDuals, AllDuals1, AllDuals, TOD, EPlanB, EPlanZ,
DEPlanB, DEPlanZ, N1, N2).
compileRHSL([], InDuals, InDuals, AllDuals, AllDuals, [], EPlan, EPlan,
DEPlan, DEPlan, N, N).
:- mode genVar(+,-/+,-/+):det.
genVar(N, Var, N1) :-
int_text(N, Nt),
string_length(Nt, L),
L1 is 3 - L,
substring('000', 0, L1, S),
concat(['v_',S,Nt], Var),
inc(N, N1).
:- mode dVar(+,-/+):det.
dVar(N, N) :-
number(N),
!.
dVar(Var, DVar) :-
concat('d', Var, DVar).
:- mode memberChk(+,+):det.
memberChk(X, [X|_]) :- !.
memberChk(X, [_|T]) :-
memberChk(X, T).
:- mode duals2vars(+,+):det.
duals2vars([[Var|DVar]|T], [Var|Tv]) :-
!,
duals2vars(T, Tv).
duals2vars([], []).
:- mode removeOutDual(+,+,-/+):det.
removeOutDual([], OutDuals, []) :- !.
removeOutDual([OutDual|T], OutDual, T) :- !.
removeOutDual([H|T], OutDual, [H|T1]) :-
removeOutDual(T, OutDual, T1).
:- mode substLHS(+,+,-/+):det.
substLHS([_=E], Var, [Var=E]) :- !.
substLHS([H|T], Var, [H|T1]) :-
substLHS(T, Var, T1).
/*** Examples
?- compile(y=3, InDuals, AllDuals, OutDual, EPlan, DEPlan).
InDuals = []
AllDuals = []
OutDual = [ y|dy ]
EPlan = [ y = 3 ]
DEPlan = [ dy = 0 ]
yes
?- compile(y=x, InDuals, AllDuals, OutDual, EPlan, DEPlan).
InDuals = [ [ x|dx ] ]
AllDuals = [ [ x|dx ] ]
OutDual = [ y|dy ]
EPlan = [ y = x ]
DEPlan = [ dy = dx ]
yes
?- compile(y=3*x, InDuals, AllDuals, OutDual, EPlan, DEPlan).
InDuals = [ [ x|dx ] ]
AllDuals = [ [ x|dx ] ]
OutDual = [ y|dy ]
EPlan = [ y = 3 * x ]
DEPlan = [ dy = 3 * dx ]
yes
?- compile(y=3*x+sin(w), InDuals, AllDuals, OutDual, EPlan, DEPlan).
InDuals = [ [ w|dw ],[ x|dx ] ]
AllDuals = [ [ v_000|dv_000 ],[ v_001|dv_001 ],[ w|dw ],[ x|dx ] ]
OutDual = [ y|dy ]
EPlan = [ v_000 = 3 * x,v_001 = sin(w),y = v_000 + v_001 ]
DEPlan = [ dv_000 = 3 * dx,dv_001 = cos(w) * dw,dy = dv_000 + dv_001 ]
yes
***/
```