Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: Backward propagation #23

Closed
dramforever opened this issue Dec 12, 2017 · 28 comments
Closed

Idea: Backward propagation #23

dramforever opened this issue Dec 12, 2017 · 28 comments

Comments

@dramforever
Copy link

We can hopefully get backward propagation using

newtype F a b = F (a -> ( b :* (b -> a) ))

We can add that to the two existing ADs. I'm not sure how to make it Closed though; it seems that you can't.

@conal
Copy link
Collaborator

conal commented Dec 12, 2017

Thanks for the suggestion. A few thoughts:

  • As I understand it, backprop is an algorithm for differentiation, which I'm doing already. Do you have in mind any possible benefits of backprop vs categories of differentiable functions I have already?
  • We might add a simple general construction for the opposite of a category, Op k. Then maybe use GD (Op k) where k could be functions or linear maps.
  • Could a backward-propagating type of differentiable functions be a cartesian category?

@dramforever
Copy link
Author

dramforever commented Dec 13, 2017

Separate responses

  1. We can compute the gradient or even the jacobian matrix of the function efficiently because we only need one evaluation of the differential. (By contrast, the current one requires one evaluation for each input variable to compute the gradient.) This (hopefully!) allows us to, for example, construct and train neural networks elegantly, which is a killer application of backprop.

  2. I think linear maps will not be useful here, as the Op of a linear map is just a transposed matrix. As for other possible ks, I'm not sure.

  3. I haven't found a nice way to. Ekmett's ad library implements one by observable sharing and carefully controlled mutation, but then we would lose the ability to, for example, integrate efficient matrix libraries into the computation.

Sorry, I accidentally the 'close and comment' button when writing 😉

@alpmestan
Copy link

but then we would lose the ability to, for example, integrate efficient matrix libraries into the computation.

You can circumvent this by performing (reverse, in this case) AD on an AST and interpreting/compiling this AST to efficient implementations of the AST's operations (e.g BLAS & friends on CPU, CUBLAS on GPU).

@dramforever
Copy link
Author

Clarification: Say you want to map tanh over each element of a vector. That's nice, but what would the accompanying jacobian matrix (AD.hs) be? A huge matrix with zero everywhere except on the diagonal. With my proposed formulation? It's just a map backwards.

@conal
Copy link
Collaborator

conal commented Dec 13, 2017

Thanks for the replies.

We can compute the gradient or even the jacobian matrix of the function efficiently because we only need one evaluation of the differential. (By contrast, the current one requires one evaluation for each input variable to compute the gradient.) This (hopefully!) allows us to, for example, construct and train neural networks elegantly, which is a killer application of backprop.

Efficient evaluation of gradients and jacobians is important to me as well. By my "current one", do you mean GAD, AD, ADFun, or something else? What makes you think that it require multiple passes? And what do you mean by an "input variable" in the context of categorical differentiation?

I think linear maps will not be useful here, as the Op of a linear map is just a transposed matrix. As for other possible ks, I'm not sure.

Just as the derivative of a function gives local linear approximations, I was guessing that you had in mind constructing local linear approximations of the inverses of given functions---equivalently, inverses of local linear approximations of the original functions.
(I'm speaking sloppily here. The original result together with the derivative gives a local affine approximation, while the derivative itself is a local linear approximation to a function closely related to the original.)

Could a backward-propagating type of differentiable functions be a cartesian category?

I haven't found a nice way to. Ekmett's ad library implements one by observable sharing and carefully controlled mutation, but then we would lose the ability to, for example, integrate efficient matrix libraries into the computation.

I wouldn't give up too easily, and I wouldn't explore the question in terms of mutation. One of the benefits of compiling to categories (CtC) over shallow and deep embeddings is that sharing is easily observed (as mentioned in section 13 of the CtC paper).

Given take your F definition, I think you can easily define id and (.), while the cartesian operations seems trickier. Duality (Op) suggests an answer, but I doubt it's the one we'd want.

@conal
Copy link
Collaborator

conal commented Dec 13, 2017

but then we would lose the ability to, for example, integrate efficient matrix libraries into the computation.

You can circumvent this by performing (reverse, in this case) AD on an AST and interpreting/compiling this AST to efficient implementations of the AST's operations (e.g BLAS & friends on CPU, CUBLAS on GPU).

I think we could indeed use concat/CtC to generate code that works well with optimized linear algebra libraries, whether the generated code is Haskell, C, CUDA, etc.

@conal
Copy link
Collaborator

conal commented Dec 13, 2017

Clarification: Say you want to map tanh over each element of a vector. That's nice, but what would the accompanying jacobian matrix (AD.hs) be? A huge matrix with zero everywhere except on the diagonal. With my proposed formulation? It's just a map backwards.

Agreed. A dense "matrix" (composed functor) representation would be mostly zero. With the function representation in ADFun, fmap becomes another fmap and a bit of rearranging.

@conal
Copy link
Collaborator

conal commented Dec 13, 2017

@dramforever Getting to what I think is the heart of your suggestion, do you have a specification (correctness definition) in mind for conversion from a -> b to F a b? A simple, precise specification should make other questions clear and probably easy to answer correctly.

As an example of what I mean by a specification, consider mapping a -> b to a -> b :* (a :-* b), where a :-* b is the type of linear maps from a to b (vector spaces sharing a common scalar field). Add a newtype wrapper. The specification is combining a function f with its derivative:

newtype AD a b = AD (a -> b :* (a :-* b))

ad :: (a -> b) -> AD a b
ad f == AD (f &&& deriv f)

Now require that ad is a homomorphism for many algebraic interfaces, including category, cartesian category, and many others (Num, Floating, etc). Solving these homomorphism equations results in a collection of correct-by-construction instances. In the CtC framework, these instances then become a translation of Haskell code, and the translation satisfies the given spec, even though ad is not computable as stated, i.e., as operating on denotations/values (functions, in this case) rather than on notations/expressions.

@dramforever
Copy link
Author

Oh, wait a bit. By 'cartesian' you mean products? In that case we can just define first assoc etc Fs as if they were lenses/arrows. dup forks the input and adds the two gradients together. How's that?

I hope to get a chance to get down to the code some time...

@dramforever
Copy link
Author

The idea is: instead of storing the jacobian d out_j / d in_i as a matrix (AD) or a normal forward function (ADFun), we store it as a function that converts gradient backwards, i.e. dF / d out_i -> dF / d in_j for some F. Say F is our network loss, we can compute the whole gradient by just applying the final differential to 1 and because dF/dF = 1, we get the dF / d in_i gradient in one go. The ADFun formulation can't do this in one go, and AD is inefficient in cases like the map tanh mentioned above.

I haven't really thought about the 'inverse' thing, but I don't think that's what I have in mind.

@dramforever
Copy link
Author

My bad; cartesian means 'has products'. In that case it's easy to make it cartesian as described above

Do you have any idea on having observable sharing in a cartesian closed category (w/ exponentials)? I feel like I can't.

@conal
Copy link
Collaborator

conal commented Dec 14, 2017

By 'cartesian' you mean products? In that case we can just define first assoc etc Fs as if they were lenses/arrows. dup forks the input and adds the two gradients together. How's that?

Yes, products. In concat, you'd define an instance of ProductCat, defining exl, exr, and (&&&) (and optionally some other methods).

I encourage you to give a specification so that you can prove or derive/calculate your definitions (for all instances, not just ProductCat). (Code without a specification is an answer without a question --- "not even wrong".)

@conal
Copy link
Collaborator

conal commented Dec 14, 2017

My bad; cartesian means 'has products'. In that case it's easy to make it cartesian as described above

In any case, your F definition is simple, and the needed exl, exr, and (&&&) all have simple types, so it should be easy to try giving definitions that type-check and that you can check against a precise & simple specification.

Do you have any idea on having observable sharing in a cartesian closed category (w/ exponentials)? I feel like I can't.

In a purely denotative/functional setting, sharing is deeply problematic at run time but trivially easy at compile time. CtC happens at compile time. In the categorical formulation, sharing is especially simple, because it's exactly captured in (&&&) (equivalently, dup).

@dramforever
Copy link
Author

Okay! Now if we don't need ClosedCat this will do:

The specific version is:

newtype BP a b = BP (a -> (b, (b -> a)))

bp :: (a -> b) -> BP a b
bp (\x -> y) = BP ((\x -> y) &&& (\x (dF/dy) -> (dF/dx))

where a and b are vector spaces with the same scalar type (like your AD). x is the input vector and y is the output vector. F is some final scalar result, dF/dx and dF/dy are the gradients of F wrt x and y.

For example

bp (fmap sin) = BP (fmap sin &&& (\x ds -> zipWith (*) ds (fmap cos x)))

@conal
Copy link
Collaborator

conal commented Dec 15, 2017

Okay. Maybe we're getting closer. Can you make this specification more precise? And as a transformation on functions rather than (what I guess is) some sort of syntactic construct (a lambda expression)? For instance, a derivative of a function (not syntax) of type a -> b is a function from a to linear maps from a to b (i.e., having type a -> (a :-* b)) that satisfies a certain limit property. There's no need to mention variable names (syntax).

It's unclear to me what operation you mean by "\ (dF/dy) -> (dF/dx)" and how/where the parameter x gets used. I know you're using conventional notation here, but I don't think it's well-defined, just customary.

Maybe what you're getting at has to do with the inverse of the derivative of the given function at a given point, something like

bp f = BP (f &&& inverse . deriv f)

Of course, inverses are often undefined, so maybe this isn't it. I still think it's about duality, starting with the usual coproduct for vector spaces (direct sum, represented by a cartesian product) and then using the opposite category, reversing arrows and swapping products and coproducts. The current concat implementation doesn't have per-category associated products, coproducts, exponentials, booleans, etc, so I can't yet express this construction with the elegance and generality I'd like, but we'll get there.

@conal
Copy link
Collaborator

conal commented Dec 15, 2017

Okay. Maybe we're getting closer. Can you make this specification more precise? And as a transformation on functions rather than (what I guess is) some sort of syntactic construct (a lambda expression)? For instance, a derivative of a function (not syntax) of type a -> b is a function from a to linear maps from a to b (i.e., having type a -> (a :-* b)) that satisfies a certain limit property. There's no need to mention variable names (syntax).

It's unclear to me what operation you mean by "\ (dF/dy) -> (dF/dx)" and how/where the parameter x gets used. I know you're using conventional notation here, but I don't think it's well-defined, just customary.

Maybe what you're getting at has to do with the inverse of the derivative of the given function at a given point, something like

bp f = BP (f &&& inverse . deriv f)

Of course, inverses are often undefined, so maybe this isn't it. Rather, it seems to be about duality, starting with the usual coproduct for vector spaces (direct sum, represented by a cartesian product) and then using the opposite category, reversing arrows and swapping products and coproducts. The current concat implementation doesn't have per-category associated products, coproducts, exponentials, booleans, etc, so I can't yet express this construction with the elegance and generality I'd like, but we'll get there.

newtype BP s a b = BP (a -> (b, ((b :-* s) -> (a :-* s))))

where b :-* s is the dual space for b, which is isomorphic to b for finite-dimensional vector spaces.

@dramforever
Copy link
Author

dramforever commented Dec 16, 2017

I wouldn't say it's an inverse. I think ‘dual' is more appropriate.

I used lambda's as I was expressing bp as solving a functional equation (differential equation, if you insist).

This is getting tricky I admit. Do you understand my example?

bp (fmap sin) = BP (fmap sin &&& (\x ds -> zipWith (*) ds (fmap cos x)))

Let's write d instead of for partial derivatives so that it's easier.

Say we have [a, b, c, d], and when we fmap sin over it we get [p, q, r, s]. And bp (fmap sin) does not need to know this, but: say the final result F = 7 p + 8 q + 5 r + 9 s.'Intuitively', 'dF/dx' means treat F as a function of one variable x and take the derivative of that. Let's ignore [a, b, c, d] and take the gradient of F wrt [p, q, r, s], and of course, dF/dp = 7, dF/dq = 8, etc. And p only depends on a, so dp/da = sin'(a) = cos a. Then apply the chain rule to get dF/da = (dF/dp) * (dp/da) = 7 cos a. Indeed, F = 7 sin a + 8 sin b + 5 sin c + 9 sin d, so the answer was correct.

Or maybe I could relate my approach to yours. Now suppose x and y are vectors and x_i, y_j represent their components. Write dF/dx for a gradient. We could have a matrix with elements A_{i,j} = dy_i / dx_j corresponding to a function taking an argument x and producing y (hence \x -> y!). Indeed, that's what you are currently using in AD. And here's the thing: we can turn that matrix into a function. Yes, it's a linear function, but it's represented with a (say) Haskell function rather than a matrix. This way we can compose them easily while allowing each to optimize itself. In the gradient case. And instead of doing it forwards (ADFun), let's do it backwards, like sending 1/dy to 1/dx. This way we just need to tell the function, okay, dF/dF = 1, and it will go back through the chain of functions to get the gradient dF/dx.

Phew...

@dramforever
Copy link
Author

Observation: Write J(f) for the Jacobian of f, then J(f . g) = J(f) J(g), where juxtaposition means matrix multiplication, and . is function composition.

@MarisaKirisame
Copy link

Somewhat related paper: Backprop as Functor: A compositional perspective on supervised learning

@conal
Copy link
Collaborator

conal commented Dec 19, 2017

Observation: Write J(f) for the Jacobian of f, then J(f . g) = J(f) J(g), where juxtaposition means matrix multiplication, and . is function composition.

Note that matrix "multiplication" is just an implementation of linear map composition, so the property is that the derivative of the composition is the composition of the derivatives. Almost. Really it's deriv (g . f) x == deriv g (f x) . deriv f x. Thus differentiation is not quite functorial/compositional, since deriv (g . f) depends not just on deriv g and deriv f, but also f itself. To regain functoriality/composability, calculate both the function and its derivative. More explanation in Beautiful differentiation and Compiling to categories.

@conal
Copy link
Collaborator

conal commented Dec 22, 2017

I've long thought that forward mode and reverse modes of AD (FAD & RAD) are both special cases of the same algorithm (the one in the CtC paper and this repo) corresponding to right- and left-associating function compositions, respectively. Then this discussion suggested to me that maybe RAD is about duality/transposition. Now I see that they're the same thing. Thanks for the stimulating conversation, @dramforever!

@dramforever
Copy link
Author

I'm glad that we were able to figure it out. And by the way, do you (and does anybody else) think that we can compute, say, the hessian? I've been trying to devise a method but had no luck.

@conal
Copy link
Collaborator

conal commented Jan 17, 2018

@dramforever Check out the talk The simple essence of automatic differentiation, to see where I went with this exploration.

I think the Hessian can be done simply enough, but it's not a gradient computation, so it's not clear to me whether to use forward or reverse or mixed mode. Might be a good application of the optimal re-association mentioned on slide 22.

@conal conal closed this as completed Jan 17, 2018
@conal
Copy link
Collaborator

conal commented Oct 1, 2018

I'm very sorry for failing to acknowledge you in the camera-ready version that appears in the ICFP proceedings. Authors were asked to exclude acknowledgments in paper submissions in order to support the anonymous review process. For most of the time between paper acceptance and submission of the final camera-ready version, I was away from home helping to deal with a family medical crisis. During this period, my concentration and clarity of thought were much reduced, and I failed to correct the omission.

I just added an acknowledgment on the paper's web page. I would be happy to replace your github handle with your IRL name if you want to share it with me, or I can leave the handle as is. When I hear your preference, I'll also update the extended version of the paper on arXiv.

Again, I'm grateful for this conversation, which spurred me to think more carefully about reverse-mode automatic differentiation (RAD) and its specialization to scalar-valued functions (as in backpropagation). The essential insights turned out to be a pair of isomorphisms for linear maps: (a) the Yoneda embedding and (b) vector space duality (relating a vector space to the space of linear maps from that space to its own scalar field). The Yoneda embedding (Cont in the paper) with the generalized matrix category leads to general RAD, while vector space duality allows optimizing this Yoneda/continuation/RAD representation to Dual (⥅) a b, which is essentially b -> a restricted to types having addition and zero. The algorithms follow from solving a collection of homomorphism equations involving these isomorphisms. See theorems 4 and 5 and their proofs (in the extended version).

@dramforever
Copy link
Author

Thank you for the acknowledgement. As for my name, you can refer to me using my real name Wang Ruikang.

To be honest, I had not expected the idea to come this far. Without you and this communication, the idea would have remained just 'I thought of it on the subway and it's cool'. So I would also like to thank you for taking the time to polish this immature idea, and going so far as to fitting it neatly within the larger picture of categorical constructions.

@dramforever
Copy link
Author

Also, I was dealing with other things at the time of most of this thread, so I did not really went down and put it into code form. And looking back, I certainly did not communicate what I was thinking well. I was certainly expecting this to be forgotten, but you took it seriously. So thanks a lot for your patience and interest.

@conal
Copy link
Collaborator

conal commented Oct 3, 2018

Thanks for sharing your name. I've updated the paper's web page page and added the acknowledgement to the extended version of the paper on arXiv.

I suspect that there are many more variations of AD to be found.

Best wishes!

@dramforever
Copy link
Author

That's nice. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants