**tl;dr: I explain the mathematical background of a joke-explanation of monads. Contains lots of math and a hasty introduction to category theory.**

There is a running gag in the programming community, that newcomers will often be confused by the concept of monads (which is how sequential computations are modeled in purely functional languages) and getting the explanation "it is simple, really: Monads are just monoids in the category of endofunctors". This is not meant as an actual explanation, but rather to poke a bit of fun at the habit of functional programmers to give quite abstract and theoretical explanations at times, that are not all that helpful.

However, given my background in mathematics, I decided that I wanted to actually approach Haskell from this point of view: I am interested in how it uses math to model programming and also to, after several years of doing mostly engineering focused programming work, flex my math muscles again - as there is quite a bit of interesting math behind these concepts.

The quote is from a pretty popular book about category theory and is, in full:

All told, a monad in \(X\) is just a monoid in the category of endofunctors of \(X\), with product \(\times\) replaced by composition of endofunctors and unit set by the identity endofunctor.

This, of course, is an explanation of the *mathematical* concept of monads,
not meant for programmers. Most explanations of the quote that I found either
assumed quite a bit of knowledge in Haskell or took a lot of liberties with the
mathematical concepts (and relied a lot on "squinting") or both. This write up
is my attempt, to walk through all the concepts needed to explain monads as a
mathematical concept and how it relates to Haskell - with as little squinting
as possible.

Of course, there are a couple of disclaimers, I should start with:

- This is not the best way to understand what monads are, if you are actually interested in using them to program. In fact, it is literally the worst way. I would recommend this intro, which takes a much more practical approach.
- This is not the best way to understand how category theory works, if you are actually interested in learning mathematics. In fact, it is literally the worst way. I would recommend the book the quote is from, it's quite good (but assumes a math audience).
- I haven't done mathematics in years. I also don't know much Haskell either. So I might be getting a bunch of stuff wrong to varying degrees. I'm sure I will hear all about it :)
- Even if I would
*understand*everything correctly, there are still a lot of details, mostly of technical nature, I had to omit, to keep this "short". Not that it is short.

Originally, I intended this to be the ultimate explanation, which would teach Haskellers category theory, mathematicians Haskell and people who know neither both. Unsurprisingly, this is not what this is, at all. It ended up mostly a write up to assure myself that I understood the path myself. If anything, you can treat this as a kind of "reading companion": If you want to understand this topic of the intersection between category theory and functional programming, this post can lead you through the correct terms to search for and give you a good idea what to focus on, in the respective Wikipedia articles.

With all that out of the way, let's begin.

##### Categories

In mathematics, a category is (roughly) a collection of objects and a
collection of arrows between them. There is not a lot of meaning behind these,
but it will probably help you to think of objects as sets and arrows as
mappings. Every arrow goes from an object (the *domain*) to an object (the
*codomain*) and we write an arrow as \(f:X\to Y\), where \(f\) is the name
of the arrow, \(X\) is the domain and \(Y\) is the codomain. Just like with
mappings, there can be many arrows between any given pair of objects - or there
may be none.

We do need *some* restrictions: First, we require a specific *identity* arrow
\(\mathrm{id}:X\to X\) attached to every object \(X\), which has \(X\) as
both domain and codomain. Secondly, we require (some) arrows to be
*composable*. That is if we have two arrows \(f:X\to Y,g:Y\to Z\) - so,
whenever the domain of \(g\) is the codomain of \(f\) - there should also
be a composed arrow¹
\(g\circ f: X\to Z\), that shares the domain with \(f\) and the codomain with
\(g\).

Furthermore, the identity arrows must act as a *unit* for composition, that is,
for every arrow \(f\) we require \(\mathrm{id}\circ f = f = f
\circ\mathrm{id}\). We also require composition to be *associative*, that is
\((f\circ g)\circ h = f\circ(g\circ h)\) (whenever all compositions exist)².

When we talk about a category, we often draw diagrams like this:

They show some of the objects and arrows from the category in a compact way.
This particular diagram indicates that there are four objects and four arrows
involved, with obvious domains and codomains. We only draw a subset of the
objects and arrows, that is interesting for the point we are trying to make -
for example, above diagram could also contain, of course, identity arrows and
compositions \(p\circ f\) and \(q\circ g\)), but we didn't draw them. In
a square like this, we can take two paths from \(X\) to \(W\). If these
paths are identical (that is, \(p\circ f = q\circ g\), we say that the
square *commutes*. A *commutative* diagram is a diagram, in which any square
commutes, that is, it does not matter which path we take from any object to
another. Most of the time, when we draw a diagram, we intend it to be
commutative.

So, to summarize, to define a mathematical category, we need to:

- Specify what our objects are
- Specify what our arrows are, where each arrow starts and ends at a certain object
- This collection of arrows need to include an arrow \(\mathrm{id}_X\) for every object \(X\), which starts and ends at \(X\)
- And we need to be able to glue together arrows \(f:X\to Y\) and \(g:Y\to Z\) to an arrow \(g\circ f: X\to Z\)

In Haskell, we work on the category **Hask**, which consists of:

- The objects are
*types*:`Int`

is an object,`String`

is an object but also`Int | String`

,`String -> Int`

and any other complicated type you can think of. - The arrows are
*functions*:`f :: a -> b`

is a function taking an`a`

as an input and returning a`b`

and is represented by an arrow`f`

, which has`a`

as its domain and`b`

as its codomain. So, for example,`length :: String -> Int`

would start at the type`String`

and end at`Int`

. - Haskell has a function
`id :: a -> a`

which gives us the identity arrow for any type`a`

. - We can compose functions with the operator
`(.) :: (b -> c) -> (a -> b) -> (a -> c)`

. Note, that this follows the swapped notation of \(\circ\), where the input type of the left function is the output type of the right function.

In general, category theory is concerned with the *relationship between*
categories, whereas in functional programming, we usually only deal with this
one category. This turns out to be both a blessing and a curse: It means that
our object of study is much simpler, but it also means, that it is sometimes
hard to see how to apply the general concepts to the limited environment of
functional programming.

##### Monoids

Understanding categories puts us in the position to understand *monoids*. A
monoid is the generalized structure underlying concepts like the natural
numbers: We can *add* two natural numbers, but we can't (in general) *subtract*
them, as there are no negative numbers. We also have the number \(0\), which,
when added to any number, does nothing - it acts as a *unit* for addition. And
we also observe, that addition is *associative*, that is, when doing a bunch of
additions, the order we do them in doesn't matter.

The same properties also apply to other constructs. For example, if we take all maps from a given set to itself, they can be composed and that composition is associative and there is a unit element (the identity map).

This provides us with the following elements to define a monoid:

- A set \(M\)
- An operation \(\star\colon M\times M\to M\), which "adds" together two elements to make a new one
- We need a special unit element \(u\in M\), which acts neutrally when added to any other element, that is \(m\star u=m=u\star m\)
- The operation needs to be associative, that is we always require \(m\star(n\star k)=(m\star n)\star k\)

There is another way to frame this, which is closer in line with category theory.
If we take \(1 := \{0\}\) to be a 1-element set, we can see that the
elements of \(M\) are in a one-to-one correspondence to functions \(1\to M\):
Every such function chooses an element of \(M\) (the image of \(0\)) and
every element \(m\in M\) fixes such a function, by using \(f(0) := m\).
Thus, instead of saying "we need a special element of \(M\)", we can also
choose a special *function* \(\eta: 1\to M\). And instead of talking about an
"operation", we can talk about a function \(\mu: M\times M\to M\). Which
means, we can define a monoid via a commutative diagram like so:

\(\pi_1\) and \(\pi_2\) here, are the functions that project to the first or second component of a cross product respectively (that is \(\pi_1(a, b) := a, \pi_2(a, b) := b\)) and e.g. \(\mathrm{id}\times\eta\) is the map that applies \(\mathrm{id}\) to the first component of a cross-product and \(\eta\) to the second: \(\mathrm{id}\times\eta(m, 0) = (m, \eta(0))\).

There are four sub-diagrams here:

- The first diagram just says, that we need an arrow \(\eta:1\to M\). This chooses a unit element for us.
- Likewise, the second diagram just says, that we need an arrow \(\mu:M\times M\to M\). This is the operation.
- The third diagram tells us that the chosen by \(\eta\) should be a unit for \(\mu\). The commutativity of the left square tells us, that it should be right-neutral, that is \[ \forall m\in M: m = \pi_1(m, 0) = \mu(\mathrm{id}\times\eta(m, 0)) = \mu(m, \eta(0)) \] and the commutativity of the right square tells us, that it should be left-neutral, that is \[ \forall m\in M: m = \pi_2(0,m) = \mu(\eta\times\mathrm{id}(0, m)) = \mu(\eta(0), m) \]

Thus, the first diagram is saying that the element chosen by \(\eta\) should act like a unit. For example, the left square says

\[\pi_1(m,0) = \mu((\mathrm{id}\times\eta)(m,0)) = \mu(m,\eta(0))\]

Now, writing \(\mu(m,n) = m\star n\) and \(\eta(0) = u\), this is equivalent to saying \(m = u\star m\).

The second diagram is saying that \(\mu\) should be associative: The top arrow combines the first two elements, the left arrow combines the second two. The right and bottom arrows then combine the result with the remaining element respectively, so commutativity of that square means the familiar \(m\star (n\star k) = (m\star n)\star k\).

Haskell has the concept of a monoid too. While it's not really relevant to the discussion, it might be enlightening to see, how it's modeled. A monoid in Haskell is a type-class with two (required) methods:

```
class Monoid a where
mempty :: a
mappend :: a -> a -> a
```

Now, this gives us the operation (`mappend`

) and the unit (`a`

), but where are
the requirements of associativity and the unit acting neutrally? The Haskell
type system is unable to codify these requirements, so they are instead given
as a "law", that is, any implementation of a monoid is supposed to have these
properties, to be manually checked by the programmer:

`mappend mempty x = x`

(the unit is left-neutral)`mappend x mempty = x`

(the unit is right-neutral)`mappend x (mappend y z) = mappend (mappend x y) z`

(the operation is associative)

##### Functors

I mentioned that category theory investigates the relationship between
categories - but so far, everything we've seen only happens inside a single
category. Functors are, how we relate categories to each other. Given two
categories \(\mathcal{B}\) and \(\mathcal{C}\), a *functor*
\(F:\mathcal{B}\to \mathcal{C}\) assigns to every object \(X\) of
\(\mathcal{B}\), an object \(F(X)\) of \(\mathcal{C}\). It also assigns
to every arrow \(f:X\to Y\) in \(\mathcal{B}\) a corresponding arrow
\(F(f): F(X)\to F(Y)\) in \(\mathcal{C}\)³. So, a functor transfers arrows from one category
to another, preserving domain and codomain. To actually preserve the
structure, we also need it to preserve the extra requirements of a category,
identities and composition. So we need, in total:

- An object map, \(F:O_\mathcal{B} \to O_\mathcal{C}\)
- An arrow map, \(F:A_\mathcal{B}\to A_\mathcal{C}\), which preserves start and end object, that is the image of an arrow \(X\to Y\) starts at \(F(X)\) and ends at \(F(Y)\)
- The arrow map has to preserve identities, that is \(F(\mathrm{id}_X) = \mathrm{id}_{F(X)}\)
- The arrow map has to preserve composition, that is \(F(g\circ f) = F(g)\circ F(f)\).

A trivial example of a functor is the *identity functor* (which we will call
\(I\)), which assigns each object to itself and each arrow to itself - that
is, it doesn't change the category at all.

A simple example is the construction of the *free monoid*, which maps from the
category of sets to the category of monoids. The Free monoid \(S^*\) on a
set \(S\) is the set of all finite length strings of elements of \(S\),
with concatenation as the operation and the empty string as the unit. Our
object map then assigns to each set \(S\) its free monoid \(S^*\). And our
arrow map assigns to each function \(f:S\to T\) the function \(f^*:S^*\to
T^*\), that applies \(f\) to each element of the input string.

There is an interesting side note here: Mathematicians love to abstract.
Categories arose from the observation, that in many branches of mathematics we
are researching some class of objects with some associated structure and those
maps between them, that preserve this structure. It turns out that category
theory is a branch of mathematics that is researching the objects of
categories, with some associated structure (identity arrows and composition)
and maps (functors) between them, that preserve that structure. So it seems
obvious that we should be able to view categories *as objects of a category*,
with functors as arrows. Functors can be composed (in the obvious way) and
every category has an identity functor, that just maps every object and arrow
to itself.

Now, in Haskell, Functors are again a type class:

```
class Functor f where
fmap :: (a -> b) -> (f a -> f b)
```

This looks like our arrow map: It assigns to each function `g :: a -> b`

a
function `fmap g :: f a -> f b`

. The object map is implicit: When we write `f a`

,
we are referring to a new type, that depends on `a`

- so we "map" `a`

to `f a`

⁴.

Again, there are additional requirements the type system of Haskell can not capture. So we provide them as laws the programmer has to check manually:

`fmap id == id`

(preserves identities)`fmap (f . g) == fmap f . fmap g`

(preserves composition)

There is one thing to note here: As mentioned, in Haskell we only really deal
with one category, the category of types. That means that a functor always maps
from the category of types to *itself*. In mathematics, we call such a functor,
that maps a category to itself, an *endofunctor*. So we can tell, that in
Haskell, every functor is automatically an endofunctor.

##### Natural transformations

We now understand categories and we understand functors. We also understand,
that we can look at something like the category of categories. But the
definition of a monad given to us talks about the *category of endofunctors*.
So we seem to have to step up yet another level in the abstraction hierarchy
and somehow build this category. As objects, we'd like to have endofunctors -
and arrows will be *natural transformations*, which take one functor to
another, while preserving its internal structure (the mapping of arrows). If
that sounds complicated and abstract, that's because it is.

We need two functors \(F,G:\mathcal{B}\to \mathcal{C}\) of the same "kind"
(that is, mapping to and from the same categories). A natural transformation
\(\eta:F\dot\to G\) assigns an arrow⁵ \(\eta_X: F(X)\to G(X)\) (called a *component* of
\(\eta\)) to every object in \(\mathcal{B}\). So a component \(\eta_X\)
describes, how we can translate the action of \(F\) on \(X\) into the
action of \(G\) on \(X\) - i.e. how to translate their object maps. We also
have to talk about the translation of the arrow maps. For that, we observe that
for any arrow \(f:X\to Y\) in \(\mathcal{B}\), we get four new arrows in
\(\mathcal{C}\):

For a natural transformation, we require the resulting square to commute.

So, to recap: To create a natural transformation, we need

- Two functors \(F,G:\mathcal{B}\to\mathcal{C}\)
- For every object \(X\) in \(\mathcal{B}\), an arrow \(\eta_X: F(X)\to G(X)\)
- The components need to be compatible with the arrow maps of the functors: \(\eta_Y\circ F(f) = G(f)\circ \eta_X\).

In Haskell, we can define a natural transformation like so:

```
class (Functor f, Functor g) => Transformation f g where
eta :: f a -> g a
```

`f`

and `g`

are functors and a natural transformation from `f`

to `g`

provides
a map `f a -> g a`

for every type `a`

. Again, the requirement of compatibility
with the actions of the functors is not expressible as a type signature, but we
can require it as a law:

`eta (fmap fn a) = fmap fn (eta a)`

##### Monads

This, finally, puts us in the position to define monads. Let's look at our quote above:

All told, a monad in \(X\) is just a monoid in the category of endofunctors of \(X\), with product \(\times\) replaced by composition of endofunctors and unit set by the identity endofunctor.

It should be clear, how we can *compose* endofunctors. But it is important,
that this is a different view of these things than if we'd look at the category
of categories - there, objects are categories and functors are arrows, while
here, objects are *functors* and arrows are natural transformations. That
shows, how composition of functors can take the role of the cross-product of
sets: In a set-category, the cross product makes a new set out of two other
set. In the category of endofunctors, composition makes a new endofunctor out
of two other endofunctors.

When we defined monoids diagrammatically, we also needed a cross product of mappings, that is, given a map \(f:X_1\to Y_1\) and a map \(g:X_2\to Y_2\), we needed the map \(f\times g: X_1\times X_2\to Y_1\times Y_2\), which operated on the individual constituents. If we want to replace the cross product with composition of endofunctors, we need an equivalent for natural transformations. That is, given two natural transformations \(\eta:F\to G\) and \(\epsilon:J\to K\), we want to construct a natural transformation \(\eta\epsilon:J\circ F\to K\circ G\). This diagram illustrates how we get there (working on components):

As we can see, we can build an arrow \(\epsilon_{G(X)}\circ J(\eta_X):
J(F(X)) \to K(G(X))\), which we can use as the components of our natural
transformation \(\eta\epsilon:J\circ F\to K\circ G\). This construction is
called the *horizontal composition* of natural transformations. We should
verify that this is indeed a natural transformation - for now, let's just
accept that it follows from the naturality of \(\eta\) and \(\epsilon\).

Lastly, there is an obvious natural transformation taking a functor to itself; each component being just the identity arrow. We call that natural transformation \(\iota\), staying with the convention of using Greek letters for natural transformations.

With this, we can redraw the diagram we used to define monoids above, the replacements indicated by the quotes:

The vertical arrows in the middle diagram now simply apply the composition of functors, using that the identity functor is a unit.

These diagrams encode these conditions on our natural transformations⁶:

- \(\mu\circ\eta\iota = \mu = \iota\eta\circ\mu\), that is \(\eta\) serves as a unit
- \(\mu\circ\mu\iota = \mu\circ\iota\mu\), that is \(\mu\) is associative

To recap, a monad, in category theory, is

- An endofunctor \(M\)
- A natural transformation \(\eta: I\to M\), which serves as an identity for horizontal composition.
- A natural transformation \(\mu: M\circ M\to M\), which is associative in respect to horizontal composition.

Now, let's see, how this maps to Haskell monads.

First, what is the identity functor in Haskell? As we pointed out above, the
object function of functors is implicit, when we write `f a`

instead of `a`

. As
such, the identity functor is simply `a`

- i.e. we map any type to itself.
`fmap`

of that functor would thus also just be the identity
`fmap :: (a -> a) -> (a -> a)`

.

So, what would our natural transformation \(\eta\) look like? As we said, a
natural transformation between two functors is just a map `f a -> g a`

. So (if
we call our endofunctor `m`

) the identity transformation of our monoid is
`eta :: a -> m a`

mapping the identity functor to `m`

. We also need our monoidal operation, which
should map `m`

applied twice to `m`

:
`mu :: m (m a) -> m a`

.

Now, Haskellers write `return`

instead of `eta`

and write `join`

instead of
`mu`

, giving us the type class⁷

```
class (Functor m) => Monad where
return :: a -> m a
join :: m (m a) -> m a
```

As a last note, it is worth pointing out that you usually won't implement
`join`

, but instead a different function, called "monadic bind":

```
(>>=) :: m a -> (a -> m b) -> m b
```

The reason is, that this more closely maps to what monads are actually *used*
for in functional programming. But we can move between `join`

and `>>=`

via

```
(>>=) :: m a -> (a -> m b) -> m b
v >>= f = join ((fmap f) v)
join :: m (m a) -> m a
join v = v >>= id
```

##### Conclusion

This certainly was a bit of a long ride. It took me *much* longer than
anticipated both to understand all the steps necessary and to write them down.
I hope you found it helpful and I hope I didn't make too many, too glaring
mistakes. If so (either), feel free to let me know on
Twitter,
reddit
or Hacker News - but please
remember to be kind :)

*I want to thank Tim Adler and
mxf+ for proof-reading this absurdly long post
and for making many helpful suggestions for improvements*

[1] It is often confusing to people, that the way the arrows point in the notation and the order they are written seems to contradict each other: When writing \(f:X\to Y\) and \(g:Y\to Z\) you might reasonably expect their composite to work like \(f\circ g: X\to Z\), that is, you glue together the arrows in the order you are writing them.

The fact that we are not doing that is a completely justified criticism,
that is due to a historical accident - we write function application from
right to left, that is we write \(f(x)\), for applying \(f\) to \(x\).
Accordingly, we write \(g(f(x))\), when applying \(g\) to the result of
applying \(f\) to \(x\). And we chose to have the composite-notation be
consistent with *that*, instead of the arrow-notation.

I chose to just eat the unfortunate confusion, as it turns out Haskell is doing exactly the same thing, so swapping things around would just increase the confusion.

Sorry. ⬆

[2] Keep in mind that this is a different notion from the
ones for monoids, which we come to a bit later: While the formulas seem the
same and the identities look like a unit, the difference is that only certain
arrows can be composed, not all. And that there are many identity arrows, not
just one. However, if we would have only *one* object, it would have to be the
domain and codomain of every arrow and there would be exactly one identity
arrow. In that case, the notions *would* be the same and indeed, "a category
with exactly one object" is yet another way to define monoids.
⬆

[3] It is customary, to use the same name for the object and arrow map, even though that may seem confusing. A slight justification of that would be, that the object map is already given by the arrow map anyway: If \(F\) is the arrow map, we can define the object map as \(X\mapsto \mathrm{dom}(F(\mathrm{id}_X))\). So, given that they are always occurring together and you can make one from the other, we tend to just drop the distinction and save some symbols.

What was that? Oh, you thought Mathematicians where precise? Ha! ⬆

[4] It is important to note, that this is not really a
*function*. Functions operate on values of a given type. But here, we are
operating on *types* and Haskell has no concept of a "type of types" built in
that a function could operate on. There are constructs operating on types to
construct new types, like `data`

, `type`

, `newtype`

or even `deriving`

. But
they are special syntactical constructs that exist outside of the realm of
functions.

This is one of the things that was tripping me up for a while: I was trying to figure out, how I would map types to other types in Haskell or even talk about the object map. But the most useful answer is "you don't". ⬆

[5] An important note here, is that the \(\eta_X\) are
*arrows*. Where the object map of a functor is just a general association which
could look anything we like, the components of a natural transformation need to
preserve the internal structure of the category we are working in.
⬆

[6] You will often see these conditions written differently, namely written e.g. \(\mu M\) instead of \(\mu\iota\). You can treat that as a notational shorthand, it really means the same thing. ⬆

[7] There is a technicality here, that Haskell also has an intermediate between functor and monad called "applicative". As I understand it, this does not have a clear category theoretical analogue. I'm not sure why it exits, but I believe it has been added into the hierarchy after the fact. ⬆