Why do we need stochastic calculus?

July 22, 2014

It’s a good question. Prod further down the line and you begin to ask why we even need rigorous mathematics. A simple (motivating) example is sufficient to understand why calculus that extends to dealing with random phenomena is needed.

We know with certainty that the sun will shine in the morning and the moon will be fluorescent at night. All we need to do is look outside the window and we’ll know which event occurs: we have total information. What happens when we don’t have so much information? Consider an atom, unobservable to the human eye. The uncertainty principle tells us we cannot know full information about this atom. Further-more, the act of observing the atom perturbs its movement a little bit.

In a classic, information-rich scenario, the instantaneous change of some phenomena is given via a classical mechanical framework

change in system = deterministic function of system

or in mathematical symbols,

dX(t) = a(t,X(t)) dt

We know add some perturbation \mathcal{N} caused by us observing the particle or lack of precision from measuring apparatus and so on to get the framework

change in system =  deterministic function of system + random perturbation

 or in mathematical symbols,

dX(t) = a(t,X(t)) dt + \mathcal{N}.

As this perturbation is not known a priori (if it were, the system would be deterministic), we call it random!

Classic calculus does not deal with anything related to ‘random’ and it is why a new form of calculus is needed: for systems that are random over time, i.e. stochastic systems, i.e. stochastic processes, hence the name stochastic calculus.

If we are not careful, we may have a new form of calculus that leads to nonsensical results, which is why we must be rigorous.

Below is a plot of a particle in integer time-steps with different perturbations \mathcal{N}.


Best Textbooks

April 11, 2014

Here is a list of the best textbooks I have read on the following subjects

  • mathematics
  • statistics
  • machine learning
  • economics
  • computer science


There are essential.

A First Course in Real Analysis by M.H. Protter and C.B. Morrey.


Explains all the basic concepts of real analysis. Avoids the sensible delicacies of the actual construction of the continuum. There is little talk (say) on Dedekind cuts as opposed to rigorous demonstrations of the real number properties. Naturally there is little talk of deeper concepts as Zorn’s lemma or posets. But a good understanding of this book is essential in nearly all science areas that need to use an epsilon sign. Caveat for books on real analysis: if they do not give a construction of the real numbers or do not explain one, it is likely that the book will end up cheating at some point, i.e. using ideas that it is trying to prove.

A good alternative is Introduction to Real Analysis by M. Stoll.

An Introduction to Complex Analysis by R.P. Agarwal, K. Perera, S. Pinelas.


Most complex analysis books are ‘ruined’ by giving a good theoretical introduction and then wandering off to some very applied problems and the theory specific to that problem. A lot of people just want to understand complex numbers.This book does that well and is comprised into ‘lecture’ series. Bonus points for its ending: it explains the Julia and Mandelbrot set. A lot of people regard the Mandelbrot set as the most complicated mathematical object. The theory here is relatively weak and introductory.

The Elements of Integration and Lebesgue measure, R.G.Bartle.


Quite simply the best introduction to measure theory. If you have a good understanding of real analysis (a nice way of saying it is a prerequisite) and you will find this book a very enjoyable read. Bartle provides a fantastic structure and concepts are not rushed.

A good alternative is Measure Theory by P. Halmos, although that is a more complicated book. The book Introduction to Real Analysis by M. Stoll provides a concise introduction to the Lebesgue measure and the Lebesgue integral, which can be studied in a short period of time. An extreme alternative is Measure Theory by J. Doob which is a considerably more complicated book to read due to stating several ‘rare’ results in measure theory.

A Course in Functional Analysis, J. Conway.


Simply fantastic. Takes a long time to read but it is worth it. Conway makes functional analysis exciting. At times it is like a story and Conway is full of quips. A good alternative is Real and Functional Analysis by S. Lang.

Probability Theory and Examples, R. Durrett.


Provides a measure theoretic explanation of probability. Nothing else to say – it treats probability very well but it is possible to find books that go far deeper than this. Good alternatives are Probability Theory: Independence, Interchangeability, Martingales by V.S. Chow, H. Teicher which is deeper but frustrating to read if you do not have a good measure theoretic background. For an easier read, Probability Models by S. Ross, which has very little measure theory.

Stochastic Differential Equations: An Introduction by Applications by B. Oksendal


The only book on stochastic differential equations that you will ever need. Simple and concise.

Linear Algebra, S. Lang.


Perhaps even more important than analysis. Linear algebra is used everywhere. Statistics and computer sciences are essentially subsets of linear algebra. Gives a great introduction. There are many alternatives!

Groups and Symmetry by M.A Armstrong.

A fun read and good preparation for Galois theory or abstract algebra. Sometimes it does not get to the action quick enough, but that is only because a lot of explanation is needed to understand the concept of symmetry, especially when trying to visualise it. Something fun to do whilst you read this book: get jelly beans and little short wood sticks and you can make the three dimensional shapes yourself to see the rotational symmetries and so on!

Galois Theory by D.A Cox.

Galois theory is beautiful but not easy to get into! All the more remarkable that Galois managed to learn most of this by the age of 21. Has a great introduction and fantastic exercises – Cox helps you to solve the exercises with well selected hints. Some books are essentially just lecture notes glued together. This is clearly a book, it is a good read. There are many alternatives but a lot of them either miss the point or make things too complex (by doing everything in complex numbers).


From a mathematician’s point of view (that is, one of rigor, clear explanation, a lack of ambiguity, …) a lot of statistics books are to be avoided. Elementary statistic book make many severe mistakes. Some people do not care for mathematics, but some ideas expressed in elementary statistics books are just not true, sometimes being completely false in practice. The best ones are ones that give concise, quick explanations and define what they use. Most waffle on.

Theory of Statistics by M. J. Schervish.

A theoretical treatment of statistics. Prerequisites are measure theory, some real and functional analysis and linear algebra. Explains everything extremely well. Good exercises too. Good point: avoids the mistake that nearly all elementary statistics books make. How? By not being elementary. Bad point: quite expensive. A good alternative is Theory of Statistics by J.E. Gentle, which has the same idea in mind, but is perhaps not as good.

Mathematical Statistics and Data Analysis by J. A. Rice.

I think this is the best elementary statistics textbook around. Explains everything reasonably well. I don’t think any measure theory is needed. A lot of graduate courses in non-mathematical (or non-theoretical) statistics are based on this.

Statistical Inference by G. Casella, R.L. Berger

This book explains concepts such as point estimation, Bayesian inference, regression, maximum likelihood estimation, etc, in a fairly rigorous manner. Not very theoretical but explains everything on inference, even a touch of machine learning and mathematics. Fantastic exercises.

Time Series: Theory and Methods by P.J. Davis, R.A. Davis.

This is the only book I have ever found that provides a rigorous treatment to time series. I try to avoid all elementary time series books as they waffle on, are misguiding and sometimes incorrect. Some knowledge of real analysis and measure theory is needed. If anyone can find a good alternative to this (there are many books that treat continuous time series – they are just books on stochastic differential equations) please let me know.

Machine Learning

Combine statistics with mathematics and you get machine learning. Programming is the language that is spoken in machine learning: mathematics is the rules and statistics is the arena in which it is played in.

Kernel Methods for Pattern Analysis by J.S. Taylor.

Best book I have read on kernels. Gives a good introduction of kernels, some functional analysis is needed. Provides many applications of kernels. Taylor also has a video series on kernel methods if you search around on Google.

Support Vector Machines by I. Steinwart, A. Christmann.

Provides a rigorous (meaning measure-theoretic) description of support vector machines and a lot of other supervised learning techniques. Again, completely fantastic. Combine this with the previous book and you will see why kernels are so essential in a lot of different applications. The problem is the prerequisites needed: (1) data experience, (2) good mathematical analysis and (3) statistical experience. Without (1) the book will seem ‘undirected’ or not relevant in a lot of situations, without (3) it will be hard to apply a lot of techniques and without (2) you will not understand anything!

There are many more books on machine learning but the ones presented here are, in my view, the best.


Someone wise once said “worthy problems prove their worthiness by fighting back”, textbooks are like that: they give you the rules early on and make them explicit (definitions, not bolded words in the middle of no-where full of hand-waving clutter), they give you instructions (theorems) and tell you what to do (exercises) and ultimately you challenge yourself.

Cauchy’s idea of probability

December 16, 2013

We get used to the idea that as our sample size increases, our model becomes more reliable.

We all ‘know’ that the sample average of most distributions is asymptotically Normal (by the central limit theorem) and that the sample average gets closer to the population mean and such.

This corresponds to a specific type of randomness, let us call it ‘mild’ randomness, as in a sense nothing wild is going on – although a random (stochastic) process is underneath everything, with more data comes convergence and more reliability to our claims. Our models come from this – linear regression and so on. They are not exact and they accept being approximations, but they still do a decent job.

However, what if they were totally wrong?

Cauchy has a different idea of probability and as a result a different idea of randomness, call it ‘wild’ randomness.

Cauchy’s idea

Cauchy’s idea is as follows.

Consider an archer that is blindfolded and has a bow and arrow. He is to shoot at a target located on a wall that is infinite in height and length. We are to measure how far his arrow is away the target. We assume that he always shoots at the wall, somewhere.

For example. if he hits the target, we record 0 as he is exactly 0 units away from the target.

We can formulate a probability distribution based on this example and without loss of generality, our assumptions need not hold.

Deriving Cauchy’s distribution

We can represent the idea about the archer above as a right angled triangle, as seen below.


This makes sense when we look at it: we are at some location (labelled archer),  say center to the target (labelled target) and some distance (adjacent to the angle \theta) away. Then our actual shot (labelled actual shot) is just a line segment from the target. Wherever the bow lands – above the target, to the right, left, below the target, etc, without loss of generality we can represent it as a right angled triangle, assuming that our shot hits the wall.

Then say we are interested in how far we are away from the target (the line segment actual shot – target) and how are we ourselves (the archer) are away from the target (the line segment archer- target). This can be represented by trigonometry and the angle \theta can be computed. We have

\tan \theta = \frac{\text{line segment between actual shot and target}}{\text{line segment between archer and target}}.

Let the ratio between the line segment between actual shot and target} and the line segment between archer and target be called y, which is fine as it is just a real number. Then we have

\tan \theta = y.

To measure how far his arrow is away from the target, we are interested in varying \theta by the ratio y. This makes sense – we do not vary y with respect to y as we are already fixing what we are varying y with. We take the inverse tangent to get

\theta = \arctan y.

Varying \theta with respect to y corresponds to our problem – how we change the ratio answers the question of how far away we are from the target. This is just the derivative!The derivative of \theta with respect to y is

\frac{d}{dy} \, \theta = \frac{1}{1+y^2}.

This derivative defines our distribution – the term 1/(1+x^2) is our probability density function of the random variable \Theta that measures how far away we are from the target, with x taking all real number values, which corresponds to us asking, are we x away from the target?

We then get the probability distribution function defined by integrating

\frac{d}{dy} \, \theta = \frac{1}{1+y^2}

over all real values of y and finding a constant to make this integral equal to 1. The constant is 1/\pi and we have the probability density function to be

\mathbf{f} \lbrace \Theta =y \rbrace = \frac{1}{\pi}\frac{1}{1+y^2},

where y is an real number.

This is the Cauchy distribution.

‘Wild’ randomness

Consider a real life process that conforms to ‘mild’ randomness – the heights of humans, for example.

If I collect heights of say five humans, it may not be close to the average. As I collect more heights I should get closer to the average, assuming that I am picking people randomly and not based on geographical location and other factors.

I get an expected value \mu of what the height should be. I also get an idea of how far away from the mean I expect to be – this is the variance \sigma ^2.

Do these ideas hold for the blinded archer? Well.. not really.

We can have a sequence of shots that are close to the target but if the archer’s next shot is miles away, all that ‘work’ is wiped out in the sense that the average from those previous shots, now considering this shot, will be totally different.

Although I have a ‘bare’ idea on what I expect to get – 0 units away from the target, I can go anywhere. This type of randomness is far more wild – I am not building on my earlier, smaller samples. I do not have any expectation of how far away my shots will be, nor do I know how far I fluctuate from my expectation (which I do not know in the first place). This corresponds to me not knowing \mu or \sigma ^2.

Formally this can be shown as the expectation of \Theta is not finite, nor is the variance of \Theta finite.

The Central Limit Theorem and the Law of Large Numbers do not hold here. Taking the sample average and using it to infer information about this distribution is useless, because the next shot can change all of what we are working with. With the Normal distribution, this is not the case.

Difference between ‘mild’ and ‘wild’ randomness

Perhaps the difference between these types of randomness (mild and wild) can be seen in the plots. Consider the plot below of the probability density function of the Cauchy distribution for y between $ latex-10$ and 10 in 0.00001 intervals (which is good enough to get a measure of what the distribution looks like)


It does not look so much different to the plot of a probability density function of a (standard) Normal distribution, which is plotted below.


The Cauchy distribution has heavier tails – they do not dip as quickly as they do for the Normal distribution. This corresponds to having a higher chance of an arrow being incredibly away from the target to be more significant in Cauchy’s distribution than in the Normal – this also makes sense. Yet the Cauchy distribution has no (finite) expected value, no (finite) variance and various intuition about it fails – say bye to the Central Limit Theorem and the Law of Large Numbers.

An Introduction to Bayesian Statistics

October 3, 2013

Consider a coin toss.

We “know” that the probability of getting a heads (and tails) is \frac{1}{2}. We know that coin tosses are independent of each other.

In the language of probability, a coin toss is a Bernoulli random variable with parameter p=\frac{1}{2} of getting heads (or tails).

The probability of a heads (or tails) is very simple. Let 1 be the outcome for heads and 0 be the outcome for tails. Then we have the coin toss to be a random variable X with probability (mass) function

\mathbb{P}(X=x_{|x=0,1}) = \frac{1}{2}^x(1-\frac{1}{2})^{1-x} = \frac{1}{2}

Then suppose the probability of getting a heads or tails is no longer symmetric (or fair), i.e we have \frac{1}{2} \mapsto \rho. The probability mass function is now

\mathbb{P}(X=x_{|x=0,1}) = {\rho}^x(1-\rho)^{1-x}.

It seems that we are finished.

Actually we have seen everything from a frequentist (statistician’s) view. A Bayesian statistician looks at this very differently. It is the difference between someone who views probability with objectivity and someone who views probability with subjectivity.

How do we know the probability of getting heads is \rho?

Instead of accepting the probability mass function as it is, we attach another probability to it: the probability of the probability of getting a heads, say \mathbb{P}(\rho).

The answers the question of why we have to assume the probability of getting a heads is \frac{1}{2}?

We no longer do. We also no longer assume it is the probability \rho. Our probability mass function now becomes

\mathbb{P}(X=x_{|x=0,1}) = {\rho}^x(1-\rho)^{1-x}\, \, \mathbb{P}(\rho).

What value can \mathbb{P}(\rho) take? This is the difference in our thinking. The parameter \rho is no longer taken as a constant, but is assumed to have a distribution. We say this is the prior (before) distribution.

Then the random variable X, by definition, is the posterior (after) distribution.

We present the connection between this inference and our usual (frequentist) inference.

Distribution of \mathbb{P}(\rho)

Assume the probability of the probability of getting a heads is equal to one. This means it is the probability of heads being equal to \frac{1}{2} is as likely as the probability of heads being equal to 0 or 1 or \frac{3}{4} or any value between zero and one.

We are assuming that the probability of getting heads follows the unitary continuous uniform distribution. We have

\mathbb{P}(\rho) = 1, for 0 \leq \rho \leq 1.

The mass function is just as before. With no inference, we have the same Bernoulli distribution. Challenge: What happens as we change the probability distribution of \mathbb{P}(\rho)?.


Birthday Problem

September 23, 2013

A pregnant woman is to give birth.

We assume the probability of boy or girl is equal (it is 1/2) and births are independent of each other.

Sometimes assumptions are unrealistic but the ones we have imposed are very realistic. It would be hard to make sense of the question if we didn’t include our assumptions.

Which of the following events is more likely to happen.

Event A: She gives birth to 2 boys and 2 girls.

Event B: She gives birth to 3 boys and 1 girl or 3 girls and 1 boy.

Method 1:

Consider the sample space, let 1 be a birth for boy and 0 a birth for girl. We have eight possible combinations for the birth of a boy:


By symmetry we also have eight combinations of birth of a girl, there are total of sixteen combinations. Our sample space consists of the union of these two sets of combinations and has cardinality sixteen (obviously).

For clarification here are the eight combinations for birth of a girl:


The probability P_1 of giving birth to two boys and two girls is used by considering the set of the number of combinations that satisfy our event and then taking the cardinality of the set. The combinations


Satisfy our event and by the addition principle we have the probability P_1 to be 6/16 = 3/8.

The second probability P_2 is done in the exactly same way and the combinations


Give us P_2 to be 8/16 = 1/2.

Therefore event B is more likely.

Method 2:

Let X be the RV counting the number of boys (or girls) born in four Bernoulli trials. $X$ is Binomially distributed with parameters $n=4$ and $p=1/2$. The events $A$ and $B$ can be trivially found.

Event A:

P(X=2) = \binom{4}{2} (1/2)^ 2 = 3/8

Event B:

2P(X=3)=2P(X=1)=2 \binom{4}{3} (1/2)^4 =2 \binom{4}{1} (1/2)^4 = 1/2.

Event B is more likely,

In fact both methods are equivalent as they are doing the same thing.

Utility Function and Stochastic Dominance

July 21, 2013

Consider the following motivating question:

A person wants to invest some of their wealth into something. How do they know what they will get back?

There are several thoughts to consider:

  • How much wealth do they have?
  • Why do they want to invest?
  • Depending on what the “something” is, should the expectation change?
  • How can we be certain we get back what we expect?

To answer this question, we use utility theory and stochastic dominance.

Utility Theory

Suppose you spend some cash on ice cream on a hot day. You eat the ice cream and enjoy the cold taste. This is the utility of your action.

Formally, utility is the satisfaction or welfare gained by an economic agent from the consumption of a good or an investment opportunity.

Some people prefer chocolate ice cream to vanilla ice cream. How can we distinguish different utilities? By assigning a numerical value to a utility, which measures the level of utility derived from a given level of wealth. This is the utility value.


Whether or not we include the raspberries may give a very different utility value…

What about any value of wealth W? Say we have some function U(W) of W such that this function U assigns utility values U(W) to given wealth W. This is a utility function.

Here are some thoughts that arise:

  • We are using wealth, not (necessarily) money. Then the utility gained from a wealth W is not measured in money (such as £, $, €) but as a function of wealth W.
  • I bought ice cream because I enjoy the taste. I expect to enjoy the taste. Investors make decisions based on their expectations. With certainty I expect the ice cream not to taste like spaghetti and cheese. This is a probability: a measure of what is possible in our situation. Buying vanilla ice cream and getting spaghetti and cheese ice cream should have a low probability of happening (although it may be high if are in an ice cream shop that sells the wrong flavours on purpose). So what we expect can be explained by the following theorem.

Expected Utility Theorem

An investor makes decisions based on maximising the expected utility \mathbb{E}[ U(W)] (so I want the best possible taste from the ice cream) under his beliefs about the probability of different outcomes.

We now know how investors think. They want to maximise their expectation of satisfaction. What if two ice creams give the same expected utility value? Then the two ice creams are indifferent.

Stochastic Dominance

The motivating question can be seen as how an investor makes a decision about which assets to purchase when returns are random (we will drop the ice cream language for now). Using utility theory means we have to know the investor’s utility function.

Suppose this function is not completely specified. Then how can we make decisions regarding which assets to pick? By using stochastic dominance. As we have lost some information (by not knowing the utility function) we have to know some other information.

To do this, instead of looking at wealth, we look at the possible returns.

Suppose we have some assets A_1, A_2, \dots, A_n and for each asset there is information on the possible returns. There is also a probability distribution for each asset – the chance of getting each return. If the probability of getting the highest possible return is one (which can be restated as “almost exactly”) then (potentially) this is a great asset to pick – it is not random and we know what we get.

Some thoughts arise:

  • How do we know what the returns are? When paying rent, taking loans, mortgages and so on, the returns are specifically stated. But this is not always the case, it is possible for the returns to change depending on how many people invest in that return or for some other reason.
  • How do know the probability distribution of each asset? In reality we do not – we estimate it (and how do we do this?).

We know that if you have the choice between having to pay for some same object or receiving it for free, you would receive it or free. Stochastic dominance puts this talk through probability. If (by some manner) the probability of all returns of an asset A_1 are lower than the probability of all returns (respectively) of an asset A_2 then clearly asset A_2 should be favoured: we have a higher chance of getting something back on the returns. The asset A_2 is stochastically dominant to asset A_1.

This informal (I have not defined many things properly) post gives us a reasonable answer to our question. This area is known as financial economics and is crucial to the modern world. Consider the following connections:

  • Statistics is used to estimate the probability distributions of the returns.
  • Economics is used to explain what investors want and how they behave.
  • Psychology is used to understand which assets an investor would pick – the risky ones or ones with a safe return. This is important for insurance – companies would like to know what their clients are like.
  • Mathematics is the language that formulates all of this – a quick look at this stochastic dominance page shows how all of this can be explained by mathematics.

This is what actuarial science is. It is very basic – yes, but it gives an introduction to how modern day financial problems are tackled.

Why “Mean” And “Variance” Are So Important

July 6, 2013

Suppose you are at a football (soccer) match X between teams A and B. You are interested in a gamble – it is assumed you want to earn money in the easiest way whilst adhering to the law), so you are offered a chance to gamble on the situation X.

It is known that team A is better than team B with the additional information that team A is very volatile and team B is very consistent: B stick to the same level most of the time where as A can bounce up and down.

Which team would you gamble on to win the match?

What do you expect in return for your gamble?

Both choices poise a riskA is better so you should pick A, where as B is more consistent and A is volatile, so you should pick B.

Given an action, the possible outcome being undesirable is precisely the risk of the situation.

What we expect in return for our investment (say R) is the expected return \mathbb{E}[R]. This is known as the mean of situation X.

It is important as this value is what we get in return – a mean that is not desirable implies no gamble hence no investment. The mean is used as a measure of seeing success in an investment in this case.

Assume we have expected returns for team A as \mathbb{E}[R_A] and for team B as \mathbb{E}[R_B].

We know that team B is far more likely to achieve the expected return \mathbb{E}[R_B]. As A is more volatile, it is hard to trust (or perhaps even accept) \mathbb{E}[R_A].

So how do we determine how likely each team is to matching their expected return?

By looking at the variance \mathbb{E}[(R-\mathbb{E}[R])^2] of X. This looks at how far we expect to be away from the expectation as a form of a least square.

Then when we have the variance of each team, clearly the team with the lower variance is more desirable as they are more likely to achieve their expected return, which we would like to be as high as possible.

We translate these problems into mathematics by use of probability theory, to allow the chances of each situation happening and defining the specific functions for expected returns and variance.

We use statistics to find estimates of what we think the expected return and variance are, as they are usually unknown.

We use utility theory to understand the behaviour an investor should exhibit when going through these choices – if he sees two teams who give the same return, so supposing that they are equally as good as each other, then assuming no further information (such as one team being more consistent than the other), he should beat on either team without any worry, as they give the same return.

The behaviour should be consistent. The behaviour is explained by economics and is translated into mathematics by probability theory and statistics.

This is the kind of what actuaries do. Everything explained above can be seen as actuarial science. It is a simple example, but it shows why mean and variance are important and the motivations behind actuarial science.


Get every new post delivered to your Inbox.