## Cauchy’s idea of probability

December 16, 2013

We get used to the idea that as our sample size increases, our model becomes more reliable.

We all ‘know’ that the sample average of most distributions is asymptotically Normal (by the central limit theorem) and that the sample average gets closer to the population mean and such.

This corresponds to a specific type of randomness, let us call it ‘mild’ randomness, as in a sense nothing wild is going on – although a random (stochastic) process is underneath everything, with more data comes convergence and more reliability to our claims. Our models come from this – linear regression and so on. They are not exact and they accept being approximations, but they still do a decent job.

However, what if they were totally wrong?

Cauchy has a different idea of probability and as a result a different idea of randomness, call it ‘wild’ randomness.

Cauchy’s idea

Cauchy’s idea is as follows.

Consider an archer that is blindfolded and has a bow and arrow. He is to shoot at a target located on a wall that is infinite in height and length. We are to measure how far his arrow is away the target. We assume that he always shoots at the wall, somewhere.

For example. if he hits the target, we record $0$ as he is exactly $0$ units away from the target.

We can formulate a probability distribution based on this example and without loss of generality, our assumptions need not hold.

Deriving Cauchy’s distribution

We can represent the idea about the archer above as a right angled triangle, as seen below.

This makes sense when we look at it: we are at some location (labelled archer),  say center to the target (labelled target) and some distance (adjacent to the angle $\theta$) away. Then our actual shot (labelled actual shot) is just a line segment from the target. Wherever the bow lands – above the target, to the right, left, below the target, etc, without loss of generality we can represent it as a right angled triangle, assuming that our shot hits the wall.

Then say we are interested in how far we are away from the target (the line segment actual shot – target) and how are we ourselves (the archer) are away from the target (the line segment archer- target). This can be represented by trigonometry and the angle $\theta$ can be computed. We have

$\tan \theta = \frac{\text{line segment between actual shot and target}}{\text{line segment between archer and target}}$.

Let the ratio between the line segment between actual shot and target} and the line segment between archer and target be called $y$, which is fine as it is just a real number. Then we have

$\tan \theta = y$.

To measure how far his arrow is away from the target, we are interested in varying $\theta$ by the ratio $y$. This makes sense – we do not vary $y$ with respect to $y$ as we are already fixing what we are varying $y$ with. We take the inverse tangent to get

$\theta = \arctan y$.

Varying $\theta$ with respect to $y$ corresponds to our problem – how we change the ratio answers the question of how far away we are from the target. This is just the derivative!The derivative of $\theta$ with respect to $y$ is

$\frac{d}{dy} \, \theta = \frac{1}{1+y^2}$.

This derivative defines our distribution – the term $1/(1+x^2)$ is our probability density function of the random variable $\Theta$ that measures how far away we are from the target, with $x$ taking all real number values, which corresponds to us asking, are we $x$ away from the target?

We then get the probability distribution function defined by integrating

$\frac{d}{dy} \, \theta = \frac{1}{1+y^2}$

over all real values of $y$ and finding a constant to make this integral equal to $1$. The constant is $1/\pi$ and we have the probability density function to be

$\mathbf{f} \lbrace \Theta =y \rbrace = \frac{1}{\pi}\frac{1}{1+y^2}$,

where $y$ is an real number.

This is the Cauchy distribution.

‘Wild’ randomness

Consider a real life process that conforms to ‘mild’ randomness – the heights of humans, for example.

If I collect heights of say five humans, it may not be close to the average. As I collect more heights I should get closer to the average, assuming that I am picking people randomly and not based on geographical location and other factors.

I get an expected value $\mu$ of what the height should be. I also get an idea of how far away from the mean I expect to be – this is the variance $\sigma ^2$.

Do these ideas hold for the blinded archer? Well.. not really.

We can have a sequence of shots that are close to the target but if the archer’s next shot is miles away, all that ‘work’ is wiped out in the sense that the average from those previous shots, now considering this shot, will be totally different.

Although I have a ‘bare’ idea on what I expect to get – $0$ units away from the target, I can go anywhere. This type of randomness is far more wild – I am not building on my earlier, smaller samples. I do not have any expectation of how far away my shots will be, nor do I know how far I fluctuate from my expectation (which I do not know in the first place). This corresponds to me not knowing $\mu$ or $\sigma ^2$.

Formally this can be shown as the expectation of $\Theta$ is not finite, nor is the variance of $\Theta$ finite.

The Central Limit Theorem and the Law of Large Numbers do not hold here. Taking the sample average and using it to infer information about this distribution is useless, because the next shot can change all of what we are working with. With the Normal distribution, this is not the case.

Difference between ‘mild’ and ‘wild’ randomness

Perhaps the difference between these types of randomness (mild and wild) can be seen in the plots. Consider the plot below of the probability density function of the Cauchy distribution for $y$ between $latex-10$ and $10$ in $0.00001$ intervals (which is good enough to get a measure of what the distribution looks like)

It does not look so much different to the plot of a probability density function of a (standard) Normal distribution, which is plotted below.

The Cauchy distribution has heavier tails – they do not dip as quickly as they do for the Normal distribution. This corresponds to having a higher chance of an arrow being incredibly away from the target to be more significant in Cauchy’s distribution than in the Normal – this also makes sense. Yet the Cauchy distribution has no (finite) expected value, no (finite) variance and various intuition about it fails – say bye to the Central Limit Theorem and the Law of Large Numbers.

## An Introduction to Bayesian Statistics

October 3, 2013

Consider a coin toss.

We “know” that the probability of getting a heads (and tails) is $\frac{1}{2}$. We know that coin tosses are independent of each other.

In the language of probability, a coin toss is a Bernoulli random variable with parameter $p=\frac{1}{2}$ of getting heads (or tails).

The probability of a heads (or tails) is very simple. Let $1$ be the outcome for heads and $0$ be the outcome for tails. Then we have the coin toss to be a random variable $X$ with probability (mass) function

$\mathbb{P}(X=x_{|x=0,1}) = \frac{1}{2}^x(1-\frac{1}{2})^{1-x} = \frac{1}{2}$

Then suppose the probability of getting a heads or tails is no longer symmetric (or fair), i.e we have $\frac{1}{2} \mapsto \rho$. The probability mass function is now

$\mathbb{P}(X=x_{|x=0,1}) = {\rho}^x(1-\rho)^{1-x}$.

It seems that we are finished.

Actually we have seen everything from a frequentist (statistician’s) view. A Bayesian statistician looks at this very differently. It is the difference between someone who views probability with objectivity and someone who views probability with subjectivity.

How do we know the probability of getting heads is $\rho$?

Instead of accepting the probability mass function as it is, we attach another probability to it: the probability of the probability of getting a heads, say $\mathbb{P}(\rho)$.

The answers the question of why we have to assume the probability of getting a heads is $\frac{1}{2}$?

We no longer do. We also no longer assume it is the probability $\rho$. Our probability mass function now becomes

$\mathbb{P}(X=x_{|x=0,1}) = {\rho}^x(1-\rho)^{1-x}\, \, \mathbb{P}(\rho)$.

What value can $\mathbb{P}(\rho)$ take? This is the difference in our thinking. The parameter $\rho$ is no longer taken as a constant, but is assumed to have a distribution. We say this is the prior (before) distribution.

Then the random variable $X$, by definition, is the posterior (after) distribution.

We present the connection between this inference and our usual (frequentist) inference.

Distribution of $\mathbb{P}(\rho)$

Assume the probability of the probability of getting a heads is equal to one. This means it is the probability of heads being equal to $\frac{1}{2}$ is as likely as the probability of heads being equal to $0$ or $1$ or $\frac{3}{4}$ or any value between zero and one.

We are assuming that the probability of getting heads follows the unitary continuous uniform distribution. We have

$\mathbb{P}(\rho) = 1$, for $0 \leq \rho \leq 1$.

The mass function is just as before. With no inference, we have the same Bernoulli distribution. Challenge: What happens as we change the probability distribution of $\mathbb{P}(\rho)?$.

## Birthday Problem

September 23, 2013

A pregnant woman is to give birth.

We assume the probability of boy or girl is equal (it is $1/2$) and births are independent of each other.

Sometimes assumptions are unrealistic but the ones we have imposed are very realistic. It would be hard to make sense of the question if we didn’t include our assumptions.

Which of the following events is more likely to happen.

Event A: She gives birth to 2 boys and 2 girls.

Event B: She gives birth to 3 boys and 1 girl or 3 girls and 1 boy.

Method 1:

Consider the sample space, let $1$ be a birth for boy and $0$ a birth for girl. We have eight possible combinations for the birth of a boy:

0000
0001
0010
0100
0011
0101
0110
0111

By symmetry we also have eight combinations of birth of a girl, there are total of sixteen combinations. Our sample space consists of the union of these two sets of combinations and has cardinality sixteen (obviously).

For clarification here are the eight combinations for birth of a girl:

1111
1110
1101
1011
1100
1010
1001
1000

The probability $P_1$ of giving birth to two boys and two girls is used by considering the set of the number of combinations that satisfy our event and then taking the cardinality of the set. The combinations

0011
1100
0110
1001
0101
1010

Satisfy our event and by the addition principle we have the probability $P_1$ to be $6/16 = 3/8$.

The second probability $P_2$ is done in the exactly same way and the combinations

0001
0010
0100
1110
1101
1011
1000
0111

Give us $P_2$ to be $8/16 = 1/2$.

Therefore event $B$ is more likely.

Method 2:

Let $X$ be the RV counting the number of boys (or girls) born in four Bernoulli trials. $X$ is Binomially distributed with parameters $n=4$ and $p=1/2$. The events $A$ and $B$ can be trivially found.

Event A:

$P(X=2) = \binom{4}{2} (1/2)^ 2 = 3/8$

Event B:

$2P(X=3)=2P(X=1)=2 \binom{4}{3} (1/2)^4 =2 \binom{4}{1} (1/2)^4 = 1/2$.

Event $B$ is more likely,

In fact both methods are equivalent as they are doing the same thing.

## Utility Function and Stochastic Dominance

July 21, 2013

Consider the following motivating question:

A person wants to invest some of their wealth into something. How do they know what they will get back?

There are several thoughts to consider:

• How much wealth do they have?
• Why do they want to invest?
• Depending on what the “something” is, should the expectation change?
• How can we be certain we get back what we expect?

To answer this question, we use utility theory and stochastic dominance.

Utility Theory

Suppose you spend some cash on ice cream on a hot day. You eat the ice cream and enjoy the cold taste. This is the utility of your action.

Formally, utility is the satisfaction or welfare gained by an economic agent from the consumption of a good or an investment opportunity.

Some people prefer chocolate ice cream to vanilla ice cream. How can we distinguish different utilities? By assigning a numerical value to a utility, which measures the level of utility derived from a given level of wealth. This is the utility value.

Whether or not we include the raspberries may give a very different utility value…

What about any value of wealth $W$? Say we have some function $U(W)$ of $W$ such that this function $U$ assigns utility values $U(W)$ to given wealth $W$. This is a utility function.

Here are some thoughts that arise:

• We are using wealth, not (necessarily) money. Then the utility gained from a wealth $W$ is not measured in money (such as £, \$, €) but as a function of wealth $W$.
• I bought ice cream because I enjoy the taste. I expect to enjoy the taste. Investors make decisions based on their expectations. With certainty I expect the ice cream not to taste like spaghetti and cheese. This is a probability: a measure of what is possible in our situation. Buying vanilla ice cream and getting spaghetti and cheese ice cream should have a low probability of happening (although it may be high if are in an ice cream shop that sells the wrong flavours on purpose). So what we expect can be explained by the following theorem.

Expected Utility Theorem

An investor makes decisions based on maximising the expected utility $\mathbb{E}[ U(W)]$ (so I want the best possible taste from the ice cream) under his beliefs about the probability of different outcomes.

We now know how investors think. They want to maximise their expectation of satisfaction. What if two ice creams give the same expected utility value? Then the two ice creams are indifferent.

Stochastic Dominance

The motivating question can be seen as how an investor makes a decision about which assets to purchase when returns are random (we will drop the ice cream language for now). Using utility theory means we have to know the investor’s utility function.

Suppose this function is not completely specified. Then how can we make decisions regarding which assets to pick? By using stochastic dominance. As we have lost some information (by not knowing the utility function) we have to know some other information.

To do this, instead of looking at wealth, we look at the possible returns.

Suppose we have some assets $A_1, A_2, \dots, A_n$ and for each asset there is information on the possible returns. There is also a probability distribution for each asset – the chance of getting each return. If the probability of getting the highest possible return is one (which can be restated as “almost exactly”) then (potentially) this is a great asset to pick – it is not random and we know what we get.

Some thoughts arise:

• How do we know what the returns are? When paying rent, taking loans, mortgages and so on, the returns are specifically stated. But this is not always the case, it is possible for the returns to change depending on how many people invest in that return or for some other reason.
• How do know the probability distribution of each asset? In reality we do not – we estimate it (and how do we do this?).

We know that if you have the choice between having to pay for some same object or receiving it for free, you would receive it or free. Stochastic dominance puts this talk through probability. If (by some manner) the probability of all returns of an asset $A_1$ are lower than the probability of all returns (respectively) of an asset $A_2$ then clearly asset $A_2$ should be favoured: we have a higher chance of getting something back on the returns. The asset $A_2$ is stochastically dominant to asset $A_1$.

This informal (I have not defined many things properly) post gives us a reasonable answer to our question. This area is known as financial economics and is crucial to the modern world. Consider the following connections:

• Statistics is used to estimate the probability distributions of the returns.
• Economics is used to explain what investors want and how they behave.
• Psychology is used to understand which assets an investor would pick – the risky ones or ones with a safe return. This is important for insurance – companies would like to know what their clients are like.
• Mathematics is the language that formulates all of this – a quick look at this stochastic dominance page shows how all of this can be explained by mathematics.

This is what actuarial science is. It is very basic – yes, but it gives an introduction to how modern day financial problems are tackled.

## Why “Mean” And “Variance” Are So Important

July 6, 2013

Suppose you are at a football (soccer) match $X$ between teams $A$ and $B$. You are interested in a gamble – it is assumed you want to earn money in the easiest way whilst adhering to the law), so you are offered a chance to gamble on the situation $X$.

It is known that team $A$ is better than team $B$ with the additional information that team $A$ is very volatile and team $B$ is very consistent: $B$ stick to the same level most of the time where as $A$ can bounce up and down.

Which team would you gamble on to win the match?

What do you expect in return for your gamble?

Both choices poise a risk$A$ is better so you should pick $A$, where as $B$ is more consistent and $A$ is volatile, so you should pick $B$.

Given an action, the possible outcome being undesirable is precisely the risk of the situation.

What we expect in return for our investment (say $R$) is the expected return $\mathbb{E}[R]$. This is known as the mean of situation $X$.

It is important as this value is what we get in return – a mean that is not desirable implies no gamble hence no investment. The mean is used as a measure of seeing success in an investment in this case.

Assume we have expected returns for team $A$ as $\mathbb{E}[R_A]$ and for team $B$ as $\mathbb{E}[R_B]$.

We know that team $B$ is far more likely to achieve the expected return $\mathbb{E}[R_B]$. As $A$ is more volatile, it is hard to trust (or perhaps even accept) $\mathbb{E}[R_A]$.

So how do we determine how likely each team is to matching their expected return?

By looking at the variance $\mathbb{E}[(R-\mathbb{E}[R])^2]$ of $X$. This looks at how far we expect to be away from the expectation as a form of a least square.

Then when we have the variance of each team, clearly the team with the lower variance is more desirable as they are more likely to achieve their expected return, which we would like to be as high as possible.

We translate these problems into mathematics by use of probability theory, to allow the chances of each situation happening and defining the specific functions for expected returns and variance.

We use statistics to find estimates of what we think the expected return and variance are, as they are usually unknown.

We use utility theory to understand the behaviour an investor should exhibit when going through these choices – if he sees two teams who give the same return, so supposing that they are equally as good as each other, then assuming no further information (such as one team being more consistent than the other), he should beat on either team without any worry, as they give the same return.

The behaviour should be consistent. The behaviour is explained by economics and is translated into mathematics by probability theory and statistics.

This is the kind of what actuaries do. Everything explained above can be seen as actuarial science. It is a simple example, but it shows why mean and variance are important and the motivations behind actuarial science.

## Branch Cuts Explained in Ballerina Turns

March 19, 2013

Suppose you are standing up and you look at a green wall. Also, you are a ballerina or just feel like doing a full 360 degrees (or  $2\pi$) turn. Naturally after your turn, you should see the same green wall, yes?

The answer (generally) is no. It all depends on what you can see! It also depends on how you are looking.

Suppose that the wall takes two colours: green and blue. They are definitely not the same!

This should appear the same after a turn…

Mathematics can be bizarre and sometimes things that do not appear to be the same actually… are. Suppose we are not in such a universe. Green and blue are definitely different for us.

We get a blue wall by using blue paint to paint the wall blue. Similarly we get a green wall by using green paint.

Then we can say dependent on what paint we used, we get a different output.

What if in our seemingly normal mathematical universe we use green paint to paint the wall green, look at the wall standing up, then do a full turn and the wall is now blue?

And further suppose that this can happen an infinite amount of times?!

But we did not use blue paint!!

Painting something is a single act. You do something and you get one (and only one) thing out in return, which makes sense.

But looking back at what you paint is not the same! Because you can paint with two different colours, when you look back there are two possible colours! You could call this a multi-act.

Also suppose there is not a single way to just look. Maybe we can look with squinted eyes or with just one eye, the other closed, or enclosed: pirate style!

Now this is crucial: how we look could affect what colour we see.

Suppose there is some way of looking that, if we make the look, then go to the wall and “look” back at the look, we can only see one thing: the look.

This appears silly, of course that’s true. But also suppose we can look, go to the wall and “look” back and see the same and perhaps another look. You could call this a 2-look.

Generalise it further and you have a multi-look. A capability to look back at your projection and see something else.

Then if you can look forward, do a turn, say a 1-turn, generalised to multi-turns, you should always see a green wall provided that you painted it green initially.

But if the “look” that you are using is a multi-look, clearly when you look back, which if you think about it… is just the same as doing a turn, then you don’t have the green wall. You have a blue wall.

This is precisely what a branch point is: the “look” that you choose that after a “turn” gives you a different colour to the wall.

To fix this, we could ban certain looks (so all multi looks) and focus on the single looks: these are all continuous since no matter how many turns we do, we have the same green paint always appearing.

Then if consider the multi looks, we make these continuous by fixing your turns: where you start and where you finish. You should still do a full $2 \pi$ turn either way. You just change the position where you do it.

This means if we are in the room and the paints are at some position in the room, so the green paint at position A and blue paint at position B, consider the distance between them, call it C.

This C will be removed: from blue a turn gives us green and likewise from green to blue. Formally, we take a cut in the axis between the points A and B.

This is a branch cut.

This part is removed as travelling in this small part of the considered axis gives us a discontinuity! Everything else stays!

Now suppose we could transfer this to a different room with different paint colours and different representations of what gives us the same or different (so in our considered case, the representation is looking at the wall).

This is analytic continuation.

This may not generally work: other rooms and other representations may have a specific structure which mean what is discontinuous for one is not discontinuous for the other.

Then suppose when in some specific section (a contour) of the room, everything is well defined and after every single turn we have the same green wall.

Suppose we are a different section and given the “nice” behaviour of the first section, if we can somehow exhibit the same behaviour, we have connected two sections.

Then we may do this for the whole room and hence when we look and turn, we always see a green wall.

This is the Monodromy Theorem.

Then we understand the whole room, after our ballerina turns, anywhere in the room, we always see green,

Essentially, we have just explained a bunch of useful (and closely related) concepts and theorems in complex analysis.

We weren’t being very formal – far from it, but we have the idea in our mind.

Formally, we looked at multivalued functions and their relationship as the inverse of single-valued functions.

We see what happens as we travel $\theta$ through a full period: the multivalued function returns a different value at some specific points (the branch points).

Then we fix angles and remove parts of the imaginary or real axis: a branch cut.

Whether this cut holds in a different region and whether there is any difference to how we look at the wall or the colour of the wall is analytic continuation.

An example of analytic continuation..

Finally, the Monodromy Theorem characterises this with singular points and analyticity.

## Partitions: The Most Powerful Tool In Mathematics

March 8, 2013

If you have a big object and cannot understand it, what do you do? You break the big object into little objects, try to make sense of them and then build up back to the big object again.

These little objects are partitions of the big object.

Let us see how this tool has been used in some areas of mathematics:

Integration (Analysis)

The first and original integral, the Riemann integral uses partitions of a set and then creates specific sums (Riemann Sum) of these partitions which satisfy certain criterion for integrability.

The result? The integral and in general integration and of course The Fundamental Theorem Of Calculus.

Group Theory (Algebra)

Suppose you have a group. Represent this group as a big box. The things inside this big box clearly make it what it is: the big box.

Partition these things inside into sets and if they have some structure (subgroups), collect all of them and consider the divisibility of the cardinality of the partitions.

What is inside one box may not be what is inside another box (in group talk: a left coset is not necessarily equal to the right coset associated to the group)

The result? Lagrange’s theorem (there are other results but the sheer beauty of that one should be enough)

Graph Theory (Algebra)

Suppose you have a graph. Partition the graph into something smaller: a collection of subgraphs. If this collection is a disjoint union of n-partite graphs, we can understand the graph’s structure.

A specific structure will allow us to understand if the graph is planar and to resolve key, real-life problems such as the utilities problem.

Stochastic Processes (Probability)

Suppose you are doing some event B contained in a sample space W. But to do B you must “be” in some place, say, A. Maybe you are in A_0, A_1, A_2, … and so on. The point is, you must be there. You partition the chance of event B occurring as sum of the chance of B occurring given that you are in A multiplied independently by the chance of you being in A.

This is the Law Of Total Probability.

The result: Chapman-Kolmogorov equation.

Actually this theorem can be represented into a matrix form which then allows us to produce big theorems for Markov chains and (in general) Markov processes.

The result: Markov chains, Ergodic theorem

Even more, the whole of probability theory arises from this.

Introductory stochastic processes classes look at how some initial understanding of probability (and other basic analysis modules) allows us to understand the world: reliability theory (how likely is it that something will break down?), queuing theory (what is that chance of you waiting for a specific time period in a given queue at say, the supermarket, before you go to the till?) and so on.

Graduate classes then use the idea of partitions and breaking things down so much that it links nicely with the “rigorous” definition of probability: measure theory.

These are just some applications, the explanations are not detailed or interesting enough to explain why using partitions is so crucial in all various fields of mathematics.

The point: if you do not understand something, break it down into what you can (or will) understand. Then collect these little pieces together and see what you get.