probability, probability definition, uniform deviate in R

Basic Statistics with R

Probability

From the point of view of statistics, the experiments we do, or the systems we observe come under two categories: deterministic and stochastic.

The deterministic system results in predictable outcomes. For example, in the absence of an appreciable air resistance, we can almost predict the trajectory of a projectile thrown at an angle with a known velocity using the equations of Newtonian Mechanics. Here the system obeys the known laws.

There are experiments which have two or more possible outcomes, and the experiment may result in any one of those possible outcomes in a random manner. They are called stochastic systems. There are no deterministic equations or laws governing these systems and they can be studied only with the help of statistics. For example, when we throw a dice and look for the number that shows up, any one of the six possibilities can occur and we cannot predict it. We can only say that if the dice is perfect, all the six numbers have equal chance for showing up! This is what makes the dice throwing a random experiment as against the projectile motion, which is deterministic.

Random experiment, outcome space and events

A random experiment will result in one or more possible outcomes. A set of all possible outcomes of a random experiment is called its outcome space .

For example, an experiment involving a single coin toss has two possibe outcomes namely Head(H) or Tail(T). We represent the outcome space as a set in set theory notation: \[\small{S = \{H, T\} }\]

Similarly, in a random experiment that involves the throw of a single dice, the outcome space has six possible results of a dice throw: \[\small{ S = \{1,2,3,4,5,6\} }\]

A subset of possible events in an outcome space is called an event. Thus, the set \[ \small{A = \{2,4,6\} }\] is an event(subset) of the above mentioned outcome space of dice throw that contains possible outcomes with even value. The set \[ \small{B = \{1,2,3\} }\] is another event(subset) of dice throw that has outcomes whose values are less than four.

For a given outcome space, it may be possible to define many events.

When a random experiment results in an outcome which is a member of subset 'A', we say that the event A has occured . Thus, from the above definition of event A and B, if a dice throw results in 4, we say "event A has occured". If it results in 3, we say "event B has occured". If the result is 2, we can say "event A and event B have occured", since the outcome 2 is a member of both the sets.

The concept of probability

The probability that an event A occurs denotes the chance or likelihood of the event occuring, and is denoted by a symbol p(A). The value of p(A) is in the range 0 to 1. A value of p(A)=1 denotes that the event A will occur certainly, and p(A)=0 denotes that the event A will definitely not occur. A value of p(A)=0.5 tells us that the event A has equal chance of occuring or not occuring.

There are two ways in which we can approach the probability of an event A.

The first way involves looking at the outcome space and count the number of possible ways in which an event A can occur. This is same as the number elements in the set A. We define the probability of occurance of event A as, \[ \small{p(A) = \frac{number~of~outcomes~favourable~to~event~A}{total~number~of~possible~outcomes} }\]

Suppose we want to compute the probability of getting a value more than 4 in a dice throw. Out of six possible outcomes {1,2,3,4,5,6}, two outcomes, namely 5 and 6 satisfy this event. Therefore, we say \[\small{ p(>4) = \frac{number~of~outcomes~that~result~in~>4}{total~number~of~possible~outcomes} = \frac{2}{6} }\]

As a second example, we define an experiment in which we throw a pair of coins and note down the result in both of them. If first coin shows up Head(H) and second shows up Tail(T), we call this outcome "HT". This way, there are four possible outcomes in this random experiment: \[ \small{S = \{HH, HT, TH, TT\} } \]

Now we ask the question : If we toss a pair of coins, what is the probability of getting at least one head in the pair?. When we look at the above outcome space, we realize that out of the 4 elements in the space, three of them namely \[ \small{ \{HH, HT, TH \} }\] result in at least one head. Therefore, we write, for a two coin toss, \[ \small{ p(at~least~one~head ) = \frac{number~of~outcomes~that~result~in~at~least~on~head}{total~number~of~possible~outcomes} = \frac{3}{4} }\]

We can also approach the probability from the experimental side. Suppose an event A has a probability of occurance p(A). We perform the experiment once and note down whether it resulted in event A or not. We then repeat this experiment N times ( N trials), and count the number of trials n(A) that resulted in the event A. For example, we throw a coing n times and note down the number of times it resulted in head(H).

The ratio \(\small{ \displaystyle\frac{n(A)}{N} }\) is the relative frequency of occurance of event A. This ratio is unstable for small number of trials N, but stabilizes towards a number for very large N.

This number, \( \small{p(A) = \displaystyle\frac{n(A)}{N} } \) for very large N is called the probability of occurance of event A.

How many trials we need before the experimentally observed ratio is equal to the probability obtained before in terms of favourable events?. ie., how many coin tosses are needed to get a p(H) exactly equal to 0.5?. It turns out that we need infinite number of tosses!!

In the same manner, if we throw a dice very very large number of times, the probability of getting each one of the 6 outcomes will approach the number \( \small{\displaystyle\frac{1}{6}} \) as the number of throws approaches infinity!

R Scripts

Simulating the coin toss experiment in R

We can realistically simulate the coin toss experiment using the runif() function in R which generates set of random numbers with a uniform probability between 0 and 1. These random numbers are called "uniform deviates".

We can simulate a single coin toss by generating a uniform deviate between 0 and 1. If the value is below 0.5, we say that the outcome is head (H). On the other hand, if the value is above 0.5, we say that the outcome is tail (T).

If we generate a large number of uniform deviates, half of the numbers generated will be below 0.5 (Heads) and the ramaining half will be above 0.5 (Tails)

We first generate N deviates. We separate those numbers with value below 0.5 and count them as "Head". Similarly, we separate those numbers that are above 0.5 and count them as "Tail". The ratio of number of heads to total number N gives the p(H), the probability of getting head. We repeat this for increasing N like N=10,100,10000, 100000,etc. As N increases, p(H) should approach the expected probability (called "apriory probability") of 0.5.



#####         script  "coin_toss.r"    #################################


## Simulating a coins toss using unform deviates from runif()

 ## number of trials in 6 experiments.
 ntrials = c(10,100,1000, 10000, 100000, 1000000)

 ## print the column nmaes
 print(paste("trials  ", "heads ", "tails ", "P(head)"))

 ## find p(H) for each experment with N trials (ie., N coin tosses)
 for(N in ntrials)
 {
   ## generate N uniform deviates between 0 and 1 and return them as a vector
   nx = runif(N)

   ## If the number is less than or equals 0.5, call it Head(H)
   nheads = length(subset(nx, nx <= 0.5))

   ## If the number if greater than 0.5, call it Tail(H)
   ntails = length(subset(nx, nx > 0.5))

   ## compute the probblity of head
   prob_head = nheads/N

   print(paste(nheads+ntails,"      ", nheads, "      ", ntails,"      ", prob_head))

 }

Run the above code file in R with "source(cointoss.r)" in command line. Results similar to the following results are printed: (It won't be same numbers everytime due to randomness).


[1] "trials   heads  tails  P(head)"
[1] "10        4        6        0.4"
[1] "100        49        51        0.49"
[1] "1000        512        488        0.512"
[1] "10000        4949        5051        0.4949"
[1] "100000        50191        49809        0.50191"
[1] "1000000        500186        499814        0.500186"

From the printout above, we see that as N increases, p(H) stabilizes arounf 0.5, as expected. The p(H) will tend to 0.5 as N tends to infinitly large number of events.