2.3 Probability

2.3.1 Random variables

2.3.1.1 Probability spaces

Before beginning any serious study of data science, it is critical to understand the basics of the random processes which created the data itself. This will allow us to understand the statistics we create as well as when those statistics are likely to fail.

Essentially we define a probability space as a collection of events \(\{E\}\) and their associated probabilities \(\text{Pr}[E]\).

A random variable \(X\) is a function which outputs different number values based on the events that happen in the probability space. For instance, when a dice is rolled, whichever side faces up is labeled with a number \(1\) to \(6\). Thus \(X\) could be that number and the sample space, \(\Omega\), would be the integer numbers \(1, 2, 3, 4, 5,\) and \(6\). The sample space could be a set of discrete numbers (like \(1\) to \(6\)) or it could be a continuous set like the real numbers \((-\infty,\infty)\) for something like a normal random variable.

2.3.1.2 Density functions

When it comes to random variables, we have four main categories: discrete, continuous, mixed, and singular. Discrete random variable have probability mass functions. Continuous random variables have probability density functions. Mixed random variables are combinations of discrete and continuous random variables and will be discussed in ??. We will not be concerned with singular random variables as they are mathematical curiosities, but not directly applicable for data science at the time of writing.

The probability density function (PDF) is a function which describes where the concentration of probability for the random variable lies: \[\text{Pr}[a < X\leq b] = \int_a^bf(x)dx\] Hence, the function must integrate to one \(\int_{-\infty}^\infty f(x)dx=1\) so that the total probability is one. The cumulative distribution function (CDF) is the integral of the PDF: \[F(t)=\int_{-\infty}^t f(x)dx=\text{Pr}[X\leq t]\]

For discrete random variables, there is no probability density function. Instead, we have a probability mass function (PMF) which behaves similarly: \[\text{Pr}[a\leq x\leq b] = \sum_{i=a}^b f(x)\] Just like the PDF must integrate to one, the PMF must sum to one, i.e., \(\sum_{k\in \Omega}f(k)=1\). The CDF for a discrete random variable is: \[F(t)=\sum_{x\leq t} f(x)dx=\text{Pr}[X\leq t]\]

As we have mentioned, a distribution may or may not have a PDF or a PMF. That said, all random variables must have a CDF. The CDF is guaranteed, but the PDF or PMF is optional.

2.3.1.3 Distribution moments

The first two moments of a distribution are its mean and variance. The mean or expected value of any continuous variable is \[\text{E}[X]=\int_{-\infty}^\infty x\;f(x)\;dx\] The expected value or mean is a measure of the center of the distribution. It is an average of the random variable weighted by the probability density function. If the variable is discrete, then \(\text{E}[X]=\sum_{x\in\Omega} x\;f(x)\). The expected value of any function of a random variable is:

\[\text{E}[g(X)]=\int_{-\infty}^\infty g(x)\;f(x)\;dx\] This means that any linear function of a random variable can be found as

\[\text{E}[aX+b]=\int_{-\infty}^\infty (a X + b)\;f(x)\;dx=a\text{E}[X]+b\]

If we have two random variables (we will discuss joint probability later), then

\[\text{E}[X+Y]=\text{E}[X]+\text{E}[Y]\]

The variance of a random variable represents the spread of that variable. A higher variance means a larger spread. Variance is defined as \[\text{Var}[X]=\text{E}[(X-\text{E}[X])^2]=\text{E}[X^2]-\text{E}[X]^2\] Take special note that while expected value can be applied across linear functions, that doesn’t mean that it can be applied to quadratics or other non-linear functions. Otherwise the variance would always be zero because \(\text{E}[X^2]\) would be equal to \(\text{E}[X]^2\).

The standard deviation is the square root of the variance:

\[\text{SD}[X]=\sqrt{\text{Var}[X]}\]

We often use standard deviation because it has the same units as the original data. The variance is related to the variable squared, so if the original data used meters then the variance would be in terms of square meters. This is awkward because square meters don’t make sense unless we are talking about area, and the problem could have nothing to do with area. So to put the variance in terms of the original data, we need to take the square root of the variance. In other words, we need to use the standard deviation to better interpret our results.

It is possible for the moments of distribution to not exist. For instance, if the integral of \(x^2\;f(x)\;dx\) evaluated from \(-\infty\) to \(\infty\) diverges to \(\infty\), then the distribution is said to not have a variance. Such a variable would have such a serious variance that it does not even exist. This kind of variable is extremely uncertain and is one way to model risky phenomena. Most random variables have some mean and some variance and for the most part, we are going to assume that our variables have these defined. When this assumption is not true we are going to have to change our approach.

2.3.1.4 Higher moments

2.3.1.5 Common distributions

2.3.1.5.1 Uniform distribution

The two most important distributions we are going to use for this class are the uniform distribution and the normal distribution. The uniform distribution is where the probability is evenly distributed in the range of the output. To indicate that a variable has a uniform distribution we write: $ X U[a,b] $. This indicates that \(X\) has a uniform \([a,b]\) distribution. This distribution has two “parameters,” \(a\) and \(b\). Distributions will have any number of parameters. \(a\) here represents the lower bound of the distribution. \(b\) represents the upper bound of the distribution. So the range is from \(a\) to \(b\). The PDF for this distribution is: \[ f(x) = \begin{cases} \frac{1}{b-a},& \text{if } x\in [a,b]\\ 0, & \text{otherwise} \end{cases} \]

The PDF is defined to be zero outside of the interval where our variable is going to fall \([a,b]\). In the interval \([a,b]\), the variable is defined to be flat or a constant value. The following figure is a plot of the uniform distribution’s PDF:

The expected value of the uniform distribution is \[\text{E}[Y]=\int_a^b y\;f(y)\;dy=\int_a^b \frac{y}{b-a}dy=\frac{1}{2}\frac{b^2-a^2}{b-a}\]

or \(\text{E}[Y]=\frac{a+b}{2}\). This is intuitive because the mean value should be right in the center of the distribution or directly between \(a\) and \(b\). The variance of \(Y\) can be calculated similarly and is \(\text{Var}[Y]=\frac{(b-a)^2}{12}\).

2.3.1.5.2 Normal distribution

The normal distribution is another continuous distribution which is hugely important for our work. We denote that a variable \(Y\) is normally distributed by \(Y \sim \mathcal{N}(\mu,\sigma^2)\). The PDF for the normal distribution is: \[f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp \begin{bmatrix} -\frac{(x-\mu)^2}{2\sigma^2}\end{bmatrix}\] The PDF of the normal distribution has the following shape:

The normal distribution has two parameters, \(\mu\) and \(\sigma^2\). The normal distribution is symmetric around \(mu\) so all the common measures of centrality agree that \(\mu\) is at the center of distribution. Hence, we can show that \[\text{E}[Y]=\int_{-\infty}^{\infty} y\;f(y)\;dy=\mu\]. The derivation of this is omitted from this book not because it is difficult calculus (it isn’t) but because we are going to use this value often as a reference. We know what the mean of normal numbers is. We don’t need to calculate it to be able to use it.

The \(\sigma^2\) parameter represents the spread of the distribution. When \(\sigma^2\) is large, the distribution has a wider shape. When \(\sigma^2\) is smaller, the distribution is more concentrated around \(\mu\). The variance of the normal distribution \(\sigma^2\), i.e., \(\text{Var}[Y]=\sigma^2\). Thus we can say that the normal distribution is parameter by its mean and variance which is to say that if we know those two numbers, we know the particular distribution uniquely.

A special case of the normal distribution is the standard normal distribution which is defined where \(\mu=0\) and \(\sigma^2=1\). This specific normal distribution is “standardized” meaning that it has mean zero and variance one.

2.3.1.5.3 Log-normal distribution

\[\ln Y \sim \mathcal{N}(\mu,\sigma^2)\] \[f(x) = \frac{1}{x\sqrt{2\pi\sigma^2}}\exp{\Big( -\frac{(\ln x-\mu)^2}{2\sigma^2}\Big)}\] \[\text{E}[Y]=\exp\begin{pmatrix}\mu+\frac{\sigma^2}{2}\end{pmatrix}\] \[\text{Var}[Y]=\big[\exp(\sigma^2)-1\big]\exp\big(2\mu+\sigma^2\big)\]

2.3.1.5.4 Binomial distribution

\[Y \sim \mathcal{B}(n,p)\] \[f(k) = \begin{pmatrix}n\\k\end{pmatrix} p^k(1-p)^{n-k} \; \text{for} \; k=0,1,\dots,n\] \[\text{E}[Y]=np\] \[\text{Var}[Y]=np(1-p)\]

2.3.1.5.5 Poisson distribution

\[Y\sim \text{Poisson}(\lambda)\] \[f(k)=\frac{\lambda^k e^{-\lambda}}{k!} \; \text{for} \; k=0,1,\dots\] \[\text{E}[Y]=\lambda\] \[\text{Var}[Y]=\lambda\]

2.3.1.5.6 Gamma distribution

\[Y\sim \text{Gamma}(\alpha,\beta)\] \[f(x)=\frac{\beta^\alpha x^{\alpha-1} e^{-\beta x}}{\Gamma(\alpha)} \; \text{for} \; x\in[0,\infty)\] \[\text{E}[Y]=\frac{\alpha}{\beta}\] \[\text{Var}[Y]=\frac{\alpha}{\beta^2}\]

2.3.1.5.7 Exponential distribution

\[Y\sim \text{Exponenial}(\lambda)\] \[f(x)=\lambda e^{-\lambda x} \; \text{for} \; x\in[0,\infty)\] \[\text{E}[Y]=\frac{1}{\lambda}\] \[\text{Var}[Y]=\frac{1}{\lambda^2}\]

2.3.1.5.8 Student t-distribution

\[Y\sim t_r\] \[f(x)=\frac{\Gamma(\frac{r+1}{2})}{\sqrt{r\pi}\Gamma(\frac{r}{2})}\begin{pmatrix}1+\frac{x^2}{r}\end{pmatrix}^{-\frac{r+1}{2}}\] \[\text{E}[Y]=0 \;\text{if}\; r>1\] \[\text{Var}[Y]=r/(r-2) \;\text{if}\; r>2\]

2.3.1.5.9 Chi-squared distribution

\[Y\sim \chi^2_r\] \[Y=\sum_{i=1}^r X_i^2 \; \text{where}\; X_i\sim \mathcal{N}(0,1)\] \[f(x)=\begin{cases} \frac{x^{r/2-1}\exp(-x/2)}{2^{r/2}\Gamma(r/2)},& \text{if } x>0\\ 0, & \text{otherwise} \end{cases}\] \[\text{E}[Y]=r\] \[\text{Var}[Y]=2r\]

2.3.1.5.10 Bernoulli random walk distribution

\[Y_t=\sum_{i=0}^t X_i\;\text{where}\; X_i\sim 2\mathcal{B}(1,0.5)-1\]

\[f(k)=\begin{cases} \frac{1}{2^t}\begin{pmatrix}t\\ \frac{t+k}{2}\end{pmatrix},& \text{if } t+k \,\text{is even and}\, |t|\leq n\\ 0, & \text{otherwise} \end{cases}\]

\[\text{E}[Y_t]=0\] \[\text{Var}[Y_t]=\sum_{i=0}^t Var[X_i] = t\]

2.3.1.6 Sample statistics

When we have a sample of data, we are going to think of the data a multiple observations of the same variable. For instance we might have a sample \(x_1,\dots,x_n\) which would index \(x_i\) where \(i=1,\dots,n\). From this sample, we can calculate the sample mean. The sample is different from the mean of random variables we described above. That was a population mean or what we would expect from the random variable if we knew the distribution already. The sample mean can be found by averaging the points that we observe or:

\[\overline{x}=n^{-1}\sum\nolimits_{i=1}^n x_i\]

In the statistics section, we will discuss the distribution of the sample mean. Similar to the sample mean, we can calculate the sample variance which we will call \(s^2\):

\[s^2=(n-1)^{-1}\sum\nolimits_{i=1}^n (x_i-\overline{x})^2\]

Taking the square root of \(s^2\) gives us \(s\) which is the sample standard deviation.

2.3.1.7 Order statistics

2.3.2 Joint distributions

When we have more than one variable in our data, the interaction between these variables can be modeled with joint probability. This is a complex subject, but the results can be very interesting. For two continuous variables, there is a joint probability density function or joint PDF: \(f(x,y)\). This is a function of two variables. It takes two inputs \(x\) and \(y\) and outputs one probability density such that: \[\int_{\Omega x}\int_{\Omega y}f(x,y)\;dy\;dx=1\]

We can visualize a joint distribution as follows:

In general, we can have up to \(r\) variables in a joint distribution, \(f(x_1,\dots,x_r)\). Obviously this is more complicated and cannot be visualized in a three dimensional world.

2.3.2.1 Marginal distributions

Every joint distribution is going to have marginals. Marginals are basically the individual PDF for each of the variables in a joint density. So if we have two variables, \(x\) and \(y\) then we are going to have two marginal distributions-one for \(x\) and one for \(y\): \[f_x(x)=\int_{\Omega y}f(x,y)\;dy\] \[f_y(y)=\int_{\Omega x}f(x,y)\;dx\]

As you can see here, we get back to the marginal by integrating across the other variables. This means that we are ignoring the other variables when we think about the marginal density of \(y\).

2.3.2.2 Conditional distributions

If we knew the value of \(x\), then we could use that information to predict \(y\). This is the principle behind conditional distributions. In the world where \(x\) is known, where should we expect \(y\) in probability? That means that conditional probability is a function of a known \(x\) value. We define conditional probability of \(y\) on \(x\) as follows:

\[f_{y|x}(y|x)=\frac{f(x,y)}{f_x(x)}\]

Notice that this conditional probability must integrate to one: \[\int_{\Omega x} f_{x|y}(x|y)\;dx=1\]

We can also turn this around and find the conditional probability of \(x\) on \(y\): \[f_{x|y}(x|y)=\frac{f(x,y)}{f_y(y)}\]

This leads us into Bayes’ theorem.

2.3.2.3 Bayes’ theorem

Bayes theorem is fundamentally a consequence of the definition of conditional distributions. We know that \(f_{x|y}(x|y)f_y(y)=f(x,y)\) and we know \(f_{y|x}(y|x)f_x(x)=f(x,y)\). Therefore, we can write \(f_{x|y}(x|y)f_y(y)=f_{y|x}(y|x)f_x(x)\) or

Theorem 1 (Bayes’ theorem) Given two random variable \(x\) and \(y\) with joint probability \(f(x,y)\), \[f_{x|y}(x|y)=\frac{f_{y|x}(y|x)f_x(x)}{f_y(y)}\]

One primary application of this idea is for understanding the probability of having a disease given a positive test: \[\text{Pr}[\text{disease|test positive}]=\frac{\text{Pr}[\text{positive test|disease}]\text{Pr}[\text{disease}]}{\text{Pr}[\text{positive test}]}\] or in other words, the probability of having the disease given that you tested positive is the probability of testing positive given that you have the disease times the probability of having the disease in general over the probability of having a positive test. This allows use to update the probability that you have the disease if we know that you have a positive test result. This is called Bayesian updating.

Consider the following example: Let’s suppose that the probability of testing positive if you have the disease is 1. Also suppose that the false positive rate is \(1/100\). Let’s also suppose that the prevalence of the disease in the overall population is \(1/1000\). Now today you take the test and get a positive result. Importantly, I’m assuming that there’s absolutely no other reason to assume that you have this disease. If there were some other symptom, then you shouldn’t be compared to the average person in the population, but rather to other people with the symptom and that would change the situation entirely. Now the probability of having the disease is \(1/1000\). The probability of testing positive for the disease given that you have the disease is 1. But the probability of a positive test overall is \(\text{Pr}[\text{positive test}]=\text{Pr}[\text{positive test|disease}]\text{Pr}[\text{disease}]+\text{Pr}[\text{positive test|no disease}]\text{Pr}[\text{no disease}]=1*1/1000+1/100*999/1000=0.011\). Therefore, the probability that you have the disease is \(\text{Pr}[\text{disease|positive test}]=0.091\) or a little less than \(10\%\). Even though you tested positive the false negative rate is \(0\) and the false positive rate is \(1/100\), you’re probably fine. This example is why good doctors won’t test you for diseases without some other indication that you have the illness.

2.3.2.4 Covariance and correlation

The covariance is defined to be similar to the variance. Essentially we are looking at the connection between two random variable. When \(X\) has a large deviation from its mean, what is the average deviation in \(Y\)?
Definition 1 (Covariance) \[\text{Cov}[X,Y]=\text{E}[(X-\text{E}[X])(Y-\text{E}[Y])]\]
Correlation is defined to be the covariance divided by the standard deviation of each variable:
Definition 2 (Correlation) \[\text{Corr}[X,Y]=\frac{\text{Cov}[X,Y]}{\text{SD}[X]\text{SD}[Y]}\]

We can calculate the sample covariance and sample correlation as follows: \[s_{x,y}=(n-1)^{-1}\sum\nolimits_{i=1}^n (x_i-\overline x)(y_i-\overline y)\] \[r_{x,y}=\frac{s_{x,y}}{s_x s_y}\]

To calculate the sample covariance and sample correlation in R, we can use cov() and cor() functions:

cov(churn$length,churn$cs_calls)     # Sample covariance
## [1] -0.1988526
cor(churn$length,churn$cs_calls)     # Sample correlation
## [1] -0.003795939

2.3.2.5 Independence

Independence is an extremely important concept with a particularly terrible (intuitively) definition:

Definition 3 (Independence) Two random variables are said to be independent if their joint distribution can be written as the product of each of their marginal distirbutions: \[f(x,y)=f_x(x)f_y(y)\]

This condition is extremely powerful when true. When it is, then any expected value between the two variables can be calculated by taking the expected value of the \(X\) part and the \(Y\) part separately. or \(\text E[f(X)g(Y)]=\text E[f(X)]\text E[g(Y)]\). This implies among other things that: \[\text{Cov}[X,Y]=\text E[(X- \text E[X])(Y-\text E[Y])]=(\text E[X-\text E[X]])(\text E[Y-\text E[Y]])=0\] This means that whenever we have independent variables, they must be uncorrelated. In other words, if we see a correlation, that implies some kind of dependence. We will discuss circumstances later where this is not true, but in the classic statistical framework, this is the implication. Independence is an extremely powerful assumption with many un-intuitive consequences. We will be discussing these consequences in this course. When independence is not present, our standard models can result in interesting failures. We will be discussing those failures as well.

2.3.2.6 Independent identically distributed

When we are working on data, our standard model is going to be that the data is independent and identically distributed or (iid) with some mean (\(\mu\)) and some variance (\(\sigma^2\)) or \[x_i \sim iid(\mu,\sigma^2) \;\text{for}\;i=1,\dots,n\]

What this assumption says is that each row of our data should be completely unrelated to its neighbors and that every row represents a draw from the same distribution. This means that if the marginal PDF of \(x_1\) is \(f(x_1)\), then the joint PDF can be written: \[f(x_1,\dots,x_n)=\prod_{i=1}^n f(x_i).\]

This is very convenient mathematically. Also note that the assumption implies that the distribution from which the data is drawn has some finite mean and variance, even if we have no idea what the mean, variance, and distribution of the data are.

2.3.2.7 Matrix moments

\[\mathbf{A} = \begin{bmatrix}a_{ij}\end{bmatrix}\] \[\text{E}[\mathbf{A}] = \begin{bmatrix}\text{E}[a_{ij}]\end{bmatrix}\] \[\text{Var}[x]=\text{E}[(x-\mu)(x-\mu)'] \;\text{where}\; \mu=\text{E}[x]\]

2.3.2.8 Multivariate normal distribution

The distribution featured above was a multivariate normal distribution. If you know matrix algebra, then we can express the multivariate normal distribution as follows: \[x \sim \mathcal{N}_r(\mu,\Sigma)\]

Here \(x\) and \(\mu\) are \(r\times1\) vectors and \(\Sigma\) is an \(r\times r\) matrix. Using these numbers, we can define the multivariate normal PDF as follows: \[f(x)=(2\pi)^{-k/2}|\Sigma|^{-1/2}\exp\begin{bmatrix}-\frac{1}{2}(x-\mu)'\Sigma^{-1}(x-\mu)\end{bmatrix}.\]

Once we have covered covariance and matrix moments, we can explain the intuition behind the \(\mu\) and \(\Sigma\) parameters. Effectively, \(\mu\) is the mean of the distribution, and \(\Sigma\) is the variance-covariance matrix of the \(x\) variables.

2.3.2.9 Conditional normal distribution

\[x_1|x_2\sim \mathcal{N}_{r_1}(\overline\mu,\overline\Sigma)\] \[\overline\mu=\mu_1+\Sigma_{12}\Sigma_{22}^{-1}(x_2-\mu_2)\] \[\overline\Sigma=\Sigma_{11}-\Sigma{12}\Sigma_{22}^{-1}\Sigma_{21}\]

2.3.2.10 Cumulants

2.3.3 Group-by

Most questions people typically have about data can be answered with summary statistics alone. Suppose we want to know how our sales change seasonally. To answer the question, we just have to group our sales by the month (or even quarter) and add up the sales numbers. Do this for the last 3 years and with 12 numbers, we can easily see how things change month-to-month. This information could be displayed in a basic figure or even in a simple table. Aggregating data like this is known as a “group by” statement following notation from SQL. Effectively we are grouping by the month and then calculating a sum or mean of the variables in question. Being able to think and write clever group by statements to answer data questions is an art that gets better with practice.

2.3.3.1 Referencing data.tables

The general notation for a data.table is dt[row,column,by=group] so if we want to call the first row:

churn[1,]
##    length intl_plan vm_plan cs_calls churn  bill  mins calls
## 1:    128         0       1        1     0 75.56 717.2   303

And if we want the first column:

churn[,1]
##       length
##    1:    128
##    2:    107
##    3:    137
##    4:     84
##    5:     75
##   ---       
## 3329:    192
## 3330:     68
## 3331:     28
## 3332:    184
## 3333:     74

And if we want the first element overall:

churn[1,1]
##    length
## 1:    128

In the row area, we can subset the data set where a particular condition is true. So if we wanted only the people who called customer service more than 7 times, we would type churn[cs_calls>7,] which produces:

##    length intl_plan vm_plan cs_calls churn  bill  mins calls
## 1:    152         1       1        9     1 77.68 770.3   297
## 2:     64         0       1        8     0 64.67 617.4   303
## 3:    103         0       0        8     1 52.73 551.3   363
## 4:     53         0       0        9     1 63.10 598.7   366

If we want to then find the mean of churn for the people with more than 7 customer service calls:

churn[cs_calls>7,mean(churn)]
## [1] 0.75

To group by a particular variable we use two commas in the call with by= set to the variable we wish to group by. Here, I am going to group by customer service calls and find the mean of churn for people who called customer service 0 times, 1 time, 2 times, etc:

churn[,mean(churn),by=cs_calls]
##     cs_calls        V1
##  1:        1 0.1033023
##  2:        0 0.1319943
##  3:        2 0.1146245
##  4:        3 0.1025641
##  5:        4 0.4578313
##  6:        5 0.6060606
##  7:        7 0.5555556
##  8:        9 1.0000000
##  9:        6 0.6363636
## 10:        8 0.5000000

To find out how many people are in each group, we can use the .N variable:

churn[,.N,by=cs_calls]
##     cs_calls    N
##  1:        1 1181
##  2:        0  697
##  3:        2  759
##  4:        3  429
##  5:        4  166
##  6:        5   66
##  7:        7    9
##  8:        9    2
##  9:        6   22
## 10:        8    2

Often it is helpful to use find two variables for each group. the .() function can be used to combine multiple variables in a data.table:

churn[,.(.N,mean(churn)),by=cs_calls]  
##     cs_calls    N        V2
##  1:        1 1181 0.1033023
##  2:        0  697 0.1319943
##  3:        2  759 0.1146245
##  4:        3  429 0.1025641
##  5:        4  166 0.4578313
##  6:        5   66 0.6060606
##  7:        7    9 0.5555556
##  8:        9    2 1.0000000
##  9:        6   22 0.6363636
## 10:        8    2 0.5000000

Because the numbers that come out are not ordered, we might want to impose an order on the rows by sorting:

churn[order(cs_calls),.(.N,mean(churn)),by=cs_calls]
##     cs_calls    N        V2
##  1:        0  697 0.1319943
##  2:        1 1181 0.1033023
##  3:        2  759 0.1146245
##  4:        3  429 0.1025641
##  5:        4  166 0.4578313
##  6:        5   66 0.6060606
##  7:        6   22 0.6363636
##  8:        7    9 0.5555556
##  9:        8    2 0.5000000
## 10:        9    2 1.0000000

It is unfortunate that many data.table operations remove the column names. To fix this we can save our result as a new variable and then rename the columns:

outp <- churn[order(cs_calls),.(.N,mean(length),mean(bill),mean(churn)),by=cs_calls]
names(outp) <- c('cs_calls','N','length','bill','churn')
outp
##     cs_calls    N    length     bill     churn
##  1:        0  697 101.30273 59.90430 0.1319943
##  2:        1 1181 101.77985 59.46177 0.1033023
##  3:        2  759  99.22530 58.89424 0.1146245
##  4:        3  429 101.43357 59.83657 0.1025641
##  5:        4  166 102.66265 60.22337 0.4578313
##  6:        5   66 102.56061 58.14697 0.6060606
##  7:        6   22  90.18182 54.23636 0.6363636
##  8:        7    9 116.11111 56.84444 0.5555556
##  9:        8    2  83.50000 58.70000 0.5000000
## 10:        9    2 102.50000 70.39000 1.0000000

Because this process is tedious and we often just want the mean of the variables, we can use the lapply function applied to each column (.SD for subset data).

churn[,lapply(.SD,mean),by=cs_calls]
Table 1: Variable means grouped by customer service calls
cs_calls length intl_plan vm_plan churn bill mins calls
1 101.77985 0.0948349 0.2912786 0.1033023 59.46177 592.1865 306.2159
0 101.30273 0.1190818 0.2769010 0.1319943 59.90430 595.4574 304.9484
2 99.22530 0.0816864 0.2832675 0.1146245 58.89424 588.2964 305.5599
3 101.43357 0.0885781 0.2284382 0.1025641 59.83657 593.7618 302.2331
4 102.66265 0.1265060 0.2530120 0.4578313 60.22337 595.6663 305.3434
5 102.56061 0.0909091 0.2575758 0.6060606 58.14697 575.9788 303.1364
7 116.11111 0.0000000 0.2222222 0.5555556 56.84444 586.5111 310.8889
9 102.50000 0.5000000 0.5000000 1.0000000 70.39000 684.5000 331.5000
6 90.18182 0.0000000 0.4090909 0.6363636 54.23636 560.3182 292.4091
8 83.50000 0.0000000 0.5000000 0.5000000 58.70000 584.3500 333.0000

This is a fairly advanced function, so don’t be discouraged if this seems fairly complicated. We will discuss lapply functions later. For now you just need to know that they can be used to calculate the mean across many variables in a data.table. We can also apply the variance (var) or standard deviation (sd) function instead:

churn[,lapply(.SD,sd),by=cs_calls]
Table 2: Variable standard deviations grouped by customer service calls
cs_calls length intl_plan vm_plan churn bill mins calls
1 39.95407 0.2931109 0.4545441 0.3044822 10.474178 90.39080 34.73712
0 39.51951 0.3241173 0.4477885 0.3387276 10.870004 94.17884 34.87746
2 41.26656 0.2740670 0.4508823 0.3187783 10.178406 86.36005 33.80545
3 39.56490 0.2844655 0.4203166 0.3037429 10.311986 88.24242 34.19470
4 35.08130 0.3334246 0.4360532 0.4997261 10.896284 90.46204 35.15329
5 36.14081 0.2896827 0.4406501 0.4923660 11.463548 93.87850 34.24702
7 23.53956 0.0000000 0.4409586 0.5270463 8.525984 86.41013 29.71298
9 70.00357 0.7071068 0.7071068 0.0000000 10.309617 121.33952 48.79037
6 46.41596 0.0000000 0.5032363 0.4923660 7.964978 62.79750 25.06118
8 27.57716 0.0000000 0.7071068 0.7071068 8.442855 46.73976 42.42641

Here we are seeing the standard deviation of each variable for each of the groups based on number of customer service calls.

2.3.3.2 Group-bys with continuous variables

Sometimes we want to perform group-by statements, but our data is continuous. When we group by a continuous variable, we get the following result:

churn[order(length),.(.N,mean(churn)),by=length]
##      length N    V2
##   1:      1 8 0.125
##   2:      2 1 1.000
##   3:      3 5 0.000
##   4:      4 1 0.000
##   5:      5 1 0.000
##  ---               
## 208:    221 1 0.000
## 209:    224 2 0.500
## 210:    225 2 0.500
## 211:    232 1 0.000
## 212:    243 1 0.000

You can see here that length takes too many values and so the outcome isn’t clear. One possible solution to this is just to reverse the process:

churn[order(churn),.(.N,mean(length)),by=churn]
##    churn    N       V2
## 1:     0 2850 100.7937
## 2:     1  483 102.6646

While this works here, it will not work in cases where our target variable (e.g., churn$churn) is continuous also. To deal with this we are going to “bin” the length variable. First we need to define the bin. Consider the following sequence command:

seq(0,250,25)
##  [1]   0  25  50  75 100 125 150 175 200 225 250

This is going to be used to define the bins. We are going to cut() the length variable using these bins (include.lowest is just a parameter which specifies whether the lowest bin should include zero):

churn$length_bins <- cut(churn$length,seq(0,250,25),include.lowest=TRUE)
table(churn$length_bins)
## 
##    [0,25]   (25,50]   (50,75]  (75,100] (100,125] (125,150] (150,175] 
##        99       242       548       775       787       526       239 
## (175,200] (200,225] (225,250] 
##        91        24         2

Now you can see that our data has been put into the bins we wanted. Now we can group by:

churn[order(length_bins),lapply(.SD,mean),by=length_bins]

<!--

Table 3: Variable means grouped by length of service
length_bins length intl_plan vm_plan cs_calls churn bill mins calls
[0,25] 14.77778 0.0808081 0.2727273 1.676768 0.1515152 59.97808 598.4828 297.0707
(25,50] 39.73967 0.0661157 0.2479339 1.487603 0.1198347 59.40798 587.9479 305.6157
(50,75] 64.45803 0.0930657 0.2846715 1.500000 0.1350365 59.64038 594.5024 304.7409
(75,100] 88.69806 0.0993548 0.2683871 1.596129 0.1496774 58.95072 589.8564 305.0697
(100,125] 112.85896 0.0889454 0.2858958 1.668361 0.1537484 59.49868 591.8094 304.1474
(125,150] 136.85741 0.1311787 0.3098859 1.484791 0.1406844 59.69844 593.4677 308.0304
(150,175] 161.17573 0.0920502 0.2217573 1.489540 0.1506276 60.33498 594.3397 306.9874
(175,200] 184.96703 0.0989011 0.2857143 1.395604 0.1318681 58.13451 580.3571 303.3297
(200,225] 211.25000 0.0416667 0.1666667 1.833333 0.2500000 59.47750 601.1167 303.6250
(225,250] 237.50000 0.0000000 0.0000000 1.500000 0.0000000 48.56500 510.8000 289.0000
-->