Back to questions

Given two vectors, represented as lists X and Y, return the Pearson Correlation Coefficient.

p.s. this is the same as question 9.8 in Ace the Data Science Interview.

p.p.s. AQR is a competitive hedge fund so they test their Quants & Data Science on both coding & math/stats skills, so they expect you to know the Pearson correlation coefficient formula.

But, if you don't know the formula, don't worry – just use the hints below to learn more about Pearson correlation 👇

Correlation is a statistical measure used to assess the degree of similarity between two sets of vectors in a multi-dimensional space.

Here's the formula for the Pearson correlation coefficient: $Corr[X, Y ] = \frac{Cov[X, Y ]}{σ[X]*σ[Y ]}$

The numerator is the covariance of X and Y, and the denominator is the product of the standard deviation of X and the standard deviation of Y.

To find the covariance of vector X and Y, use this formula:

${Cov}(X, Y) = \frac{1}{n} \sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})$

In this formula:

- $n$ is the number of data points in the vectors.
- $X_i$ and $Y_i$ are individual data points in vectors $X$ and $Y$, respectively.
- $\bar{X}$ and $\bar{Y}$ represent the means (averages) of vectors $X$ and $Y$, respectively.

We'll first start by creating two helper functions for calculating mean and standard deviation.

Using these helper functions, here's the full solution:

Calculating both the mean and standard deviation is O(N) runtime (since both take a single sum across all N elements).

Therefore, the correlation runtime is O(N), since we calculate a few means and standard deviations, as well as iterate over all N elements once through.

The space complexity is O(N) to keep track of N elements in the correlation to be processed.

Python