uncertainty: Admission Bias?

Update: I realized there is a coding error in my previous version, and I have corrected it. The result now is not as dramatic, but I think it is still striking. I mistakenly plotted the conditional (on wealth) probability of admission, when we are interested in what is the implied percentage of student population based on wealth. I apologize for my mistake.

Motivation:

A friend of mine emailed me a note she wrote on Harvard's admission status. The main finding is that the admitted students tend to come disproportionately from wealthy families, suggesting that, despite the bind-need policy, there is still a selection bias towards the rich. For example, the author found:

About 150 students come from family with less than \$20,000 income, that is about 2.2\% in Harvard’s 6,700 undergraduate population. In the whole country, there are about 20% of families with household income below \$20,000 (Census, 2012)..... 40\% of Harvard graduates presumably come from families with annual income higher than [200,000]. But in the US, only 4.5\% families make more than \$200,000. 40\% of Harvard’s students come from the top 4.5\% of America’s family income spectrum.

While I do agree certain selection criterion like so called leadership, and versatility are unfair to the poor, I think we need to dig deeper into the selection process and data to get a better sense. This phenomonon, could be the result of fair admission process with no bias towards the poor. What do I mean by this?

Theoretical Framework:

Let me illustrate with a model.

Let $Y_i$ be the relevant performance of an individual for admission process. We have
\[
Y_i=\beta_x X_i+\beta_z Z_i+e_i
\]
where $X_i$ is talent, $Z_i$ is work ethics, and $e_i$ is pure luck.
Now we consider the family wealth of that student $T_i$:
\[
T_i=\alpha_x X'_i+\epsilon'_i
\]
where $X'_i$ is the talent of the parents, and $\epsilon$ is all the other factors. We would expect there to be a inter-generational correlation of talents. We could decompose $X_i$ into the inherited part ($E(X_i|X'_i)$) and the innovation part $\xi'_i$. For simplicity, however, we will decompose in the following way:
\[
X_i=E(X_i|T_i)+\xi_i=\delta T_i+\xi_i
\]
This means that talent is not independent of family wealth simply because
1) talent has positive inter-generational correlation;
2) higher talent means higher wealth level in expectation.
Now we can write the performance in the following way
\[
Y_i=\beta T_i+e'_i
\]
Note that $\beta=\beta_x\cdot \delta$, and $e'_i=\beta_z Z_i+e_i+\xi_i$.

A calibration and result

Let us do some calibration. We will scale $e'_i$ so that $\beta=1$. We let $e'_i\sim N(0,3)$ and $T_i\sim N(0,1)$.

Technical fuss: I know wealth does not come from Normal distribution, but consider a function acting on wealth to make the output normally distributed. For example, log-normal distribution is considered to be a good approximation for the distribution of wealth, so instead of saying $T_i$ is wealth, we could say $T_i$ models the log of wealth, which then becomes normally distributed. If you work through the argument, you will find none of these matters.

So the rest is a Bayesian exercise. Conditional on observing getting admitted to Harvard $Y_i>cutoff$, what is the conditional distribution of $T_i$? I calculated the cutoff and did the corresponding calculation, (no worry, I will show you the code and how I calculate the cutoff in the end for transparency). Here is what I find.

The conditional distribution is highly skewed to the right---that is conditional on admission through this fair process, we would expect most of the students come from rich families. Here is the distribution:

The above plot is the percentage of students admitted as a function of income. On the y-axis, it is the percentage of admitted student with respect to total college student population---to find the population percentage among admits, you need to multiply the number by 1000. Note that unconditional $T_i$ comes from $N(0,1)$. Notice the lack of admission in the lower income group until 1 standard deviation below mean.

Another graph. What is the probability of admission in each income group?

As you can see, as the family income goes up, the chance of getting admitted is monotonically increasing. According to our model, this correlation is not causal (family wealth does not cause better performance), and is a result of the centrality of talent.

Ok, let me give you a number in the end. What will the predicted percentage of students coming from the top 4.5% of income distribution? 49.4238%

Discussion

What is the point? The point is not that the education system is fair. The point is we need to be more careful in our diagnosis. Let me give some far-fetched general comments, not validated by the work I did in this blog. The problem might very well lie in the inequality in elementary education and secondary education, and if that is the case, targeting higher education admission process is unlikely to be effective. Efforts like affirmative action aiming to change the result, without addressing the root of the problem, is not only doomed to fail but also likely to create new problems.

I am of the minority view that in empirical work, we need to think harder about the structure of the problem. As those tiny details matter for policy. We need to know the industry we are talking about, understand its structure, and use data to give as precise a picture of the industry. We could gain a lot from reduced-form estimation, but in some applications, without forcing us to be clear the data generating process and the structure, and link them with data, the policy implication could be very limited.

Robustness Check and Technical addendum:

How I choose the parameters for the random variables? Well, I did some exercise finding a reasonable range for the parameters. As it turned out, they tell similar stories. so I choose the parameters that is simple for expository purposes. I hate blackbox, and I will disclose the workings, which lead to a different set of parameters (slightly more complicated), and show you that the result is basically the same.

So for the first equation

\[
Y_i=\beta_x X_i+\beta_z Z_i+e_i
\],
I let $\beta$'s to be 1 and all variables come from iid $N(0,4)$. That is talent accounts only 1/3 of the performance. (I am using variance 4 to reduce fractions later in calculations)

then for family wealth,
\[
T_i=\alpha_x X'_i+\alpha_\epsilon \epsilon'_i
\]
I let $\alpha$'s to be $\frac{1}{\sqrt{2}}$, and both random variables come from $N(0,4)$. That is talent accounts for only half of the variation in wealth.
Finally, for inherited talent, I assume the correlation is only 1/2.
\[
X'_i=\frac{1}{\sqrt{2}} X_i+\frac{1}{\sqrt{2}}\xi_i
\]
The weird scaling is to make sure that the variance of talent is steady over generation.
If you work through the algebra, you will find the following regression function:
\[
X_i=\frac{1}{2} T_i+\frac{\sqrt{3}}{2}\epsilon'_i
\]
substituting in, we have
\[
Y_i=\frac{1}{2}T_i+\frac{\sqrt{3}}{2}\epsilon'_i+Z_i+e_i
\]
We can pack the last three terms as one error with variance 11, and $\frac{1}{2}T_i$ as scaled wealth $\sim N(0,1)$. In essence we have
\[
Y_i=\tilde{T}_i+\sqrt{11}\tilde{\epsilon}_i
\]
where both random variables are iid standard normal.
What will the predicted percentage of students coming from the top 4.5% of income distribution be? 22.56%.
Feel free to try different numbers.

Appendix:

So to make this transparent, I will show you what I did.

How to calculate cutoff? The number of students enrolled in Harvard is 6700, which means 6700/4 students each year. In US about 25 m students enroll in colleges each year. Of course there are other great universities, like Princeton, Williams (yes!), Yale, Stanford. So let me be generous, and say Harvard get the top 0.1%. So with this we can find the 99.9 quantile in terms of Y, which is about 6.18. The rest is coding it up (I used Mathematica):
In[49]:= cutoff = Quantile[NormalDistribution[0, 2], 1 - 10^(-3)];

In[50]:= N[cutoff]

Out[50]= 6.18046

In[51]:= F[x_] :=
NIntegrate[
PDF[NormalDistribution[0, 1], x] PDF[NormalDistribution[0, Sqrt[3]],
y - x], {y, cutoff, +\[Infinity]}]

In[52]:= Plot[F[x], {x, 0, 5}]

In[43]:= wealth = Quantile[NormalDistribution[], 0.955]

Out[43]= 1.6954

In[54]:= ratio = NIntegrate[F[x], {x, wealth, +\[Infinity]}]/0.001

uncertainty

Friday, November 15, 2013

Admission Bias?