Endogeneity - John Antonakis

en français

Back to podcast

http://www.youtube.com/watch?v=dLuTjoYmfXs

Attachments

Causality_and_endogeneity_final.pdf

Causal_Claims.pdf

2slsdata.xls.zip

[Transcript]

[00:05] Hello, my name is John Antonakis. I am a professor of Organisational Behavior at the University of Lausanne.

[00:15] Today I am going talk about a topic that many researchers don’t know about or care to avoid: endogeneity. It sounds bad, it is bad. It’s like a deadly virus that threatens the viability of models that make causal claims regarding the relationship between an independent variable and a dependent variable.

[00:39] Who needs to know about endogeneity? Researchers, students, but most important: Policy makers and practitioners. Endogeneity is always lurking around in the background. Therefore it is important to know, what it is, why it is deadly for research, and how to deal with it.

[01:02] We are bombarded all the time with apparent causal claims that have implications for practice. For example, choosing a certain strategy or control system in a company that apparently predicts the company’s performance; good leader member relations, or LMX, which apparently predicts reduced turn-over intentions on the part of subordinates or increased satisfaction; having more women on management boards apparently predicts profitability of companies.

[01:33] Management researchers and consultants often make such claims; but are these findings actually valid? How can we know if that is the case? As you will see at the end of this podcast, very often, such claims–if they haven’t been observed in certain conditions–will usually be false.

[01:52] To better understand the problem of endogeneity, imagine a philosopher, who is taken out on a field, and is supposed to observe a phenomenon that is going to occur 50 times. Her goal is to piece together what she sees and provide a theoretical explanation of what she has observed. So, the philosopher walks out on the field, and she stares out all around her, she hears a few birds, the wind rustling, nothing much. Suddenly, a disc streaks across the sky. She hears a loud crack, and the disc shatters into smithereens. Puzzled at this occurrence she looks again.

[02:22] Suddenly she hears a streak in the sky; the disc appears, she hears the loud crack, and the disc shatters again. This happens several times. Most of the times that she hears a loud crack the disc shatters. After thinking about this phenomenon for quite some time, the philosopher comes up with a conclusion: A theoretical explanation which stems from the loud crack. She comes to this conclusion after having observed 50 trials, and she is really convinced that it must be the sound that is destroying the disc. Let’s take a look at the data that she gathered.

[03:29] So, here is the data that shows exactly what the philosopher saw and what she recorded. Over here, there was no loud crack. As you can see, out of 19 trials the disc remained intact and it was never destroyed. Over here is when she heard the loud crack. On 29 trials, where the loud crack occurred, the disc was shattered and on two trials, where the loud crack occurred, the disc did not shatter.

[04:03] As you can see, there is a very strong relationship between two variables. That is, “hearing the sound” and “shattering.” When the sound is present the disc usually shatters. In fact, the probability is extremely high. And, when the sound is not there, it’s almost with a 100% certainty that the disc does not shatter.

[04:24] We can actually estimate the relationship between these two variables. This is actually called a phi coefficient, and it is 0.92. It is almost perfect. Therefore, what can we conclude? When the sound is present it is highly likely that the disc shatters; when the sound is not present it is highly likely that it doesn’t.

[04:42] Now, here is a question to you: Can the philosopher actually conclude that the sound is making the disc shatter? The observed, and I highlight “the observed” correlation is very strong and it’s statistically significant. That is, this relationship is very reliable and it is not due to chance. But, does it actually reflect the true relationship between the sound and the disc shattering? It seems like the crack causes the disc to shatter. Let’s talk about the variables in terms of “x”, the cause, and “y”, the apparent outcome.

[05:20] Assume the following causal diagram where “x”, in fact, causes “y”, and there is something that causes “y” too that we don’t observe: We call this a disturbance term, the “e” term over here (or, perhaps, unmeasured causes). As you can see, the reason why this exists is because we don’t perfectly predict when the disc will shatter as a function of the sound. We have a little bit of error and that was shown in the bar graph; those are the misses that we have.

[05:50] The problem with this causal specification is that “x” here is actually not exogenous. It depends on something. And if this something is not accounted for in the model the relationship that we will estimate between “x” and “y” will be, in fact, very biased. Now, here comes the big problem: endogeneity. What causes “x” might also cause “y”. That is, “u” and “e”, these unknown causes maybe correlated or may be due to the same variable. And this variable is what we call “an omitted cause.”

[06:29] When adding “z” in the model, we account for what causes both “x” and “y”. In fact, the relationship between “x” and “y” is non-existent. It is naught, zero. To better understand why the relation between “x” and “y” is actually zero, I estimated a multivariate regression where:

1. “z” predicts “x” (how loud the sound was), using a linear model

2. “z” predicts “y” (whether the disk shattered), using a linear probability model (estimated with OLS)

3. and, the disturbances of “x” and “y” (“u” and “e”) are correlated

The residual correlation between “x” and “y” is actually zero when we account for what causes “x” and “y.” For detailed notes and to download the data please refer to the following link on my webpage: http://www.hec.unil.ch/jantonakis/disk.xls

[07:43] So, this is the true model: “z” causes both “x” and “y”. “x” is how loud the crack was that the philosopher heard, which is caused by a gun-shot, the omitted variable “z”. What she hears is also caused by “u”, an unmeasured cause, perhaps background noise, which perturbed a little bit what she heard. The disc shattering is caused by “z” as well and “e,” a random, unmeasured cause. It could have been the wind when the shooter was shooting and, which, disturbed the direction of the bullets, which is why they missed. So this was a random cause, which is unmeasured in the model and uncorrelated with “z”.

[08:32] Going back to the philosopher in the field: Despite her good intentions to try and model this phenomenon correctly, what she did was wrong. There is no correlation between the sound and the disc shattering. Both are caused by a gunshot that she did not know of. There we can see the shooter. And it is the shooter who is, in fact, destroying the disc, which is launched on this side by a disc-launcher.

[09:03] When faced with endogeneity problems of this kind, the problem is that the relation we observe could be positive, could be negative, could be non-significant. In fact, we don’t know what the true relationship is when we omit important causes.

[09:25] What causes endogeneity? There are three major reasons: The first one we just saw, was omitted variables. And these come in many different kinds and forms. I just showed you one example where we omitted a common cause, but there are many different kinds of omitted variables biases. For example, omitting fixed effects. Many researchers, especially those who use what is called HLM models, or Hierarchical Linear Models, estimate these models using random effects or random coefficients, without checking whether the level one variables correlate with fixed or constant effects that are due to the higher level entity.

[10:07] A second interesting case of omitted variable bias is where we have omitted selection. In this case, there is a choice that has been had by the entity we are observing. For example, women who choose to work or who choose not to work.

How can we estimate the relationship between education and how much a woman earns if we can only observe women who are working? We need to also observe the counterfactual. What would women–who are not working–what would they have had in terms of salary had they chosen to work? So, we need to model this endogenous choice. This comes in many different forms. For example, leaders may choose to attend a leadership training program, or not. Firms may choose to export, or not. Certain companies may choose to use a certain strategy, or not. This choice is endogenous. It cannot be used to predict anything. It’s just like the gunshot noise.

[11:09] The third major cause is what is called simultaneity. By simultaneity we mean that the model that we have: “x” predicting “y”, could work one way, in other words, “x” is a cause of “y”. However, “y” is also a cause of “x”. So we have a backward causal loop going from “y” to “x”. As you can see “x” is caused by “y” and an omitted cause that we don’t observe, and “y” is caused by “x” and also by an omitted cause that we don’t observe. The problem we have when we estimate a relationship between “x” and “y”, and simultaneity might be actually driving the result, is that the coefficient that we estimate is actually wrong because it consists of two coefficients. So we might observe a coefficient of .50 but that doesn’t mean anything. One could be going a positive direction; another could be going a negative direction.

[12:09] For example, leaders change their leadership style as a function of follower performance. If a follower has bad performance, the leader might use a negative leadership style, one that gives negative feedback. If the follower has good performance, the leader might adjust their style as a function of that good performance. Thus, what we observe is actually not trustworthy. It is consisting of two components. Very often researchers say, “well, the observed correlation that we have might be because “x” is causing “y”, or because “y” is causing “x””. No, that is wrong. These researchers don’t get it. The correlation we observe is not true, is not correct, because one correlation could be positive, the other could be negative, and these could be of different signs of different magnitudes. In fact, what the researchers are doing is correlating the sound of the gunshot with an outcome.

[13:04] There are many other causes of endogeneity, including measurement error, which is a special case where “x” is actually exogenous but is not precisely, or reliably, measured. There is also what is called, common method variance. Many researchers don’t realize how bad common-method variance can be. But, in fact, if there is a problem of common method variance, we just don’t know what the true relation can be. For example, suppose I ask you to rate your boss on a certain leadership style and then I ask you if you like your boss; or I might ask the questions in the other sense: “do you like your boss?” say you say “yes”. “Are you very impressed with the leadership style of your boss?” “Is your boss a good leader?” Well, you’re more likely to say “yes” given the first response. Now, it was very blatant the way I asked the question by saying “do you like your boss?” Questions may be asked in more indirect ways but they may be driven by a lurking variable, “z”, which is driving the relationship between “x” and “y” and we just don’t know what the true relationship will be if we have an omitted cause–if we have endogeneity.

[14:16] How do we get rid of endogeneity? Very simple. The fail-safe way to do it is with an experiment. In an experiment “x” is exogenously manipulated. In other words, it varies randomly. Because it varies randomly, it will not correlate with anything in the dependent variable that we haven’t measured.

[14:40] Let me give you an example. Suppose we want to try the efficacy of a treatment, whatever that may be, leadership training, a medicine, what have you. So we have a sample of individuals, let’s say 100, and what we will do is we will use some kind of random mechanism to assign these people to a treatment group and a control group. We might have alternative treatments, that doesn’t matter. Suppose we have a treatment group and a control group. Because we randomly are assigning the people to either the treatment group or the control group, the groups at the beginning are exactly the same on any observed or unobserved variable; and that is very critical, because, if the groups are exactly the same on every observed or unobserved variable, if there is a difference in them after we administer the treatment, that difference can be due to only one thing: That is, the treatment. Because there is nothing else that might possibly explain why we observe a difference in the groups. The groups were exactly the same in the beginning. So, the strength of the experimental design is that we can observe what is called the counterfactual. What would the treated group received had it not received the treatment? That’s what we observe in the un-treated group. We can make a valid causal claim with an experiment.

[16:00] Remember, with an experiment we can be sure that: (1) no omitted causes correlate with the treatment, (2) the groups are the same in the beginning, (3) we can observe the counterfactual (at the group level), and (4) causal claims are valid.

[16:19] Experimental methods are one way to deal with endogeneity. There are other ways, a bit more complex, that borrow a lot from econometrics.

[16:32] I would like to depict the relationship between “x” and “y” using Ballentines. This example is taken from Kennedy’s “Introduction to econometrics” book. Imagine, here we have the two variables: “y” and “x”, and the overlap that they have, where it intersects, is actually the percentage of, say, overlap, or variance that is shared between the two variables; and that is actually what we want to estimate when we estimate an ANOVA model or a regression model: Assume that is a slope, beta. Of course, “y” depends on “x”, but “y” depends on other variables that we have not measured. I’m just going to add one in here: “m”.

[17:17] Now, because we have exogenously manipulated “x”, “x” does not overlap with “m” at all. It is independent of it, or it is orthogonal to it. Therefore, what we estimate in terms of “x” predicting “y”, the slope coefficient, is actually consistent. By consistent we mean that it reflects the true value as the sample size will increase.

[17:43] But even with experiments we can have endogeneity problems. Unbeknown to some experimentalists–I’ve noticed this very often in psychology–sometimes we actually might measure two dependent variables in an experiment, “y1” and “y2”, and we want to estimate the causal effect of “y1” on “y2” as a function of “x”.

[18:06] So, for example, subjects were randomly assigned to a treatment–we call this “x”–and they were then measured on “y1” and “y2”. Now, suppose that “y1” and “y2” share a common cause, which is possible because they were measured at relatively the same time. Suppose they were exposed (the participants), to a certain leader, and perhaps they liked the leader or not because of what the leader looked like. That’s got nothing to do with the treatment that was administered. Therefore, if one tries to estimate the causal effect of “y1” on “y2”, there is an endogeneity problem again between “y1” and “y2”. And this endogeneity problem needs to be acknowledged.

[18:50] By acknowledged we mean that the causal structure of the data must be modelled correctly. That is, “x” causes “y1” and “y2”, but there is a common cause linking “y1” to “y2” that must be modelled. This correlation between the two disturbances must actually be modelled in the estimation procedure. Very often they do not do that, and if it is not done, the correlation then that is estimated between “y1” and “y2” will actually be mis-estimated: It will be wrong.

[19:29] The solution to this problem is very simple. It is to use the two-stage least squares estimator, 2SLS. In this case, “x” is known as the exogenous variable or the instrument, which is used to help identify the causal effect of “y1” or “y2”. How are we going to do this? We will find the portion of variability that “x” and “y1” share that overlaps with “y2”. However, we must model this causal structure correctly by correlating cross equation disturbances.

[20:04] Going back to our Ballentines–so you can understand the nature of the problem–suppose we wish to estimate the relationship between “y1” and “y2”. The causal relationship, that is. Unfortunately, “y1” and “y2” share a common cause, which is “q”. As you can see, the portion where “q” overlaps with “y1” and “y2” is what is going to cause the endogeneity problem. This must be correctly acknowledged in the estimator. Now, if we just estimate this relationship between “y1” and “y2”, as you will see the overlapping area consists of a true component, but it also consists of an error component, and that is where the three circles overlap. That portion of the variance in the yellow circle is going to be incorrectly estimated if we use what is called the normal OLS, or Ordinary Least Squares estimator, or maybe even Maximum Likelihood; it doesn’t matter which estimator we use, but if we don’t acknowledge the correct causal structure and find an instrument that is exogenous to the system of variables we cannot identify the causal effect of “y1” on “y2”.

[21:22] The instrument in this case is “x”. As you can see, “x” overlaps both with “y1” and “y2”. Because “x” is exogenous it does not overlap at all with the omitted common cause: “q”, or any other causes and I have not put them all in the model. We are just isolating “q” just to demonstrate the point. So, what the estimator is going to do is it’s going to look at the share of overlapping variance that “x” has with “y1” and “y2” to estimate the relationship of “y1” with “y2”. Even though it’s only using a smaller portion of the overlap of the “y1” with “y2” it will still estimate it consistently. In other words, what coefficient we will find will be correctly estimated even though we use less information.

[22:10] This portion of the variance is what I call in the picture “y1” hat. “y1” hat is the predicted value of “y1” that is due to “x”. Now, this predicted value has a very special property. It does not overlap with “q” at all, as you can see in the diagram. This is the actual two-stage least squares estimate, which can be estimated using 2sls or maximum likelihood. I will show you in a minute how.

[22:42] What’s important to do is to correlate the two disturbances, the disturbances of “y1” and “y2”, that is. You have to acknowledge in this estimator that “y1” and “y2” are endogeneous and that they potentially share a common cause. The correlation between these two disturbances as I indicate in the Ballentines is actually what the Hausman test estimates.

[23:07] So, let me show you how to estimate this correctly, but first we will start off how estimating it incorrectly.

[23:18] Now usually what is done in these cases is that “y1” is used as a predictor of “y2”. The problem is that “y1” actually correlates with the disturbance in “y2”. In other words, this correlation is not “0”. Beta one, the relationship between “y1” and “y2”, will in fact be inconsistent in its estimation, it won’t be correct.

[23:45] Now the correct way to estimate this model is to actually use the instruments to predict “y1” first and then to use the predicted value of “y1” to predict “y2”. To do that we must correlate the disturbances of the two equations. In other words, the disturbances of “y1” and “y2”. So, this disturbance correlation I call psi 1.

Now if this correlation, psi 1, is not zero, and it is estimated, we will kill endogeneity.

[24:15] If psi 1 is not zero and we don’t estimate it we’re going to have a big problem, [24:24] and this big problem is just like we did before [24:28]:if I can go back to the previous figure, it’s as if we never used the instruments to predict “y1” and use the predicted value of “y1” to predict “y2”.

[24:35] In other words, constraining that covariance between the two disturbances to zero is going to give exactly the same, and inconsistent, wrong estimate: As if we never had “z” or “q” in the model.

[24.51] And this is how many researchers go about testing such causal models. [24:56] They have two endogenous variables, they may have exogenous variables from an experiment, ideally, but they don’t use that in the correct way to estimate the relationship between “y1” and “y2”. If the two variables, that are endogenous, share a common cause, this must be acknowledged in the estimator.

[25:14] Let me demonstrate a specific case with simulated data. This data is available on my website (2slsdata.xls.zip), and I encourage you to download it and play around with it in different programs to see whether you can obtain the same estimates that I do.

[25:30] So, suppose the model that generated the data, in fact the true model underlying the relationships between the two variables, is depicted as such. We have “x” that causes “y” but we have “q” that causes both “x” and “y”. We also have two instruments: “m” and “n” that are exogenous. They don’t correlate with “u”, with “e” and indeed with “q”.

So, what we’re trying to estimate is the causal effect of “x” on “y” and this causal effect is supposed to be -.30.

[26:09] So, we can estimate this model correctly even if we don’t include “q” in the model as long as [26:16] we correlate the two cross-equation disturbances. As you can see from the simulated data (and here the data are quite large, I have a sample size of 10’000 observations) we see that the 2sls estimator recovers the true parameter almost precisely. It estimates it to be -.29. Remember that the true estimate was -.30.

[26:42] Now, if this is estimated the “usual” way, the OLS way, where this correlation is not acknowledged, it’s as if we were doing two separate OLS equations. [26:54] So even though you estimate the system of equations simultaneously—[26:59] and if you do not correlate the cross-equation disturbances—it’s as if you had estimated two separate equations that links them with nothing at all.

[27:10] In this case, when we regress “y” on “x”, we obtain an estimate of .03. Remember the true estimate was -.30. So we are way off in what we have estimated.

[27:23] Try this yourselves. Simply take “y” and regress it on “x” [27:27] you will get an estimate of .03. That is the observed correlation. This is completely wrong.

So you may use all the fancy big hammers and structural equation modelling programs [27:37], but if the model, firstly, does not have exogenous variables to identify the effect of “x” on “y”, you cannot get the correct estimates. Second problem is, if you don’t acknowledge the endogeneity between the two endogeneous variables you are not going to get correct estimates.

[27:57] Now, here is the million Swiss franc question (and for the viewers in the United States, be assured that this is a lot of money, it’s actually 1.2 million bucks). Where do we get instruments from [28:06]? In an experimental design it’s very easy to get an instrument, that’s the variable that was exogenously manipulated (or variables), and ideally you will have more variables than you have endogenous regressors so that you can estimate what is called an overidentification statistic, whether the structure that you have in the data, the causal structure, is actually valid. So, what that does, in fact, it compares the model that you have [28:34] with what it observes in the data. Just like as if you are comparing an architectural plan of a house versus what was actually built to see how close it was, so the closer the model is to the data the more likely the model generated the data. So, it is good to have instruments [28:52], to have at least a couple more than what we have in terms of endogenous regressors.

[29:01] Where do we get instruments if we haven’t done an experiment? There are many creative ways that you can go about getting instruments. Economists have identified many, many interesting ways to do it, when you want to, for example, estimate the effect of firms on performance, or country level variables on country level performance or on firms. [29:20] They can be geographic instruments, they can be distance instruments, they can be vectors of malaria…many, many different ways. In psychology, for example, one can use the IQ of a leader or any kind of fixed or constant effect that is genetically determined.

[29:36] So we talk about this in the paper, published in the Leadership Quarterly, and the title is: “On making causal claims.” So if you’re interested to find out more about it, please refer to the paper:

Antonakis, J., Bendahan, S., Jacquart, P., & Lalive, R. (2010). On making causal claims: A review and recommendations. The Leadership Quarterly, 21(6). 1086-1120. http://www.hec.unil.ch/jantonakis/Causal_Claims.pdf

[If you wish, refer to the following “prequel” paper, which is really a more basic introduction to endogeneity]:

Antonakis, J., Bendahan, S., Jacquart, P., & Lalive, R. (submitted). Causality and endogeneity: Problems and solutions. In D.V. Day (Ed.), The Oxford Handbook of Leadership and Organizations. http://www.hec.unil.ch/jantonakis/Causality_and_endogeneity_final.pdf

[29:51] To summarize, if “x” is not exogenous, its relation to “y” is suspect, and it has to be corrected using some kind of corrective technique, which will kill endogeneity. There are many, many cases of this in the literature and the paper to which I referred to, published in the Leadership Quarterly [30:10], we found that even in very, very good journals, in top journals in management, in applied psychology, that estimates were severely compromised by using incorrect modelling procedures.

[30:22] So, recall, we cannot regress “y”, satisfaction of followers, or what have you, on LMX, leader-member exchange: LMX is endogenous. We cannot use [the] Hierarchical linear modelling estimator, which looks at random effects, when level one variables could correlate with the fixed effects (the fixed effect are an omitted cause). We cannot regress company performance on an endogenous choice, for example using a certain control strategy or not using it, because the choice is endogenous, it has to be modelled correctly.

[30:53] And that is exactly where James Heckman in fact won the Nobel prize in 2000; the procedure that is named for him, the Heckman two-step, he found out a way in which he could correct for this endogeneity and reproduce a true counterfactual. Just like in an experimental design.

[31:09] Thank you for taking the time to listen to this University of Lausanne podcast. [31:15] If you are interested to find out more information about endogeneity and how to correct for it, please refer to the following paper. The paper is available on my website or if you wish you may e-mail me and I’ll be very glad to give it to you.

[Or just click here]

Antonakis, J., Bendahan, S., Jacquart, P., & Lalive, R. (2010). On making causal claims: A review and recommendations. The Leadership Quarterly, 21(6). 1086-1120. Causal_Claims.pdf

[If you wish, refer to the following “prequel” paper, which is really a more basic introduction to endogeneity]:

[31:27] Before closing, make sure to think about these supposed causal effects that someone is trying to convince you of. [31:26] Was the claim made in the context of an experimental design? If not, is it possible that there are omitted causes that haven’t been correctly modelled? Were instruments used to assure that the causal direction of the effect of an endogenous regression can be identified on a dependent variable?

[31:50] If there is any cause for doubt, don’t trust the results of the study that has published them.

[31:55] Remember, endogeneity is like a disease, it must be stomped out in every one of its forms. [32:03] It’s not ethical neither economical to base policies or practices on procedures that might not work.

Thank you for listening to this podcast.