Or: females and university faculty have a small bias regarding sex and institution when reading psychology abstracts, and male university faculty in STEM departments slightly more skeptical than females about abstracts that accuse men of being sexist
“Even With Hard Evidence Of Gender Bias In STEM Fields, Men Don’t Believe It’s Real”, reads the headline of a ThinkProgress article.
“There’s a growing mountain of evidence that women in the STEM fields face gender bias…. These are egregious examples — but the empirical evidence backs them up. One landmark study found that science faculty at research universities rate applicants with male names as more competent, more hireable, and more deserving of a higher starting salary than female applicants, even when the resumes are otherwise identical.
Now, a new study published by the Proceedings of the National Academy of Science (PNAS) shows another level of bias: Many men don’t believe this is happening. When shown empirical evidence of gender bias against women in the STEM fields, men were far less likely to find the studies convincing or important….”
So there was one “landmark study” that says men won’t hire women in STEM jobs, and now yet another comes out by and Academy of Science that says men are so stubborn and blinded by sexism that they will deny this hard fact even when confronted with empirical evidence! Much like those crazy, troglodyte conservatives who “deny” global warming, or knuckle-dragging creationists who eschew evolution, men in STEM are militantly opposed to women in the fields and refuse to listen to reason. Science proves it. It’s on the level of Newton’s laws of motion.
Although if you read the two studies, the latter of which directly references the former in its experiment itself, which are by the same authors and hardly constitute scientific consensus, this kind of conclusion is completely untenable. The studies are quick, superficial, provide very tenuous results, and cannot reasonably be used to prove sexism one way or the other. They also indication that the institution someone is affiliated with can play a role in perception too, but this is omitted form the ThinkProgress article.
Let’s take a gander at the studies. The “new study” is titled Quality of evidence revealing subtle gender biases in science is in the eye of the beholder. Here’s the abstract, trimmed down slightly to give you the relevant info:
“[We think there’s] growing evidence revealing a gender bias against women—or favoring men—within science, technology, engineering, and mathematics (STEM) settings[. To what] extent [does] gender bias contribute to women’s underrepresentation within STEM fields[? A]re men and women equally receptive to this type of experimental evidence? This question was tested with three randomized, double-blind experiments—two involving samples from the general public (n=205 and 303, respectively) and one involving a sample of university STEM and non-STEM faculty (n=205). In all experiments, participants read an actual journal abstract reporting gender bias in a STEM context (or an altered abstract reporting no gender bias in experiment 3) and evaluated the overall quality of the research.
Results across experiments showed that men evaluate the gender-bias research less favorably than women, and, of concern, this gender difference was especially prominent among STEM faculty (experiment 2). These results suggest a relative reluctance among men, especially faculty men within STEM, to accept evidence of gender biases in STEM. This finding is problematic because broadening the participation of underrepresented people in STEM, including women, necessarily requires a widespread willingness (particularly by those in the majority) to acknowledge that bias exists before transformation is possible.”
This study is composed of three similar experiments, fit into one paper. They ran 3 experiments, two surveying a segment of the public, and one surveying university researchers.
You can skip the introduction. As an insider, I’ll let you in on a trade secret: introductions to science papers exist for two reasons: the first is tradition, the second is that you justify/rationalize your doing the experiment. There you rehash existing studies and show how your topic is important and your work is unique. In practice, introductions are loaded full of self-serving bullshit and links to either your own work or shit nobody has bothered to read past an abstract (at most). They exist almost entirely to convince the reader (and writer) that the work that follows and field of surrounding work is valid, undisputed science.
Normally, the next part of the paper is the materials and methods, but they put pretty much all of their methodology on an external page, and even their broken in two parts, which is asinine from an organizational standpoint, and it’s a testament to the ineffectiveness of paper editors that this wasn’t changed. Experiments #1 and #3 were carried out by posting a survey to Amazon’s Mechanical Turk online job site, which is a website that pays people peanuts ($.25 for this experiment, which is a common rate) for doing work that cannot yet be done by robots or people in Asia. It’s a widely used tool in the social sciences, because it’s a cheap and easy source of people to fill out surveys. It also has inbuilt biases that I have never once seen addressed. That is, the kind of person who fills out surveys for a quarter a piece may not be representative of the population at large.
Participants for experiment 1 were asked to read the following paragraph:
opinions about different types of research that was published back in the
Following that, on scales from 1 (not at all) to 6 (very much), participants responded to the following four questions or statements: “To what extent do you agree with the interpretation of the research results?” “To what extent are the findings of this research important?” “To what extent was the abstract well written?” and “Overall, my evaluation of this abstract is favorable.” These four responses demonstrated high internal consistency in all experiments (Cronbach’s α=0.84, 0.89, and 0.78 in experiments 1, 2, and 3, respectively) and were therefore averaged to measure participants’ perceived quality of the research.
So obviously these are 3 different questions. I know they’re similar, but they are still distinct. The scores are a combination of all these answers. Just because there is high consistency between answers, does that mean it’s okay to conflate them–especially when you end up with data that show pretty marginal differences in answers?
Also, note the scale is 1-6, meaning there’s no middle “No opinion one way or the other” option; participants had to give an agree/disagree. What if, say, women are more prone, when prompted, not to want to disagree? The women in the “don’t really know” category are now in the “agree” category not because they agree and would act on that, but because they don’t want to disagree with a direct question.
Nonetheless, the results of Experiment 1 are interesting, although I would argue vehemently that they ignore the interesting results and report the uninteresting ones, because the authors are looking for a particular answer, not just whatever the data show. From their paper:
The results (which were a pain in the ass to cobble together from their shitty paper organization):
When the abstract was reported to be from a man at ISU, the average rating of the abstract by male participants was 4.57 (scale of 1-6), and the average rating by females was 4.26. There was a standard deviation in each answer by about 1 point. To me, that is not a significant difference. It’s .31 out of 5. My grade-grubbing straight-A students don’t even fight for those kinds of points… My hunch is that if this question were even on a scale of 1-7, that variation would disappear into further insignificance.
With an supposed female author from ISU, the men rated it 3.89 and the women 5.03. That I’ll grant as significant enough to bother with. Men decreased their opinion by .68. And again, you stretch the scale, that distance shrinks. Women increased their opinion by .77. But as far as the actual response goes, men are unchanged on their average response, presumably “agree somewhat”. Women moved up a bracket into “agree”. And it’s slight. This is not an overwhelming change. It’s there, but not huge. So this may affect marginal opinions, but men aren’t going to look at a study and go “this must be rock solid” with a man’s name on it to “what a load of crap” with a woman’s.
The abstract by a “male from Yale” yielded a score of 4.13 from the men and 5.02 by the women. Again, significant. The men are still giving out a “agree somewhat”. The women moved up to “agree”. A female Yale scientist yielded a score of 4.38 for both sexes. Men are basically in the same spot.
The major changes in this experiment are with the women. Men are basically “meh” when it comes to abstract articles, regardless of who wrote it, but are a touch skeptical of women from low-prestige institutions. Women, however, agree with females from lower-prestige institutions and males from high-prestige institutions more. So why the fuck doesn’t that show up in the discussion? Because the authors care about sexist men keeping women down. That’s the argument in the ether. Their data, for all its many limitations, just told them something different. So it gets buried in the “supplemental” materials.
Experiment #3 was a variation of #1, except the took the institution and author name and then added a modified version of the abstract that concluded “no bias against women”. The astute among you may pause and go, “wait, isn’t that changing two variables? Aren’t you only supposed to change one at a time”. I really hope this was just poor methodology and not willful intent to remove the institution name because that was an essential component in the different results in experiment 1. When “men evaluated the original (gender-bias exists) abstract”, they gave it a 3.65, and women a 3.86. This is an even smaller and less trustworthy difference than the first. That’s a .23 out of 5. Again, make the scale 1-7 and that goes away. “Whereas men evaluated the modified (no gender-bias exists) abstract” 3.83 and women 3.59) a difference of .24. Comparing the two, men rate an article that alleges bias against women by .18 lower than one that alleges no bias. Women are rank the article that alleges discrimination by .17 higher. Again, these are just super, super slight. What that sure as hell doesn’t tell you definitively is that “men ignore proof of bias”. What my guess is a number of those men are saying is, “I’m sick of being tacitly accused of being a sexist asshole). I could be wrong, but the experiment can’t make the distinction. The fact that they can’t will not be in the discussion. The data are a distraction. They are only there to make the authors look smarter. “Men are sexist and biased. Science says I’m right. Look, numbers!”
On to Experiment 2. Their pool was about 200 faculty from a “research-intensive university” who responded to emails asking to fill out a survey. They had a roughly even of people in STEM and non-STEM departments. Race, rank (e.g. associate professor), age, and time affiliated with the institution were also disclosed. They read the abstract as in Experiment 1, and ranked it on the same scale. Here’s the results as reported in the main paper (I took out the numbers to make it more readable) :
“Results from our experiment 2 also supported hypothesis A, revealing a main effect of participant gender such that male faculty evaluated the research less favorably than female faculty…. Importantly, results from experiment 2 further reveal that this effect was qualified by a significant interaction between participant gender and field of study. This interaction supported hypothesis B, because simple-effect tests confirmed that male faculty evaluated the research less favorably than female faculty in STEM fields whereas male and female faculty reported comparable evaluations in non-STEM fields. Further, the effect size for the observed gender difference was large within STEM. Looking at this interaction another way, simple-effect tests demonstrated that men evaluated the research more negatively if they were in STEM than non-STEM departments, whereas the opposite trend was not statistically significant among female faculty. Thus, it seems that men in STEM displayed harsher judgments of Moss-Racusin et al.’s research, not that women in STEM exhibited more positive evaluations of it. The analysis revealed one other significant interaction that did not involve faculty gender (for further details, see SI Additional Analyses,Experiment 2)”
Tl;dr: Their data show that the men in STEM fields ranked the abstract lower overall than the females, but that differential was not seen in the non-STEM fields. So by how much? The men in STEM rated the abstract on average a 4.02, and the females gave it a 4.80. So I’ll grant that that’s a meaningful difference on the numerical scale. The non-STEM’s put it at 4.55 and 4.54, respectively. Looking at the data in the supplemental material, not the main article, They offer up this interesting find: “The interaction pattern indicated that faculty in STEM evaluated the abstract written by a man more favorably if the author was from Yale (vs. Iowa State), but the abstract written by a woman more favorably if the author was from Iowa State (vs. Yale), whereas the opposite pattern manifested among non-STEM faculty.” So again, institutional bias is apparently an important facor in academia (hopefully, surprising to exactly no one in academia), but annoyingly, they do not provide the number they used to come to this conclusion. So I can’t give you a decent analysis of it. What I can do is lob disdain at them for shirking a major find in their data. That’s bad science.
So what do the numbers we do have mean, though? Do men think less of the abstract than they should, or do women think more of the abstract than they should? The authors–to reiterate, one of them wrote the abstract in question–make the absolutely untenable argument that because the non-STEM female rankings and STEM female rankings are basically the same, this must prove the former: “Thus, it seems that men in STEM displayed harsher judgments of Moss-Racusin et al.’s research, not that women in STEM exhibited more positive evaluations of it.” But here’s the thing, is THAT DOESN’T PROVE THAT. There is no objective way to determine what the actual rating of the abstract should be. That’s entirely subjective. Both STEM and non-STEM people are rating subjectively. All you can tell is the differential in ratings. You absolutely cannot prove which way the rating bias goes. Men could be rating it too harshly or women could be rating it too generously. It could be either, until they come up with an objective number to compare them against, and they can’t, because objectivity does not work that way.
That is the limitation on all of these experiments, and it terrifies me that these researchers misunderstand the fundamental philosophy behind their experiment. It is the major limitation of the social sciences, that there’s no objective standards to measure against, only subjective ones. But that’s the cross to bear. Human interactions being hard to nail down doesn’t excuse lazy science.
Epilogue: This whole thing is about rating an abstract. So let’s do it (abstract in quotation marks).
http://www.pnas.org/content/109/41/16474.full
My reactions to this: “significantly more competent and hireable than the (identical) female applicant”
Define “competent”, define “hireable”, and how much different?
“These participants also selected a higher starting salary and offered more career mentoring to the male applicant.” How the hell can they “offer more career mentoring” to someone who doesn’t exist? That doesn’t make sense.
” The gender of the faculty participants did not affect responses, such that female and male faculty were equally likely to exhibit bias against the female student.” Interesting. So this is bias for males, not bias from males.
“Mediation analyses” I don’t know what that is.
“using a standard instrument and found that preexisting subtle bias against women played a moderating role” What instrument? How does it work, and to what extent was any of this demonstrated?
“These results suggest that interventions addressing faculty gender bias might advance the goal of increasing the participation of women in science.” So you have proscriptive motivation for authoring this, not a descriptive one? Or is this just pandering? If so, that’s still a little disconcerting, and I think that could color your assessment of the data.
So, rating this. What would I give it? Here are their three questions: “To what extent do you agree with the interpretation of the research results? To what extent are the findings of this research important? To what extent was the abstract well written?” Here’s the thing, I don’t have enough of my questions answered to rate this on the extent that I agree or find it important. Where’s my “no opinion” option? That was a major, major flaw in this study (and lord knows how many surveys in general). “I don’t know” is a response most of us have to a lot of things, and our society does a horrendous job of explaining that that’s okay to say. This study reinforces that. So I guess I would “not at all agree”, because I tacitly do not agree with something I’m not sure I believe or not. It’s possible the article would address my concerns, but I can’t rate it just on the abstract. And the fact that this does not occur to the researchers is also very unsettling. Reading the abstract and nothing more is incredibly damaging to our society. A lot of crap gets through that shouldn’t because people are too goddamn lazy to read more than 1 paragraph. How important is this? Well, I might up the score if I were confident that it proves what it says, and to what extent there is a problem. But still, this is going to rank pretty low to me. My guess is that it shows a small bias in pre-interview assessments of candidates in favor of males, which in the scheme of things doesn’t top my list. Which is more important, that a handful of women might get looked over for a job in a cancer lab or the fact that millions of people every year die of fucking cancer and we can at best delay the fatality by a few years? This is just personal philosophy. This implies that there is an objective way of measuring what is important, which is one hell of a philosophical assertion. God help you if there are any nihilists peer reviewing that… As to what extent it’s well written: I have a couple minor tweaks I’d make, but overall? It’s good. I’d give it a 6. But so what? I would hope that well-written is a minimum requirement. I hope nothing below a 5 gets through. Real life isn’t an English class; being well-written doesn’t make it right!
I want to go back and draw your attention to one word in particular: “significantly” While it has no bearing on my rating, this word means something different to people in the sciences than it does to the public. In the sciences, this is “statistically significant”, which is a specific mathematical term. https://en.wikipedia.org/wiki/Statistical_significance_%28hypothesis_testing%29
To the public, this means “a lot” or “very”, or something like that. This is one of the problems that the academic community needs to address. The effort to be precise and concise is making us misunderstood by the public. If faculty hired 56 men and 44 women, that is a statistically significant difference by most standards. But the public might not consider that “significant”. You ask John Public what he might think when we say there is a “significant” difference in hiring by sex, he’s going to put that difference a lot higher. 70-30, 65-35, 80-20, something like that. So even though it might not change how people ranked the abstract, the laymen and the STEM people got two completely different impressions out of this.
Also, somewhat interestingly, the abstract uses the term “academic science”, not “STEM”; the T, E, and M are not included for damnation here. Yet those folks are factored in to the discussion in this paper.
I could go through and pick apart the study the abstract is from. If I were being diligent, I would. But as of now, I’m not being paid for this, so I’m not anticipating a huge cost/benefit ratio. But in my defense, I don’t necessarily need to. There’s a methodological flaw that I think makes this major enough to dismiss as not definitive: it’s a hypothetical. They didn’t comb through actual hiring data, they made up data and asked people to rate things on an arbitrary scale (I read the study, I’m just not going to go through it. I may be lazy, but give me a little credit….) This study addresses what people think about candidates on paper. I’m not saying it doesn’t warrant further investigation, but it is not conclusive data about actual hiring practices. The reality is what matters, not the hypothetical. If girls are hypothetically discriminated against but not discriminated against in functional reality, then so what? Look, I get the train of thought, “well, if there’s hypothetical discrimination, actual discrimination is probably not far behind”. Makes sense. Still, you have to demonstrate that before asking people to accept that as a fact.
Here’s the thing, asking faculty, “Hey, hypothetically, would you hire this person? Do you think they’re competent? Would you mentor them? How much would you pay them?” will get you an answer of “Hell if I know; I’d have to interview them first, I’ll have to see what kind of time I have, and I have to see what I can budget.” Nobody’s going to hire without an interview. If the sex-bias goes away after an interview, then this is a worthless survey. I get that it’s harder to do an all-things-being-equal interview, but again, that’s the cross social science has to bear. Mentorship in a voluntary capacity is also massively determined by personal relationship. No decent professor would conclusively say whether or not they’d mentor someone they’ve never met. Payroll is also something that faculty rarely has unilateral control over, particularly in university setting. There is an unholy amount of bureaucracy around that, as I can attest from personal experience. What their survey actually tests for is how likely these faculty are to say they’d do something. Until they prove what is said and what is done are the same, they are wrong to expect people to equate the two. That’s either ignorance or arrogance on the authors’ part, and I can fairly judge which, but I disapprove either way.
This loaded science quickly gets foisted onto the public and stripped of the content vital to a good discussion and critiquing of the information. Both media and consumer laziness are certainly complicit in this, but that does not exculpate the authors. This is what these articles get turned into by the time the public gets to them:
And you can bitch about ThinkProgress being run by dumb lefties, but that’s a lazy analysis, because that’s pretty much how they all end up looking–I just found them first. They reference both studies, but fail to mention that they are by the same author. It’s not that that makes them invalid, it’s just that it gives the impression that there is a field of research that is independently verifying things, when that has not been demonstrated. When science is spoken about in the abstract, a notion the industry is all-too happy to reinforce, it makes it seem like it is less biased than it is, as if it’s not conducted by people with their own biases. They only mentioned two of the experiments, and they gave no mention of the very blatant findings on institutional bias that affect women in particular–I’m going to defer to Hanlon’s razor on that. They parrot the philosophically unfounded claim by the authors that it must be male underrating and not female overrating. Overall, this isn’t a completely terrible relay of the paper, compared to what it could be. I wouldn’t be satisfied with putting my name on it, but I’ve seen worse. There was a comparatively small amount of editorializing, at least. But it wasn’t very thorough. These surveys, whether intentionally or not, wound up as ammo to support an existing socio-political narrative.