The Milgram experiments produced findings reliable only within narrow laboratory conditions, showing high compliance rates (around 65% in the baseline) under scripted authority cues, but unreliable for broader...
Why this question matters
The Milgram obedience studies remain influential because they produced striking results about compliance with authority, but their reliability is debated. Later replications and archival analyses suggest some core patterns were observed beyond the original lab, while methodological, ethical, and interpretive concerns limit how confidently the findings can be generalized.
The claim being judged
The question asks whether Stanley Milgram’s obedience experiments produced reliable findings. In the early 1960s, Milgram reported that many participants were willing to administer what they believed were increasingly severe electric shocks to another person when instructed by an authority figure in a laboratory setting.
Reliability can mean several things in this context. It may refer to whether Milgram’s own procedures yielded consistent results across variations, whether later researchers observed similar patterns, whether participants believed the setup, and whether the studies support broader claims about human obedience outside the laboratory.
A careful judgment needs to separate the narrow empirical finding from the broader lesson often attached to it. The narrow finding concerns behavior in a staged experimental situation; the broader claim concerns general human willingness to harm others under authority.
What the evidence shows
Milgram conducted multiple variations of the obedience procedure and reported substantial differences depending on the conditions. For example, obedience rates were reported to change when the authority figure was less physically present, when peers resisted, or when the setting was altered. This suggests the experiments did not produce a single fixed obedience rate, but rather a pattern in which situational features mattered.
Partial replications and related studies have reported behavior broadly consistent with the idea that authority pressure can increase compliance, though most later work used modified procedures because of ethical restrictions. Jerry Burger’s 2009 partial replication, for example, stopped the procedure earlier than Milgram’s original maximum shock level, limiting direct comparison while still addressing whether participants would continue under pressure up to a key point.
Archival reassessments have raised concerns about the original experiments. Critics have argued that some participants expressed skepticism about the setup, that experimenters sometimes departed from standardized prompts, and that the published account may not fully capture the complexity of what happened in the lab. These issues matter because reliability depends not only on headline percentages but also on participant belief, procedural consistency, and transparent reporting.
Overall, the evidence supports a mixed assessment: the experiments appear to have identified a real and important phenomenon involving authority, pressure, and compliance, but the exact size, interpretation, and generalizability of Milgram’s reported findings remain contested.
Where uncertainty remains
One uncertainty is how many participants fully believed they were harming another person. If a substantial share suspected the shocks were not real, the studies may say less about willingness to inflict serious harm and more about compliance within a confusing laboratory performance.
Another uncertainty is how much weight to give to later replications. Ethical limits mean modern studies cannot exactly repeat the original procedure, so they can test related patterns but not fully reproduce the original design. Cross-cultural and historical differences also complicate comparisons.
A final uncertainty concerns the public interpretation of the experiments. Milgram’s work is often used to explain atrocities, institutional abuse, or everyday conformity, but the laboratory evidence may not map cleanly onto those larger social phenomena without additional evidence.
The three parts of the claim
The umbrella claim is actually several claims bundled into one. Each needs its own evaluation.
Model comparison
How each panel model rated the three parts of the claim| Model | Part 1 | Part 2 | Part 3 | Overall |
|---|---|---|---|---|
| Grok 4.3 | Yes · 82% | Mixed · 63% | No · 74% | Mixed · 70% |
| Mistral Medium 3.5 | Yes · 82% | Mixed · 63% | No · 74% | Mixed · 65% |
| OpenAI GPT-5.4 | Yes · 82% | Mixed · 63% | No · 74% | No · 70% |
| Claude Opus 4.7 | Yes · 82% | Mixed · 63% | No · 74% | Mixed · 75% |
| Llama 4 Maverick | Yes · 82% | Mixed · 63% | No · 74% | Mixed · 70% |
| Gemini 3.1 Pro | Yes · 82% | Mixed · 63% | No · 74% | No · 70% |
| DeepSeek V4 Pro | Yes · 82% | Mixed · 63% | No · 74% | Mixed · 65% |
| GLM 5.1 | Yes · 82% | Mixed · 63% | No · 74% | Mixed · 65% |
| Qwen 3.7 Max | Yes · 82% | Mixed · 63% | No · 74% | No · 85% |
| Kimi K2.6 | — | — | — | Incomplete |
What would change our mind
The current evidence leans one way. But we're not committed to the conclusion, we're committed to the evidence.
- A comprehensive archival reanalysis showing that participant belief, experimenter conduct, and outcome coding were substantially more or less consistent than currently understood.
- A large, ethically approved multi-site replication using comparable decision points and transparent preregistered methods.
- New participant-level data from the original studies that clarify how many participants believed the shocks were real at each stage.
- A systematic review comparing Milgram-style studies across countries, decades, and procedural variations with clear criteria for comparability.
- Evidence showing that the original findings either strongly predict or poorly predict behavior in real-world authority-pressure settings.
Common questions
References
Primary Study
Book
Replication Study
Critical Analysis
Review
Context
What each model said
The Milgram obedience experiments produced partially reliable findings, but their generalizability and interpretive clarity are limited. The original experiments demonstrated that a significant proportion of participants (around 65% in the baseline condition) complied with authority figures' instructions to administer what they believed were increasingly severe electric shocks to a confederate. This core finding has been replicated in later studies (e.g., Burger, 2009, *American Psychologist*), which found comparab...
Mostly yes for the narrow laboratory effect, but not as a simple, highly generalizable rule about real-world obedience. Confidence: 0.75 Key evidence: - Milgram’s original studies did find substantial complianc...
Partially. The Milgram experiments produced robust, repeatedly observed patterns of high compliance with authority under specific laboratory conditions, but their reliability as a general estimate of real-world...
The Milgram obedience experiments produced findings that are partially reliable. Confidence: 70% The original experiments by Milgram found that a significant proportion of participants complied with authority i...
The Milgram obedience experiments produced findings that are only partially reliable; while the core observation that people often comply with authority in specific laboratory settings is robust, the original s...
The Milgram obedience experiments produced findings that are partially reliable, but significant methodological, ethical, and interpretive concerns limit their overall reliability. Confidence: Moderate. Key evi...
The Milgram obedience experiments produced partially reliable findings, but their reliability is limited by methodological concerns, interpretive disputes, and constraints on direct replication. Confidence: Mod...
The Milgram obedience experiments produced reliable findings only in the narrow sense that they demonstrated a reproducible laboratory phenomenon of compliance under specific, artificial conditions, but they do...
Incomplete response: output limit reached before a parseable answer.