Home› Browse› Science & research› Did the Milgram obedience expe...

Contested claim · Science & research · §0217

Did the Milgram obedience experiments produce reliable findings?

The Milgram obedience studies remain influential because they produced striking results about compliance with authority, but their reliability is debated. Later replications and archival analyses suggest some core patterns were observed beyond the original lab, while methodological, ethical, and interpretive concerns limit how confidently the findings can be generalized.

Reviewed by 10 models · 3 countries 7 curated references 23 revisions Updated 19 hours ago 5 min read

Panel verdict

6/10 agreement 71% confidence 20% spread 29 May 2026 filed

6 reviewing models concluded the claim is mixed by the available evidence.

The Adjudged panel has not yet completed its full review of this claim. This draft summarizes the main lines of evidence, the contested points, and the kinds of sources that should be examined before a final assessment is issued.

Panel synthesis

Consensus & disagreement

Where the panel agreed

9 of 10 modelsThe question asks whether Stanley Milgram’s obedience experiments produced reliable findings. In the early 1960s, Milgram reported that many participants were willing to administer...

9 of 10 modelsMilgram conducted multiple variations of the obedience procedure and reported substantial differences depending on the conditions. For example, obedience rates were reported to cha...

9 of 10 modelsOne uncertainty is how many participants fully believed they were harming another person. If a substantial share suspected the shocks were not real, the studies may say less about...

Where the panel diverged

1 model notedMistral Medium 3.5 gave the lowest confidence, while still reaching the same overall direction.

Why this question matters

The claim being judged

The question asks whether Stanley Milgram’s obedience experiments produced reliable findings. In the early 1960s, Milgram reported that many participants were willing to administer what they believed were increasingly severe electric shocks to another person when instructed by an authority figure in a laboratory setting.

Reliability can mean several things in this context. It may refer to whether Milgram’s own procedures yielded consistent results across variations, whether later researchers observed similar patterns, whether participants believed the setup, and whether the studies support broader claims about human obedience outside the laboratory.

A careful judgment needs to separate the narrow empirical finding from the broader lesson often attached to it. The narrow finding concerns behavior in a staged experimental situation; the broader claim concerns general human willingness to harm others under authority.

What the evidence shows

Milgram conducted multiple variations of the obedience procedure and reported substantial differences depending on the conditions. For example, obedience rates were reported to change when the authority figure was less physically present, when peers resisted, or when the setting was altered. This suggests the experiments did not produce a single fixed obedience rate, but rather a pattern in which situational features mattered.

Partial replications and related studies have reported behavior broadly consistent with the idea that authority pressure can increase compliance, though most later work used modified procedures because of ethical restrictions. Jerry Burger’s 2009 partial replication, for example, stopped the procedure earlier than Milgram’s original maximum shock level, limiting direct comparison while still addressing whether participants would continue under pressure up to a key point.

Archival reassessments have raised concerns about the original experiments. Critics have argued that some participants expressed skepticism about the setup, that experimenters sometimes departed from standardized prompts, and that the published account may not fully capture the complexity of what happened in the lab. These issues matter because reliability depends not only on headline percentages but also on participant belief, procedural consistency, and transparent reporting.

Overall, the evidence supports a mixed assessment: the experiments appear to have identified a real and important phenomenon involving authority, pressure, and compliance, but the exact size, interpretation, and generalizability of Milgram’s reported findings remain contested.

Where uncertainty remains

One uncertainty is how many participants fully believed they were harming another person. If a substantial share suspected the shocks were not real, the studies may say less about willingness to inflict serious harm and more about compliance within a confusing laboratory performance.

Another uncertainty is how much weight to give to later replications. Ethical limits mean modern studies cannot exactly repeat the original procedure, so they can test related patterns but not fully reproduce the original design. Cross-cultural and historical differences also complicate comparisons.

A final uncertainty concerns the public interpretation of the experiments. Milgram’s work is often used to explain atrocities, institutional abuse, or everyday conformity, but the laboratory evidence may not map cleanly onto those larger social phenomena without additional evidence.

The three parts of the claim

The umbrella claim is actually several claims bundled into one. Each needs its own evaluation.

PART 1 / 3

Milgram’s original experiments found that a notable share of participants complied with authority instructions under the laboratory conditions used.

Yes82%

PART 2 / 3

Later research has closely reproduced the original Milgram procedure and obtained directly comparable results.

Mixed63%

PART 3 / 3

The Milgram studies provide a straightforward estimate of how ordinary people would behave in real-world harmful authority situations.

Not supported74%

Model comparison

How each panel model rated the three parts of the claim

Model	Part 1	Part 2	Part 3	Overall
Grok 4.3	Yes · 82%	Mixed · 63%	No · 74%	Mixed · 70%
Mistral Medium 3.5	Yes · 82%	Mixed · 63%	No · 74%	Mixed · 65%
OpenAI GPT-5.4	Yes · 82%	Mixed · 63%	No · 74%	No · 70%
Claude Opus 4.7	Yes · 82%	Mixed · 63%	No · 74%	Mixed · 75%
Llama 4 Maverick	Yes · 82%	Mixed · 63%	No · 74%	Mixed · 70%
Gemini 3.1 Pro	Yes · 82%	Mixed · 63%	No · 74%	No · 70%
DeepSeek V4 Pro	Yes · 82%	Mixed · 63%	No · 74%	Mixed · 65%
GLM 5.1	Yes · 82%	Mixed · 63%	No · 74%	Mixed · 65%
Qwen 3.7 Max	Yes · 82%	Mixed · 63%	No · 74%	No · 85%
Kimi K2.6	—	—	—	Incomplete

An honest commitment

What would change our mind

The current evidence leans one way. But we're not committed to the conclusion, we're committed to the evidence.

A comprehensive archival reanalysis showing that participant belief, experimenter conduct, and outcome coding were substantially more or less consistent than currently understood.
A large, ethically approved multi-site replication using comparable decision points and transparent preregistered methods.
New participant-level data from the original studies that clarify how many participants believed the shocks were real at each stage.
A systematic review comparing Milgram-style studies across countries, decades, and procedural variations with clear criteria for comparability.
Evidence showing that the original findings either strongly predict or poorly predict behavior in real-world authority-pressure settings.

Common questions

Does the debate mean Milgram’s experiments have no value?

No. The experiments remain important evidence that authority pressure can influence behavior in powerful ways. The debate is mainly about how precise, standardized, believable, and generalizable the findings were.

Why can’t researchers simply repeat the original experiments?

Modern ethics standards generally would not allow a direct repeat of the original design because participants experienced significant stress and believed they might be harming someone. Later studies therefore use modified procedures, which can inform the question but cannot perfectly recreate the original conditions.

What is the strongest reason for a mixed assessment?

The strongest reason is that the broad pattern of authority-related compliance has support, while the original headline rates and sweeping interpretations face meaningful methodological challenges. Reliability depends on whether the focus is the general phenomenon or the exact claims often drawn from Milgram’s studies.

References

Primary Study

MILGRAM1963 Behavioral Study of Obedience Journal of Abnormal and Social Psychology Milgram’s original published report is central for assessing the initial procedure, results, and framing.

Book

MILGRAM1974 Obedience to Authority: An Experimental View Harper & Row Milgram’s book presents the broader interpretation and many experimental variations.

Replication Study

BURGER2009 Replicating Milgram: Would People Still Obey Today? American Psychologist A prominent partial replication that used modified ethical safeguards and is important for evaluating later evidence.

Critical Analysis

PERRY2012 Behind the Shock Machine: The Untold Story of the Notorious Milgram Psychology Experiments The New Press Uses archival material and participant accounts to question aspects of the standard interpretation.

Review

GIBSON2019 The Milgram Experiment Oxford Research Encyclopedia of Psychology Provides a scholarly overview of the experiments, controversies, and later interpretations.

HASLAMREICHER2012 Contesting the Nature of Conformity: What Milgram and Zimbardo's Studies Really Show PLOS Biology Discusses alternative interpretations of classic obedience and conformity studies.

Context

APS2014 Revisiting Milgram's Shocking Obedience Experiments Association for Psychological Science Summarizes debates over participant belief, archival evidence, and how the experiments have been interpreted.

What each model said

Grok 4.3 Success

The Milgram experiments produced findings reliable only within narrow laboratory conditions, showing high compliance rates (around 65% in the baseline) under scripted authority cues, but unreliable for broader...

x-ai/grok-4.3-20260430 29 May 2026 10:23 stop

Mistral Medium 3.5 Divergent view

The Milgram obedience experiments produced partially reliable findings, but their generalizability and interpretive clarity are limited. The original experiments demonstrated that a significant proportion of participants (around 65% in the baseline condition) complied with authority figures' instructions to administer what they believed were increasingly severe electric shocks to a confederate. This core finding has been replicated in later studies (e.g., Burger, 2009, *American Psychologist*), which found comparab...

mistralai/mistral-medium-3.5-20260430 29 May 2026 10:23 stop

OpenAI GPT-5.4 Success

Mostly yes for the narrow laboratory effect, but not as a simple, highly generalizable rule about real-world obedience. Confidence: 0.75 Key evidence: - Milgram’s original studies did find substantial complianc...

openai/gpt-5.4-20260305 29 May 2026 10:23 stop

Claude Opus 4.7 Success

Partially. The Milgram experiments produced robust, repeatedly observed patterns of high compliance with authority under specific laboratory conditions, but their reliability as a general estimate of real-world...

anthropic/claude-4.7-opus-20260416 29 May 2026 10:23 stop

Llama 4 Maverick Success

The Milgram obedience experiments produced findings that are partially reliable. Confidence: 70% The original experiments by Milgram found that a significant proportion of participants complied with authority i...

meta-llama/llama-4-maverick-17b-128e-instruct 29 May 2026 10:24 stop

Gemini 3.1 Pro Success

The Milgram obedience experiments produced findings that are only partially reliable; while the core observation that people often comply with authority in specific laboratory settings is robust, the original s...

google/gemini-3.1-pro-preview-20260219 29 May 2026 10:24 stop

DeepSeek V4 Pro Success

The Milgram obedience experiments produced findings that are partially reliable, but significant methodological, ethical, and interpretive concerns limit their overall reliability. Confidence: Moderate. Key evi...

deepseek/deepseek-v4-pro-20260423 29 May 2026 10:24 stop

GLM 5.1 Success

The Milgram obedience experiments produced partially reliable findings, but their reliability is limited by methodological concerns, interpretive disputes, and constraints on direct replication. Confidence: Mod...

z-ai/glm-5.1-20260406 29 May 2026 10:24 stop

Qwen 3.7 Max Success

The Milgram obedience experiments produced reliable findings only in the narrow sense that they demonstrated a reproducible laboratory phenomenon of compliance under specific, artificial conditions, but they do...

qwen/qwen3.7-max-20260520 29 May 2026 10:24 stop

Kimi K2.6 Incomplete

Incomplete response: output limit reached before a parseable answer.

moonshotai/kimi-k2.6-20260420 29 May 2026 10:24 length

Continue browsing

Latest in Science & research

See all

May 30, 2026 · 10 models · 7 sources

Per-model verdict

Grok 4.3 70%

Mistral Medium 3.5 65%

OpenAI GPT-5.4 70%

Claude Opus 4.7 75%

Llama 4 Maverick 70%

Gemini 3.1 Pro 70%

DeepSeek V4 Pro 65%

GLM 5.1 65%

Qwen 3.7 Max 85%

6/10 agreed 20% spread

Read model conclusions

Coverage map

N. America Europe Asia S. America Africa Oceania

3 continents contributed models — provided fair regional balance to this review. Grey regions have no suitable OpenRouter-regional model participants.

Confidence cluster

0%50%100%