Simple Definition

Weight of evidence refers to a systematic approach that scientists use to evaluate the totality of scientific evidence to assess if the science supports a particular conclusion. Weight of evidence decisions do not involve a simple tally of positive and negative studies, but rely on expert scientific judgment to assess, review and integrate all of the results to form a meaningful conclusion.

Advanced Definition

The gold standard for scientists, when there are conflicting outcomes of scientific study, is a weight of evidence approach that incorporates a scientifically rigorous review of all of the data available to make a determination.

It is similar in many respects to what happens in a criminal trial where a judge or jury weigh the measure of credible proof on one side of a dispute and compare it with the credible proof on the other side.

Evaluating causal criteria that link a chemical to a specified disease or adverse condition is surprisingly complex. This process often involves integrating data from many studies that differ in terms of the strength of their designs, the quality of their underlying data and even in the types of diseases and adverse conditions that are examined. Many scientific topics are also fraught with conflicting findings, making it difficult for even the informed reader to determine what the truth may be.

A weight of the evidence approach considers all of the scientific evidence that is relevant to a particular issue. This is in stark contrast to a “strength of evidence” evaluation – a less desirable approach that considers only a subset of the evidence, such as focusing on only those studies which have found a positive link between exposure to a chemical and a disease or adverse condition. For example, historically, the National Toxicology Program’s regulatory classification of chemical agents as carcinogens has relied upon strength of evidence; that is, the degree of positive evidence from perhaps only a single study showing a statistically significant result.

By contrast, weight of evidence considerations integrate all experimental findings, as well as “mechanism of action” information —all the evidence on the relevance to humans, positive and negative, that relate to a causal determination. However, caution is needed when assessing the human data available within the overall weight of evidence process. One should evaluate whether a determination of the strength of the epidemiological evidence has clearly identified the degree to which the observed link may be explained by other factors, including those that may be inherent weaknesses in the study design or execution, as well as other potential explanatory factors.

Expanded Definition

In making a weight of evidence evaluation, scientists consider the quality of each relevant study, as well as how all of the individual pieces of data fit together to create a holistic picture. There are well-established criteria for assessing the reliability of data derived from testing in laboratory animals (i.e., the Klimisch score). Similar criteria exist for evaluating the quality of human observational studies (i.e., epidemiology).

More than 50 years ago, a well-respected occupational physician named Sir Bradford Hill proposed a set of criteria for evaluating how data from different studies fit together to make a causal determination. These criteria continue to be debated and refined, but have largely stood the test of time. None of the criteria alone are sufficient to establish a causal relationship between a chemical and disease, and only the criterion of temporality necessarily must be satisfied to enable one to do so. Usually, a group of scientific experts with broad and deep expertise convene to conduct such a weight of evidence evaluation. The criteria are summarized below:

  1. Strength of association. The stronger the association between the suspect chemical and the disease, the more likely it is to reflect a causal relationship. Hill used the example of chimney sweeps, who died of scrotal cancer at rates 200 times higher than the average population. This is perhaps an extreme example, but as a rule of thumb if the strength of a reported association is less than two to three times the normal experience, it deserves greater scrutiny.
  2. Consistency (replicability). To the points made above, a single study reporting a finding should be treated skeptically, and it is critical that the finding be independently replicated by other investigators. Ideally, almost every study should support an association for there to be causation. Hill used the example of cigarettes as a cause of lung cancer where there were numerous studies, of different design, nearly all demonstrating a strong association.
  3. Specificity. This is arguably the criterion which has been most debated, since it has subsequently been demonstrated that many chronic diseases can have multiple causes and some substances (e.g., asbestos) can cause multiple health effects. Nevertheless, it can be stated in short, “if specificity exists, we may be able to draw conclusions without hesitation; if it is not apparent, we are not thereby necessarily left sitting irresolutely on the fence.”
  4. Temporality. The cause must precede the effect in time. This is the only absolutely essential criterion that must be met, and it seems self-evident. However, many published studies employ a type of design in which samples to measure exposure to a chemical and disease are made at the same point in time, making it impossible to determine which came first. For example, many studies that measure chemicals in the blood or in other biological specimens and then relate the levels found to current disease states or other physiological parameters suffer from this particular shortcoming.
  5. Biological gradient (Dose-Response). The frequency and/or intensity of the biological response should increase with the size of the dose of exposure to the chemical. The presence of a dose-response relationship certainly strengthens the possibility of cause and effect. Conversely, the absence of one should be considered weaker evidence of a causal link, since this alone cannot completely rule out an association, as the doses tested may have been below the threshold necessary to cause an effect or there may have been gross errors made in measuring exposures.
  6. Plausibility. The effect must have biologic plausibility. There should be a theoretical basis for postulating a causal association, and it should not violate well-established laws of the universe. On the other hand, the association being reported may be one that is new to science or medicine so shouldn’t be dismissed too lightly. However, the more novel it is, the more we need to see evidence of an independent replication of the findings before accepting it as real.
  7. Coherence. Any inference of cause-and-effect should not seriously conflict with the generally known facts of the natural history and biology of the disease. An example is the “hygiene hypothesis” as a cause of some autoimmune disorders and allergies, since it is consistent with trends in developed countries of both fewer childhood infections and increased prevalence of autoimmune disorders and allergies.
  8. Experiment. Findings from studies that employ well-designed experiments (such as controlled laboratory, animal model, or clinical trial studies) in which subjects are randomly and blindly allocated into exposure and contemporary control groups and where many significant variables are held stable to prevent them interfering with the results should be accorded more weight than findings derived from studies that do not employ such methods. Also, if the disease or biological effect can be shown experimentally to be reversed or prevented by withdrawal of the exposure, then this strengthens an interpretation of causality.
  9. Analogy. When a chemical is suspected of causing an effect, then other similar or analogous chemicals should also be considered and identified as a possible cause or otherwise eliminated from the investigation.

Hill offered these guidelines as aids to help find answers to the fundamental questions: Is there any other way of explaining the set of facts before us? Are there any other answers equal or more likely than cause and effect? Hill also cautioned against waiting for full and compelling evidence of causation before taking action if the circumstances warranted, stating that “All scientific work is incomplete, whether it be observational or experimental. All scientific work is liable to be upset or modified by advancing knowledge. That does not confer upon us a freedom to ignore the knowledge we already have, or to postpone the action that it appears to demand at a given time.”

Thus, the decision to take action must consider the full weight of evidence available, as well as the severity of the consequences.