Prof Omar Hasan Kasule, Sr.



The causal triangle: Cause is epidemiologically defined with preventive possibilities in mind. There is little benefit in studying and describing causes that can not be prevented or changed. The concept of the causal triangle (environment, host, and disease) has been used for many years to simplify epidemiological reasoning. The environment includes factors that actually cause or facilitate disease occurrence. The human host could be protected from contact with the cause of disease or could be made able to resist the disease. The disease condition is the result of the inter-play between the environment and the host. There is 2-way interactions among the three factors: agent & disease, agent & host, and host & disease.The interruption of disease causation usually involves one of the three factors.


Causative and preventive causes:The causes may be defined as causative or preventive.  A causative 'cause' directly leads to disease in a pro-active way. A 'preventive' cause leads to disease by its absence.


Data on causes: Data on causes can be obtained from animal or human experiments/observations. In animal populations each agent is assessed in a controlled experimental study. In human populations, 3 types of studies can be undertaken: clinical observations, observational epidemiology, and experimental studies


Types of association: Association between a disease and a putative risk factor can be described as statistically associated or not statistically associated. When there is no statistical association we say that disease is independent of the risk factor. Statistical association or dependence of disease on the risk factor can be causal or non-causal. In a causal association the risk factor is involved in disease causation either directly or indirectly by operating through another related factor. A non-causal relation in only apparent but is not biologically real for example presence of xanthomata is associated with coronary heart disease but one is not a biological cause of the other. The joint occurrence is because both are associated with high cholesterol. High cholesterol is associated with xanthomata and is also associated with coronary heart disease. Both associations are independent one of the other.


Models of disease-agent association: there are several possible models of disease causation. (a) One disease may have 2 or more co-factors (b) One disease may have 2 quite different independent causes (c) One cause leads to 2 different diseases.


Disease risk: Disease risk is a probability. Risk has 2 dimensions: individual & statistical. For a single individual the risk can be 0 (no disease) or 1 (disease). Statistical probability or risk is an empirical frequentist probability based on the experiences of many people and lies between 0 and 1.


Risk factors and risk indicators: Risk factors are established causes. A risk factor is a factor that is known empirically to be involved in disease causation. Risk indicators are likely to be causes but are not yet confirmed. A risk indicator may itself not be involved in disease causation but may be a pointer to the risk.


Description of causes: Epidemiology is the study of the occurrence relation. This is the relation between the determinant (risk factor) and outcome (disease). Several co-factors modulate this relationship. The view of causality as pure determinism is not practical. Modified determinism fits better with observed data. A risk factor's action in disease causation is affected by many other factors that operate with it. A risk factor or cause is described as sufficient when its mere presence will trigger the disease concerned. In practice a sufficient cause refers to a constellation of 2 or more risk factors since most diseases are multi-causal. One diseases normally has more than 1 sufficient cause. There are some risk factors that are always present in  all sufficient causes of the disease. These are referred to as necessary causes. No disease will result in the absence of the necessary cause. Cigarette smoking is not a necessary cause of lung cancer because some non-smokers get cancer. The tubercle bacillus is not a sufficient cause of TB because not every infected person gets the disease. Causes may be weak or strong depending on their contribution to the risk. Causes may interact either cooperatively in disease causation (synergy) or act against one another (antagonism). We now know that the causal chain or causal pathway is multi-stage. Initiation of the pathogenic process is by the main risk factor. Co-factors acting as promotors play a role in completing the process to the stage of clinical disease. Some co-factors may be antagonistic preventing eventual occurrence of disease. It is not necessary to know all the steps/components of the causal pathway before undertaking preventive action. It is enough to intervene against only one necessary cause/factor along the pathogenic pathway to stop the disease process.


Criteria of causality: The first attempts to describe criteria of causality were made by Robert Koch. He listed the following 3 criteria for bacterial disease: the organism must constantly be present in cases of the disease, isolation of the organism in culture media, and reproduction of the disease in a susceptible animal. Those criteria have been modified a lot with more knowledge. We now know that criteria of causality are either essential criteria or back-up criteria. The essential causal criteria are four: specificity, strength, time sequence, and biological plausibility. Both the cause and the disease have to be defined as specifically as possible. The strength of association is assessed by studying how disease risk increases/decreases with increase/decrease of the putative risk factor. The time sequence must be correct ie the cause must operate before occurrence of the disease. Biological plausibility requires that the causal relationship be understandable at the cellular level. The back-up causal criteria are five: dose-effect relationship, repetition, consistency, evidence from intervention, and experimental evidence. Under the dose-effect criterion, the more of the putative cause, the higher the risk of the disease. Under the repetition criterion, association between the 'cause' and the disease is found in different groups and at different times. The cause-disease relationship must be consistent with existing knowledge. Intervention based on putative cause should lead to decreased disease occurrence. Additional support for the cause-disease relationship can be obtained from animal or in-vitro experiments. Some times a cause-disease relationship is accepted by default when there is no other explanation. The case for the putative cause may be strengthened by finding no other explanation for the disease.




The term misclassification bias covers information bias, detection bias, protopathic bias, and the Hawthorne effect. Misclassification is inaccurate assignment of exposure or disease status. There are 2 types of misclassification: random or non-differential misclassification and non-random or differential misclassification.


Random or non-differential misclassification arises for example when there are mistakes in exposure assignment but those mistakes are not affected by knowledge of disease status. It could also be random mistakes in assignment of disease status without knowledge of exposure status.  Random variation is normal and its magnitude is assessed by the width of the 95% confidence interval of the effect measures. A wide interval indicates lack of precision. Random misclassification just under-estimates the effect measure but does not introduce bias. It decreases the magnitude of association, the chi-square statistic, and the magnitude of the effect measure, the OR. In other words the study finds a valid relationship but under-estimates its magnitude. A more figurative explanation is to say that it dampens the OR ie it tends to the null. With complete random misclassification, an extreme condition, OR=1.0 and any association that may exist is masked completely.


Non-random or differential misclassification is a systematic error that biases the effect measures either in the same direction as the true parameter or away from it. It arises for example when there is systematic over-reporting of a disease in the exposed compared to the unexposed. Differential misclassification tends the OR away from the null value.  Positive association may become negative and negative associations association may become positive.



This is systematic incorrect measurement on response. It is difference in data collection between cases and controls. It could be due to 6 reasons: questionnaire defects, observer errors, respondent errors, instrument errors, diagnostic errors, and exposure mis-specification.


Questionnaire defects arise due to ambiguous or inappropriate questions or ambiguous definition of variables. The design of the questionnaire may be visually confusing leading to errors


Observer errors also called observer or interviewer bias arise due to misunderstanding procedures, making mistakes in recording, or systematic differences among interviewers (time of the interview, place of the interview, the manner and duration of the interview)

Respondent errors arise due to non-response, misunderstanding questions, faulty recall, or lack of interest. Respondent errors may be due to a response bias/recall bias or unacceptability bias. In response bias cases recall exposure better than controls. In unacceptability bias information on shameful things is not accurate because the respondent or the interviewer may not be comfortable with certain questions.


Instrument error is due to faulty calibration of measuring instruments, contaminated reagents, incorrect dilutions, or inaccurate diagnostic tests. Instrument error involves both sensitivity and specificity and a trade-off has to be made since high sensitivity is associated with low specificity and high specificity is associated with low sensitivity. A measurement error has 2 components: systematic error and random error. It is the systematic error that is the source of bias. Errors in measurement of 2 variables may be dependent or independent. If the magnitude and direction of error in one variable affects the magnitude and direction of error in the other, then the errors are said to be dependent and will result in bias. On the other hand if the errors in one variable do not affect those in the other, the two are said to be independent with less potential for bias. The direction of bias due to independent non-differential errors is predictable


Diagnostic accuracy bias arises as errors of clinical assessment. The bias may be due to background and previous experience of the examining physician. It may also be due to sheer clinical incompetence.


Exposure mis-specification arises when there are mistakes in exposure classification. It commonly occurs when a surrogate is used for an exposure. The surrogate or proxy variable does not adequately represent what is being measured. For example job title being used as a surrogate for a specific hazardous exposure at the work-place. It does not follow that all those with that job title actually handled the hazardous material.



Detection bias arises when disease or exposure are sought more vigorously in some groups than others. In case control studies, knowledge of the diagnosis of the cases may lead to a more vigorous search for exposure information than is the case for controls. In follow up studies, the search for disease may be more intense in the exposed than the unexposed.


Detection bias is more likely to arise unconsciously in the clinician or interviewer if they are not blinded to the exposure or the disease respectively.



This type of bias arises when early signs of disease cause a change in behaviour with regard to the risk factor. The following three examples illustrate protopathic bias. Persons with early signs of lung cancer stopping smoking. A study could find a spurious negative association between smoking and lung cancer. Another example is when early pancreatic cancer leads to abdominal discomfort and the subject increases coffee consumption because of anxiety over the discomfort. A study could find spurious positive association between pancreatic cancer and coffee consumption. Physicians normally refuse to prescribe oral contraceptives for women with breast lumps. If those lumps are an early stage in the development of breast cancer, a study could find a spurious negative association between breast cancer and use of oral contraceptives.



The Hawthorne effect, also called the healthy worker effect, airises when health assessment of workers reveals that they are more healthy than the normal population which could lead to a spurious conclusion that the work-place promotes good health. The reality is that recruitment and job termination are selective. Only healthy people are employed and some factories even administer health questionnaires or carry out health examinations to ascertain this. Those employees whose health begin faltering either resign from the job, are terminated, or they fall sick and die. Thus an unhealthy work environment may apparently appear to have healthy workers due to the Hawthorne effect.




Selection bias arises when subjects included in the study differ in a systematic way from those not selected.

There are 8 different types of selection bias that will be described: exposure bias, detection bias, referral bias, the Berkson fallacy, non-response bias, the Neymann fallacy, susceptibility bias, and follow-up bias.



 Exposure bias arises when the selection into the study is different for the exposed and the un-exposed



Detection bias  arises when exposure status influences the chances of being included into the study. Thus asymptomatic diseases may be searched for more vigorously because of knowledge of exposure status. Thus 'more' cases are recruited from the exposed than would otherwise be the situation.



This type of bias arises when there is selective referral. For example cases referred to big medical centers may be more complicated that those that are not referred. Bias arises because they may also have different pathogenesis and different risk factors. Thus a study based on cases at a referral center may give misleading results about the etiology of a disease



The Berkson fallacy is an example of selection bias in hospital-based studies described by Pearl in 1929 and Berkson in 1946. Berkson described spurious -ve association between TB & Ca lung in hospital autopsy series. He found that non-cancer autopsies had relatively more TB lesions than cancer autopsies; this could lead to a misleading conclusion that TB was protective against cancer. The actual reason for the apparent observation was that cases of TB that actually were admitted to the hospital and were autopsied did not represent the situation in the community. The autopsied non-cancer cases autopsied in the hospital were not a good representative of the general community incidence of TB. TB cases in the general community may not need hospital admission for treatment and even if admitted, few would end up in autopsies because TB is not a very fatal condition. The selection bias of the Berkson fallacy type does not occur when, the control group is selected from several diagnostic groups.



This arises when those who respond to the invitation to enter the study differ in a systematic way from those who do not respond. Non-response could be due to: physicians or hospitals denying or limiting access to particular patient records and refusal of subjects to cooperate



Neyman fallacy arises when the risk factor is related to prognosis (survival). This will bias prevalence studies. The relation between gender and colo-rectal cancer illustrates this type of bias. Cancer is more common in m ales but females have overall longer survival and life expectancy. A simple prevalence study will find a spuriously higher proportion of colon cancer in females. This problem is avoided by using only incident cases.



Susceptibility is a very interesting source of selection bias. Some persons are more susceptible to certain diseases for reasons indirectly related to the risk factor under. Longevity of an individual is determined by the longevity of parents. Those with short-lived parents may lead a hedonistic life-style resulting from parental deprivation. The hedonistic life style is blamed for their short life when the actual determinant was the short life of the parents. Type A people may smoke or eat high fat diets. The resulting IHD is blamed on the smoking and diet (an external factor) when the actual cause is the type A personality (the internal factor)



Follow-up bias arises in prolonged follow-up studies when loss to follow-up is rekated to the exposure.




Confounding bias arises when the disease-exposure relationship is disturbed by an extraneous factor called the confounding variable. The confounding variable is not actually involved in the exposure-disease relationship. It is however predictive of disease but is unequally distributed between exposure groups. Being related both to the disease and the risk factor, the confounding variable could lead to a spurious apparent relation between disease and exposure.


The effect of confounding depends on 3 factors: the causal relation between the confounding factor and the disease, the relation between the confounding factor and the exposure, and the prevalence of the confounding factor.  Confonding is stronger when these 3 factors increase.



Example #1: alcohol consumption confounds the relation between smoking and lung cancer. There is an indirect relation between alcohol consumption and cancer of the lung. We observe that those who have lung cancer also consume alcohol. This is because of the non-causal relation between alcohol consumption and cigarette smoking. The two are part of the same lifestyle and tend to occur together. The direct causal relationship between cigarette smoking and lung cancer could be distorted in a study in which alcohol consumption is not balanced between the smoking and non-smoking exposure groups. A negative relationship between cigarette smoking and lung cancer will be seen if study subjects are selected predominantly from the non-smoking population.


Example #2: HSV2 infection confounds the relation between HPV infection and cervical cancer. HPV infection has a direct causal relation to Cervical cancer. The relation between HSV2 infection and cervical cancer is not established. However HSV2 infection and HPV infection are usually found together being both sexually transmitted diseases. Thus a study among predominantly HSV-2 infected subjects will lead to a distorted relation between HPV infection and cervical cancer.


Example #3: Age confounds the relation between HT and IHD. The relation between age and hypertension is not fully established.. Age & IHD have an indirect relation. The direct causal relationship between HT and IHD can be distorted if a study is carried out among predominantly young persons with low blood pressure measurements.


Example #4: age confounds the relation between place of residence and mortality. Old age is related to place of residence because the elderly tend to move into retirement communities in areas of the country with more favourable climates or better geriatric services. Old age and death are related. There is a doubtful relationship between death and place of residence based on the assumption that air and water pollution cause higher mortality. Thus a study among the elderly will  distort the relation between mortality and place of residence.


Example #5: Smoking confounding the relation between PU and lung cancer. There is an observed relation between lung cancer and PU. Lung cancer is associated with smoking. PU is related to smoking.



This type of bias arises when a wrong statistical model is used. For example use of parametric methods for non-parametric data biases the findings.



Research funding bias

Publication bias




Control of misclassification can be prevention at the design stage or at adjustment at the data analysis stage. Misclassification bias can be prevented in the study design by avoiding all the source explained above. Double-blind techniques can decrease observer and respondent bias because neither knows the disease or the exposure status. In structured interviews, all observers interview in same way that decreasing interviewer bias.


Treatment of misclassification bias after the study uses 2 approaches: the probabilistic approach and measurenment of inter-rater variation. The probabilistic approach uses misclassification known error rates to adjust the numbers of subjects in the various categories.


Measurement of inter-rater variation can provide information on interviewer bias. This can be done by using 2 raters per subject and then making a comparison of discordant responses. Alternatively the same interviewer may undertake repeat interview of the same subject and disocrdnaces are assessed.



Prevention: study design should avoid the causes of selection bias that have been mentioned

Treatment: there are no easy methods for adjustment for the effect of selection bias once it has occurred.




Prevention of confounding at the design stage by eliminating the effect of the confounding factor can be achieved usaing 3 strategies are used: pair-matching, stratification, and randomisation. Care must be taken to deal only with true confounders. Adjusting for non-confounders reduces the precision of the study


Matching can be pair-matching or frequency matching. Pair-matching is best for small samples. It is used in both case control & follow-up studies. It aims at validity by controlling confounding. It is not concerned about precision of effect estimates. Matching can take any of the 3 forms: one-to-one matching, one to many matching, and use of several matching groups. Matching may be individual or frequency matching. Frequency matching is complicated and is rarely used. In frequency matching, we must make sure that the confounding factor has the same proportion in cases and controls. Matching has the advantage of controlling for several confounders at the same time. The disadvantages of matching are:  the study is long and complicated, matching can lead to excessive costs, the matching variable can not be studied in the same study, it is not possible to match on more than a few variables, and overmatching can occur as a problem


Stratification is best for large samples. Stratification is by level or strata of putative/suspected confounder. The variation of the effect measure by stratum of CF indicates confounding

Randomisation in clinical trials eliminates confounding to a certain extent. It does not eliminate all confounding. It works by randomly distributing confounding factors to the 2 comparison groups in such a way that they balance and cancel each other's effects



Non-multi-variate adjustment can be by stratified analysis using the MH procedure and standardisation.


Standardisation is adjustment for age, sex, race, SES, residence, and exposure to RF using a standard population. The standard is constituted is constituted in 3 ways: using one of the comparison, combining both groups, and use of a national/international reference. In direct standardisation the study age-specific rates are applying to the standard population. In indirect standardisation, the standard age-specific rates are applied to the study population. An alternative approach to using a standard population is using weighing factors from variances of differences between age-specific rates. Standardised rate ratios can be computed from the standardised rates. The MH odds ratio is a type of directly standardised ratio. The SMR is a type of indirectly standardised rate ratio


Stratified analysis (MH procedure) uses different formulas for independent and paired data. There are special procedures for multiple matching. You start by inspecting 2x2 tables for each stratum of CF and decide whether adjustment is needed. If the OR does not vary by stratum, there is no need for adjustment. If the OR varies by stratum, adjustment is needed. The MH procedure combines data from several strata to give summary MH Odds ratio. 95% CI for MH odds ration can be computed using 3 procedures: Woolf's procedure, log-based procedure, and Miettinen's test-based procedure



Regression and linear discrimination procedures are used for treatment of confounding at the analysis stage is by using multivariate adjustment procedures: multiple linear regression, linear discriminant function, and multiple logistic regression.


Multiple linear regression is used for both continuous and categorical/discrete data. The regression model is  y= a + b1x1 + b2x2 + b3x3 … +e where y is the dependent variable and x is the independent or predictor variable. x can be continuous or categorical/discrete. The following assumptions are made for validity of the regression model: linearity, non-interaction, independence, homoscedacity, and normality. The linearity assumption states that for every pair of x1 and x2, the mean of y lies on a flat plane. The non-interaction assumption states that the effect of change in x1 on y is independent of the level of x2. The independence assumption requires that y for an individual gives no info on any other individual. The homoscedacity assumption requires that for every pair of x1 and x2, the variance of y is constant. The normality assumption requires that for every pair of x1 and x2, y is normally distributed.


The linear discriminant function is closely related to multiple linear regression. It is modelled as z-scores that are computed for each individual subject. The z-score is a linear combination of the x variables. Z-score =  = b1x1 + b1x2 + b3x3 ….. Individuals are classified into groups according to their z-score. The values of b are the same for all individuals; they provide a weighted sum to x’s. The x’s that discriminate more are given more weight than those that discriminate less.


Multiple logistic regression is used for categorical data from case-control,  follow-up, and cross-sectional studies. The regression model:  y = b1x1 + b2x2 + b3x3 …. Where y is binary (0 or 1). The regression coefficient is interpreted as OR. Unconditional logistic model is used for independent/unmatched data. Conditional logistic model is used for matched data. Significant testing on regression ceofficients carried out using t-test. Standardised regression coefficient, b x SD of x, identifies the x with most effect on y (d) Analysis of Covariance (ANCOVA).



Interaction is said to exist if OR relating disease to risk factor varies at different levels of CF. The synonyms for interaction are: synergism and effect modification. Synergy can be detected when the RR in presence of 2 factors is more than the sum of RRs measured independently for each of the 2 factors. Interaction can be conceptualised at 4 levels

Types of interaction: Statistical (additive and multiplicative), biologic, public health, & decision making. Test for interaction using chi-square statistic with n-1 df = sum over i {(ln ORi – lnOR)squared}/Var(ln ORi) Where i = stratum, OR=overall



Professor Omar Hasan Kasule Sr. October 2000