Home

ISLAMIC MEDICAL EDUCATION RESOURCES 04

0405-INTRODUCTION TO MULTIVARIATE ANALYSIS

Paper presented at a workshop on data analysis using SPSS held at the Kulliyah of Medicine, International Islamic University Kuantan MALAYSIA on 25th May 2004 by Prof Dr Omar Hasan Kasule, Sr. MB ChB (MUK), MPH & DrPH (Harvard)

VARIABLES

Understanding variables and their properties is essential to understanding statistical analysis.  A constant has only one unvarying value under all circumstances for example p and c = speed of light. A random variable can be qualitative (descriptive with no intrinsic numerical value) or quantitative (with intrinsic numerical value). Qualitative variables can be nominal (no specific order of magnitude), ordinal (specific order) or ranked. A random quantitative variable results when numerical values are assigned to results of measurement or counting. It is called a discrete random variable if the assignment is based on counting. It is called a continuous random variable if the numerical assignment is based on measurement. The numerical continuous random variable can be expressed as fractions and decimals. The numerical discrete can only be expressed as whole numbers. Choice of the technique of statistical analysis depends on the type of variable. Many mistakes in data analysis arise from not knowing the difference between discrete and continuous variables and wrongly applying the wrong statistical technique.

 

PRELIMINARIES OF DATA ANALYSIS

Simple manual inspection of the data is needed before applying sophisticated statistical tests.. Indiscriminate application of the tests to data leads to wrong or misleading conclusions. Acquiring familiarity with the data by simple manual inspection can help identify outliers, assess the normality of data distribution, and identify commonsense relationships among variables that could alert the investigator to errors in computer analysis.

 

Data analysis is essentially construction and testing of hypotheses. Two procedures are employed in statistical analysis. The test for association is done first. The assessment of the effect measures is done after finding an association. Effect measures are useless in situations in which tests for association are negative. The tests for association commonly employed are: t-test, chi-square, the linear correlation coefficient, and the linear regression coefficient. The effect measures commonly employed are: Odds Ratio, Risk Ratio, Rate difference. Measures of trend can discover relationships that are not picked up by association and effect measures

 

TYPES OF ANALYSIS

Univariate analysis is testing a hypothesis about one mean or one proportion. The t test is used to test hypotheses about a single sample mean. The chisquare test is used to test hypotheses about a single sample proportion. Univariate testing answers the question whether the given mean or proportion is significantly different from zero.

 

Bivariate analysis is testing the hypothesis whether two means or two proportions are significantly different from one another. The choice of the statistical test for association in bivariate analysis is made according to Table #1

 

Multivariate analysis in its commonest form is essentially bivariate analysis with adjustment for extraneous variables that confuse (or confound) the bivariate relation. Choice of statistical test of association for multi-variate analysis is made according to table #2

 

STATISTICAL MODELS IN DATA ANALYSIS

Observations or raw data has to be fit to a specific statistical model. Once the model is fit it can be used for prediction. There are basically three types of models: probability models, likelihood models, and regression models. The probability model is deterministic and stochastic. Probability models commonly used in statistical analysis are the binomial and the normal distributions. The likelihood model derives the maximum likelihood estimator from the data. The maximum likelihood estimate, MLE, is the most likely value of the parameter from the given data and is derived interactively. The regression model may be a Poisson regression model or may be binomial logistic regression model. The model allows modeling the interaction among confounders and the interaction between the exposure and the confounders. It can be used to explore additive and synergistic relations.

 

NON REGRESSION MULTIVARIATE ANALYSIS (STRATIFIED ANALYSIS)

Stratified analysis has two main purposes: study effect modification / interaction (variation of effect measures by stratum) and control bias (confounding bias and other types of bias). It usually starts by an examination of stratum-specific effect measures. If there is variation by stratum, heterogeneity, no further analysis is undertaken and the final results are reported as stratum-specific measures. If there is homogeneity of effect measures across strata, a summary estimates computed. Heterogeneity is identified using the chi square fir homogeneity. The summary estimates may be a chi square or an odds ratio computed using the Mantel-Haenszel procedure.

 

REGRESSION MULTIVARIATE ANALYSIS

Multivariate models solve 2 problems that arose when stratified analysis was used. Stratified analysis breaks down when data is sparse with very low numbers in some strata. Stratified analysis would be very cumbersome if it were used for more than 3 variables. There are three main types of multivariate models: the linear model, the logistic model, and the proportional hazards model. The linear model is E(Y) = b0 + i=1 bixi. The binary logistic model is of the form ln(p/1-p) = ei=1 bixi. The proportional hazards regression relates hazard at a given time to risk factors such that yi = ln{hi(t) / h0(t)} = b1 x1i + b2 x2i + ….The coefficients of proportional hazards regression are interpreted like coefficients of logistic regression.

 


TABLE #1:

CHOICE OF STATISTICAL TECHNIQUE FOR BIVARIATE ANALYSIS[i]

 

First variable

Second Variable

Test

Continuous

Dichotomous, unpaired

2-sample t test

Continuous

Dichotomous, paired

Paired t test ( 1 sample t test after taking differences for each pair)

Continuous

Nominal (>= groups)

1-way ANOVA

Continuous

Continuous

Linear correlation (Pearson) or linear regression

Ordinal

Dichotomous, unpaired

Mann-Whitney U test or Chi-square test for linear trend

Ordinal

Dichotomous, paired

Wilcoxon test

Ordinal

Ordinal

Spearman Correlation or Kendall Correlation

Ordinal

Continuous

Categorize the continuous and use Spearman correlation, Kendal correlation or the chi square test

Dichtomous

Dichotomous, unpaired

Chi-square test or Fisher exact probability test

Dichotomous

Dichotomous, paired

McNemar chi-square test

Dichtomous

Nominal

Chi-square test

Nominal

Nominal

Chi-square test

 

TABLE #2:

CHOICE OF STATISTICAL TECHNIQUE FOR MULTIVARIATE ANALYSIS1

 

Dependent variable

Independent Variables

Test

Continuous

All categorical

ANOVA (analysis of variance)

Continuous

Mixture of categorical and continuous

ANCOVA (Analysis of covariance)

Continuous

All continuous

Multiple linear regression

Dichotomous

All categorical

Multiple logistic regression or log-linear analysis

Dichtomous

Mixture of categorical and continuous

Logistic regression

Time-dependent

Dichotomous

Mixture of categorical and continuous

Cox’s proportinal hazards model

Dichotomous

All continuous

Logistic regression or discriminant function analysis

Nominal

All categorical

Log-linear analysis

Nominal

Mixture of categorical and continuous

Group the continuous and perform log linear analysis

Nominal

All continuous

Discriminant function analysis or categorize the continuous and perform log-linear analysis

NB: Categorical includes nominal, ordinal and dichotomous


[i] (Jekel et al Epidemiology, Biostatistics, and Preventive Medicine WB Saunders page 175):

 

Omar Hasan Kasule, Sr. May 2004