There are two types of populations: study and target populations. The population may be large or small.
Sampling starts by defining a sampling frame (list of individuals to be sampled). Then a specific method
is used to select the sample from the defined population.
The sampling units are the people or objects to be sampled. A sampling frame can be looked at as the enumeration
of the population by sampling units
Sampling can be of 4 types: convenience, quota, and random sampling. A random sample is the best and most
scientific of all samples. The defining characteristic of random sampling is that any element has an equal chance of being
selected since selection is purely by chance.
Simple random sampling is the simplest type of random sampling. In simple random sampling, all units in
the population are at equal risk of being selected into the sample. The simple random sampling eliminates personal bias. This
is because unlike the situation in convenience or quota sampling, the researcher has no way of pre-determining that a particular
member of the population will be included in the sample.
Stratified random sampling is a type of sampling in which the whole population is divided into groups called
strata. A pre-determined proportion or fraction of each stratum is randomly selected into the sample. Selection is carried
out separately in each stratum using random selection.
Systematic random sampling is a type of sampling when there is an ordered list ie the population is arranged
in some definite and known order. The decision can then be made to include into the sample every nth unit where n may be any
number. The first unit is selected at random and then you proceed according to the pre-defined pattern. Systematic sampling
can be as accurate as simple random sampling. This type of sampling will be invalid if there is a natural repeat order in
the sample that repeats every n times.
Multi-stage random sampling is a random sample selected in 2 or more stages. This is done for example when
a random sample is selected from each of the 2 gender categories, male and female. Then random samples are selected from each
age category of each gender category. The resulting sample has the advantage of being balanced with respect to gender and
age. Multi-stage sampling is cheaper than simple random sampling. It is convenient when the complete sampling frame is not
known. It has the great advantage of ensuring balanced representation of the groups that may not occur with simple random
Cluster sampling is easy, cheap but less precise. Instead of using individuals as sampling units, groups
of individuals (clusters) are used. For example instead of sampling individuals, households may be sampled. Clusters are normally
selected as natural sub-groupings of the population. A random sample of clusters is selected and sampling is carried out within
each cluster. Cluster sampling proceeds by selecting geographical units like districts or zip codes. Then a house is selected
at random in each unit. A cluster of given size is then formed around the index house. Sophisticated methods for this selection
have been developed. For example the researcher may walk in a straight line in a pre-determined direction while counting until
a pre-determined number of houses is counted. These houses together with the index house will then constitute the cluster.
Similar clusters are formed in the other zip codes and members of the households are interviewed as study subjects.
The qualitative scale is used for attribute or categorical variables. These variables have no intrinsic
numerical value. They arise as a result of classification. Qualitative variables are of three types: nominal, ordinal, and
The quantitative scale is used for variables that arise as a result of measurement or counts. Quantitative
variables have an intrinsic numerical value. Quantitative variables are of two types: numerical continuous & numerical
The nominal scale is unordered. Ordering is impossible even if desired because there is no natural ordering
of the categories.
There is a natural ordering in the ordinal scale. The order between the groups is pre-determined.
In the ranked scale, Observations are arrayed in order of magnitude either ascending or descending.
The numerical continuous scale is a result of measurement of length, weight, speed, volume, time, or a combination
of these. Continuous data can not take exact values. Responses assume any value including decimals with no restrictions at
all. Any point on the scale is possible; the only limitation is accuracy of measurement. Further mathematical manipulations
are limited by degree of accuracy of the measurements. Readings on this scale are not always perfectly accurate because of the inevitable
rounding off error.
The numerical discrete scale is a result of counting. It is a numerical scale that uses only whole positive
integers. There is no continuum of values. No values is permissible between any two integers. The numerical discrete scale,
unlike the numerical continuous, is exact because it is count of whole numbers
C. DATA COLLECTION
Sources of data: general population and household census, vital statistics, routinely-collected data, epidemiological
studies, and special surveys
Data collection processes must be clearly defined in a written protocol which is the operational document
of the study. Data collection is usually by questionnaire. The protocol should include the initial version of the questionnaire.
This can be updated and improved after the pilot study. If a paper questionnaire is used data transfer into the electronic
form will be necessary. The need for this could be obviated by direct on-line entry of data.
The objectives of the data collection must be defined clearly. Operational decisions and planning depend
on the definition of objectives. It is wrong to collect more data than what is necessary to satisfy the objectives. It is
also wrong to collect data just in case it may turn out to be useful.
The study population is identified. The method of sampling and the size of the sample are determined.
Staff to be used must be trained. The training should go beyond telling them what they will do. They must
have sufficient understanding of the study that they can detect serious mistakes and deviations. A pilot study to test methods
and procedures should be carried out. However well a study is planned, things could go wrong once field work starts.
A pilot study helps detect and correct such pitfalls.
A quality control program must be part of the protocol from the beginning
Utmost care should be taken in preparing the study questionnaire. A start is made by reviewing questionnaires of
The following should be observed in selecting questionnaire items: (a) Clarity: The wording of the questionnaire
items should leave no room for ambiguity (b) comprehensibility: the words must be easy. Technical jargon must be avoided.
Questions should not be double barreled. (c) Value-laden words and expressions should not be used. (d) Leading questions should
be avoided. Wording should not be positive or negative. (e) The responses must be scaled appropriately.
The following design and structure features must be observed: (a) The questionnaire should be designed for easy
reading. (b) The logical sequence of questions must be proper. (c) Skip patterns should be worked out carefully and exhaustively.
The reliability and validity of the questionnaire should be tested during the pilot study.
Before administering a questionnaire the investigator should be aware of some ethical issues. Informed consent
must be obtained. The information provided could be sub-poened by a court of law and the investigator can not refuse to release
it. In the course of the interview the investigator may get information that requires taking life-saving measures. Taking
these measures will however compromise the confidentiality. Such a situation may arise in case of an interviewee who informs
the interviewer that he is planning to commit suicide later that day. Such information may have to be conveyed immediately
to the authorities concerned.
In a face-to-face interview, the interviewer reads out questions to the interviewee and completes the questionnaire.
In the method of questionnaire administration by mail, a questionnaire is mailed to the respondent's address. The respondent
completes and returns the questionnaire in a pre-addressed and stamped envelope.
D. DATA MANAGEMENT
DEFINITION OF TERMINOLOGY
A field/attribute/variable/variate is the characteristic measured for
each member e.g name & weight.
A value/element is the actual measurement or count like 5 cm, 10kg.
A record/observation is
a collection of all variables belonging to one individual.
A file is a collection of records.
A data-base is a collection of files.
A data dictionary is an explanation or index of the data.
Data-base life-cycle: A data base goes through a life cycle of its
own. Data is collected and is stored. New data has to be collected to update the old one.
A census comprises all values of a defined finite population are obtained
DATA CODING, ENTRY, STORAGE, & RETRIEVAL
Coding: Self-coding or pre-coded questionnaires are preferable to those
requiring coding after data collection. Errors and inconsistencies could be introduced into the data during manual coding.
A good pre-coded questionnaire can be produced after piloting the study
Data entry: Both random and non-random errors could occur in data entry.
The following methods can be used to detect such errors: (a) double entry techniques in which 2 data entry clerks enter the
same data and a check is made by computer on items on which they differ. The items are then re-checked in the questionnaires
and reconciliation is carried out. This method is based on the assumption that the probability of 2 persons making the same
random error on entering an item of information is very low. The items on which the 2 agree are therefore likely to be valid.
(b) The data entered in the computer could be checked manually against the original questionnaire. (c) Interactive data entry
is becoming popular. It enables detection and correction of logical and entry errors immediately. The computer data entry
program could be programmed to detect entries with unacceptable values or that are logically inconsistent
Data editing: This is the process of correcting data collection and
data entry errors. The data is 'cleaned' using logical and statistical checks. Range checks are used to detect entries whose
values are outside what is expected; for example child height of 5 meters is clearly wrong. Consistency checks enable identifying
errors such as recording presence of an enlarged prostate in a female. Among the functions of data editing is to make sure
that all values are at the same level of precision (number of decimal places). This makes computations consistent and decreases
rounding off errors.
Validation checks: Data errors can be detected at the stage of data
entry or data editing. More advanced validity checks can be carried out using three methods: (a) logical checks (b) statistical
checks involving actual plotting of data on a graph and visually detecting outlying values
Data transformation: This is the process of creating new derived variables
preliminary to analysis. The transformations may be simple using ordinary arithmetical operators or more complex using mathematical
Coding and entry errors
Rounding-off / significant figures
Questions with multiple valid responses
E. DATA PRESENTATION
Objective: summarize data for presentation (parsimony) while preserving
a complete picture. Some information is inevitably lost by grouping.
Data classes: The suitable number of classes is 10-20. Too few intervals
mask details of the distribution. The following are desirable characteristics of classes: mutually exclusive, intervals equal
in width, and intervals continuous throughout the distribution,. Class limits (also called class boundaries) are of 2 types:
true and tabulated. The true are more accurate but can not be easily tabulated. True class limits should conform to data accuracy
(decimals & rounding off). The class mid-points are used in drawing line graphs. The upper class limit (UCL) and the lower
class limit (LCL)
Objective: a table can present a lot of data in logical groupings and
for 2 or more variables for visual inspection.
Type of information presented in tables (a) Frequency and frequency
% (b) Relative (proportional) frequency and relative frequency % (c) Cumulative
frequency and relative cumulative frequency %. The relative cumulative frequency % can be used to compare distributions.
Characteristics of an ideal table: Ideal tables are simple, easy to read, and correctly scaled. The layout of the table should make it easy to read and understand
the numerical information. The table must be able to stand on its own ie understandable without reference to the text. The
table must have a title/heading that should indicate its contents. Labeling must be complete and accurate: title, rows &
columns, marginal & grand totals, are units of measurement. The field labels are in the margins of the table while the
numerical data is in the cells that are in the body of the table. Footnotes may be used to explain the table.
Vertical Line graph
Bar diagram: There are 2 types of bar diagram: the bar chart and the histogram.
In the bar chart there are spaces between the bars. In the histogram the bars are lying side by side. The bar chart is best
for discrete, nominal or ordinal data. The histogram is best for continuous data. A vertical line graph could also be considered
a type of bar diagram. Bar charts and histograms are generated from frequency tables discussed above. They can show more than
one variable. Using modern computer technology, they can be constructed as 4-dimensional figures. The area of the bar represents
frequency ie the area of the bar is proportional to the frequency. If the class intervals are equal, the height of the bar
represents frequency. If the class intervals are unequal the height of the bar does not represent frequency. Frequency can
only be computed from the area of the bar. We talk of frequency density (frequency/class width) for histograms with unequal
class widths and for which the frequency density is used as the height of the bars. As the class intervals get smaller the
bar chart takes on the shape of a distribution. The special advantages of the
bar diagram: (a) it is more intuitive for the non-specialist (b) the area of the bar represents frequency.
Line graphs/frequency polygon: A frequency polygon is the plot of the
frequency against the mid-point of the class interval. The points are joined by straight lines to produce a frequency polygon.
If smoothed, a frequency curve is produced. The line graph can be used to show the following: frequency polygon, cumulative
frequency curve, cumulative frequency % (also called the ogive), and moving averages. The line graph has two axes: the abscissa
or x-axis is horizontal. The ordinate or y-axis is vertical. It is possible to plot several variables on the same graph. The
graph can be smoothened manually or by computer. Time series and changes over time are shown easily by the line graph. Trends,
cyclic and non-cyclic are easy to represent. Line graphs are sometimes superior to other methods of indicating trend such
as moving averages and linear regression. The frequency polygon has the following advantages: (a) It shows trends better (b)
It is best for continuous data (c) A comparative frequency polygon can show more than 1 distribution on the same page. The
cumulative frequency curve has the additional advantage of being used to show and compare different sets of data because their
respective medians, quartiles, percentiles, can be read off directly from the curve.
Scatter diagram / scatter-gram
Pie chart / pie diagram: These diagrams show relative frequency % converted
into angles of circle (called sector angle). The area of each sector is proportional to the frequency. The pie chart is used
to compare data sets in 2 ways: (a) the size of the circle indicates overall total; thus a small data set will have a small
pie chart. (b) The sector area or angle shows the distribution within a particular data set.
Stem and leaf: the actual values are written in the histogram and not
the bars. The stem and leaf gives a good idea of the shape of the distribution. It is easy to pick up the minimum and maximum
values. Two variables can be shown on a stem and leaf. The modal class is easy to identify. The class intervals must be equal.
A key must be provided.
Maps: different variables or values of one variable can be indicated
by use of different shading, cross-hatching, dotting, and colors.
SHAPES OF DIAGRAMS
Uni-modal is most common in biology. Bi-modal has 2 peaks that are not necessarily of the same height.
More than bimodal is unusual
A perfectly symmetrical curve is bell-shaped and is centered on the mean.
Skew to right (+ve skew) is more common. It shows that there are extreme values on the right of the center
pulling the the mean to the right thus mode<median<mean. Skew to left (-ve skew) is less common. It shows that
there are extreme values to the left of the center.
Poor labeling of scales
Distortion of scales
Omitting zero/origin. If plotting from zero is not possible for reasons of space, a broken line should be used
to show discontinuity of the scale.
F. RATES & PROPORTIONS
Rates are events in a given population over a defined time period. A rate has 4 components: numerator, denominator,
and time. The numerator of a rate is included in its denominator. Incidence is a type of rate. It describes a moving and dynamic
picture of disease.
Crude Rates: Crude rates are un-weighted and are misleading. Comparison of crude rates in 2 populations is not
possible. No valid inference based on crude rates is possible because of confounding. The Simpson paradox, due to confounding,
arises when the conclusion based on crude rates contradicts that based on specific rates.
Specific rates: The following types of specific rates are commonly used: age-specific, sex-specific, place-specific,
race-specific, and cause-specific rates.
Adjusted /standardized rates: Adjustment for age, sex, or any other factor to remove confounding and allow comparison
Proportions are used for enumeration. A proportion is the number of events in a given population at risk. It has
only 2 components: the numerator and the denominator. The numerator is included in denominator. The time period is not defined
but is somehow assumed. Prevalence of disease is a proportion. It describes a still/stationary picture of disease.
Like rates, proportions can be crude, specific, and standard.
Definition: Standardization is a statistical technic that involves adjustment of a rate or a proportion for 1 or
2 confounding factors. There are 2 types of standardisation: direct and indirect. Both involve the same principles but use
Advantages of standardisation: (a) a single summary index is easier to compare across populations than several
specific rates (b) comparison of specific rates may not be valid when some strata have too few subjects to be reliable (c)
specific rates may not be available especially for occupational studies
Standardize for what? To compare different populations with varying age, sex, SES, and ethnic distribution.
Direct standardization: Direct standardization is used when age-specific rates are available. The population rates
are applied to the age distribution of a standard population to compute the standardized rate.
Indirect standardization: Indirect standardization is used when age-specific rates are not available. The rates
of the standard population are applied to the age distribution of the study sample to compute the observed/expected ratio
Source of the standard population: (a) Combine the 2 populations (b) Use one of the populations (c) Use national
population (d) Use world population
Definition: The arithmetic mean is defined as the sum of the observations'
values divided by the total number of observations. The arithmetic mean reflects the impact of all observations. The arithmetic
mean is popularly called the average. It is just one of several types of averages. Two other parameters related to the arithmetic
mean are the robust (trimmed) mean and the mid-range. The robust arithmetic mean is the arithmetic mean of the remaining observations
when a fixed percentage of the smallest and largest observations are eliminated. The mid-range is the arithmetic mean of the
values of the smallest and the largest observations.
The mode for a set of values is defined as the of the commonest, most frequent, or most popular observation.
The median is defined as the middle observation in a ranked series such that 50% are above and 50% are below. If
the number of observations is odd, the middle observation is the median. If the number of observations is even the arithmetic
mean of the 2 middle observations is the median. The median is ˝ (n+1)th observation.
H. STANDARD DEVIATION
The standard deviation is the square root of the variance described above. The standard deviation is the most frequently
used measure of variation.
I. TESTING OF MEANS
THE T TEST STATISTIC
The student t-test is the most commonly used test statistic for inference on continuous numerical data. Because
of its wide-spread use it is often misused. The commonest transgression is to use it on count data.
The t-test is used for inference on one sample mean or two sample means. The t-test must fulfill the following
conditions of validity: (a) The samples compared must be normally distributed
(b) the variances of samples compared must be approximately equal.
The t-test is used uniformly for sample sizes below 60. It is also used for sample sizes above this if the population
standard deviation is not known. The t-test is based on the assumption that for small samples the shape of the distribution
is flatter at the peak and is more elongated at the tails. As the sample size increases, the shape of the distribution tends
towards the Gaussian and the z-test is used.
The F-test is a generalized test used in inference on 3 or more sample means.
The procedures of the F-statistic are also generally called analysis of variance, ANOVA.
ANOVA studies how the mean varies by group.
J. TESTING PROPORTIONS
CHI-SQUARE STATISTIC FOR 2 x 2 TABLES
The chi-square is computed from the data using appropriate formulas and takes various shapes
There is no chi-square test for only one proportion.
Most chi-square testing involves two proportions and a 2 x 2 contingency table is used.
The contingency table lay-out and the formulas are different for paired and independent
The chi-square for paired data is called the MacNemar chi-square.
K. LINEAR CORRELATION
Data analysis usually begins with preliminary exploration using linear correlation. A correlation matrix is used
to explore for pairs of variables likely to be associated. Then more sophisticated methods are applied to define the relationships
Correlation describes the relation between 2 variables about the same person or object with no prior evidence of
inter-dependence. Both variables are random. Correlation indicates only association. The association is not necessarily causative.
Correlation analysis has the following objectives: (a) describe the relation between x and y (b) predict y if x
is known and vice versa (c) study trends (d) study the effect of a third factor like age on the relation between x and y.
L. REGRESSIONThe mathematical
model of simple linear regression is shown in the regression equation/regression function/regression line: y=a + bx where
y is the dependent/response variable, a is the intercept, b is the slope/regression coefficient, and x is the dependent/predictor
variable. Both a and b are in a strict sense regression coefficients but the term is usually reserved for b only.