By Prof Dr. Omar Hasan Kasule, Sr. on 19th September 2000


There are two types of populations: study and target populations. The population may be large or small.


Sampling starts by defining a sampling frame (list of individuals to be sampled). Then a specific method is used to select the sample from the defined population.

The sampling units are the people or objects to be sampled. A sampling frame can be looked at as the enumeration of the population by sampling units


Sampling can be of 4 types: convenience, quota, and random sampling. A random sample is the best and most scientific of all samples. The defining characteristic of random sampling is that any element has an equal chance of being selected since selection is purely by chance.


Simple random sampling is the simplest type of random sampling. In simple random sampling, all units in the population are at equal risk of being selected into the sample. The simple random sampling eliminates personal bias. This is because unlike the situation in convenience or quota sampling, the researcher has no way of pre-determining that a particular member of the population will be included in the sample.


Stratified random sampling is a type of sampling in which the whole population is divided into groups called strata. A pre-determined proportion or fraction of each stratum is randomly selected into the sample. Selection is carried out separately in each stratum using random selection.


Systematic random sampling is a type of sampling when there is an ordered list ie the population is arranged in some definite and known order. The decision can then be made to include into the sample every nth unit where n may be any number. The first unit is selected at random and then you proceed according to the pre-defined pattern. Systematic sampling can be as accurate as simple random sampling. This type of sampling will be invalid if there is a natural repeat order in the sample that repeats every n times.


Multi-stage random sampling is a random sample selected in 2 or more stages. This is done for example when a random sample is selected from each of the 2 gender categories, male and female. Then random samples are selected from each age category of each gender category. The resulting sample has the advantage of being balanced with respect to gender and age. Multi-stage sampling is cheaper than simple random sampling. It is convenient when the complete sampling frame is not known. It has the great advantage of ensuring balanced representation of the groups that may not occur with simple random sampling.


Cluster sampling is easy, cheap but less precise. Instead of using individuals as sampling units, groups of individuals (clusters) are used. For example instead of sampling individuals, households may be sampled. Clusters are normally selected as natural sub-groupings of the population. A random sample of clusters is selected and sampling is carried out within each cluster. Cluster sampling proceeds by selecting geographical units like districts or zip codes. Then a house is selected at random in each unit. A cluster of given size is then formed around the index house. Sophisticated methods for this selection have been developed. For example the researcher may walk in a straight line in a pre-determined direction while counting until a pre-determined number of houses is counted. These houses together with the index house will then constitute the cluster. Similar clusters are formed in the other zip codes and members of the households are interviewed as study subjects.



The qualitative scale is used for attribute or categorical variables. These variables have no intrinsic numerical value. They arise as a result of classification. Qualitative variables are of three types: nominal, ordinal, and ranked.   


The quantitative scale is used for variables that arise as a result of measurement or counts. Quantitative variables have an intrinsic numerical value. Quantitative variables are of two types: numerical continuous & numerical discrete


The nominal scale is unordered. Ordering is impossible even if desired because there is no natural ordering of the categories.


There is a natural ordering in the ordinal scale. The order between the groups is pre-determined.


In the ranked scale, Observations are arrayed in order of magnitude either ascending or descending.


The numerical continuous scale is a result of measurement of length, weight, speed, volume, time, or a combination of these. Continuous data can not take exact values. Responses assume any value including decimals with no restrictions at all. Any point on the scale is possible; the only limitation is accuracy of measurement. Further mathematical manipulations are limited by degree of accuracy of the measurements. Readings on this scale are not always perfectly accurate because of the inevitable rounding off error.


The numerical discrete scale is a result of counting. It is a numerical scale that uses only whole positive integers. There is no continuum of values. No values is permissible between any two integers. The numerical discrete scale, unlike the numerical continuous, is exact because it is count of whole numbers




Sources of data: general population and household census, vital statistics, routinely-collected data, epidemiological studies, and  special surveys

Data collection processes must be clearly defined in a written protocol which is the operational document of the study. Data collection is usually by questionnaire. The protocol should include the initial version of the questionnaire. This can be updated and improved after the pilot study. If a paper questionnaire is used data transfer into the electronic form will be necessary. The need for this could be obviated by direct on-line entry of data.


The objectives of the data collection must be defined clearly. Operational decisions and planning depend on the definition of objectives. It is wrong to collect more data than what is necessary to satisfy the objectives. It is also wrong to collect data just in case it may turn out to be useful.


The study population is identified. The method of sampling and the size of the sample are determined.


Staff to be used must be trained. The training should go beyond telling them what they will do. They must have sufficient understanding of the study that they can detect serious mistakes and deviations. A pilot study to test methods and procedures should be carried out. However well a study is planned, things could go wrong once field work starts.


A pilot study helps detect and correct such pitfalls.


A quality control program must be part of the protocol from the beginning



Utmost care should be taken in preparing the study questionnaire. A start is made by reviewing questionnaires of previous studies.


The following should be observed in selecting questionnaire items: (a) Clarity: The wording of the questionnaire items should leave no room for ambiguity (b) comprehensibility: the words must be easy. Technical jargon must be avoided. Questions should not be double barreled. (c) Value-laden words and expressions should not be used. (d) Leading questions should be avoided. Wording should not be positive or negative. (e) The responses must be scaled appropriately.


The following design and structure features must be observed: (a) The questionnaire should be designed for easy reading. (b) The logical sequence of questions must be proper. (c) Skip patterns should be worked out carefully and exhaustively.


The reliability and validity of the questionnaire should be tested during the pilot study.

Before administering a questionnaire the investigator should be aware of some ethical issues. Informed consent must be obtained. The information provided could be sub-poened by a court of law and the investigator can not refuse to release it. In the course of the interview the investigator may get information that requires taking life-saving measures. Taking these measures will however compromise the confidentiality. Such a situation may arise in case of an interviewee who informs the interviewer that he is planning to commit suicide later that day. Such information may have to be conveyed immediately to the authorities concerned.


In a face-to-face interview, the interviewer reads out questions to the interviewee and completes the questionnaire. In the method of questionnaire administration by mail, a questionnaire is mailed to the respondent's address. The respondent completes and returns the questionnaire in a pre-addressed and stamped envelope.




A field/attribute/variable/variate is the characteristic measured for each member e.g name & weight.


A value/element is the actual measurement or count like 5 cm, 10kg.


A record/observation  is a collection of  all variables belonging to one individual.


A file is a collection of records.


A data-base is a collection of files.


A data dictionary is an explanation or index of the data.


Data-base life-cycle: A data base goes through a life cycle of its own. Data is collected and is stored. New data has to be collected to update the old one.


A census comprises all values of a defined finite population are obtained ie totality



Coding: Self-coding or pre-coded questionnaires are preferable to those requiring coding after data collection. Errors and inconsistencies could be introduced into the data during manual coding. A good pre-coded questionnaire can be produced after piloting the study


Data entry: Both random and non-random errors could occur in data entry. The following methods can be used to detect such errors: (a) double entry techniques in which 2 data entry clerks enter the same data and a check is made by computer on items on which they differ. The items are then re-checked in the questionnaires and reconciliation is carried out. This method is based on the assumption that the probability of 2 persons making the same random error on entering an item of information is very low. The items on which the 2 agree are therefore likely to be valid. (b) The data entered in the computer could be checked manually against the original questionnaire. (c) Interactive data entry is becoming popular. It enables detection and correction of logical and entry errors immediately. The computer data entry program could be programmed to detect entries with unacceptable values or that are logically inconsistent



Data storage



Data editing: This is the process of correcting data collection and data entry errors. The data is 'cleaned' using logical and statistical checks. Range checks are used to detect entries whose values are outside what is expected; for example child height of 5 meters is clearly wrong. Consistency checks enable identifying errors such as recording presence of an enlarged prostate in a female. Among the functions of data editing is to make sure that all values are at the same level of precision (number of decimal places). This makes computations consistent and decreases rounding off errors.


Validation checks: Data errors can be detected at the stage of data entry or data editing. More advanced validity checks can be carried out using three methods: (a) logical checks (b) statistical checks involving actual plotting of data on a graph and visually detecting outlying values


Data transformation: This is the process of creating new derived variables preliminary to analysis. The transformations may be simple using ordinary arithmetical operators or more complex using mathematical transformations.



Missing data

Coding and entry errors


Irregular patterns

Digit preference


Rounding-off / significant figures

Questions with multiple valid responses

Record duplication




Objective: summarize data for presentation (parsimony) while preserving a complete picture. Some information is inevitably lost by grouping.


Data classes: The suitable number of classes is 10-20. Too few intervals mask details of the distribution. The following are desirable characteristics of classes: mutually exclusive, intervals equal in width, and intervals continuous throughout the distribution,. Class limits (also called class boundaries) are of 2 types: true and tabulated. The true are more accurate but can not be easily tabulated. True class limits should conform to data accuracy (decimals & rounding off). The class mid-points are used in drawing line graphs. The upper class limit (UCL) and the lower class limit (LCL)     




Objective: a table can present a lot of data in logical groupings and for 2 or more variables for visual inspection.


Type of information presented in tables (a) Frequency and frequency % (b) Relative (proportional) frequency  and relative frequency % (c) Cumulative frequency and relative cumulative frequency %. The relative cumulative frequency % can be used to compare distributions.


Characteristics of an ideal table: Ideal tables are simple,  easy to read, and correctly scaled. The layout of the table should make it easy to read and understand the numerical information. The table must be able to stand on its own ie understandable without reference to the text. The table must have a title/heading that should indicate its contents. Labeling must be complete and accurate: title, rows & columns, marginal & grand totals, are units of measurement. The field labels are in the margins of the table while the numerical data is in the cells that are in the body of the table. Footnotes may be used to explain the table.



Dot plot


Vertical Line graph


Bar diagram: There are 2 types of bar diagram: the bar chart and the histogram. In the bar chart there are spaces between the bars. In the histogram the bars are lying side by side. The bar chart is best for discrete, nominal or ordinal data. The histogram is best for continuous data. A vertical line graph could also be considered a type of bar diagram. Bar charts and histograms are generated from frequency tables discussed above. They can show more than one variable. Using modern computer technology, they can be constructed as 4-dimensional figures. The area of the bar represents frequency ie the area of the bar is proportional to the frequency. If the class intervals are equal, the height of the bar represents frequency. If the class intervals are unequal the height of the bar does not represent frequency. Frequency can only be computed from the area of the bar. We talk of frequency density (frequency/class width) for histograms with unequal class widths and for which the frequency density is used as the height of the bars. As the class intervals get smaller the bar chart takes on the shape of a distribution.  The special advantages of the bar diagram: (a) it is more intuitive for the non-specialist (b) the area of the bar represents frequency.


Line graphs/frequency polygon: A frequency polygon is the plot of the frequency against the mid-point of the class interval. The points are joined by straight lines to produce a frequency polygon. If smoothed, a frequency curve is produced. The line graph can be used to show the following: frequency polygon, cumulative frequency curve, cumulative frequency % (also called the ogive), and moving averages. The line graph has two axes: the abscissa or x-axis is horizontal. The ordinate or y-axis is vertical. It is possible to plot several variables on the same graph. The graph can be smoothened manually or by computer. Time series and changes over time are shown easily by the line graph. Trends, cyclic and non-cyclic are easy to represent. Line graphs are sometimes superior to other methods of indicating trend such as moving averages and linear regression. The frequency polygon has the following advantages: (a) It shows trends better (b) It is best for continuous data (c) A comparative frequency polygon can show more than 1 distribution on the same page. The cumulative frequency curve has the additional advantage of being used to show and compare different sets of data because their respective medians, quartiles, percentiles, can be read off directly from the curve.


Scatter diagram / scatter-gram


Pie chart / pie diagram: These diagrams show relative frequency % converted into angles of circle (called sector angle). The area of each sector is proportional to the frequency. The pie chart is used to compare data sets in 2 ways: (a) the size of the circle indicates overall total; thus a small data set will have a small pie chart. (b) The sector area or angle shows the distribution within a particular data set.




Stem and leaf: the actual values are written in the histogram and not the bars. The stem and leaf gives a good idea of the shape of the distribution. It is easy to pick up the minimum and maximum values. Two variables can be shown on a stem and leaf. The modal class is easy to identify. The class intervals must be equal. A key must be provided.


Maps: different variables or values of one variable can be indicated by use of different shading, cross-hatching, dotting, and colors.



Uni-modal is most common in biology. Bi-modal has 2 peaks that are not necessarily of the same height. More than bimodal is unusual


A perfectly symmetrical curve is bell-shaped and is centered on the mean.


Skew to right (+ve skew) is more common. It shows that there are extreme values on the right of the center pulling the the mean to the right thus mode<median<mean. Skew to left (-ve skew) is less common. It shows that there are extreme values to the left of the center.



Poor labeling of scales

Distortion of scales

Omitting zero/origin. If plotting from zero is not possible for reasons of space, a broken line should be used to show discontinuity of the scale.




Rates are events in a given population over a defined time period. A rate has 4 components: numerator, denominator, and time. The numerator of a rate is included in its denominator. Incidence is a type of rate. It describes a moving and dynamic picture of disease.


Crude Rates: Crude rates are un-weighted and are misleading. Comparison of crude rates in 2 populations is not possible. No valid inference based on crude rates is possible because of confounding. The Simpson paradox, due to confounding, arises when the conclusion based on crude rates contradicts that based on specific rates.

Specific rates: The following types of specific rates are commonly used: age-specific, sex-specific, place-specific, race-specific, and cause-specific rates.

Adjusted /standardized rates: Adjustment for age, sex, or any other factor to remove confounding and allow comparison across populations       



Proportions are used for enumeration. A proportion is the number of events in a given population at risk. It has only 2 components: the numerator and the denominator. The numerator is included in denominator. The time period is not defined but is somehow assumed. Prevalence of disease is a proportion. It describes a still/stationary picture of disease.


Like rates, proportions can be crude, specific, and standard.



Definition: Standardization is a statistical technic that involves adjustment of a rate or a proportion for 1 or 2 confounding factors. There are 2 types of standardisation: direct and indirect. Both involve the same principles but use different weights.


Advantages of standardisation: (a) a single summary index is easier to compare across populations than several specific rates (b) comparison of specific rates may not be valid when some strata have too few subjects to be reliable (c) specific rates may not be available especially for occupational studies


Standardize for what? To compare different populations with varying age, sex, SES, and ethnic distribution.


Direct standardization: Direct standardization is used when age-specific rates are available. The population rates are applied to the age distribution of a standard population to compute the standardized rate.


Indirect standardization: Indirect standardization is used when age-specific rates are not available. The rates of the standard population are applied to the age distribution of the study sample to compute the observed/expected ratio


Source of the standard population: (a) Combine the 2 populations (b) Use one of the populations (c) Use national population (d) Use world population



Definition: The arithmetic mean is defined as the sum of the observations' values divided by the total number of observations. The arithmetic mean reflects the impact of all observations. The arithmetic mean is popularly called the average. It is just one of several types of averages. Two other parameters related to the arithmetic mean are the robust (trimmed) mean and the mid-range. The robust arithmetic mean is the arithmetic mean of the remaining observations when a fixed percentage of the smallest and largest observations are eliminated. The mid-range is the arithmetic mean of the values of the smallest and the largest observations.



The mode for a set of values is defined as the of the commonest, most frequent, or most popular observation.



The median is defined as the middle observation in a ranked series such that 50% are above and 50% are below. If the number of observations is odd, the middle observation is the median. If the number of observations is even the arithmetic mean of the 2 middle observations is the median. The median is ˝ (n+1)th observation.



The standard deviation is the square root of the variance described above. The standard deviation is the most frequently used measure of variation.




The student t-test is the most commonly used test statistic for inference on continuous numerical data. Because of its wide-spread use it is often misused. The commonest transgression is to use it on count data.


The t-test is used for inference on one sample mean or two sample means. The t-test must fulfill the following conditions of validity:  (a) The samples compared must be normally distributed (b) the variances of samples compared must be approximately equal.


The t-test is used uniformly for sample sizes below 60. It is also used for sample sizes above this if the population standard deviation is not known. The t-test is based on the assumption that for small samples the shape of the distribution is flatter at the peak and is more elongated at the tails. As the sample size increases, the shape of the distribution tends towards the Gaussian and the z-test is used.



The F-test is a generalized test used in inference on 3 or more sample means.  The procedures of the F-statistic are also generally called analysis of variance, ANOVA.


ANOVA studies how the mean varies by group.




The chi-square is computed from the data using appropriate formulas and takes various shapes


There is no chi-square test for only one proportion.


Most chi-square testing involves two proportions and a 2 x 2 contingency table is used.


The contingency table lay-out and the formulas are different for paired and independent



The chi-square for paired data is called the MacNemar chi-square.



Data analysis usually begins with preliminary exploration using linear correlation. A correlation matrix is used to explore for pairs of variables likely to be associated. Then more sophisticated methods are applied to define the relationships further.


Correlation describes the relation between 2 variables about the same person or object with no prior evidence of inter-dependence. Both variables are random. Correlation indicates only association. The association is not necessarily causative.


Correlation analysis has the following objectives: (a) describe the relation between x and y (b) predict y if x is known and vice versa (c) study trends (d) study the effect of a third factor like age on the relation between x and y.



The mathematical model of simple linear regression is shown in the regression equation/regression function/regression line: y=a + bx where y is the dependent/response variable, a is the intercept, b is the slope/regression coefficient, and x is the dependent/predictor variable. Both a and b are in a strict sense regression coefficients but the term is usually reserved for b only.

ŠProfessor Omar Hasan Kasule Sr, September 2000