Professor Omar Hasan Kasule Sr.

Learning Objectives:

Types, sources, uses, and limitations of health data

Questionnaires: design & 4 methods of administration (advantages and disadvantages of each)

Methods of data presentation, strengths, and weaknesses

Data presentation in diagrams:   class intervals, class limits), tabulations (freq, freq%, cum freq, cum freq%), bar diagrams (bar chart & histogram, area of the bar = frequency), line graphs

Shapes of frequency polygons: symmetry, skewdness, kurtosis.


Key Words and Terms:

Axis: x axis or ordinate

Axis: y axis or abscissa

Bar diagram: bar chart

Bar diagram: histogram


Class interval

Cumulative frequency

Curve fitting

Data compression

Data display

Data editing

Data encryption

Data interpretation

Data processing

Data protection

Data recovery

Data reduction

Data retrieval

Data storage

Data structures

Data transmission

Data value

Data, grouped

Database design

Database mgmt system

Database searching

Disease reporting

Exact limits

Forensic statistics

Frequency percent

Frequency curves

Frequency distribution

Frequency polygon

Graph theory

Graphic methods

Grouping error

Medical record linkage





Public health records


Relative Frequency

Relative frequency

Score, grouped

Score=data value

Scores, raw

Shape of distribution

Smoothing of a graph

Statistical diagrams

Statistical forms


Vital events

Vital statistics

Zero point of a graph



Unit Outline:

7.1 sources of data

7.1.1 General population & household census

7.1.2 Vital statistics

7.1.3 Routinely-collected data

7.1.4 Epidemiological studies

7.1.5 Special surveys


7.2 Data collection

7.2.1 Preliminary measures

7.2.2 Face-to-face questionnaire administration

7.2.3 Questionnaire administration by telephone

7.2.4 Questionnaire administration by mail

7.2.5 Computer-administered questionnaire:


7.3 Data management

7.3.1 Introduction

7.3.2 Data coding, entry, storage, & retrieval

7.3.3 Data processing

7.3.4 Data problems



7.4 Data presentation

7.4.1 Data grouping

7.4.2 Data tabulation

7.4.3 Diagrams

7.4.4 Shapes of distributions

7.4.5 Misleading diagrams


7.5 Health data-bases









Timing: Most countries hold decennial censuses once every 10 years. Due to logistical reasons, some delay beyond 10 years can occur but this should not be too long because it will make it difficult to compare with previous decennial periods. The date of the census must be selected carefully to ensure that as many people as possible will be at home on census day. A census can not be held during holidays or during seasons of the year when people tend to travel out of the areas of normal residence.


Estimates: Inter-censual estimates are made every year in the period between 2 censuses. These estimates are based on the data of the previous census. The following additional information is also used: death rate, immigration, emigration, vital statistics of birth and death. Post-censual estimates are made based on vital statistics of analysis of special sample data. The estimates sometimes are proven very inaccurate by the next national census. This is due to trends that are not included in the estimation. Demographic models are notoriously unreliable. An actual count of all persons is therefore always needed.


Reliability: Governments allocate a lot of resources to ensure that census information is reliable. Despite this some mistakes still occur. Some households/individuals are missed. Incomplete or inaccurate census forms may be submitted. Some persons may be counted twice in their place of usual residence and their residence on census day. Sampling techniques are used to compute the level of reliability of the census results.


Scope: The census covers demographic, social, economic, and health information. Each government department has its own data needs and the census organization has to strike the balance among competing needs of several stake-holders otherwise the census will be unwieldy collecting too much information to satisfy everybody. The information collected changes from census to census depending on the needs. Some information items do not change to be able to assess trends over decades.


Organization: A census is a major national undertaking. There is a central office at the national level with regional and district organizations under it. The smallest organizational unit is the local enumeration area usually covered by one enumerator. Guidelines have to be written for each census. Data and management forms have to be prepared. Census managers and enumerators have to be trained. A pilot census has to be carried out to test census procedures. A quality control program must be included in the census planning. Extensive use of the computer makes the census an easier undertaking than it was decades ago when everything was done manually.


Methods of data collection: (a) direct enumeration when the enumerator visits a household. Information is obtained by interviewing members of the household. Some information is obtained by direct observation. (b) Mailed questionnaire to each household. The questionnaire may be mailed back before census day or is collected from each household on census day.


Use/misuse: Census data provides valuable information for economic, social, and health planning. It provides population data that can be used as denominators for health rates used in public health. Census data can be falsified for political reasons (allocation of parliamentary seats) or economic reasons (allocation of the national budget). Some countries are reluctant to hold national censuses for fear of revelation of their demographic weaknesses. In a multi-ethnic country, a census may reveal that some ethnic groups have more or less political or economic clout that what their numbers dictate. Among the methods used to falsify are: deliberate omission of certain persons or enumeration areas, entering wrong information on the census forms,  recording non-existing phantom names and places.


Sources of errors: (a) counting: Normally the total count is not very far from accurate. Some subjects are counter twice whereas others are not counted at all. There is a tendency for these errors to balance out. (b) Age is often under-estimated. The correct approach is to record age at last birthday. (c) Occupational information is notoriously incomplete and inaccurate due to faulty recall especially for those who changed jobs frequently


Description of the population: Population composition is described by sex, race/ethnic group, place of birth, urban/rural distribution, marital conditions, socio-economic indicators (literacy, home ownership, occupation).  There are 2 approaches to description of place of residence: In the de jure census persons are counted in the area of their normal residence. In the de facto census persons are counted in the area they were found on the census night. The population may increase, decrease, or stay stationary. Population increase is due to births and immigration. Population decrease is due to death or emigration.



Definition: Vital events are: births, deaths, Marriage & divorce, and some disease conditions. Collection of vital statistics was initially motivated by the administrative need of keeping a record of vital events that are of legal importance. It was only later with the growth of the public health discipline that the use of this data was understood. Even now a lot of the available data is not fully analyzed or utilized to understand public health phenomena.


Coverage: Most countries have legislation requiring mandatory reporting of vital events. However the effectiveness and efficiency of the registration vary. he items of information reportable vary by country and even within the same country by jurisdiction. The coverage and reliability of vital event reporting varies among countries and within jurisdictions of the same country. Established Market Economies have generally good coverage. Poor developing countries do not have the resources,  manpower and finance, to maintain reliable systems of vital data collection. They have no strong enforcement mechanisms to ensure full registration. Data processing and report generation are also a problem.


Errors in vital statistics: vital data may be inaccurate; the usual causes of inaccuracy is misclassification and incomplete information. Reporting of births may not be complete where non-institutional deliveries are common. Uncomplicated home deliveries may not be reported. In cases of extra-marital births, the parents may prefer not to report the information. Deaths at home may not reported. Institutional deaths may not reported in the jurisdiction of usual residence. People may die away from their place of usual residence. Reporting of marriages and divorces has some problems. There are many registrars who may be civilian government marriage offices or religious authorities. The data from these various sources may not be centralized. Many marriages and divorces are informal and are never registered anywhere. Although reporting of specific morbidity data is mandatory, many physicians especially in the private sector are usually reluctant to report resulting into incomplete information.

Uses of vital statistics: Data on vital events is used for legal purposes, population estimates, and health planning. The following are legal purposes fulfilled by vital events registration: establishing citizenship, payment of social welfare benefits, property or inheritance rights, establishment of paternity and legal financial support for offspring. The population distribution data is used for the following purposes: marketing, planning infra-structural developments (roads, schools, water, sewage, shops, recreational facilities), military planning, planning social security for workers and their dependents. The health planning functions are: planning number of hospital beds, planning of other health facilities, planning for health manpower, planning health insurance, emergency preparedness, and health budget allocations.



The following institutions routinely collect data about their clientele:

Medical facilities: Hospitals, health centers, and other health facilities have limited coverage because they collect data on a small segment of the population that comes to them. The following types of data are available: diseases and their treatment, deaths and their causes, health expenditure, and the demographic character of the catchment area.


Life and health Insurance companies: Unlike health care organizations, insurance companies collect background data bearing on the risk indicators of various disease conditions. They also record health events like surgery because of their impact on premiums.


Institutions: The following institutions collect routine information from their members: military, police, prisons, schools, and factories. Their coverage is limited only to their members. They have the advantage of pre-screening their recruits to make sure they are healthy. They thus have baseline and followup data. They have their own medical facilities where records of all inmates are kept. Their record keeping is efficient because they have strict measures to prevent mis-use of health services and absconding from duty on the basis of illness.  


Disease registries: Starting with cancer, the number of specific disease registries has grown phenomenally. There are registries for congenital anomalies, genetic anomalies, blood dyscrasias etc. Another tendency is the emergence of support groups or support networks that enable people suffering from the same disease to stay in touch. Pharmacies also maintain individual records of prescriptions. Some pharmacy net-works share data over a wide territory. Thus a lot of useful information is available in many data-bases.


Government: administrative records. Administrative records usually relate to public financial assistance or disability. They may not be medically accurate because those who make them are not medically trained. Data collection instruments are not designed with public health in mind with the result that a lot of health-related data is unusable.


Churches: In Europe, churches used to collect and record vital events. This was easy because churches performed marriages, baptisms, and burials thus covering the vital events of the human life cycle. These days with many people becoming non-practicing Christians these records are no longer complete or representative.



Epidemiological studies are undertaken for a specific purpose. They are of limited coverage. They are based on small samples not necessarily representative of the whole population. Well-funded population-based epidemiological studies may involve several thousand participants and produce much data of public health importance. They elicit specific information relating to disease manifestations and exposure to risk factors.

Observational studies are of 4 types: cross-sectional, case-control, and follow-up/cohort studies.


Experimental studies involving humans have ethical problems and are therefore not popular. Usually they are community intervention studies.

Epidemiological surveillance involves large populations and records of many important events.



Special surveys are studies with larger coverage of the population larger than epidemiological studies. Many are based on national samples.


Health surveys cover symptoms, signs, health-related behavior, treatment, and expenditure.


Nutritional surveys: Cover dietary intake (quality and quantity), anthropometric and biochemical measures of nutritional status.


Socio-demographic surveys: These cover age, gender, dependency, contraceptive practice, family structure, employment status etc





Data collection processes must be clearly defined in a written protocol which is the operational document of the study. Data collection is usually by questionnaire. The protocol should include the initial version of the questionnaire. This can be updated and improved after the pilot study. If a paper questionnaire is used data transfer into the electronic form will be necessary. The need for this could be obviated by direct on-line entry of data.


The objectives of the data collection must be defined clearly. Operational decisions and planning depend on the definition of objectives. It is wrong to collect more data than what is necessary to satisfy the objectives. It is also wrong to collect data just in case it may turn out to be useful.


The study population is identified. The method of sampling and the size of the sample are determined.


Staff to be used must be trained. The training should go beyond telling them what they will do. They must have sufficient understanding of the study that they can detect serious mistakes and deviations. A pilot study to test methods and procedures should be carried out. However well a study is planned, things could go wrong once field work starts.


A pilot study helps detect and correct such pitfalls.

A quality control program must be part of the protocol from the beginning



Utmost care should be taken in preparing the study questionnaire. A start is made by reviewing questionnaire of previous studies.


The following should be observed in selecting questionnaire items: (a) Clarity: The wording of the questionnaire items should leave no room for ambiguity (b) comprehensibility: the words must be easy. Technical jargon must be avoided. Questions should not be double barreled. (c) Value-laden words and expressions should not be used. (d) Leading questions should be avoided. Wording should not be positive or negative. (e) The responses must be scaled appropriately.


The following design and structure features must be observed: (a) The questionnaire should be designed for easy reading. (b) The logical sequence of questions must be proper. (c) Skip patterns should be worked out carefully and exhaustively.


The reliability and validity of the questionnaire should be tested during the pilot study.

Before administering a questionnaire the investigator should be aware of some ethical issues. Informed consent must be obtained. The information provided could be sub-poened by a court of law and the investigator can not refuse to release it. In the course of the interview the investigator may get information that requires taking life-saving measures. Taking these measures will however compromise the confidentiality. Such a situation may arise in case of an interviewee who informs the interviewer that he is planning to commit suicide later that day. Such information may have to be conveyed immediately to the authorities concerned.



In a face-to-face interview, the interviewer reads out questions to the interviewee and completes the questionnaire.


This method of data collection has the following advantages: (a) The interviewer can establish the identity of the respondent. In mailed questionnaire the answers may be from another person other than the intended respondent. (b) There are fewer item non-responses because of the presence of the interviewer who will encourage and may coax the respondent to answer all items. (c) The interviewer can clarify items that the respondent does not understand or is likely to misunderstand. (d) There is flexibility in the sequence of the items. (e) Open-ended questions are possible (e) Items irrelevant to the particular interviewee can be dropped thus saving time.


Face-to-face questionnaire administration also has disadvantages: (a) It costs more in terms of time and money. The interviewer has to travel, search for, and spend time with the respondent. (b) a prior appointment is needed to ensure that the respondent will be available at the place and time of the proposed interview. (c) Personal chemistry may not work well. The interviewee may resent the interviewer on the basis of gender, ethnicity, or any other personal and behavioral characteristic. (d) The presence of the interviewer may influence interviewee responses in a subtle way. The interviewee may try to give responses that he thinks are acceptable to the interviewer on the basis of the interviewer's gender, race, SES, and suggestive questioning.



Questionnaire administration by telephone has the following advantages: (a) Considerable savings in time and money. It is possible to conduct a nation-wide survey sitting in one office. (b) Has fewer item non-response because of the personal contact involved. (c) Skip patterns can be followed to save time  (c) difficult questions can be explained.


The disadvantages of questionnaire administration by telephone are: (a) Selection bias may operate when the study sample includes only those who have telephones and the telephone numbers are listed. The problem of unlisted numbers can be overcome by use of random digit dialing. (b) Selection bias may arise due to the day and time of day that the telephone call is placed. Office workers will be missed in early morning calls. Workers on night shifts will be missed in evening calls.  (c) It is not possible to be sure whether the person at the other end of the line is the actual intended respondent.


Telephone interview can be improved by use of computers. computer-assisted telephone interview can make the process quicker when the interviewer is prompted by the computer. The computer will work out the skip patterns and will alert the interviewer to responses that are inappropriate or contradictory



In the method of questionnaire administration by mail, a questionnaire is mailed to the respondent's address. The respondent completes and returns the questionnaire in a pre-addressed and stamped envelope.


Questionnaire administration by mail has 2 main advantages: (a) it is the cheapest method of data collection (b) There is no bias due to interviewer involvement. The disadvantages are: (a) low overall response (b) Higher item non-response (c) Delays in returning the questionnaire.


The following measures are undertaken to increase response to mailed questionnaires: (a) sending the questionnaire with a personalized cover letter (b) promising a token of appreciation for return of the questionnaire. (c) Making the questionnaire anonymous by not including any information on the returned questionnaire that can be used to identify a particular individual. (d) Providing a self-addressed and stamped envelop for the response (e) using pre-coded questionnaires so that all the respondent has to do is to select responses (f) follow up by letter for those who delay in returning the questionnaires.



The advantages of computer interview are: (a) It frees the interviewer's time.  (b) There are no transcription errors because information in entered on-line. (c) No items are missed because the computer will not allow the respondent to move to the next item before answering the previous one. (d) The respondent can give more honest responses when facing an anonymous computer than when faced by a human interviewer.


The disadvantage of computer-administered questionnaires is that the respondent does not have the opportunity to vary the order of questions to his convenience.





A field/attribute/variable/variate is the characteristic measured for each member e.g name & weight.


A value/element is the actual measurement or count like 5 cm, 10kg.


A record/observation  is a collection of  all variables belonging to one individual.


A file is a collection of records.


A data-base is a collection of files.


A data dictionary is an explanation or index of the data.


Data-base life-cycle: A data base goes through a life cycle of its own. Data is collected and is stored. New data has to be collected to update the old one.


A census comprises all values of a defined finite population are obtained ie totality



Data models can take any of three shapes: relational, hierarchical, and network. Relational data is when access to any file or data element is possible from the outside and is random. A hierarchical data set is organized in layers. Folders opened at the beginning have other folders within that could not be seen from the outside.



Coding: Self-coding or pre-coded questionnaires are preferable to those requiring coding after data collection. Errors and inconsistencies could be introduced into the data during manual coding. A good pre-coded questionnaire can be produced after piloting the study


Data entry: Both random and non-random errors could occur in data entry. The following methods can be used to detect such errors: (a) double entry techniques in which 2 data entry clerks enter the same data and a check is made by computer on items on which they differ. The items are then re-checked in the questionnaires and reconciliation is carried out. This method is based on the assumption that the probability of 2 persons making the same random error on entering an item of information is very low. The items on which the 2 agree are therefore likely to be valid. (b) The data entered in the computer could be checked manually against the original questionnaire. (c) Interactive data entry is becoming popular. It enables detection and correction of logical and entry errors immediately. The computer data entry program could be programmed to detect entries with unacceptable values or that are logically inconsistent


Data storage


Data transmission



Data editing: This is the process of correcting data collection and data entry errors. The data is 'cleaned' using logical and statistical checks. Range checks are used to detect entries whose values are outside what is expected; for example child height of 5 meters is clearly wrong. Consistency checks enable identifying errors such as recording presence of an enlarged prostate in a female. Among the functions of data editing is to make sure that all values are at the same level of precision (number of decimal places). This makes computations consistent and decreases rounding off errors.


Validation checks: Data errors can be detected at the stage of data entry or data editing. More advanced validity checks can be carried out using three methods: (a) logical checks (b) statistical checks involving actual plotting of data on a graph and visually detecting outlying values (c) using more advanced methods of robust estimation. Some data involves interviewers making an assessment and assigning a rating to each respondent on one or several questionnaire items. There is a possibility of biased rating. To overcome this 2 raters are used to interview each respondent and the kappa measure of inter-rater agreement is used. The working is set out as shown in the table below:


Rater #2

Rater #1










Kappa = (Lo - Le) / 1 - Le.

Lo= a + d/n

Le = (a+c)(a+b) + (b+d)(c+d) / n*n

The maximum value of kappa is 1.0


Data transformation: This is the process of creating new derived variables preliminary to analysis. The transformations may be simple using ordinary arithmetical operators or more complex using mathematical transformations. New variables may be generated by using the following arithmetical operations: (a) carrying out mathematical operations on the old variables such as division or multiplication (b) combining 2 or more variables to generate a new one by addition, subtraction, multiplication or division.  New variables can also be generated by using mathematical transformations of variables for the purposes of: stabilizing variances, linearizing relations, normalizing distributions (making them conform to the Gaussian distribution), or presenting data in a more acceptable scale of measurement. Four types of mathematical transformations are carried out on count or measurement data: logarithmic, trigonometric, power, and z-transformations. Both the natural (base e) and Niepierian (base 10) logarithmic transformations can be used. Trigonometric transformation involves recording data as its sine, cosine, and tangent transformation. Power transformations can take any of three types: the exponential transformation, the square root transformation, and the reciprocal transformation. Data could also be expressed in terms of the z-score which is the difference between the data value and the group mean divided by the group standard deviation. The probit and logit transformations are used to data expressed as proportions.



Missing data: Missing data can arise from data collection when no response was recorded at all or from data entry when the value was dropped accidentally. Missing data due to data entry errors is easy to correct. It is more difficult to go back and collect the missing data from the respondents and analysis may have to proceed with some data missing. It is better to have a code for missing data than to leave the field blank.


Coding and entry errors




Irregular patterns


Digit preference




Rounding-off / significant figures


Questions with multiple valid responses


Record duplication






Objective: summarize data for presentation (parsimony) while preserving a complete picture. Some information is inevitably lost by grouping.


Data classes: The suitable number of classes is 10-20. Too few intervals mask details of

the distribution. The following are desirable characteristics of classes: mutually exclusive, intervals equal in width, and intervals continuous throughout the distribution,. Class limits (also called class boundaries) are of 2 types: true and tabulated. The true are more accurate but can not be easily tabulated. True class limits should conform to data accuracy (decimals & rounding off). The class mid-points are used in drawing line graphs. The upper class limit (UCL) and the lower class limit (LCL) 


Dichotomy/trichotomy: Data grouping is sometimes achieved by dividing it into 2 groups (dichotomy), 4 groups (trichotomy) or 4 or more groups (trichotomy)


Grouping errors: The bigger the class interval, the bigger the error. The error arises when the distribution of scores about the mid-point is not uniform.



Objective: a table can present a lot of data in logical groupings and for 2 or more variables for visual inspection.


Type of information presented in tables (a) Frequency and frequency % (b) Relative (proportional) frequency and relative frequency % (c) Cumulative frequency and relative cumulative frequency %. The relative cumulative frequency % can be used to compare distributions.


Characteristics of an ideal table: Ideal tables are simple,  easy to read, and correctly scaled. The layout of the table should make it easy to read and understand the numerical information. The table must be able to stand on its own ie understandable without reference to the text. The table must have a title/heading that should indicate its contents. Labeling must be complete and accurate: title, rows & columns, marginal & grand totals, are units of measurement. The field labels are in the margins of the table while the numerical data is in the cells that are in the body of the table. Footnotes may be used to explain the table.


Configurations: Contingency tables (2x2, 2xk, rxc), others


Cumulative frequency is total frequency to a particular item ie how many observations lie below the given upper limit


Relative frequency % curves are used to compare distributions with unequal numbers of observations



Objective: The purpose of diagrams is to present a visual picture of the data. The diagram can reveal patterns in the data that are not obvious from numerical presentations. An ideal diagram must be self-explanatory ie able to be understood without reference to the text. It must be simple and not be crowded with too much information.


Dot plot


Vertical Line graph


Bar diagram: There are 2 types of bar diagram: the bar chart and the histogram. In the bar chart there are spaces between the bars. In the histogram the bars are lying side by side. The bar chart is best for discrete, nominal or ordinal data. The histogram is best for continuous data. A vertical line graph could also be considered a type of bar diagram. Bar charts and histograms are generated from frequency tables discussed above. They can show more than one variable. Using modern computer technology, they can be constructed as 4-dimensional figures. The area of the bar represents frequency ie the area of the bar is proportional to the frequency. If the class intervals are equal, the height of the bar represents frequency. If the class intervals are unequal the height of the bar does not represent frequency. Frequency can only be computed from the area of the bar. We talk of frequency density (frequency/class width) for histograms with unequal class widths and for which the frequency density is used as the height of the bars. As the class intervals get smaller the bar chart takes on the shape of a distribution.  The special advantages of the bar diagram: (a) it is more intuitive for the non-specialist (b) the area of the bar represents frequency.


Line graphs/frequency polygon: A frequency polygon is the plot of the frequency against the mid-point of the class interval. The points are joined by straight lines to produce a frequency polygon. If smoothed, a frequency curve is produced. The line graph can be used to show the following: frequency polygon, cumulative frequency curve, cumulative frequency % (also called the ogive), and moving averages. The line graph has two axes: the abscissa or x-axis is horizontal. The ordinate or y-axis is vertical. It is possible to plot several variables on the same graph. The graph can be smoothened manually or by computer. Time series and changes over time are shown easily by the line graph. Trends, cyclic and non-cyclic are easy to represent. Line graphs are sometimes superior to other methods of indicating trend such as moving averages and linear regression. The frequency polygon has the following advantages: (a) It shows trends better (b) It is best for continuous data (c) A comparative frequency polygon can show more than 1 distribution on the same page. The cumulative frequency curve has the additional advantage of being used to show and compare different sets of data because their respective medians, quartiles, percentiles, can be read off directly from the curve.


Scatter diagram / scatter-gram


Pie chart / pie diagram: These diagrams show relative frequency % converted into angles of circle (called sector angle). The area of each sector is proportional to the frequency. The pie chart is used to compare data sets in 2 ways: (a) the size of the circle indicates overall total; thus a small data set will have a small pie chart. (b) The sector area or angle shows the distribution within a particular data set.

                              F1                                         F2                                          F3





The variation in overall size is the proportions F1 : F2 : F3

The variation in radii is sqrt (F1) : sqrt (F2) : sqrt (F3)




Stem and leaf: the actual values are written in the histogram and not the bars. The stem and leaf gives a giid idea of the shape of the distribution. It is easy to pick up the minimum and maximum values. Two variables can be shown on a stem and leaf. The modal class is easy to identify. The class intervals must be equal. A key must be provided.


Maps: different variables or values of one variable can be indicated by use of different shading, cross-hatching, dotting, and colors.




Graphs and bar charts described above generate distributions of various shapes. The shapes can be described based on modality, symmetry, skewness, kurtosis, the 4rd moment, and the distance of the first and third quartiles from the mean.



Unimodal is most common in biology. Bi-modal has 2 peaks that are not necessarily of the same height. More than bimodal is unusual



A perfectly symmetrical curve is bell-shaped and is centered on the mean.



Skew to right (+ve skew) is more common. It shows that there are extreme values on the right of the center pulling the the mean to the right thus mode<median<mean.


Skew to left (-ve skew) is less common. It shows that there are extreme values to the left of the center.


Skewness is measured by Pearson’s coefficient of skewdness which is mean-mode / standard deviation = 3 (mean-median) / standard deviation. The following relations are derived from Pearson’s coefficient of skewdness: if the mean > mode then the skew is positive. If mean < mode then the skew is negative. If mean = mode then there is a symmetrical distribution.


Skewdness can also be measured by the quartile coefficient of skewdness which is {(Q3 – Q2) – (Q2 – Q1)} / {Q3 – Q1}. In a symmetrical distribution, the coefficient is 0 since (Q3 – Q2) = (Q2 – Q1). In a positive skew the coefficient is more than 1 since  (Q3 – Q2) > (Q2 – Q1). In a negative skew, the coefficient is less than zero because  (Q3 – Q2) < (Q2 – Q1)




Note that the median always lies between the mean and the mode and that the following relation holds approximately: mean – mode = 3 (mean – median)



 Leptokurtosis = narrow sharp peak. Platykurtosis = wide flat humps. (e) the 4rd moment is                                  .



Some shapes are common and have special names and special significance: (a) normal, (b) s-curve: This is also called the ogive. It is the curve of the cumultative frequency %  (c) reverse J-curve in exponential decay, and uniform like in presentation of cases of disease per month for a disease that has no seasonal pattern.


The normal distribution is a probability density. When relative frequency is plotted on the graph a normal distribution is generated. If the scales of the graph are transformed such that the total area under the curve is 1.0, a normal probability density is now generated.


The following tests for normality can be carried out: (a) plotting a cumulative frequency curve which should yield a gentle S-curve if the data is normal (b) Plot of the data on semi-log paper yields a straight line for normally distributed data (c) Plot of data on normal graph paper against normal scores



Poor labeling of scales


Distortion of scales


Omitting zero/origin. If plotting from zero is not possible for reasons of space, a broken line should be used to show discontinuity of the scale.

Professor Omar Hasan Kasule October 2000