Biostatistics Notes
Statistics:
A field of study concerned with the collection, organization, summarization, and analysis of data; and the drawing of inferences about a body of data when only a part of the data is observed.
Biostatistics
Definitions:
 It can be defined as the application of the mathematical tools used in statistics to the fields of biological sciences and medicine.
 Biostatistics is the branch of statistics responsible for the proper interpretation of scientific data generated in the biology, public health and other health sciences (i.e., the biomedical sciences).
 It is the branch of statistics concerned with mathematical facts and data related to biological events.
Role of biostatistics
 Identify and develop treatments for disease and estimate their effects.
 Identify risk factors for diseases.
 Design, monitor, analyze, interpret, and report results of clinical studies.
 Develop statistical methodologies to address questions arising from medical/public health data.
 Locate , define & measure extent of disease
 Improve the health of individual & community
Types of Statistics
Descriptive Statistics
Statistical techniques used to organize, summarize and describe a particular set of measurements. It includes the construction of graphs, charts, and tables and the calculation of various descriptive measures such as averages, measures of variation, and percentiles.
Example: census is descriptive statistics of population. The information that is gathered concerning age, gender, race, income, etc. is compiled to describe the population at a given point in time.
Inferential Statistics
Inferential statistics use data gathered from a sample to make inferences about the larger population from which the sample was drawn.
OR
Inferential Statistics consists of methods for drawing and measuring the reliability of conclusions about a population based on information obtained from a sample of the population
Application of Biostatistics:
In Nursing:
 It is said that biostatistics is the tool of all health sciences and is called as the “language of research” because the findings in research are based on biostatistical techniques.
 By the knowledge of biostatistics nurses/ health care worker may trained in the skilled application of statistical methods to the solution of problems encountered in public health and medicine.
 In nursing biostatistics is an essential tool to determine the effectiveness of nursing procedures based on the collection of records of clinical trials devised in such a scale and such form that valid conclusions can be drawn
 Nurses/ health worker have a better understanding of nursing/ health care and medical research journals, respectively, if they have knowledge on biostatistical methods and techniques
 They collaborate with scientists in nearly every area related to health and have made major contributions to our understanding of AIDS, cancer, and immunology, as well as other areas.
 Further, they spend a considerable amount of time developing and evaluating the statistical methodology used in those projects.
 Biostatistics may prepare health worker/ nursing graduates for work in a wide variety of challenging positions in government, N.G.O’s, international organizations (WHO/UNICEF) and education.
 Health worker/ nursing graduates have found careers involving teaching, research, and consulting in such fields as medicine, public health, life sciences, and survey research.
 It forces the researcher to definite and exact in his procedures and techniques
 It enables the researcher to predict “how much” of a thing will happen under conditions he knows and has measured.
 To determine the time interval in which a patient should be given a medicine or perform any nursing action,
In Anatomy and Physiology
 To define what is normal or healthy in a population.
 To find the limits of normality in variables such as weight and pulse rate etc. in a population.
 To find the difference between means and proportions of normal at two places or in different periods.
 To find the correlation between two variables X and Y such as height and weight.
In Pharmacology
 To find the action of drug
 To compare the action of two different drugs or two successive dosages of the same drug.
 To find the relative potency of a new drug with respect to a standard drug.
In medicine
 To compare the efficacy of a particular drug, operation or line of treatment
 To find an association between two attributes such as cancer and smoking.
 To identify signs and symptoms of a disease or syndrome. i.e. Cough in typhoid is found by chance and fever is found in almost every case. The proportional incidence of one symptom or another indicates whether it is a characteristic feature of the disease or not.
 To test usefulness of sera and vaccines in the field.
Example: percentage of attacks or deaths among the vaccinated subjects is compared with that among the unvaccinated ones to find whether the difference observed is statistically significant.
 Design and analysis of clinical trials in medicine
 By learning the methods in biostatistics a student learns to critically evaluate articles published in medical and dental journals or papers read in medical and dental conferences.
 To understand the basic methods of observation in clinical practice and research.
In Clinical Medicine
 Documentation of medical history of diseases.
 Planning and conduct of clinical studies.
 Evaluating the merits of different procedures.
 In providing methods for definition of ‘normal’ and ‘abnormal’.
In Preventive Medicine
 To provide the magnitude of any health problem in the community.
 To find out the basic factors underlying the illhealth.
 To evaluate the health programs which was introduced in the community (success/failure).
 To introduce and promote health legislation.
In Community Medicine and Public Health
 To evaluate the efficacy of sera and vaccines in the field.
 In epidemiological studiesthe role of causative factors is statistically tested.
 To test whether the difference between two populations is real or a chance occurrence.
 To study the correlation between attributes in the same population.
 To identify the leading cause of disease or death.
 To measure the morbidity and mortality.
 To evaluate achievements of public health programs.
 To fix priorities in public health programs.
 To help promote health legislation and create administrative standards for oral health.
 It helps in compilation of data, drawing conclusions and making recommendations.
In Genetics
 Statistics and Human Genetics are twin subjects, having grown with the century together, and there are many connections between the two.
 Some fundamental aspects in particular the concept of Analysis of Variance, first arose in Human Genetics, while statistical and probabilistic methods are now central to many aspects of analysis of questions is human genetics.
In Environmental Science
Environmental statistics covers
 Baseline studies to document the present state of an environment to provide background in case of unknown changes in the future.
 Targeted studies to describe the likely impact of changes being planned or of accidental occurrences.
 Regular monitoring to attempt to detect changes in the environment.
In Nutrition
 Nutritionists now have the advanced methodologies for the analysis of DNA, RNA, protein, lowmolecularweight metabolites, as well as access to bioinformatics databases.
 Biostatistics, which can be defined as the process of making scientific inferences from data that contain variability, has historically played an integral role in advancing nutritional sciences.
 Currently, in the era of systems biology statistics has become an increasingly important tool to quantitatively analyze information about biological macromolecules.
 Appropriate statistical analyses are expected to make an important contribution to solving major nutritionassociated problems in humans and animals (including obesity, diabetes, cardiovascular disease, cancer, ageing, and intrauterine growth retardation).
In Dental Science:
 To find the statistical difference between means of two groups. Ex: Mean plaque scores of two groups.
 To assess the state of oral health in the community and to determine the availability and utilization of dental care facilities.
 To indicate the basic factors underlying the state of oral health by diagnosing the community and find solutions to such problems.
 To determine success or failure of specific oral health care programs or to evaluate the program action.
 To promote oral health legislation and in creating administrative standards for oral health care delivery.
Application and Uses of Biostatistics as Figures
 Health and vital statistics are essential tools in demography, public health, medical practice and community services.
 Recording of vital events in birth and death registers and diseases in hospitals is like book keeping of the community, describing the incidence or prevalence of diseases, defects or deaths in a defined population.
 Such events properly recorded form the eyes and ears of a public health or medical administrator.
 What are the leading causes of death?
 What are the important cause of sickness?
 Whether a particular disease is rising or falling in severity and prevalence? etc.
Logical Reasoning:
Logical reasoning is the process which uses arguments, statements, premises and axioms to define weather a statement is true or false, resulting in a logical or illogical reasoning.
Inductive reasoning:
It is the process of developing generalizations from specific observations. Inductive reasoning makes broad generalizations from specific observations. Even if all of the premises are true in a statement, inductive reasoning allows for the conclusion to be false.
Example: “Harold is a grandfather. Harold is bald. Therefore, all grandfathers are bald.” The conclusion does not follow logically from the statements.
Deductive reasoning:
Deduction is a method for applying a general rule (major premise) in specific situations (minor premise) of which conclusions can be drawn (general to specific). In Deductive reasoning, no new information provides, it only rearranges information what is already known into a new statement or conclusion.
Example:
Major premise: All humans are mortal
Minor premise: Socrates is human
Conclusion: Socrates is mortal
 Inductive reasoning has its place in the scientific method. Scientists use it to form hypotheses and theories. Deductive reasoning allows them to apply the theories to specific situations.
Abductive Reasoning
Another form of reasoning is abductive reasoning. It is based on making and testing hypotheses using the best information available. It often entails making an educated guess after observing a phenomenon for which there is no clear explanation. Abductive reasoning is useful for forming hypotheses to be tested. Abductive reasoning is often used by doctors who make a diagnosis based on test results and by jurors who make decisions based on the evidence presented to them.
Abductive reasoning is the third form of logical reasoning and is somewhat similar to inductive reasoning, since conclusions drawn here are based on probabilities. In abductive reasoning it is presumed that the most plausible conclusion also the correct.
Example:
Major premise: The jar is filled with yellow marbles
Minor premise: I have a yellow marble in my hand
Conclusion: The yellow marble was taken out of the jar
The abductive reasoning example clearly shows that conclusion might seem obvious; however it is purely based on the most plausible reasoning. This type of logical reasoning is mostly used within the field of science and research.
MEASUREMENT:
It may be defined as the assignment of numbers to objects or events according to a set of rules.
Scale of Measurements
Scales of measurement refer to ways in which variables/numbers are defined and categorized.
Each scale of measurement has certain properties which in turn determine the appropriateness for use of certain statistical analyses.
Types of Scale of measurements:
The four scales of measurement are:
 Nominal
 Ordinal
 Interval
 Ratio
Nominal:
The lowest measurement scale is the nominal scale. As the name implies it consists of “naming” observations or classifying them into various mutually exclusive and collectively exhaustive categories. They represent categories wherethere is no basis for ordering the categories.
Example:
 diagnostic categories
 sex of the participant
 classification based on discrete characteristics (e.g., hair color)
 Group affiliation (e.g., Republican, Democrat, Boy Scout, etc.)
 the town people live in
 a person’s name
 an arbitrary identification, including identification numbers that are arbitrary
 menu items selected
 any yes/no distinctions
 most forms of classification (species of animals or type of tree)
 location of damage in the brain
Ordinal:
Whenever observations are not only different from category to category but can be ranked according to some criterion, they are said to be measured on an ordinal scale. However, we have no way of knowing how different the categories are from one another.
Example:
 any rank ordering
 class ranks
 Socioeconomic status as low, medium, or high.
 Pain; mild, moderate, severe
Interval:
Interval scales are very similar to standard numbering scales except that they do not have a true zero. That means that the distance between successive numbers is equal, but that the number zero does NOT mean that there is none of the property being measured.
Example:
Temperature is usually measured (degrees Fahrenheit or Celsius). The unit of measurement is the degree, and the point of comparison is the arbitrarily chosen “zero degrees,” which does not indicate a lack of heat.
Ratio:
Ratio scales are the easiest to understand because they are numbers as we usually think of them. The distances between adjacent numbers are equal on a ratio scale and the score of zero on the ratio scale means that there is none of whatever is being measured. Most ratio scales are counts of things.
The ratio scale of measurement is similar to the interval scale in that it also represents quantity and has equality of units. However, this scale also has an absolute zero (no numbers exist below zero). Very often, physical measures will represent ratio data (for example, height and weight). If one is measuring the length of a piece of wood in centimeters, there is quantity, equal units, and that measure cannot go below zero centimeters. A negative length is not possible.
 time to complete a task
 number of responses given in a specified time period
 weight of an object
 size of an object
 number of objects detected
 number of errors made in a specified time period
 proportion of responses in a specified category
Comparison of scales of measurement:
Scale  Indications Difference  Indicates Direction of Difference  Indicates Amount of Difference  Absolute Zero 
Nominal  X  
Ordinal  X  X  
Interval  X  X  X  
Ratio  X  X  X  X 
Parametric & Nonparametric Statistics:
Interval and Ratio data are sometimes referred to as parametric and nominal and Ordinal data are referred to as nonparametric. Parametric means that it meets certain requirements with respect to parameters of the population (for example, the data will be normal–the distribution parallels the normal or bell curve). In addition, it means that numbers can be added, subtracted, multiplied, and divided. Parametric data are analyzed using statistical techniques identified as Parametric Statistics. As a rule, there are more statistical technique options for the analysis of parametric data and parametric statistics are considered more powerful than nonparametric statistics. Nonparametric data are lacking those same parameters and cannot be added, subtracted, multiplied, and divided. For example, it does not make sense to add Social Security numbers to get a third person. Nonparametric data are analyzed by using Nonparametric Statistics.
PRELIMINARY CONCEPTS:
DATA
The information given in quantitative or qualitative from regarding a particular characteristic is called data. It is the raw material of Statistics.
We may define data as fact & figures. Figures result from the process of counting or from taking a measurement.
For example:
When a hospital administrator counts the number of patients (counting).
When a nurse weighs a patient (measurement)
Types of data:
Primary Data
Secondary Data
Data
Primary data:
The data which are collected directly from the field of enquiry for a specific purpose. These are raw data or data in original nature, and directly collected from population..
Secondary data:
If the data is collected by some other agency or has been passed through it is called secondary data. (OR)
The data presented in an arranged (particular) form as to serve ones purpose is called secondary data.
Investigator:
The person who collects the data is known as investigator. He/ she must be:
 Intelligent, reliable and responsible.
 Properly trained and polite.
 Experienced, tactful and well known about the object he/she is dealing with.
Characteristics of Data:
Quantitative/Measurable/Variable
Qualitative/ Nonmeasurable/Attributes
Characteristics of Data
Discrete
Continuous
Nominal
Ordinal
Data set:
The data collected for a particular purpose is called data set.
Outlier:
An observation point that is distant from other observations in given set of data.
An outlier is an observation whose value, x, either exceeds the value of the third quartile by a magnitude greater than 1.5(IQR) or is less than the value of the first quartile by a magnitude greater than 1.5(IQR).
That is, an observation of x >Q3 + 1.5(IQR) or an observation of x< Q1 – 1.5(IQR) is called an outlier.
Variable:
It is a characteristic that can take different values for the elements in the data set.
Random Variable:
When the values obtained arise as a result of chance factors, so that they cannot be exactly predicted in advance, the variable is called a random variable.
Types of variables;
Variable are of two types depending upon the form of the characteristics.
 Quantitative Variables:
It is that variable whose characteristics of interest is measurable or can be expressed numerically
Example: age, weight, height, income. Length etc.
Types of Quantitative Variables:
Quantitative variables are of two types:
 Discrete Variables (Discontinuous/meristic)
A discrete variable is characterized by gaps or interruptions in the values that it can assume. These gaps or interruptions indicate the absence of values between particular values that the variable can assume.
These are the quantities which can be measure in whole integral values. It does not take fractional values. They assume a finite or countable number of possible values. It is usually obtained by counting.
Example:
 The number of daily admissions to a general hospital.
 Number of students in a class.
 Number of patients in a ward.
The data which are described by discrete variables are called discrete data.
 Continuous variable:
These are quantities which can take any value in specified range. Thus it can take both integral and fractional values. They assume an infinite number of possible values. It is usually obtained by measurement.
Example: Height, weight etc.
The data described by continuous variables are called continuous data.
 Qualitative variable:
These are nonmeasurable characteristics, which cannot be numerically express in terms of some unit and are also known as attributes. It is that variable whose values are nonnumerical
Example: color, sex, intelligence, Religion, Nationality, Illiteracy etc.
Types of Qualitative variable:
 Nominal Variable:
A categorical measurement expressed not in terms of numbers, but rather by means of a natural language description there is not a natural ordering of the categories.
The data which are described by nominal variables are called nominal data.
Examples: gender, race, religion etc.
 Ordinal Variable:
A categorical measurement expressed not in terms of numbers, but rather by means of a natural language description &the categories are ordered. The distance between these categories cannot be measured.
Population:
The collection of all observations (elements) relating to a characteristic is called statistical population or simply population.
(OR)
The collection of all individuals or items under consideration in a statistical study.
Populations may be finite or infinite.
 Finite:
If a population consists of fixed number of values; it is said to be finite.
Example: Number of days in a week.
 Infinite:
If a population consists of an endless succession of values, it is said to be infinite
Example: Number of animals in ocean.
Parameter:
Numerical descriptive measures corresponding to populations are called parameters.
Target Population:
The target population is the population about which one wishes to make an inference.
Sample:
It is a relatively small group of selected number of individuals or objects drawn from a particular population and is used to throw light on the population characteristics.
(OR)
The observed sets of measurements that are subsets of a corresponding population
Statistics
Numerical descriptive measures corresponding to samples are called statistics.
Random Sample:
It is a sample chosen in a very specific way and has been selected in such a way that every element in the population has an equal opportunity of being included in the sample.
Statistical Error:
The extent to which the observed value of a quantity exceeds the true value.
Error = Observed Value – True Value
Types of Statistical Error:
Statistical error may be classified as
 Biased error: it arises due to persona prejudices or bias of investigator or informant.
 Unbiased error: it enters into statistical enquiry due to chance causes.
Array:
The presentation of data in ascending order of magnitude is called array.
PRESENTATION OF DATA OR INFORMATION
Data obtained by the investigator is irregularly documented and is unorganized. This unorganized data is called raw data. It is organized in a specific sequence and is presented in such a way as to make it easily understandable.
Classification:
It is the process of arranging the raw data under different categories or classes according to some common characteristics possessed by an individual member.
Examples:
Patients in hospitals are classified according to disease.
Presentation of Statistical data:
Presentation of Statistical Data
Textual presentation
Tabular Presentation
Graphical Presentation
Textual presentation:
 Numerical data presented in a descriptive form are called textual presentation.
 It is lengthy, some words may repeat several time in the text.
 It becomes difficult to grasp salient points in a textual presentation.
Tabular presentation:
 The logical and systematic presentation of numerical data in rows and columns designed to simplify the presentation and facilitate comparison is termed as tabulation.
 Tabulation is thus a form of presenting quantitative data in condensed and concise form so that the numerical figures are capable of easy and quick reception by the eyes.
 It is more convenient than textual presentation.
Parts of a Table
 Table number: A table should be numbered for easy identification and reference in future. The table number may be given either in the centre or side of the table but above the top of the title of the table. If the number of columns in a table is large, then these can also be numbered so that easy reference to these is possible.
 Title of the table: Each table must have a brief, selfexplanatory, and complete title which can
 Indicate the nature of data contained.
 Explain the locality (i.e., geographical or physical) of data covered.
 Indicate the time (or period) of data obtained.
 Contain the source of the data to indicate the authority for the data, as a means of verification and as a reference. The source is always placed below the table.
 Caption and stubs: The headings for columns and rows are called caption and stub, respectively. They must be clear and concise.
 Body: The body of the table should contain the numerical information. The numerical information is arranged according to the descriptions given for each column and row.
 Prefatory or head note: If needed, a prefatory note is given just below the title for its further description in a prominent type. It is usually enclosed in brackets and is about the unit of measurement.
 Footnotes: Anything written below the table is called a footnote. It is written to further clarify either the title captions or stubs. For example, if the data described in the table pertain to profits earned by a company, then the footnote may define whether it is profit before tax or after tax. There are various ways of identifying footnotes:
 Numbering footnotes consecutively with small number 1, 2, 3, …, or letters a, b, c, …, or star *, **, …
 Sometimes symbols like @ or $ are also used to identify footnotes.
7. Source Notes: The source notes is given at the end of the table indicating the source from when information has been taken. It includes the information about compiling agency, publication etc…
A blank model table is given below:
—THE TITLE—
—Prefatory Notes—
—Box Head—  
—Row Captions—  —Column Captions—  
—Stub Entries—  —The Body— 
Foot Notes…
Source Notes…
Types of tabulation:
There are two types of tabulation:
 Simple tabulation: it contains data in respect of one characteristic only
 Complex tabulation: it contains data of more than one characteristicssimultaneously.
Example:
Simple tabulation: No. of students in three classes of B.S.N
Name of Class  No. of students 
B.S.N I  42 
B.S.N II  48 
B.S.N III  50 
Complex tabulation: No. of students in three classes of B.S.N
Name of Class  No. of students  total  
Male  Female  
B.S.N I  12  30  42 
B.S.N II  08  40  48 
B.S.N III  05  45  50 
Contingency Table
A Contingency table is an arrangement of data in a twoway classification. The data are sorted into cells, and then count for each cell is reported. The contingency table involves two factors (or variables), and a common question concerning such tables is whether the data indicate that the two variables are independent or dependent.
Observation:
The values of variable obtained by observations are termed as observed values or observation.
Frequency:
The frequency (f) of a particular observation is the number of times the observation occurs in the data.
Frequency distribution:
Frequency distribution is a statistical table which shows the values of variable arranged in order of magnitude either individually or in groups and also the corresponding frequencies side by side.
Types of frequency distribution:
Frequency distribution tables can be used for both categorical and numeric variables. Continuous variables should only be used with class intervals. A frequency distribution is a summary of how often different scores occur within a sample of scores.
Frequency Distribution
Quantitative Frequency Distribution
Qualitative Frequency Distribution
Simple /Ungrouped Frequency Distribution
(Range ≤ 20 digits)
Grouped Frequency Distribution
(Range > 20 digits)
Frequency distribution table of NonMeasurable (Qualitative) Data.
Example:
Suppose that you are collecting data of blood group college students. After conducting a survey of 30 of your classmates, you are left with the following set of scores:A,A,A,O,AB,B,AB,AB,AB,O,O,O,B,A,B,AB,A,B,AB,O,A,B,AB,AB,B,B,AB,A,A,A,AB
In order to make sense of this information, you need to find a way to organize the data. A frequency distribution is commonly used to categorize information so that it can be interpreted quickly in a visual way. In our example above, the blood group serves as the categories and the occurrences of each number are then tallied
Example of a Frequency Distribution
Blood Group  Tally Marks  Frequency 
O 
 5 
A 
 9 
B 
 6 
AB 
 10 
Total  30 
Frequency distribution table of Measurable (Quantitative) Data.
Let’s suppose that you are collecting data on how many hours of sleep college students get each night. After conducting a survey of 30 of your classmates, you are left with the following set of scores:
7, 5, 8, 9, 4, 10, 7, 9, 9, 6, 5, 11, 6, 5, 9, 10, 8, 6, 9, 7, 9, 8, 4, 7, 8, 7, 6, 10, 4, 8
In order to make sense of this information, you need to find a way to organize the data. A frequency distribution is commonly used to categorize information so that it can be interpreted quickly in a visual way. In our example above, the number of hours each week serves as the categories and the occurrences of each number are then tallied.
Example of a Frequency Distribution
Tally Marks  Frequency  
4  │││  3 
5  │││  3 
6 
 4 
7 
 5 
8 
 5 
9 
 6 
10  │││  3 
11  │  1 
Total  30 
Constructing a Simple frequency distribution table
 Construct a table with three columns.
 Write all observation in ascending order in first column
 Select the first item and see in which observation it falls, draw a small tally mark (/) against it in second column and also tick () the concerned item. Continue this way until the last item is ticked. If some element is reported many time, mark separate tally mark for each.
 These tallies are marked in sets of five; the fifth tally in each set is marked across the other four. i.e. ////
 Count the number of tally marks for each mark and write it in frequency column.
Example:
A survey was taken on Maple Avenue. In each of 20 homes, people were asked how many cars were registered to their households. The results were recorded as follows:
1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0
Frequency table for the number of cars registered in each household  
Number of cars (x)  Tally  Frequency (f) 
0 
 4 
1 
 6 
2 
 5 
3 
 3 
4 
 2 
Grouped Frequency Distribution
Terms:
Class interval
The size, width or length of a class interval is the difference between the upper (or lower) limits of any two consecutive classes. It is denoted by ‘h’.
If a variable takes a large number of values (Range >20), then it is easier to present and handle the data by grouping the values into class intervals. Continuous variables are more likely to be presented in class intervals, while discrete variables can be grouped into class intervals or not. The class interval should be contiguous, nonoverlapping such that each value in the set of observations can be placed in one, and only one, of the intervals.
Frequency:
The frequency of a class interval is the number of observations that occur in a particular predefined interval.
Endpoint:
The endpoints of a class interval are the lowest and highest values that a variable can take.
Class width:
Class width is the difference between the lower endpoint of an interval and the lower endpoint of the next interval. It is denoted by ‘w’.
OR
It is the range or length of a class interval or difference between the upper and lower class boundaries.
Number of classes:
There is no hard and fast rule for finding exact number of classes. A commonly followed rule of thumb states that there should be no fewer than five intervals and no more than 15.If there are fewer than five intervals, the data have been summarized too much and the information they contain has been lost. If there are more than 15 intervals, the data have not been summarized enough. It is also important to make sure that the class intervals are mutually exclusive.
Sturges’ formula
Those who need more specific guidance in the matter of deciding how many class intervals to employ may use a formula given by Sturges (1). This formula gives k = 1 + 3.322log_{n}, where k stands for the number of class intervals and n is the number of values in the data set under consideration. The answer obtained by applying Sturges’s rule should not be regarded as final, but should be considered as a guide only.
The number of class intervals specified by the rule should be increased or decreased for convenience and clear presentation.
No of classes (k) =1+3.322
Range=Max Value – Min Value
Size of Class=Range/No of classes
Class limits:
The two numbers used to specify the limits of a class interval for the purpose of tallying the original observations into the various classes, are called class limits.
 The smallest of the pair is known as lower class limit i.e. The smaller number in each class is the lower class limit (l_{1})
 The largest of the pair is called upper class limit. i.e. the larger number is the upper class limits (l_{2}) of the class.
Class mark or mid – point of a class
 It is the midvalue of a class or class interval exactly at the middle of the class or class interval.
 It lies half way between the class limits or between the class boundaries.
 It is used as representative value of the class interval for the calculation of mean, standard deviation, mean deviation etc.
 It is the average of the lower and upper class limits.
Lower class limit + Upper class limit
Class mark =
2
l_{1} + l_{2}
Midpoint (x) =
2
Class boundaries (or exact class limits)
These are the precise points separating the class from adjoining classes. A class boundary is always located midway between the upper limit of the class and lower limit of the next higher class.
Construction of class boundaries:
Steps in the construction of class boundaries from the class limits are;
 Find the difference between the lower limit of a class and the upper limit of the preceding class, denoted by d.
 Subtract d/2 from lower limit of the class to get lower boundary of that class and add d/2 to the upper limit to get upper boundary of the class.
Percentage Frequency:
It represents the relative percentage of total cases of any class interval. It is obtained by dividing the number of cases in the class interval by total number of cases and then multiply with 100.
Frequency of the class
Percentage frequency of class = x 100
Total Frequency
Relative Frequency:
 It is the ratio of the frequency of the classto the total frequency.
 It is not expressed in percentage.
 Relative frequencies are used to compare two or more frequency distributions or two or more items in the same frequency distribution.
Frequency of the class
Relative frequency =
Total Frequency
Cumulative Relative Frequency:
 Cumulative frequency corresponding to a class is the sum of all the frequency up to and including that class.
 It is obtained by adding to the frequency of that class and all the frequencies of the previous classes.
 It gives the proportion of individuals having a measurement less than or equal to the upper boundary of the class interval.
Frequency Density:
Frequency density of a class or class interval is its frequency per unit with. It shows the concentration of frequency in a class.
It is used in drawing histogram when the classes are of unequal width.
Class frequency
Frequency density =
Width of the class
Construction of Grouped Frequency Distribution:
 Construct a table with three columns.
 Determine the range, i.e. the difference b/w the highest and the lowest observation.
 Decide about the number of classes or the length of class interval (h), using the working rule:
Number of classes = range / h
 Number of classes should be b/w 5 and 15.
 Determine the starting point, and the remaining class limits. If several values of the variable are to be included in one class, the class limits should be designated in term of the “this amount to that amount”. This , if h is 5 we have to start with either of the values 0, 5, 10, 15,…., and if h is 3, we have to start with either of the values of 0, 3, 6, 9, 2, …., etc.
 Distribute the data into appropriate classes by Tally method.
Select the first item and see in which class if falls, draw a small tally mark (/) against that class and also tick () the concerned item. Continue this way until the last item is ticked. If some element is reported many time or some elements fall in the same class, mark separate tally mark for each.
These tallies are marked in sets of five; the fifth tally in each set is marked across the other four. i.e. ////
 Count the number of tally marks for each mark and write it in frequency column.
GRAPHICAL PRESENTATION OF DATA
The presentation of quantitative data by graphs and charts are termed as graphical presentation.
It gives the reader a nice overview of the essential features of the data. Graphs are designed to give an intuitive feeling of the data at a glance.
Therefore graphs:

 Should be selfexplanatory
 Must have title
 Must have labeled axis
 Should mention unit of observation
 Should be simple & clean
Advantages of Graph Representation
 It is easy to read
 It is easy to understand by all.
 It shows relationship between two or more sets of observations.
 It is universally applicable
 It is attractive in representation
 It helps in proper estimation, evaluation, and interpretation of the characteristics of items and individuals
 It has more lasting effect on brain
 It simplifies complex data
 It indicates trend, and therefore, helps in forecasting.
Disadvantages of Graph Representation
 It is time consuming.
 Finer details may be lost during preparation
 It represents only approximate values.
Graphical Presentation of Statistical data:
 Grouped and ungrouped data may be presented as:
Line Graphs
 A line chart or line graph is a type of chart which displays information as a series of data points called ‘markers’ connected by straight line segments.
 These are drawn on the plane paper by plotting the data concerning one variable on the horizontal xaxis (abscissa) and other variable of data on yaxis (ordinate). Which intersect at a point called origin.
 With the help of such graphs the effect of one variable upon another variable during and experimental study may be clearly demonstrated.
 According to data for corresponding X, Y values (in pairs), we will find a pint on the graph paper. The points thus generated are then jointed by pieces of straight lines successfully. The figure thus formed is called line diagram or graph.
Example
In the experimental sciences, data collected from experiments are often visualized by a graph. For example, if one were to collect data on the speed of a body at certain points in time, one could visualize the data by a data table such as the following:
Elapsed Time (s)  Speed (m s^{−1}) 
0  0 
1  3 
2  7 
3  12 
4  20 
5  30 
6  45 
Graph of Speed Vs Time
Bar Diagram:
 A bar diagram is a graph on which the data are represented in the form of bar and it is useful in comparing qualitative or quantitative data of discrete type.
 It consists of a number of equally spaced rectangular areas with equal width and originate from a horizontal base line (xaxis)
 The length of the bar is proportional to the value it represents. It should be seen that the bars are neither too short nor too long.
 They are shaded or coloured suitably.
 The mars may be vertical or horizontal in a bar diagram. If the bare are placed horizontally, it is called horizontal bar diagram, when bares are placed vertically it is called a vertical bar diagram.
 It is used with discrete qualitative variables and provides a visual comparison of figures.
Types of Bar Diagram
There are three types of bar diagram
 Simple bar diagram
 Multiple or grouped bar diagram
 Component bar subdivided bar diagram.
Simple bar chart:
Represent one type of data (variable).
Example:
Following is an example of bar chart which shows educational status of certain area.
Multiple Bar charts:
Such charts are useful for direct comparison between two or more sets of data. The technique of drawing such a chart is same as that of a single bar chart with a difference that each set of data is represented in different shades or colors on the same scale. An index explaining shades or colors must be given.
Example:
Draw a multiple bar chart to represent the import and export of Canada (values in $) for the years 1991 to 1995.
Years  Imports  Exports 
1991  7930  4260 
1992  8850  5225 
1993  9780  6150 
1994  11720  7340 
1995  12150  8145 
Simple bar chart showing the import and export of Canada from 1991 – 1995.
Component bar chart:
Subdivided or component bar chart is used to represent data in which the total magnitude is divided into different or components.
In this diagram, first we make simple bars for each class taking total magnitude in that class and then divide these simple bars into parts in the ratio of various components. This type of diagram shows the variation in different components within each class as well as between different classes. Different shades or colours are used to distinguish the various components and should be given with the diagram. It is also known as staked chart.
Example:
The table below shows the quantity in hundred kgs of Wheat, Barley and Oats produced on a certain form during the years 1991 to 1994.
Years  Wheat  Barley  Oats  Total 
1991  34  18  27  79 
1992  43  14  24  81 
1993  43  16  27  86 
1994  45  13  34  92 
Pie Chart
 It is a circular graph whose area is subdivided into sectors by radii in such a way that the areas of the sectors are proportional to the angles at the centre.
 The area of the circle represents the total value and the different sectors of the circle represent the different parts.
 It is generally used for comparing the relation between the various components of a value and between components and the total value.
 The data is expressed as percentage. Each component it expressed as percentage of the total value.
Working procedure:
 Plot a circle of an appropriate size. The angle of a circle total is 360^{o}.
 Convert the given value of the components of an item in percentage of the total value of the item.
Value of component
Area = x 360
Total value of item
 It the pie chart largest sector remains at the top and other in sequence running clockwise.
 Measure with protector, the points on a circle representing the size of each sector. Label each sector for identification.
Example:
A family’s weekly expenditure on its house mortgage (finance), food and fuel is as follows: Draw pie chart:
Expense  $ 
Mortage  300 
Food  225 
Fuel  75 
Histogram
It is the most common form of diagrammatic representation of grouped frequency distribution of both continuous and discontinuous type, in which the frequencies are represented in the forms of bars. The area and more especially the height of each rectangle is proportional to the frequency.
Working Procedure:
 Convert the data in exclusive series from inclusive series. (Make class boundaries if classes do not coincide; discontinuous class interval)
 Take class intervals (class boundaries) and plot in xaxis.
 Take two extra class intervals one below and one above the given grouped intervals.
 Plot separate rectangles for each class interval. The base of each rectangle is the width of the class interval and the height is the respective frequency of that class.
 Frequencies are plotted on yaxis.
Age  Class Boundaries  Frequency 
3039  29.5–39.5  11 
4049  39.5–49.5  46 
5059  49.5–59.5  70 
6069  59.5–69.5  45 
7079  69.5–79.5  16 
8089  79.5–89.5  1 
Frequency Polygon:
It is an area diagram represented in the form of curve obtained by joining the middle points of the tops of the rectangles in a histogram or joining the midpoint of class intervals at the height of frequencies by straight lines.
Cumulative Frequency Polygon (Ogive)
The graphical representation of a cumulative frequency distribution where the cumulative frequencies are plotted against the corresponding class boundaries and the successive points are joined by straight lines, the line diagram or curve is obtained known as ogive or cumulative frequency polygon.
Working procedure:
 The upper limits of the classes are represented along xaxis.
 The cumulative frequency of a particular class is taken along the yaxis.
Class interval  Class Boundries  f  c.f 
151 155  150.5 – 155.5  8  8 
156 – 160  155.5 – 160.5  7  15 
161 – 165  160.5 – 165.5  15  30 
166 – 170  165.5 – 170.5  9  39 
171 – 175  170.5 – 175.5  9  48 
176 – 180  175.5 – 180.5  2  50 
 The points corresponding to cumulative frequency at each upper limit of the classes are joined by a free hand curve.
StemandLeaf Displays
A stemandleaf display bears a strong resemblance to a histogram and serves the same purpose. It provides information regarding the range of the data set, shows the location of the highest concentration of measurements, and reveals the presence or absence of symmetry. An advantage of the stemandleaf display over the histogram is the fact that it preserves the information contained in the individual measurements.
Another advantage of stemandleaf displays is the fact that they can be constructed during the tallying process, so the intermediate step of preparing an ordered array is eliminated.
Working procedure:
 To construct a stemandleaf display we partition each measurement into two parts.
 The first part is called the stem, and the second part is called the leaf.
 The stem consists of one or more of the initial digits of the measurement, and the leaf is composed of one or more of the remaining digits.
 The stems form an ordered column with the smallest stem at the top and the largest at the bottom. We include in the stem column all stems within the range of the data even when a measurement with that stem is not in the data set.
 The rows of the display contain the leaves, ordered and listed to the right of their respective stems.
 The stems are separated from their leaves by a vertical line.
Example:
The following example illustrates the construction of a stemandleaf display.
44, 46, 47, 49, 63, 64, 66, 68, 68, 72, 72, 75, 76, 81, 84, 88, 106
Stem Leaves
4 4, 6, 7, 9
5
6 3, 4, 6, 8, 8
7 2, 2, 5, 6
8 1, 4, 8
9
10 6
Key: 63=63
Leaf unit: 1.0
Stem unit: 10.0
BoxandWhisker Plots
A useful visual device for communicating the information contained in a data set is the boxandwhisker plot. The construction of a box and whisker plot (sometimes called, simply, a box plot) makes use of the quartiles of a data set and may be accomplished by following these five steps:
 Represent the variable of interest on the horizontal axis.
 Draw a box in the space above the horizontal axis in such a way that the left end of the box aligns with the first quartile Q_{1} and the right end of the box aligns with the third quartile Q_{3}
 Divide the box into two parts by a vertical line that aligns with the median
 Draw a horizontal line called a whisker from the left end of the box to a point that aligns with the smallest measurement in the data set.
 Draw another horizontal line, or whisker, from the right end of the box to a point that aligns with the largest measurement in the data set.
Examination of a boxandwhisker plot for a set of data reveals information regarding the amount of spread, location of concentration, and symmetry of the data.
Example:
The following example illustrates the construction of a boxandwhisker plot.
The smallest and largest measurements are 14.6 and 44, respectively.
First quartile Q_{1}= 27.25, the median Q_{2} =31.1and the third quartileQ_{3}= 33.525.
Measure of Central Tendency
Central tendency or central position or statistical averages reflects the central point or the most characteristic value of a set of measurements. The measure of central tendency describes the one score that best represents the entire distribution,
(OR)
A single figure that describes the entire series of observations with their varying sizes, occupying a central position.
The most common measures of central tendency are
 Mean
 Median
 Mode
Characteristics of Central Tendency:
 It should be rigidly defined
 An average should be properly defined so that it has one and only one interpretation.
 The average should not depend on the personal prejudice and bias of the investigator.
 It should be based onall items
 It should be easily understand.
 It should not be unduly affected by the extreme value.
 It should be least affected by the fluctuation of the sampling.
 It should be easy to interpret.
 It should be easily subjected to further mathematical calculations.
Measure of Central Tendency
If n ≤ 15
Direct Method
If n > 15
Frequency Distribution Method
Simple /Ungrouped Frequency Distribution
(Range ≤ 20 digits)
Grouped Frequency Distribution
(Range > 20 digits)
Mean:
It is defined as a value which is obtained by dividing the sum of all the values by the numbers of observations. Thus arithmetic mean of a set of values x_{1}, x_{2}, x_{3}, x_{4.}. . . .x_{n} is denoted by (read as “x bar”) and is calculated as:
= = (Direct Method)
Where sign ∑ stands for the sum and “n” is the number of observations.
Example:
The grades of a student in five examinations were 67, 75, 81, 87, 90 find the arithmetic mean of grades.
Solution:
=
=
Here, = = 80
Thus, the mean grade is 80.
Method of Finding Mean
If x_{1}, x_{2}, x_{3}, x_{4}, ….x_{n} are the values of different observations andf_{1}, f_{2}, f_{3}, f_{4}, ….f_{n}are their frequencies_{, }then,
=
Or. A.M. =
Example 2. The number of children of 80 families in a village are given below:
No. of Children/Family  1  2  3  4  5  6 
No. of Families  8  10  10  25  20  7 
Calculate mean.
Solution: let x_{i} represent the number of children per family and f_{i} represent the number of families. The calculations are presented in the following table:
No. of Children/Family (x_{i})  No. of Families (f_{i})  f_{i}x_{i} 
1  8  8 
2  10  20 
3  10  30 
4  25  100 
5  20  100 
6  7  42 
n=∑f_{i} =80  ∑f_{i}x_{i} = 300 
Thus = = = 3.75
Methods of Finding Arithmetic mean for Grouped Data
Let x_{1}, x_{2}, x_{3}, x_{4.}. . . .x_{n}be midpoints of the class intervals with corresponding frequencies f_{1}, f_{2}, f_{3}, f_{4}, ….f_{n} . Then the arithmetic mean is obtained by dividing the sum of the product of “f “ and “x” by the total of all frequencies.
Thus:
A.M. = =
=
Example:
Given below are the heights of (in inches) of 200 students. Find A.M.
Height (inches)  3035  3540  4045  4550  5055  5560 
No. of Students  28  32  36  46  36  22 
Solution:
Height (Inches)  Mid points (x)  Frequency (f)  fx 
3035  32.5  28  910 
3540  37.5  32  1200 
4045  42.5  36  1530 
4550  47.5  46  2185 
5055  52.5  36  1890 
5560  57.5  22  1265 
Total:  —  ∑f = 200  ∑fx = 8980 
= = = 44.90 (inches).
Example: Given below are the weights (in kgs) of 100 students. Find Mean Weight:
Weight  7074  7579  8084  8589  9094 
No. of Students  10  24  46  12  8 
Solution:
Weight (Kg)  MidPoints (x)  Frequency (f)  fx 
70 – 74  72  10  720 
75 – 79  77  24  1848 
80 – 84  82  46  3772 
85 – 89  87  12  1044 
90 – 94  92  8  736 
Total:  —  ∑f = 100  ∑fx = 8120 
= = = 81.20
Here, Mean Weight is 31.2 kgs.
Merits of Mean
 It has the simplest average formula which is easily understandable and easy to compute.
 It is so rigidly defined by mathematical formula that everyone gets same result for single problem.
 Its calculation is based on all the observations.
 It is least affected by sampling fluctuations.
 It is a typical i.e. it balances the value at either side.
 It is the best measure to compare two or more series.(data)
 Mean is calculated on value and does not depend upon any position.
 Mathematical centre of a distribution
 Good for interval & ratio scale
 Does not ignore any information
 Inferential statistics is based on mathematical properties of the mean.
 It is based on all the observations.
 It is easy to calculate and simple to understand.
 It is relatively stable and amendable to mathematical treatment.
Demerits of Mean
 It cannot be calculated if all the values are not known.
 The extreme values have greater affect on it.
 It cannot be determined for the qualitative data.
 It may not exist in data.
Median:
It is the middle most point or the central value of the variable in a set of observation when observations are arranged in either order of their magnitudes.
It is the value in a series, which divides the series into two equal parts, one consisting of all values less and the other all values greater than it.
Median for Ungrouped data
Median of “n” observations, x_{1}, x_{2}, x_{3},…x_{n} can be obtained as follows:
 When “n” is an odd number,
Median = ()^{th} observation
 When “n” is an even number,
Median is the average of ()^{th}and ()^{th}observations.
Or
Simply use ()th observation. It will the average
The median for the discrete frequency distribution can be obtained as above, Using a cumulative frequency distribution.
Problem
Find the median of the following data:
12, 2, 16, 8, 14, 10, 6
Step 1: Organize the data, or arrange the numbers from smallest to largest.
2, 6, 8, 10, 12, 14, 16
Step 2: count number of observation in data (n)
.n = 7
Step 3: Since the number of data values is odd, the median will be found in the position.
Median term (m) =
7 + 1 8
= = = 4^{th} value
2 2
Step 4: In this case, the median is the value that is found in the fourth position of the organized data, therefore
Median = 10
Problem
Median for even data:
Find the median of the following data:
7, 9, 3, 4, 11, 1, 8, 6, 1, 4
Step 1: Organize the data, or arrange the numbers from smallest to largest.
1, 1, 3, 4, 4, 6, 7, 8, 9, 11
Step 2: Since the number of data values is even, the median will be the mean value of the numbers found before and after the position.
Step 3: The number found before the 5.5 position is 4 and the number found after the 5.5 position is 6. Now, you need to find the mean value.
1, 1, 3, 4, 4, 6, 7, 8, 9, 11
Example:
The following are the runs made by a batsman in 7 matches:
8, 12, 18, 13, 16, 5, 20.Find the median.
Solution: Writing the runs in ascending order.
5, 8, 12, 13, 16, 18, 20
As n=7
Median= ()^{th}item = ()4^{th} item.
Hence, Median is13 runs.
Example:
Following are the marks (out of 100) obtained by 10 students in English:
23, 15, 35, 41, 48, 5, 8, 9, 11, 51. Find the median mark.
Solution: arranging the marks in ascending order. The marks are:
5, 8, 9, 11, 15, 23, 35, 41, 48, 51
As n= 10
So, median = [] item.
=
Or, Median = [15+23] = = 19 marks.
Alternative Method:
Median term(m) = ()^{th} value
=
= 11/2 = 5.5^{th} value
5, 8, 9, 11, 15, 23, 35, 41, 48, 51
M1 M2
Median =
Median = = 19
Median for Grouped data
It is obtained by the following formula:
Median = l_{1} +()
Where, l_{1} = lower class limit of median class.
l_{2} = upper class limit of median class
f = frequency of median class.
m = or
C = cumulative frequency preceding the median class.
n = total frequency, i.e. ∑f.
Example:
Find the median height of 200 students in given data
Solution:
Class interval  Frequency (f)  C.F 
3035  28  28 
3540  32  28+32=60 
4045  36  60+36=96 
4550  46  96+46=142 
5055  36  142+36=178 
5560  22  178+22=200 n 
Median =
As 100.5 ^{th} item lies in (4550), it is the median class with l_{1} = 45, l_{2 }= 50 ,f= 46, C= 96
Median = l_{1} +()
Median = 45 + (
= 45 +
= 45 + 0.489
= 45.489
Thus, median height is 45.489 inches.
2^{nd} Method:
l + (
Where, l = lower class boundary of median class.
w = width of median class.
f = frequency of median class.
n = total frequency, i.e. ∑f.
c = cumulative frequency preceding the median class.
Example:
Following are the weights in kgs of 100 students. Find the median weight.
Weights (kgs)  7074  7579  8084  8589  9094 
No of students.  10  24  46  12  8 
Solution: As class boundaries are not given so, first of all we make class boundaries by using procedure.
Weight (kgs)  No. of students  Class boundaries  C.F 
7074  10  69.574.5  10 
7580  24  74.579.5  34 
8084  46  79.584.5  80 
8589  12  84.589.5  92 
9094  8  89.594.5  100 
Median =
As 50^{th} item lies in (79.584.5), it is the median class with h= 5, f= 46, C= 34
Median = l + (, we find
Median = 79.5 + (
= 79.5 +
Hence, median weight is 81.24 kg.
Merits of Median:
 It is easily understood although it is not so popular as mean.
 It is not influenced or affected by the variation in the magnitude or the extremes items.
 The value of the median can be graphically ascertained by ogives.
 It is the best measure for qualitative data such as beauty, intelligence etc.
 The median indicated the value of middle item in the distribution i.e. middle most item is the median
 It can be determined even by inspection in many cases.
 Good with ordinal data
 Easier to compute than the mean
Demerits of Median:
 For the calculation of median, data must be arranged.
 Median being a positional average, cannot be dependent on each and every observations.
 It is not subject to algebraic treatment.
 Median is more affected or influenced by samplings fluctuations that the arithmetic mean.
 May not exist in data.
 It is not rigorously defined.
 It does not use values of all observations.
Mode:
Mode is considered as the value in a series which occurs most frequently (has the highest frequency)
The mode of distribution is the value at the point around which the items tend to be most heavily concentrated. It may be regarded as the most typical value.
 The word modal is often used when referring to the mode of a data set.
 If a data set has only one value that occurs most often, the set is called unimodal.
 A data set that has two values that occur with the same greatest frequency is referred to as bimodal.
 When a set of data has more than two values that occur with the same greatest frequency, the set is called multimodal.
Mode for Ungrouped data
Example 1. The grades of Jamal in eight monthly tests were 75, 76, 80, 80, 82, 82, 82, 85.Find the mode of his grades.
Solution: As 82 is repeated more than any other number, so clearly mode is 82.
Example 2. Ten students were asked about the number of questions they have solved out of 20 questions, last week. Records were 13, 14, 15, 11, 16, 10, 19, 20, 18, 17. Find the modes.
Solution: it is obvious that the data contain no mode, as none of the numbers is repeated. Sometimes data contains several modes.
If x = 10, 15, 15, 15, 20, 20, 20, 25 then the data contains two modes i.e. 15 and 20.
Mode for grouped data
Mode for the grouped data can be calculated by the following formula:
Mode=
(OR)
Mode=
(OR)
Mode=
l_{1}= lower limit (class boundary) of the modal class.
l_{2} = upper limit of the modal class
f_{m}= frequency of the modal class
f_{1}= frequency associated with the class preceding the modal class.
f_{2} = frequency associated with the class following the modal class
h = (size of modal class)
The class with highest frequency is called the “Modal Class”.
Example 3. Find the mode for the heights of 200 students in given data
Height (inches)  Frequency 
3035  28 
3540  32 
4045  36 () 
4550  46 () 
5055  36 () 
5560  22 
∑f=200 
Solution:
Mode=
Mode=
Mode=
Mode=
Mode=
Mode=
Mode = 47. 5
Merits of Mode:
 It can be obtained by inspection.
 It is not affected by extreme values.
 This average can be calculated from open end classes.
 The score comes from the data set
 Good for nominal data
 Good when there are two ‘typical‘ scores
 Easiest to compute and understand
 It can be used to describe qualitative phenomenon
 The value of mode can also be found graphically.
Demerits of Mode
 Mode has no significance unless a large number of observations are available.
 It cannot be treated algebraically.
 It is a peculiar measure of central tendency.
 For the calculation of mode, the data must be arranged in the form of frequency distribution.
 It is not rigidly define measure.
 Ignores most of the information in a distribution
 Small samples may not have a mode.
 It is not based on all the observations.
Empirical Relationship b/w
Skewness:
Data distributions may be classified on the basis of whether they are symmetric or asymmetric. If a distribution is symmetric, the left half of its graph (histogram or frequency polygon) will be a mirror image of its right half. When the left half and right half of the graph of a distribution are not mirror images of each other, the distribution is asymmetric.
If the graph (histogram or frequency polygon) of a distribution is asymmetric, the distribution is said to be skewed. The mean, median and mode do not fall in the middle of the distribution.
Types of Skewness
 Positive skewness: If a distribution is not symmetric because its graph extends further to the right than to the left, that is, if it has a long tail to the right, we say that the distribution is skewed to the right or is positively skewed. In positively skewed distribution Mean > Median > Mode. The positive skewness indicates that the mean is more influenced than the median and mode, by the few extremely high value. Positively skewed distribution have positive value because mean is greater than mode
 Negative skewness: If a distribution is not symmetric because its graph extends further to the left than to the right, that is, if it has a long tail to the left, we say that the distribution is skewed to the left or is negatively skewed. In negatively skewed distribution Mean < Median < Mode. Negatively skewed distribution have negative value because mean is less than mode.
KURTOSIS
Kurtosis is a measure of the degree to which a distribution is “peaked” or flat in comparison to a normal distribution whose graph is characterized by a bellshaped appearance.


Measures of Dispersion
This term is used commonly to mean scatter, Deviation, Fluctuation, Spread or variability of data.
The degree to which the individual values of the variate scatter away from the average or the central value, is called a dispersion.
Types of Measures of Dispersions:
 Absolute Measures of Dispersion: The measures of dispersion which are expressed in terms of original units of a data are termed as Absolute Measures.
 Relative Measures of Dispersion: Relative measures of dispersion, are also known as coefficients of dispersion, are obtained as ratios or percentages. These are pure numbers independent of the units of measurement and used to compare two or more sets of data values.
Absolute Measures
 Range
 Quartile Deviation
 Mean Deviation
 Standard Deviation
Relative Measure
 Coefficient of Range
 Coefficient of Quartile Deviation
 Coefficient of mean Deviation
 Coefficient of Variation.
The Range:
1. The range is the simplest measure of dispersion. It is defined as the difference between the largest value and the smallest value in the data:
2. For grouped data, the range is defined as the difference between the upper class boundary (UCB) of the highest class and the lower class boundary (LCB) of the lowest class.
MERITS OF RANGE:
 Easiest to calculate and simplest to understand.
 Gives a quick answer.
DEMERITS OF RANGE:
 It gives a rough answer.
 It is not based on all observations.
 It changes from one sample to the next in a population.
 It can’t be calculated in openend distributions.
 It is affected by sampling fluctuations.
 It gives no indication how the values within the two extremes are distributed
Quartile Deviation (QD):
1. It is also known as the SemiInterquartile Range. The range is a poor measure of dispersion where extremely large values are present. The quartile deviation is defined half of the difference between the third and the first quartiles:
QD = Q_{3} – Q_{1}/2
InterQuartile Range
The difference between third and first quartiles is called the ‘InterQuartile Range’.
IQR = Q_{3} – Q_{1}
Mean Deviation (MD):
1. The MD is defined as the average of the deviations of the values from an average:
It is also known as Mean Absolute Deviation.
2. MD from median is expressed as follows:
3. for grouped data:
 The MD is simple to understand and to interpret.
 It is affected by the value of every observation.
 It is less affected by absolute deviations than the standard deviation.
 It is not suited to further mathematical treatment. It is, therefore, not as logical as convenient measure of dispersion as the SD.
The Variance:
 Mean of all squared deviations from the mean is called as variance
 (Sample variance=S^{2}; population variance= σ^{2}sigma squared (standard deviation squared). A high variance means most scores are far away from the mean, a low variance indicates most scores cluster tightly about the mean.
Formula
OR S^{2} =
Calculating variance: Heart rate of certain patient is 80, 84, 80, 72, 76, 88, 84, 80, 78, & 78. Calculate variance for this data.
Solution:
Step 1:
Find mean of this data
= 800/10 Mean = 80
Step 2:
Draw two Columns respectively ‘X’ and deviation about mean (X ). In column ‘X’ put all values of X and in (X ) subtract each ‘X’ value with .
Step 3:
Draw another Column of (X )^{ 2}, in which put square of deviation about mean.
X  (X ) Deviation about mean  (X )^{2} Square of Deviation about mean 
80 84 80 72 76 88 84 80 78 78  80 – 80 = 0 84 – 80 = 4 80 – 80 = 0 72 – 80 = 8 76 – 80 = 4 88 – 80 = 8 84 – 80 = 4 80 – 80 = 0 78 – 80 = 2 78 – 80 = 2  0 x 0 = 00 4 x 4 = 16 0 x 0 = 00 8 x 8 = 64 4 x 4 = 16 8 x 8 = 64 4 x 4 = 16 0 x 0 = 00 2 x 2 = 04 2 x 2 = 04 
∑X = 800 = 80  ∑(X ) = 0 Summation of Deviation about mean is always zero  ∑(X )2 = 184 Summation of Square of Deviation about mean 
Step 4
Apply formula and put following values
∑(X )^{ 2}= 184
n = 10
Variance = 184/ 101 = 184/9
Variance = 20.44
Standard Deviation
 The SD is defined as the positive Square root of the mean of the squared deviations of the values from their mean.
 The square root of the variance.
 It measures the spread of data around the mean. One standard deviation includes 68% of the values in a sample population and two standard deviations include 95% of the values & 3 standard deviations include 99.7% of the values
 The SD is affected by the value of every observation.
 In general, it is less affected by fluctuations of sampling than the other measures of dispersion.
 It has a definite mathematical meaning and is perfectly adaptable to algebraic treatment.
Formula:
OR S =
Calculating Standard Deviation (we use same example): Heart rate of certain patient is 80, 84, 80, 72, 76, 88, 84, 80, 78, & 78. Calculate standard deviation for this data.
SOLUTION:
Step 1: Find mean of this data
= 800/10 Mean = 80
Step 2:
Draw two Columns respectively ‘X’ and deviation about mean (X). In column ‘X’ put all values of X and in (X) subtract each ‘X’ value with.
Step 3:
Draw another Column of (X_{} )^{ 2}, in which put square of deviation about mean.
X  (X ) Deviation about mean  (X )2 Square of Deviation about mean 
80 84 80 72 76 88 84 80 78 78  80 – 80 = 0 84 – 80 = 4 80 – 80 = 0 72 – 80 = 8 76 – 80 = 4 88 – 80 = 8 84 – 80 = 4 80 – 80 = 0 78 – 80 = 2 78 – 80 = 2  0 x 0 = 00 4 x 4 = 16 0 x 0 = 00 8 x 8 = 64 4 x 4 = 16 8 x 8 = 64 4 x 4 = 16 0 x 0 = 00 2 x 2 = 04 2 x 2 = 04 
∑X = 800 = 80  ∑(X ) = 0 Summation of Deviation about mean is always zero  ∑(X )2 = 184 Summation of Square of Deviation about mean 
Step 4
Apply formula and put following values
∑(X )2 = 184
n = 10
MERITS AND DEMERITS OF STD. DEVIATION
 Std. Dev. summarizes the deviation of a large distribution from mean in one figure used as a unit of variation.
 It indicates whether the variation of difference of a individual from the mean is real or by chance.
 Std. Dev. helps in finding the suitable size of sample for valid conclusions.
 It helps in calculating the Standard error.
DEMERITS
 It gives weightage to only extreme values. The process of squaring deviations and then taking square root involves lengthy calculations.
Relative measure of dispersion:
(a) Coefficient of Variation,
(b) Coefficient of Dispersion,
(c) Quartile Coefficient of Dispersion, and
(d) Mean Coefficient of Dispersion.
Coefficient of Variation (CV):
1. Coefficient of variation was introduced by Karl Pearson. The CV expresses the SD as a percentage in terms of AM:
————— For sample data
————— For population data
 It is frequently used in comparing dispersion of two or more series. It is also used as a criterion of consistent performance, the smaller the CV the more consistent is the performance.
 The disadvantage of CV is that it fails to be useful when is close to zero.
 It is sometimes also referred to as ‘coefficient of standard deviation’.
 It is used to determine the stability or consistency of a data.
 The higher the CV, the higher is instability or variability in data, and vice versa.
Coefficient of Dispersion (CD):
If X_{m} and X_{n} are respectively the maximum and the minimum values in a set of data, then the coefficient of dispersion is defined as:
Coefficient of Quartile Deviation (CQD):
1. If Q_{1} and Q_{3} are given for a set of data, then (Q_{1} + Q_{3})/2 is a measure of central tendency or average of data. Then the measure of relative dispersion for quartile deviation is expressed as follows:
CQD may also be expressed in percentage.
Mean Coefficient of Dispersion (CMD):
The relative measure for mean deviation is ‘mean coefficient of dispersion’ or ‘coefficient of mean deviation’:
——————– for arithmetic mean
——————– for median
Percentiles and Quartiles
The mean and median are special cases of a family of parameters known as location parameters. These descriptive measures are called location parameters because they can be used to designate certain positions on the horizontal axis when the distribution of a variable is graphed.
Percentile:
 Percentiles are numerical values that divide an ordered data set into 100 groups of values with at the most 1% of the data values in each group. There can be maximum 99 percentile in a data set.
 A percentile is a measure that tells us what percent of the total frequency scored at or below that measure.
Percentiles corresponding to a given data value: The percentile in a set corresponding to a specific data value is obtained by using the following formula
Number of values below X + 0.5
Percentile = ——————————————–
Number of total values in data set
Example: Calculate percentile for value 12 from the following data
13 11 10 13 11 10 8 12 9 9 8 9
Solution:
Step # 01: Arrange data values in ascending order from smallest to largest
S. No  1  2  3  4  5  6  7  8  9  10  11  12 
Observations or values  8  8  9  9  9  10  10  11  11  12  13  13 
Step # 02: The number of values below 12 is 9 and total number in the data set is 12
Step # 03: Use percentile formula
9 + 0.5
Percentile for 12 = ——— x 100 = 79.17%
12
It means the value of 12 corresponds to 79^{th} percentile
Example2: Find out 25^{th} percentile for the following data
6 12 18 12 13 8 13 11
10 16 13 11 10 10 2 14
SOLUTION
Step # 01: Arrange data values in ascending order from smallest to largest
S. No  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16 
Observations or values  2  6  8  10  10  10  11  11  12  12  13  13  13  14  16  18 
Step # 2 Calculate the position of percentile (n x k/ 100). Here n = No: of observation = 16 and k (percentile) = 25
16 x 25 16 x 1
Therefore Percentile = ——— = ——— = 4
100 4
Therefore, 25^{th} percentile will be the average of values located at the 4^{th} and 5^{th} position in the ordered set. Here values for 4^{th} and 5^{th} correspond to the value of 10 each.
(10 + 10)
Thus, P_{25} (=P_{k}) = ————– = 10
2
Quartiles
These are measures of position which divide the data into four equal parts when the data is arranged in ascending or descending order. The quartiles are denoted by Q.
Quartiles  Formula for Ungrouped Data  Formula for Grouped Data 
Q_{1} = First Quartile below which first 25% of the observations are present 


Q_{2} = Second Quartile below which first 50% of the observations are present. It can easily be located as the median value. 


Q_{3} = Third Quartile below which first 75% of the observations are present 


Symbol Key:
PROBABILITY
Probability:
Probability is used to measure the ‘likelihood’ or ‘chances’ of certain events (prespecified outcomes) of an experiment.
If an event can occur in N mutually exclusive and equally likely ways, and if m of these possess a trait E, the probability of the occurrence of E expressed as:
Number of favourable cases
=
Total number of outcome (sample Space)
Characteristics of probability:
 It is usually expressed by the symbol ‘P’
 It ranges from 0 to 1
 When P = 0, it means there is no chance of happening or impossible.
 If P = 1, it means the chances of an event happening is 100%.
 The total sum of probabilities of all the possible outcomes in a sample space is always equal to one (1).
 If the probability of occurrence is p(o)= A, then the probability of nonoccurrence is 1A.
Terminology
Random Experiment:
Any natural phenomenon, yielding some result will be termed as random experiment when it is not possible to predict a particular result to turn out.
An Outcome:
The result of an experiment in all possible form are said to be event of that experiment. e.g. When you toss a coin once, you either get head or tail.
A trial:
This refers to an activity of carrying out an experiment like tossing a coin or rolling a die or dices.
Sample Space:
A set of All possible outcomes of a probability experiment.
Example 1: In tossing a coin, the outcomes are either Head (H) or tail (T) i.e. there are only two possible outcomes in tossing a coin. The chances of obtaining a head or a tail are equal. It can be solved as follow;
n(s) = 2 ways
S = {H, T}
Example 2: what is sample space when single dice is rolled?
n(s) = 6 ways
S = {1, 2, 3, 4, 5, 6}
A Simple Event
In an experimental probability, an event with only one outcome is called a simple event.
Compound Events
When two or more events occur in connection with each other, then their simultaneous occurrence is called a compound event.
Mutually exhaustive:
If in an experiment the occurrence of one event prevents or rules out the happening of all other events in the same experiment then these event are said to be mutually exhaustive events.
Mutually exclusive:
Two events are said to be mutually exclusive if they cannot occur simultaneously.
Example: tossing a coin, the events head and tail are mutually exclusive because if the outcome is head then the possibilities of getting a tail in the same trial is ruled out.
Equally likely events:
Events are said to be equally likely if there is no reason to expect any one in preference to other.
Example: in a single cast of a fair die each of the events 1, 2, 3, 4, 5, 6 is equally likely to occur.
Favourable case:
The cases which ensure the occurrence of an event are said to be favourable to the events.
Independent event:
When the experiments are conducted in such a way that the occurrence of an event in one trial does not have any effect on the occurrence of the other events at a subsequent experiment, then the events are said to be independent.
Example:
If we draw a card from a pack of cards and again draw a second a card from the pack by replacing the first card drawn, the second draw is known as independent f the first.
Dependent Event:
When the experiments are conducted in such a way that the occurrence of an event in one trial does have some effect on the occurrence of the other events at a subsequent experiment, then the event are said to be dependent event.
Example:
If we draw a card from a pack and again draw a card from the rest of pack of cards (containing 51 cards) then the second draw is dependent on the first.
Conditional Probability:
The probability of happening of an event A, when it is known that B has already happened, is called conditional probability of A and is denoted by P (A/B) i.e.
 P(A/B) = conditional probability of A given that B has already occurred.
 P (A/B) = conditional Probability of B given that A has already occurred.
Types of Probability:
The Classical or mathematical:
Probability is the ratio of the number of favorable cases as compared to the total likely cases.
The probability of nonoccurrence of the same event is given by {1p (occurrence)}.
The probability of occurrence plus nonoccurrence is equal to one.
If probability occurrence; p (O) and probability of nonoccurrence (O’), then p(O)+p(O’)=1.
Statistical or Empirical
Empirical probability arises when frequency distributions are used. For example:
Observation ( X)  0  1  2  3  4 
Frequency ( f)  3  7  10  16  11 
The probability of observation (X) occurring 2 times is given by the formulae
RULES OF PROBABILITY
Addition Rule
 Rule 1: When two events A and B are mutually exclusive, then probability of any one of them is equal to the sum of the probabilities of the happening of the separate events;
Mathematically:
P (A or B) =P (A) +P (B)
Example: When a die or dice is rolled, find the probability of getting a 3 or 5.
Solution: P (3) =1/6 and P (5) =1/6.
Therefore P (3 or 5) = P (3) + P (5) = 1/6+1/6 =2/6=1/3.
2) Rule 2: If A and B are two events that are NOT mutually exclusive, then
P (A or B) = P(A) + P(B) – P(A and B), where A and B means the number of outcomes that event A and B have in common.
Given two events A and B, the probability that event A, or event B, or both occur is equal to the probability that event A occurs, plus the probability that event B occurs, minus the probability that the events occur simultaneously.
Example: When a card is drawn from a pack of 52 cards, find the probability that the card is a 10 or a heart.
Solution: P (10) = 4/52 and P (heart) =13/52
P (10 that is Heart) = 1/52
P (A or B) = P (A) +P (B)P (A and B) = 4/52 _ 13/52 – 1/52 = 16/52.
Multiplication Rule
 Rule 1: For two independent events A and B, then
P (A and B) = P (A) x P (B).
Example: Determine the probability of obtaining a 5 on a die and a tail on a coin in one throw.
Solution: P (5) =1/6 and P (T) =1/2.
P (5 and T) = P (5) x P (T) = 1/6 x ½= 1/12.
 Rule 2: When two events are dependent, the probability of both events occurring is P (A and B) =P (A) x P (BA), where P (BA) is the probability that event B occurs given that event A has already occurred.
Example: Find the probability of obtaining two Aces from a pack of 52 cards without replacement.
Solution: P (Ace) =2/52 and P (second Ace if NO replacement) = 3/51
Therefore P (Ace and Ace) = P (Ace) x P (Second Ace) = 4/52 x 3/51 = 1/221
Construct sample space, when two dice are rolled
n(s) = n_{1} x n_{2} = 6 x 6 = 36
(1,1)  (2,1)  (3,1)  (4,1)  (5,1)  (6,1) 
(1,2)  (2, 2)  (3, 2)  (4, 2)  (5, 2)  (6, 2) 
(1, 3)  (2, 3)  (3, 3)  (4, 3)  (5, 3)  (6, 3) 
(1, 4)  (2, 4)  (3, 4)  (4, 4)  (5, 4)  (6, 4) 
(1, 5)  (2, 5)  (3, 5)  (4, 5)  (5, 5)  (6, 5) 
(1, 6)  (2, 6)  (3, 6)  (4, 6)  (5, 6)  (6, 6) 
EXAMPLE OF FINDING PROBABILITY OF AN EVENT
If 3 coins are tossed together, construct a tree diagram & find the followings;
a) Event showing No head b) Event showing 01 head,
c) Event showing 02 heads d) Event showing 03 heads
n (s) = n_{1} x n_{2} x n_{3}
= 2 x 2 x2 = 8
 Event showing no head = P(X = 0)
Answer: TTT, 1/8 = 0.125
 Event showing 01 head = P(X = 1)
Answer: HTT, THT, TTH 3/8 = 0.375
 Event showing 02 heads = P(X = 2)
Answer: HHT, HTH, THH 3/8 = 0.375
 Event showing 03 heads = P(X = 3)
Answer: HHH 1/8 = 0.125
Complementary Events
Complementary events happen when there are only two outcomes, like getting a job, or not getting a job. In other words, the complement of an event happening is the exact opposite: the probability of it not happening.
The probability of not occurrence of an event.
The probability of an event A is equal to 1 minus the probability of its complement, which is written as Ā and
P (Ā) = 1 – P (A)
CONDITIONAL PROBABILITY &SCREENING TESTS
Sensitivity, Specificity, and Predictive Value Positive and Negative
In the health sciences field a widely used application of probability laws and concepts is found in the evaluation of screening tests and diagnostic criteria. Of interest to clinicians is an enhanced ability to correctly predict the presence or absence of a particular disease from knowledge of test results (positive or negative) and/or the status of presenting symptoms (present or absent). Also of interest is information regarding the likelihood of positive and negative test results and the likelihood of the presence or absence of a particular symptom in patients with and without a particular disease.
In consideration of screening tests, one must be aware of the fact that they are not always infallible. That is, a testing procedure may yield a false positive or a false negative.
False Positive:
A false positive results when a test indicates a positive status when the true status is negative.
False Negative:
A false negative results when a test indicates a negative status when the true status is positive.
Sensitivity:
The sensitivity of a test (or symptom) is the probability of a positive test result (or presence of the symptom) given the presence of the disease.
Specificity:
The specificity of a test (or symptom) is the probability of a negative test result (or absence of the symptom) given the absence of the disease.
Predictive value positive:
The predictive value positive of a screening test (or symptom) is the probability that a subject has the disease given that the subject has a positive screening test result (or has the symptom).
Predictive value negative:
The predictive value negative of a screening test (or symptom) is the probability that a subject does not have the disease, given that the subject has a negative screening test result (or does not have the symptom).
Summary of formulae:
Symbols
COUNTING RULES
1) FACTORIALS (number of ways)
The result of multiplying a sequence of descending natural numbers down to a number. It is denoted by “!”
Examples:
4! = 4 × 3 × 2 × 1×0! = 24
7! = 7 × 6 × 5 × 4 × 3 × 2 × 1 = 5040
Remember : 0! = 1
General Method:
n! = n (n 1) (n 2) (n 3)……….. (n – n)!
2) PERMUTATION RULES
All possible arrangements of a collection of things, where the order is important in a subset.
Repetition of same items with different arrangement is allowed.
Examples
 COMBINATIONS
The order of the objects in a subset is immaterial.
Repetition of same objects in not allowed with different arrangement
Examples:
Binomial distribution:
Binomial distribution is a probability distribution which is obtained when the probability ‘P’ of the happening of an event is same in all the trials and there are only two event in each trial.
Conditions:
 Each trial results in one of two possible, mutually exclusive, outcomes. One of the possible outcomes is denoted (arbitrarily) as a success, and the other is denoted a failure.
 The probability of a success, denoted by p, remains constant from trial to trial. The probability of a failure (1 – p) is denoted by q.
 The trials are independent; that is, the outcome of any particular trial is not affected by the outcome of any other trial.
 Parameter should be available; (n & p) are parameters.
Formula:
b (X: n, p) = ^{n}C_{x} p^{x} q^{n – x } (OR) f (x) = ^{n}C_{x} p^{x} q^{n – x}
Where
X = Random variable
n = Number of Trials
p = Probability of Success
q = Probability of Failure
NORMAL DISTRIBUTION
Definitions:
 The normal distribution is pattern for the distribution of a set of data which follows a bell shaped curve.
 A theoretical frequency distribution for a set of variable data, usually represented by a bellshaped curve symmetrical about the mean
The formula for this distribution was first published by Abraham De Moivre (1667–1754) on November 12, 1733. Many other mathematicians figure prominently in the history of the normal distribution, including Carl Friedrich Gauss (1777–1855).The distribution is frequently called the Gaussian distribution in recognition of his contributions.
The normal density is given by
 ‘π’ and ‘e’(Euler’s constant)are the familiar constants, 3.14159 and 2.71828 respectively.
 The two parameters of the distribution are ‘µ’, the mean and ‘δ’, the standard deviation.
Properties of Normal Distribution:
 Total area under a normal distribution curve is equal to 1.00
 Mean, median and mode all have same values (mean = median = mode) and located at the centre of the distribution.
 A normal distribution curve is bell shaped, symmetric around the mean and skewness is ‘0’ zero.
 A normal distribution curve is unimodal. (it has only one mode)
 Normal distributions are denser in the center and less dense in the tails.
 All normal curves are positive for all x. That is, f(x) > 0 for all x.
 Tails of the curve get closer and closer to the xaxis as it move away from the mean but never touché the xaxis.
 Continuous for all values of X between ∞ and ∞ so that each conceivable interval of real numbers has a probability other than zero.
 ∞ ≤ X ≤ ∞
 68% of the values fall within ±1 SD of the mean, 95% of values fall within ±2 SD of the mean, 99.7% of values fall within ±3 SD of the mean.
 The normal distribution is completely determined by the parameters ‘µ’and ‘σ’. Different values of µ shift the graph of the distribution along the xaxis. Whereas Different values of σ determine the degree of flatness or peakedness of the graph of the distribution. µ is often referred to as a location parameter and σ is often referred to as a shape parameter.
Why is the normal distribution useful?
 Many things actually are normally distributed, or very close to it. For example, height and intelligence are approximately normally distributed; measurement errors also often have a normal distribution
 The normal distribution is easy to work with mathematically. In many practical cases, the methods developed using normal theory work quite well even when the distribution is not normal.
 There is a very strong connection between the size of a sample N and the extent to which a sampling distribution approaches the normal form. Many sampling distributions based on large N can be approximated by the normal distribution even though the population distribution itself is definitely not normal.
The Standard Normal Distribution
Fisher and Yet modified normal distribution, known as standard normal distribution or unit normal distribution. Because it has a mean of 0 and standard deviation of 1. It may be obtained from following Equation by creating a random variable.
z = (x – µ)/σ
The equation for the standard normal distribution is written
The ztransformation will prove to be useful in the examples and applications that follow. This value of ‘z’ denotes, for a value of a random variable, the number of standard deviations that value falls above (+‘z) or below (‘z) the mean, which in this case is 0.
RANDOM VARIABLE:
Any numerical quantity with a specific characteristics, having probability in background
(OR)
Numerical quantity which has a specific probability. It is represented by ‘X’.
General Procedure
As you might suspect from the formula for the normal density function, it would be difficult and tedious to do the calculus every time we had a new set of parameters for µ and σ. So instead, we usually work with the standardized normal distribution, where µ = 0 and σ = 1, i.e. N (0,1). That is, rather than directly solve a problem involving a normally distributed variable X with mean µ and standard deviation σ, an indirect approach is used.
 We first convert the problem into an equivalent one dealing with a normal variable measured in standardized deviation units, called a standardized normal variable. To do this, if X ∼ N (µ, σ^{2}), then
 A table of standardized normal values can then be used to obtain an answer in terms of the converted problem.
 The interpretation of Z values is straightforward. Since σ = 1, if Z = 2, the corresponding X value is exactly 2 standard deviations above the mean. If Z = 1, the corresponding X value is one standard deviation below the mean. If Z = 0, X = the mean, i.e. µ.
Example of a zscore calculation: Suppose that patients’ heart rate follow a normal distribution with a mean of 72 & standard deviation of 8 b/ min. Find the probabilities if;
 Heart Rate is Greater Than 80 Or P(X > 80)
P(X > 80)
Data
X = 80
µ = 72
= 8
80 – 72 8
Z = =
8 8
Z = 1
P (Z > 1)
P (Z > 1) = 1 – P (Z < 1)
P (Z > 1) = 1 – 0.8413
P (Z > 1) = 0.1587
 Heart Rate is Lesser Than 90 Or P(X < 90)
Data
X = 90
µ = 72
= 8
P(X < 90)
90 – 72 18
Z = =
8 8
Z = 2.25
P (Z < 2.25)
P (Z < 2.25) = 0.9878
 Heart Rate is Between 75and 95 Or P(75 <X < 95)
Data
X_{1} = 75
X_{2} = 95
µ = 72
= 8
X_{1} – µ X_{2} – µ
Z_{1}= Z_{2}=
75 – 72 95 – 72
Z_{1}= Z_{2}=
8 8
Z_{1}= 3/8 Z_{2}= 23/8
Z_{1}= 0.37 Z_{2}= 2.87
P (0.37 < Z < 2.87)
P (0.37 < Z < 2.87) = P (Z < 2.87) – P (Z < 0.37)
P (Z < 2.87) = 0.9979
P (Z < 0.37) = () 0.6443
0.3536
P (0.37 < Z < 2.87) = 0.3536
SAMPLING:
A set of data or elements drawn from a larger population and analyzed to estimate the characteristics of that population is called sample. And the process of selecting a sample from a population is called sampling.
OR
Procedure by which some members of a given population are selected as representatives of the entire population
TYPES OF SAMPLING
There are two types of sampling
 Probability sampling
 Nonprobability sampling
 Probability Sampling:
A sampling technique in which each member of the population has an equal chance of being chosen is called probability sampling.
There are four types of probability sampling
 Simple random sampling
 Systemic sampling
 Stratified sampling
 Cluster sampling
 Simple Random Sampling
A probability sampling technique in which, each person in the population has an equal chance of being chosen for the sample and every collection of persons of the same size has an equal chance of becoming the actual sample.
 Systematic Sampling
A sample constructed by selecting every kth element in the sampling frame.
Number the units in the population from 1 to N decide on the n (sample size) that you want or need k = N/n = the interval size randomly select an integer between 1 to k then take every kth unit.
 Stratified Random Sampling.
Is obtained by separating the population elements into non overlapping groups, called strata, and then selecting a simple random sample from each stratum.
 Cluster Sampling.
A simple random sample in which each sampling unit is a collection or cluster, or elements. For example, an investigator wishing to study students might first sample groups or clusters of students such as classes and then select the final sample of students from among clusters. Also called area sampling.
 NonProbability Sampling
Nonprobability sampling is a sampling technique where the samples are gathered in a process that does not give all the individuals in the population equal chances of being selected.
It decreases a sample’s representativeness of a population.
Type of Nonprobability sampling
Following are the common types of nonprobability sampling:
 Convenience sampling
 Quota Sampling
 Purposive/ judgmental sampling
 Network/ snowball Sampling
 Convenience Sampling:
The members of the population are chosen based on their relative ease of access. Suchsamples are biased because researchers may unconsciously approach some kinds of respondents and avoid others
 Quota Sampling
It is the nonprobability version of stratified sampling. Like stratified sampling, the researcher first identifies the stratums and their proportions as they are represented in the population. Then convenience or judgment sampling is used to select the required number of subjects from each stratum. This differs from stratified sampling, where the stratums are filled by random sampling.
 Purposive Sampling.
It is a common nonprobability method. The researcher uses his or her own judgment about which respondents to choose, and picks those who best meets the purposes of the study.
 Snowball Sampling
It is a special nonprobability method used when the desired sample characteristic is rare. It may be extremely difficult or cost prohibitive to locate respondents in these situations. Snowball sampling relies on referrals from initial subjects to generate additional subjects. While this technique can dramatically lower search costs, it comes at the expense of introducing bias because the technique itself reduces the likelihood that the sample will represent a good cross section from the population.
INFERENTIAL STATISTICS
Statistical inference is the procedure by which we reach a conclusion about a population on the basis of the information contained in a sample drawn from that population. It consists of two techniques:
 Estimation of parameters
 Hypothesis testing
ESTIMATION OF PARAMETERS
The process of estimation entails calculating, from the data of a sample, some statistic that is offered as an approximation of the corresponding parameter of the population from which the sample was drawn.
Parameter estimation is used to estimate a single parameter, like a mean.
There are two types of estimates
 Point Estimates
 Interval Estimates (Confidence Interval).
POINT ESTIMATES
A point estimate is a single numerical value used to estimate the corresponding population parameter.
For example: the sample mean ‘x’ is a point estimate of the population mean μ. the sample variance S^{2} is a point estimate of the population variance σ^{2}. These are point estimates — a single–valued guess of the parametric value.
A good estimator must satisfy three conditions:
 Unbiased: The expected value of the estimator must be equal to the mean of the parameter
 Consistent: The value of the estimator approaches the value of the parameter as the sample size increases
 Relatively Efficient: The estimator has the smallest variance of all estimators which could be used
CONFIDENCE INTERVAL (Interval Estimates)
An interval estimate consists of two numerical values defining a range of values that, with a specified degree of confidence, most likely includes the parameter being estimated.
Interval estimation of a parameter is more useful because it indicates a range of values within which the parameter has a specified probability of lying. With interval estimation, researchers construct a confidence interval around estimate; the upper and lower limits are called confidence limits.
Interval estimates provide a range of values for a parameter value, within which we have a stated degree of confidence that the parameter lies. A numeric range, based on a statistic and its sampling distribution that contains the population parameter of interest with a specified probability.
A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data
Calculating confidence interval when n ≥ 30 (Single Population Mean)
Example: A random sample of size 64 with mean 25 & Standard Deviation 4 is taken from a normal population. Construct 95 % confidence interval
We use following formula to solve Confidence Interval when n ≥ 30
Data
 = 25
= 4
n = 64
25 4/ . x 1.96
25 4/8 x 1.96
25 0.5 x 1.96
25 0.98
25 – 0.98 ≤ µ ≤ 25 + 0.98
24.02≤ µ ≤ 25.98
We are 95% confident that population mean (µ) will have value between 24.02 & 25.98
Calculating confidence interval when n < 30 (Single Population Mean)
Example: A random sample of size 9 with mean 25 & Standard Deviation 4 is taken from a normal population. Construct 95 % confidence interval
We use following formula to solve Confidence Interval when n < 30
(OR)
Data
 = 25
S = 4
n = 9
α/2 = 0.025
df = n – 1 (9 1 = 8)
t_{α/2,df} = 2.306
25 ± 4/√9 x 2.306
25 ± 4/3 x 2.306
25 ± 1.33 x 2.306
25 ± 3.07
25 – 3.07 ≤ µ ≤ 25 + 3.07
21.93 ≤ µ ≤ 28.07
We are 95% confident that population mean (µ) will have value between 21.93 & 28.07
Hypothesis:
A hypothesis may be defined simply as a statement about one or more populations. It is frequently concerned with the parameters of the populations about which the statement is made.
Types of Hypotheses
Researchers are concerned with two types of hypotheses
 Research hypotheses
The research hypothesis is the conjecture or supposition that motivates the research. It may be the result of years of observation on the part of the researcher.
 Statistical hypotheses
Statistical hypotheses are hypotheses that are stated in such a way that they may be evaluated by appropriate statistical techniques.
Types of statistical Hypothesis
There are two statistical hypotheses involved in hypothesis testing, and these should be stated explicitly.
 Null Hypothesis:
The null hypothesis is the hypothesis to be tested. It is designated by the symbol H_{o.} The null hypothesis is sometimes referred to as a hypothesis of no difference, since it is a statement of agreement with (or no difference from) conditions presumed to be true in the population of interest.
In general, the null hypothesis is set up for the express purpose of being discredited. Consequently, the complement of the conclusion that the researcher is seeking to reach becomes the statement of the null hypothesis. In the testing process the null hypothesis either is rejected or is not rejected. If the null hypothesis is not rejected, we will say that the data on which the test is based do not provide sufficient evidence to cause rejection. If the testing procedure leads to rejection, we will say that the data at hand are not compatible with the null hypothesis, but are supportive of some other hypothesis.
 Alternative Hypothesis
The alternative hypothesis is a statement of what we will believe is true if our sample data cause us to reject the null hypothesis. Usually the alternative hypothesis and the research hypothesis are the same, and in fact the two terms are used interchangeably. We shall designate the alternative hypothesis by the symbol H_{A }orH_{1.}
LEVEL OF SIGNIFICANCE
The level of significance is a probability and, in fact, is the probability of rejecting a true null hypothesis. The level of significance specifies the area under the curve of the distribution of the test statistic that is above the values on the horizontal axis constituting the rejection region. It is denoted by ‘α’.
Types of Error
In the context of testing of hypotheses, there are basically two types of errors:
 TYPE I Error
 TYPE II Error
Type I Error
 A type I error, also known as an error of the first kind, occurs when the null hypothesis (H_{0}) is true, but is rejected.
 A type I error may be compared with a so called false positive.
 The rate of the type I error is called the size of the test and denoted by the Greek letter α (alpha).
 It usually equals the significance level of a test.
 If type I error is fixed at 5 %, it means that there are about 5 chances in 100 that we will reject H_{0} when H_{0} is true.
Type II Error
 Type II error, also known as an error of the second kind, occurs when the null hypothesis is false, but erroneously fails to be rejected.
 Type II error means accepting the hypothesis which should have been rejected.
 A Type II error is committed when we fail to believe a truth.
 A type II error occurs when one rejects the alternative hypothesis (fails to reject the null hypothesis) when the alternative hypothesis is true.
 The rate of the type II error is denoted by the Greek letter β (beta) and related to the power of a test (which equals 1β ).
In the tabular form two errors can be presented as follows:
Null hypothesis (H_{0}) is true  Null hypothesis (H_{0}) is false  
Reject null hypothesis  Type I error  Correct outcome 
Fail to reject null hypothesis  Correct outcome  Type II error 
Graphical depiction of the relation between Type I and Type II errors
What are the differences between Type 1 errors and Type 2 errors?
Type 1 Error  Type 2 Error 


Reducing Type I Errors
 Prescriptive testing is used to increase the level of confidence, which in turn reduces Type I errors. The chances of making a Type I error are reduced by increasing the level of confidence.
Reducing Type II Errors
 Descriptive testing is used to better describe the test condition and acceptance criteria, which in turn reduces type ii errors. This increases the number of times we reject the null hypothesis – with a resulting increase in the number of type I errors (rejecting H_{0} when it was really true and should not have been rejected).
 Therefore, reducing one type of error comes at the expense of increasing the other type of error! The same means cannot reduce both types of errors simultaneously.
Power of Test:
Statistical power is defined as the probability of rejecting the null hypothesis while the alternative hypothesis is true.
Power = P(reject H_{0}  H_{1} is true)
= 1 – P(type II error)
= 1 – β
That is, the power of a hypothesis test is the probability that it will reject when it’s supposed to.
Distribution under H_{0}
Distribution under H_{1}
Power 
Factors that affect statistical power include
 The sample size
 The specification of the parameter(s) in the null and alternative hypothesis, i.e. how far they are from each other, the precision or uncertainty the researcher allows for the study (generally the confidence or significance level)
 The distribution of the parameter to be estimated. For example, if a researcher knows that the statistics in the study follow a Z or standard normal distribution, there are two parameters that he/she needs to estimate, the population mean (μ) and the population variance (σ^{2}). Most of the time, the researcher know one of the parameters and need to estimate the other. If that is not the case, some other distribution may be used, for example, if the researcher does not know the population variance, he/she can estimate it using the sample variance and that ends up with using a T distribution.
Application:
In research, statistical power is generally calculated for two purposes.
 It can be calculated before data collection based on information from previous research to decide the sample size needed for the study.
 It can also be calculated after data analysis. It usually happens when the result turns out to be nonsignificant. In this case, statistical power is calculated to verify whether the nonsignificant result is due to really no relation in the sample or due to a lack of statistical power.
Relation with sample size:
Statistical power is positively correlated with the sample size, which means that given the level of the other factors, a larger sample size gives greater power. However, researchers are also faced with the decision to make a difference between statistical difference and scientific difference. Although a larger sample size enables researchers to find smaller difference statistically significant, that difference may not be large enough be scientifically meaningful. Therefore, this would be recommended that researcher have an idea of what they would expect to be a scientifically meaningful difference before doing a power analysis to determine the actual sample size needed.
HYPOTHESIS TESTING
Statistical hypothesis testing provides objective criteria for deciding whether hypotheses are supported by empirical evidence.
The purpose of hypothesis testing is to aid the clinician, researcher, or administrator in reaching a conclusion concerning a population by examining a sample from that population.
STEPS IN STATISTICAL HYPOTHESIS TESTING
Step # 01: State the Null hypothesis and Alternative hypothesis.
The alternative hypothesis represents what the researcher is trying to prove. The null hypothesis represents the negation of what the researcher is trying to prove.
Step # 02: State the significance level, α (0.01, 0.05, or 0.1), for the test
The significance level is the probability of making a Type I error. A Type I Error is a decision in favor of the alternative hypothesis when, in fact, the null hypothesis is true.
Type II Error is a decision to fail to reject the null hypothesis when, in fact, the null hypothesis is false.
Step # 03: State the test statistic that will be used to conduct the hypothesis test
The appropriate test statistic for different kinds of hypothesis tests (i.e. ttest, ztest, ANOVA, Chisquare etc.) are stated in this step
Step # 04: Computation/ calculation of test statistic
Different kinds of hypothesis tests (i.e. ttest, ztest, ANOVA, Chisquare etc.) are computed in this step.
Step # 05: Find Critical Value or Rejection (critical) Region of the test
Use the value of α (0.01, 0.05, or 0.1) from Step # 02 and the distribution of the test statistics from Step # 03.
Step # 06: Conclusion (Making statistical decision and interpretation of results)
If calculated value of test statistics falls in the rejection (critical) region, the null hypothesis is rejected, while, if calculated value of test statistics falls in the acceptance (noncritical) region, the null hypothesis is not rejected i.e. it is accepted.
Note: In case if we conclude on the basis of pvalue then we compare calculated pvalue to the chosen level of significance. If pvalue is less than α, then the null hypothesis will be rejected and alternative will be affirmed. If pvalue is greater than α, then the null hypothesis will not be rejected
If the decision is to reject, the statement of the conclusion should read as follows: “we reject at the _______ level of significance. There is sufficient evidence to conclude that (statement of alternative hypothesis.)”
If the decision is to fail to reject, the statement of the conclusion should read as follows: “we fail to reject at the _______ level of significance. There is no sufficient evidence to conclude that (statement of alternative hypothesis.)”
Rules for Stating Statistical Hypotheses
When hypotheses are stated, an indication of equality (either = ,≤ or ≥ ) must appear in the null hypothesis.
Example:
We want to answer the question: Can we conclude that a certain population mean is not 50? The null hypothesis is
H_{o} : µ = 50
And the alternative is
H_{A} : µ ≠ 50
Suppose we want to know if we can conclude that the population mean is greater than
50. Our hypotheses are
H_{o}: µ ≤ 50
H_{A}: µ >
If we want to know if we can conclude that the population mean is less than 50, the hypotheses are
H_{o} : µ ≥ 50
H_{A}: µ < 50
We may state the following rules of thumb for deciding what statement goes in the null hypothesis and what statement goes in the alternative hypothesis:
 What you hope or expect to be able to conclude as a result of the test usually should be placed in the alternative hypothesis.
 The null hypothesis should contain a statement of equality, either = ,≤ or ≥.
 The null hypothesis is the hypothesis that is tested.
 The null and alternative hypotheses are complementary. That is, the two together exhaust all possibilities regarding the value that the hypothesized parameter can assume.
T TEST
Ttest is used to test hypotheses about μ when the population standard deviation is unknown and Sample size can be small (n<30).
The distribution is symmetrical, bellshaped, and similar to the normal but more spread out.
Calculating one sample ttest
Example: A random sample of size 16 with mean 25 and Standard Deviation 5 is taken from a normal population Test at 5% LOS that; : µ= 22
: µ≠22
SOLUTION
Step # 01: State the Null hypothesis and Alternative hypothesis.
: µ= 22
: µ≠22
Step # 02: State the significance level
α = 0.05 or 5% Level of Significance
Step # 03: State the test statistic (n<30)
ttest statistic
Step # 04: Computation/ calculation of test statistic
Data
 = 25
µ = 22
S = 5
n = 16
t _{calculated} = 2.4
Step # 05: Find Critical Value or Rejection (critical) Region
For critical value we find and on the basis of its answer we see critical value from tdistribution table.
Critical value = α/2(v = 161)
= 0.05/2(v = 15)
= (0.025, 15)
t _{tabulated }= ± 2.131
t _{calculated} = 2.4
Step # 06: Conclusion: Since t _{calculated} = 2.4 falls in the region of rejection therefore we reject at the 5% level of significance. There is sufficient evidence to conclude that Population mean is not equal to 22.
Z TEST
 Ztest is applied when the distribution is normal and the population standard deviation σ is known or when the sample size n is large (n ≥ 30) and with unknown σ (by taking S as estimator of σ).
 Ztest is used to test hypotheses about μ when the population standard deviation is known and population distribution is normal or sample size is large (n ≥ 30)
Calculating one sample ztest
Example: A random sample of size 49 with mean 32 is taken from a normal population whose standard deviation is 4. Test at 5% LOS that : µ= 25
: µ≠25
SOLUTION
Step # 01: : µ= 25
: µ≠25
Step # 02: α = 0.05
Step # 03:Since (n<30), we apply ztest statistic
Step # 04: Calculation of test statistic
Data
 = 32
µ = 25
= 4
n = 49
Z_{calculated} = 12.28
Step # 05: Find Critical Value or Rejection (critical) Region
Critical Value (5%) (2tail) = ±1.96
Z_{calculated} = 12.28
Step # 06: Conclusion: Since Z_{calculated} = 12.28 falls in the region of rejection therefore we reject at the 5% level of significance. There is sufficient evidence to conclude that Population mean is not equal to 25.
CHISQUARE
A statistic which measures the discrepancy (difference) between KObserved Frequencies f_{o}1, f_{o}2… f_{o}k and the corresponding ExpectedFrequencies f_{e}1, f_{e}2……. f_{e}k
The chisquare is useful in making statistical inferences about categorical data in whichthe categories are two and above.
Characteristics
 Every χ2 distribution extends indefinitely to the right from 0.
 Every χ2 distribution has only one (right sided) tail.
 As df increases, the χ2 curves get more bell shaped and approach the normal curve in appearance (but remember that a chi square curvestarts at 0, not at – ∞ )
Calculating ChiSquare
Example 1: census of U.S. determine four categories of doctors practiced in different areas as
Specialty  %  Probability 
General Practice  18%  0.18 
Medical  33.9 %  0.339 
Surgical  27 %  0.27 
Others  21.1 %  0.211 
Total  100 %  1.000 
A searcher conduct a test after 5 years to check this data for changes and select 500 doctors and asked their speciality. The result were:
Specialty  frequency 
General Practice  80 
Medical  162 
Surgical  156 
Others  102 
Total  500 
Hypothesis testing:
Step 01”
Null Hypothesis (H_{o}):
There is no difference in specialty distribution (or) the current specialty distribution of US physician is same as declared in the census.
Alternative Hypothesis (H_{A}):
There is difference in specialty distribution of US doctors. (or) the current specialty distribution of US physician is different as declared in the census.
Step 02: Level of Significance
α = 0.05
Step # 03:Chisquire Test Statistic
Step # 04:
Statistical Calculation
fe (80) = 18 % x 500 = 90
fe (162) = 33.9 % x 500 = 169.5
fe (156) = 27 % x 500 = 135
fe (102) = 21.1 % x 500 = 105.5
S # (n)  Specialty  f_{o}  f_{e}  (f_{o} – f_{e})  (f_{o} – f_{e})^{ 2}  (f_{o} – f_{e})^{ 2 }/ f_{e} 
1  General Practice  80  90  10  100  1.11 
2  Medical  162  169.5  7.5  56.25  0.33 
3  Surgical  156  135  21  441  3.26 
4  Others  102  105.5  3.5  12.25  0.116 
 4.816 
χ^{2}_{cal}= = 4.816
Step # 05:
Find critical region using X^{2}– chisquire distribution table
χ^{2 } = χ^{2 }= χ^{2} = 7.815
^{tab} ^{(α,d.f) (0.05,3)}
(d.f = n – 1)
Step # 06:
Conclusion: Since χ^{2}_{cal }value lies in the region of acceptance therefore we accept the H_{O }and reject H_{A}. There is no difference in specialty distribution among U.S. doctors.
Example2: A sample of 150 chronic Carriers of certain antigen and a sample of 500 Noncarriers revealed the following blood group distributions. Can one conclude from these data that the two population from which samples were drawn differ with respect to blood group distribution? Let α = 0.05.
Blood Group  Carriers  Noncarriers  Total 
O  72  230  302 
A  54  192  246 
B  16  63  79 
AB  8  15  23 
Total  150  500  650 
Hypothesis Testing
Step # 01: H_{O}: There is no association b/w Antigen and Blood Group
H_{A}: There is some association b/w Antigen and Blood Group
Step # 02:α = 0.05
Step # 03:Chisquire Test Statistic
Step # 04:
Calculation
f_{e }(72) = 302*150/650 = 70
f_{e }(230) = 302*500/ 650 = 232
f_{e }(54) = 246*150/650 = 57
f_{e }(192) = 246*500/650 = 189
f_{e }(16) = 79*150/650 = 18
f_{e }(63) = 79*500/650 = 61
f_{e }(8) = 23*150/650 = 05
f_{e }(15) = 23*500/650 = 18
f_{o}  f_{e}  (f_{o} – f_{e})  (f_{o} – f_{e})^{ 2}  (f_{o} – f_{e})^{ 2 }/ f_{e} 
72  70  2  4  0.0571 
230  232  2  4  0.0172 
54  57  3  9  0.1578 
192  189  3  9  0.0476 
16  18  2  4  0.2222 
63  61  2  4  0.0655 
8  5  3  9  1.8 
15  18  3  9  0.5 
2.8674 
X^{2} = = 2.8674
X^{2}_{cal} = 2.8674
Step # 05:
Find critical region using X^{2}– chisquire distribution table
X^{2} = (α, d.f) = (0.05, 3) = 7.815
Step # 06:
Conclusion: Since X^{2}_{cal }value lies in the region of acceptance therefore we accept the H_{O }andreject H_{A}. Means There is no association b/w Antigen and Blood Group
WHAT IS TEST OF SIGNIFICANCE? WHY IT IS NECESSARY? MENTION NAMES OF IMPORTANT TESTS.
1. Test of significance
A procedure used to establish the validity of a claim by determining whether or not the test statistic falls in the critical region. If it does, the results are referred to as significant. This test is sometimes called the hypothesis test.
The methods of inference used to support or reject claims based on sample data are known as tests of significance.
Why it is necessary
A significance test is performed;
 To determine if an observed value of a statistic differs enough from a hypothesized value of a parameter
 To draw the inference that the hypothesized value of the parameter is not the true value. The hypothesized value of the parameter is called the “null hypothesis.”
Types of test of significance
 Parametric
 ttest (one sample & two sample)
 ztest (one sample & two Sample)
 Ftest.
 Nonparametric
 Chisquire test
 MannWhitney U test
 Coefficient of concordance (W)
 Median test
 KruskalWallis test
 Friedman test
 Rank difference methods (Spearman rho and Kendal’s tau)
P –Value:
A pvalue is the probability that the computed value of a test statistic is at least as extreme as a specified value of the test statistic when the null hypothesis is true. Thus, the p value is the smallest value of for which we can reject a null hypothesis.
Simply the p value for a test may be defined also as the smallest value of α for which the null hypothesis can be rejected.
The p value is a number that tells us how unusual our sample results are, given that the null hypothesis is true. A p value indicating that the sample results are not likely to have occurred, if the null hypothesis is true, provides justification for doubting the truth of the null hypothesis.
Test Decisions with pvalue
The decision about whether there is enough evidence to reject the null hypothesis can be made by comparing the pvalues to the value of α, the level of significance of the test.
A general rule worth remembering is:
 If the p value is less than or equal to, we reject the null hypothesis
 If the p value is greater than, we do not reject the null hypothesis.
If pvalue ≤ α reject the null hypothesis 
If pvalue ≥ α fail to reject the null hypothesis 
Observational Study:
An observational study is a scientific investigation in which neither the subjects under study nor any of the variables of interest are manipulated in any way.
An observational study, in other words, may be defined simply as an investigation that is not an experiment. The simplest form of observational study is one in which there are only two variables of interest. One of the variables is called the risk factor, or independent variable, and the other variable is referred to as the outcome, or dependent variable.
Risk Factor:
The term risk factor is used to designate a variable that is thought to be related to some outcome variable. The risk factor may be a suspected cause of some specific state of the outcome variable.
Types of Observational Studies
There are two basic types of observational studies, prospective studies and retrospective studies.
Prospective Study:
A prospective study is an observational study in which two random samples of subjects are selected. One sample consists of subjects who possess the risk factor, and the other sample consists of subjects who do not possess the risk factor. The subjects are followed into the future (that is, they are followed prospectively), and a record is kept on the number of subjects in each sample who, at some point in time, are classifiable into each of the categories of the outcome variable.
The data resulting from a prospective study involving two dichotomous variables can be displayed in a 2 x 2 contingency table that usually provides information regarding the number of subjects with and without the risk factor and the number who did and did not
Retrospective Study:
A retrospective study is the reverse of a prospective study. The samples are selected from those falling into the categories of the outcome variable. The investigator then looks back (that is, takes a retrospective look) at the subjects and determines which ones have (or had) and which ones do not have (or did not have) the risk factor.
From the data of a retrospective study we may construct a contingency table
Relative Risk:
Relative risk is the ratio of the risk of developing a disease among subjects with the risk factor to the risk of developing the disease among subjects without the risk factor.
We represent the relative risk from a prospective study symbolically as
We may construct a confidence interval for RR
100 (1 – α)%CI=
Where z_{α }is the twosided z value corresponding to the chosen confidence coefficient and X^{2}is computed by Equation
Interpretation of RR
 The value of RR may range anywhere between zero and infinity.
 A value of 1 indicates that there is no association between the status of the risk factor and the status of the dependent variable.
 A value of RR greater than 1 indicates that the risk of acquiring the disease is greater among subjects with the risk factor than among subjects without the risk factor.
 An RR value that is less than 1 indicates less risk of acquiring the disease among subjects with the risk factor than among subjects without the risk factor.
EXAMPLE
In a prospective study of pregnant women, Magann et al. (A16) collected extensive information on exercise level of lowrisk pregnant working women. A group of 217 women did no voluntary or mandatory exercise during the pregnancy, while a group of
238 women exercised extensively. One outcome variable of interest was experiencing preterm labor. The results are summarized in Table
Estimate the relative risk of preterm labor when pregnant women exercise extensively.
Solution:
By Equation
These data indicate that the risk of experiencing preterm labor when a woman exercises heavily is 1.1 times as great as it is among women who do not exercise at all.
Confidence Interval for RR
We compute the 95 percent confidence interval for RR as follows.
The lower and upper confidence limits are, respectively
= 0.65 and = 1.86
Conclusion:
Since the interval includes 1, we conclude, at the .05 level of significance, that the population risk may be 1. In other words, we conclude that, in the population, there may not be an increased risk of experiencing preterm labor when a pregnant woman exercises extensively.
Odds Ratio
An odds ratio (OR) is a measure of association between an exposure and an outcome. The OR represents the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure.
It is the appropriate measure for comparing cases and controls in a retrospective study.
Odds:
The odds for success are the ratio of the probability of success to the probability of failure.
Two odds that we can calculate from data displayed as in contingency Table of retrospective study
 The odds of being a case (having the disease) to being a control (not having the disease) among subjects with the risk factor is [a/ (a +b)] / [b/ (a + b)] = a/b
 The odds of being a case (having the disease) to being a control (not having the disease) among subjects without the risk factor is [c/(c +d)] / [d/(c + d)] = c/d
The estimate of the population odds ratio is
We may construct a confidence interval for OR by the following method:
100 (1 – α) %CI=
Where is the twosided z value corresponding to the chosen confidence coefficient and X^{2} is computed by Equation
Interpretation of the Odds Ratio:
In the case of a rare disease, the population odds ratio provides a good approximation to the population relative risk. Consequently, the sample odds ratio, being an estimate of the population odds ratio, provides an indirect estimate of the population relative risk in the case of a rare disease.
 The odds ratio can assume values between zero and ∞.
 A value of 1 indicates no association between the risk factor and disease status.
 A value less than 1 indicates reduced odds of the disease among subjects with the risk factor.
 A value greater than 1 indicates increased odds of having the disease among subjects in whom the risk factor is present.
EXAMPLE
Toschke et al. (A17) collected data on obesity status of children ages 5–6 years and the smoking status of the mother during the pregnancy. Table below shows 3970 subjects classified as cases or noncases of obesity and also classified according to smoking status of the mother during pregnancy (the risk factor).
We wish to compare the odds of obesity at ages 5–6 among those whose mother smoked throughout the pregnancy with the odds of obesity at age 5–6 among those whose mother did not smoke during pregnancy.
Solution
By formula:
We see that obese children (cases) are 9.62 times as likely as nonobese children (noncases) to have had a mother who smoked throughout the pregnancy.
We compute the 95 percent confidence interval for OR as follows.
The lower and upper confidence limits for the population OR, respectively, are
= 7.12 and = = 13.00
We conclude with 95 percent confidence that the population OR is somewhere between
7.12 And 13.00. Because the interval does not include 1, we conclude that, in the population, obese children (cases) are more likely than nonobese children (noncases) to have had a mother who smoked throughout the pregnancy.
Statistics:
A field of study concerned with the collection, organization, summarization, and analysis of data; and the drawing of inferences about a body of data when only a part of the data is observed.
Biostatistics
Definitions:
 It can be defined as the application of the mathematical tools used in statistics to the fields of biological sciences and medicine.
 Biostatistics is the branch of statistics responsible for the proper interpretation of scientific data generated in the biology, public health and other health sciences (i.e., the biomedical sciences).
 It is the branch of statistics concerned with mathematical facts and data related to biological events.
Role of biostatistics
 Identify and develop treatments for disease and estimate their effects.
 Identify risk factors for diseases.
 Design, monitor, analyze, interpret, and report results of clinical studies.
 Develop statistical methodologies to address questions arising from medical/public health data.
 Locate , define & measure extent of disease
 Improve the health of individual & community
Types of Statistics
Descriptive Statistics
Statistical techniques used to organize, summarize and describe a particular set of measurements. It includes the construction of graphs, charts, and tables and the calculation of various descriptive measures such as averages, measures of variation, and percentiles.
Example: census is descriptive statistics of population. The information that is gathered concerning age, gender, race, income, etc. is compiled to describe the population at a given point in time.
Inferential Statistics
Inferential statistics use data gathered from a sample to make inferences about the larger population from which the sample was drawn.
OR
Inferential Statistics consists of methods for drawing and measuring the reliability of conclusions about a population based on information obtained from a sample of the population
Application of Biostatistics:
In Nursing:
 It is said that biostatistics is the tool of all health sciences and is called as the “language of research” because the findings in research are based on biostatistical techniques.
 By the knowledge of biostatistics nurses/ health care worker may trained in the skilled application of statistical methods to the solution of problems encountered in public health and medicine.
 In nursing biostatistics is an essential tool to determine the effectiveness of nursing procedures based on the collection of records of clinical trials devised in such a scale and such form that valid conclusions can be drawn
 Nurses/ health worker have a better understanding of nursing/ health care and medical research journals, respectively, if they have knowledge on biostatistical methods and techniques
 They collaborate with scientists in nearly every area related to health and have made major contributions to our understanding of AIDS, cancer, and immunology, as well as other areas.
 Further, they spend a considerable amount of time developing and evaluating the statistical methodology used in those projects.
 Biostatistics may prepare health worker/ nursing graduates for work in a wide variety of challenging positions in government, N.G.O’s, international organizations (WHO/UNICEF) and education.
 Health worker/ nursing graduates have found careers involving teaching, research, and consulting in such fields as medicine, public health, life sciences, and survey research.
 It forces the researcher to definite and exact in his procedures and techniques
 It enables the researcher to predict “how much” of a thing will happen under conditions he knows and has measured.
 To determine the time interval in which a patient should be given a medicine or perform any nursing action,
In Anatomy and Physiology
 To define what is normal or healthy in a population.
 To find the limits of normality in variables such as weight and pulse rate etc. in a population.
 To find the difference between means and proportions of normal at two places or in different periods.
 To find the correlation between two variables X and Y such as height and weight.
In Pharmacology
 To find the action of drug
 To compare the action of two different drugs or two successive dosages of the same drug.
 To find the relative potency of a new drug with respect to a standard drug.
In medicine
 To compare the efficacy of a particular drug, operation or line of treatment
 To find an association between two attributes such as cancer and smoking.
 To identify signs and symptoms of a disease or syndrome. i.e. Cough in typhoid is found by chance and fever is found in almost every case. The proportional incidence of one symptom or another indicates whether it is a characteristic feature of the disease or not.
 To test usefulness of sera and vaccines in the field.
Example: percentage of attacks or deaths among the vaccinated subjects is compared with that among the unvaccinated ones to find whether the difference observed is statistically significant.
 Design and analysis of clinical trials in medicine
 By learning the methods in biostatistics a student learns to critically evaluate articles published in medical and dental journals or papers read in medical and dental conferences.
 To understand the basic methods of observation in clinical practice and research.
In Clinical Medicine
 Documentation of medical history of diseases.
 Planning and conduct of clinical studies.
 Evaluating the merits of different procedures.
 In providing methods for definition of ‘normal’ and ‘abnormal’.
In Preventive Medicine
 To provide the magnitude of any health problem in the community.
 To find out the basic factors underlying the illhealth.
 To evaluate the health programs which was introduced in the community (success/failure).
 To introduce and promote health legislation.
In Community Medicine and Public Health
 To evaluate the efficacy of sera and vaccines in the field.
 In epidemiological studiesthe role of causative factors is statistically tested.
 To test whether the difference between two populations is real or a chance occurrence.
 To study the correlation between attributes in the same population.
 To identify the leading cause of disease or death.
 To measure the morbidity and mortality.
 To evaluate achievements of public health programs.
 To fix priorities in public health programs.
 To help promote health legislation and create administrative standards for oral health.
 It helps in compilation of data, drawing conclusions and making recommendations.
In Genetics
 Statistics and Human Genetics are twin subjects, having grown with the century together, and there are many connections between the two.
 Some fundamental aspects in particular the concept of Analysis of Variance, first arose in Human Genetics, while statistical and probabilistic methods are now central to many aspects of analysis of questions is human genetics.
In Environmental Science
Environmental statistics covers
 Baseline studies to document the present state of an environment to provide background in case of unknown changes in the future.
 Targeted studies to describe the likely impact of changes being planned or of accidental occurrences.
 Regular monitoring to attempt to detect changes in the environment.
In Nutrition
 Nutritionists now have the advanced methodologies for the analysis of DNA, RNA, protein, lowmolecularweight metabolites, as well as access to bioinformatics databases.
 Biostatistics, which can be defined as the process of making scientific inferences from data that contain variability, has historically played an integral role in advancing nutritional sciences.
 Currently, in the era of systems biology statistics has become an increasingly important tool to quantitatively analyze information about biological macromolecules.
 Appropriate statistical analyses are expected to make an important contribution to solving major nutritionassociated problems in humans and animals (including obesity, diabetes, cardiovascular disease, cancer, ageing, and intrauterine growth retardation).
In Dental Science:
 To find the statistical difference between means of two groups. Ex: Mean plaque scores of two groups.
 To assess the state of oral health in the community and to determine the availability and utilization of dental care facilities.
 To indicate the basic factors underlying the state of oral health by diagnosing the community and find solutions to such problems.
 To determine success or failure of specific oral health care programs or to evaluate the program action.
 To promote oral health legislation and in creating administrative standards for oral health care delivery.
Application and Uses of Biostatistics as Figures
 Health and vital statistics are essential tools in demography, public health, medical practice and community services.
 Recording of vital events in birth and death registers and diseases in hospitals is like book keeping of the community, describing the incidence or prevalence of diseases, defects or deaths in a defined population.
 Such events properly recorded form the eyes and ears of a public health or medical administrator.
 What are the leading causes of death?
 What are the important cause of sickness?
 Whether a particular disease is rising or falling in severity and prevalence? etc.
Logical Reasoning:
Logical reasoning is the process which uses arguments, statements, premises and axioms to define weather a statement is true or false, resulting in a logical or illogical reasoning.
Inductive reasoning:
It is the process of developing generalizations from specific observations. Inductive reasoning makes broad generalizations from specific observations. Even if all of the premises are true in a statement, inductive reasoning allows for the conclusion to be false.
Example: “Harold is a grandfather. Harold is bald. Therefore, all grandfathers are bald.” The conclusion does not follow logically from the statements.
Deductive reasoning:
Deduction is a method for applying a general rule (major premise) in specific situations (minor premise) of which conclusions can be drawn (general to specific). In Deductive reasoning, no new information provides, it only rearranges information what is already known into a new statement or conclusion.
Example:
Major premise: All humans are mortal
Minor premise: Socrates is human
Conclusion: Socrates is mortal
 Inductive reasoning has its place in the scientific method. Scientists use it to form hypotheses and theories. Deductive reasoning allows them to apply the theories to specific situations.
Abductive Reasoning
Another form of reasoning is abductive reasoning. It is based on making and testing hypotheses using the best information available. It often entails making an educated guess after observing a phenomenon for which there is no clear explanation. Abductive reasoning is useful for forming hypotheses to be tested. Abductive reasoning is often used by doctors who make a diagnosis based on test results and by jurors who make decisions based on the evidence presented to them.
Abductive reasoning is the third form of logical reasoning and is somewhat similar to inductive reasoning, since conclusions drawn here are based on probabilities. In abductive reasoning it is presumed that the most plausible conclusion also the correct.
Example:
Major premise: The jar is filled with yellow marbles
Minor premise: I have a yellow marble in my hand
Conclusion: The yellow marble was taken out of the jar
The abductive reasoning example clearly shows that conclusion might seem obvious; however it is purely based on the most plausible reasoning. This type of logical reasoning is mostly used within the field of science and research.
MEASUREMENT:
It may be defined as the assignment of numbers to objects or events according to a set of rules.
Scale of Measurements
Scales of measurement refer to ways in which variables/numbers are defined and categorized.
Each scale of measurement has certain properties which in turn determine the appropriateness for use of certain statistical analyses.
Types of Scale of measurements:
The four scales of measurement are:
 Nominal
 Ordinal
 Interval
 Ratio
Nominal:
The lowest measurement scale is the nominal scale. As the name implies it consists of “naming” observations or classifying them into various mutually exclusive and collectively exhaustive categories. They represent categories wherethere is no basis for ordering the categories.
Example:
 diagnostic categories
 sex of the participant
 classification based on discrete characteristics (e.g., hair color)
 Group affiliation (e.g., Republican, Democrat, Boy Scout, etc.)
 the town people live in
 a person’s name
 an arbitrary identification, including identification numbers that are arbitrary
 menu items selected
 any yes/no distinctions
 most forms of classification (species of animals or type of tree)
 location of damage in the brain
Ordinal:
Whenever observations are not only different from category to category but can be ranked according to some criterion, they are said to be measured on an ordinal scale. However, we have no way of knowing how different the categories are from one another.
Example:
 any rank ordering
 class ranks
 Socioeconomic status as low, medium, or high.
 Pain; mild, moderate, severe
Interval:
Interval scales are very similar to standard numbering scales except that they do not have a true zero. That means that the distance between successive numbers is equal, but that the number zero does NOT mean that there is none of the property being measured.
Example:
Temperature is usually measured (degrees Fahrenheit or Celsius). The unit of measurement is the degree, and the point of comparison is the arbitrarily chosen “zero degrees,” which does not indicate a lack of heat.
Ratio:
Ratio scales are the easiest to understand because they are numbers as we usually think of them. The distances between adjacent numbers are equal on a ratio scale and the score of zero on the ratio scale means that there is none of whatever is being measured. Most ratio scales are counts of things.
The ratio scale of measurement is similar to the interval scale in that it also represents quantity and has equality of units. However, this scale also has an absolute zero (no numbers exist below zero). Very often, physical measures will represent ratio data (for example, height and weight). If one is measuring the length of a piece of wood in centimeters, there is quantity, equal units, and that measure cannot go below zero centimeters. A negative length is not possible.
 time to complete a task
 number of responses given in a specified time period
 weight of an object
 size of an object
 number of objects detected
 number of errors made in a specified time period
 proportion of responses in a specified category
Comparison of scales of measurement:
Scale 
Indications Difference 
Indicates Direction of Difference 
Indicates Amount of Difference 
Absolute Zero 
Nominal 
X 

Ordinal 
X 
X 

Interval 
X 
X 
X 

Ratio 
X 
X 
X 
X 
Parametric & Nonparametric Statistics:
Interval and Ratio data are sometimes referred to as parametric and nominal and Ordinal data are referred to as nonparametric. Parametric means that it meets certain requirements with respect to parameters of the population (for example, the data will be normal–the distribution parallels the normal or bell curve). In addition, it means that numbers can be added, subtracted, multiplied, and divided. Parametric data are analyzed using statistical techniques identified as Parametric Statistics. As a rule, there are more statistical technique options for the analysis of parametric data and parametric statistics are considered more powerful than nonparametric statistics. Nonparametric data are lacking those same parameters and cannot be added, subtracted, multiplied, and divided. For example, it does not make sense to add Social Security numbers to get a third person. Nonparametric data are analyzed by using Nonparametric Statistics.
PRELIMINARY CONCEPTS:
DATA
The information given in quantitative or qualitative from regarding a particular characteristic is called data. It is the raw material of Statistics.
We may define data as fact & figures. Figures result from the process of counting or from taking a measurement.
For example:
When a hospital administrator counts the number of patients (counting).
When a nurse weighs a patient (measurement)
Types of data:
Primary Data
Secondary Data
Data
Primary data:
The data which are collected directly from the field of enquiry for a specific purpose. These are raw data or data in original nature, and directly collected from population..
Secondary data:
If the data is collected by some other agency or has been passed through it is called secondary data. (OR)
The data presented in an arranged (particular) form as to serve ones purpose is called secondary data.
Investigator:
The person who collects the data is known as investigator. He/ she must be:
 Intelligent, reliable and responsible.
 Properly trained and polite.
 Experienced, tactful and well known about the object he/she is dealing with.
Characteristics of Data:
Quantitative/Measurable/Variable
Qualitative/ Nonmeasurable/Attributes
Characteristics of Data
Discrete
Continuous
Nominal
Ordinal
Data set:
The data collected for a particular purpose is called data set.
Outlier:
An observation point that is distant from other observations in given set of data.
An outlier is an observation whose value, x, either exceeds the value of the third quartile by a magnitude greater than 1.5(IQR) or is less than the value of the first quartile by a magnitude greater than 1.5(IQR).
That is, an observation of x >Q3 + 1.5(IQR) or an observation of x< Q1 – 1.5(IQR) is called an outlier.
Variable:
It is a characteristic that can take different values for the elements in the data set.
Random Variable:
When the values obtained arise as a result of chance factors, so that they cannot be exactly predicted in advance, the variable is called a random variable.
Types of variables;
Variable are of two types depending upon the form of the characteristics.
 Quantitative Variables:
It is that variable whose characteristics of interest is measurable or can be expressed numerically
Example: age, weight, height, income. Length etc.
Types of Quantitative Variables:
Quantitative variables are of two types:
 Discrete Variables (Discontinuous/meristic)
A discrete variable is characterized by gaps or interruptions in the values that it can assume. These gaps or interruptions indicate the absence of values between particular values that the variable can assume.
These are the quantities which can be measure in whole integral values. It does not take fractional values. They assume a finite or countable number of possible values. It is usually obtained by counting.
Example:
 The number of daily admissions to a general hospital.
 Number of students in a class.
 Number of patients in a ward.
The data which are described by discrete variables are called discrete data.
 Continuous variable:
These are quantities which can take any value in specified range. Thus it can take both integral and fractional values. They assume an infinite number of possible values. It is usually obtained by measurement.
Example: Height, weight etc.
The data described by continuous variables are called continuous data.
 Qualitative variable:
These are nonmeasurable characteristics, which cannot be numerically express in terms of some unit and are also known as attributes. It is that variable whose values are nonnumerical
Example: color, sex, intelligence, Religion, Nationality, Illiteracy etc.
Types of Qualitative variable:
 Nominal Variable:
A categorical measurement expressed not in terms of numbers, but rather by means of a natural language description there is not a natural ordering of the categories.
The data which are described by nominal variables are called nominal data.
Examples: gender, race, religion etc.
 Ordinal Variable:
A categorical measurement expressed not in terms of numbers, but rather by means of a natural language description &the categories are ordered. The distance between these categories cannot be measured.
Population:
The collection of all observations (elements) relating to a characteristic is called statistical population or simply population.
(OR)
The collection of all individuals or items under consideration in a statistical study.
Populations may be finite or infinite.
 Finite:
If a population consists of fixed number of values; it is said to be finite.
Example: Number of days in a week.
 Infinite:
If a population consists of an endless succession of values, it is said to be infinite
Example: Number of animals in ocean.
Parameter:
Numerical descriptive measures corresponding to populations are called parameters.
Target Population:
The target population is the population about which one wishes to make an inference.
Sample:
It is a relatively small group of selected number of individuals or objects drawn from a particular population and is used to throw light on the population characteristics.
(OR)
The observed sets of measurements that are subsets of a corresponding population
Statistics
Numerical descriptive measures corresponding to samples are called statistics.
Random Sample:
It is a sample chosen in a very specific way and has been selected in such a way that every element in the population has an equal opportunity of being included in the sample.
Statistical Error:
The extent to which the observed value of a quantity exceeds the true value.
Error = Observed Value – True Value
Types of Statistical Error:
Statistical error may be classified as
 Biased error: it arises due to persona prejudices or bias of investigator or informant.
 Unbiased error: it enters into statistical enquiry due to chance causes.
Array:
The presentation of data in ascending order of magnitude is called array.
PRESENTATION OF DATA OR INFORMATION
Data obtained by the investigator is irregularly documented and is unorganized. This unorganized data is called raw data. It is organized in a specific sequence and is presented in such a way as to make it easily understandable.
Classification:
It is the process of arranging the raw data under different categories or classes according to some common characteristics possessed by an individual member.
Examples:
Patients in hospitals are classified according to disease.
Presentation of Statistical data:
Presentation of Statistical Data
Textual presentation
Tabular Presentation
Graphical Presentation
Textual presentation:
 Numerical data presented in a descriptive form are called textual presentation.
 It is lengthy, some words may repeat several time in the text.
 It becomes difficult to grasp salient points in a textual presentation.
Tabular presentation:
 The logical and systematic presentation of numerical data in rows and columns designed to simplify the presentation and facilitate comparison is termed as tabulation.
 Tabulation is thus a form of presenting quantitative data in condensed and concise form so that the numerical figures are capable of easy and quick reception by the eyes.
 It is more convenient than textual presentation.
Parts of a Table
 Table number: A table should be numbered for easy identification and reference in future. The table number may be given either in the centre or side of the table but above the top of the title of the table. If the number of columns in a table is large, then these can also be numbered so that easy reference to these is possible.
 Title of the table: Each table must have a brief, selfexplanatory, and complete title which can
 Indicate the nature of data contained.
 Explain the locality (i.e., geographical or physical) of data covered.
 Indicate the time (or period) of data obtained.
 Contain the source of the data to indicate the authority for the data, as a means of verification and as a reference. The source is always placed below the table.
 Caption and stubs: The headings for columns and rows are called caption and stub, respectively. They must be clear and concise.
 Body: The body of the table should contain the numerical information. The numerical information is arranged according to the descriptions given for each column and row.
 Prefatory or head note: If needed, a prefatory note is given just below the title for its further description in a prominent type. It is usually enclosed in brackets and is about the unit of measurement.
 Footnotes: Anything written below the table is called a footnote. It is written to further clarify either the title captions or stubs. For example, if the data described in the table pertain to profits earned by a company, then the footnote may define whether it is profit before tax or after tax. There are various ways of identifying footnotes:
 Numbering footnotes consecutively with small number 1, 2, 3, …, or letters a, b, c, …, or star *, **, …
 Sometimes symbols like @ or $ are also used to identify footnotes.
7. Source Notes: The source notes is given at the end of the table indicating the source from when information has been taken. It includes the information about compiling agency, publication etc…
A blank model table is given below:
—THE TITLE—
—Prefatory Notes—
—Box Head— 

—Row Captions— 
—Column Captions— 

—Stub Entries— 
—The Body— 
Foot Notes…
Source Notes…
Types of tabulation:
There are two types of tabulation:
 Simple tabulation: it contains data in respect of one characteristic only
 Complex tabulation: it contains data of more than one characteristicssimultaneously.
Example:
Simple tabulation: No. of students in three classes of B.S.N
Name of Class 
No. of students 
B.S.N I 
42 
B.S.N II 
48 
B.S.N III 
50 
Complex tabulation: No. of students in three classes of B.S.N
Name of Class 
No. of students 
total 

Male 
Female 

B.S.N I 
12 
30 
42 
B.S.N II 
08 
40 
48 
B.S.N III 
05 
45 
50 
Contingency Table
A Contingency table is an arrangement of data in a twoway classification. The data are sorted into cells, and then count for each cell is reported. The contingency table involves two factors (or variables), and a common question concerning such tables is whether the data indicate that the two variables are independent or dependent.
Observation:
The values of variable obtained by observations are termed as observed values or observation.
Frequency:
The frequency (f) of a particular observation is the number of times the observation occurs in the data.
Frequency distribution:
Frequency distribution is a statistical table which shows the values of variable arranged in order of magnitude either individually or in groups and also the corresponding frequencies side by side.
Types of frequency distribution:
Frequency distribution tables can be used for both categorical and numeric variables. Continuous variables should only be used with class intervals. A frequency distribution is a summary of how often different scores occur within a sample of scores.
Frequency Distribution
Quantitative Frequency Distribution
Qualitative Frequency Distribution
Simple /Ungrouped Frequency Distribution
(Range ≤ 20 digits)
Grouped Frequency Distribution
(Range > 20 digits)
Frequency distribution table of NonMeasurable (Qualitative) Data.
Example:
Suppose that you are collecting data of blood group college students. After conducting a survey of 30 of your classmates, you are left with the following set of scores:A,A,A,O,AB,B,AB,AB,AB,O,O,O,B,A,B,AB,A,B,AB,O,A,B,AB,AB,B,B,AB,A,A,A,AB
In order to make sense of this information, you need to find a way to organize the data. A frequency distribution is commonly used to categorize information so that it can be interpreted quickly in a visual way. In our example above, the blood group serves as the categories and the occurrences of each number are then tallied
Example of a Frequency Distribution
Blood Group 
Tally Marks 
Frequency 
O 

5 
A 

9 
B 

6 
AB 

10 
Total 
30 
Frequency distribution table of Measurable (Quantitative) Data.
Let’s suppose that you are collecting data on how many hours of sleep college students get each night. After conducting a survey of 30 of your classmates, you are left with the following set of scores:
7, 5, 8, 9, 4, 10, 7, 9, 9, 6, 5, 11, 6, 5, 9, 10, 8, 6, 9, 7, 9, 8, 4, 7, 8, 7, 6, 10, 4, 8
In order to make sense of this information, you need to find a way to organize the data. A frequency distribution is commonly used to categorize information so that it can be interpreted quickly in a visual way. In our example above, the number of hours each week serves as the categories and the occurrences of each number are then tallied.
Example of a Frequency Distribution
Tally Marks 
Frequency 

4 
│││ 
3 
5 
│││ 
3 
6 

4 
7 

5 
8 

5 
9 

6 
10 
│││ 
3 
11 
│ 
1 
Total 
30 
Constructing a Simple frequency distribution table
 Construct a table with three columns.
 Write all observation in ascending order in first column
 Select the first item and see in which observation it falls, draw a small tally mark (/) against it in second column and also tick () the concerned item. Continue this way until the last item is ticked. If some element is reported many time, mark separate tally mark for each.
 These tallies are marked in sets of five; the fifth tally in each set is marked across the other four. i.e. ////
 Count the number of tally marks for each mark and write it in frequency column.
Example:
A survey was taken on Maple Avenue. In each of 20 homes, people were asked how many cars were registered to their households. The results were recorded as follows:
1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0
Frequency table for the number of cars registered in each household 

Number of cars (x) 
Tally 
Frequency (f) 
0 
4 

1 
6 

2 
5 

3 
3 

4 
2 
Grouped Frequency Distribution
Terms:
Class interval
The size, width or length of a class interval is the difference between the upper (or lower) limits of any two consecutive classes. It is denoted by ‘h’.
If a variable takes a large number of values (Range >20), then it is easier to present and handle the data by grouping the values into class intervals. Continuous variables are more likely to be presented in class intervals, while discrete variables can be grouped into class intervals or not. The class interval should be contiguous, nonoverlapping such that each value in the set of observations can be placed in one, and only one, of the intervals.
Frequency:
The frequency of a class interval is the number of observations that occur in a particular predefined interval.
Endpoint:
The endpoints of a class interval are the lowest and highest values that a variable can take.
Class width:
Class width is the difference between the lower endpoint of an interval and the lower endpoint of the next interval. It is denoted by ‘w’.
OR
It is the range or length of a class interval or difference between the upper and lower class boundaries.
Number of classes:
There is no hard and fast rule for finding exact number of classes. A commonly followed rule of thumb states that there should be no fewer than five intervals and no more than 15.If there are fewer than five intervals, the data have been summarized too much and the information they contain has been lost. If there are more than 15 intervals, the data have not been summarized enough. It is also important to make sure that the class intervals are mutually exclusive.
Sturges’ formula
Those who need more specific guidance in the matter of deciding how many class intervals to employ may use a formula given by Sturges (1). This formula gives k = 1 + 3.322log_{n}, where k stands for the number of class intervals and n is the number of values in the data set under consideration. The answer obtained by applying Sturges’s rule should not be regarded as final, but should be considered as a guide only.
The number of class intervals specified by the rule should be increased or decreased for convenience and clear presentation.
No of classes (k) =1+3.322
Range=Max Value – Min Value
Size of Class=Range/No of classes
Class limits:
The two numbers used to specify the limits of a class interval for the purpose of tallying the original observations into the various classes, are called class limits.
 The smallest of the pair is known as lower class limit i.e. The smaller number in each class is the lower class limit (l_{1})
 The largest of the pair is called upper class limit. i.e. the larger number is the upper class limits (l_{2}) of the class.
Class mark or mid – point of a class
 It is the midvalue of a class or class interval exactly at the middle of the class or class interval.
 It lies half way between the class limits or between the class boundaries.
 It is used as representative value of the class interval for the calculation of mean, standard deviation, mean deviation etc.
 It is the average of the lower and upper class limits.
Lower class limit + Upper class limit
Class mark =
2
l_{1} + l_{2}
Midpoint (x) =
2
Class boundaries (or exact class limits)
These are the precise points separating the class from adjoining classes. A class boundary is always located midway between the upper limit of the class and lower limit of the next higher class.
Construction of class boundaries:
Steps in the construction of class boundaries from the class limits are;
 Find the difference between the lower limit of a class and the upper limit of the preceding class, denoted by d.
 Subtract d/2 from lower limit of the class to get lower boundary of that class and add d/2 to the upper limit to get upper boundary of the class.
Percentage Frequency:
It represents the relative percentage of total cases of any class interval. It is obtained by dividing the number of cases in the class interval by total number of cases and then multiply with 100.
Frequency of the class
Percentage frequency of class = x 100
Total Frequency
Relative Frequency:
 It is the ratio of the frequency of the classto the total frequency.
 It is not expressed in percentage.
 Relative frequencies are used to compare two or more frequency distributions or two or more items in the same frequency distribution.
Frequency of the class
Relative frequency =
Total Frequency
Cumulative Relative Frequency:
 Cumulative frequency corresponding to a class is the sum of all the frequency up to and including that class.
 It is obtained by adding to the frequency of that class and all the frequencies of the previous classes.
 It gives the proportion of individuals having a measurement less than or equal to the upper boundary of the class interval.
Frequency Density:
Frequency density of a class or class interval is its frequency per unit with. It shows the concentration of frequency in a class.
It is used in drawing histogram when the classes are of unequal width.
Class frequency
Frequency density =
Width of the class
Construction of Grouped Frequency Distribution:
 Construct a table with three columns.
 Determine the range, i.e. the difference b/w the highest and the lowest observation.
 Decide about the number of classes or the length of class interval (h), using the working rule:
Number of classes = range / h
 Number of classes should be b/w 5 and 15.
 Determine the starting point, and the remaining class limits. If several values of the variable are to be included in one class, the class limits should be designated in term of the “this amount to that amount”. This , if h is 5 we have to start with either of the values 0, 5, 10, 15,…., and if h is 3, we have to start with either of the values of 0, 3, 6, 9, 2, …., etc.
 Distribute the data into appropriate classes by Tally method.
Select the first item and see in which class if falls, draw a small tally mark (/) against that class and also tick () the concerned item. Continue this way until the last item is ticked. If some element is reported many time or some elements fall in the same class, mark separate tally mark for each.
These tallies are marked in sets of five; the fifth tally in each set is marked across the other four. i.e. ////
 Count the number of tally marks for each mark and write it in frequency column.
GRAPHICAL PRESENTATION OF DATA
The presentation of quantitative data by graphs and charts are termed as graphical presentation.
It gives the reader a nice overview of the essential features of the data. Graphs are designed to give an intuitive feeling of the data at a glance.
Therefore graphs:

 Should be selfexplanatory
 Must have title
 Must have labeled axis
 Should mention unit of observation
 Should be simple & clean
Advantages of Graph Representation
 It is easy to read
 It is easy to understand by all.
 It shows relationship between two or more sets of observations.
 It is universally applicable
 It is attractive in representation
 It helps in proper estimation, evaluation, and interpretation of the characteristics of items and individuals
 It has more lasting effect on brain
 It simplifies complex data
 It indicates trend, and therefore, helps in forecasting.
Disadvantages of Graph Representation
 It is time consuming.
 Finer details may be lost during preparation
 It represents only approximate values.
Graphical Presentation of Statistical data:
 Grouped and ungrouped data may be presented as:
Line Graphs
 A line chart or line graph is a type of chart which displays information as a series of data points called ‘markers’ connected by straight line segments.
 These are drawn on the plane paper by plotting the data concerning one variable on the horizontal xaxis (abscissa) and other variable of data on yaxis (ordinate). Which intersect at a point called origin.
 With the help of such graphs the effect of one variable upon another variable during and experimental study may be clearly demonstrated.
 According to data for corresponding X, Y values (in pairs), we will find a pint on the graph paper. The points thus generated are then jointed by pieces of straight lines successfully. The figure thus formed is called line diagram or graph.
Example
In the experimental sciences, data collected from experiments are often visualized by a graph. For example, if one were to collect data on the speed of a body at certain points in time, one could visualize the data by a data table such as the following:
Elapsed Time (s) 
Speed (m s^{−1}) 
0 
0 
1 
3 
2 
7 
3 
12 
4 
20 
5 
30 
6 
45 
Graph of Speed Vs Time
Bar Diagram:
 A bar diagram is a graph on which the data are represented in the form of bar and it is useful in comparing qualitative or quantitative data of discrete type.
 It consists of a number of equally spaced rectangular areas with equal width and originate from a horizontal base line (xaxis)
 The length of the bar is proportional to the value it represents. It should be seen that the bars are neither too short nor too long.
 They are shaded or coloured suitably.
 The mars may be vertical or horizontal in a bar diagram. If the bare are placed horizontally, it is called horizontal bar diagram, when bares are placed vertically it is called a vertical bar diagram.
 It is used with discrete qualitative variables and provides a visual comparison of figures.
Types of Bar Diagram
There are three types of bar diagram
 Simple bar diagram
 Multiple or grouped bar diagram
 Component bar subdivided bar diagram.
Simple bar chart:
Represent one type of data (variable).
Example:
Following is an example of bar chart which shows educational status of certain area.
Multiple Bar charts:
Such charts are useful for direct comparison between two or more sets of data. The technique of drawing such a chart is same as that of a single bar chart with a difference that each set of data is represented in different shades or colors on the same scale. An index explaining shades or colors must be given.
Example:
Draw a multiple bar chart to represent the import and export of Canada (values in $) for the years 1991 to 1995.
Years 
Imports 
Exports 
1991 
7930 
4260 
1992 
8850 
5225 
1993 
9780 
6150 
1994 
11720 
7340 
1995 
12150 
8145 
Simple bar chart showing the import and export of Canada from 1991 – 1995.
Component bar chart:
Subdivided or component bar chart is used to represent data in which the total magnitude is divided into different or components.
In this diagram, first we make simple bars for each class taking total magnitude in that class and then divide these simple bars into parts in the ratio of various components. This type of diagram shows the variation in different components within each class as well as between different classes. Different shades or colours are used to distinguish the various components and should be given with the diagram. It is also known as staked chart.
Example:
The table below shows the quantity in hundred kgs of Wheat, Barley and Oats produced on a certain form during the years 1991 to 1994.
Years 
Wheat 
Barley 
Oats 
Total 
1991 
34 
18 
27 
79 
1992 
43 
14 
24 
81 
1993 
43 
16 
27 
86 
1994 
45 
13 
34 
92 
Pie Chart
 It is a circular graph whose area is subdivided into sectors by radii in such a way that the areas of the sectors are proportional to the angles at the centre.
 The area of the circle represents the total value and the different sectors of the circle represent the different parts.
 It is generally used for comparing the relation between the various components of a value and between components and the total value.
 The data is expressed as percentage. Each component it expressed as percentage of the total value.
Working procedure:
 Plot a circle of an appropriate size. The angle of a circle total is 360^{o}.
 Convert the given value of the components of an item in percentage of the total value of the item.
Value of component
Area = x 360
Total value of item
 It the pie chart largest sector remains at the top and other in sequence running clockwise.
 Measure with protector, the points on a circle representing the size of each sector. Label each sector for identification.
Example:
A family’s weekly expenditure on its house mortgage (finance), food and fuel is as follows: Draw pie chart:
Expense 
$ 
Mortage 
300 
Food 
225 
Fuel 
75 
Histogram
It is the most common form of diagrammatic representation of grouped frequency distribution of both continuous and discontinuous type, in which the frequencies are represented in the forms of bars. The area and more especially the height of each rectangle is proportional to the frequency.
Working Procedure:
 Convert the data in exclusive series from inclusive series. (Make class boundaries if classes do not coincide; discontinuous class interval)
 Take class intervals (class boundaries) and plot in xaxis.
 Take two extra class intervals one below and one above the given grouped intervals.
 Plot separate rectangles for each class interval. The base of each rectangle is the width of the class interval and the height is the respective frequency of that class.
 Frequencies are plotted on yaxis.
Age 
Class Boundaries 
Frequency 
3039 
29.5–39.5 
11 
4049 
39.5–49.5 
46 
5059 
49.5–59.5 
70 
6069 
59.5–69.5 
45 
7079 
69.5–79.5 
16 
8089 
79.5–89.5 
1 
Frequency Polygon:
It is an area diagram represented in the form of curve obtained by joining the middle points of the tops of the rectangles in a histogram or joining the midpoint of class intervals at the height of frequencies by straight lines.
Cumulative Frequency Polygon (Ogive)
The graphical representation of a cumulative frequency distribution where the cumulative frequencies are plotted against the corresponding class boundaries and the successive points are joined by straight lines, the line diagram or curve is obtained known as ogive or cumulative frequency polygon.
Working procedure:
 The upper limits of the classes are represented along xaxis.
 The cumulative frequency of a particular class is taken along the yaxis.
Class interval 
Class Boundries 
f 
c.f 
151 155 
150.5 – 155.5 
8 
8 
156 – 160 
155.5 – 160.5 
7 
15 
161 – 165 
160.5 – 165.5 
15 
30 
166 – 170 
165.5 – 170.5 
9 
39 
171 – 175 
170.5 – 175.5 
9 
48 
176 – 180 
175.5 – 180.5 
2 
50 
 The points corresponding to cumulative frequency at each upper limit of the classes are joined by a free hand curve.
StemandLeaf Displays
A stemandleaf display bears a strong resemblance to a histogram and serves the same purpose. It provides information regarding the range of the data set, shows the location of the highest concentration of measurements, and reveals the presence or absence of symmetry. An advantage of the stemandleaf display over the histogram is the fact that it preserves the information contained in the individual measurements.
Another advantage of stemandleaf displays is the fact that they can be constructed during the tallying process, so the intermediate step of preparing an ordered array is eliminated.
Working procedure:
 To construct a stemandleaf display we partition each measurement into two parts.
 The first part is called the stem, and the second part is called the leaf.
 The stem consists of one or more of the initial digits of the measurement, and the leaf is composed of one or more of the remaining digits.
 The stems form an ordered column with the smallest stem at the top and the largest at the bottom. We include in the stem column all stems within the range of the data even when a measurement with that stem is not in the data set.
 The rows of the display contain the leaves, ordered and listed to the right of their respective stems.
 The stems are separated from their leaves by a vertical line.
Example:
The following example illustrates the construction of a stemandleaf display.
44, 46, 47, 49, 63, 64, 66, 68, 68, 72, 72, 75, 76, 81, 84, 88, 106
Stem Leaves
4 4, 6, 7, 9
5
6 3, 4, 6, 8, 8
7 2, 2, 5, 6
8 1, 4, 8
9
10 6
Key: 63=63
Leaf unit: 1.0
Stem unit: 10.0
BoxandWhisker Plots
A useful visual device for communicating the information contained in a data set is the boxandwhisker plot. The construction of a box and whisker plot (sometimes called, simply, a box plot) makes use of the quartiles of a data set and may be accomplished by following these five steps:
 Represent the variable of interest on the horizontal axis.
 Draw a box in the space above the horizontal axis in such a way that the left end of the box aligns with the first quartile Q_{1} and the right end of the box aligns with the third quartile Q_{3}
 Divide the box into two parts by a vertical line that aligns with the median
 Draw a horizontal line called a whisker from the left end of the box to a point that aligns with the smallest measurement in the data set.
 Draw another horizontal line, or whisker, from the right end of the box to a point that aligns with the largest measurement in the data set.
Examination of a boxandwhisker plot for a set of data reveals information regarding the amount of spread, location of concentration, and symmetry of the data.
Example:
The following example illustrates the construction of a boxandwhisker plot.
The smallest and largest measurements are 14.6 and 44, respectively.
First quartile Q_{1}= 27.25, the median Q_{2} =31.1and the third quartileQ_{3}= 33.525.
Measure of Central Tendency
Central tendency or central position or statistical averages reflects the central point or the most characteristic value of a set of measurements. The measure of central tendency describes the one score that best represents the entire distribution,
(OR)
A single figure that describes the entire series of observations with their varying sizes, occupying a central position.
The most common measures of central tendency are
 Mean
 Median
 Mode
Characteristics of Central Tendency:
 It should be rigidly defined
 An average should be properly defined so that it has one and only one interpretation.
 The average should not depend on the personal prejudice and bias of the investigator.
 It should be based onall items
 It should be easily understand.
 It should not be unduly affected by the extreme value.
 It should be least affected by the fluctuation of the sampling.
 It should be easy to interpret.
 It should be easily subjected to further mathematical calculations.
Measure of Central Tendency
If n ≤ 15
Direct Method
If n > 15
Frequency Distribution Method
Simple /Ungrouped Frequency Distribution
(Range ≤ 20 digits)
Grouped Frequency Distribution
(Range > 20 digits)
Mean:
It is defined as a value which is obtained by dividing the sum of all the values by the numbers of observations. Thus arithmetic mean of a set of values x_{1}, x_{2}, x_{3}, x_{4.}. . . .x_{n} is denoted by (read as “x bar”) and is calculated as:
= = (Direct Method)
Where sign ∑ stands for the sum and “n” is the number of observations.
Example:
The grades of a student in five examinations were 67, 75, 81, 87, 90 find the arithmetic mean of grades.
Solution:
=
=
Here, = = 80
Thus, the mean grade is 80.
Method of Finding Mean
If x_{1}, x_{2}, x_{3}, x_{4}, ….x_{n} are the values of different observations andf_{1}, f_{2}, f_{3}, f_{4}, ….f_{n}are their frequencies_{, }then,
=
Or. A.M. =
Example 2. The number of children of 80 families in a village are given below:
No. of Children/Family 
1 
2 
3 
4 
5 
6 
No. of Families 
8 
10 
10 
25 
20 
7 
Calculate mean.
Solution: let x_{i} represent the number of children per family and f_{i} represent the number of families. The calculations are presented in the following table:
No. of Children/Family (x_{i}) 
No. of Families (f_{i}) 
f_{i}x_{i} 
1 
8 
8 
2 
10 
20 
3 
10 
30 
4 
25 
100 
5 
20 
100 
6 
7 
42 
n=∑f_{i} =80 
∑f_{i}x_{i} = 300 
Thus = = = 3.75
Methods of Finding Arithmetic mean for Grouped Data
Let x_{1}, x_{2}, x_{3}, x_{4.}. . . .x_{n}be midpoints of the class intervals with corresponding frequencies f_{1}, f_{2}, f_{3}, f_{4}, ….f_{n} . Then the arithmetic mean is obtained by dividing the sum of the product of “f “ and “x” by the total of all frequencies.
Thus:
A.M. = =
=
Example:
Given below are the heights of (in inches) of 200 students. Find A.M.
Height (inches) 
3035 
3540 
4045 
4550 
5055 
5560 
No. of Students 
28 
32 
36 
46 
36 
22 
Solution:
Height (Inches) 
Mid points (x) 
Frequency (f) 
fx 
3035 
32.5 
28 
910 
3540 
37.5 
32 
1200 
4045 
42.5 
36 
1530 
4550 
47.5 
46 
2185 
5055 
52.5 
36 
1890 
5560 
57.5 
22 
1265 
Total: 
— 
∑f = 200 
∑fx = 8980 
= = = 44.90 (inches).
Example: Given below are the weights (in kgs) of 100 students. Find Mean Weight:
Weight 
7074 
7579 
8084 
8589 
9094 
No. of Students 
10 
24 
46 
12 
8 
Solution:
Weight (Kg) 
MidPoints (x) 
Frequency (f) 
fx 
70 – 74 
72 
10 
720 
75 – 79 
77 
24 
1848 
80 – 84 
82 
46 
3772 
85 – 89 
87 
12 
1044 
90 – 94 
92 
8 
736 
Total: 
— 
∑f = 100 
∑fx = 8120 
= = = 81.20
Here, Mean Weight is 31.2 kgs.
Merits of Mean
 It has the simplest average formula which is easily understandable and easy to compute.
 It is so rigidly defined by mathematical formula that everyone gets same result for single problem.
 Its calculation is based on all the observations.
 It is least affected by sampling fluctuations.
 It is a typical i.e. it balances the value at either side.
 It is the best measure to compare two or more series.(data)
 Mean is calculated on value and does not depend upon any position.
 Mathematical centre of a distribution
 Good for interval & ratio scale
 Does not ignore any information
 Inferential statistics is based on mathematical properties of the mean.
 It is based on all the observations.
 It is easy to calculate and simple to understand.
 It is relatively stable and amendable to mathematical treatment.
Demerits of Mean
 It cannot be calculated if all the values are not known.
 The extreme values have greater affect on it.
 It cannot be determined for the qualitative data.
 It may not exist in data.
Median:
It is the middle most point or the central value of the variable in a set of observation when observations are arranged in either order of their magnitudes.
It is the value in a series, which divides the series into two equal parts, one consisting of all values less and the other all values greater than it.
Median for Ungrouped data
Median of “n” observations, x_{1}, x_{2}, x_{3},…x_{n} can be obtained as follows:
 When “n” is an odd number,
Median = ()^{th} observation
 When “n” is an even number,
Median is the average of ()^{th}and ()^{th}observations.
Or
Simply use ()th observation. It will the average
The median for the discrete frequency distribution can be obtained as above, Using a cumulative frequency distribution.
Problem
Find the median of the following data:
12, 2, 16, 8, 14, 10, 6
Step 1: Organize the data, or arrange the numbers from smallest to largest.
2, 6, 8, 10, 12, 14, 16
Step 2: count number of observation in data (n)
.n = 7
Step 3: Since the number of data values is odd, the median will be found in the position.
Median term (m) =
7 + 1 8
= = = 4^{th} value
2 2
Step 4: In this case, the median is the value that is found in the fourth position of the organized data, therefore
Median = 10
Problem
Median for even data:
Find the median of the following data:
7, 9, 3, 4, 11, 1, 8, 6, 1, 4
Step 1: Organize the data, or arrange the numbers from smallest to largest.
1, 1, 3, 4, 4, 6, 7, 8, 9, 11
Step 2: Since the number of data values is even, the median will be the mean value of the numbers found before and after the position.
Step 3: The number found before the 5.5 position is 4 and the number found after the 5.5 position is 6. Now, you need to find the mean value.
1, 1, 3, 4, 4, 6, 7, 8, 9, 11
Example:
The following are the runs made by a batsman in 7 matches:
8, 12, 18, 13, 16, 5, 20.Find the median.
Solution: Writing the runs in ascending order.
5, 8, 12, 13, 16, 18, 20
As n=7
Median= ()^{th}item = ()4^{th} item.
Hence, Median is13 runs.
Example:
Following are the marks (out of 100) obtained by 10 students in English:
23, 15, 35, 41, 48, 5, 8, 9, 11, 51. Find the median mark.
Solution: arranging the marks in ascending order. The marks are:
5, 8, 9, 11, 15, 23, 35, 41, 48, 51
As n= 10
So, median = [] item.
=
Or, Median = [15+23] = = 19 marks.
Alternative Method:
Median term(m) = ()^{th} value
=
= 11/2 = 5.5^{th} value
5, 8, 9, 11, 15, 23, 35, 41, 48, 51
M1 M2
Median =
Median = = 19
Median for Grouped data
It is obtained by the following formula:
Median = l_{1} +()
Where, l_{1} = lower class limit of median class.
l_{2} = upper class limit of median class
f = frequency of median class.
m = or
C = cumulative frequency preceding the median class.
n = total frequency, i.e. ∑f.
Example:
Find the median height of 200 students in given data
Solution:
Class interval 
Frequency (f) 
C.F 
3035 
28 
28 
3540 
32 
28+32=60 
4045 
36 
60+36=96 
4550 
46 
96+46=142 
5055 
36 
142+36=178 
5560 
22 
178+22=200 n 
Median =
As 100.5 ^{th} item lies in (4550), it is the median class with l_{1} = 45, l_{2 }= 50 ,f= 46, C= 96
Median = l_{1} +()
Median = 45 + (
= 45 +
= 45 + 0.489
= 45.489
Thus, median height is 45.489 inches.
2^{nd} Method:
l + (
Where, l = lower class boundary of median class.
w = width of median class.
f = frequency of median class.
n = total frequency, i.e. ∑f.
c = cumulative frequency preceding the median class.
Example:
Following are the weights in kgs of 100 students. Find the median weight.
Weights (kgs) 
7074 
7579 
8084 
8589 
9094 
No of students. 
10 
24 
46 
12 
8 
Solution: As class boundaries are not given so, first of all we make class boundaries by using procedure.
Weight (kgs) 
No. of students 
Class boundaries 
C.F 
7074 
10 
69.574.5 
10 
7580 
24 
74.579.5 
34 
8084 
46 
79.584.5 
80 
8589 
12 
84.589.5 
92 
9094 
8 
89.594.5 
100 
Median =
As 50^{th} item lies in (79.584.5), it is the median class with h= 5, f= 46, C= 34
Median = l + (, we find
Median = 79.5 + (
= 79.5 +
Hence, median weight is 81.24 kg.
Merits of Median:
 It is easily understood although it is not so popular as mean.
 It is not influenced or affected by the variation in the magnitude or the extremes items.
 The value of the median can be graphically ascertained by ogives.
 It is the best measure for qualitative data such as beauty, intelligence etc.
 The median indicated the value of middle item in the distribution i.e. middle most item is the median
 It can be determined even by inspection in many cases.
 Good with ordinal data
 Easier to compute than the mean
Demerits of Median:
 For the calculation of median, data must be arranged.
 Median being a positional average, cannot be dependent on each and every observations.
 It is not subject to algebraic treatment.
 Median is more affected or influenced by samplings fluctuations that the arithmetic mean.
 May not exist in data.
 It is not rigorously defined.
 It does not use values of all observations.
Mode:
Mode is considered as the value in a series which occurs most frequently (has the highest frequency)
The mode of distribution is the value at the point around which the items tend to be most heavily concentrated. It may be regarded as the most typical value.
 The word modal is often used when referring to the mode of a data set.
 If a data set has only one value that occurs most often, the set is called unimodal.
 A data set that has two values that occur with the same greatest frequency is referred to as bimodal.
 When a set of data has more than two values that occur with the same greatest frequency, the set is called multimodal.
Mode for Ungrouped data
Example 1. The grades of Jamal in eight monthly tests were 75, 76, 80, 80, 82, 82, 82, 85.Find the mode of his grades.
Solution: As 82 is repeated more than any other number, so clearly mode is 82.
Example 2. Ten students were asked about the number of questions they have solved out of 20 questions, last week. Records were 13, 14, 15, 11, 16, 10, 19, 20, 18, 17. Find the modes.
Solution: it is obvious that the data contain no mode, as none of the numbers is repeated. Sometimes data contains several modes.
If x = 10, 15, 15, 15, 20, 20, 20, 25 then the data contains two modes i.e. 15 and 20.
Mode for grouped data
Mode for the grouped data can be calculated by the following formula:
Mode=
(OR)
Mode=
(OR)
Mode=
l_{1}= lower limit (class boundary) of the modal class.
l_{2} = upper limit of the modal class
f_{m}= frequency of the modal class
f_{1}= frequency associated with the class preceding the modal class.
f_{2} = frequency associated with the class following the modal class
h = (size of modal class)
The class with highest frequency is called the “Modal Class”.
Example 3. Find the mode for the heights of 200 students in given data
Height (inches) 
Frequency 
3035 
28 
3540 
32 
4045 
36 () 
4550 
46 () 
5055 
36 () 
5560 
22 
∑f=200 
Solution:
Mode=
Mode=
Mode=
Mode=
Mode=
Mode=
Mode = 47. 5
Merits of Mode:
 It can be obtained by inspection.
 It is not affected by extreme values.
 This average can be calculated from open end classes.
 The score comes from the data set
 Good for nominal data
 Good when there are two ‘typical‘ scores
 Easiest to compute and understand
 It can be used to describe qualitative phenomenon
 The value of mode can also be found graphically.
Demerits of Mode
 Mode has no significance unless a large number of observations are available.
 It cannot be treated algebraically.
 It is a peculiar measure of central tendency.
 For the calculation of mode, the data must be arranged in the form of frequency distribution.
 It is not rigidly define measure.
 Ignores most of the information in a distribution
 Small samples may not have a mode.
 It is not based on all the observations.
Empirical Relationship b/w
Skewness:
Data distributions may be classified on the basis of whether they are symmetric or asymmetric. If a distribution is symmetric, the left half of its graph (histogram or frequency polygon) will be a mirror image of its right half. When the left half and right half of the graph of a distribution are not mirror images of each other, the distribution is asymmetric.
If the graph (histogram or frequency polygon) of a distribution is asymmetric, the distribution is said to be skewed. The mean, median and mode do not fall in the middle of the distribution.
Types of Skewness
 Positive skewness: If a distribution is not symmetric because its graph extends further to the right than to the left, that is, if it has a long tail to the right, we say that the distribution is skewed to the right or is positively skewed. In positively skewed distribution Mean > Median > Mode. The positive skewness indicates that the mean is more influenced than the median and mode, by the few extremely high value. Positively skewed distribution have positive value because mean is greater than mode
 Negative skewness: If a distribution is not symmetric because its graph extends further to the left than to the right, that is, if it has a long tail to the left, we say that the distribution is skewed to the left or is negatively skewed. In negatively skewed distribution Mean < Median < Mode. Negatively skewed distribution have negative value because mean is less than mode.
KURTOSIS
Kurtosis is a measure of the degree to which a distribution is “peaked” or flat in comparison to a normal distribution whose graph is characterized by a bellshaped appearance.

Measures of Dispersion
This term is used commonly to mean scatter, Deviation, Fluctuation, Spread or variability of data.
The degree to which the individual values of the variate scatter away from the average or the central value, is called a dispersion.
Types of Measures of Dispersions:
 Absolute Measures of Dispersion: The measures of dispersion which are expressed in terms of original units of a data are termed as Absolute Measures.
 Relative Measures of Dispersion: Relative measures of dispersion, are also known as coefficients of dispersion, are obtained as ratios or percentages. These are pure numbers independent of the units of measurement and used to compare two or more sets of data values.
Absolute Measures
 Range
 Quartile Deviation
 Mean Deviation
 Standard Deviation
Relative Measure
 Coefficient of Range
 Coefficient of Quartile Deviation
 Coefficient of mean Deviation
 Coefficient of Variation.
The Range:
1. The range is the simplest measure of dispersion. It is defined as the difference between the largest value and the smallest value in the data:
2. For grouped data, the range is defined as the difference between the upper class boundary (UCB) of the highest class and the lower class boundary (LCB) of the lowest class.
MERITS OF RANGE:
 Easiest to calculate and simplest to understand.
 Gives a quick answer.
DEMERITS OF RANGE:
 It gives a rough answer.
 It is not based on all observations.
 It changes from one sample to the next in a population.
 It can’t be calculated in openend distributions.
 It is affected by sampling fluctuations.
 It gives no indication how the values within the two extremes are distributed
Quartile Deviation (QD):
1. It is also known as the SemiInterquartile Range. The range is a poor measure of dispersion where extremely large values are present. The quartile deviation is defined half of the difference between the third and the first quartiles:
QD = Q_{3} – Q_{1}/2
InterQuartile Range
The difference between third and first quartiles is called the ‘InterQuartile Range’.
IQR = Q_{3} – Q_{1}
Mean Deviation (MD):
1. The MD is defined as the average of the deviations of the values from an average:
It is also known as Mean Absolute Deviation.
2. MD from median is expressed as follows:
3. for grouped data:
 The MD is simple to understand and to interpret.
 It is affected by the value of every observation.
 It is less affected by absolute deviations than the standard deviation.
 It is not suited to further mathematical treatment. It is, therefore, not as logical as convenient measure of dispersion as the SD.
The Variance:
 Mean of all squared deviations from the mean is called as variance
 (Sample variance=S^{2}; population variance= σ^{2}sigma squared (standard deviation squared). A high variance means most scores are far away from the mean, a low variance indicates most scores cluster tightly about the mean.
Formula
OR S^{2} =
Calculating variance: Heart rate of certain patient is 80, 84, 80, 72, 76, 88, 84, 80, 78, & 78. Calculate variance for this data.
Solution:
Step 1:
Find mean of this data
= 800/10 Mean = 80
Step 2:
Draw two Columns respectively ‘X’ and deviation about mean (X ). In column ‘X’ put all values of X and in (X ) subtract each ‘X’ value with .
Step 3:
Draw another Column of (X )^{ 2}, in which put square of deviation about mean.
X 
(X ) Deviation about mean 
(X )^{2} Square of Deviation about mean 
80 84 80 72 76 88 84 80 78 78 
80 – 80 = 0 84 – 80 = 4 80 – 80 = 0 72 – 80 = 8 76 – 80 = 4 88 – 80 = 8 84 – 80 = 4 80 – 80 = 0 78 – 80 = 2 78 – 80 = 2 
0 x 0 = 00 4 x 4 = 16 0 x 0 = 00 8 x 8 = 64 4 x 4 = 16 8 x 8 = 64 4 x 4 = 16 0 x 0 = 00 2 x 2 = 04 2 x 2 = 04 
∑X = 800 = 80 
∑(X ) = 0 Summation of Deviation about mean is always zero 
∑(X )2 = 184 Summation of Square of Deviation about mean 
Step 4
Apply formula and put following values
∑(X )^{ 2}= 184
n = 10
Variance = 184/ 101 = 184/9
Variance = 20.44
Standard Deviation
 The SD is defined as the positive Square root of the mean of the squared deviations of the values from their mean.
 The square root of the variance.
 It measures the spread of data around the mean. One standard deviation includes 68% of the values in a sample population and two standard deviations include 95% of the values & 3 standard deviations include 99.7% of the values
 The SD is affected by the value of every observation.
 In general, it is less affected by fluctuations of sampling than the other measures of dispersion.
 It has a definite mathematical meaning and is perfectly adaptable to algebraic treatment.
Formula:
OR S =
Calculating Standard Deviation (we use same example): Heart rate of certain patient is 80, 84, 80, 72, 76, 88, 84, 80, 78, & 78. Calculate standard deviation for this data.
SOLUTION:
Step 1: Find mean of this data
= 800/10 Mean = 80
Step 2:
Draw two Columns respectively ‘X’ and deviation about mean (X). In column ‘X’ put all values of X and in (X) subtract each ‘X’ value with.
Step 3:
Draw another Column of (X_{} )^{ 2}, in which put square of deviation about mean.
X 
(X ) Deviation about mean 
(X )2 Square of Deviation about mean 
80 84 80 72 76 88 84 80 78 78 
80 – 80 = 0 84 – 80 = 4 80 – 80 = 0 72 – 80 = 8 76 – 80 = 4 88 – 80 = 8 84 – 80 = 4 80 – 80 = 0 78 – 80 = 2 78 – 80 = 2 
0 x 0 = 00 4 x 4 = 16 0 x 0 = 00 8 x 8 = 64 4 x 4 = 16 8 x 8 = 64 4 x 4 = 16 0 x 0 = 00 2 x 2 = 04 2 x 2 = 04 
∑X = 800 = 80 
∑(X ) = 0 Summation of Deviation about mean is always zero 
∑(X )2 = 184 Summation of Square of Deviation about mean 
Step 4
Apply formula and put following values
∑(X )2 = 184
n = 10
MERITS AND DEMERITS OF STD. DEVIATION
 Std. Dev. summarizes the deviation of a large distribution from mean in one figure used as a unit of variation.
 It indicates whether the variation of difference of a individual from the mean is real or by chance.
 Std. Dev. helps in finding the suitable size of sample for valid conclusions.
 It helps in calculating the Standard error.
DEMERITS
 It gives weightage to only extreme values. The process of squaring deviations and then taking square root involves lengthy calculations.
Relative measure of dispersion:
(a) Coefficient of Variation,
(b) Coefficient of Dispersion,
(c) Quartile Coefficient of Dispersion, and
(d) Mean Coefficient of Dispersion.
Coefficient of Variation (CV):
1. Coefficient of variation was introduced by Karl Pearson. The CV expresses the SD as a percentage in terms of AM:
————— For sample data
————— For population data
 It is frequently used in comparing dispersion of two or more series. It is also used as a criterion of consistent performance, the smaller the CV the more consistent is the performance.
 The disadvantage of CV is that it fails to be useful when is close to zero.
 It is sometimes also referred to as ‘coefficient of standard deviation’.
 It is used to determine the stability or consistency of a data.
 The higher the CV, the higher is instability or variability in data, and vice versa.
Coefficient of Dispersion (CD):
If X_{m} and X_{n} are respectively the maximum and the minimum values in a set of data, then the coefficient of dispersion is defined as:
Coefficient of Quartile Deviation (CQD):
1. If Q_{1} and Q_{3} are given for a set of data, then (Q_{1} + Q_{3})/2 is a measure of central tendency or average of data. Then the measure of relative dispersion for quartile deviation is expressed as follows:
CQD may also be expressed in percentage.
Mean Coefficient of Dispersion (CMD):
The relative measure for mean deviation is ‘mean coefficient of dispersion’ or ‘coefficient of mean deviation’:
——————– for arithmetic mean
——————– for median
Percentiles and Quartiles
The mean and median are special cases of a family of parameters known as location parameters. These descriptive measures are called location parameters because they can be used to designate certain positions on the horizontal axis when the distribution of a variable is graphed.
Percentile:
 Percentiles are numerical values that divide an ordered data set into 100 groups of values with at the most 1% of the data values in each group. There can be maximum 99 percentile in a data set.
 A percentile is a measure that tells us what percent of the total frequency scored at or below that measure.
Percentiles corresponding to a given data value: The percentile in a set corresponding to a specific data value is obtained by using the following formula
Number of values below X + 0.5
Percentile = ——————————————–
Number of total values in data set
Example: Calculate percentile for value 12 from the following data
13 11 10 13 11 10 8 12 9 9 8 9
Solution:
Step # 01: Arrange data values in ascending order from smallest to largest
S. No 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
Observations or values 
8 
8 
9 
9 
9 
10 
10 
11 
11 
12 
13 
13 
Step # 02: The number of values below 12 is 9 and total number in the data set is 12
Step # 03: Use percentile formula
9 + 0.5
Percentile for 12 = ——— x 100 = 79.17%
12
It means the value of 12 corresponds to 79^{th} percentile
Example2: Find out 25^{th} percentile for the following data
6 12 18 12 13 8 13 11
10 16 13 11 10 10 2 14
SOLUTION
Step # 01: Arrange data values in ascending order from smallest to largest
S. No 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
Observations or values 
2 
6 
8 
10 
10 
10 
11 
11 
12 
12 
13 
13 
13 
14 
16 
18 
Step # 2 Calculate the position of percentile (n x k/ 100). Here n = No: of observation = 16 and k (percentile) = 25
16 x 25 16 x 1
Therefore Percentile = ——— = ——— = 4
100 4
Therefore, 25^{th} percentile will be the average of values located at the 4^{th} and 5^{th} position in the ordered set. Here values for 4^{th} and 5^{th} correspond to the value of 10 each.
(10 + 10)
Thus, P_{25} (=P_{k}) = ————– = 10
2
Quartiles
These are measures of position which divide the data into four equal parts when the data is arranged in ascending or descending order. The quartiles are denoted by Q.
Quartiles 
Formula for Ungrouped Data 
Formula for Grouped Data 
Q_{1} = First Quartile below which first 25% of the observations are present 

Q_{2} = Second Quartile below which first 50% of the observations are present. It can easily be located as the median value. 

Q_{3} = Third Quartile below which first 75% of the observations are present 
Symbol Key:
PROBABILITY
Probability:
Probability is used to measure the ‘likelihood’ or ‘chances’ of certain events (prespecified outcomes) of an experiment.
If an event can occur in N mutually exclusive and equally likely ways, and if m of these possess a trait E, the probability of the occurrence of E expressed as:
Number of favourable cases
=
Total number of outcome (sample Space)
Characteristics of probability:
 It is usually expressed by the symbol ‘P’
 It ranges from 0 to 1
 When P = 0, it means there is no chance of happening or impossible.
 If P = 1, it means the chances of an event happening is 100%.
 The total sum of probabilities of all the possible outcomes in a sample space is always equal to one (1).
 If the probability of occurrence is p(o)= A, then the probability of nonoccurrence is 1A.
Terminology
Random Experiment:
Any natural phenomenon, yielding some result will be termed as random experiment when it is not possible to predict a particular result to turn out.
An Outcome:
The result of an experiment in all possible form are said to be event of that experiment. e.g. When you toss a coin once, you either get head or tail.
A trial:
This refers to an activity of carrying out an experiment like tossing a coin or rolling a die or dices.
Sample Space:
A set of All possible outcomes of a probability experiment.
Example 1: In tossing a coin, the outcomes are either Head (H) or tail (T) i.e. there are only two possible outcomes in tossing a coin. The chances of obtaining a head or a tail are equal. It can be solved as follow;
n(s) = 2 ways
S = {H, T}
Example 2: what is sample space when single dice is rolled?
n(s) = 6 ways
S = {1, 2, 3, 4, 5, 6}
A Simple Event
In an experimental probability, an event with only one outcome is called a simple event.
Compound Events
When two or more events occur in connection with each other, then their simultaneous occurrence is called a compound event.
Mutually exhaustive:
If in an experiment the occurrence of one event prevents or rules out the happening of all other events in the same experiment then these event are said to be mutually exhaustive events.
Mutually exclusive:
Two events are said to be mutually exclusive if they cannot occur simultaneously.
Example: tossing a coin, the events head and tail are mutually exclusive because if the outcome is head then the possibilities of getting a tail in the same trial is ruled out.
Equally likely events:
Events are said to be equally likely if there is no reason to expect any one in preference to other.
Example: in a single cast of a fair die each of the events 1, 2, 3, 4, 5, 6 is equally likely to occur.
Favourable case:
The cases which ensure the occurrence of an event are said to be favourable to the events.
Independent event:
When the experiments are conducted in such a way that the occurrence of an event in one trial does not have any effect on the occurrence of the other events at a subsequent experiment, then the events are said to be independent.
Example:
If we draw a card from a pack of cards and again draw a second a card from the pack by replacing the first card drawn, the second draw is known as independent f the first.
Dependent Event:
When the experiments are conducted in such a way that the occurrence of an event in one trial does have some effect on the occurrence of the other events at a subsequent experiment, then the event are said to be dependent event.
Example:
If we draw a card from a pack and again draw a card from the rest of pack of cards (containing 51 cards) then the second draw is dependent on the first.
Conditional Probability:
The probability of happening of an event A, when it is known that B has already happened, is called conditional probability of A and is denoted by P (A/B) i.e.
 P(A/B) = conditional probability of A given that B has already occurred.
 P (A/B) = conditional Probability of B given that A has already occurred.
Types of Probability:
The Classical or mathematical:
Probability is the ratio of the number of favorable cases as compared to the total likely cases.
The probability of nonoccurrence of the same event is given by {1p (occurrence)}.
The probability of occurrence plus nonoccurrence is equal to one.
If probability occurrence; p (O) and probability of nonoccurrence (O’), then p(O)+p(O’)=1.
Statistical or Empirical
Empirical probability arises when frequency distributions are used. For example:
Observation ( X) 
0 
1 
2 
3 
4 
Frequency ( f) 
3 
7 
10 
16 
11 
The probability of observation (X) occurring 2 times is given by the formulae
RULES OF PROBABILITY
Addition Rule
 Rule 1: When two events A and B are mutually exclusive, then probability of any one of them is equal to the sum of the probabilities of the happening of the separate events;
Mathematically:
P (A or B) =P (A) +P (B)
Example: When a die or dice is rolled, find the probability of getting a 3 or 5.
Solution: P (3) =1/6 and P (5) =1/6.
Therefore P (3 or 5) = P (3) + P (5) = 1/6+1/6 =2/6=1/3.
2) Rule 2: If A and B are two events that are NOT mutually exclusive, then
P (A or B) = P(A) + P(B) – P(A and B), where A and B means the number of outcomes that event A and B have in common.
Given two events A and B, the probability that event A, or event B, or both occur is equal to the probability that event A occurs, plus the probability that event B occurs, minus the probability that the events occur simultaneously.
Example: When a card is drawn from a pack of 52 cards, find the probability that the card is a 10 or a heart.
Solution: P (10) = 4/52 and P (heart) =13/52
P (10 that is Heart) = 1/52
P (A or B) = P (A) +P (B)P (A and B) = 4/52 _ 13/52 – 1/52 = 16/52.
Multiplication Rule
 Rule 1: For two independent events A and B, then
P (A and B) = P (A) x P (B).
Example: Determine the probability of obtaining a 5 on a die and a tail on a coin in one throw.
Solution: P (5) =1/6 and P (T) =1/2.
P (5 and T) = P (5) x P (T) = 1/6 x ½= 1/12.
 Rule 2: When two events are dependent, the probability of both events occurring is P (A and B) =P (A) x P (BA), where P (BA) is the probability that event B occurs given that event A has already occurred.
Example: Find the probability of obtaining two Aces from a pack of 52 cards without replacement.
Solution: P (Ace) =2/52 and P (second Ace if NO replacement) = 3/51
Therefore P (Ace and Ace) = P (Ace) x P (Second Ace) = 4/52 x 3/51 = 1/221
Construct sample space, when two dice are rolled
n(s) = n_{1} x n_{2} = 6 x 6 = 36
(1,1) 
(2,1) 
(3,1) 
(4,1) 
(5,1) 
(6,1) 
(1,2) 
(2, 2) 
(3, 2) 
(4, 2) 
(5, 2) 
(6, 2) 
(1, 3) 
(2, 3) 
(3, 3) 
(4, 3) 
(5, 3) 
(6, 3) 
(1, 4) 
(2, 4) 
(3, 4) 
(4, 4) 
(5, 4) 
(6, 4) 
(1, 5) 
(2, 5) 
(3, 5) 
(4, 5) 
(5, 5) 
(6, 5) 
(1, 6) 
(2, 6) 
(3, 6) 
(4, 6) 
(5, 6) 
(6, 6) 
EXAMPLE OF FINDING PROBABILITY OF AN EVENT
If 3 coins are tossed together, construct a tree diagram & find the followings;
a) Event showing No head b) Event showing 01 head,
c) Event showing 02 heads d) Event showing 03 heads
n (s) = n_{1} x n_{2} x n_{3}
= 2 x 2 x2 = 8

 Event showing no head = P(X = 0)
Answer: TTT, 1/8 = 0.125

 Event showing 01 head = P(X = 1)
Answer: HTT, THT, TTH 3/8 = 0.375

 Event showing 02 heads = P(X = 2)
Answer: HHT, HTH, THH 3/8 = 0.375

 Event showing 03 heads = P(X = 3)
Answer: HHH 1/8 = 0.125
Complementary Events
Complementary events happen when there are only two outcomes, like getting a job, or not getting a job. In other words, the complement of an event happening is the exact opposite: the probability of it not happening.
The probability of not occurrence of an event.
The probability of an event A is equal to 1 minus the probability of its complement, which is written as Ā and
P (Ā) = 1 – P (A)
CONDITIONAL PROBABILITY &SCREENING TESTS
Sensitivity, Specificity, and Predictive Value Positive and Negative
In the health sciences field a widely used application of probability laws and concepts is found in the evaluation of screening tests and diagnostic criteria. Of interest to clinicians is an enhanced ability to correctly predict the presence or absence of a particular disease from knowledge of test results (positive or negative) and/or the status of presenting symptoms (present or absent). Also of interest is information regarding the likelihood of positive and negative test results and the likelihood of the presence or absence of a particular symptom in patients with and without a particular disease.
In consideration of screening tests, one must be aware of the fact that they are not always infallible. That is, a testing procedure may yield a false positive or a false negative.
False Positive:
A false positive results when a test indicates a positive status when the true status is negative.
False Negative:
A false negative results when a test indicates a negative status when the true status is positive.
Sensitivity:
The sensitivity of a test (or symptom) is the probability of a positive test result (or presence of the symptom) given the presence of the disease.
Specificity:
The specificity of a test (or symptom) is the probability of a negative test result (or absence of the symptom) given the absence of the disease.
Predictive value positive:
The predictive value positive of a screening test (or symptom) is the probability that a subject has the disease given that the subject has a positive screening test result (or has the symptom).
Predictive value negative:
The predictive value negative of a screening test (or symptom) is the probability that a subject does not have the disease, given that the subject has a negative screening test result (or does not have the symptom).
Summary of formulae:
Symbols
COUNTING RULES
1) FACTORIALS (number of ways)
The result of multiplying a sequence of descending natural numbers down to a number. It is denoted by “!”
Examples:
4! = 4 × 3 × 2 × 1×0! = 24
7! = 7 × 6 × 5 × 4 × 3 × 2 × 1 = 5040
Remember : 0! = 1
General Method:
n! = n (n 1) (n 2) (n 3)……….. (n – n)!
2) PERMUTATION RULES
All possible arrangements of a collection of things, where the order is important in a subset.
Repetition of same items with different arrangement is allowed.
Examples
 COMBINATIONS
The order of the objects in a subset is immaterial.
Repetition of same objects in not allowed with different arrangement
Examples:
Binomial distribution:
Binomial distribution is a probability distribution which is obtained when the probability ‘P’ of the happening of an event is same in all the trials and there are only two event in each trial.
Conditions:
 Each trial results in one of two possible, mutually exclusive, outcomes. One of the possible outcomes is denoted (arbitrarily) as a success, and the other is denoted a failure.
 The probability of a success, denoted by p, remains constant from trial to trial. The probability of a failure (1 – p) is denoted by q.
 The trials are independent; that is, the outcome of any particular trial is not affected by the outcome of any other trial.
 Parameter should be available; (n & p) are parameters.
Formula:
b (X: n, p) = ^{n}C_{x} p^{x} q^{n – x } (OR) f (x) = ^{n}C_{x} p^{x} q^{n – x}
Where
X = Random variable
n = Number of Trials
p = Probability of Success
q = Probability of Failure
NORMAL DISTRIBUTION
Definitions:
 The normal distribution is pattern for the distribution of a set of data which follows a bell shaped curve.
 A theoretical frequency distribution for a set of variable data, usually represented by a bellshaped curve symmetrical about the mean
The formula for this distribution was first published by Abraham De Moivre (1667–1754) on November 12, 1733. Many other mathematicians figure prominently in the history of the normal distribution, including Carl Friedrich Gauss (1777–1855).The distribution is frequently called the Gaussian distribution in recognition of his contributions.
The normal density is given by
 ‘π’ and ‘e’(Euler’s constant)are the familiar constants, 3.14159 and 2.71828 respectively.
 The two parameters of the distribution are ‘µ’, the mean and ‘δ’, the standard deviation.
Properties of Normal Distribution:
 Total area under a normal distribution curve is equal to 1.00
 Mean, median and mode all have same values (mean = median = mode) and located at the centre of the distribution.
 A normal distribution curve is bell shaped, symmetric around the mean and skewness is ‘0’ zero.
 A normal distribution curve is unimodal. (it has only one mode)
 Normal distributions are denser in the center and less dense in the tails.
 All normal curves are positive for all x. That is, f(x) > 0 for all x.
 Tails of the curve get closer and closer to the xaxis as it move away from the mean but never touché the xaxis.
 Continuous for all values of X between ∞ and ∞ so that each conceivable interval of real numbers has a probability other than zero.
 ∞ ≤ X ≤ ∞
 68% of the values fall within ±1 SD of the mean, 95% of values fall within ±2 SD of the mean, 99.7% of values fall within ±3 SD of the mean.
 The normal distribution is completely determined by the parameters ‘µ’and ‘σ’. Different values of µ shift the graph of the distribution along the xaxis. Whereas Different values of σ determine the degree of flatness or peakedness of the graph of the distribution. µ is often referred to as a location parameter and σ is often referred to as a shape parameter.
Why is the normal distribution useful?
 Many things actually are normally distributed, or very close to it. For example, height and intelligence are approximately normally distributed; measurement errors also often have a normal distribution
 The normal distribution is easy to work with mathematically. In many practical cases, the methods developed using normal theory work quite well even when the distribution is not normal.
 There is a very strong connection between the size of a sample N and the extent to which a sampling distribution approaches the normal form. Many sampling distributions based on large N can be approximated by the normal distribution even though the population distribution itself is definitely not normal.
The Standard Normal Distribution
Fisher and Yet modified normal distribution, known as standard normal distribution or unit normal distribution. Because it has a mean of 0 and standard deviation of 1. It may be obtained from following Equation by creating a random variable.
z = (x – µ)/σ
The equation for the standard normal distribution is written
The ztransformation will prove to be useful in the examples and applications that follow. This value of ‘z’ denotes, for a value of a random variable, the number of standard deviations that value falls above (+‘z) or below (‘z) the mean, which in this case is 0.
RANDOM VARIABLE:
Any numerical quantity with a specific characteristics, having probability in background
(OR)
Numerical quantity which has a specific probability. It is represented by ‘X’.
General Procedure
As you might suspect from the formula for the normal density function, it would be difficult and tedious to do the calculus every time we had a new set of parameters for µ and σ. So instead, we usually work with the standardized normal distribution, where µ = 0 and σ = 1, i.e. N (0,1). That is, rather than directly solve a problem involving a normally distributed variable X with mean µ and standard deviation σ, an indirect approach is used.
 We first convert the problem into an equivalent one dealing with a normal variable measured in standardized deviation units, called a standardized normal variable. To do this, if X ∼ N (µ, σ^{2}), then
 A table of standardized normal values can then be used to obtain an answer in terms of the converted problem.
 The interpretation of Z values is straightforward. Since σ = 1, if Z = 2, the corresponding X value is exactly 2 standard deviations above the mean. If Z = 1, the corresponding X value is one standard deviation below the mean. If Z = 0, X = the mean, i.e. µ.
Example of a zscore calculation: Suppose that patients’ heart rate follow a normal distribution with a mean of 72 & standard deviation of 8 b/ min. Find the probabilities if;
 Heart Rate is Greater Than 80 Or P(X > 80)
P(X > 80)
Data
X = 80
µ = 72
= 8
80 – 72 8
Z = =
8 8
Z = 1
P (Z > 1)
P (Z > 1) = 1 – P (Z < 1)
P (Z > 1) = 1 – 0.8413
P (Z > 1) = 0.1587
 Heart Rate is Lesser Than 90 Or P(X < 90)
Data
X = 90
µ = 72
= 8
P(X < 90)
90 – 72 18
Z = =
8 8
Z = 2.25
P (Z < 2.25)
P (Z < 2.25) = 0.9878
 Heart Rate is Between 75and 95 Or P(75 <X < 95)
Data
X_{1} = 75
X_{2} = 95
µ = 72
= 8
X_{1} – µ X_{2} – µ
Z_{1}= Z_{2}=
75 – 72 95 – 72
Z_{1}= Z_{2}=
8 8
Z_{1}= 3/8 Z_{2}= 23/8
Z_{1}= 0.37 Z_{2}= 2.87
P (0.37 < Z < 2.87)
P (0.37 < Z < 2.87) = P (Z < 2.87) – P (Z < 0.37)
P (Z < 2.87) = 0.9979
P (Z < 0.37) = () 0.6443
0.3536
P (0.37 < Z < 2.87) = 0.3536
SAMPLING:
A set of data or elements drawn from a larger population and analyzed to estimate the characteristics of that population is called sample. And the process of selecting a sample from a population is called sampling.
OR
Procedure by which some members of a given population are selected as representatives of the entire population
TYPES OF SAMPLING
There are two types of sampling
 Probability sampling
 Nonprobability sampling
 Probability Sampling:
A sampling technique in which each member of the population has an equal chance of being chosen is called probability sampling.
There are four types of probability sampling
 Simple random sampling
 Systemic sampling
 Stratified sampling
 Cluster sampling
 Simple Random Sampling
A probability sampling technique in which, each person in the population has an equal chance of being chosen for the sample and every collection of persons of the same size has an equal chance of becoming the actual sample.
 Systematic Sampling
A sample constructed by selecting every kth element in the sampling frame.
Number the units in the population from 1 to N decide on the n (sample size) that you want or need k = N/n = the interval size randomly select an integer between 1 to k then take every kth unit.
 Stratified Random Sampling.
Is obtained by separating the population elements into non overlapping groups, called strata, and then selecting a simple random sample from each stratum.
 Cluster Sampling.
A simple random sample in which each sampling unit is a collection or cluster, or elements. For example, an investigator wishing to study students might first sample groups or clusters of students such as classes and then select the final sample of students from among clusters. Also called area sampling.
 NonProbability Sampling
Nonprobability sampling is a sampling technique where the samples are gathered in a process that does not give all the individuals in the population equal chances of being selected.
It decreases a sample’s representativeness of a population.
Type of Nonprobability sampling
Following are the common types of nonprobability sampling:
 Convenience sampling
 Quota Sampling
 Purposive/ judgmental sampling
 Network/ snowball Sampling
 Convenience Sampling:
The members of the population are chosen based on their relative ease of access. Suchsamples are biased because researchers may unconsciously approach some kinds of respondents and avoid others
 Quota Sampling
It is the nonprobability version of stratified sampling. Like stratified sampling, the researcher first identifies the stratums and their proportions as they are represented in the population. Then convenience or judgment sampling is used to select the required number of subjects from each stratum. This differs from stratified sampling, where the stratums are filled by random sampling.
 Purposive Sampling.
It is a common nonprobability method. The researcher uses his or her own judgment about which respondents to choose, and picks those who best meets the purposes of the study.
 Snowball Sampling
It is a special nonprobability method used when the desired sample characteristic is rare. It may be extremely difficult or cost prohibitive to locate respondents in these situations. Snowball sampling relies on referrals from initial subjects to generate additional subjects. While this technique can dramatically lower search costs, it comes at the expense of introducing bias because the technique itself reduces the likelihood that the sample will represent a good cross section from the population.
INFERENTIAL STATISTICS
Statistical inference is the procedure by which we reach a conclusion about a population on the basis of the information contained in a sample drawn from that population. It consists of two techniques:
 Estimation of parameters
 Hypothesis testing
ESTIMATION OF PARAMETERS
The process of estimation entails calculating, from the data of a sample, some statistic that is offered as an approximation of the corresponding parameter of the population from which the sample was drawn.
Parameter estimation is used to estimate a single parameter, like a mean.
There are two types of estimates
 Point Estimates
 Interval Estimates (Confidence Interval).
POINT ESTIMATES
A point estimate is a single numerical value used to estimate the corresponding population parameter.
For example: the sample mean ‘x’ is a point estimate of the population mean μ. the sample variance S^{2} is a point estimate of the population variance σ^{2}. These are point estimates — a single–valued guess of the parametric value.
A good estimator must satisfy three conditions:
 Unbiased: The expected value of the estimator must be equal to the mean of the parameter
 Consistent: The value of the estimator approaches the value of the parameter as the sample size increases
 Relatively Efficient: The estimator has the smallest variance of all estimators which could be used
CONFIDENCE INTERVAL (Interval Estimates)
An interval estimate consists of two numerical values defining a range of values that, with a specified degree of confidence, most likely includes the parameter being estimated.
Interval estimation of a parameter is more useful because it indicates a range of values within which the parameter has a specified probability of lying. With interval estimation, researchers construct a confidence interval around estimate; the upper and lower limits are called confidence limits.
Interval estimates provide a range of values for a parameter value, within which we have a stated degree of confidence that the parameter lies. A numeric range, based on a statistic and its sampling distribution that contains the population parameter of interest with a specified probability.
A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data
Calculating confidence interval when n ≥ 30 (Single Population Mean)
Example: A random sample of size 64 with mean 25 & Standard Deviation 4 is taken from a normal population. Construct 95 % confidence interval
We use following formula to solve Confidence Interval when n ≥ 30
Data
 = 25
= 4
n = 64
25 4/ . x 1.96
25 4/8 x 1.96
25 0.5 x 1.96
25 0.98
25 – 0.98 ≤ µ ≤ 25 + 0.98
24.02≤ µ ≤ 25.98
We are 95% confident that population mean (µ) will have value between 24.02 & 25.98
Calculating confidence interval when n < 30 (Single Population Mean)
Example: A random sample of size 9 with mean 25 & Standard Deviation 4 is taken from a normal population. Construct 95 % confidence interval
We use following formula to solve Confidence Interval when n < 30
(OR)
Data
 = 25
S = 4
n = 9
α/2 = 0.025
df = n – 1 (9 1 = 8)
t_{α/2,df} = 2.306
25 ± 4/√9 x 2.306
25 ± 4/3 x 2.306
25 ± 1.33 x 2.306
25 ± 3.07
25 – 3.07 ≤ µ ≤ 25 + 3.07
21.93 ≤ µ ≤ 28.07
We are 95% confident that population mean (µ) will have value between 21.93 & 28.07
Hypothesis:
A hypothesis may be defined simply as a statement about one or more populations. It is frequently concerned with the parameters of the populations about which the statement is made.
Types of Hypotheses
Researchers are concerned with two types of hypotheses
 Research hypotheses
The research hypothesis is the conjecture or supposition that motivates the research. It may be the result of years of observation on the part of the researcher.
 Statistical hypotheses
Statistical hypotheses are hypotheses that are stated in such a way that they may be evaluated by appropriate statistical techniques.
Types of statistical Hypothesis
There are two statistical hypotheses involved in hypothesis testing, and these should be stated explicitly.
 Null Hypothesis:
The null hypothesis is the hypothesis to be tested. It is designated by the symbol H_{o.} The null hypothesis is sometimes referred to as a hypothesis of no difference, since it is a statement of agreement with (or no difference from) conditions presumed to be true in the population of interest.
In general, the null hypothesis is set up for the express purpose of being discredited. Consequently, the complement of the conclusion that the researcher is seeking to reach becomes the statement of the null hypothesis. In the testing process the null hypothesis either is rejected or is not rejected. If the null hypothesis is not rejected, we will say that the data on which the test is based do not provide sufficient evidence to cause rejection. If the testing procedure leads to rejection, we will say that the data at hand are not compatible with the null hypothesis, but are supportive of some other hypothesis.
 Alternative Hypothesis
The alternative hypothesis is a statement of what we will believe is true if our sample data cause us to reject the null hypothesis. Usually the alternative hypothesis and the research hypothesis are the same, and in fact the two terms are used interchangeably. We shall designate the alternative hypothesis by the symbol H_{A }orH_{1.}
LEVEL OF SIGNIFICANCE
The level of significance is a probability and, in fact, is the probability of rejecting a true null hypothesis. The level of significance specifies the area under the curve of the distribution of the test statistic that is above the values on the horizontal axis constituting the rejection region. It is denoted by ‘α’.
Types of Error
In the context of testing of hypotheses, there are basically two types of errors:
 TYPE I Error
 TYPE II Error
Type I Error
 A type I error, also known as an error of the first kind, occurs when the null hypothesis (H_{0}) is true, but is rejected.
 A type I error may be compared with a so called false positive.
 The rate of the type I error is called the size of the test and denoted by the Greek letter α (alpha).
 It usually equals the significance level of a test.
 If type I error is fixed at 5 %, it means that there are about 5 chances in 100 that we will reject H_{0} when H_{0} is true.
Type II Error
 Type II error, also known as an error of the second kind, occurs when the null hypothesis is false, but erroneously fails to be rejected.
 Type II error means accepting the hypothesis which should have been rejected.
 A Type II error is committed when we fail to believe a truth.
 A type II error occurs when one rejects the alternative hypothesis (fails to reject the null hypothesis) when the alternative hypothesis is true.
 The rate of the type II error is denoted by the Greek letter β (beta) and related to the power of a test (which equals 1β ).
In the tabular form two errors can be presented as follows:
Null hypothesis (H_{0}) is true 
Null hypothesis (H_{0}) is false 

Reject null hypothesis 
Type I error 
Correct outcome 
Fail to reject null hypothesis 
Correct outcome 
Type II error 
Graphical depiction of the relation between Type I and Type II errors
What are the differences between Type 1 errors and Type 2 errors?
Type 1 Error 
Type 2 Error 


Reducing Type I Errors
 Prescriptive testing is used to increase the level of confidence, which in turn reduces Type I errors. The chances of making a Type I error are reduced by increasing the level of confidence.
Reducing Type II Errors
 Descriptive testing is used to better describe the test condition and acceptance criteria, which in turn reduces type ii errors. This increases the number of times we reject the null hypothesis – with a resulting increase in the number of type I errors (rejecting H_{0} when it was really true and should not have been rejected).
 Therefore, reducing one type of error comes at the expense of increasing the other type of error! The same means cannot reduce both types of errors simultaneously.
Power of Test:
Statistical power is defined as the probability of rejecting the null hypothesis while the alternative hypothesis is true.
Power = P(reject H_{0}  H_{1} is true)
= 1 – P(type II error)
= 1 – β
That is, the power of a hypothesis test is the probability that it will reject when it’s supposed to.
Distribution under H_{0}
Distribution under H_{1}

Power 
Factors that affect statistical power include
 The sample size
 The specification of the parameter(s) in the null and alternative hypothesis, i.e. how far they are from each other, the precision or uncertainty the researcher allows for the study (generally the confidence or significance level)
 The distribution of the parameter to be estimated. For example, if a researcher knows that the statistics in the study follow a Z or standard normal distribution, there are two parameters that he/she needs to estimate, the population mean (μ) and the population variance (σ^{2}). Most of the time, the researcher know one of the parameters and need to estimate the other. If that is not the case, some other distribution may be used, for example, if the researcher does not know the population variance, he/she can estimate it using the sample variance and that ends up with using a T distribution.
Application:
In research, statistical power is generally calculated for two purposes.
 It can be calculated before data collection based on information from previous research to decide the sample size needed for the study.
 It can also be calculated after data analysis. It usually happens when the result turns out to be nonsignificant. In this case, statistical power is calculated to verify whether the nonsignificant result is due to really no relation in the sample or due to a lack of statistical power.
Relation with sample size:
Statistical power is positively correlated with the sample size, which means that given the level of the other factors, a larger sample size gives greater power. However, researchers are also faced with the decision to make a difference between statistical difference and scientific difference. Although a larger sample size enables researchers to find smaller difference statistically significant, that difference may not be large enough be scientifically meaningful. Therefore, this would be recommended that researcher have an idea of what they would expect to be a scientifically meaningful difference before doing a power analysis to determine the actual sample size needed.
HYPOTHESIS TESTING
Statistical hypothesis testing provides objective criteria for deciding whether hypotheses are supported by empirical evidence.
The purpose of hypothesis testing is to aid the clinician, researcher, or administrator in reaching a conclusion concerning a population by examining a sample from that population.
STEPS IN STATISTICAL HYPOTHESIS TESTING
Step # 01: State the Null hypothesis and Alternative hypothesis.
The alternative hypothesis represents what the researcher is trying to prove. The null hypothesis represents the negation of what the researcher is trying to prove.
Step # 02: State the significance level, α (0.01, 0.05, or 0.1), for the test
The significance level is the probability of making a Type I error. A Type I Error is a decision in favor of the alternative hypothesis when, in fact, the null hypothesis is true.
Type II Error is a decision to fail to reject the null hypothesis when, in fact, the null hypothesis is false.
Step # 03: State the test statistic that will be used to conduct the hypothesis test
The appropriate test statistic for different kinds of hypothesis tests (i.e. ttest, ztest, ANOVA, Chisquare etc.) are stated in this step
Step # 04: Computation/ calculation of test statistic
Different kinds of hypothesis tests (i.e. ttest, ztest, ANOVA, Chisquare etc.) are computed in this step.
Step # 05: Find Critical Value or Rejection (critical) Region of the test
Use the value of α (0.01, 0.05, or 0.1) from Step # 02 and the distribution of the test statistics from Step # 03.
Step # 06: Conclusion (Making statistical decision and interpretation of results)
If calculated value of test statistics falls in the rejection (critical) region, the null hypothesis is rejected, while, if calculated value of test statistics falls in the acceptance (noncritical) region, the null hypothesis is not rejected i.e. it is accepted.
Note: In case if we conclude on the basis of pvalue then we compare calculated pvalue to the chosen level of significance. If pvalue is less than α, then the null hypothesis will be rejected and alternative will be affirmed. If pvalue is greater than α, then the null hypothesis will not be rejected
If the decision is to reject, the statement of the conclusion should read as follows: “we reject at the _______ level of significance. There is sufficient evidence to conclude that (statement of alternative hypothesis.)”
If the decision is to fail to reject, the statement of the conclusion should read as follows: “we fail to reject at the _______ level of significance. There is no sufficient evidence to conclude that (statement of alternative hypothesis.)”
Rules for Stating Statistical Hypotheses
When hypotheses are stated, an indication of equality (either = ,≤ or ≥ ) must appear in the null hypothesis.
Example:
We want to answer the question: Can we conclude that a certain population mean is not 50? The null hypothesis is
H_{o} : µ = 50
And the alternative is
H_{A} : µ ≠ 50
Suppose we want to know if we can conclude that the population mean is greater than
50. Our hypotheses are
H_{o}: µ ≤ 50
H_{A}: µ >
If we want to know if we can conclude that the population mean is less than 50, the hypotheses are
H_{o} : µ ≥ 50
H_{A}: µ < 50
We may state the following rules of thumb for deciding what statement goes in the null hypothesis and what statement goes in the alternative hypothesis:
 What you hope or expect to be able to conclude as a result of the test usually should be placed in the alternative hypothesis.
 The null hypothesis should contain a statement of equality, either = ,≤ or ≥.
 The null hypothesis is the hypothesis that is tested.
 The null and alternative hypotheses are complementary. That is, the two together exhaust all possibilities regarding the value that the hypothesized parameter can assume.
T TEST
Ttest is used to test hypotheses about μ when the population standard deviation is unknown and Sample size can be small (n<30).
The distribution is symmetrical, bellshaped, and similar to the normal but more spread out.
Calculating one sample ttest
Example: A random sample of size 16 with mean 25 and Standard Deviation 5 is taken from a normal population Test at 5% LOS that; : µ= 22
: µ≠22
SOLUTION
Step # 01: State the Null hypothesis and Alternative hypothesis.
: µ= 22
: µ≠22
Step # 02: State the significance level
α = 0.05 or 5% Level of Significance
Step # 03: State the test statistic (n<30)
ttest statistic
Step # 04: Computation/ calculation of test statistic
Data
 = 25
µ = 22
S = 5
n = 16
t _{calculated} = 2.4
Step # 05: Find Critical Value or Rejection (critical) Region
For critical value we find and on the basis of its answer we see critical value from tdistribution table.
Critical value = α/2(v = 161)
= 0.05/2(v = 15)
= (0.025, 15)
t _{tabulated }= ± 2.131
t _{calculated} = 2.4
Step # 06: Conclusion: Since t _{calculated} = 2.4 falls in the region of rejection therefore we reject at the 5% level of significance. There is sufficient evidence to conclude that Population mean is not equal to 22.
Z TEST
 Ztest is applied when the distribution is normal and the population standard deviation σ is known or when the sample size n is large (n ≥ 30) and with unknown σ (by taking S as estimator of σ).
 Ztest is used to test hypotheses about μ when the population standard deviation is known and population distribution is normal or sample size is large (n ≥ 30)
Calculating one sample ztest
Example: A random sample of size 49 with mean 32 is taken from a normal population whose standard deviation is 4. Test at 5% LOS that : µ= 25
: µ≠25
SOLUTION
Step # 01: : µ= 25
: µ≠25
Step # 02: α = 0.05
Step # 03:Since (n<30), we apply ztest statistic
Step # 04: Calculation of test statistic
Data
 = 32
µ = 25
= 4
n = 49
Z_{calculated} = 12.28
Step # 05: Find Critical Value or Rejection (critical) Region
Critical Value (5%) (2tail) = ±1.96
Z_{calculated} = 12.28
Step # 06: Conclusion: Since Z_{calculated} = 12.28 falls in the region of rejection therefore we reject at the 5% level of significance. There is sufficient evidence to conclude that Population mean is not equal to 25.
CHISQUARE
A statistic which measures the discrepancy (difference) between KObserved Frequencies f_{o}1, f_{o}2… f_{o}k and the corresponding ExpectedFrequencies f_{e}1, f_{e}2……. f_{e}k
The chisquare is useful in making statistical inferences about categorical data in whichthe categories are two and above.
Characteristics
 Every χ2 distribution extends indefinitely to the right from 0.
 Every χ2 distribution has only one (right sided) tail.
 As df increases, the χ2 curves get more bell shaped and approach the normal curve in appearance (but remember that a chi square curvestarts at 0, not at – ∞ )
Calculating ChiSquare
Example 1: census of U.S. determine four categories of doctors practiced in different areas as
Specialty 
% 
Probability 
General Practice 
18% 
0.18 
Medical 
33.9 % 
0.339 
Surgical 
27 % 
0.27 
Others 
21.1 % 
0.211 
Total 
100 % 
1.000 
A searcher conduct a test after 5 years to check this data for changes and select 500 doctors and asked their speciality. The result were:
Specialty 
frequency 
General Practice 
80 
Medical 
162 
Surgical 
156 
Others 
102 
Total 
500 
Hypothesis testing:
Step 01”
Null Hypothesis (H_{o}):
There is no difference in specialty distribution (or) the current specialty distribution of US physician is same as declared in the census.
Alternative Hypothesis (H_{A}):
There is difference in specialty distribution of US doctors. (or) the current specialty distribution of US physician is different as declared in the census.
Step 02: Level of Significance
α = 0.05
Step # 03:Chisquire Test Statistic
Step # 04:
Statistical Calculation
fe (80) = 18 % x 500 = 90
fe (162) = 33.9 % x 500 = 169.5
fe (156) = 27 % x 500 = 135
fe (102) = 21.1 % x 500 = 105.5
S # (n) 
Specialty 
f_{o} 
f_{e} 
(f_{o} – f_{e}) 
(f_{o} – f_{e})^{ 2} 
(f_{o} – f_{e})^{ 2 }/ f_{e} 
1 
General Practice 
80 
90 
10 
100 
1.11 
2 
Medical 
162 
169.5 
7.5 
56.25 
0.33 
3 
Surgical 
156 
135 
21 
441 
3.26 
4 
Others 
102 
105.5 
3.5 
12.25 
0.116 

4.816 
χ^{2}_{cal}= = 4.816
Step # 05:
Find critical region using X^{2}– chisquire distribution table
χ^{2 } = χ^{2 }= χ^{2} = 7.815
^{tab} ^{(α,d.f) (0.05,3)}
(d.f = n – 1)
Step # 06:
Conclusion: Since χ^{2}_{cal }value lies in the region of acceptance therefore we accept the H_{O }and reject H_{A}. There is no difference in specialty distribution among U.S. doctors.
Example2: A sample of 150 chronic Carriers of certain antigen and a sample of 500 Noncarriers revealed the following blood group distributions. Can one conclude from these data that the two population from which samples were drawn differ with respect to blood group distribution? Let α = 0.05.
Blood Group 
Carriers 
Noncarriers 
Total 
O 
72 
230 
302 
A 
54 
192 
246 
B 
16 
63 
79 
AB 
8 
15 
23 
Total 
150 
500 
650 
Hypothesis Testing
Step # 01: H_{O}: There is no association b/w Antigen and Blood Group
H_{A}: There is some association b/w Antigen and Blood Group
Step # 02:α = 0.05
Step # 03:Chisquire Test Statistic
Step # 04:
Calculation
f_{e }(72) = 302*150/650 = 70
f_{e }(230) = 302*500/ 650 = 232
f_{e }(54) = 246*150/650 = 57
f_{e }(192) = 246*500/650 = 189
f_{e }(16) = 79*150/650 = 18
f_{e }(63) = 79*500/650 = 61
f_{e }(8) = 23*150/650 = 05
f_{e }(15) = 23*500/650 = 18
f_{o} 
f_{e} 
(f_{o} – f_{e}) 
(f_{o} – f_{e})^{ 2} 
(f_{o} – f_{e})^{ 2 }/ f_{e} 
72 
70 
2 
4 
0.0571 
230 
232 
2 
4 
0.0172 
54 
57 
3 
9 
0.1578 
192 
189 
3 
9 
0.0476 
16 
18 
2 
4 
0.2222 
63 
61 
2 
4 
0.0655 
8 
5 
3 
9 
1.8 
15 
18 
3 
9 
0.5 
2.8674 
X^{2} = = 2.8674
X^{2}_{cal} = 2.8674
Step # 05:
Find critical region using X^{2}– chisquire distribution table
X^{2} = (α, d.f) = (0.05, 3) = 7.815
Step # 06:
Conclusion: Since X^{2}_{cal }value lies in the region of acceptance therefore we accept the H_{O }andreject H_{A}. Means There is no association b/w Antigen and Blood Group
WHAT IS TEST OF SIGNIFICANCE? WHY IT IS NECESSARY? MENTION NAMES OF IMPORTANT TESTS.
1. Test of significance
A procedure used to establish the validity of a claim by determining whether or not the test statistic falls in the critical region. If it does, the results are referred to as significant. This test is sometimes called the hypothesis test.
The methods of inference used to support or reject claims based on sample data are known as tests of significance.
Why it is necessary
A significance test is performed;
 To determine if an observed value of a statistic differs enough from a hypothesized value of a parameter
 To draw the inference that the hypothesized value of the parameter is not the true value. The hypothesized value of the parameter is called the “null hypothesis.”
Types of test of significance
 Parametric
 ttest (one sample & two sample)
 ztest (one sample & two Sample)
 Ftest.
 Nonparametric
 Chisquire test
 MannWhitney U test
 Coefficient of concordance (W)
 Median test
 KruskalWallis test
 Friedman test
 Rank difference methods (Spearman rho and Kendal’s tau)
P –Value:
A pvalue is the probability that the computed value of a test statistic is at least as extreme as a specified value of the test statistic when the null hypothesis is true. Thus, the p value is the smallest value of for which we can reject a null hypothesis.
Simply the p value for a test may be defined also as the smallest value of α for which the null hypothesis can be rejected.
The p value is a number that tells us how unusual our sample results are, given that the null hypothesis is true. A p value indicating that the sample results are not likely to have occurred, if the null hypothesis is true, provides justification for doubting the truth of the null hypothesis.
Test Decisions with pvalue
The decision about whether there is enough evidence to reject the null hypothesis can be made by comparing the pvalues to the value of α, the level of significance of the test.
A general rule worth remembering is:
 If the p value is less than or equal to, we reject the null hypothesis
 If the p value is greater than, we do not reject the null hypothesis.
If pvalue ≤ α reject the null hypothesis 
If pvalue ≥ α fail to reject the null hypothesis 
Observational Study:
An observational study is a scientific investigation in which neither the subjects under study nor any of the variables of interest are manipulated in any way.
An observational study, in other words, may be defined simply as an investigation that is not an experiment. The simplest form of observational study is one in which there are only two variables of interest. One of the variables is called the risk factor, or independent variable, and the other variable is referred to as the outcome, or dependent variable.
Risk Factor:
The term risk factor is used to designate a variable that is thought to be related to some outcome variable. The risk factor may be a suspected cause of some specific state of the outcome variable.
Types of Observational Studies
There are two basic types of observational studies, prospective studies and retrospective studies.
Prospective Study:
A prospective study is an observational study in which two random samples of subjects are selected. One sample consists of subjects who possess the risk factor, and the other sample consists of subjects who do not possess the risk factor. The subjects are followed into the future (that is, they are followed prospectively), and a record is kept on the number of subjects in each sample who, at some point in time, are classifiable into each of the categories of the outcome variable.
The data resulting from a prospective study involving two dichotomous variables can be displayed in a 2 x 2 contingency table that usually provides information regarding the number of subjects with and without the risk factor and the number who did and did not
Retrospective Study:
A retrospective study is the reverse of a prospective study. The samples are selected from those falling into the categories of the outcome variable. The investigator then looks back (that is, takes a retrospective look) at the subjects and determines which ones have (or had) and which ones do not have (or did not have) the risk factor.
From the data of a retrospective study we may construct a contingency table
Relative Risk:
Relative risk is the ratio of the risk of developing a disease among subjects with the risk factor to the risk of developing the disease among subjects without the risk factor.
We represent the relative risk from a prospective study symbolically as
We may construct a confidence interval for RR
100 (1 – α)%CI=
Where z_{α }is the twosided z value corresponding to the chosen confidence coefficient and X^{2}is computed by Equation
Interpretation of RR
 The value of RR may range anywhere between zero and infinity.
 A value of 1 indicates that there is no association between the status of the risk factor and the status of the dependent variable.
 A value of RR greater than 1 indicates that the risk of acquiring the disease is greater among subjects with the risk factor than among subjects without the risk factor.
 An RR value that is less than 1 indicates less risk of acquiring the disease among subjects with the risk factor than among subjects without the risk factor.
EXAMPLE
In a prospective study of pregnant women, Magann et al. (A16) collected extensive information on exercise level of lowrisk pregnant working women. A group of 217 women did no voluntary or mandatory exercise during the pregnancy, while a group of
238 women exercised extensively. One outcome variable of interest was experiencing preterm labor. The results are summarized in Table
Estimate the relative risk of preterm labor when pregnant women exercise extensively.
Solution:
By Equation
These data indicate that the risk of experiencing preterm labor when a woman exercises heavily is 1.1 times as great as it is among women who do not exercise at all.
Confidence Interval for RR
We compute the 95 percent confidence interval for RR as follows.
The lower and upper confidence limits are, respectively
= 0.65 and = 1.86
Conclusion:
Since the interval includes 1, we conclude, at the .05 level of significance, that the population risk may be 1. In other words, we conclude that, in the population, there may not be an increased risk of experiencing preterm labor when a pregnant woman exercises extensively.
Odds Ratio
An odds ratio (OR) is a measure of association between an exposure and an outcome. The OR represents the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure.
It is the appropriate measure for comparing cases and controls in a retrospective study.
Odds:
The odds for success are the ratio of the probability of success to the probability of failure.
Two odds that we can calculate from data displayed as in contingency Table of retrospective study
 The odds of being a case (having the disease) to being a control (not having the disease) among subjects with the risk factor is [a/ (a +b)] / [b/ (a + b)] = a/b
 The odds of being a case (having the disease) to being a control (not having the disease) among subjects without the risk factor is [c/(c +d)] / [d/(c + d)] = c/d
The estimate of the population odds ratio is
We may construct a confidence interval for OR by the following method:
100 (1 – α) %CI=
Where is the twosided z value corresponding to the chosen confidence coefficient and X^{2} is computed by Equation
Interpretation of the Odds Ratio:
In the case of a rare disease, the population odds ratio provides a good approximation to the population relative risk. Consequently, the sample odds ratio, being an estimate of the population odds ratio, provides an indirect estimate of the population relative risk in the case of a rare disease.
 The odds ratio can assume values between zero and ∞.
 A value of 1 indicates no association between the risk factor and disease status.
 A value less than 1 indicates reduced odds of the disease among subjects with the risk factor.
 A value greater than 1 indicates increased odds of having the disease among subjects in whom the risk factor is present.
EXAMPLE
Toschke et al. (A17) collected data on obesity status of children ages 5–6 years and the smoking status of the mother during the pregnancy. Table below shows 3970 subjects classified as cases or noncases of obesity and also classified according to smoking status of the mother during pregnancy (the risk factor).
We wish to compare the odds of obesity at ages 5–6 among those whose mother smoked throughout the pregnancy with the odds of obesity at age 5–6 among those whose mother did not smoke during pregnancy.
Solution
By formula:
We see that obese children (cases) are 9.62 times as likely as nonobese children (noncases) to have had a mother who smoked throughout the pregnancy.
We compute the 95 percent confidence interval for OR as follows.
The lower and upper confidence limits for the population OR, respectively, are
= 7.12 and = = 13.00
We conclude with 95 percent confidence that the population OR is somewhere between
7.12 And 13.00. Because the interval does not include 1, we conclude that, in the population, obese children (cases) are more likely than nonobese children (noncases) to have had a mother who smoked throughout the pregnancy.