 Clinical science
Statistical analysis of data
Summary
Statistical analysis is one of the principal tools employed in epidemiology, which is primarily concerned with the study of health and disease in populations. Statistics is the science of collecting, analyzing, and interpreting data, and a good epidemiological study depends on statistical methods being employed correctly. At the same time, flaws in study design can affect statistics and lead to incorrect conclusions. Descriptive statistics measure, describe, and summarize features of a collection of data/sample without making inferences that go beyond the scope of that collection/sample. Common measures of descriptive statistics are those of central tendency and dispersion. Measures of central tendency describe the central distribution of data and include the mode, median, and mean. Measures of dispersion describe how data is distributed and include range, quartiles, variance, and deviation. The counterpart of descriptive statistics, inferential statistics, relies on data to make inferences that do go beyond the scope of the data collected and the sample from which it was obtained. Inferential statistics involves parameters such as sensitivity, specificity, positive/negative predictive values, confidence intervals, and hypothesis testing.
The values used to describe features of a sample or data set are called variables. Variables can be independent, in the sense that they are not dependent on other variables and can thus be manipulated by the researcher for the purpose of a study (e.g., administration of a certain drug), or dependent, in the sense that their value depends on another variable and, thus, cannot be manipulated by the researcher (e.g., a condition caused by a certain drug). Variables can furthermore be categorized qualitatively in categorical terms (e.g., eye color, sex, race) and quantitatively in numerical terms (e.g., age, weight, temperature).
The evaluation of diagnostic tests before approval for clinical practice is another important area of epidemiological study. It relies on inferential statistics to draw conclusions from sample groups that can be applied to the general population. See also types of epidemiological studies.
Descriptive and inferential statistics
 Descriptive statistics: analysis of a sample group conducted in order to measure, describe, and summarize the data collected, but not to make inferences that go beyond the scope of that sample group; employs measures of central tendency (mode, median, and mean) and measures of dispersion measures (range, quartiles, variance, and deviation)
 Inferential statistics: analysis of a sample group conducted in order to make inferences that go beyond the sample group.
Measures of central tendency and outliers
Measures of central tendency
 Definition: measures to describe a common, typical value of a data set (e.g., clustering of data at a specific value)
 The type of measure used depends on the sample size.
Measure  Definition  Example 

Mean 


Median 


Mode 


Outlier
 Definition: a data point/observation that is distant from other data points/observations in a data set
 Problem

Approach
 Using a trimmed mean: calculate the mean by discarding extreme values in a data set and using the remaining values
 Use the median or mode: useful for asymmetrical data; these measures are not affected by extreme values because they are based on ranks of data (median) or the most commonly occurring value (mode) rather than the average score of all values
 Removing outliers can also distort the interpretation of data. It should be done with caution and with a view to reflecting the respective data set.
Measures of dispersion
 Definition: measures the extent to which the distribution is stretched out
Definition  Description  

Range (statistics) 


Interquartile range 


Variance (statistics) 


Standard deviation (SD) 


Percentiles 


Quartile 
 
Standard error of the mean 


Variables

Definition: measured values of population attributes or a value subject to change
 General population: the group from which the units of observation are drawn (e.g., all the inhabitants of Braunschweig, all the patients in a hospital)
 Unit of observation: the individual who is the subject of the study (e.g, inhabitant of a region, a patient)
 Attribute: a character of the unit of observation (e.g., gender, patient satisfaction)

Attribute value (e.g., male/female, satisfied/dissatisfied)
 Variables can be qualitative; (e.g., male/female) or quantitative (e.g., temperature: 10°C, 20°C) in nature
 Quantitative variables can be discrete or nondiscrete (continuous) variables (see “Probability” below).

Types
 Independent variable: a variable that is not dependent on other variables and can thus be manipulated by the researcher for the purpose of a study
 Dependent variable: a variable with a value that depends on another variable and therefore cannot be manipulated by the researcher

Types of quantitative variables
 Discrete variable: variables that can only assume whole number values
 Continuous variable (nondiscrete variable): variables that can assume any real number value
 Categorical variable (nominal variable): variables that have a finite number of categories that may not have an intrinsic logical order

Variable scales
 Definition: types of measurement scales; categorized as categorical scales and metric scales

Categorical scale (qualitative)
 The distance (interval) between two categories is undefined.
 Includes the nominal scale and ordinal scale

Metric scale (quantitative)
 The distance between two categories is defined and the data can be ranked .
 Includes the interval scale and ratio scale
Types  Characteristics  Measure of central tendency  Measure of dispersion  Statistical analysis  Data illustration 

Nominal scale 



 
Ordinal scale 


 
Interval scale 



 
Ratio scale 

References:^{[1]}^{[2]}
Distribution and graphical representation of data
Normal distribution (Bell curve, Gaussian distribution)
 Normal distributions differ according to their mean and variance, but share the following characteristics:
 The same basic shape; the following assumptions about the data distribution can be made:
 68% of the data falls within 1 SD of the mean.
 95% of the data falls within 2 SD of the mean.
 99.7% of the data falls within 3 SD of the mean.
 Symmetry (i.e., a symmetrical bell curve)
 Total area under the curve = 1
 All measures of central tendency are equal (mean = median = mode)
 The same basic shape; the following assumptions about the data distribution can be made:
 Standard normal distribution (Z distribution): A normal distribution with a mean of 0 and standard deviation of 1
Nonnormal distributions
Description  Meaning  

Bimodal distribution 


Positively skewed distribution 


Negatively skewed distribution 


Standard normal value (Zscore; Zvalue; standard normalized score)
 Enables the comparison of populations with different means and standard deviations
 Standard normal value = (value  population mean) divided by standard deviation
 A means of expressing data scores (e.g., height in centimeters or meters) in the same metric (specifically, in terms of units of standard deviation for the population)
 Determines how many standard deviations an observation is above or below the mean
Recommended measures according to distribution
Distribution  Measure of central tendency  Measure of spread 

Normal (symmetrical)  Mean, median, mode  Standard deviation 
Skewed (asymmetrical)  Median  Range or interquartile range 
Data illustration
Categorical data

 Presents data values for each category in a table
 Illustrates which values in a data set appear frequently

Pie chart
 Describes the frequency of categories in a circular graph divided into slices, with each slice representing a categorical proportion
 Useful for depicting a small number of categories and large differences between them

Bar graph
 Describes the frequency of categories in bars separated from each other; the height/length of each bar represents a categorical proportion
 Useful for depicting many categories of information (compared to a pie chart)
 Frequency can be expressed in absolute or relative terms.
Continuous data

Histogram
 A histogram is similar to a bar graph but displays data on a metric scale.
 The data is grouped into intervals that are plotted on the xaxis.
 Useful for depicting continuous data
 Similar to a bar chart, but differs in the following ways:
 Used for continuous data
 The bars can be shown touching each other to illustrate continuous data.
 Bars cannot be reordered.

Box plot
 Quartiles and median are used to display numerical data in the form of a box.
 Useful for depicting continuous data
 Shows the following important characteristics of data:
 Minimum and maximum values
 First and third quartiles
 Interquartile range
 Median
 Easily shows measures of central tendency, range, symmetry, and outliers at a glance

Scatter plot
 A graph used to display values for (typically) two variables of data, plotted on the horizontal (xaxis) and vertical (yaxis) axes using cartesian coordinates, which represent individual data values
 Helps to establish correlations between dependent and independent variables
 Helps to determine whether a relationship between data sets is linear or nonlinear
Hypothesis testing and probability
Hypothesis testing
 Two mutually exclusive hypotheses (referred to as null hypothesis and alternative hypotheses) are formulated.
 Null hypothesis (H_{0}): the assumption that there is no relationship between two measured variables (e.g., the exposure and the outcome) or no significant difference between two studied populations; statistical tests are used to either reject or accept this hypothesis

Alternative hypothesis (H_{1}): the assumption that there is a relationship between two measured variables (e.g., the exposure and the outcome) or a significant difference between two studied populations. This hypothesis is formulated as a counterpart to the null hypothesis; statistical tests are used to either reject or accept this hypothesis

Types
 Directional alternative hypothesis (onetailed): specifies the direction of a tested relationship
 Nondirectional alternative hypothesis (twotailed): only states that a difference exists in a tested relationship (does not specify the direction)

Types

Interpretation

Type 1 error: The null hypothesis is rejected when it is actually true, and, consequently, the alternative hypothesis is accepted, although the observed effect is actually due to chance.

Significance level (type 1 error rate)
 The probability of a type 1 error; denoted with “α”
 The significance level is determined by the principal investigator before the study is conducted.
 For medical/epidemiological studies, the significance level α is usually set to 0.05.

Significance level (type 1 error rate)

Type 2 error: The null hypothesis is accepted when it is actually false, and, consequently, the alternative hypothesis is rejected even though an observed effect did not occur due to chance.
 Type 2 error rate: the probability of a type 2 error; denoted by “β”

Statistical power (1β)
 The probability of correctly rejecting the null hypothesis, i.e., the ability to detect a difference between two groups when there truly is a difference
 Reciprocal to the type 2 error rate
 Positively correlates with the sample size and the magnitude of the association of interest (e.g., increasing the sample size of a study would increase its statistical power)
 By convention, most studies aim to achieve 80% statistical power.

Pvalue: the probability that a statistical test leads to the false conclusion that there is a relationship between two measured variables (e.g., the exposure and the outcome) or that there is a significant difference between two studied populations
 Calculated with statistical tests
 If the pvalue is equal to or less than a predetermined significance level (usually set at 0.05), the association is considered statistically significant.

Type 1 error: The null hypothesis is rejected when it is actually true, and, consequently, the alternative hypothesis is accepted, although the observed effect is actually due to chance.
Null hypothesis (H_{0}) is true  Null hypothesis (H_{0}) is false  

Statistical test does not reject H_{0}  1α  Type 2 error (β) 
Statistical test rejects H_{0}  Type 1 error (α)  Power (1β) 
Probability

Description

Probability of an occurring event (P)
 Describes the degree of certainty that a particular event will take place
 P = number of favorable outcomes/total number of possible outcomes

Probability of an event not occurring (Q)
 The degree of certainty that a particular event will not take place
 Q = number of unfavourable outcomes/total number of possible outcomes OR 1  P

Probability of an occurring event (P)

Use
 Probabilities can be combined for individual, unrelated events by multiplying them by one another.
 Probabilities can be combined for multiple, unrelated (i.e., exclusive) events by addition
 Probabilities can be combined for multiple events that are related (i.e., NOT exclusive) by adding the probability of each event and then subtracting the probability of the combined events
The actual probability of an event is not the same as the observed frequency of an event!
The observed relative frequency can deviate greatly from the true probability of an event, especially when the outcome is only measured a few times. However, as the number of measurements increases, the observed relative frequency approaches the true probability of the event!
 Probability of independent events: The probability of event “A” is not contingent upon the probability of event “B” and vice versa.

Conditional probability (P(AB)): the probability of event “A” occurring given that event “B” has occurred: P(AB)

P(AB) = P(A and B) / P(B)
 P(B) = probability of event “B”
 P(A and B) = probability of events “A” and “B” occurring simultaneously

E.g., the probability of lung cancer in a smoker (“A” → lung cancer; “B” → smoking)
 The underlying condition is that the individual is a smoker → P(B) = probability of being a smoker = number of smokers / total population
 P(A and B) = probability of simultaneously being a smoker and having lung cancer = number of smokers with lung cancer / total population
 Therefore, P(AB) = the probability of lung cancer arising in a smoker = P(A and B)/P(B) = number of smokers with lung cancer / number of smokers

P(AB) = P(A and B) / P(B)

Multiplication rule
 P(A and B) = probability of events “A” and “B” occurring simultaneously
 The multiplication rule is obtained by rearranging the formula for conditional probability. → P(A and B) = P(B) × P(AB)
 The multiplication rule can be applied to a decision tree in order to calculate the probability of one of the branches (a particular outcome)
 E.g., pedigree charts for genetic analysis

Bayes' theorem
 Bayes' theorem is used to calculate conditional probabilities.

Bayes' theorem describes the relationship between P(AB) and P(BA) :

P(AB) = (P(BA) × P(A)) / P(B), where:
 P(B) = (P(BA) × P(A)) + (P(BC) × P(C)); where “C” = “A not occurring”
 Therefore, P(AB) = ^{P(BA) × P(A)} / _{(P(BA) × P(A)) + (P(BC) × P(C))}

E.g., positive predictive value (PPV) of a test
 P(AB) = probability that a individual with a positive test (event “B”) actually has the disease (event “A”) = PPV
 P(BA) = probability that a individual with the disease (event “A”) will test positive (event “B”) = sensitivity
 P(A) = probability that an individual has the disease = disease prevalence
 P(B) = probability of a test being positive; although the probability of a positive test result cannot be directly estimated, it can be calculated using the formula: P(B) = (P(BA) × P(A)) + (P(BC) × P(C)), where “C” = “A not occurring”
 P(AB) = P(BA) × P(A) / P(B) = ^{P(BA) × P(A)}/_{(P(BA) × P(A)) + (P(BC) × P(C))} = PPW = (sensitivity × prevalence) / ((sensitivity × prevalence)) + ((1  specificity) × (1  prevalence))

P(AB) = (P(BA) × P(A)) / P(B), where:
Confidence interval
 Overview: Confidence intervals provide a way to determine a population measurement or a value that is subject to change from a sample measurement.
 Definition: : the range of values that are highly likely to contain the true sample measurement

Z scores for confidence intervals for normally distributed data (see Zscore)
 Zscore for a 95% confidence interval = 1.96
 Zscore for a 97.5% confidence interval = 2.24
 Zscore for a 99% confidence interval = 2.58

Formula: any sample measurement (e.g., mean) +/ Zscore (standard error of the mean) ; requires the following values:
 Confidence level (usually fixed at 95% )
 Sample measurement

Standard error of the mean, which requires the:
 Sample size
 Standard deviation

Interpretation
 Overlapping confidence intervals between two groups signify that there is no statistically significant difference.
 Nonoverlapping confidence intervals between two groups signify that there is a statistically significant difference.

If the confidence interval includes the null hypothesis, the result is not significant and the null hypothesis cannot be rejected.
 If the 95% confidence interval of relative risk or odds ratio includes 1.0, the result is not significant and the null hypothesis cannot be rejected.
 If the 95% confidence interval of a difference between the means of two variables includes 0, the result is not significant and the null hypothesis cannot be rejected.
 A 95% confidence interval that does not include the null hypothesis corresponds to a pvalue of 0.05
 A 99% confidence interval that does not include the null hypothesis corresponds to a pvalue of 0.01
Statistical tests
Statistical significance vs. clinical significance

Significance (epidemiology): the statistical probability that a result did not occur by chance alone
 Statistical significance: describes a true statistical outcome (i.e., that is determined by statistical tests) that has not occurred by chance
 Clinical significance (epidemiology): describes an important change in a patient's clinical condition, which may or may not be due to an intervention introduced during a clinical study
 Statistical and clinical significance do not necessarily correlate.
Correlation and regression
Correlation
 Definition: : a measure of the linear statistical correlation between continuous variables

Interpretation: A correlation coefficient measures the strength (i.e., the degree) and direction (i.e., a positive or negative relationship) of a linear relationship (does not require causality!)
 Direction or relationship: can be positive; or negative ; identified by a "plus" or "minus", respectively
 Strength of relationship
 Perfect relationship: two variables are perfectly linear and the correlation coefficient is +1 or 1
 No linear relationship: correlation coefficient is 0
 See Spearman's correlation coefficient and Pearson's correlation coefficient.
Regression (epidemiology)
 Definition: the process of developing a mathematical relationship between the dependent variable (the outcome; “y”) and one or more independent variables (the exposure; “x”)

Linear regression: a type of regression in which the dependent variable is continuous

Simple linear regression
 1 independent variable is analyzed
 If “y” has a linear relationship with an independent variable “x”, a graph plotting this relationship takes the form of a straight line (called regression line).
 In the case of simple linear regression, the equation of the regression line is: y = mx + b, with “m” representing the slope of the regression line, “y” the dependent variable, “x” the independent variable, and “b” the yintercept (the value of y where the line crosses the yaxis)
 Multiple linear regression: >1 independent variable is analyzed

Simple linear regression

Logistic regression: a type of regression in which the dependent variable is categorical
 Simple logistic regression: 1 independent variable is analyzed
 Multiple logistic regression: >1 independent variable is analyzed
Parametric tests
 Definition: tests used to evaluate statistically significant differences between groups when the study sample has a normal distribution and the sample size is large

Types

Pearson correlation coefficient (r)
 Compares interval level variables
 Calculates the estimated strength and direction of a relationship between two variables

Interpretation
 r is always a value between 1 and 1.
 A positive rvalue = a positive correlation
 A negative rvalue = negative correlation
 The closer the rvalue is to 1, the stronger the correlation between the compared variables.
 The coefficient = r^{2} (the coefficient may be affected by extreme values)

Ttest
 Calculates the difference between the means of two samples or between a sample and population or a value subject to change; especially when samples are small and/or the population or a value subject to change distribution is not known

Used to determine the confidence intervals of a tdistribution
 Tdistribution: a collection of distributions in which the standard deviation is unknown and/or the sample size is small

Types

One sample ttest
 Calculates whether a sample mean differs from the population mean (μ_{0})
 Prerequisite: normal distribution (the variance is known and depends on the degrees of freedom.)
 Formula: tvalue = (sample mean  population mean)/standard deviation) * √(n)

Interpretation
 The tvalue can be classified according a table that lists tvalues and their corresponding quantiles based on the number of degrees of freedom (df) and the significance level (α value).
 t < tabular value of t_{df} (1α/2) → null hypothesis cannot be rejected
 t > tabular value of t_{df} (1α/2) → null hypothesis should be rejected
 Alternatively, one may calculate the confidence intervals of the sample observations and check if the population mean (μ_{0}) falls within the range given by the confidence intervals.
 The tvalue can be classified according a table that lists tvalues and their corresponding quantiles based on the number of degrees of freedom (df) and the significance level (α value).

Two sample ttest
 Calculates whether the means of two groups differ from one another

Prerequisites
 Both sample groups are drawn from the same population and have the same (but unknown) variance.
 The difference between the observations in the two groups approximately follows a normal distribution.
 Formula: tvalue = (mean difference between the two samples/standard deviation) * √(n)
 Interpretation: The tvalue is compared with a table of tvalues in order to determine whether the difference is statistically significant (similar to the one sample ttest described above).

Types

Unpaired ttest (independent samples ttest)
 Two different groups are sampled at the same time
 The difference between the means of a continuous outcome variable of these 2 groups is compared
 The null hypothesis is that the mean of these two groups is equal; a statistically significant difference rejects the null hypothesis

Paired ttest (dependent samples ttest)
 The same group is sampled at two different times
 The difference between the means of a continuous outcome variable of this group is compared
 The null hypothesis is that the group mean is equal at these two different times; a statistically significant difference rejects the null hypothesis

Unpaired ttest (independent samples ttest)

One sample ttest

Analysis of variance (ANOVA)

Calculates the statistically significant difference between ≥ 3 independent groups by comparing their means (an extension of the ttest)

Oneway analysis of variance
 Assesses 1 variable (e.g., the mean height of women in clinics A, B, and C at a given point in time; the variable is height)
 The aim is to determine whether there is an effect of different independent variables on a dependent variable.
 Twoway analysis of variance: assesses 2 variables (e.g., the mean height of women and the mean height of men in clinics A, B, and C at a point in time; the variables are gender and height)

Oneway analysis of variance

Calculates the statistically significant difference between ≥ 3 independent groups by comparing their means (an extension of the ttest)

Pearson correlation coefficient (r)
Nonparametric tests
 Definition: tests used to evaluate the statistically significant difference between groups when the sample has nonnormal distribution and the sample size is small.

Types

Spearman correlation coefficient
 Calculates the relationship between two variables according to their rank
 Compares ordinal level variables

Interpretation
 Extreme values have a minimal effect on Spearman's coefficient.
 Not precise because not all information from the data set is used.
 See correlation.

MannWhitney U test
 Compares ordinal, interval, or ratio scales
 Calculates whether two independently chosen samples originate from the same population and have identical distributions and/or medians

Wilcoxon test (rank sum and signed rank)
 Rank sum test: compares the means between groups of different sizes
 Signed rank test: compares the means between pairs of scores that can be matched; substitute for the onesample ttest when a preintervention measure is compared with a posttreatment measure and the null hypothesis is that the treatment has no effect

KruskalWallis H test
 Extension of the Mannwhitney U test
 Compares multiple groups by testing the null hypothesis (that there is no median difference between at least two groups)
 Binomial test : examines whether the observed frequency of an event with binary outcomes (e.g., heads/tails, dead/alive) is statistically probable or not

Spearman correlation coefficient
Categorical tests
 Definition: : tests used to evaluate the statistically significant difference between groups with categorical variables

Types

Chisquare test (X^{2 }test)
 Calculates the difference between the frequencies in a sample
 Aims to determine how likely outcomes are to occur due to chance (used in crosssectional studies)

Fishers exact test
 Also calculates the difference between the frequencies in a sample; but, unlike a Chisquare test, is used when the study sample is small
 Also aims to determine how likely it was the outcomes occurred due to chance

Chisquare test (X^{2 }test)