• Clinical science

Statistical analysis of data

Abstract

Statistical analysis is one of the principal tools employed in epidemiology, which is primarily concerned with the study of health and disease in populations. Statistics is the science of collecting, analyzing, and interpreting data, and a good epidemiological study depends on statistical methods being employed correctly. At the same time, flaws in study design can affect statistics and lead to incorrect conclusions. Descriptive statistics measure, describe, and summarize features of a collection of data/sample without making inferences that go beyond the scope of that collection/sample. Common measures of descriptive statistics are those of central tendency and dispersion. Measures of central tendency describe the central distribution of data and include the mode, median, and mean. Measures of dispersion describe how data is distributed and include range, quartiles, variance, and deviation. The counterpart of descriptive statistics, inferential statistics, relies on data to make inferences that do go beyond the scope of the data collected and the sample from which it was obtained. Inferential statistics involves parameters such as sensitivity, specificity, positive/negative predictive values, confidence intervals, and hypothesis testing.

The values used to describe features of a sample or data set are called variables. Variables can be independent, in the sense that they are not dependent on other variables and can thus be manipulated by the researcher for the purpose of a study (e.g., administration of a certain drug), or dependent, in the sense that their value depends on another variable and, thus, cannot be manipulated by the researcher (e.g., a condition caused by a certain drug). Variables can furthermore be categorized qualitatively in categorical terms (e.g., eye color, sex, race) and quantitatively in numerical terms (e.g., age, weight, temperature).

The evaluation of diagnostic tests before approval for clinical practice is another important area of epidemiological study. It relies on inferential statistics to draw conclusions from sample groups that can be applied to the general population. See also types of epidemiological studies.

Descriptive and inferential statistics

Measures of central tendency and outliers

Measures of central tendency

  • Definition: measures to describe a common, typical value of a data set (e.g., clustering of data at a specific value)
  • The type of measure used depends on the sample size.
Measure Definition Example
Mean
  • The arithmetic average of the data set
  • Limitations: affected by extreme values (outliers)
  • The sum of all the data divided by the number of values in the data set. (e.g., consider a data set of 3, 6, 11, 14, 16, 19. The mean value is 11.5 (= 69/6).
Median
  • The middle value of the data set that has been arranged in order of magnitude; it divides the upper half of the data set from the lower half
  • Not strongly affected by outliers or skewed data
  • Uneven number of values: 3, 6, 11, 16, 19. The median value is the middle value = 11.
  • Even number of values: 2, 3, 5, 7, 9, 10. The median value is the average of the two middle values = (5+7)/2 = 6.
Mode
  • The most common value in a data set
  • Most resistant against outliers
  • In a data set with the values “3, 6, 6, 11, 11, 11, 2, 2,” the mode = 11.

Outlier

  • Definition: a data point/observation that is distant from other data points/observations in a data set
  • Problem
    • It is important to identify outliers, because outliers can indicate errors in measurement or statistical anomalies.
    • The mean is easily influenced by outliers
  • Approach
    • Using a trimmed mean: calculate the mean by discarding extreme values in a data set and using the remaining values
    • Use the median or mode: useful for asymmetrical data; these measures are not affected by extreme values because they are based on ranks of data (median) or the most commonly occurring value (mode) rather than the average score of all values
    • Removing outliers can also distort the interpretation of data. It should be done with caution and with a view to reflecting the respective data set.

Measures of dispersion

  • Definition: measures the extent to which the distribution is stretched out
Definition Description
Range (statistics)
  • The difference between the largest and smallest value in a data set
  • Sensitive to extreme data values; helps to identify an unusually wide or narrow data range, which may occur with data entry errors (e.g., data that actually belongs to another study population)
  • In the data set “27, 3, 4, 9,” the range is 24 (i.e., 27-3).
Interquartile range
  • The range from the second to the third quartile. Calculated by establishing the difference between the 75th and 25th percentile.
  • Less influenced by extreme data values (outliers)
  • Calculated as the difference between the 75th and 25th percentile
Variance (statistics)
  • The average of the squared deviations from the mean
  • Represented by σ2 (“s∼”)
  • σ2 = sum of squared deviations from the mean divided by total number of observations
  • Calculated by subtracting the mean from each population data set value. Each difference is then squared and added together. Finally, the total sum is divided by n-1.
Standard deviation (SD)
  • The square root of the variance
  • Describes the variability or dispersion of data
  • Represented by σ (“sigma”)
  • Calculated as the square root of (the sum of squared deviations from the mean divided by total number of observations)
  • The standard deviation is calculated by first calculating the mean. The mean is subtracted from each population data set value. Each difference is squared and added together. The total sum is divided by the total number of data set values -1. The square root of this value is the standard deviation (σ).
  • In a normal distribution
    • 1 SD = 68% of the data set
    • 2 SD = 95% of the data set
    • 3 SD = 99.7% of the data set
Percentiles
  • Division of the population data set into 100 equal parts; a percentile is the value below which a percentage of observations fall.
  • Percentiles are usually used to help evaluate children's growth.
  • If a child's weight is in the 25th percentile for his or her age, this child's weight is heavier than 25% of children of the same age group, but lighter than 75% children of the same age group. For example:
Quartile
  • One quarter of a data set
  • Each quartile includes 25% of the population data set.
    • First quartile (= lower quartile): 25% of all values are smaller than this value.
    • Third quartile (= upper quartile): 75% of all values are smaller than this value.
Standard error of the mean
  • The deviation of the sample mean from the population mean
  • Influenced by the standard deviation (e.g., a greater SD increases the chance of error) and the sample size (a smaller sample size will increase the chance of error).

Variables

  • Definition: measured values of population attributes or a value subject to change
    • General population: the group from which the units of observation are drawn (e.g., all the inhabitants of Braunschweig, all the patients in a hospital)
    • Unit of observation: the individual who is the subject of the study (e.g, inhabitant of a region, a patient)
    • Attribute: a character of the unit of observation (e.g., gender, patient satisfaction)
    • Attribute value (e.g., male/female, satisfied/dissatisfied)
      • Variables can be qualitative; (e.g., male/female) or quantitative (e.g., temperature: 10°C, 20°C) in nature
      • Quantitative variables can be discrete or non-discrete (continuous) variables (see “Probability” below).
  • Types
    • Independent variable: a variable that is not dependent on other variables and can thus be manipulated by the researcher for the purpose of a study
    • Dependent variable: a variable with a value that depends on another variable and therefore cannot be manipulated by the researcher
  • Types of quantitative variables
  • Variable scales
Types Characteristics Measure of central tendency Measure of dispersion Statistical analysis Data illustration
Nominal scale
  • Data cannot be ranked
  • Mode
  • Absolute and relative frequency
  • Not applicable
  • Non-parametric tests (e.g., Mann-Whitney test)
Ordinal scale
  • Data can be ranked
Interval scale
  • There is no natural zero point.
Ratio scale
  • There is a natural zero point.

References:[1][2]

Distribution and graphical representation of data

Normal distribution (Bell curve, Gaussian distribution)

Nonnormal distributions

Description Meaning
Bimodal distribution
  • Data distribution with two peaks (a peak = mode (epidemiology)
  • Two subgroups within the study population
Positively skewed distribution
  • Data set that is skewed to the right
  • Mean > median > mode
Negatively skewed distribution
  • Data set that is skewed to the left
  • Mean < median < mode

Standard normal value (Z-score; Z-value; standard normalized score)

  • Enables the comparison of populations with different means and standard deviations
    • Standard normal value = (value - population mean) divided by standard deviation
    • A means of expressing data scores (e.g., height in centimeters or meters) in the same metric (specifically, in terms of units of standard deviation for the population)
    • Determines how many standard deviations an observation is above or below the mean

Recommended measures according to distribution

Distribution Measure of central tendency Measure of spread
Normal (symmetrical) Mean, median, mode Standard deviation
Skewed (asymmetrical) Median Range or interquartile range

Data illustration

Categorical data

  • Frequency table

    • Presents data values for each category in a table
    • Illustrates which values in a data set appear frequently
  • Pie chart
    • Describes the frequency of categories in a circular graph divided into slices, with each slice representing a categorical proportion
    • Useful for depicting a small number of categories and large differences between them
  • Bar graph
    • Describes the frequency of categories in bars separated from each other; the height/length of each bar represents a categorical proportion
    • Useful for depicting many categories of information (compared to a pie chart)
    • Frequency can be expressed in absolute or relative terms.

Continuous data

  • Histogram
    • A histogram is similar to a bar graph but displays data on a metric scale.
    • The data is grouped into intervals that are plotted on the x-axis.
    • Useful for depicting continuous data
    • Similar to a bar chart, but differs in the following ways:
      • Used for continuous data
      • The bars can be shown touching each other to illustrate continuous data.
      • Bars cannot be reordered.
  • Box plot
  • Scatter plot
    • A graph used to display values for (typically) two variables of data, plotted on the horizontal (x-axis) and vertical (y-axis) axes using cartesian coordinates, which represent individual data values
    • Helps to establish correlations between dependent and independent variables
    • Helps to determine whether a relationship between data sets is linear or nonlinear

Hypothesis testing and probability

Hypothesis testing

  • Two mutually exclusive hypotheses (referred to as null hypothesis and alternative hypotheses) are formulated.
    • Null hypothesis (H0): the assumption that there is no relationship between two measured variables (e.g., the exposure and the outcome) or no significant difference between two studied populations; statistical tests are used to either reject or accept this hypothesis
    • Alternative hypothesis (H1): the assumption that there is a relationship between two measured variables (e.g., the exposure and the outcome) or a significant difference between two studied populations. This hypothesis is formulated as a counterpart to the null hypothesis; statistical tests are used to either reject or accept this hypothesis
      • Types
        • Directional alternative hypothesis (one-tailed): specifies the direction of a tested relationship
        • Non-directional alternative hypothesis (two-tailed): only states that a difference exists in a tested relationship (does not specify the direction)
  • Interpretation
Null hypothesis (H0) is true Null hypothesis (H0) is false
Statistical test does not reject H0 1-α Type 2 error (β)
Statistical test rejects H0 Type 1 error (α) Power (1-β)

Probability

  • Description
    • Probability of an occurring event (P)
      • Describes the degree of certainty that a particular event will take place
      • P = number of favorable outcomes/total number of possible outcomes
    • Probability of an event not occurring (Q)
      • The degree of certainty that a particular event will not take place
      • Q = number of unfavourable outcomes/total number of possible outcomes OR 1 - P
  • Use
    • Probabilities can be combined for individual, unrelated events by multiplying them by one another.
    • Probabilities can be combined for multiple, unrelated (i.e., exclusive) events by addition
    • Probabilities can be combined for multiple events that are related (i.e., NOT exclusive) by adding the probability of each event and then subtracting the probability of the combined events

The actual probability of an event is not the same as the observed frequency of an event!

The observed relative frequency can deviate greatly from the true probability of an event, especially when the outcome is only measured a few times. However, as the number of measurements increases, the observed relative frequency approaches the true probability of the event!

  • Bayes' theorem
    • Bayes' theorem is used to calculate conditional probabilities.
    • Bayes' theorem describes the relationship between P(A|B) and P(B|A) :
      • P(A|B) = (P(B|A) × P(A)) / P(B), where:
        • P(B) = (P(B|A) × P(A)) + (P(B|C) × P(C)); where “C” = “A not occurring”
        • Therefore, P(A|B) = P(B|A) × P(A) / (P(B|A) × P(A)) + (P(B|C) × P(C))
      • E.g., positive predictive value (PPV) of a test
        • P(A|B) = probability that a individual with a positive test (event “B”) actually has the disease (event “A”) = PPV
        • P(B|A) = probability that a individual with the disease (event “A”) will test positive (event “B”) = sensitivity
        • P(A) = probability that an individual has the disease = disease prevalence
        • P(B) = probability of a test being positive; although the probability of a positive test result cannot be directly estimated, it can be calculated using the formula: P(B) = (P(B|A) × P(A)) + (P(B|C) × P(C)), where “C” = “A not occurring”
        • P(A|B) = P(B|A) × P(A) / P(B) = P(B|A) × P(A)/(P(B|A) × P(A)) + (P(B|C) × P(C)) = PPW = (sensitivity × prevalence) / ((sensitivity × prevalence)) + ((1 - specificity) × (1 - prevalence))

Confidence interval

Statistical tests

Statistical significance vs. clinical significance

  • Significance (epidemiology): the statistical probability that a result did not occur by chance alone
    • Statistical significance: describes a true statistical outcome (i.e., that is determined by statistical tests) that has not occurred by chance
    • Clinical significance (epidemiology): describes an important change in a patient's clinical condition, which may or may not be due to an intervention introduced during a clinical study
  • Statistical and clinical significance do not necessarily correlate.

Correlation and regression

Correlation

Regression (epidemiology)

Parametric tests

  • Definition: tests used to evaluate statistically significant differences between groups when the study sample has a normal distribution and the sample size is large
  • Types
    • Pearson correlation coefficient (r)
      • Compares interval level variables
      • Calculates the estimated strength and direction of a relationship between two variables
      • Interpretation
        • r is always a value between -1 and 1.
        • A positive r-value = a positive correlation
        • A negative r-value = negative correlation
        • The closer the r-value is to 1, the stronger the correlation between the compared variables.
        • The coefficient = r2 (the coefficient may be affected by extreme values)
    • T-test
      • Calculates the difference between the means of two samples or between a sample and population or a value subject to change; especially when samples are small and/or the population or a value subject to change distribution is not known
      • Used to determine the confidence intervals of a t-distribution
      • Types
        • One sample t-test
          • Calculates whether a sample mean differs from the population mean (μ0)
          • Prerequisite: normal distribution (the variance is known and depends on the degrees of freedom.)
          • Formula: t-value = (sample mean - population mean)/standard deviation) * √(n)
          • Interpretation
            • The t-value can be classified according a table that lists t-values and their corresponding quantiles based on the number of degrees of freedom (df) and the significance level (α value).
              • |t| < tabular value of tdf (1-α/2)null hypothesis cannot be rejected
              • |t| > tabular value of tdf (1-α/2)null hypothesis should be rejected
            • Alternatively, one may calculate the confidence intervals of the sample observations and check if the population mean (μ0) falls within the range given by the confidence intervals.
        • Two sample t-test
    • Analysis of variance (ANOVA)
      • Calculates the statistically significant difference between ≥ 3 independent groups by comparing their means (an extension of the t-test)
        • One-way analysis of variance
          • Assesses 1 variable (e.g., the mean height of women in clinics A, B, and C at a given point in time; the variable is height)
          • The aim is to determine whether there is an effect of different independent variables on a dependent variable.
        • Two-way analysis of variance: assesses 2 variables (e.g., the mean height of women and the mean height of men in clinics A, B, and C at a point in time; the variables are gender and height)

Non-parametric tests

  • Definition: tests used to evaluate the statistically significant difference between groups when the sample has non-normal distribution and the sample size is small.
  • Types
    • Spearman correlation coefficient
      • Calculates the relationship between two variables according to their rank
      • Compares ordinal level variables
      • Interpretation
        • Extreme values have a minimal effect on Spearman's coefficient.
        • Not precise because not all information from the data set is used.
      • See correlation.
    • Mann-Whitney U test
      • Compares ordinal, interval, or ratio scales
      • Calculates whether two independently chosen samples originate from the same population and have identical distributions and/or medians
    • Wilcoxon test (rank sum and signed rank)
      • Rank sum test: compares the means between groups of different sizes
      • Signed rank test: compares the means between pairs of scores that can be matched; substitute for the one-sample t-test when a pre-intervention measure is compared with a post-treatment measure and the null hypothesis is that the treatment has no effect
    • Kruskal-Wallis H test
      • Extension of the Mann-whitney U test
      • Compares multiple groups by testing the null hypothesis (that there is no median difference between at least two groups)
    • Binomial test : examines whether the observed frequency of an event with binary outcomes (e.g., heads/tails, dead/alive) is statistically probable or not

Categorical tests

  • Definition: : tests used to evaluate the statistically significant difference between groups with categorical variables
  • Types
    • Chi-square test (X2 test)
      • Calculates the difference between the frequencies in a sample
      • Aims to determine how likely outcomes are to occur due to chance (used in cross-sectional studies)
    • Fishers exact test
      • Also calculates the difference between the frequencies in a sample; but, unlike a Chi-square test, is used when the study sample is small
      • Also aims to determine how likely it was the outcomes occurred due to chance