The case-only design is a powerful approach to detect interactions but should be used with caution

Dong, Rui; Wang, Gao T.; DeWan, Andrew T.; Leal, Suzanne M.

doi:10.1186/s12864-025-11318-1

Research
Open access
Published: 06 March 2025

The case-only design is a powerful approach to detect interactions but should be used with caution

Rui Dong¹,
Gao T. Wang¹,
Andrew T. DeWan² &
…
Suzanne M. Leal¹

BMC Genomics volume 26, Article number: 222 (2025) Cite this article

391 Accesses
Metrics details

Abstract

Background

The case-only design is a powerful approach to identify gene $\times$ gene and gene $\times$ environment interactions for complex traits. It has been demonstrated that for the case-only design to be valid the genetic and environmental factors must be independent in the population. Additionally, there is a rare disease assumption for the case-only design, but the impact of disease prevalence and other factors, e.g., size of main effects, on type I and II error rates has not been investigated.

Methods

Through theoretical and extensive simulation studies, we investigated type I error, power, and bias of interaction term for a wide variety of disease prevalences, main and interaction effect sizes, sample sizes, and variant and environmental exposure frequencies.

Results

For diseases with prevalence $<$ 4%, the case-only design usually has well controlled type I error rates and is substantially more powerful to detect interactions than the case–control design, but for higher disease prevalences both type I and II error rates can be inflated and the estimate of interaction term biased. However, when one or both main effects are large there can be inflated type I error rate even for low disease prevalences, e.g., $<$ 1%, but if there is no or only one main effect, type I error rate is controlled regardless of the disease prevalence. Additionally, type I error rate can increase with sample size.

Conclusions

We determined the upper bound of the disease prevalence in order not to violate the rare disease assumption for the case-only design. To verify that a case-only design study does not have increased type I error rate, the bias of the interaction term should be estimated. Although the case-only design is a powerful method to detect interactions, prevalences for some complex traits are too high to implement this method without increasing type I error rates.

Peer Review reports

Background

Genome-wide association studies (GWAS) have detected associations for thousands of complex traits with millions of single-nucleotide variants (SNVs). Most GWAS are designed to detect main effects, while interactions either between genetic variation and environmental factors (G $\times$ E) or genetic variations (G $\times$ G) may explain “missing heritability”. In the past due to insufficient sample sizes most studies were underpowered to detect interactions, however large biobanks that include genetic and environmental data, e.g., UK Biobank, are making it possible to potentially detect interactions [1].

One method to detect G $\times$ E interactions, the case-only design, has received considerable attention since Piegorsch et al. proposed it [2]. It can also be applied to study G $\times$ G interactions, aka epistasis [3]. Piegorsch et al. stated that two assumptions must be met when applying the case-only design to estimate the interaction effect: (1) the disease should be rare (2) G and E or G and G are independent. The second assumption of independence is well studied [4,5,6] and has been evaluated using control data [2, 7]. For the case-only design the effect of deviation from Hardy–Weinberg equilibrium (HWE) in controls was also examined [8, 9]. Additionally, it was shown that population stratification can also introduce a bias [4]. However, the rare disease assumption has not been investigated. In practice the case-only design has been applied to complex traits with a wide range of disease prevalences, including breast, prostate and ovarian cancers, Crohn's disease, and rheumatoid arthritis [10,11,12,13,14]. For these studies it is not clear whether the rare disease assumption was met and if it was appropriate to use the case-only design.

Using theoretical analysis and simulation studies, we evaluated the role of disease prevalence for the case-only design, assuming G and E are independent in the overall population. We determined that there is no set disease prevalence where type I error rates are always well controlled, since not only does disease prevalence impact type I error rate, but also main effects and sample sizes and to a lesser extent variant and environmental exposure frequencies. Generally, for diseases with a prevalence of $<$ 4% the case-only design has well controlled type I error rate. When main effects are large, e.g., odds ratio ($OR$) $>$ 5.0, the disease prevalence must be $<$ 1% to control type I error rates. Contrarily, when there is no or only one main effect, the disease prevalence can be high, e.g., 20% and type I error rates are controlled. The estimate of the interaction bias can be used to evaluate if type I error rate is controlled. To facilitate this evaluation, we provide the CaseOnly R code, which simulates and analyses data to assess type I error rates and statistical power in case-only designs, thereby offering researchers a robust tool for detecting interactions. The case-only design is a powerful method to detect interactions, and it is advantageous to apply it when type I error rate is controlled.

Methods

Simulation study

Data were generated for a genetic variant under a dominant and additive model and for a binary environmental exposure. To evaluate type I error, the genetic variant and environmental exposure have no interaction effects. The main effects for the genetic variant ranged from ${\beta }_{G}=\ln(1.050)\approx 0.049$ to ${\beta }_{G}=\ln(3.846)\approx 1.347$ for the dominant model, and for the additive model ${\beta }_{G}=\ln(1.2)\approx 0.182$ for the carrier of one risk allele. For the environmental exposure the main effects ranged from ${\beta }_{E}=\ln(1.10)\approx 0.095$ to ${\beta }_{E}=\ln(5.00)\approx 1.609$. We also tested when both main effects were protective $[{\beta }_{G}=\ln\left(0.952\right)\approx -0.049$ to ${\beta }_{G}=\ln\left(0.26\right)\approx -1.347$, ${\beta }_{E}=\ln\left(0.909\right)\approx -0.095$ to ${\beta }_{E}=\ln\left(0.20\right)\approx -1.609$], as well as when one main effect was protective and the other increased risk. We also evaluated type I error when neither the genetic nor environmental factor had a main effect and when only one factor had a main effect. Data were also generated under the alternative where there was an interaction, ${\beta }_{G\times E}=\ln(1.20)\approx 0.182$. The minor allele frequency (MAF) of the genetic variant and the frequency of the environmental exposure was varied between 0.05 and 0.50. We also varied the disease prevalence between 1 and 20%. Using a uniform distribution and a random number generator, genetic variant and exposure data were generated for the $i$th sample and a logistic regression model,

$$logit({Y}_{i}=1)={\beta }_{0}+{\beta }_{G}{G}_{i}+{\beta }_{E}{E}_{i}+{\beta }_{G\times E}{G}_{i}{E}_{i}$$

(1)

was used to assign the $i$th sample as a case or control. For case–control design, we generated samples of 10,000 cases and 10,000 to 30,000 controls to obtain a ratio ($R$) between controls and cases ($R=1, 2$, and $3$). We also generated samples of different number of cases (2,500, 5,000, 10,000, 20,000, and 50,000) to evaluate how sample size impacts type I error rate for the case-only design, with disease prevalence levels varying from 1% to 20%.

For the case-only design, the following formula was applied to estimate the interaction effect ${\beta }_{G\times E}$ where $OR$ is odds ratio:

$$\begin{aligned}{\widehat\beta}_{G\times E}&=\ln({\text{OR}}_{G\times E}\vert_{Y=1})\\&=\ln(\frac{P(G=1,E=1\vert Y=1)\times P(G=0,E=0\vert Y=1)}{P(G=1,E=0\vert Y=1)\times P(G=0,E=1\vert Y=1)})\end{aligned}$$

(2)

For case–control design, a logistic regression model was used $logit({Y}_{i}=1)={\beta }_{0}+{\beta }_{G}{G}_{i}+{\beta }_{E}{E}_{i}+{\beta }_{G\times E}{G}_{i}{E}_{i}$ to estimate the interaction effect ${\beta }_{G\times E}$. The p-value ($P)$ for both the case-only and case–control designs was computed using the Wald test for the coefficient of ${\beta }_{G\times E}$ in the logistic regression model.

When estimating type I error under the null of no interaction (${\beta }_{G\times E}=0$), 1,000,000 replicates were generated and analysed. Type I error was calculated for the different significance levels ($\alpha$) varying from 0.001 to 0.05. Quantile–quantile (QQ) plots were also generated. For evaluating power under the alternative $[{\beta }_{G\times E}=\text{ln}(1.2)$], 100,000 replicates were generated and analysed. Power was estimated as the proportion of replicates with $P<0.05$. For data generated under the null and alternative the distribution of ${\widehat{\beta }}_{G\times E}$ was obtained using all replicates.

Theoretical and analytical studies

We developed a theoretical framework to demonstrate the relationship between the interaction effect and the $OR$ between G and E in the cases and controls. As stated in Piegorsch et al. [2], under the logistic model (1), the exponential value of interaction effect ${\beta }_{G\times E}$ approximately equals to the $OR$ between G and E among the cases when the disease prevalence is sufficiently low and G and E are independent in the population. Here we explicitly write the relationship as follows:

$$exp({\beta }_{G\times E})=\frac{O{R}_{G,E}{|}_{Y=1}}{O{R}_{G,E}{|}_{Y=0}}$$

(3)

When the disease prevalence is sufficiently low, $O{R}_{G,E}{|}_{Y=0}$ is approximately the $OR$ between G and E in the population (Supplementary Methods Sections 1, 2.1, and 2.2).

Besides, by comparing ${\widehat{\beta }}_{G\times E}$ in formula (2) and the true ${\beta }_{G\times E}$ in formula (3), we show the analytical bias given the true ${\beta }_{G}$, ${\beta }_{E}$ and ${\beta }_{G\times E}$ and the baseline prevalence of the disease (Supplementary Methods Sect. 2.3):

$$\begin{aligned}{\text{bias}}&={\widehat\beta}_{G\times E}-\beta_{G\times E}\\&=\ln\left({\text{OR}}_{G\times E}\vert_{Y=0}\right)\\&=\ln(\frac{P(G=1,E=1\vert Y=0)\times P(G=0,E=0\vert Y=0)}{P(G=1,E=0\vert Y=0)\times P(G=0,E=1\vert Y=0)})\\&=\ln(\frac{\text{exp}(2\beta_0+\beta_G+\beta_E)+\text{exp}(\beta_0+\beta_G)+\text{exp}(\beta_0+\beta_E)+1}{\text{exp}(2\beta_0+\beta_G+\beta_E+\beta_{G\times E})+\text{exp}(\beta_0+\beta_G+\beta_E+\beta_{G\times E})+\text{exp}(\beta_0)+1})\end{aligned}$$

(4)

Results

Simulation studies

Type I error—case-only and case–control designs

Simulation studies were used to evaluate type I error when testing for interactions (${\beta }_{G\times E}$) for the case-only and case–control designs. Tables 1a and Supplementary Table S1 show the type I error for the case-only and case–control designs for $\alpha =0.001, 0.01$ and $0.05$ when there are main effects, i.e., ${\beta }_{G}=\ln(1.2)$ and ${\beta }_{E}=\ln(2)$ under the dominant model. Similar results were observed for the additive model (Supplementary Table S2).

Table 1 Type I error rates for case-only and case–control designs

Full size table

As the disease prevalence increases, type I error rate generally becomes higher in this setting and the type I error rate is inflated when disease prevalence is $\ge$ 4%. When disease prevalence is 4%, type I error for case-only design is 0.001098 and 0.051387 when $\alpha =0.001$ and $\alpha =0.05$, respectively. When disease prevalence is 20%, type I error rate is greatly inflated to 0.002091 and 0.070049, for $\alpha =0.001$ and $\alpha =0.05$, respectively. As observed in the QQ plot it is evident that type I error rate for the case-only design increases with increasing disease prevalence, but for the case–control design the type I error rate is well controlled even when the disease prevalence is 20% (Fig. 1 and Supplementary Fig. 1a). When one or both main effects are absent, the case-only design has well controlled type I error rate even when the disease prevalence is high, e.g., 20% (Table 1b-d and Supplementary Figure S1b-d).

Main effects and type I error for the case-only design

We also evaluated type I error for a variety of main effects (Table 2a), the results suggest that if one or both main effects are strong, the disease prevalence should be $<$ 4% for well controlled type I error rate. If ${\beta }_{G}$ is increased from $\ln(1.2)$ to $\ln(3.846)$ and ${\beta }_{E}$ remains $\ln(2)$, and the disease prevalence is 1%, type I error rate increases from 0.050132 to 0.056403. When both main effects are strong ${[{\beta }_{G}=\ln(3.846)}$ and ${\beta }_{E}=\ln(5)$], even when the disease prevalence is as low as 1%, type I error is very inflated, i.e., 0.141321 for $\alpha =0.05$. Interestingly, if the main effects are protective, the case-only design can be applied to diseases with higher prevalences without inflated type I error issues, e.g., when both main effects are strongly protective, e.g., ${\beta }_{G}=\ln\left(0.26\right)$ and ${\beta }_{E}=\ln\left(0.2\right)$, type I error is 0.048999 when the disease prevalence is 5%. If the main effects are in the opposite directions but both strong, even for a disease prevalence of 1%, type I error is still inflated (Supplementary Table S3a). If both main effects are weak, then the disease prevalence can be $>$ 4% for well controlled type I error rate, e.g., if ${\beta}_{G}=\ln\left(1.05\right)$ and ${\beta}_{E}=\ln(1.1)$, type I error is well controlled (0.049666) even when the disease prevalence is 20%.

Table 2 Impact of main effects on type I error and analytical bias for the case-only design

Full size table

Exposure frequencies and type I error for the case-only design

Genetic variant and environmental exposure frequencies also affect type I error for the case-only design. Type I error rate first increases as variant and environmental exposure frequencies become greater, then decreases as the minor allele becomes the major allele or > 50% of the population are exposed. For example, when the disease prevalence is 5% and frequency of environmental exposure is 10%, type I error is 0.049830 (MAF = 0.05), 0.052221 (MAF = 20%), and 0.051684 (MAF = 50%), respectively. However, the impact of MAF and the environmental exposure frequency on type I error rate is limited compared to the influence of main effects and disease prevalence (Table 3a).

Table 3 Impact of variant and environmental exposure frequencies on type I error and analytical bias for the case-only design

Full size table

Sample size and type I error for the case-only design

We also evaluated the role of sample size plays on type I error for the case-only design. When the disease prevalence is 4% and there are two main effects [${\beta }_{G}=\ln(1.2)$ and ${\beta }_{E}=\ln(2)$], type I error is 0.050325, 0.051155, and 0.058031 for 2 500, 10,000, and 50,000 cases for $\alpha=0.05$. If the sample size is 20,000 the disease prevalence should be $\le$ 2% for well controlled type I error rate. For a disease prevalence of 2% and a sample size of 50,000 cases, type I error is inflated (0.051977). On the other hand, if the sample size is 2 500, type I error is still well controlled (0.050102) when the disease prevalence is 5% (Table 4).

Table 4 Impact of sample size on type I error rates for the case-only design

Full size table

Statistical power—case-only and case–control designs

When there are two main effects [${\beta }_{G}=\ln(1.2)$ and ${\beta }_{E}=\ln(2)$] under the dominant model with disease prevalence $\le$ 5%, the case-only design has higher power than case–control designs even when $R=3$. Similar results were observed for the additive model (Supplementary Figure S2). When disease prevalence is 1%, the power for case-only and case–control ($R=1$) designs are 0.93 and 0.56, respectively, and even when disease prevalence increases to 4%, the power of case-only design (0.87) is 1.58 × greater than case–control design (0.55). The power of case-only design drops substantially when disease prevalence increases to 10%, but it is not until the disease prevalence reaches 20% that case–control design ($R=1$) has greater power (Fig. 2a).

When one or both main effects are absent, the power for the case-only design is also significantly higher than for the case–control design, even though the former has a much smaller sample size, e.g., N = 10,000 for case-only and N = 20,000 for case–control ($R=1$) designs. For a disease prevalence of 4%, when there is only a main genetic effect [${\beta }_{G}=\ln(1.2)$ and ${\beta }_{E}=0$], the power is 0.76 and 0.48 for the case-only and case–control designs, respectively. As the disease prevalence increases to 20% the power for the case-only design is still greater than for the case–control design but the gain in power is not as great, i.e., 0.56 vs. 0.47. The result is similar when there is only one main environmental effect [${\beta }_{G}=0$ and ${\beta }_{E}=\ln(2)$] or no main effects [${\beta }_{G}=0$ and ${\beta }_{E}=0$] (Fig. 2b-d).

Higher MAF and frequency of environmental exposure may also have an impact on power. For example, when MAF remains 0.2 and disease prevalence is 4%, the power for the case-only design is 0.65, 0.87, 0.97 for frequency of environmental exposure being 0.05, 0.10, 0.20, respectively. However, when MAF increases from 0.05 to 0.20 to 0.50 with frequency of environmental exposure fixed as 10%, under the dominant model, the power will first increase from 0.53 to 0.87, then decrease to 0.73 (Supplementary Figure S3).

Bias—analytical results

In formula (4), we derived the analytical bias of ${\widehat{\beta }}_{G\times E}$ given the baseline risk ($\frac{1}{1+{e}^{-{\beta }_{0}}}$), main effects (${\beta }_{G}$ and ${\beta }_{E}$), and true interaction effect (${\beta }_{G\times E}$). Under the null, when one or both main effects are absent, the bias of ${\widehat{\beta }}_{G\times E}$ equals 0. However, when the main effects either both increase or decrease the risk of developing disease, ${\widehat{\beta }}_{G\times E}$ underestimates the true interaction effect ${\beta }_{G\times E}$, leading to a negative bias of ${\widehat{\beta }}_{G\times E}$, and the bias is greater with higher disease prevalence. For example, when ${\beta }_{G}=\ln(1.2)$, ${\beta }_{E}=\ln(2)$ for the following disease prevalences 1%, 4%, and 20% the biases are −0.001667, −0.006330, and −0.023631, respectively. Stronger main effects also lead to greater bias of ${\widehat{\beta }}_{G\times E}$, e.g., when both main effects are strong [$\beta_{G}=\ln(3.846)$ and ${\beta }_{E}=\ln(5)$], the bias of ${\widehat{\beta }}_{G\times E}$ reaches −0.144066 even when the disease prevalence is only 4%. When main effects (${\beta }_{G}$ and ${\beta }_{E}$) are in opposite directions, ${\widehat{\beta }}_{G\times E}$ for the case-only design overestimates ${\beta }_{G\times E}$ (a positive bias), with the bias increasing as the disease prevalence increases. As long as there are two non-zero main effects, ${\widehat{\beta }}_{G\times E}$ is biased (Table 3b, Supplementary Table S3, Supplementary Figure S4 and Supplementary Sect. 2.3).

Under the alternative [${\beta }_{G\times E}=\ln(1.2)$], even when there are no main effects, the bias of ${\widehat{\beta }}_{G\times E}$ is negative, indicating that the case-only design underestimates the interaction effect. When ${\beta }_{G}=\ln(1.2)$, ${\beta }_{E}=\ln(2)$ and ${\beta }_{G\times E}=\ln(1.2)$, the bias is −0.005613, −0.021628, and −0.089011 for disease prevalences of 1%, 4%, and 20%, respectively. The bias of ${\widehat{\beta }}_{G\times E}$ is smaller when ${\beta }_{G}$ and ${\beta }_{E}$ are in opposite directions compared to when ${\beta }_{G}$ and ${\beta }_{E}$ are either both positive or negative, e.g., when ${\beta }_{G}=\ln(1.2)$, ${\beta }_{G\times E}=\ln(1.2)$, and the disease prevalence is 20%, the bias is −0.089011 when ${\beta }_{E}=\ln(2)$, but only −0.008341 when ${\beta }_{E}=\ln(0.5)$. However, the bias still increases as the disease becomes more prevalent regardless of the main effects (Supplementary Table S4 and Supplementary Figure S5).

The bias of ${\widehat{\beta }}_{G\times E}$ is closely related to type I and II error rates for the case-only design. Under the null, when there are two main effects the bias of ${\widehat{\beta }}_{G\times E}$ always leads to an increase in type I error rate, and ${\widehat{\beta }}_{G\times E}$ for the case-only design either underestimates or overestimates ${\beta }_{G\times E}$. For the case-only design, as the disease prevalence increases, the bias of ${\widehat{\beta }}_{G\times E}$ is greater, leading to higher type I error rates. The case-only design may suffer from inflated type I error rate if the absolute value of the analytical bias of ${\widehat{\beta }}_{G\times E}$ $>$ 0.006 (both main effects increase risk), $>$ 0.01 (one main effect increases and the other decreases risk), and $>$ 0.025 (both main effects are protective) (Tables 3 and 4). Under the alternative, if the bias of ${\widehat{\beta }}_{G\times E}$ is in the same direction as ${\beta }_{G\times E}$, this will boost the power of the case-only design. However, it should be noted even when ${\beta }_{G\times E}$ and the bias of ${\widehat{\beta }}_{G\times E}$ are in opposite directions, e.g., ${\beta }_{G\times E}$ is positive and the bias of ${\widehat{\beta }}_{G\times E}$ is negative, the power for the case-only design is greater than when a case–control sample is analysed for disease prevalences $<$ 15% (Fig. 2 and Supplementary Table S4).

Discussion

We evaluated the case-only design through theoretical and simulation analysis, and showed that several factors including disease prevalence, main effects, variant and environmental exposure frequencies, and sample size may impact bias of the interaction term and type I and II error rates. While previous studies stated that for the case-only design a rare disease assumption is necessary, they often lack clarity on what constitutes a "rare" disease. Our simulations investigated various disease prevalence thresholds, from 1% to 20%, to assess their effects on type I error rates and bias in estimating interaction terms. Compared with the conventional case–control design, the power of the case-only design can be a magnitude greater, but it should only be applied when type I error rate is controlled. Generally, under the assumption of independence between G and E, for disease prevalences $<$ 4% type I error rate of the case-only design is not inflated, but for higher disease prevalences type I error rate can be high and the estimation of interaction effect biased. When disease prevalence is $>$ 20% the power for the case-only design can be lower than analysing a case–control sample. When one or both main effects are absent, the disease prevalence does not impact type I error rate. However, caution is required since there may be a failure to detect main effects even when they exist. The analytical bias can be calculated using formula (4), to aid in evaluating if the case-only design is appropriate to use. Besides the analytical bias, when the sample size is large ($>$ 10,000 cases), the case-only design requires lower disease prevalence to avoid inflation in type I error rate.

Stronger main effects may lead to a greater bias of the interaction estimate and higher type I error rate. Though it is uncommon for complex traits to have a genetic risk factor with $OR$ >1.5 there are several environmental exposures that have large main effects [15,16,17]. We recommend applying the case-only design to test for interactions for complex traits with low prevalences (e.g., ovarian cancer and celiac disease), and limit testing of interactions to variants and environmental exposures that do not have strong main effects.

Because of the bias of ${\widehat{\upbeta }}_{\text{G}\times \text{E}}$ may be in the same or opposite direction of ${\widehat{\upbeta }}_{\text{G}\times \text{E}}$ itself, type I and II error rates can both increase with increasing disease prevalence. Therefore, for the case-only design unlike most statistical tests as type I error rate increases power decreases.

Yang et al. claimed that no assumption about disease prevalence is required stating that the cross-product term of the $2\times 2$ table (presence or absence of the risk allele for genes 1 and 2 among cases) measures the departure from the multiplicative joint effects of relative risk aka the risk ratio ($RR$) (not $OR$) [3]. The $RR$ is used for observational cohort studies that have incidence cases and only approximate the $OR$ for diseases with low prevalence. Therefore, using $RR$ to define interaction is inaccurate and both ${\widehat{\beta }}_{G\times E}$ estimated by the case-only design and the “interaction” defined by $RR$ have high type I error rates when the disease is prevalent. In fact, the case-only design measures the “interaction” effect defined by $RR$ (Supplementary Methods Sect. 2.4) and they are both biased estimates of ${\beta }_{G\times E}$ (Supplementary Figures S4 and S5).

In addition to multiplicative interaction, there are other types of interactions such as additive [18] and sufficient-cause interactions [9]. Interaction on an additive scale, often uses the index of relative excess risk due to interaction, which should be used with caution due to use of the relative risk as discussed in the paragraph above. Sufficient-cause interaction comprises a set of conditions or events that lead to a specific outcome, that is equivalent to the scenario when the main effects for G and E are both absent.

Conclusions

Although the case-only design is a powerful method to detect interaction due to increased type I error rates for a variety of scenarios, e.g., main effects, allele and environmental exposure frequencies are not sufficiently low, type I error rate should be evaluated. Our research contributes to the existing literatures by establishing clear guidelines on the acceptable thresholds for disease prevalence that maintain type I error rates under control. This can be done by analytically examining the bias [formula (4)] and performing simulation studies by implementing the R code, CaseOnly, that simulates and analyses data to evaluate type I error rate. CaseOnly can also be used to evaluate power. Although the case-only design is a powerful method to detect interactions it should be used with caution.

Data availability

The source code used to simulate and analyze data in this study is available in the CaseOnly repository (https://github.com/RuiDongDR/CaseOnly), and the simulations can be replicated through running the Rscript CaseOnly.R.

References

Wang H, Zhang F, Zeng J, Wu Y, Kemper KE, Xue A, et al. Genotype-by-environment interactions inferred from genetic effects on phenotypic variability in the UK Biobank. Sci Adv. 2019;
Piegorsch WW, Weinberg CR, Taylor JA. Non-hierarchical logistic models and case-only designs for assessing susceptibility in population-based case-control studies. Stat Med. 1994;13:153–62.
Article CAS PubMed Google Scholar
Yang Q, Khoury MJ, Sun F, Flanders WD. Case-Only Design to Measure Gene-Gene Interaction: Epidemiology. 1999;10:167–70.
CAS PubMed Google Scholar
Wang L-Y, Lee W-C. Population Stratification Bias in the Case-Only Study for Gene-Environment Interactions. Am J Epidemiol. 2008;168:197–201.
Article PubMed Google Scholar
Gatto NM. Further development of the case-only design for assessing gene-environment interaction: evaluation of and adjustment for bias. Int J Epidemiol. 2004;33:1014–24.
Article PubMed Google Scholar
Bhattacharjee S, Wang Z, Ciampa J, Kraft P, Chanock S, Yu K, et al. Using Principal Components of Genetic Variation for Robust and Powerful Detection of Gene-Gene Interactions in Case-Control and Case-Only Studies. Am J Hum Genet. 2010;86:331–42.
Article CAS PubMed PubMed Central Google Scholar
Khoury MJ, Flanders WD. Nontraditional Epidemiologic Approaches in the Analysis of Gene Environment Interaction: Case-Control Studies with No Controls! Am J Epidemiol. 1996;144:207–13.
Article CAS PubMed Google Scholar
Lee W-C, Wang L-Y, Cheng KF. An easy-to-implement approach for analyzing case-control and case-only studies assuming gene-environment independence and hardy-weinberg equilibrium. Stat Med. 2010;29:2557–67.
Article PubMed Google Scholar
Lee W-C. Testing for sufficient-cause gene-environment interactions under the assumptions of independence and hardy-weinberg equilibrium. Am J Epidemiol. 2015;182:9–16.
Article PubMed Google Scholar
Lash TL, Bradbury BD, Wilk JB, Aschengrau A. A case-only analysis of the interaction between N-acetyltransferase 2 haplotypes and tobacco smoke in breast cancer etiology. Breast Cancer Res. 2005;7:R385.
Article CAS PubMed PubMed Central Google Scholar
Neslund-Dudas C, Levin AM, Rundle A, Beebe-Dimmer J, Bock CH, Nock NL, et al. Case-only gene–environment interaction between ALAD tagSNPs and occupational lead exposure in prostate cancer. Prostate. 2014;74:637–46.
Article CAS PubMed PubMed Central Google Scholar
Liu G, Mukherjee B, Lee S, Lee AW, Wu AH, Bandera EV, et al. Robust Tests for Additive Gene-Environment Interaction in Case-Control Studies Using Gene-Environment Independence. Am J Epidemiol. 2018;187:366–77.
Article PubMed Google Scholar
Helbig KL, Nothnagel M, Hampe J, Balschun T, Nikolaus S, Schreiber S, et al. A case-only study of gene-environment interaction between genetic susceptibility variants in NOD2 and cigarette smoking in Crohn’s disease aetiology. BMC Med Genet. 2012;13:14.
Article CAS PubMed PubMed Central Google Scholar
Clarke GM, Pettersson FH, Morris AP. A comparison of case-only designs for detecting gene × gene interaction in rheumatoid arthritis using genome-wide case-control data in Genetic Analysis Workshop 16. BMC Proc. 2009;3:S73.
Article PubMed PubMed Central Google Scholar
Tuomi T, Huuskonen MS, Virtamo M, Tossavainen A, Tammilehto L, Mattson K, et al. Relative risk of mesothelioma associated with different levels of exposure to asbestos. Scand J Work Environ Health. 1991;17:404–8.
Article CAS PubMed Google Scholar
Van Hylckama VA, Helmerhorst FM, Vandenbroucke JP, Doggen CJM, Rosendaal FR. The venous thrombotic risk of oral contraceptives, effects of oestrogen dose and progestogen type: results of the MEGA case-control study. BMJ. 2009;339:b2921–b2921.
Article Google Scholar
Bloemenkamp KWM, Rosendaal FR, Büller HR, Helmerhorst FM, Colly LP, Vandenbroucke JP. Risk of Venous Thrombosis With Use of Current Low-Dose Oral Contraceptives Is Not Explained by Diagnostic Suspicion and Referral Bias. Arch Intern Med. 1999;159:65.
Article CAS PubMed Google Scholar
Lee W-C. Sample size calculations for additive interactions. Epidemiology. 2013;24:774.
Article PubMed Google Scholar

Download references

Acknowledgements

Not applicable.

Funding

This work was supported by grant R01DC017712 to SML and ATD from National Institute and other Communications Disorders (NIDCD).

Author information

Authors and Affiliations

Center for Statistical Genetics, Gertrude H. Sergievsky Center, and the Department of Neurology, Columbia University Medical Center, New York, NY, 10032, USA
Rui Dong, Gao T. Wang & Suzanne M. Leal
Department of Chronic Disease Epidemiology and Center for Perinatal, Pediatric and Environmental Epidemiology, Yale School of Public Health, 1 Church Street, New Haven, CT, 06510, USA
Andrew T. DeWan

Authors

Rui Dong
View author publications
You can also search for this author inPubMed Google Scholar
Gao T. Wang
View author publications
You can also search for this author inPubMed Google Scholar
Andrew T. DeWan
View author publications
You can also search for this author inPubMed Google Scholar
Suzanne M. Leal
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

RD, GTW and SML designed the theoretical and simulation analysis, performed data interpretation, and RD and SML drafted the manuscript. SML directed the study’s implementation. ATD and SML obtained the funding. All authors provided critical review for important intellectual content of the manuscript and approved the final manuscript for submission.

Corresponding author

Correspondence to Suzanne M. Leal.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Material 1.

Supplementary Material 2.

Supplementary Material 3.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Dong, R., Wang, G.T., DeWan, A.T. et al. The case-only design is a powerful approach to detect interactions but should be used with caution. BMC Genomics 26, 222 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12864-025-11318-1

Download citation

Received: 08 May 2024
Accepted: 03 February 2025
Published: 06 March 2025
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s12864-025-11318-1

The case-only design is a powerful approach to detect interactions but should be used with caution

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Simulation study

Theoretical and analytical studies

Results

Simulation studies

Type I error—case-only and case–control designs

Main effects and type I error for the case-only design

Exposure frequencies and type I error for the case-only design

Sample size and type I error for the case-only design

Statistical power—case-only and case–control designs

Bias—analytical results

Discussion

Conclusions

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Supplementary Material 1.

Supplementary Material 2.

Supplementary Material 3.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomics

Contact us