\(*\) Correspondence should be addressed to Huining Kang (HuKang@salud.unm.edu).
We recently developed a new statistical method for analyzing differentially expressed/spliced transcript isoforms, which appropriately account for the dependence between isoforms and multiple testing corrections for the multi-dimensional structure of at both the gene-and isoform-level. We proposed a linear mixed effects model-based approach for analyzing the complex alternative RNA splicing regulation patterns detected by whole-transcriptome RNA-Sequencing technologies. This approach thoroughly characterizes and differentiates three types of genes related to alternative RNA splicing events with distinct differential expression/splicing patterns. We applied the concept of appropriately controlling for the gene-level overall false discovery rate (OFDR) in this multi-dimensional alternative RNA splicing analysis utilizing a two-step hierarchical hypothesis testing framework. In the initial screening test we identify genes that have differentially expressed or spliced isoforms; in the subsequent confirmatory testing stage we examine only the isoforms of genes that have passed the screening tests. The methodology is described in detail in the manuscript titled Two-step mixed model approach to analyzing differential alternative RNA splicing submitted to PLOS ONE. We have provided an R program file (Tools.R) that included all the functions for performing the analysis with the proposed method. The R program file can be downloaded here. We also wrote a Tutorial to show step-by-step how to use these functions in R to perform the analyses that we have proposed in our manuscript.
In the manuscript we have reported an application of our method to an RNA-sequencing study of salivary gland adenoid cystic carcinoma (ACC). We adhere to the principles of transparency in order to assure that our results are reproducible and present in this document the detailed steps and R codes for the analysis. The major steps of the analysis for this application are the same as we presented in our Tutorial except that the dataset is different. We strongly encourage readers to read the Tutorial before reading this document.
In the original study conducted by (Brayer, et. al. 2016) the RNA-sequencing was performed on 20 ACC salivary gland tumors and five normal salivary glands. We performed differential isoform expression/splicing analysis for a subset of the tumor data consisting of 8 ACC patients who are free of cancer vs. 6 patients who are not among whom have complete follow-up. Cufflinks was used to estimate and quantify the isoform abundance in FPKM. There are 2850 genes that have at least two detected isoforms. This number reduced to 2820 when we restricted to the 14 patents. We converted the data set to an R list object named data in the format described in the Tutorial. This data set is available per request.
As suggested in our Tutorial we created a folder called ACC as our working directory and under this folder we created two subfolders data and output. Aforementioned R object data is saved in a file named acc.RData in the subfolder data.
In the following (our primary analysis for ACC data) we assumed the compound symmetry with unequal variances as the covariance structure and we set the statistical significance at OFDR = 0.10 due to the exploratory nature of this study where the sample sizes and effect sizes were limited. Before starting our analysis, let’s take a quick look at the data object. The following commands will load this object into the memory and show the elements of the object.
# Load data to the memory
load(file="data/acc.RData")
# list the elements of object data
str(data)
## List of 6
## $ x : num [1:8847, 1:14] 2.62 1.46 3.52 3.71 3.98 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:8847] "ENST00000367429" "ENST00000466229" "ENST00000445516" "ENST00000487168" ...
## .. ..$ : NULL
## $ y : num [1:14] 1 1 1 2 2 1 2 2 1 2 ...
## $ gene_id: chr [1:8847] "ENSG00000000971" "ENSG00000000971" "ENSG00000001631" "ENSG00000001631" ...
## $ gene : chr [1:8847] "CFH" "CFH" "KRIT1" "KRIT1" ...
## $ isoform: chr [1:8847] "ENST00000367429" "ENST00000466229" "ENST00000445516" "ENST00000487168" ...
## $ y.lab : chr [1:2] "Free of Cancer" "Not Free of Cancer"
We see that the data format/structure is exactly the same as that of the data object in our tutorial. The total number of isoforms is 8847 and the total number of genes is as follows
length(unique(data$gene))
## [1] 2820
The function find.nL()
defined in file Tools.R
can be used to calculate the number of isoforms for each gene. It returns a numeric vector with each element being the number of isoforms of one gene and the names of the elements being gene names.
source("Tools.R")
nL <- find.nL(data=data)
The numbers of isoforms for the first 5 genes are
head(nL, n=5)
## CFH KRIT1 CD99 M6PR ICA1
## 2 2 2 5 3
The numbers of isoforms for the last 5 genes are
tail(nL, n=5)
## RPL17 STRADA RP1-178F10.3 ZNF587B RP11-18I14.10
## 5 3 2 2 2
We can summarize all the numbers as follows:
sum.nL <- table(nL)
sum.nL
## nL
## 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17
## 1402 674 317 182 98 61 35 17 10 8 8 5 1 1 1
The result indicates that there are 1402 genes that have 2 isoforms for each, 674 genes that have 3 isoforms for each, and so on so forth.
As described in our Tutorial, the first step is to run the screening test\(-\)the likelihood ratio test (LRT) by calling function lrt.test.mclapply()
. We proceed this step by submitting following command:
res <- lrt.test.mclapply(data=data, VarCov.type="uneqvar", nCore=NULL, outfile="output/ACC_uneqvar_type1.RData", screen.type=1)
This was the most time-consuming part. It returned a matrix object res
which was saved in the file ACC_uneqvar_type1.RData
under the subfolder output. Now we load the object from the file and take a look.
load(file="output/ACC_uneqvar_type1.RData")
The dimension of the object is
dim(res)
## [1] 2820 6
The rows on the top are
head(res)
## lrt df nIso p.value FDR Time
## CFH 0.4919436 2 2 0.7819443 0.9163829 0.3518181
## KRIT1 1.1450537 2 2 0.5640983 0.8071390 0.4742093
## CD99 1.6574952 2 2 0.4365957 0.7445243 0.5504806
## M6PR 8.4910369 5 5 0.1311697 0.5263642 0.9330542
## ICA1 4.1329972 3 3 0.2474563 0.6249283 0.4287062
## CFLAR 5.8691039 7 7 0.5551140 0.8049110 1.1064849
and on the bottom are
tail(res)
## lrt df nIso p.value FDR Time
## RP11-156P1.3 1.260997 2 2 0.53232632 0.7919478 0.06369615
## RPL17 9.902963 5 5 0.07803189 0.4582985 0.50190425
## STRADA 3.526793 3 3 0.31730378 0.6749765 0.13856792
## RP1-178F10.3 2.487001 2 2 0.28837294 0.6575270 0.05994153
## ZNF587B 2.097044 2 2 0.35045533 0.6991489 0.29776192
## RP11-18I14.10 1.534545 2 2 0.46427766 0.7525246 0.23661232
Each row of the matrix gives the test result for each gene and the row names indicate the names (IDs) of the genes. The columns from left to right are LRT statistic (lrt
), degree of freedom (df
), Number of isoforms (nIso
), p-values of LRT (p.value
), FDR (FDR
) and time used for fitting the linear mixed effects model in second (Time
). Note that there might be missing results for some genes because the corresponding linear mixed models procedures did not converge. Let’s see how many such genes there are.
p <- which(is.na(res[, 1]))
length(p)
## [1] 37
We see that there are 37 (1.31%) genes whose linear mixed effects models did not converge. The names of these genes are
rownames(res)[p]
## [1] "SEC62" "RORA" "DOCK9" "FKBP5" "TNRC6B" "FAM208B"
## [7] "MED13" "EIF4G2" "SPARC" "CLK4" "USP34" "MED13L"
## [13] "STX16" "C3" "ATF4" "ERCC5" "IL6" "NPEPPS"
## [19] "GALNT1" "SMAD4" "LYST" "OSMR" "OGT" "UTRN"
## [25] "RPL30" "CTNNB1" "TPT1-AS1" "AHSA2" "RPL4" "PPP1R2"
## [31] "SESTD1" "DMBT1" "MB" "GNB2L1" "PRB4" "RP11-37B2.1"
## [37] "CCPG1"
Next let’s retain only those called significant at level FDR = 0.1 and assign the results in a new object called sig.res1
sig.res1 <- res[!is.na(res[, 1]) & res[, "FDR"] <= 0.1, ]
sig.res1
## lrt df nIso p.value FDR Time
## GNPTAB 23.86920 5 5 2.300448e-04 0.09145925 0.4770377
## VEGFA 18.07379 2 2 1.189393e-04 0.09145925 0.2424715
## HNRNPA2B1 29.81071 6 6 4.270285e-05 0.05942102 0.8640647
## RRBP1 18.52259 3 3 3.431237e-04 0.09366467 0.1906805
## PRKAA1 19.51404 3 3 2.140158e-04 0.09145925 0.4196160
## POSTN 31.05287 5 5 9.144749e-06 0.02544984 0.3757513
## FRMD4A 22.40986 4 4 1.660719e-04 0.09145925 0.3195240
## TRIP12 28.24006 7 7 1.989194e-04 0.09145925 0.8803523
## C3orf17 21.09291 4 4 3.035165e-04 0.09366467 0.3033209
## FZD6 15.80285 2 2 3.702161e-04 0.09366467 0.2177060
## ASXL1 18.77488 3 3 3.043197e-04 0.09366467 0.2043502
We see that there is a total of 11 genes that passed the screening test at level FDR = 0.1. Let’s sort these genes by the p-values.
sig.res1[order(sig.res1[, "p.value"]), ]
## lrt df nIso p.value FDR Time
## POSTN 31.05287 5 5 9.144749e-06 0.02544984 0.3757513
## HNRNPA2B1 29.81071 6 6 4.270285e-05 0.05942102 0.8640647
## VEGFA 18.07379 2 2 1.189393e-04 0.09145925 0.2424715
## FRMD4A 22.40986 4 4 1.660719e-04 0.09145925 0.3195240
## TRIP12 28.24006 7 7 1.989194e-04 0.09145925 0.8803523
## PRKAA1 19.51404 3 3 2.140158e-04 0.09145925 0.4196160
## GNPTAB 23.86920 5 5 2.300448e-04 0.09145925 0.4770377
## C3orf17 21.09291 4 4 3.035165e-04 0.09366467 0.3033209
## ASXL1 18.77488 3 3 3.043197e-04 0.09366467 0.2043502
## RRBP1 18.52259 3 3 3.431237e-04 0.09366467 0.1906805
## FZD6 15.80285 2 2 3.702161e-04 0.09366467 0.2177060
Next we perform the step 2 (confirmatory) test to identify the isoforms of these 11 genes that are differentially expressed using function step2.Ftest()
as follows
final.res1 <- step2.Ftest(data=data, file.step1="output/ACC_uneqvar_type1.RData", alpha=0.1, method="hochberg")
## Number of significant genes: 11
## Cutoff for significant isoform is 0.0003952569
Note that there is an argument called method in function step2.Ftest and the following functon mc.step2.Waldtest, which is the options for controlling the family-wise error rate required by our two step procedure. The options are methods of Bonferroni (“bonferroni”), Holm (“holm”), Hochberg (“hochberg”) and Hommel (“hommel”). We use Hochberg option throughout this document. The function returns the results of the confirmatory test for all the isoforms of the above 11 genes in a data.frame object called final.res1
. Let’s take a look at the results.
dim(final.res1)
## [1] 44 7
head(final.res1)
## gene isoform FC step2.p.value adj.p threshold decision
## 1 GNPTAB ENST00000299314 -1.424071 0.003505942 0.017529709 7.905138e-05 FALSE
## 2 GNPTAB ENST00000549165 2.320880 0.023427021 0.093708082 9.881423e-05 FALSE
## 3 GNPTAB ENST00000549194 -2.002233 0.047714461 0.097372530 1.936837e-04 FALSE
## 4 GNPTAB ENST00000552009 -2.072173 0.162558101 0.162558101 3.952569e-04 FALSE
## 5 GNPTAB ENST00000552681 -1.375782 0.048686265 0.097372530 1.976285e-04 FALSE
## 6 VEGFA ENST00000372067 4.650730 0.009558154 0.009558154 3.952569e-04 FALSE
The columns from left to right are gene name (gene
), isoform ID (isoform
), Fold Change (FC
), p-value of the step two test (step2.p.value
), adjusted p-value (adj.p
), and threshold to determine if the step2.p.value
is significant, and decison (decision
). An isoform is called significant or considered differentially expressed if the decision is TRUE. The isoforms that are called significant are as follows
final.res1[final.res1$decision, ]
## gene isoform FC step2.p.value adj.p threshold decision
## 9 HNRNPA2B1 ENST00000356674 1.682557 5.628982e-05 0.0003377389 6.587615e-05 TRUE
We see that with the standard t-test option we identified 1 isoforms that are called significant.
Next we perform the step 2 (confirmatory) test to identify the isoforms of these 11 genes that are differentially expressed using function mc.step2.Waldtest
as follows
final.res1.wald <- mc.step2.Waldtest(data=data, VarCov.type="uneqvar", nCore = NULL, file.step1="output/ACC_uneqvar_type1.RData", alpha=0.1, method="hochberg")
## Number of significant genes: 11
## Fitting mixed effects model. This may take a while...
## Model fitting is completed
## # Number of total genes tested: 2783
## # Number of significant genes: 11
## # Cutoff for significant isoform is 0.0003952569
## # Number of significant isoform: 12
Let’s take a look at the results returned by above command.
dim(final.res1.wald)
## [1] 44 6
head(final.res1.wald)
## gene isoform step2.p.value adj.p threshold decision
## 1 GNPTAB ENST00000299314 8.061455e-05 0.0004030728 7.905138e-05 FALSE
## 2 GNPTAB ENST00000549165 4.857991e-03 0.0194319658 9.881423e-05 FALSE
## 3 GNPTAB ENST00000549194 3.044334e-02 0.0608866723 1.976285e-04 FALSE
## 4 GNPTAB ENST00000552009 1.045219e-01 0.1045219228 3.952569e-04 FALSE
## 5 GNPTAB ENST00000552681 1.312816e-02 0.0393844689 1.317523e-04 FALSE
## 6 VEGFA ENST00000372067 8.824491e-04 0.0008824491 3.952569e-04 FALSE
tail(final.res1.wald)
## gene isoform step2.p.value adj.p threshold decision
## 39 C3orf17 ENST00000494891 6.967420e-01 6.967420e-01 0.0003952569 FALSE
## 40 FZD6 ENST00000521195 1.061691e-01 1.061691e-01 0.0003952569 FALSE
## 41 FZD6 ENST00000522566 8.117205e-08 1.623441e-07 0.0001976285 TRUE
## 42 ASXL1 ENST00000306058 1.578210e-03 4.734631e-03 0.0001317523 FALSE
## 43 ASXL1 ENST00000375689 2.046450e-02 2.046450e-02 0.0003952569 FALSE
## 44 ASXL1 ENST00000470145 4.484705e-03 8.969409e-03 0.0001976285 FALSE
The isoforms that are called significant are as follows
final.res1.wald[final.res1.wald$decision, ]
## gene isoform step2.p.value adj.p threshold decision
## 7 VEGFA ENST00000497139 7.287226e-07 1.457445e-06 1.976285e-04 TRUE
## 9 HNRNPA2B1 ENST00000356674 5.720757e-11 3.432454e-10 6.587615e-05 TRUE
## 16 RRBP1 ENST00000495501 2.488224e-05 7.464671e-05 1.317523e-04 TRUE
## 19 PRKAA1 ENST00000511248 3.032114e-08 9.096341e-08 1.317523e-04 TRUE
## 20 POSTN ENST00000379743 6.157335e-05 1.847200e-04 1.317523e-04 TRUE
## 22 POSTN ENST00000478947 4.803759e-07 1.921503e-06 9.881423e-05 TRUE
## 24 POSTN ENST00000541179 8.160369e-08 4.080184e-07 7.905138e-05 TRUE
## 25 FRMD4A ENST00000358621 9.130130e-06 2.739039e-05 1.317523e-04 TRUE
## 28 FRMD4A ENST00000492155 7.353218e-06 2.739039e-05 1.061106e-04 TRUE
## 31 TRIP12 ENST00000428959 4.191024e-08 2.933717e-07 5.646527e-05 TRUE
## 38 C3orf17 ENST00000474311 8.501880e-07 3.400752e-06 9.881423e-05 TRUE
## 41 FZD6 ENST00000522566 8.117205e-08 1.623441e-07 1.976285e-04 TRUE
We see that with the Wald test option we identified 12 isoforms that are called significant. These isoforms belong to 9 genes, which are
unique(final.res1.wald[final.res1.wald$decision, 1])
## [1] "VEGFA" "HNRNPA2B1" "RRBP1" "PRKAA1" "POSTN" "FRMD4A" "TRIP12" "C3orf17"
## [9] "FZD6"
Now we check if the isoforms identified by t-test are elements of that identified by the Wald test as follows
is.element(final.res1[final.res1$decision, 2], final.res1.wald[final.res1.wald$decision, 2])
## [1] TRUE
We see that the isoforms identified by t-test are a proper subset of that identified by the Wald test.
The procedure for the analysis based on Type 2 screening test is the same as that based on Type 1 test except that the argument screen.type
of function lrt.test.mclapply
needs to be specified as 2 rather than 1. We perform the Type 2 screening test by submitting following command:
res <- lrt.test.mclapply(data=data, VarCov.type="uneqvar", outfile="output/ACC_uneqvar_type2.RData", screen.type=2)
Let’s take a look at the results
load(file="output/ACC_uneqvar_type2.RData")
The dimension of the object is
dim(res)
## [1] 2820 6
The rows on the top are
head(res)
## lrt df nIso p.value FDR Time
## CFH 0.0005488067 1 2 0.9813100 0.9972579 0.2570431
## KRIT1 0.1836651401 1 2 0.6682422 0.9239242 0.5494285
## CD99 0.5384275999 1 2 0.4630851 0.8545718 0.6255455
## M6PR 5.5588222692 4 5 0.2346071 0.7161827 1.0294824
## ICA1 3.9085784985 2 3 0.1416651 0.6252655 0.4974437
## CFLAR 5.6094197880 6 7 0.4683317 0.8552340 1.1157057
and on the bottom are
tail(res)
## lrt df nIso p.value FDR Time
## RP11-156P1.3 0.2502671 1 2 0.6168871 0.9104418 0.07161355
## RPL17 7.2153502 4 5 0.1249362 0.5975773 0.48692489
## STRADA 3.4497661 2 3 0.1781939 0.6589856 0.14717674
## RP1-178F10.3 1.7910057 1 2 0.1808037 0.6614968 0.07882619
## ZNF587B 0.3767853 1 2 0.5393287 0.8851202 0.26278758
## RP11-18I14.10 1.5043512 1 2 0.2200031 0.6997417 0.22513223
Let’s check how many genes with missing results.
p <- which(is.na(res[, 1]))
length(p)
## [1] 39
We see that there are 39 (1.38%) genes whose linear mixed effects models did not converge. The names of these genes are
rownames(res)[p]
## [1] "UBE3C" "RORA" "TGFBR3" "HSP90AA1" "DOCK9" "FKBP5"
## [7] "RAB18" "FAM208B" "DDX6" "SPARC" "CLK4" "USP34"
## [13] "MED13L" "STX16" "C3" "ATF4" "MLLT4" "ERCC5"
## [19] "NPEPPS" "SSH2" "LYST" "OSMR" "OGT" "PSIP1"
## [25] "CTNNB1" "UGP2" "SMR3B" "AHSA2" "RPL4" "PAWR"
## [31] "PCBP1-AS1" "PPP1R2" "SESTD1" "DMBT1" "MB" "GNB2L1"
## [37] "DIO2" "RP11-37B2.1" "CCPG1"
Next let’s retain only those called significant at level FDR = 0.1 and assign the results in a new object called sig.res2
sig.res2 <- res[!is.na(res[, 1]) & res[, "FDR"] <= 0.1, ]
sig.res2
## lrt df nIso p.value FDR Time
## RRBP1 18.52213 2 3 9.505402e-05 0.08582668 0.2357700
## PRKAA1 18.05272 2 3 1.201991e-04 0.08582668 0.4137421
## TRIP12 27.36960 6 7 1.234472e-04 0.08582668 0.7584684
## C3orf17 20.79593 3 4 1.160648e-04 0.08582668 0.2774065
We see that there are total 4 genes that passed the screening test at level FDR = 0.1. All these genes are a proper subset of that identified by Type 1 screening test.
Next we perform the step 2 (confirmatory) test to identify the isoforms of these 4 genes that are differentially expressed using function step2.Ftest()
as follows
final.res2 <- step2.Ftest(data=data, file.step1="output/ACC_uneqvar_type2.RData", alpha=0.1, method="hochberg")
## Number of significant genes: 4
## Cutoff for significant isoform is 0.0001438332
dim(final.res2)
## [1] 17 7
head(final.res2)
## gene isoform FC step2.p.value adj.p threshold decision
## 1 RRBP1 ENST00000246043 -1.517705 0.3067864458 0.6135728915 7.191658e-05 FALSE
## 2 RRBP1 ENST00000398782 1.014987 0.9703823359 0.9703823359 1.438332e-04 FALSE
## 3 RRBP1 ENST00000495501 5.886727 0.0010257833 0.0030773498 4.794438e-05 FALSE
## 4 PRKAA1 ENST00000397128 -1.234283 0.1011816625 0.2023633251 7.191658e-05 FALSE
## 5 PRKAA1 ENST00000505783 -1.007924 0.9887322647 0.9887322647 1.438332e-04 FALSE
## 6 PRKAA1 ENST00000511248 7.337378 0.0002496545 0.0007489636 4.794438e-05 FALSE
The isoforms that are called significant are as follows
final.res2[final.res2$decision, ]
## [1] gene isoform FC step2.p.value adj.p threshold
## [7] decision
## <0 rows> (or 0-length row.names)
We see that with the standard t-test option we identified 0 isoform that is called significant.
Next we perform the step 2 (confirmatory) test to identify the isoforms of these 4 genes that are differentially expressed using function mc.step2.Waldtest
as follows
final.res2.wald <- mc.step2.Waldtest(data=data, VarCov.type="uneqvar", nCore = NULL, file.step1="output/ACC_uneqvar_type2.RData", alpha=0.1, method="hochberg")
## Number of significant genes: 4
## Fitting mixed effects model. This may take a while...
## Model fitting is completed
## # Number of total genes tested: 2781
## # Number of significant genes: 4
## # Cutoff for significant isoform is 0.0001438332
## # Number of significant isoform: 4
Let’s take a look at the results returned by above command.
dim(final.res2.wald)
## [1] 17 6
head(final.res2.wald)
## gene isoform step2.p.value adj.p threshold decision
## 1 RRBP1 ENST00000246043 2.571108e-01 5.142215e-01 7.191658e-05 FALSE
## 2 RRBP1 ENST00000398782 9.653002e-01 9.653002e-01 1.438332e-04 FALSE
## 3 RRBP1 ENST00000495501 2.488224e-05 7.464671e-05 4.794438e-05 TRUE
## 4 PRKAA1 ENST00000397128 5.516000e-02 1.103200e-01 7.191658e-05 FALSE
## 5 PRKAA1 ENST00000505783 9.875736e-01 9.875736e-01 1.438332e-04 FALSE
## 6 PRKAA1 ENST00000511248 3.032114e-08 9.096341e-08 4.794438e-05 TRUE
tail(final.res2.wald)
## gene isoform step2.p.value adj.p threshold decision
## 12 TRIP12 ENST00000470302 5.758099e-01 7.749497e-01 1.068722e-04 FALSE
## 13 TRIP12 ENST00000487178 5.098594e-01 7.749497e-01 9.463154e-05 FALSE
## 14 C3orf17 ENST00000383675 2.849060e-01 6.967420e-01 5.881506e-05 FALSE
## 15 C3orf17 ENST00000472705 3.840888e-01 6.967420e-01 7.929005e-05 FALSE
## 16 C3orf17 ENST00000474311 8.501880e-07 3.400752e-06 3.595829e-05 TRUE
## 17 C3orf17 ENST00000494891 6.967420e-01 6.967420e-01 1.438332e-04 FALSE
The isoforms that are called significant are as follows
final.res2.wald[final.res2.wald$decision, ]
## gene isoform step2.p.value adj.p threshold decision
## 3 RRBP1 ENST00000495501 2.488224e-05 7.464671e-05 4.794438e-05 TRUE
## 6 PRKAA1 ENST00000511248 3.032114e-08 9.096341e-08 4.794438e-05 TRUE
## 9 TRIP12 ENST00000428959 4.191024e-08 2.933717e-07 2.054759e-05 TRUE
## 16 C3orf17 ENST00000474311 8.501880e-07 3.400752e-06 3.595829e-05 TRUE
We see that with the Wald test option we identified 4 isoforms that are called significant. These isoforms belong to 4 genes, which are
unique(final.res2.wald[final.res2.wald$decision, 1])
## [1] "RRBP1" "PRKAA1" "TRIP12" "C3orf17"
We see that these isoforms are a proper subset of that obtained based on screening test 1.
First we create the plots for the 11 genes that passed the Type 1 screening test and output the results to a pdf file ACC_figure.pdf.
plot.gene.bat(data=data, gene=rownames(sig.res1), xlim=c(-0.2, 2.3), lwd=3, isoform.id.opt=1, Phenotype="Tumor Status", file="ACC_figure.pdf", pPanel=c(2, 2), width=7, height=9, y.lab=c("Free of Cancer", "Not"))
## png
## 2
Next we make the isoform plots for the nine genes that have at least one differentially expressed isoform detected.
gene <- c("C3orf17", "FRMD4A", "FZD6", "HNRNPA2B1", "POSTN", "PRKAA1", "RRBP1", "TRIP12", "VEGFA")
par(mfrow=c(3, 3))
for(k in 1:9) dat<-plot.gene(data=data, gene=gene[k], Phenotype="Tumor Status", xlim = c(-0.05, 1.6), lwd=1, isoform.id.opt=1)
In the above figure, genes FRMD4A, POSTN, and VEGFA are typical genes that passed the Type 1 screening test but the Type 2 screening test. Form the plot of the genes we see that there is roughly a parallel pattern for each gene; the isoforms the genes are all down-regulated in patients who are free of cancer and up-regulated in patients who are not. There is no evidence for the differential splicing or isoform switches for these genes. On the other hand gene TRIP12 passed both screening tests. We see an interaction pattern, where some of the isoforms are up-regulated in patients who are free of cancer and down-regulated in patients who are not while other isoforms are opposite. There might be isoform switching events for this gene.
The only thing we need to do is to replace uneqvar
with unstr
for argument VarCov.type
in functions lrt.test.mclapply
and mc.step2.Waldtest
. Since we expect that using unstructured covariance sturcture will yield more genes that are called significant because this method suffers from serious type I error inflation. We set the significance level at FDR = 0.05
The first step is to run the screening test\(-\)the likelihood ratio test (LRT) by calling function lrt.test.mclapply()
as follows. res <- lrt.test.mclapply(data=data, VarCov.type="unstr", nCore=NULL, outfile="output/ACC_unstr_type1.RData", screen.type=1)
This was the most time-consuming part. It returned a matrix object res
which was saved in the file ACC_unstr_type1.RData
under the subfolder output. Now we load the object from the file and take a look.
load(file="output/ACC_unstr_type1.RData")
The dimension of the object is
dim(res)
## [1] 2820 6
The rows on the top are
head(res)
## lrt df nIso p.value FDR Time
## CFH 0.4919436 2 2 0.78194429 0.8965698 0.2092488
## KRIT1 1.6988124 2 2 0.42766881 0.6961705 0.1414852
## CD99 1.5683330 2 2 0.45650004 0.7090252 0.1826291
## M6PR 11.5435983 5 5 0.04160597 0.2957710 1.6749692
## ICA1 8.1432489 3 3 0.04314212 0.3006032 0.5308003
## CFLAR 8.6364768 7 7 0.27981947 0.5826950 4.2954025
and on the bottom are
tail(res)
## lrt df nIso p.value FDR Time
## RP11-156P1.3 1.260997 2 2 0.5323263 0.7545544 0.06314778
## RPL17 8.561647 5 5 0.1278787 0.4372829 0.85537767
## STRADA 3.006061 3 3 0.3906916 0.6713965 0.28347087
## RP1-178F10.3 2.487001 2 2 0.2883729 0.5900335 0.05041909
## ZNF587B 2.126251 2 2 0.3453747 0.6408724 0.09667969
## RP11-18I14.10 1.179368 2 2 0.5545026 0.7673779 0.07717490
Each row of the matrix gives the test result for each gene and the row names indicate the names (IDs) of the genes. The columns from left to right are LRT statistic (lrt
), degree of freedom (df
), Number of isoforms (nIso
), p-values of LRT (p.value
), FDR (FDR
) and time used for fitting the linear mixed effects model in second (Time
). Note that there may be missing results for some genes because the corresponding linear mixed models procedures did not converge. Let’s see how many such genes there are.
p <- which(is.na(res[, 1]))
length(p)
## [1] 12
We see that there are 12 (0.43%) genes whose linear mixed effects models did not converge. The names of these genes are
rownames(res)[p]
## [1] "PABPC1" "DDX5" "EIF4G2" "PRR4" "HDLBP" "STX16" "RBM39" "DST" "HNRNPH1"
## [10] "CHD2" "DMBT1" "PCBP2"
Next let’s retain only those called significant at level FDR = 0.05 and assign the results in a new object called sig.res1.unstr
sig.res1.unstr <- res[!is.na(res[, 1]) & res[, "FDR"] <= 0.05, ]
sig.res1.unstr
## lrt df nIso p.value FDR Time
## TFPI 17.55591 3 3 5.430494e-04 4.121304e-02 0.46739388
## RALA 14.86930 2 2 5.904364e-04 4.135785e-02 0.07815695
## CSDE1 40.34346 11 11 3.124477e-05 1.096691e-02 37.36043525
## DCN 30.21382 9 9 4.034372e-04 3.540162e-02 12.47401571
## MATR3 38.27614 11 11 7.026838e-05 1.351868e-02 44.02841592
## CD44 48.49410 12 12 2.563610e-06 1.799654e-03 64.03558469
## BCLAF1 30.62666 7 7 7.284432e-05 1.351868e-02 6.17670083
## CTNNA1 38.63362 11 11 6.112788e-05 1.351868e-02 38.18209171
## EPB41L2 38.91635 9 9 1.192761e-05 5.582120e-03 13.78606319
## XPO1 38.07973 11 11 7.584898e-05 1.351868e-02 36.04494047
## CTTN 23.83046 6 6 5.611140e-04 4.135785e-02 2.63145041
## RPLP0 23.59008 6 6 6.211229e-04 4.135785e-02 2.19417548
## HECTD1 34.11522 12 12 6.468887e-04 4.135785e-02 66.54357004
## SUPT16H 23.36264 4 4 1.071596e-04 1.583706e-02 0.64759159
## MYL6 36.50017 8 8 1.421863e-05 5.703700e-03 7.60929179
## RSRC2 35.54355 7 7 8.837860e-06 4.963342e-03 4.24996257
## VEGFA 18.09681 2 2 1.175786e-04 1.650804e-02 0.08749151
## LIFR 16.93216 3 3 7.298140e-04 4.269412e-02 0.28256750
## TXNIP 21.25429 5 5 7.251620e-04 4.269412e-02 1.69173169
## CD46 21.29808 5 5 7.114555e-04 4.269412e-02 1.24898028
## HNRNPA2B1 28.14309 6 6 8.830796e-05 1.458640e-02 2.24245334
## CALD1 26.13189 7 7 4.770543e-04 3.995906e-02 4.72571683
## RRBP1 16.67807 3 3 8.230742e-04 4.716719e-02 0.32654810
## SAT1 28.63596 7 7 1.685532e-04 1.972072e-02 4.78907704
## RPL36 14.73562 2 2 6.312475e-04 4.135785e-02 0.08538485
## PRKAA1 18.31728 3 3 3.783044e-04 3.426705e-02 0.27591324
## DSC2 14.10812 2 2 8.638945e-04 4.851632e-02 0.08670640
## MDM2 31.34720 7 7 5.364311e-05 1.351868e-02 5.73842359
## SEC31A 34.36643 9 9 7.702951e-05 1.351868e-02 12.84965777
## TPM1 49.23749 12 12 1.900840e-06 1.779186e-03 77.00821447
## NPC1 24.73678 4 4 5.682708e-05 1.351868e-02 0.64532685
## HDGF 20.35293 3 3 1.434237e-04 1.751016e-02 0.28810167
## PDCD4 25.94381 7 7 5.154447e-04 4.020468e-02 3.31925941
## SAR1B 19.63531 3 3 2.019939e-04 2.208973e-02 0.26707196
## C3orf17 19.31513 4 4 6.814449e-04 4.252216e-02 0.66901827
## CCNL1 27.94532 8 8 4.847572e-04 3.995906e-02 5.39370370
## FZD6 15.78972 2 2 3.726543e-04 3.426705e-02 0.09691691
## NDRG2 33.64118 10 10 2.124012e-04 2.208973e-02 25.49065089
## ASXL1 20.56654 3 3 1.295116e-04 1.721234e-02 0.23151445
## RPLP2 22.86310 4 4 1.348545e-04 1.721234e-02 0.84775782
## NPM1 56.73750 11 11 3.718940e-08 5.221392e-05 27.47847319
## ANXA2 29.06282 9 9 6.324180e-04 4.135785e-02 13.86661887
## SF3B3 17.73817 3 3 4.980652e-04 3.995906e-02 0.26519942
## TMEM63A 27.84423 6 6 1.005263e-04 1.568210e-02 1.66032028
## SUPT5H 19.13105 3 3 2.568598e-04 2.575937e-02 0.21394444
## COL27A1 17.18301 3 3 6.480574e-04 4.135785e-02 0.18095660
## STK39 19.59480 3 3 2.059333e-04 2.208973e-02 0.20194840
## SDHD 22.22055 3 3 5.868893e-05 1.351868e-02 0.17672563
## GNB2L1 68.18185 12 12 6.998713e-10 1.965239e-06 45.70985651
## CCPG1 20.76096 4 4 3.531628e-04 3.419590e-02 0.73050785
We see that there are total 50 genes that passed the screening test at level FDR = 0.05. Let’s sort these genes by the p-values.
sig.res1.unstr[order(sig.res1.unstr[, "p.value"]), ]
## lrt df nIso p.value FDR Time
## GNB2L1 68.18185 12 12 6.998713e-10 1.965239e-06 45.70985651
## NPM1 56.73750 11 11 3.718940e-08 5.221392e-05 27.47847319
## TPM1 49.23749 12 12 1.900840e-06 1.779186e-03 77.00821447
## CD44 48.49410 12 12 2.563610e-06 1.799654e-03 64.03558469
## RSRC2 35.54355 7 7 8.837860e-06 4.963342e-03 4.24996257
## EPB41L2 38.91635 9 9 1.192761e-05 5.582120e-03 13.78606319
## MYL6 36.50017 8 8 1.421863e-05 5.703700e-03 7.60929179
## CSDE1 40.34346 11 11 3.124477e-05 1.096691e-02 37.36043525
## MDM2 31.34720 7 7 5.364311e-05 1.351868e-02 5.73842359
## NPC1 24.73678 4 4 5.682708e-05 1.351868e-02 0.64532685
## SDHD 22.22055 3 3 5.868893e-05 1.351868e-02 0.17672563
## CTNNA1 38.63362 11 11 6.112788e-05 1.351868e-02 38.18209171
## MATR3 38.27614 11 11 7.026838e-05 1.351868e-02 44.02841592
## BCLAF1 30.62666 7 7 7.284432e-05 1.351868e-02 6.17670083
## XPO1 38.07973 11 11 7.584898e-05 1.351868e-02 36.04494047
## SEC31A 34.36643 9 9 7.702951e-05 1.351868e-02 12.84965777
## HNRNPA2B1 28.14309 6 6 8.830796e-05 1.458640e-02 2.24245334
## TMEM63A 27.84423 6 6 1.005263e-04 1.568210e-02 1.66032028
## SUPT16H 23.36264 4 4 1.071596e-04 1.583706e-02 0.64759159
## VEGFA 18.09681 2 2 1.175786e-04 1.650804e-02 0.08749151
## ASXL1 20.56654 3 3 1.295116e-04 1.721234e-02 0.23151445
## RPLP2 22.86310 4 4 1.348545e-04 1.721234e-02 0.84775782
## HDGF 20.35293 3 3 1.434237e-04 1.751016e-02 0.28810167
## SAT1 28.63596 7 7 1.685532e-04 1.972072e-02 4.78907704
## SAR1B 19.63531 3 3 2.019939e-04 2.208973e-02 0.26707196
## STK39 19.59480 3 3 2.059333e-04 2.208973e-02 0.20194840
## NDRG2 33.64118 10 10 2.124012e-04 2.208973e-02 25.49065089
## SUPT5H 19.13105 3 3 2.568598e-04 2.575937e-02 0.21394444
## CCPG1 20.76096 4 4 3.531628e-04 3.419590e-02 0.73050785
## FZD6 15.78972 2 2 3.726543e-04 3.426705e-02 0.09691691
## PRKAA1 18.31728 3 3 3.783044e-04 3.426705e-02 0.27591324
## DCN 30.21382 9 9 4.034372e-04 3.540162e-02 12.47401571
## CALD1 26.13189 7 7 4.770543e-04 3.995906e-02 4.72571683
## CCNL1 27.94532 8 8 4.847572e-04 3.995906e-02 5.39370370
## SF3B3 17.73817 3 3 4.980652e-04 3.995906e-02 0.26519942
## PDCD4 25.94381 7 7 5.154447e-04 4.020468e-02 3.31925941
## TFPI 17.55591 3 3 5.430494e-04 4.121304e-02 0.46739388
## CTTN 23.83046 6 6 5.611140e-04 4.135785e-02 2.63145041
## RALA 14.86930 2 2 5.904364e-04 4.135785e-02 0.07815695
## RPLP0 23.59008 6 6 6.211229e-04 4.135785e-02 2.19417548
## RPL36 14.73562 2 2 6.312475e-04 4.135785e-02 0.08538485
## ANXA2 29.06282 9 9 6.324180e-04 4.135785e-02 13.86661887
## HECTD1 34.11522 12 12 6.468887e-04 4.135785e-02 66.54357004
## COL27A1 17.18301 3 3 6.480574e-04 4.135785e-02 0.18095660
## C3orf17 19.31513 4 4 6.814449e-04 4.252216e-02 0.66901827
## CD46 21.29808 5 5 7.114555e-04 4.269412e-02 1.24898028
## TXNIP 21.25429 5 5 7.251620e-04 4.269412e-02 1.69173169
## LIFR 16.93216 3 3 7.298140e-04 4.269412e-02 0.28256750
## RRBP1 16.67807 3 3 8.230742e-04 4.716719e-02 0.32654810
## DSC2 14.10812 2 2 8.638945e-04 4.851632e-02 0.08670640
Next we perform the step 2 (confirmatory) test to identify the isoforms of these 50 genes that are differentially expressed using function step2.Ftest()
as follows
final.res1.unstr <- step2.Ftest(data=data, file.step1="output/ACC_unstr_type1.RData", alpha=0.05, method="hochberg")
## Number of significant genes: 50
## Cutoff for significant isoform is 0.0008903134
The function returned the results of the confirmatory test for all the isoforms of the above 50 gene in a data.frame object called final.res1.unstr
. Let’s take a look at the results.
dim(final.res1.unstr)
## [1] 307 7
head(final.res1.unstr)
## gene isoform FC step2.p.value adj.p threshold decision
## 1 TFPI ENST00000233156 1.230777 0.7349773654 0.921281355 0.0007102718 FALSE
## 2 TFPI ENST00000437725 -10.381846 0.0008045526 0.002413658 0.0002967711 FALSE
## 3 TFPI ENST00000453013 1.078937 0.9212813552 0.921281355 0.0008903134 FALSE
## 4 RALA ENST00000005257 1.703897 0.0010162639 0.002032528 0.0004451567 FALSE
## 5 RALA ENST00000466491 2.198932 0.2974163689 0.297416369 0.0008903134 FALSE
## 6 CSDE1 ENST00000261443 1.580196 0.4497199931 0.738619676 0.0005420811 FALSE
The columns from left to right are gene name (gene
), isoform ID (isoform
), Fold Change (FC
), p-value of the step two test (step2.p.value
), adjusted p-value (adj.p
), and threshold to determine if the step2.p.value
is significant, and descison (decision
). An isoform is called significant or considered differentially expressed if the decision is TRUE. The isoforms that are called significant are as follows
final.res1.unstr[final.res1.unstr$decision, ]
## gene isoform FC step2.p.value adj.p threshold decision
## 146 HNRNPA2B1 ENST00000356674 1.682557 5.628982e-05 0.0003377389 0.0001483856 TRUE
## 172 PRKAA1 ENST00000511248 7.337378 2.496545e-04 0.0007489636 0.0002967711 TRUE
## 229 CCNL1 ENST00000474539 -2.308505 1.002999e-04 0.0008023989 0.0001112892 TRUE
## 233 FZD6 ENST00000522566 8.590236 3.271466e-04 0.0006542932 0.0004451567 TRUE
We see that with the standard t-test option we identified 4 isoforms that are called significant.
Next we perform the step 2 (confirmatory) test to identify the isoforms of these 50 genes that are differentially expressed using function mc.step2.Waldtest
and save the object final.res1.wald.unstr
returned by the function in a file called ACC_unstr_type1_step2_Wald_Hoch.RData
under subfolder output
final.res1.wald.unstr <- mc.step2.Waldtest(data=data, file.step1="output/ACC_unstr_type1.RData", VarCov.type="unstr", alpha=0.05, method="hochberg", nCore=NULL)
save(final.res1.wald.unstr, file="output/ACC_unstr_type1_step2_Wald_Hoch_05.RData")
Now let’s load the object final.res1.wald.unstr
from the file and take a look.
load(file="output/ACC_unstr_type1_step2_Wald_Hoch_05.RData")
dim(final.res1.wald.unstr)
## [1] 307 6
head(final.res1.wald.unstr)
## gene isoform step2.p.value adj.p threshold decision
## 1 TFPI ENST00000233156 7.289780e-01 9.196149e-01 0.0007057507 FALSE
## 2 TFPI ENST00000437725 8.927690e-06 2.678307e-05 0.0002967711 TRUE
## 3 TFPI ENST00000453013 9.196149e-01 9.196149e-01 0.0008903134 FALSE
## 4 RALA ENST00000005257 1.642580e-05 3.285160e-05 0.0004451567 TRUE
## 5 RALA ENST00000466491 2.760240e-01 2.760240e-01 0.0008903134 FALSE
## 6 CSDE1 ENST00000261443 4.345778e-01 7.327135e-01 0.0005280515 FALSE
tail(final.res1.wald.unstr)
## gene isoform step2.p.value adj.p threshold decision
## 302 GNB2L1 ENST00000514183 5.894023e-02 6.483425e-01 8.093758e-05 FALSE
## 303 GNB2L1 ENST00000514455 1.450059e-01 9.720775e-01 1.328091e-04 FALSE
## 304 CCPG1 ENST00000310958 2.598792e-06 1.039517e-05 2.225783e-04 TRUE
## 305 CCPG1 ENST00000442196 6.949804e-01 7.982639e-01 7.751201e-04 FALSE
## 306 CCPG1 ENST00000568808 7.982639e-01 7.982639e-01 8.903134e-04 FALSE
## 307 CCPG1 ENST00000569205 5.650142e-01 7.982639e-01 6.301671e-04 FALSE
The isoforms that are called significant are as follows
final.res1.wald.unstr[final.res1.wald.unstr$decision, ]
## gene isoform step2.p.value adj.p threshold decision
## 2 TFPI ENST00000437725 8.927690e-06 2.678307e-05 2.967711e-04 TRUE
## 4 RALA ENST00000005257 1.642580e-05 3.285160e-05 4.451567e-04 TRUE
## 62 CTNNA1 ENST00000521387 7.100130e-08 7.810143e-07 8.093758e-05 TRUE
## 131 VEGFA ENST00000497139 4.516121e-06 9.032242e-06 4.451567e-04 TRUE
## 132 LIFR ENST00000263409 8.797806e-07 2.639342e-06 2.967711e-04 TRUE
## 146 HNRNPA2B1 ENST00000356674 1.320035e-09 7.920210e-09 1.483856e-04 TRUE
## 160 RRBP1 ENST00000495501 1.682240e-05 5.046719e-05 2.967711e-04 TRUE
## 172 PRKAA1 ENST00000511248 2.918382e-07 8.755147e-07 2.967711e-04 TRUE
## 174 DSC2 ENST00000280904 1.938294e-05 3.876588e-05 4.451567e-04 TRUE
## 184 SEC31A ENST00000505479 6.384223e-05 5.745801e-04 9.892371e-05 TRUE
## 205 NPC1 ENST00000590723 1.044331e-06 4.177324e-06 2.225783e-04 TRUE
## 222 C3orf17 ENST00000474311 2.111586e-06 8.446344e-06 2.225783e-04 TRUE
## 229 CCNL1 ENST00000474539 1.251432e-08 1.001146e-07 1.112892e-04 TRUE
## 233 FZD6 ENST00000522566 6.814898e-07 1.362980e-06 4.451567e-04 TRUE
## 281 SUPT5H ENST00000600818 4.554611e-06 1.366383e-05 2.967711e-04 TRUE
## 284 COL27A1 ENST00000451716 8.943306e-07 2.682992e-06 2.967711e-04 TRUE
## 286 STK39 ENST00000355999 6.958571e-07 2.087571e-06 2.967711e-04 TRUE
## 304 CCPG1 ENST00000310958 2.598792e-06 1.039517e-05 2.225783e-04 TRUE
We see that with the Wald test option we identified 18 isoforms that are called significant. These isoforms belong to 18 genes, which are
unique(final.res1.wald.unstr[final.res1.wald.unstr$decision, 1])
## [1] "TFPI" "RALA" "CTNNA1" "VEGFA" "LIFR" "HNRNPA2B1" "RRBP1" "PRKAA1"
## [9] "DSC2" "SEC31A" "NPC1" "C3orf17" "CCNL1" "FZD6" "SUPT5H" "COL27A1"
## [17] "STK39" "CCPG1"
Now we check if the isoforms identified by t-test are elements of that identified by the Wald test as follows
is.element(final.res1.unstr[final.res1.unstr$decision, 2], final.res1.wald.unstr[final.res1.wald.unstr$decision, 2])
## [1] TRUE TRUE TRUE TRUE
We see that the isoforms identified by t-test are a proper subset of that identified by the Wald test.
The procedure for the analysis based on Type 2 screening test is the same as that based on Type 1 test except that the argument screen.type
of function lrt.test.mclapply
needs to be specified as 2 rather than 1. We perform the Type 2 screening test by submitting following command:
res <- lrt.test.mclapply(data=data, VarCov.type="unstr", nCore=NULL, outfile="output/ACC_unstr_type2.RData", screen.type=2)
Let’s take a look at the results
load(file="output/ACC_unstr_type2.RData")
The dimension of the object is
dim(res)
## [1] 2820 6
The rows on the top are
head(res)
## lrt df nIso p.value FDR Time
## CFH 0.0005488067 1 2 0.98130997 0.9951395 0.1339359
## KRIT1 0.1293231950 1 2 0.71913481 0.9289952 0.2619956
## CD99 0.4625576194 1 2 0.49643109 0.8553886 0.1218066
## M6PR 7.4279294055 4 5 0.11492961 0.5745622 1.7152686
## ICA1 7.3514875354 2 3 0.02533056 0.4197706 0.5095682
## CFLAR 5.0603373219 6 7 0.53609805 0.8707348 4.5903389
and on the bottom are
tail(res)
## lrt df nIso p.value FDR Time
## RP11-156P1.3 0.2502671 1 2 0.6168871 0.8902322 0.06278110
## RPL17 6.2179248 4 5 0.1834539 0.6653167 0.56409240
## STRADA 2.8060809 2 3 0.2458483 0.7082911 0.27607942
## RP1-178F10.3 1.7910057 1 2 0.1808037 0.6634195 0.04237771
## ZNF587B 0.3452667 1 2 0.5568049 0.8728023 0.10422730
## RP11-18I14.10 1.1763892 1 2 0.2780923 0.7328160 0.09242225
Let’s check how many genes with missing results.
p <- which(is.na(res[, 1]))
length(p)
## [1] 13
We see that there are 13 (0.46%) genes whose linear mixed effects models did not converge. The names of these genes are
rownames(res)[p]
## [1] "PABPC1" "DDX5" "EIF4G2" "PRR4" "HDLBP" "STX16" "RBM39" "TPM1" "DST"
## [10] "HNRNPH1" "CHD2" "DMBT1" "PCBP2"
Next let’s retain only those called significant at level FDR = 0.05 and assign the results in a new object called sig.res2.unstr
sig.res2.unstr <- res[!is.na(res[, 1]) & res[, "FDR"] <= 0.05, ]
sig.res2.unstr
## lrt df nIso p.value FDR Time
## CD44 46.14026 11 12 3.053940e-06 0.0028574695 65.6715877
## BCLAF1 30.45665 6 7 3.217994e-05 0.0225822703 5.8314250
## PRKAA1 18.07153 2 3 1.190742e-04 0.0470583239 0.2567837
## NDRG2 32.98947 9 10 1.341171e-04 0.0470583239 25.2666283
## NPM1 51.81913 10 11 1.232546e-07 0.0001799886 29.8166714
## TMEM63A 26.60506 5 6 6.808432e-05 0.0331858408 1.6199515
## SUPT5H 19.10749 2 3 7.093518e-05 0.0331858408 0.2028501
## GNB2L1 53.80250 11 12 1.282427e-07 0.0001799886 46.1234529
We see that there are total 8 genes that passed the screening test at level FDR = 0.05. Let’s check if these genes are a proper subset of that identified by Type 1 screening test.
is.element(rownames(sig.res2.unstr), rownames(sig.res2.unstr))
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
All these genes are a proper subset of that identified by Type 1 screening test.
Next we perform the step 2 (confirmatory) test to identify the isoforms of these 8 genes that are differentially expressed using function step2.Ftest()
as follows
final.res2.unstr <- step2.Ftest(data=data, file.step1="output/ACC_unstr_type2.RData", alpha=0.05, method="hochberg")
## Number of significant genes: 8
## Cutoff for significant isoform is 0.0001425009
dim(final.res2.unstr)
## [1] 64 7
head(final.res2.unstr)
## gene isoform FC step2.p.value adj.p threshold decision
## 1 CD44 ENST00000263398 -1.523227 0.01701411 0.2041693 1.187507e-05 FALSE
## 2 CD44 ENST00000278385 1.147836 0.70541585 0.9339563 1.076307e-04 FALSE
## 3 CD44 ENST00000278386 -1.571053 0.56068426 0.9339563 8.554791e-05 FALSE
## 4 CD44 ENST00000525241 1.625292 0.51080786 0.9339563 7.793788e-05 FALSE
## 5 CD44 ENST00000525293 1.032489 0.89953813 0.9339563 1.372495e-04 FALSE
## 6 CD44 ENST00000525469 -1.175661 0.84067219 0.9339563 1.282678e-04 FALSE
The isoforms that are called significant are as follows
final.res2.unstr[final.res2.unstr$decision, ]
## [1] gene isoform FC step2.p.value adj.p threshold
## [7] decision
## <0 rows> (or 0-length row.names)
We see that with the standard t-test option we identified 0 isoform that is called significant.
Next we perform the step 2 (confirmatory) test to identify the isoforms of these 8 genes that are differentially expressed using function mc.step2.Waldtest
and save the object final.res2.wald.unstr
returned by the function in a file called ACC_unstr_type2_step2_Wald_Hoch.RData
under subfolder output
final.res2.wald.unstr <- mc.step2.Waldtest(data=data, file.step1="output/ACC_unstr_type2.RData", VarCov.type="unstr", alpha=0.05, method="hochberg", nCore=NULL)
save(final.res2.wald.unstr, file="output/ACC_unstr_type2_step2_Wald_Hoch_05.RData")
Now let’s load the object final.res2.wald.unstr
from the file and take a look.
load(file="output/ACC_unstr_type2_step2_Wald_Hoch_05.RData")
dim(final.res2.wald.unstr)
## [1] 64 6
head(final.res2.wald.unstr)
## gene isoform step2.p.value adj.p threshold decision
## 1 CD44 ENST00000263398 0.00563439 0.06761268 1.187507e-05 FALSE
## 2 CD44 ENST00000278385 0.69863665 0.93256076 1.067559e-04 FALSE
## 3 CD44 ENST00000278386 0.54955922 0.93256076 8.397595e-05 FALSE
## 4 CD44 ENST00000525241 0.49795166 0.93256076 7.609001e-05 FALSE
## 5 CD44 ENST00000525293 0.89740197 0.93256076 1.371284e-04 FALSE
## 6 CD44 ENST00000525469 0.83722791 0.93256076 1.279335e-04 FALSE
tail(final.res2.wald.unstr)
## gene isoform step2.p.value adj.p threshold decision
## 59 GNB2L1 ENST00000508963 0.75646753 0.9720775 1.108937e-04 FALSE
## 60 GNB2L1 ENST00000509148 0.24040706 0.9720775 3.524227e-05 FALSE
## 61 GNB2L1 ENST00000511900 0.44812396 0.9720775 6.569236e-05 FALSE
## 62 GNB2L1 ENST00000513060 0.16121923 0.9720775 2.363380e-05 FALSE
## 63 GNB2L1 ENST00000514183 0.05894023 0.6483425 1.295463e-05 FALSE
## 64 GNB2L1 ENST00000514455 0.14500595 0.9720775 2.125703e-05 FALSE
The isoforms that are called significant are as follows
final.res2.wald.unstr[final.res2.wald.unstr$decision, ]
## gene isoform step2.p.value adj.p threshold decision
## 22 PRKAA1 ENST00000511248 2.918382e-07 8.755147e-07 4.75003e-05 TRUE
## 51 SUPT5H ENST00000600818 4.554611e-06 1.366383e-05 4.75003e-05 TRUE
We see that with the Wald test option we identified 2 isoforms that are called significant. These isoforms belong to 2 genes, which are
unique(final.res2.wald.unstr[final.res2.wald.unstr$decision, 1])
## [1] "PRKAA1" "SUPT5H"