\(*\) Correspondence should be addressed to Huining Kang (HuKang@salud.unm.edu).

Introduction

We recently developed a new statistical method for analyzing differentially expressed/spliced transcript isoforms, which appropriately account for the dependence between isoforms and multiple testing corrections for the multi-dimensional structure of at both the gene-and isoform-level. We proposed a linear mixed effects model-based approach for analyzing the complex alternative RNA splicing regulation patterns detected by whole-transcriptome RNA-Sequencing technologies. This approach thoroughly characterizes and differentiates three types of genes related to alternative RNA splicing events with distinct differential expression/splicing patterns. We applied the concept of appropriately controlling for the gene-level overall false discovery rate (OFDR) in this multi-dimensional alternative RNA splicing analysis utilizing a two-step hierarchical hypothesis testing framework. In the initial screening test we identify genes that have differentially expressed or spliced isoforms; in the subsequent confirmatory testing stage we examine only the isoforms of genes that have passed the screening tests. The methodology is described in detail in the manuscript titled Two-step mixed model approach to analyzing differential alternative RNA splicing submitted to PLOS ONE. We have provided an R program file (Tools.R) that included all the functions for performing the analysis with the proposed method. The R program file can be downloaded here. We also wrote a Tutorial to show step-by-step how to use these functions in R to perform the analyses that we have proposed in our manuscript.

In the manuscript we have reported an application of our method to an RNA-sequencing study of salivary gland adenoid cystic carcinoma (ACC). We adhere to the principles of transparency in order to assure that our results are reproducible and present in this document the detailed steps and R codes for the analysis. The major steps of the analysis for this application are the same as we presented in our Tutorial except that the dataset is different. We strongly encourage readers to read the Tutorial before reading this document.

In the original study conducted by (Brayer, et. al. 2016) the RNA-sequencing was performed on 20 ACC salivary gland tumors and five normal salivary glands. We performed differential isoform expression/splicing analysis for a subset of the tumor data consisting of 8 ACC patients who are free of cancer vs. 6 patients who are not among whom have complete follow-up. Cufflinks was used to estimate and quantify the isoform abundance in FPKM. There are 2850 genes that have at least two detected isoforms. This number reduced to 2820 when we restricted to the 14 patents. We converted the data set to an R list object named data in the format described in the Tutorial. This data set is available per request.

Performing Analyses Assuming Compound Symmetry with Unequal Variances for Covariance Structure

As suggested in our Tutorial we created a folder called ACC as our working directory and under this folder we created two subfolders data and output. Aforementioned R object data is saved in a file named acc.RData in the subfolder data.

In the following (our primary analysis for ACC data) we assumed the compound symmetry with unequal variances as the covariance structure and we set the statistical significance at OFDR = 0.10 due to the exploratory nature of this study where the sample sizes and effect sizes were limited. Before starting our analysis, let’s take a quick look at the data object. The following commands will load this object into the memory and show the elements of the object.

# Load data to the memory
load(file="data/acc.RData")
# list the elements of object data
str(data)

## List of 6
##  $ x      : num [1:8847, 1:14] 2.62 1.46 3.52 3.71 3.98 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:8847] "ENST00000367429" "ENST00000466229" "ENST00000445516" "ENST00000487168" ...
##   .. ..$ : NULL
##  $ y      : num [1:14] 1 1 1 2 2 1 2 2 1 2 ...
##  $ gene_id: chr [1:8847] "ENSG00000000971" "ENSG00000000971" "ENSG00000001631" "ENSG00000001631" ...
##  $ gene   : chr [1:8847] "CFH" "CFH" "KRIT1" "KRIT1" ...
##  $ isoform: chr [1:8847] "ENST00000367429" "ENST00000466229" "ENST00000445516" "ENST00000487168" ...
##  $ y.lab  : chr [1:2] "Free of Cancer" "Not Free of Cancer"

We see that the data format/structure is exactly the same as that of the data object in our tutorial. The total number of isoforms is 8847 and the total number of genes is as follows

length(unique(data$gene))

## [1] 2820

The function find.nL() defined in file Tools.R can be used to calculate the number of isoforms for each gene. It returns a numeric vector with each element being the number of isoforms of one gene and the names of the elements being gene names.

source("Tools.R")
nL <- find.nL(data=data)

The numbers of isoforms for the first 5 genes are

head(nL, n=5)

##   CFH KRIT1  CD99  M6PR  ICA1 
##     2     2     2     5     3

The numbers of isoforms for the last 5 genes are

tail(nL, n=5)

##         RPL17        STRADA  RP1-178F10.3       ZNF587B RP11-18I14.10 
##             5             3             2             2             2

We can summarize all the numbers as follows:

sum.nL <- table(nL)
sum.nL

## nL
##    2    3    4    5    6    7    8    9   10   11   12   13   14   15   17 
## 1402  674  317  182   98   61   35   17   10    8    8    5    1    1    1

The result indicates that there are 1402 genes that have 2 isoforms for each, 674 genes that have 3 isoforms for each, and so on so forth.

Analysis based on Type 1 screening test

Performing step 1 test

As described in our Tutorial, the first step is to run the screening test\(-\)the likelihood ratio test (LRT) by calling function lrt.test.mclapply(). We proceed this step by submitting following command:

res <- lrt.test.mclapply(data=data, VarCov.type="uneqvar", nCore=NULL, outfile="output/ACC_uneqvar_type1.RData", screen.type=1)

This was the most time-consuming part. It returned a matrix object res which was saved in the file ACC_uneqvar_type1.RData under the subfolder output. Now we load the object from the file and take a look.

load(file="output/ACC_uneqvar_type1.RData")

The dimension of the object is

dim(res)

## [1] 2820    6

The rows on the top are

head(res)

##             lrt df nIso   p.value       FDR      Time
## CFH   0.4919436  2    2 0.7819443 0.9163829 0.3518181
## KRIT1 1.1450537  2    2 0.5640983 0.8071390 0.4742093
## CD99  1.6574952  2    2 0.4365957 0.7445243 0.5504806
## M6PR  8.4910369  5    5 0.1311697 0.5263642 0.9330542
## ICA1  4.1329972  3    3 0.2474563 0.6249283 0.4287062
## CFLAR 5.8691039  7    7 0.5551140 0.8049110 1.1064849

and on the bottom are

tail(res)

##                    lrt df nIso    p.value       FDR       Time
## RP11-156P1.3  1.260997  2    2 0.53232632 0.7919478 0.06369615
## RPL17         9.902963  5    5 0.07803189 0.4582985 0.50190425
## STRADA        3.526793  3    3 0.31730378 0.6749765 0.13856792
## RP1-178F10.3  2.487001  2    2 0.28837294 0.6575270 0.05994153
## ZNF587B       2.097044  2    2 0.35045533 0.6991489 0.29776192
## RP11-18I14.10 1.534545  2    2 0.46427766 0.7525246 0.23661232

Each row of the matrix gives the test result for each gene and the row names indicate the names (IDs) of the genes. The columns from left to right are LRT statistic (lrt), degree of freedom (df), Number of isoforms (nIso), p-values of LRT (p.value), FDR (FDR) and time used for fitting the linear mixed effects model in second (Time). Note that there might be missing results for some genes because the corresponding linear mixed models procedures did not converge. Let’s see how many such genes there are.

p <- which(is.na(res[, 1]))
length(p)

## [1] 37

We see that there are 37 (1.31%) genes whose linear mixed effects models did not converge. The names of these genes are

rownames(res)[p]

##  [1] "SEC62"       "RORA"        "DOCK9"       "FKBP5"       "TNRC6B"      "FAM208B"    
##  [7] "MED13"       "EIF4G2"      "SPARC"       "CLK4"        "USP34"       "MED13L"     
## [13] "STX16"       "C3"          "ATF4"        "ERCC5"       "IL6"         "NPEPPS"     
## [19] "GALNT1"      "SMAD4"       "LYST"        "OSMR"        "OGT"         "UTRN"       
## [25] "RPL30"       "CTNNB1"      "TPT1-AS1"    "AHSA2"       "RPL4"        "PPP1R2"     
## [31] "SESTD1"      "DMBT1"       "MB"          "GNB2L1"      "PRB4"        "RP11-37B2.1"
## [37] "CCPG1"

Next let’s retain only those called significant at level FDR = 0.1 and assign the results in a new object called sig.res1

sig.res1 <- res[!is.na(res[, 1]) & res[, "FDR"] <= 0.1, ]
sig.res1

##                lrt df nIso      p.value        FDR      Time
## GNPTAB    23.86920  5    5 2.300448e-04 0.09145925 0.4770377
## VEGFA     18.07379  2    2 1.189393e-04 0.09145925 0.2424715
## HNRNPA2B1 29.81071  6    6 4.270285e-05 0.05942102 0.8640647
## RRBP1     18.52259  3    3 3.431237e-04 0.09366467 0.1906805
## PRKAA1    19.51404  3    3 2.140158e-04 0.09145925 0.4196160
## POSTN     31.05287  5    5 9.144749e-06 0.02544984 0.3757513
## FRMD4A    22.40986  4    4 1.660719e-04 0.09145925 0.3195240
## TRIP12    28.24006  7    7 1.989194e-04 0.09145925 0.8803523
## C3orf17   21.09291  4    4 3.035165e-04 0.09366467 0.3033209
## FZD6      15.80285  2    2 3.702161e-04 0.09366467 0.2177060
## ASXL1     18.77488  3    3 3.043197e-04 0.09366467 0.2043502

We see that there is a total of 11 genes that passed the screening test at level FDR = 0.1. Let’s sort these genes by the p-values.

sig.res1[order(sig.res1[, "p.value"]), ]

##                lrt df nIso      p.value        FDR      Time
## POSTN     31.05287  5    5 9.144749e-06 0.02544984 0.3757513
## HNRNPA2B1 29.81071  6    6 4.270285e-05 0.05942102 0.8640647
## VEGFA     18.07379  2    2 1.189393e-04 0.09145925 0.2424715
## FRMD4A    22.40986  4    4 1.660719e-04 0.09145925 0.3195240
## TRIP12    28.24006  7    7 1.989194e-04 0.09145925 0.8803523
## PRKAA1    19.51404  3    3 2.140158e-04 0.09145925 0.4196160
## GNPTAB    23.86920  5    5 2.300448e-04 0.09145925 0.4770377
## C3orf17   21.09291  4    4 3.035165e-04 0.09366467 0.3033209
## ASXL1     18.77488  3    3 3.043197e-04 0.09366467 0.2043502
## RRBP1     18.52259  3    3 3.431237e-04 0.09366467 0.1906805
## FZD6      15.80285  2    2 3.702161e-04 0.09366467 0.2177060

Performing step 2 test using the standard t-test

Next we perform the step 2 (confirmatory) test to identify the isoforms of these 11 genes that are differentially expressed using function step2.Ftest() as follows

final.res1 <- step2.Ftest(data=data, file.step1="output/ACC_uneqvar_type1.RData", alpha=0.1, method="hochberg")

## Number of significant genes: 11 
## Cutoff for significant isoform is 0.0003952569

Note that there is an argument called method in function step2.Ftest and the following functon mc.step2.Waldtest, which is the options for controlling the family-wise error rate required by our two step procedure. The options are methods of Bonferroni (“bonferroni”), Holm (“holm”), Hochberg (“hochberg”) and Hommel (“hommel”). We use Hochberg option throughout this document. The function returns the results of the confirmatory test for all the isoforms of the above 11 genes in a data.frame object called final.res1. Let’s take a look at the results.

dim(final.res1)

## [1] 44  7

head(final.res1)

##     gene         isoform        FC step2.p.value       adj.p    threshold decision
## 1 GNPTAB ENST00000299314 -1.424071   0.003505942 0.017529709 7.905138e-05    FALSE
## 2 GNPTAB ENST00000549165  2.320880   0.023427021 0.093708082 9.881423e-05    FALSE
## 3 GNPTAB ENST00000549194 -2.002233   0.047714461 0.097372530 1.936837e-04    FALSE
## 4 GNPTAB ENST00000552009 -2.072173   0.162558101 0.162558101 3.952569e-04    FALSE
## 5 GNPTAB ENST00000552681 -1.375782   0.048686265 0.097372530 1.976285e-04    FALSE
## 6  VEGFA ENST00000372067  4.650730   0.009558154 0.009558154 3.952569e-04    FALSE

The columns from left to right are gene name (gene), isoform ID (isoform), Fold Change (FC), p-value of the step two test (step2.p.value), adjusted p-value (adj.p), and threshold to determine if the step2.p.value is significant, and decison (decision). An isoform is called significant or considered differentially expressed if the decision is TRUE. The isoforms that are called significant are as follows

final.res1[final.res1$decision, ]

##        gene         isoform       FC step2.p.value        adj.p    threshold decision
## 9 HNRNPA2B1 ENST00000356674 1.682557  5.628982e-05 0.0003377389 6.587615e-05     TRUE

We see that with the standard t-test option we identified 1 isoforms that are called significant.

Performing step 2 test using the the Wald test based on the linear mixed-effects model

Next we perform the step 2 (confirmatory) test to identify the isoforms of these 11 genes that are differentially expressed using function mc.step2.Waldtest as follows

final.res1.wald <- mc.step2.Waldtest(data=data, VarCov.type="uneqvar", nCore = NULL, file.step1="output/ACC_uneqvar_type1.RData", alpha=0.1, method="hochberg")

## Number of significant genes: 11 
## Fitting mixed effects model. This may take a while...
## Model fitting is completed
## # Number of total genes tested: 2783 
## # Number of significant genes: 11 
## # Cutoff for significant isoform is 0.0003952569 
## # Number of significant isoform: 12

Let’s take a look at the results returned by above command.

dim(final.res1.wald)

## [1] 44  6

head(final.res1.wald)

##     gene         isoform step2.p.value        adj.p    threshold decision
## 1 GNPTAB ENST00000299314  8.061455e-05 0.0004030728 7.905138e-05    FALSE
## 2 GNPTAB ENST00000549165  4.857991e-03 0.0194319658 9.881423e-05    FALSE
## 3 GNPTAB ENST00000549194  3.044334e-02 0.0608866723 1.976285e-04    FALSE
## 4 GNPTAB ENST00000552009  1.045219e-01 0.1045219228 3.952569e-04    FALSE
## 5 GNPTAB ENST00000552681  1.312816e-02 0.0393844689 1.317523e-04    FALSE
## 6  VEGFA ENST00000372067  8.824491e-04 0.0008824491 3.952569e-04    FALSE

tail(final.res1.wald)

##       gene         isoform step2.p.value        adj.p    threshold decision
## 39 C3orf17 ENST00000494891  6.967420e-01 6.967420e-01 0.0003952569    FALSE
## 40    FZD6 ENST00000521195  1.061691e-01 1.061691e-01 0.0003952569    FALSE
## 41    FZD6 ENST00000522566  8.117205e-08 1.623441e-07 0.0001976285     TRUE
## 42   ASXL1 ENST00000306058  1.578210e-03 4.734631e-03 0.0001317523    FALSE
## 43   ASXL1 ENST00000375689  2.046450e-02 2.046450e-02 0.0003952569    FALSE
## 44   ASXL1 ENST00000470145  4.484705e-03 8.969409e-03 0.0001976285    FALSE

The isoforms that are called significant are as follows

final.res1.wald[final.res1.wald$decision, ]

##         gene         isoform step2.p.value        adj.p    threshold decision
## 7      VEGFA ENST00000497139  7.287226e-07 1.457445e-06 1.976285e-04     TRUE
## 9  HNRNPA2B1 ENST00000356674  5.720757e-11 3.432454e-10 6.587615e-05     TRUE
## 16     RRBP1 ENST00000495501  2.488224e-05 7.464671e-05 1.317523e-04     TRUE
## 19    PRKAA1 ENST00000511248  3.032114e-08 9.096341e-08 1.317523e-04     TRUE
## 20     POSTN ENST00000379743  6.157335e-05 1.847200e-04 1.317523e-04     TRUE
## 22     POSTN ENST00000478947  4.803759e-07 1.921503e-06 9.881423e-05     TRUE
## 24     POSTN ENST00000541179  8.160369e-08 4.080184e-07 7.905138e-05     TRUE
## 25    FRMD4A ENST00000358621  9.130130e-06 2.739039e-05 1.317523e-04     TRUE
## 28    FRMD4A ENST00000492155  7.353218e-06 2.739039e-05 1.061106e-04     TRUE
## 31    TRIP12 ENST00000428959  4.191024e-08 2.933717e-07 5.646527e-05     TRUE
## 38   C3orf17 ENST00000474311  8.501880e-07 3.400752e-06 9.881423e-05     TRUE
## 41      FZD6 ENST00000522566  8.117205e-08 1.623441e-07 1.976285e-04     TRUE

We see that with the Wald test option we identified 12 isoforms that are called significant. These isoforms belong to 9 genes, which are

unique(final.res1.wald[final.res1.wald$decision, 1])

## [1] "VEGFA"     "HNRNPA2B1" "RRBP1"     "PRKAA1"    "POSTN"     "FRMD4A"    "TRIP12"    "C3orf17"  
## [9] "FZD6"

Now we check if the isoforms identified by t-test are elements of that identified by the Wald test as follows

is.element(final.res1[final.res1$decision, 2], final.res1.wald[final.res1.wald$decision, 2])

## [1] TRUE

We see that the isoforms identified by t-test are a proper subset of that identified by the Wald test.

Analysis based on Type 2 screening test

Performing step 1 test

The procedure for the analysis based on Type 2 screening test is the same as that based on Type 1 test except that the argument screen.type of function lrt.test.mclapply needs to be specified as 2 rather than 1. We perform the Type 2 screening test by submitting following command:

res <- lrt.test.mclapply(data=data, VarCov.type="uneqvar", outfile="output/ACC_uneqvar_type2.RData", screen.type=2)

Let’s take a look at the results

load(file="output/ACC_uneqvar_type2.RData")

The dimension of the object is

dim(res)

## [1] 2820    6

The rows on the top are

head(res)

##                lrt df nIso   p.value       FDR      Time
## CFH   0.0005488067  1    2 0.9813100 0.9972579 0.2570431
## KRIT1 0.1836651401  1    2 0.6682422 0.9239242 0.5494285
## CD99  0.5384275999  1    2 0.4630851 0.8545718 0.6255455
## M6PR  5.5588222692  4    5 0.2346071 0.7161827 1.0294824
## ICA1  3.9085784985  2    3 0.1416651 0.6252655 0.4974437
## CFLAR 5.6094197880  6    7 0.4683317 0.8552340 1.1157057

and on the bottom are

tail(res)

##                     lrt df nIso   p.value       FDR       Time
## RP11-156P1.3  0.2502671  1    2 0.6168871 0.9104418 0.07161355
## RPL17         7.2153502  4    5 0.1249362 0.5975773 0.48692489
## STRADA        3.4497661  2    3 0.1781939 0.6589856 0.14717674
## RP1-178F10.3  1.7910057  1    2 0.1808037 0.6614968 0.07882619
## ZNF587B       0.3767853  1    2 0.5393287 0.8851202 0.26278758
## RP11-18I14.10 1.5043512  1    2 0.2200031 0.6997417 0.22513223

Let’s check how many genes with missing results.

p <- which(is.na(res[, 1]))
length(p)

## [1] 39

We see that there are 39 (1.38%) genes whose linear mixed effects models did not converge. The names of these genes are

rownames(res)[p]

##  [1] "UBE3C"       "RORA"        "TGFBR3"      "HSP90AA1"    "DOCK9"       "FKBP5"      
##  [7] "RAB18"       "FAM208B"     "DDX6"        "SPARC"       "CLK4"        "USP34"      
## [13] "MED13L"      "STX16"       "C3"          "ATF4"        "MLLT4"       "ERCC5"      
## [19] "NPEPPS"      "SSH2"        "LYST"        "OSMR"        "OGT"         "PSIP1"      
## [25] "CTNNB1"      "UGP2"        "SMR3B"       "AHSA2"       "RPL4"        "PAWR"       
## [31] "PCBP1-AS1"   "PPP1R2"      "SESTD1"      "DMBT1"       "MB"          "GNB2L1"     
## [37] "DIO2"        "RP11-37B2.1" "CCPG1"

Next let’s retain only those called significant at level FDR = 0.1 and assign the results in a new object called sig.res2

sig.res2 <- res[!is.na(res[, 1]) & res[, "FDR"] <= 0.1, ]
sig.res2

##              lrt df nIso      p.value        FDR      Time
## RRBP1   18.52213  2    3 9.505402e-05 0.08582668 0.2357700
## PRKAA1  18.05272  2    3 1.201991e-04 0.08582668 0.4137421
## TRIP12  27.36960  6    7 1.234472e-04 0.08582668 0.7584684
## C3orf17 20.79593  3    4 1.160648e-04 0.08582668 0.2774065

We see that there are total 4 genes that passed the screening test at level FDR = 0.1. All these genes are a proper subset of that identified by Type 1 screening test.

Performing step 2 test using the standard t-test

Next we perform the step 2 (confirmatory) test to identify the isoforms of these 4 genes that are differentially expressed using function step2.Ftest() as follows

final.res2 <- step2.Ftest(data=data, file.step1="output/ACC_uneqvar_type2.RData", alpha=0.1, method="hochberg")

## Number of significant genes: 4 
## Cutoff for significant isoform is 0.0001438332

dim(final.res2)

## [1] 17  7

head(final.res2)

##     gene         isoform        FC step2.p.value        adj.p    threshold decision
## 1  RRBP1 ENST00000246043 -1.517705  0.3067864458 0.6135728915 7.191658e-05    FALSE
## 2  RRBP1 ENST00000398782  1.014987  0.9703823359 0.9703823359 1.438332e-04    FALSE
## 3  RRBP1 ENST00000495501  5.886727  0.0010257833 0.0030773498 4.794438e-05    FALSE
## 4 PRKAA1 ENST00000397128 -1.234283  0.1011816625 0.2023633251 7.191658e-05    FALSE
## 5 PRKAA1 ENST00000505783 -1.007924  0.9887322647 0.9887322647 1.438332e-04    FALSE
## 6 PRKAA1 ENST00000511248  7.337378  0.0002496545 0.0007489636 4.794438e-05    FALSE

The isoforms that are called significant are as follows

final.res2[final.res2$decision, ]

## [1] gene          isoform       FC            step2.p.value adj.p         threshold    
## [7] decision     
## <0 rows> (or 0-length row.names)

We see that with the standard t-test option we identified 0 isoform that is called significant.

Performing step 2 test using the the Wald test based on the linear mixed-effects model

Next we perform the step 2 (confirmatory) test to identify the isoforms of these 4 genes that are differentially expressed using function mc.step2.Waldtest as follows

final.res2.wald <- mc.step2.Waldtest(data=data, VarCov.type="uneqvar", nCore = NULL, file.step1="output/ACC_uneqvar_type2.RData", alpha=0.1, method="hochberg")

## Number of significant genes: 4 
## Fitting mixed effects model. This may take a while...
## Model fitting is completed
## # Number of total genes tested: 2781 
## # Number of significant genes: 4 
## # Cutoff for significant isoform is 0.0001438332 
## # Number of significant isoform: 4

Let’s take a look at the results returned by above command.

dim(final.res2.wald)

## [1] 17  6

head(final.res2.wald)

##     gene         isoform step2.p.value        adj.p    threshold decision
## 1  RRBP1 ENST00000246043  2.571108e-01 5.142215e-01 7.191658e-05    FALSE
## 2  RRBP1 ENST00000398782  9.653002e-01 9.653002e-01 1.438332e-04    FALSE
## 3  RRBP1 ENST00000495501  2.488224e-05 7.464671e-05 4.794438e-05     TRUE
## 4 PRKAA1 ENST00000397128  5.516000e-02 1.103200e-01 7.191658e-05    FALSE
## 5 PRKAA1 ENST00000505783  9.875736e-01 9.875736e-01 1.438332e-04    FALSE
## 6 PRKAA1 ENST00000511248  3.032114e-08 9.096341e-08 4.794438e-05     TRUE

tail(final.res2.wald)

##       gene         isoform step2.p.value        adj.p    threshold decision
## 12  TRIP12 ENST00000470302  5.758099e-01 7.749497e-01 1.068722e-04    FALSE
## 13  TRIP12 ENST00000487178  5.098594e-01 7.749497e-01 9.463154e-05    FALSE
## 14 C3orf17 ENST00000383675  2.849060e-01 6.967420e-01 5.881506e-05    FALSE
## 15 C3orf17 ENST00000472705  3.840888e-01 6.967420e-01 7.929005e-05    FALSE
## 16 C3orf17 ENST00000474311  8.501880e-07 3.400752e-06 3.595829e-05     TRUE
## 17 C3orf17 ENST00000494891  6.967420e-01 6.967420e-01 1.438332e-04    FALSE

The isoforms that are called significant are as follows

final.res2.wald[final.res2.wald$decision, ]

##       gene         isoform step2.p.value        adj.p    threshold decision
## 3    RRBP1 ENST00000495501  2.488224e-05 7.464671e-05 4.794438e-05     TRUE
## 6   PRKAA1 ENST00000511248  3.032114e-08 9.096341e-08 4.794438e-05     TRUE
## 9   TRIP12 ENST00000428959  4.191024e-08 2.933717e-07 2.054759e-05     TRUE
## 16 C3orf17 ENST00000474311  8.501880e-07 3.400752e-06 3.595829e-05     TRUE

We see that with the Wald test option we identified 4 isoforms that are called significant. These isoforms belong to 4 genes, which are

unique(final.res2.wald[final.res2.wald$decision, 1])

## [1] "RRBP1"   "PRKAA1"  "TRIP12"  "C3orf17"

We see that these isoforms are a proper subset of that obtained based on screening test 1.

Plots of isoform abundances

First we create the plots for the 11 genes that passed the Type 1 screening test and output the results to a pdf file ACC_figure.pdf.

plot.gene.bat(data=data, gene=rownames(sig.res1), xlim=c(-0.2, 2.3), lwd=3, isoform.id.opt=1, Phenotype="Tumor Status", file="ACC_figure.pdf", pPanel=c(2, 2), width=7, height=9, y.lab=c("Free of Cancer", "Not"))

## png 
##   2

Next we make the isoform plots for the nine genes that have at least one differentially expressed isoform detected.

gene <- c("C3orf17", "FRMD4A", "FZD6", "HNRNPA2B1", "POSTN", "PRKAA1", "RRBP1", "TRIP12", "VEGFA")
par(mfrow=c(3, 3))
for(k in 1:9) dat<-plot.gene(data=data, gene=gene[k], Phenotype="Tumor Status", xlim = c(-0.05, 1.6), lwd=1, isoform.id.opt=1)

In the above figure, genes FRMD4A, POSTN, and VEGFA are typical genes that passed the Type 1 screening test but the Type 2 screening test. Form the plot of the genes we see that there is roughly a parallel pattern for each gene; the isoforms the genes are all down-regulated in patients who are free of cancer and up-regulated in patients who are not. There is no evidence for the differential splicing or isoform switches for these genes. On the other hand gene TRIP12 passed both screening tests. We see an interaction pattern, where some of the isoforms are up-regulated in patients who are free of cancer and down-regulated in patients who are not while other isoforms are opposite. There might be isoform switching events for this gene.

Performing Analyses Assuming Covariance Structure is Unstructured

The only thing we need to do is to replace uneqvar with unstr for argument VarCov.type in functions lrt.test.mclapply and mc.step2.Waldtest. Since we expect that using unstructured covariance sturcture will yield more genes that are called significant because this method suffers from serious type I error inflation. We set the significance level at FDR = 0.05

Analysis based on Type 1 screening test

Performing step 1 test

The first step is to run the screening test\(-\)the likelihood ratio test (LRT) by calling function lrt.test.mclapply() as follows. res <- lrt.test.mclapply(data=data, VarCov.type="unstr", nCore=NULL, outfile="output/ACC_unstr_type1.RData", screen.type=1)

This was the most time-consuming part. It returned a matrix object res which was saved in the file ACC_unstr_type1.RData under the subfolder output. Now we load the object from the file and take a look.

load(file="output/ACC_unstr_type1.RData")

The dimension of the object is

dim(res)

## [1] 2820    6

The rows on the top are

head(res)

##              lrt df nIso    p.value       FDR      Time
## CFH    0.4919436  2    2 0.78194429 0.8965698 0.2092488
## KRIT1  1.6988124  2    2 0.42766881 0.6961705 0.1414852
## CD99   1.5683330  2    2 0.45650004 0.7090252 0.1826291
## M6PR  11.5435983  5    5 0.04160597 0.2957710 1.6749692
## ICA1   8.1432489  3    3 0.04314212 0.3006032 0.5308003
## CFLAR  8.6364768  7    7 0.27981947 0.5826950 4.2954025

and on the bottom are

tail(res)

##                    lrt df nIso   p.value       FDR       Time
## RP11-156P1.3  1.260997  2    2 0.5323263 0.7545544 0.06314778
## RPL17         8.561647  5    5 0.1278787 0.4372829 0.85537767
## STRADA        3.006061  3    3 0.3906916 0.6713965 0.28347087
## RP1-178F10.3  2.487001  2    2 0.2883729 0.5900335 0.05041909
## ZNF587B       2.126251  2    2 0.3453747 0.6408724 0.09667969
## RP11-18I14.10 1.179368  2    2 0.5545026 0.7673779 0.07717490

Each row of the matrix gives the test result for each gene and the row names indicate the names (IDs) of the genes. The columns from left to right are LRT statistic (lrt), degree of freedom (df), Number of isoforms (nIso), p-values of LRT (p.value), FDR (FDR) and time used for fitting the linear mixed effects model in second (Time). Note that there may be missing results for some genes because the corresponding linear mixed models procedures did not converge. Let’s see how many such genes there are.

p <- which(is.na(res[, 1]))
length(p)

## [1] 12

We see that there are 12 (0.43%) genes whose linear mixed effects models did not converge. The names of these genes are

rownames(res)[p]

##  [1] "PABPC1"  "DDX5"    "EIF4G2"  "PRR4"    "HDLBP"   "STX16"   "RBM39"   "DST"     "HNRNPH1"
## [10] "CHD2"    "DMBT1"   "PCBP2"

Next let’s retain only those called significant at level FDR = 0.05 and assign the results in a new object called sig.res1.unstr

sig.res1.unstr <- res[!is.na(res[, 1]) & res[, "FDR"] <= 0.05, ]
sig.res1.unstr

##                lrt df nIso      p.value          FDR        Time
## TFPI      17.55591  3    3 5.430494e-04 4.121304e-02  0.46739388
## RALA      14.86930  2    2 5.904364e-04 4.135785e-02  0.07815695
## CSDE1     40.34346 11   11 3.124477e-05 1.096691e-02 37.36043525
## DCN       30.21382  9    9 4.034372e-04 3.540162e-02 12.47401571
## MATR3     38.27614 11   11 7.026838e-05 1.351868e-02 44.02841592
## CD44      48.49410 12   12 2.563610e-06 1.799654e-03 64.03558469
## BCLAF1    30.62666  7    7 7.284432e-05 1.351868e-02  6.17670083
## CTNNA1    38.63362 11   11 6.112788e-05 1.351868e-02 38.18209171
## EPB41L2   38.91635  9    9 1.192761e-05 5.582120e-03 13.78606319
## XPO1      38.07973 11   11 7.584898e-05 1.351868e-02 36.04494047
## CTTN      23.83046  6    6 5.611140e-04 4.135785e-02  2.63145041
## RPLP0     23.59008  6    6 6.211229e-04 4.135785e-02  2.19417548
## HECTD1    34.11522 12   12 6.468887e-04 4.135785e-02 66.54357004
## SUPT16H   23.36264  4    4 1.071596e-04 1.583706e-02  0.64759159
## MYL6      36.50017  8    8 1.421863e-05 5.703700e-03  7.60929179
## RSRC2     35.54355  7    7 8.837860e-06 4.963342e-03  4.24996257
## VEGFA     18.09681  2    2 1.175786e-04 1.650804e-02  0.08749151
## LIFR      16.93216  3    3 7.298140e-04 4.269412e-02  0.28256750
## TXNIP     21.25429  5    5 7.251620e-04 4.269412e-02  1.69173169
## CD46      21.29808  5    5 7.114555e-04 4.269412e-02  1.24898028
## HNRNPA2B1 28.14309  6    6 8.830796e-05 1.458640e-02  2.24245334
## CALD1     26.13189  7    7 4.770543e-04 3.995906e-02  4.72571683
## RRBP1     16.67807  3    3 8.230742e-04 4.716719e-02  0.32654810
## SAT1      28.63596  7    7 1.685532e-04 1.972072e-02  4.78907704
## RPL36     14.73562  2    2 6.312475e-04 4.135785e-02  0.08538485
## PRKAA1    18.31728  3    3 3.783044e-04 3.426705e-02  0.27591324
## DSC2      14.10812  2    2 8.638945e-04 4.851632e-02  0.08670640
## MDM2      31.34720  7    7 5.364311e-05 1.351868e-02  5.73842359
## SEC31A    34.36643  9    9 7.702951e-05 1.351868e-02 12.84965777
## TPM1      49.23749 12   12 1.900840e-06 1.779186e-03 77.00821447
## NPC1      24.73678  4    4 5.682708e-05 1.351868e-02  0.64532685
## HDGF      20.35293  3    3 1.434237e-04 1.751016e-02  0.28810167
## PDCD4     25.94381  7    7 5.154447e-04 4.020468e-02  3.31925941
## SAR1B     19.63531  3    3 2.019939e-04 2.208973e-02  0.26707196
## C3orf17   19.31513  4    4 6.814449e-04 4.252216e-02  0.66901827
## CCNL1     27.94532  8    8 4.847572e-04 3.995906e-02  5.39370370
## FZD6      15.78972  2    2 3.726543e-04 3.426705e-02  0.09691691
## NDRG2     33.64118 10   10 2.124012e-04 2.208973e-02 25.49065089
## ASXL1     20.56654  3    3 1.295116e-04 1.721234e-02  0.23151445
## RPLP2     22.86310  4    4 1.348545e-04 1.721234e-02  0.84775782
## NPM1      56.73750 11   11 3.718940e-08 5.221392e-05 27.47847319
## ANXA2     29.06282  9    9 6.324180e-04 4.135785e-02 13.86661887
## SF3B3     17.73817  3    3 4.980652e-04 3.995906e-02  0.26519942
## TMEM63A   27.84423  6    6 1.005263e-04 1.568210e-02  1.66032028
## SUPT5H    19.13105  3    3 2.568598e-04 2.575937e-02  0.21394444
## COL27A1   17.18301  3    3 6.480574e-04 4.135785e-02  0.18095660
## STK39     19.59480  3    3 2.059333e-04 2.208973e-02  0.20194840
## SDHD      22.22055  3    3 5.868893e-05 1.351868e-02  0.17672563
## GNB2L1    68.18185 12   12 6.998713e-10 1.965239e-06 45.70985651
## CCPG1     20.76096  4    4 3.531628e-04 3.419590e-02  0.73050785

We see that there are total 50 genes that passed the screening test at level FDR = 0.05. Let’s sort these genes by the p-values.

sig.res1.unstr[order(sig.res1.unstr[, "p.value"]), ]

##                lrt df nIso      p.value          FDR        Time
## GNB2L1    68.18185 12   12 6.998713e-10 1.965239e-06 45.70985651
## NPM1      56.73750 11   11 3.718940e-08 5.221392e-05 27.47847319
## TPM1      49.23749 12   12 1.900840e-06 1.779186e-03 77.00821447
## CD44      48.49410 12   12 2.563610e-06 1.799654e-03 64.03558469
## RSRC2     35.54355  7    7 8.837860e-06 4.963342e-03  4.24996257
## EPB41L2   38.91635  9    9 1.192761e-05 5.582120e-03 13.78606319
## MYL6      36.50017  8    8 1.421863e-05 5.703700e-03  7.60929179
## CSDE1     40.34346 11   11 3.124477e-05 1.096691e-02 37.36043525
## MDM2      31.34720  7    7 5.364311e-05 1.351868e-02  5.73842359
## NPC1      24.73678  4    4 5.682708e-05 1.351868e-02  0.64532685
## SDHD      22.22055  3    3 5.868893e-05 1.351868e-02  0.17672563
## CTNNA1    38.63362 11   11 6.112788e-05 1.351868e-02 38.18209171
## MATR3     38.27614 11   11 7.026838e-05 1.351868e-02 44.02841592
## BCLAF1    30.62666  7    7 7.284432e-05 1.351868e-02  6.17670083
## XPO1      38.07973 11   11 7.584898e-05 1.351868e-02 36.04494047
## SEC31A    34.36643  9    9 7.702951e-05 1.351868e-02 12.84965777
## HNRNPA2B1 28.14309  6    6 8.830796e-05 1.458640e-02  2.24245334
## TMEM63A   27.84423  6    6 1.005263e-04 1.568210e-02  1.66032028
## SUPT16H   23.36264  4    4 1.071596e-04 1.583706e-02  0.64759159
## VEGFA     18.09681  2    2 1.175786e-04 1.650804e-02  0.08749151
## ASXL1     20.56654  3    3 1.295116e-04 1.721234e-02  0.23151445
## RPLP2     22.86310  4    4 1.348545e-04 1.721234e-02  0.84775782
## HDGF      20.35293  3    3 1.434237e-04 1.751016e-02  0.28810167
## SAT1      28.63596  7    7 1.685532e-04 1.972072e-02  4.78907704
## SAR1B     19.63531  3    3 2.019939e-04 2.208973e-02  0.26707196
## STK39     19.59480  3    3 2.059333e-04 2.208973e-02  0.20194840
## NDRG2     33.64118 10   10 2.124012e-04 2.208973e-02 25.49065089
## SUPT5H    19.13105  3    3 2.568598e-04 2.575937e-02  0.21394444
## CCPG1     20.76096  4    4 3.531628e-04 3.419590e-02  0.73050785
## FZD6      15.78972  2    2 3.726543e-04 3.426705e-02  0.09691691
## PRKAA1    18.31728  3    3 3.783044e-04 3.426705e-02  0.27591324
## DCN       30.21382  9    9 4.034372e-04 3.540162e-02 12.47401571
## CALD1     26.13189  7    7 4.770543e-04 3.995906e-02  4.72571683
## CCNL1     27.94532  8    8 4.847572e-04 3.995906e-02  5.39370370
## SF3B3     17.73817  3    3 4.980652e-04 3.995906e-02  0.26519942
## PDCD4     25.94381  7    7 5.154447e-04 4.020468e-02  3.31925941
## TFPI      17.55591  3    3 5.430494e-04 4.121304e-02  0.46739388
## CTTN      23.83046  6    6 5.611140e-04 4.135785e-02  2.63145041
## RALA      14.86930  2    2 5.904364e-04 4.135785e-02  0.07815695
## RPLP0     23.59008  6    6 6.211229e-04 4.135785e-02  2.19417548
## RPL36     14.73562  2    2 6.312475e-04 4.135785e-02  0.08538485
## ANXA2     29.06282  9    9 6.324180e-04 4.135785e-02 13.86661887
## HECTD1    34.11522 12   12 6.468887e-04 4.135785e-02 66.54357004
## COL27A1   17.18301  3    3 6.480574e-04 4.135785e-02  0.18095660
## C3orf17   19.31513  4    4 6.814449e-04 4.252216e-02  0.66901827
## CD46      21.29808  5    5 7.114555e-04 4.269412e-02  1.24898028
## TXNIP     21.25429  5    5 7.251620e-04 4.269412e-02  1.69173169
## LIFR      16.93216  3    3 7.298140e-04 4.269412e-02  0.28256750
## RRBP1     16.67807  3    3 8.230742e-04 4.716719e-02  0.32654810
## DSC2      14.10812  2    2 8.638945e-04 4.851632e-02  0.08670640

Performing step 2 test using the standard t-test

Next we perform the step 2 (confirmatory) test to identify the isoforms of these 50 genes that are differentially expressed using function step2.Ftest() as follows

final.res1.unstr <- step2.Ftest(data=data, file.step1="output/ACC_unstr_type1.RData", alpha=0.05, method="hochberg")

## Number of significant genes: 50 
## Cutoff for significant isoform is 0.0008903134

The function returned the results of the confirmatory test for all the isoforms of the above 50 gene in a data.frame object called final.res1.unstr. Let’s take a look at the results.

dim(final.res1.unstr)

## [1] 307   7

head(final.res1.unstr)

##    gene         isoform         FC step2.p.value       adj.p    threshold decision
## 1  TFPI ENST00000233156   1.230777  0.7349773654 0.921281355 0.0007102718    FALSE
## 2  TFPI ENST00000437725 -10.381846  0.0008045526 0.002413658 0.0002967711    FALSE
## 3  TFPI ENST00000453013   1.078937  0.9212813552 0.921281355 0.0008903134    FALSE
## 4  RALA ENST00000005257   1.703897  0.0010162639 0.002032528 0.0004451567    FALSE
## 5  RALA ENST00000466491   2.198932  0.2974163689 0.297416369 0.0008903134    FALSE
## 6 CSDE1 ENST00000261443   1.580196  0.4497199931 0.738619676 0.0005420811    FALSE

The columns from left to right are gene name (gene), isoform ID (isoform), Fold Change (FC), p-value of the step two test (step2.p.value), adjusted p-value (adj.p), and threshold to determine if the step2.p.value is significant, and descison (decision). An isoform is called significant or considered differentially expressed if the decision is TRUE. The isoforms that are called significant are as follows

final.res1.unstr[final.res1.unstr$decision, ]

##          gene         isoform        FC step2.p.value        adj.p    threshold decision
## 146 HNRNPA2B1 ENST00000356674  1.682557  5.628982e-05 0.0003377389 0.0001483856     TRUE
## 172    PRKAA1 ENST00000511248  7.337378  2.496545e-04 0.0007489636 0.0002967711     TRUE
## 229     CCNL1 ENST00000474539 -2.308505  1.002999e-04 0.0008023989 0.0001112892     TRUE
## 233      FZD6 ENST00000522566  8.590236  3.271466e-04 0.0006542932 0.0004451567     TRUE

We see that with the standard t-test option we identified 4 isoforms that are called significant.

Performing step 2 test using the the Wald test based on the linear mixed-effects model

Next we perform the step 2 (confirmatory) test to identify the isoforms of these 50 genes that are differentially expressed using function mc.step2.Waldtest and save the object final.res1.wald.unstr returned by the function in a file called ACC_unstr_type1_step2_Wald_Hoch.RData under subfolder output

final.res1.wald.unstr <- mc.step2.Waldtest(data=data, file.step1="output/ACC_unstr_type1.RData", VarCov.type="unstr", alpha=0.05, method="hochberg", nCore=NULL) save(final.res1.wald.unstr, file="output/ACC_unstr_type1_step2_Wald_Hoch_05.RData")

Now let’s load the object final.res1.wald.unstr from the file and take a look.

load(file="output/ACC_unstr_type1_step2_Wald_Hoch_05.RData")
dim(final.res1.wald.unstr)

## [1] 307   6

head(final.res1.wald.unstr)

##    gene         isoform step2.p.value        adj.p    threshold decision
## 1  TFPI ENST00000233156  7.289780e-01 9.196149e-01 0.0007057507    FALSE
## 2  TFPI ENST00000437725  8.927690e-06 2.678307e-05 0.0002967711     TRUE
## 3  TFPI ENST00000453013  9.196149e-01 9.196149e-01 0.0008903134    FALSE
## 4  RALA ENST00000005257  1.642580e-05 3.285160e-05 0.0004451567     TRUE
## 5  RALA ENST00000466491  2.760240e-01 2.760240e-01 0.0008903134    FALSE
## 6 CSDE1 ENST00000261443  4.345778e-01 7.327135e-01 0.0005280515    FALSE

tail(final.res1.wald.unstr)

##       gene         isoform step2.p.value        adj.p    threshold decision
## 302 GNB2L1 ENST00000514183  5.894023e-02 6.483425e-01 8.093758e-05    FALSE
## 303 GNB2L1 ENST00000514455  1.450059e-01 9.720775e-01 1.328091e-04    FALSE
## 304  CCPG1 ENST00000310958  2.598792e-06 1.039517e-05 2.225783e-04     TRUE
## 305  CCPG1 ENST00000442196  6.949804e-01 7.982639e-01 7.751201e-04    FALSE
## 306  CCPG1 ENST00000568808  7.982639e-01 7.982639e-01 8.903134e-04    FALSE
## 307  CCPG1 ENST00000569205  5.650142e-01 7.982639e-01 6.301671e-04    FALSE

The isoforms that are called significant are as follows

final.res1.wald.unstr[final.res1.wald.unstr$decision, ]

##          gene         isoform step2.p.value        adj.p    threshold decision
## 2        TFPI ENST00000437725  8.927690e-06 2.678307e-05 2.967711e-04     TRUE
## 4        RALA ENST00000005257  1.642580e-05 3.285160e-05 4.451567e-04     TRUE
## 62     CTNNA1 ENST00000521387  7.100130e-08 7.810143e-07 8.093758e-05     TRUE
## 131     VEGFA ENST00000497139  4.516121e-06 9.032242e-06 4.451567e-04     TRUE
## 132      LIFR ENST00000263409  8.797806e-07 2.639342e-06 2.967711e-04     TRUE
## 146 HNRNPA2B1 ENST00000356674  1.320035e-09 7.920210e-09 1.483856e-04     TRUE
## 160     RRBP1 ENST00000495501  1.682240e-05 5.046719e-05 2.967711e-04     TRUE
## 172    PRKAA1 ENST00000511248  2.918382e-07 8.755147e-07 2.967711e-04     TRUE
## 174      DSC2 ENST00000280904  1.938294e-05 3.876588e-05 4.451567e-04     TRUE
## 184    SEC31A ENST00000505479  6.384223e-05 5.745801e-04 9.892371e-05     TRUE
## 205      NPC1 ENST00000590723  1.044331e-06 4.177324e-06 2.225783e-04     TRUE
## 222   C3orf17 ENST00000474311  2.111586e-06 8.446344e-06 2.225783e-04     TRUE
## 229     CCNL1 ENST00000474539  1.251432e-08 1.001146e-07 1.112892e-04     TRUE
## 233      FZD6 ENST00000522566  6.814898e-07 1.362980e-06 4.451567e-04     TRUE
## 281    SUPT5H ENST00000600818  4.554611e-06 1.366383e-05 2.967711e-04     TRUE
## 284   COL27A1 ENST00000451716  8.943306e-07 2.682992e-06 2.967711e-04     TRUE
## 286     STK39 ENST00000355999  6.958571e-07 2.087571e-06 2.967711e-04     TRUE
## 304     CCPG1 ENST00000310958  2.598792e-06 1.039517e-05 2.225783e-04     TRUE

We see that with the Wald test option we identified 18 isoforms that are called significant. These isoforms belong to 18 genes, which are

unique(final.res1.wald.unstr[final.res1.wald.unstr$decision, 1])

##  [1] "TFPI"      "RALA"      "CTNNA1"    "VEGFA"     "LIFR"      "HNRNPA2B1" "RRBP1"     "PRKAA1"   
##  [9] "DSC2"      "SEC31A"    "NPC1"      "C3orf17"   "CCNL1"     "FZD6"      "SUPT5H"    "COL27A1"  
## [17] "STK39"     "CCPG1"

Now we check if the isoforms identified by t-test are elements of that identified by the Wald test as follows

is.element(final.res1.unstr[final.res1.unstr$decision, 2], final.res1.wald.unstr[final.res1.wald.unstr$decision, 2])

## [1] TRUE TRUE TRUE TRUE

We see that the isoforms identified by t-test are a proper subset of that identified by the Wald test.

Analysis based on Type 2 screening test

Performing step 1 test

res <- lrt.test.mclapply(data=data, VarCov.type="unstr", nCore=NULL, outfile="output/ACC_unstr_type2.RData", screen.type=2)

Let’s take a look at the results

load(file="output/ACC_unstr_type2.RData")

The dimension of the object is

dim(res)

## [1] 2820    6

The rows on the top are

head(res)

##                lrt df nIso    p.value       FDR      Time
## CFH   0.0005488067  1    2 0.98130997 0.9951395 0.1339359
## KRIT1 0.1293231950  1    2 0.71913481 0.9289952 0.2619956
## CD99  0.4625576194  1    2 0.49643109 0.8553886 0.1218066
## M6PR  7.4279294055  4    5 0.11492961 0.5745622 1.7152686
## ICA1  7.3514875354  2    3 0.02533056 0.4197706 0.5095682
## CFLAR 5.0603373219  6    7 0.53609805 0.8707348 4.5903389

and on the bottom are

tail(res)

##                     lrt df nIso   p.value       FDR       Time
## RP11-156P1.3  0.2502671  1    2 0.6168871 0.8902322 0.06278110
## RPL17         6.2179248  4    5 0.1834539 0.6653167 0.56409240
## STRADA        2.8060809  2    3 0.2458483 0.7082911 0.27607942
## RP1-178F10.3  1.7910057  1    2 0.1808037 0.6634195 0.04237771
## ZNF587B       0.3452667  1    2 0.5568049 0.8728023 0.10422730
## RP11-18I14.10 1.1763892  1    2 0.2780923 0.7328160 0.09242225

Let’s check how many genes with missing results.

p <- which(is.na(res[, 1]))
length(p)

## [1] 13

We see that there are 13 (0.46%) genes whose linear mixed effects models did not converge. The names of these genes are

rownames(res)[p]

##  [1] "PABPC1"  "DDX5"    "EIF4G2"  "PRR4"    "HDLBP"   "STX16"   "RBM39"   "TPM1"    "DST"    
## [10] "HNRNPH1" "CHD2"    "DMBT1"   "PCBP2"

Next let’s retain only those called significant at level FDR = 0.05 and assign the results in a new object called sig.res2.unstr

sig.res2.unstr <- res[!is.na(res[, 1]) & res[, "FDR"] <= 0.05, ]
sig.res2.unstr

##              lrt df nIso      p.value          FDR       Time
## CD44    46.14026 11   12 3.053940e-06 0.0028574695 65.6715877
## BCLAF1  30.45665  6    7 3.217994e-05 0.0225822703  5.8314250
## PRKAA1  18.07153  2    3 1.190742e-04 0.0470583239  0.2567837
## NDRG2   32.98947  9   10 1.341171e-04 0.0470583239 25.2666283
## NPM1    51.81913 10   11 1.232546e-07 0.0001799886 29.8166714
## TMEM63A 26.60506  5    6 6.808432e-05 0.0331858408  1.6199515
## SUPT5H  19.10749  2    3 7.093518e-05 0.0331858408  0.2028501
## GNB2L1  53.80250 11   12 1.282427e-07 0.0001799886 46.1234529

We see that there are total 8 genes that passed the screening test at level FDR = 0.05. Let’s check if these genes are a proper subset of that identified by Type 1 screening test.

is.element(rownames(sig.res2.unstr), rownames(sig.res2.unstr))

## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

All these genes are a proper subset of that identified by Type 1 screening test.

Performing step 2 test using the standard t-test

Next we perform the step 2 (confirmatory) test to identify the isoforms of these 8 genes that are differentially expressed using function step2.Ftest() as follows

final.res2.unstr <- step2.Ftest(data=data, file.step1="output/ACC_unstr_type2.RData", alpha=0.05, method="hochberg")

## Number of significant genes: 8 
## Cutoff for significant isoform is 0.0001425009

dim(final.res2.unstr)

## [1] 64  7

head(final.res2.unstr)

##   gene         isoform        FC step2.p.value     adj.p    threshold decision
## 1 CD44 ENST00000263398 -1.523227    0.01701411 0.2041693 1.187507e-05    FALSE
## 2 CD44 ENST00000278385  1.147836    0.70541585 0.9339563 1.076307e-04    FALSE
## 3 CD44 ENST00000278386 -1.571053    0.56068426 0.9339563 8.554791e-05    FALSE
## 4 CD44 ENST00000525241  1.625292    0.51080786 0.9339563 7.793788e-05    FALSE
## 5 CD44 ENST00000525293  1.032489    0.89953813 0.9339563 1.372495e-04    FALSE
## 6 CD44 ENST00000525469 -1.175661    0.84067219 0.9339563 1.282678e-04    FALSE

The isoforms that are called significant are as follows

final.res2.unstr[final.res2.unstr$decision, ]

## [1] gene          isoform       FC            step2.p.value adj.p         threshold    
## [7] decision     
## <0 rows> (or 0-length row.names)

We see that with the standard t-test option we identified 0 isoform that is called significant.

Performing step 2 test using the the Wald test based on the linear mixed-effects model

Next we perform the step 2 (confirmatory) test to identify the isoforms of these 8 genes that are differentially expressed using function mc.step2.Waldtest and save the object final.res2.wald.unstr returned by the function in a file called ACC_unstr_type2_step2_Wald_Hoch.RData under subfolder output

final.res2.wald.unstr <- mc.step2.Waldtest(data=data, file.step1="output/ACC_unstr_type2.RData", VarCov.type="unstr", alpha=0.05, method="hochberg", nCore=NULL) save(final.res2.wald.unstr, file="output/ACC_unstr_type2_step2_Wald_Hoch_05.RData")

Now let’s load the object final.res2.wald.unstr from the file and take a look.

load(file="output/ACC_unstr_type2_step2_Wald_Hoch_05.RData")
dim(final.res2.wald.unstr)

## [1] 64  6

head(final.res2.wald.unstr)

##   gene         isoform step2.p.value      adj.p    threshold decision
## 1 CD44 ENST00000263398    0.00563439 0.06761268 1.187507e-05    FALSE
## 2 CD44 ENST00000278385    0.69863665 0.93256076 1.067559e-04    FALSE
## 3 CD44 ENST00000278386    0.54955922 0.93256076 8.397595e-05    FALSE
## 4 CD44 ENST00000525241    0.49795166 0.93256076 7.609001e-05    FALSE
## 5 CD44 ENST00000525293    0.89740197 0.93256076 1.371284e-04    FALSE
## 6 CD44 ENST00000525469    0.83722791 0.93256076 1.279335e-04    FALSE

tail(final.res2.wald.unstr)

##      gene         isoform step2.p.value     adj.p    threshold decision
## 59 GNB2L1 ENST00000508963    0.75646753 0.9720775 1.108937e-04    FALSE
## 60 GNB2L1 ENST00000509148    0.24040706 0.9720775 3.524227e-05    FALSE
## 61 GNB2L1 ENST00000511900    0.44812396 0.9720775 6.569236e-05    FALSE
## 62 GNB2L1 ENST00000513060    0.16121923 0.9720775 2.363380e-05    FALSE
## 63 GNB2L1 ENST00000514183    0.05894023 0.6483425 1.295463e-05    FALSE
## 64 GNB2L1 ENST00000514455    0.14500595 0.9720775 2.125703e-05    FALSE

The isoforms that are called significant are as follows

final.res2.wald.unstr[final.res2.wald.unstr$decision, ]

##      gene         isoform step2.p.value        adj.p   threshold decision
## 22 PRKAA1 ENST00000511248  2.918382e-07 8.755147e-07 4.75003e-05     TRUE
## 51 SUPT5H ENST00000600818  4.554611e-06 1.366383e-05 4.75003e-05     TRUE

We see that with the Wald test option we identified 2 isoforms that are called significant. These isoforms belong to 2 genes, which are

unique(final.res2.wald.unstr[final.res2.wald.unstr$decision, 1])

## [1] "PRKAA1" "SUPT5H"

Application to an RNA-Sequencing Study of Adenoid Cystic Carcinoma (ACC)

Li Luo, Huining Kang*, Xichen Li, Scott A. Ness, Christine A. Stidley

6/14/2020

Introduction

Performing Analyses Assuming Compound Symmetry with Unequal Variances for Covariance Structure

Analysis based on Type 1 screening test

Performing step 1 test

Performing step 2 test using the standard t-test

Performing step 2 test using the the Wald test based on the linear mixed-effects model

Analysis based on Type 2 screening test

Performing step 1 test

Performing step 2 test using the standard t-test

Performing step 2 test using the the Wald test based on the linear mixed-effects model

Plots of isoform abundances

Performing Analyses Assuming Covariance Structure is Unstructured

Analysis based on Type 1 screening test

Performing step 1 test

Performing step 2 test using the standard t-test

Performing step 2 test using the the Wald test based on the linear mixed-effects model

Analysis based on Type 2 screening test

Performing step 1 test

Performing step 2 test using the standard t-test

Performing step 2 test using the the Wald test based on the linear mixed-effects model