\(*\) Correspondence should be addressed to Huining Kang (HuKang@salud.unm.edu).

Introduction

We recently developed a new statistical method for analyzing differentially expressed/spliced transcript isoforms, which appropriately account for the dependence between isoforms and multiple testing corrections for the multi-dimensional structure of at both the gene-and isoform-level. We proposed a linear mixed effects model-based approach for analyzing the complex alternative RNA splicing regulation patterns detected by whole-transcriptome RNA-Sequencing technologies. This approach thoroughly characterizes and differentiates three types of genes related to alternative RNA splicing events with distinct differential expression/splicing patterns. We applied the concept of appropriately controlling for the gene-level overall false discovery rate (OFDR) in this multi-dimensional alternative RNA splicing analysis utilizing a two-step hierarchical hypothesis testing framework. In the initial screening test we identify genes that have differentially expressed or spliced isoforms; in the subsequent confirmatory testing stage we examine only the isoforms of genes that have passed the screening tests. The methodology is described in detail in the manuscript titled Two-step mixed model approach to analyzing differential alternative RNA splicing submitted to PLOS ONE. We also provided an R program file (Tools.R) that included all the functions for performing the analysis with the proposed method. The R program file can be downloaded here. In this tutorial we describe step-by-step how to use these functions in R to perform the analyses that we have proposed in our manuscript.

Setup for the Analysis

Before performing the analysis we suggest you create a folder as the working directory which you may name as, for example, WD and download the R file Tools.R to this folder. Under this folder we suggest you create two subfolders and name them as data and output, saving the original data file in the data subfolder and the results from calling R functions in the output subfolder.

After starting R, use R function setwd() to set the current working directory to the working directory you have created and then load the R file Tools.R by submitting command source("Tools.R"). This will also load all the necessary R packages, including nlme, aod, gtools, genefilter, and parallel. Please pre-install these packages before running this Tutorial.

# Clear memmory
rm(list=ls())
# load the R functions and R packages
source("Tools.R")

Note that the R programs will use the multicores of the computer to speed up the computation under Linux. Currently this option is not available under Windows.

Data set and Data Format

This tutorial uses a small data set saved in an R data file called data4demo.RData that is available for download here. The data set consists of a subset of 400 genes from the RNA-Sequencing study of adenoid cystic carcinoma (ACC) reported in our manuscript with the genes hand-picked for ease of computation and demonstration. The data set contains isoform abundances of the 400 genes from 14 ACC patients. Eight of the patients are free of cancer while the rest of six are not during the follow-up time. The objective of the analysis is to identify the genes that have differentially expressed/spliced isoforms between the two medical conditions as well as those specific isoforms that are differentially expressed/spliced.

The R data file data4demo.RData contains an R list object called data. The following commands will load this object into the memmory and show the elements of the object.

# load data to the memory
load(file="data/data4demo.RData")
# list the elements of object data
str(data)

## List of 6
##  $ x      : num [1:984, 1:14] 2.631 2.409 0.637 2.668 0 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:984] "ENST00000357337" "ENST00000474882" "ENST00000485061" "ENST00000507771" ...
##   .. ..$ : NULL
##  $ y      : num [1:14] 1 1 1 2 2 1 2 2 1 2 ...
##  $ gene_id: chr [1:984] "ENSG00000005810" "ENSG00000005810" "ENSG00000005810" "ENSG00000005884" ...
##  $ gene   : chr [1:984] "MYCBP2" "MYCBP2" "MYCBP2" "ITGA3" ...
##  $ isoform: chr [1:984] "ENST00000357337" "ENST00000474882" "ENST00000485061" "ENST00000507771" ...
##  $ y.lab  : chr [1:2] "Free of Cancer" "Not Free of Cancer"

We can see that the data object has six elements: x, y, gene_id, gene, isoform, and y.lab. The x is a \(p\) by \(n\) matrix of isoform abundance (expression) measurements (in base 2 log scale) with \(p=984\) and \(n=14\). Each row of the matrix represents the abundance measurements for each isoform and each column corresponds to one experimental sample, i.e. one patient in the current example. The y is a vector of numerical labels indicating the medical conditions of the samples, corresponding to the columns of matrix x. The gene_id is a vector of Ensembl gene IDs, the gene, a vector of gene symbols, and the isoform, a vector of isoform IDs, corresponding to the rows of matrix x. The y.lab provides descriptions of the medical conditions that the values of y represent, i.e. 1 = “Free of Cancer”, 2 = “Not Free of Cancer”. Note that elements gene_id and y.lab are not required in this tutorial.

The following command shows the number of unique genes in this data set:

length(unique(data$gene))

## [1] 400

The function find.nL() can be used to calculate the number of isoforms for each gene. It returns a numeric vector with each element being the number of isoforms of one gene and the names of the elements being gene names.

nL <- find.nL(data=data)

The numbers of isoforms for the first 5 genes are

head(nL, n=5)

##  MYCBP2   ITGA3  GGNBP2 CCDC124 ST3GAL1 
##       3       2       3       2       3

The numbers of isoforms for the last 5 genes are

tail(nL, n=5)

##        CHMP4A        LSM14A        CHURC1 RP11-356K23.1  RP1-178F10.3 
##             2             2             2             2             2

We can summarize all the numbers as follows:

table(nL)

## nL
##   2   3   4   5   6   7 
## 262 100  33   3   1   1

The result indicates that there are 262 genes that have 2 isoforms for each, 100 genes that have 3 isoforms for each, and so on so forth.

Performing Analyses

Now that we have a good understanding of the data, we are in a position to perform the analysis of the data. Our method features a two-step hypothesis testing framework. In the first step we perform a screening test to identify genes that have differentially expressed or spliced isoforms; in the second step we perform a confirmatory test to examine only the isoforms of genes that have passed the screening tests.

There are some parameters and options that we need to determine beforehand. The first is the significance level. It is determined by specifying a threshold for the overall false discovery level (OFDR). In general, we set the significance level at OFDR = 0.05. Note that the OFDR is equivalent to the false discovery rate (FDR) proposed by Benjamini and Hochberg (1995) if we perform only the screening test with a purpose to only identify genes that have differentially expressed/spliced isoforms (refer to the manuscript for details.) Hence in the related R functions the argument for it is FDR. The second is to choose between two types of screening tests. Type 1 is to identify genes that have either differentially expressed or differentially spliced isoforms. Type 2 is to identify genes that have differentially spliced isoforms. If the study has a focus on the differentially spliced isoforms especially on identifying isoform switches, then Type 2 should be used. The third is the covariance structure. There are three options: cs (compound symmetry with equal variances), uneqvar (compound symmetry with unequal variances), and unstr (unstructured), where cs corresponds to a more strict assumption while unstr corresponds to a most liberal assumption. We suggest using uneqvar in general cases and we will use it in this tutorial. The forth is the method for the step-two analysis. There are two options, the standard t-test and model based Wald test.

Analysis based on Type 1 screening test

Performing step 1 test

The first step is to run the screening test, the likelihood ratio test (LRT) by calling function lrt.test.mclapply(). This is the most time-consuming part. It could take hours or even days depending on the total number of genes, the sample size, the options on the covariance structure chosen, and the power of the machine. The five arguments of this function are data, VarCov.type, nCore, outfile, and screen.type. data is the name of data object. VarCov.type is the covariance structure with three options cs, uneqvar, and unstr as described previously. nCore is the number of cores to use and default value is the total number of the machine cores minus one. outfile is the name of the R file to save the result. screen.type is the type of the screening test. We proceed this step by submitting following command: res <- lrt.test.mclapply(data=data, VarCov.type="uneqvar", outfile="output/demo_uneqvar_type1.RData", screen.type=1)

This command returns a matrix object res which is also saved in the file demo_uneqvar_type1.RData under the subfolder output. Now we load the object from the file and take a look.

load(file="output/demo_uneqvar_type1.RData")

The dimension of the object is

dim(res)

## [1] 400   6

The rows on the top are

head(res)

##               lrt df nIso    p.value       FDR      Time
## MYCBP2  1.2786148  3    3 0.73421736 0.8838459 0.2462955
## ITGA3   1.8306987  2    2 0.40037674 0.6656263 0.4099042
## GGNBP2  7.0888065  3    3 0.06912043 0.3793880 0.2460728
## CCDC124 7.2913562  2    2 0.02610370 0.2711072 0.2867465
## ST3GAL1 4.8565803  3    3 0.18260536 0.5313799 0.2581735
## CD9     0.9654634  3    3 0.80960795 0.9125242 0.2184870

and on the bottom are

tail(res)

##                    lrt df nIso   p.value       FDR       Time
## LYN           3.134356  2    2 0.2086332 0.5313799 0.09197474
## CHMP4A        1.389586  2    2 0.4991777 0.7187187 0.14126539
## LSM14A        4.559027  2    2 0.1023340 0.4665358 0.18501544
## CHURC1        4.269844  2    2 0.1182538 0.4665358 0.14422417
## RP11-356K23.1 0.508017  2    2 0.7756852 0.8997046 0.22349191
## RP1-178F10.3  2.487001  2    2 0.2883729 0.5958529 0.10903144

Each row of the matrix gives the test result for each gene and the row names indicate the names (IDs) of the genes. The columns from left to right are LRT statistic (lrt), degree of freedom (df), Number of isoforms (nIso), p-values of LRT (p.value), FDR (FDR) and time used for fitting the linear mixed effects model in second (Time). Note that there may be missing results for some genes because the corresponding linear mixed models procedures do not converge. Let’s see if there are such genes.

p <- which(is.na(res[, 1]))
p

## DMBT1 
##   327

res[(p-1):(p+1), ]

##               lrt df nIso   p.value       FDR       Time
## FNBP1  4.18255218  3    3 0.2424146 0.5611718 0.16614842
## DMBT1          NA  3    3        NA        NA 0.16569257
## PLSCR1 0.04271755  2    2 0.9788677 0.9924697 0.08884215

We see that there is one such gene called DMBT1.

Next let’s retain only those that are called significant at level FDR = 0.05 and save the results in a new object called sig.res1

sig.res1 <- res[!is.na(res[, 1]) & res[, "FDR"] <= 0.05, ]
sig.res1

##                lrt df nIso      p.value         FDR      Time
## GNPTAB    23.86920  5    5 2.300448e-04 0.013112555 0.3510180
## VEGFA     18.07379  2    2 1.189393e-04 0.013112555 0.1600897
## LIFR      15.70857  3    3 1.301145e-03 0.037082633 0.1950824
## HNRNPA2B1 29.81071  6    6 4.270285e-05 0.008519219 0.7523081
## RRBP1     18.52259  3    3 3.431237e-04 0.013428748 0.1561484
## PRKAA1    19.51404  3    3 2.140158e-04 0.013112555 0.3414207
## POSTN     31.05287  5    5 9.144749e-06 0.003648755 0.2998912
## FRMD4A    22.40986  4    4 1.660719e-04 0.013112555 0.2389264
## TRIP12    28.24006  7    7 1.989194e-04 0.013112555 0.5914843
## C3orf17   21.09291  4    4 3.035165e-04 0.013428748 0.2080283
## FZD6      15.80285  2    2 3.702161e-04 0.013428748 0.1618197
## ASXL1     18.77488  3    3 3.043197e-04 0.013428748 0.1315994
## AFF1      16.76817  3    3 7.887221e-04 0.026225008 0.1432698
## DENND4A   16.05417  3    3 1.105349e-03 0.033925697 0.1352131

We see that there are total 14 genes that passed the screening test at level FDR = 0.05.

Performing step 2 test using the standard t-test

Next we perform the step 2 (confirmatory) test to identify the isoforms of these 14 genes that are differentially expressed using function step2.Ftest() as follows

final.res1 <- step2.Ftest(data=data, file.step1="output/demo_uneqvar_type1.RData", alpha=0.05, method="holm")

## Number of significant genes: 14 
## Cutoff for significant isoform is 0.001754386

Note that there is an argument called method in function step2.Ftest and the following functon mc.step2.Waldtest, which is the option for controlling the family-wise error rate required by our two-step procedure. The options are methods of Bonferroni (“bonferroni”), Holm (“holm”), Hochberg (“hochberg”) and Hommel (“hommel”). We use Holm option throughout the tutorial. The function returns the results of the confirmatory test for all the isoforms of the above 14 genes in a data.frame object called final.res1. Let’s take a look at the results.

dim(final.res1)

## [1] 53  7

head(final.res1)

##     gene         isoform        FC step2.p.value       adj.p    threshold decision
## 1 GNPTAB ENST00000299314 -1.424071   0.003505942 0.017529709 0.0003508772    FALSE
## 2 GNPTAB ENST00000549165  2.320880   0.023427021 0.093708082 0.0004385965    FALSE
## 3 GNPTAB ENST00000549194 -2.002233   0.047714461 0.143143384 0.0005847953    FALSE
## 4 GNPTAB ENST00000552009 -2.072173   0.162558101 0.162558101 0.0017543860    FALSE
## 5 GNPTAB ENST00000552681 -1.375782   0.048686265 0.143143384 0.0005967059    FALSE
## 6  VEGFA ENST00000372067  4.650730   0.009558154 0.009558154 0.0017543860    FALSE

The columns from left to right are gene name (gene), isoform ID (isoform), Fold Change (FC), p-value of the step two test (step2.p.value), adjusted p-value (adj.p), and threshold to determine if the step2.p.value is significant, and decison (decision). An isoform is called significant or considered differentially expressed if the decision is TRUE. The isoforms that are called significant are as follows

final.res1[final.res1$decision, ]

##         gene         isoform         FC step2.p.value        adj.p    threshold decision
## 7      VEGFA ENST00000497139   7.576606  6.257982e-04 0.0012515964 0.0008771930     TRUE
## 8       LIFR ENST00000263409  -5.235773  3.557282e-04 0.0010671847 0.0005847953     TRUE
## 12 HNRNPA2B1 ENST00000356674   1.682557  5.628982e-05 0.0003377389 0.0002923977     TRUE
## 22    PRKAA1 ENST00000511248   7.337378  2.496545e-04 0.0007489636 0.0005847953     TRUE
## 34    TRIP12 ENST00000428959 -14.647464  2.376155e-04 0.0016633087 0.0002506266     TRUE
## 44      FZD6 ENST00000522566   8.590236  3.271466e-04 0.0006542932 0.0008771930     TRUE

We see that with the standard t-test option we identified 6 isoforms that are called significant.

Performing step 2 test using the Wald test based on the linear mixed-effects model

Next we perform the step 2 (confirmatory) test to identify the isoforms of these 14 genes that are differentially expressed using function mc.step2.Waldtest as follows

final.res1.wald <- mc.step2.Waldtest(data=data, VarCov.type="uneqvar", nCore = NULL, file.step1="output/demo_uneqvar_type1.RData", alpha=0.05, method="holm")

## Number of significant genes: 14 
## Fitting mixed effects model. This may take a while...
## Model fitting is completed
## # Number of total genes tested: 399 
## # Number of significant genes: 14 
## # Cutoff for significant isoform is 0.001754386 
## # Number of significant isoform: 20

Let’s take a look at the results returned by above command.

dim(final.res1.wald)

## [1] 53  6

head(final.res1.wald)

##     gene         isoform step2.p.value        adj.p    threshold decision
## 1 GNPTAB ENST00000299314  8.061455e-05 0.0004030728 0.0003508772     TRUE
## 2 GNPTAB ENST00000549165  4.857991e-03 0.0194319658 0.0004385965    FALSE
## 3 GNPTAB ENST00000549194  3.044334e-02 0.0608866723 0.0008771930    FALSE
## 4 GNPTAB ENST00000552009  1.045219e-01 0.1045219228 0.0017543860    FALSE
## 5 GNPTAB ENST00000552681  1.312816e-02 0.0393844689 0.0005847953    FALSE
## 6  VEGFA ENST00000372067  8.824491e-04 0.0008824491 0.0017543860     TRUE

tail(final.res1.wald)

##       gene         isoform step2.p.value        adj.p    threshold decision
## 48    AFF1 ENST00000307808  5.968291e-01 5.968291e-01 0.0017543860    FALSE
## 49    AFF1 ENST00000395146  1.082247e-06 3.246741e-06 0.0005847953     TRUE
## 50    AFF1 ENST00000504956  2.061851e-01 4.123703e-01 0.0008771930    FALSE
## 51 DENND4A ENST00000561863  3.688310e-04 7.376620e-04 0.0008771930     TRUE
## 52 DENND4A ENST00000562028  9.952405e-05 2.985721e-04 0.0005847953     TRUE
## 53 DENND4A ENST00000562540  6.663129e-01 6.663129e-01 0.0017543860    FALSE

The isoforms that are called significant are as follows

final.res1.wald[final.res1.wald$decision, ]

##         gene         isoform step2.p.value        adj.p    threshold decision
## 1     GNPTAB ENST00000299314  8.061455e-05 4.030728e-04 0.0003508772     TRUE
## 6      VEGFA ENST00000372067  8.824491e-04 8.824491e-04 0.0017543860     TRUE
## 7      VEGFA ENST00000497139  7.287226e-07 1.457445e-06 0.0008771930     TRUE
## 8       LIFR ENST00000263409  1.273216e-07 3.819648e-07 0.0005847953     TRUE
## 12 HNRNPA2B1 ENST00000356674  5.720757e-11 3.432454e-10 0.0002923977     TRUE
## 19     RRBP1 ENST00000495501  2.488224e-05 7.464671e-05 0.0005847953     TRUE
## 22    PRKAA1 ENST00000511248  3.032114e-08 9.096341e-08 0.0005847953     TRUE
## 23     POSTN ENST00000379743  6.157335e-05 1.847200e-04 0.0005847953     TRUE
## 25     POSTN ENST00000478947  4.803759e-07 1.921503e-06 0.0004385965     TRUE
## 26     POSTN ENST00000497145  8.679231e-04 1.735846e-03 0.0008771930     TRUE
## 27     POSTN ENST00000541179  8.160369e-08 4.080184e-07 0.0003508772     TRUE
## 28    FRMD4A ENST00000358621  9.130130e-06 2.941287e-05 0.0005445837     TRUE
## 30    FRMD4A ENST00000475325  8.193067e-04 1.638613e-03 0.0008771930     TRUE
## 31    FRMD4A ENST00000492155  7.353218e-06 2.941287e-05 0.0004385965     TRUE
## 34    TRIP12 ENST00000428959  4.191024e-08 2.933717e-07 0.0002506266     TRUE
## 41   C3orf17 ENST00000474311  8.501880e-07 3.400752e-06 0.0004385965     TRUE
## 44      FZD6 ENST00000522566  8.117205e-08 1.623441e-07 0.0008771930     TRUE
## 49      AFF1 ENST00000395146  1.082247e-06 3.246741e-06 0.0005847953     TRUE
## 51   DENND4A ENST00000561863  3.688310e-04 7.376620e-04 0.0008771930     TRUE
## 52   DENND4A ENST00000562028  9.952405e-05 2.985721e-04 0.0005847953     TRUE

We see that with the Wald test option we identified 20 isoforms that are called significant.

Now we check if the isoforms identified by t-test are elements of that identified by the Wald test as follows

is.element(final.res1[final.res1$decision, 2], final.res1.wald[final.res1.wald$decision, 2])

## [1] TRUE TRUE TRUE TRUE TRUE TRUE

We see that the isoforms identified by t-test are a proper subset of that identified by the Wald test.

Analysis based on Type 2 screening test

Performing step 1 test

The procedure for the analysis based on Type 2 screening test is the same as that based on Type 1 test except that the argument screen.type of function lrt.test.mclapply needs to be specified as 2 rather than 1. We perform the Type 2 screening test by submitting following command:

res <- lrt.test.mclapply(data=data, VarCov.type="uneqvar", outfile="output/demo_uneqvar_type2.RData", screen.type=2)

Let’s take a look at the results

load(file="output/demo_uneqvar_type2.RData")

The dimension of the object is

dim(res)

## [1] 400   6

The rows on the top are

head(res)

##              lrt df nIso    p.value       FDR      Time
## MYCBP2  1.222056  2    3 0.54279251 0.8583272 0.4416478
## ITGA3   1.041377  1    2 0.30750133 0.7247150 0.6149364
## GGNBP2  6.229158  2    3 0.04439720 0.3838666 0.3626785
## CCDC124 5.445468  1    2 0.01961916 0.3439145 0.6217566
## ST3GAL1 3.596824  2    3 0.16556158 0.5512508 0.2834027
## CD9     0.745200  2    3 0.68894074 0.8942145 0.6414995

and on the bottom are

tail(res)

##                     lrt df nIso    p.value       FDR      Time
## LYN           3.1108992  1    2 0.07777001 0.4239160 0.2211494
## CHMP4A        0.3155459  1    2 0.57429679 0.8708392 0.2821398
## LSM14A        4.5259783  1    2 0.03338398 0.3614521 0.2982352
## CHURC1        3.9609811  1    2 0.04656655 0.3838666 0.2569530
## RP11-356K23.1 0.2177605  1    2 0.64075166 0.8797750 0.4486036
## RP1-178F10.3  1.7910057  1    2 0.18080368 0.5725450 0.1202860

Let’s check whether there are any genes with missing results.

p <- which(is.na(res[, 1]))
p

## DMBT1 
##   327

res[(p-1):(p+1), ]

##                lrt df nIso   p.value       FDR      Time
## FNBP1  1.063860794  2    3 0.5874698 0.8708392 0.4132810
## DMBT1           NA  2    3        NA        NA 0.3211651
## PLSCR1 0.003553308  1    2 0.9524665 0.9724736 0.2541683

The gene with missing result is the same as that in Type 1 screening test.

Next let’s retain only those that are called significant at level FDR = 0.05 and save the results in a new object called sig.res2

sig.res2 <- res[!is.na(res[, 1]) & res[, "FDR"] <= 0.05, ]
sig.res2

##                lrt df nIso      p.value        FDR      Time
## HNRNPA2B1 23.97014  5    6 2.200007e-04 0.01755605 1.9420850
## RRBP1     18.52213  2    3 9.505402e-05 0.01231386 0.1800642
## PRKAA1    18.05272  2    3 1.201991e-04 0.01231386 0.8521507
## TRIP12    27.36960  6    7 1.234472e-04 0.01231386 1.5939834
## C3orf17   20.79593  3    4 1.160648e-04 0.01231386 0.5258532

We see that there is a total of 5 genes that passed the screening test at level FDR = 0.05. All these genes are a proper subset of that identified by the Type 1 screening test.

Performing step 2 test using the standard t-test

Next we perform the step 2 (confirmatory) test to identify the isoforms of these 5 genes that are differentially expressed using function step2.Ftest() as follows

final.res2 <- step2.Ftest(data=data, file.step1="output/demo_uneqvar_type2.RData", alpha=0.05, method="holm")

## Number of significant genes: 5 
## Cutoff for significant isoform is 0.0006265664

dim(final.res2)

## [1] 23  7

head(final.res2)

##        gene         isoform        FC step2.p.value        adj.p    threshold decision
## 1 HNRNPA2B1 ENST00000354667 -1.522879  5.324928e-01 1.0000000000 0.0003336421    FALSE
## 2 HNRNPA2B1 ENST00000356674  1.682557  5.628982e-05 0.0003377389 0.0001044277     TRUE
## 3 HNRNPA2B1 ENST00000360787  1.207453  8.030475e-01 1.0000000000 0.0005031626    FALSE
## 4 HNRNPA2B1 ENST00000463181 -1.401538  4.395106e-02 0.2197552996 0.0001253133    FALSE
## 5 HNRNPA2B1 ENST00000476233  4.267469  6.709748e-02 0.2683899222 0.0001566416    FALSE
## 6 HNRNPA2B1 ENST00000495810 -1.504948  4.740218e-01 1.0000000000 0.0002970062    FALSE

The isoforms that are called significant are as follows

final.res2[final.res2$decision, ]

##        gene         isoform       FC step2.p.value        adj.p    threshold decision
## 2 HNRNPA2B1 ENST00000356674 1.682557  5.628982e-05 0.0003377389 0.0001044277     TRUE

We see that with the standard t-test option we identified only 1 isoform that is called significant.

Performing step 2 test using the the Wald test based on the linear mixed-effects model

Next we perform the step 2 (confirmatory) test to identify the isoforms of these 5 genes that are differentially expressed using function mc.step2.Waldtest as follows

final.res2.wald <- mc.step2.Waldtest(data=data, VarCov.type="uneqvar", nCore = NULL, file.step1="output/demo_uneqvar_type2.RData", alpha=0.05, method="holm")

## Number of significant genes: 5 
## Fitting mixed effects model. This may take a while...
## Model fitting is completed
## # Number of total genes tested: 399 
## # Number of significant genes: 5 
## # Cutoff for significant isoform is 0.0006265664 
## # Number of significant isoform: 5

Let’s take a look at the results returned by above command.

dim(final.res2.wald)

## [1] 23  6

head(final.res2.wald)

##        gene         isoform step2.p.value        adj.p    threshold decision
## 1 HNRNPA2B1 ENST00000354667  4.875446e-01 1.000000e+00 0.0003054791    FALSE
## 2 HNRNPA2B1 ENST00000356674  5.720757e-11 3.432454e-10 0.0001044277     TRUE
## 3 HNRNPA2B1 ENST00000360787  7.829920e-01 1.000000e+00 0.0004905965    FALSE
## 4 HNRNPA2B1 ENST00000463181  1.506111e-02 7.530553e-02 0.0001253133    FALSE
## 5 HNRNPA2B1 ENST00000476233  2.967788e-02 1.187115e-01 0.0001566416    FALSE
## 6 HNRNPA2B1 ENST00000495810  4.246485e-01 1.000000e+00 0.0002660705    FALSE

tail(final.res2.wald)

##       gene         isoform step2.p.value        adj.p    threshold decision
## 18  TRIP12 ENST00000470302  5.758099e-01 1.000000e+00 0.0003607832    FALSE
## 19  TRIP12 ENST00000487178  5.098594e-01 1.000000e+00 0.0003194608    FALSE
## 20 C3orf17 ENST00000383675  2.849060e-01 8.547179e-01 0.0002088555    FALSE
## 21 C3orf17 ENST00000472705  3.840888e-01 8.547179e-01 0.0002815633    FALSE
## 22 C3orf17 ENST00000474311  8.501880e-07 3.400752e-06 0.0001566416     TRUE
## 23 C3orf17 ENST00000494891  6.967420e-01 8.547179e-01 0.0005107593    FALSE

The isoforms that are called significant are as follows

final.res2.wald[final.res2.wald$decision, ]

##         gene         isoform step2.p.value        adj.p    threshold decision
## 2  HNRNPA2B1 ENST00000356674  5.720757e-11 3.432454e-10 1.044277e-04     TRUE
## 9      RRBP1 ENST00000495501  2.488224e-05 7.464671e-05 2.088555e-04     TRUE
## 12    PRKAA1 ENST00000511248  3.032114e-08 9.096341e-08 2.088555e-04     TRUE
## 15    TRIP12 ENST00000428959  4.191024e-08 2.933717e-07 8.950949e-05     TRUE
## 22   C3orf17 ENST00000474311  8.501880e-07 3.400752e-06 1.566416e-04     TRUE

We see that with the Wald test option we identified 5 isoforms that are called significant and the isoform identified by t-test is one of them.

Plots of isoform abundances

There are two functions plot.gene() and plot.gene.bat() in R file [Tools.R] that can be used to make plots of isoform abundances for given genes. plot.gene() plots one gene specified by the argument gene and output the figure to a given graphic device. plot.gene.bat() plots a series of genes by assigning a vector of gene names to the argument gene in the function. The following command will create the plots for the 14 genes that passed the Type 1 screening test and output the results to a pdf file demo_figure.pdf.

load(file="data/data4demo.RData")
plot.gene.bat(data=data, gene=rownames(sig.res1), xlim=c(-0.2, 2.3), lwd=3, isoform.id.opt=1, Phenotype="Tumor Status", file="demo_figure.pdf", pPanel=c(2, 2), width=7, height=9, y.lab=c("Free of Cancer", "Not"))

## png 
##   2

Next we make the isoform plots for two genes LIFR and HNRNPA2B1 using function plot.gene().

par(mfrow=c(1, 2))
dat<-plot.gene(data=data, gene="LIFR", Phenotype="Tumor Status", xlim = c(-0.05, 1.5), isoform.id.opt=1)
dat<-plot.gene(data=data, gene="HNRNPA2B1", Phenotype="Tumor Status", xlim = c(-0.05, 1.5), isoform.id.opt=1)

Gene LIFR is a typical gene that passed the Type 1 screening test but the Type 2 screening test. Form the plot of the gene we see that there is a parallel pattern; all the three isoforms of the genes are up-regulated in patients who are free of cancer and down-regulated in patients who are not. There is no evidence for the differential splicing or isoform switches. On the other hand, gene HNRNPA2B1 passed both screening tests. We see an interaction pattern, where some of the isoforms are up-regulated in patients who are free of cancer and down-regulated in patients who are not while other isoforms are opposite. There might be isoform switching events for this gene.

Conclusion

We present this tutorial as a companion to the manuscript submitted for publication to PLOS ONE. The tutorial shows step by step how to identify genes that have differentially expressed/spliced isoforms using the R functions that we provided in the R file Tools.R. When users have a need to analyze their own data using our proposed methods they only need to prepare the data in the format as describted in Section Data set and Data Format. All of the analyses can be easily done by following the example presented in the tutorial. However, some modifications may still be required to improve the algorithms. We welcome the feedback from all the users who are interested in our approach.

Two-Step Mixed Model Approach to Analyzing Differential Alternative RNA Splicing – Tutorial

Li Luo, Huining Kang*, Xichen Li, Scott A. Ness, Christine A. Stidley

6/14/2020

Introduction

Setup for the Analysis

Data set and Data Format

Performing Analyses

Analysis based on Type 1 screening test

Performing step 1 test

Performing step 2 test using the standard t-test

Performing step 2 test using the Wald test based on the linear mixed-effects model

Analysis based on Type 2 screening test

Performing step 1 test

Performing step 2 test using the standard t-test

Performing step 2 test using the the Wald test based on the linear mixed-effects model

Plots of isoform abundances

Conclusion