\(*\) Correspondence should be addressed to Huining Kang (HuKang@salud.unm.edu).
We recently developed a new statistical method for analyzing differentially expressed/spliced transcript isoforms, which appropriately account for the dependence between isoforms and multiple testing corrections for the multi-dimensional structure of at both the gene-and isoform-level. We proposed a linear mixed effects model-based approach for analyzing the complex alternative RNA splicing regulation patterns detected by whole-transcriptome RNA-Sequencing technologies. This approach thoroughly characterizes and differentiates three types of genes related to alternative RNA splicing events with distinct differential expression/splicing patterns. We applied the concept of appropriately controlling for the gene-level overall false discovery rate (OFDR) in this multi-dimensional alternative RNA splicing analysis utilizing a two-step hierarchical hypothesis testing framework. In the initial screening test we identify genes that have differentially expressed or spliced isoforms; in the subsequent confirmatory testing stage we examine only the isoforms of genes that have passed the screening tests. The methodology is described in detail in the manuscript titled Two-step mixed model approach to analyzing differential alternative RNA splicing submitted to PLOS ONE. We also provided an R program file (Tools.R) that included all the functions for performing the analysis with the proposed method. The R program file can be downloaded here. In this tutorial we describe step-by-step how to use these functions in R to perform the analyses that we have proposed in our manuscript.
Before performing the analysis we suggest you create a folder as the working directory which you may name as, for example, WD and download the R file Tools.R to this folder. Under this folder we suggest you create two subfolders and name them as data and output, saving the original data file in the data subfolder and the results from calling R functions in the output subfolder.
After starting R, use R function setwd()
to set the current working directory to the working directory you have created and then load the R file Tools.R by submitting command source("Tools.R")
. This will also load all the necessary R packages, including nlme, aod, gtools, genefilter, and parallel. Please pre-install these packages before running this Tutorial.
# Clear memmory
rm(list=ls())
# load the R functions and R packages
source("Tools.R")
Note that the R programs will use the multicores of the computer to speed up the computation under Linux. Currently this option is not available under Windows.
This tutorial uses a small data set saved in an R data file called data4demo.RData
that is available for download here. The data set consists of a subset of 400 genes from the RNA-Sequencing study of adenoid cystic carcinoma (ACC) reported in our manuscript with the genes hand-picked for ease of computation and demonstration. The data set contains isoform abundances of the 400 genes from 14 ACC patients. Eight of the patients are free of cancer while the rest of six are not during the follow-up time. The objective of the analysis is to identify the genes that have differentially expressed/spliced isoforms between the two medical conditions as well as those specific isoforms that are differentially expressed/spliced.
The R data file data4demo.RData
contains an R list object called data. The following commands will load this object into the memmory and show the elements of the object.
# load data to the memory
load(file="data/data4demo.RData")
# list the elements of object data
str(data)
## List of 6
## $ x : num [1:984, 1:14] 2.631 2.409 0.637 2.668 0 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:984] "ENST00000357337" "ENST00000474882" "ENST00000485061" "ENST00000507771" ...
## .. ..$ : NULL
## $ y : num [1:14] 1 1 1 2 2 1 2 2 1 2 ...
## $ gene_id: chr [1:984] "ENSG00000005810" "ENSG00000005810" "ENSG00000005810" "ENSG00000005884" ...
## $ gene : chr [1:984] "MYCBP2" "MYCBP2" "MYCBP2" "ITGA3" ...
## $ isoform: chr [1:984] "ENST00000357337" "ENST00000474882" "ENST00000485061" "ENST00000507771" ...
## $ y.lab : chr [1:2] "Free of Cancer" "Not Free of Cancer"
We can see that the data object has six elements: x
, y
, gene_id
, gene
, isoform
, and y.lab
. The x
is a \(p\) by \(n\) matrix of isoform abundance (expression) measurements (in base 2 log scale) with \(p=984\) and \(n=14\). Each row of the matrix represents the abundance measurements for each isoform and each column corresponds to one experimental sample, i.e. one patient in the current example. The y
is a vector of numerical labels indicating the medical conditions of the samples, corresponding to the columns of matrix x
. The gene_id
is a vector of Ensembl gene IDs, the gene
, a vector of gene symbols, and the isoform
, a vector of isoform IDs, corresponding to the rows of matrix x
. The y.lab
provides descriptions of the medical conditions that the values of y
represent, i.e. 1 = “Free of Cancer”, 2 = “Not Free of Cancer”. Note that elements gene_id
and y.lab
are not required in this tutorial.
The following command shows the number of unique genes in this data set:
length(unique(data$gene))
## [1] 400
The function find.nL()
can be used to calculate the number of isoforms for each gene. It returns a numeric vector with each element being the number of isoforms of one gene and the names of the elements being gene names.
nL <- find.nL(data=data)
The numbers of isoforms for the first 5 genes are
head(nL, n=5)
## MYCBP2 ITGA3 GGNBP2 CCDC124 ST3GAL1
## 3 2 3 2 3
The numbers of isoforms for the last 5 genes are
tail(nL, n=5)
## CHMP4A LSM14A CHURC1 RP11-356K23.1 RP1-178F10.3
## 2 2 2 2 2
We can summarize all the numbers as follows:
table(nL)
## nL
## 2 3 4 5 6 7
## 262 100 33 3 1 1
The result indicates that there are 262 genes that have 2 isoforms for each, 100 genes that have 3 isoforms for each, and so on so forth.
Now that we have a good understanding of the data, we are in a position to perform the analysis of the data. Our method features a two-step hypothesis testing framework. In the first step we perform a screening test to identify genes that have differentially expressed or spliced isoforms; in the second step we perform a confirmatory test to examine only the isoforms of genes that have passed the screening tests.
There are some parameters and options that we need to determine beforehand. The first is the significance level. It is determined by specifying a threshold for the overall false discovery level (OFDR). In general, we set the significance level at OFDR = 0.05. Note that the OFDR is equivalent to the false discovery rate (FDR) proposed by Benjamini and Hochberg (1995) if we perform only the screening test with a purpose to only identify genes that have differentially expressed/spliced isoforms (refer to the manuscript for details.) Hence in the related R functions the argument for it is FDR. The second is to choose between two types of screening tests. Type 1 is to identify genes that have either differentially expressed or differentially spliced isoforms. Type 2 is to identify genes that have differentially spliced isoforms. If the study has a focus on the differentially spliced isoforms especially on identifying isoform switches, then Type 2 should be used. The third is the covariance structure. There are three options: cs
(compound symmetry with equal variances), uneqvar
(compound symmetry with unequal variances), and unstr
(unstructured), where cs
corresponds to a more strict assumption while unstr
corresponds to a most liberal assumption. We suggest using uneqvar
in general cases and we will use it in this tutorial. The forth is the method for the step-two analysis. There are two options, the standard t-test and model based Wald test.
The first step is to run the screening test, the likelihood ratio test (LRT) by calling function lrt.test.mclapply()
. This is the most time-consuming part. It could take hours or even days depending on the total number of genes, the sample size, the options on the covariance structure chosen, and the power of the machine. The five arguments of this function are data
, VarCov.type
, nCore
, outfile
, and screen.type
. data
is the name of data object. VarCov.type
is the covariance structure with three options cs
, uneqvar
, and unstr
as described previously. nCore
is the number of cores to use and default value is the total number of the machine cores minus one. outfile
is the name of the R file to save the result. screen.type
is the type of the screening test. We proceed this step by submitting following command: res <- lrt.test.mclapply(data=data, VarCov.type="uneqvar", outfile="output/demo_uneqvar_type1.RData", screen.type=1)
This command returns a matrix object res
which is also saved in the file demo_uneqvar_type1.RData
under the subfolder output. Now we load the object from the file and take a look.
load(file="output/demo_uneqvar_type1.RData")
The dimension of the object is
dim(res)
## [1] 400 6
The rows on the top are
head(res)
## lrt df nIso p.value FDR Time
## MYCBP2 1.2786148 3 3 0.73421736 0.8838459 0.2462955
## ITGA3 1.8306987 2 2 0.40037674 0.6656263 0.4099042
## GGNBP2 7.0888065 3 3 0.06912043 0.3793880 0.2460728
## CCDC124 7.2913562 2 2 0.02610370 0.2711072 0.2867465
## ST3GAL1 4.8565803 3 3 0.18260536 0.5313799 0.2581735
## CD9 0.9654634 3 3 0.80960795 0.9125242 0.2184870
and on the bottom are
tail(res)
## lrt df nIso p.value FDR Time
## LYN 3.134356 2 2 0.2086332 0.5313799 0.09197474
## CHMP4A 1.389586 2 2 0.4991777 0.7187187 0.14126539
## LSM14A 4.559027 2 2 0.1023340 0.4665358 0.18501544
## CHURC1 4.269844 2 2 0.1182538 0.4665358 0.14422417
## RP11-356K23.1 0.508017 2 2 0.7756852 0.8997046 0.22349191
## RP1-178F10.3 2.487001 2 2 0.2883729 0.5958529 0.10903144
Each row of the matrix gives the test result for each gene and the row names indicate the names (IDs) of the genes. The columns from left to right are LRT statistic (lrt
), degree of freedom (df
), Number of isoforms (nIso
), p-values of LRT (p.value
), FDR (FDR
) and time used for fitting the linear mixed effects model in second (Time
). Note that there may be missing results for some genes because the corresponding linear mixed models procedures do not converge. Let’s see if there are such genes.
p <- which(is.na(res[, 1]))
p
## DMBT1
## 327
res[(p-1):(p+1), ]
## lrt df nIso p.value FDR Time
## FNBP1 4.18255218 3 3 0.2424146 0.5611718 0.16614842
## DMBT1 NA 3 3 NA NA 0.16569257
## PLSCR1 0.04271755 2 2 0.9788677 0.9924697 0.08884215
We see that there is one such gene called DMBT1.
Next let’s retain only those that are called significant at level FDR = 0.05 and save the results in a new object called sig.res1
sig.res1 <- res[!is.na(res[, 1]) & res[, "FDR"] <= 0.05, ]
sig.res1
## lrt df nIso p.value FDR Time
## GNPTAB 23.86920 5 5 2.300448e-04 0.013112555 0.3510180
## VEGFA 18.07379 2 2 1.189393e-04 0.013112555 0.1600897
## LIFR 15.70857 3 3 1.301145e-03 0.037082633 0.1950824
## HNRNPA2B1 29.81071 6 6 4.270285e-05 0.008519219 0.7523081
## RRBP1 18.52259 3 3 3.431237e-04 0.013428748 0.1561484
## PRKAA1 19.51404 3 3 2.140158e-04 0.013112555 0.3414207
## POSTN 31.05287 5 5 9.144749e-06 0.003648755 0.2998912
## FRMD4A 22.40986 4 4 1.660719e-04 0.013112555 0.2389264
## TRIP12 28.24006 7 7 1.989194e-04 0.013112555 0.5914843
## C3orf17 21.09291 4 4 3.035165e-04 0.013428748 0.2080283
## FZD6 15.80285 2 2 3.702161e-04 0.013428748 0.1618197
## ASXL1 18.77488 3 3 3.043197e-04 0.013428748 0.1315994
## AFF1 16.76817 3 3 7.887221e-04 0.026225008 0.1432698
## DENND4A 16.05417 3 3 1.105349e-03 0.033925697 0.1352131
We see that there are total 14 genes that passed the screening test at level FDR = 0.05.
Next we perform the step 2 (confirmatory) test to identify the isoforms of these 14 genes that are differentially expressed using function step2.Ftest()
as follows
final.res1 <- step2.Ftest(data=data, file.step1="output/demo_uneqvar_type1.RData", alpha=0.05, method="holm")
## Number of significant genes: 14
## Cutoff for significant isoform is 0.001754386
Note that there is an argument called method in function step2.Ftest and the following functon mc.step2.Waldtest, which is the option for controlling the family-wise error rate required by our two-step procedure. The options are methods of Bonferroni (“bonferroni”), Holm (“holm”), Hochberg (“hochberg”) and Hommel (“hommel”). We use Holm option throughout the tutorial. The function returns the results of the confirmatory test for all the isoforms of the above 14 genes in a data.frame object called final.res1. Let’s take a look at the results.
dim(final.res1)
## [1] 53 7
head(final.res1)
## gene isoform FC step2.p.value adj.p threshold decision
## 1 GNPTAB ENST00000299314 -1.424071 0.003505942 0.017529709 0.0003508772 FALSE
## 2 GNPTAB ENST00000549165 2.320880 0.023427021 0.093708082 0.0004385965 FALSE
## 3 GNPTAB ENST00000549194 -2.002233 0.047714461 0.143143384 0.0005847953 FALSE
## 4 GNPTAB ENST00000552009 -2.072173 0.162558101 0.162558101 0.0017543860 FALSE
## 5 GNPTAB ENST00000552681 -1.375782 0.048686265 0.143143384 0.0005967059 FALSE
## 6 VEGFA ENST00000372067 4.650730 0.009558154 0.009558154 0.0017543860 FALSE
The columns from left to right are gene name (gene
), isoform ID (isoform
), Fold Change (FC
), p-value of the step two test (step2.p.value
), adjusted p-value (adj.p
), and threshold to determine if the step2.p.value
is significant, and decison (decision
). An isoform is called significant or considered differentially expressed if the decision is TRUE. The isoforms that are called significant are as follows
final.res1[final.res1$decision, ]
## gene isoform FC step2.p.value adj.p threshold decision
## 7 VEGFA ENST00000497139 7.576606 6.257982e-04 0.0012515964 0.0008771930 TRUE
## 8 LIFR ENST00000263409 -5.235773 3.557282e-04 0.0010671847 0.0005847953 TRUE
## 12 HNRNPA2B1 ENST00000356674 1.682557 5.628982e-05 0.0003377389 0.0002923977 TRUE
## 22 PRKAA1 ENST00000511248 7.337378 2.496545e-04 0.0007489636 0.0005847953 TRUE
## 34 TRIP12 ENST00000428959 -14.647464 2.376155e-04 0.0016633087 0.0002506266 TRUE
## 44 FZD6 ENST00000522566 8.590236 3.271466e-04 0.0006542932 0.0008771930 TRUE
We see that with the standard t-test option we identified 6 isoforms that are called significant.
Next we perform the step 2 (confirmatory) test to identify the isoforms of these 14 genes that are differentially expressed using function mc.step2.Waldtest
as follows
final.res1.wald <- mc.step2.Waldtest(data=data, VarCov.type="uneqvar", nCore = NULL, file.step1="output/demo_uneqvar_type1.RData", alpha=0.05, method="holm")
## Number of significant genes: 14
## Fitting mixed effects model. This may take a while...
## Model fitting is completed
## # Number of total genes tested: 399
## # Number of significant genes: 14
## # Cutoff for significant isoform is 0.001754386
## # Number of significant isoform: 20
Let’s take a look at the results returned by above command.
dim(final.res1.wald)
## [1] 53 6
head(final.res1.wald)
## gene isoform step2.p.value adj.p threshold decision
## 1 GNPTAB ENST00000299314 8.061455e-05 0.0004030728 0.0003508772 TRUE
## 2 GNPTAB ENST00000549165 4.857991e-03 0.0194319658 0.0004385965 FALSE
## 3 GNPTAB ENST00000549194 3.044334e-02 0.0608866723 0.0008771930 FALSE
## 4 GNPTAB ENST00000552009 1.045219e-01 0.1045219228 0.0017543860 FALSE
## 5 GNPTAB ENST00000552681 1.312816e-02 0.0393844689 0.0005847953 FALSE
## 6 VEGFA ENST00000372067 8.824491e-04 0.0008824491 0.0017543860 TRUE
tail(final.res1.wald)
## gene isoform step2.p.value adj.p threshold decision
## 48 AFF1 ENST00000307808 5.968291e-01 5.968291e-01 0.0017543860 FALSE
## 49 AFF1 ENST00000395146 1.082247e-06 3.246741e-06 0.0005847953 TRUE
## 50 AFF1 ENST00000504956 2.061851e-01 4.123703e-01 0.0008771930 FALSE
## 51 DENND4A ENST00000561863 3.688310e-04 7.376620e-04 0.0008771930 TRUE
## 52 DENND4A ENST00000562028 9.952405e-05 2.985721e-04 0.0005847953 TRUE
## 53 DENND4A ENST00000562540 6.663129e-01 6.663129e-01 0.0017543860 FALSE
The isoforms that are called significant are as follows
final.res1.wald[final.res1.wald$decision, ]
## gene isoform step2.p.value adj.p threshold decision
## 1 GNPTAB ENST00000299314 8.061455e-05 4.030728e-04 0.0003508772 TRUE
## 6 VEGFA ENST00000372067 8.824491e-04 8.824491e-04 0.0017543860 TRUE
## 7 VEGFA ENST00000497139 7.287226e-07 1.457445e-06 0.0008771930 TRUE
## 8 LIFR ENST00000263409 1.273216e-07 3.819648e-07 0.0005847953 TRUE
## 12 HNRNPA2B1 ENST00000356674 5.720757e-11 3.432454e-10 0.0002923977 TRUE
## 19 RRBP1 ENST00000495501 2.488224e-05 7.464671e-05 0.0005847953 TRUE
## 22 PRKAA1 ENST00000511248 3.032114e-08 9.096341e-08 0.0005847953 TRUE
## 23 POSTN ENST00000379743 6.157335e-05 1.847200e-04 0.0005847953 TRUE
## 25 POSTN ENST00000478947 4.803759e-07 1.921503e-06 0.0004385965 TRUE
## 26 POSTN ENST00000497145 8.679231e-04 1.735846e-03 0.0008771930 TRUE
## 27 POSTN ENST00000541179 8.160369e-08 4.080184e-07 0.0003508772 TRUE
## 28 FRMD4A ENST00000358621 9.130130e-06 2.941287e-05 0.0005445837 TRUE
## 30 FRMD4A ENST00000475325 8.193067e-04 1.638613e-03 0.0008771930 TRUE
## 31 FRMD4A ENST00000492155 7.353218e-06 2.941287e-05 0.0004385965 TRUE
## 34 TRIP12 ENST00000428959 4.191024e-08 2.933717e-07 0.0002506266 TRUE
## 41 C3orf17 ENST00000474311 8.501880e-07 3.400752e-06 0.0004385965 TRUE
## 44 FZD6 ENST00000522566 8.117205e-08 1.623441e-07 0.0008771930 TRUE
## 49 AFF1 ENST00000395146 1.082247e-06 3.246741e-06 0.0005847953 TRUE
## 51 DENND4A ENST00000561863 3.688310e-04 7.376620e-04 0.0008771930 TRUE
## 52 DENND4A ENST00000562028 9.952405e-05 2.985721e-04 0.0005847953 TRUE
We see that with the Wald test option we identified 20 isoforms that are called significant.
Now we check if the isoforms identified by t-test are elements of that identified by the Wald test as follows
is.element(final.res1[final.res1$decision, 2], final.res1.wald[final.res1.wald$decision, 2])
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
We see that the isoforms identified by t-test are a proper subset of that identified by the Wald test.
The procedure for the analysis based on Type 2 screening test is the same as that based on Type 1 test except that the argument screen.type
of function lrt.test.mclapply
needs to be specified as 2 rather than 1. We perform the Type 2 screening test by submitting following command:
res <- lrt.test.mclapply(data=data, VarCov.type="uneqvar", outfile="output/demo_uneqvar_type2.RData", screen.type=2)
Let’s take a look at the results
load(file="output/demo_uneqvar_type2.RData")
The dimension of the object is
dim(res)
## [1] 400 6
The rows on the top are
head(res)
## lrt df nIso p.value FDR Time
## MYCBP2 1.222056 2 3 0.54279251 0.8583272 0.4416478
## ITGA3 1.041377 1 2 0.30750133 0.7247150 0.6149364
## GGNBP2 6.229158 2 3 0.04439720 0.3838666 0.3626785
## CCDC124 5.445468 1 2 0.01961916 0.3439145 0.6217566
## ST3GAL1 3.596824 2 3 0.16556158 0.5512508 0.2834027
## CD9 0.745200 2 3 0.68894074 0.8942145 0.6414995
and on the bottom are
tail(res)
## lrt df nIso p.value FDR Time
## LYN 3.1108992 1 2 0.07777001 0.4239160 0.2211494
## CHMP4A 0.3155459 1 2 0.57429679 0.8708392 0.2821398
## LSM14A 4.5259783 1 2 0.03338398 0.3614521 0.2982352
## CHURC1 3.9609811 1 2 0.04656655 0.3838666 0.2569530
## RP11-356K23.1 0.2177605 1 2 0.64075166 0.8797750 0.4486036
## RP1-178F10.3 1.7910057 1 2 0.18080368 0.5725450 0.1202860
Let’s check whether there are any genes with missing results.
p <- which(is.na(res[, 1]))
p
## DMBT1
## 327
res[(p-1):(p+1), ]
## lrt df nIso p.value FDR Time
## FNBP1 1.063860794 2 3 0.5874698 0.8708392 0.4132810
## DMBT1 NA 2 3 NA NA 0.3211651
## PLSCR1 0.003553308 1 2 0.9524665 0.9724736 0.2541683
The gene with missing result is the same as that in Type 1 screening test.
Next let’s retain only those that are called significant at level FDR = 0.05 and save the results in a new object called sig.res2
sig.res2 <- res[!is.na(res[, 1]) & res[, "FDR"] <= 0.05, ]
sig.res2
## lrt df nIso p.value FDR Time
## HNRNPA2B1 23.97014 5 6 2.200007e-04 0.01755605 1.9420850
## RRBP1 18.52213 2 3 9.505402e-05 0.01231386 0.1800642
## PRKAA1 18.05272 2 3 1.201991e-04 0.01231386 0.8521507
## TRIP12 27.36960 6 7 1.234472e-04 0.01231386 1.5939834
## C3orf17 20.79593 3 4 1.160648e-04 0.01231386 0.5258532
We see that there is a total of 5 genes that passed the screening test at level FDR = 0.05. All these genes are a proper subset of that identified by the Type 1 screening test.
Next we perform the step 2 (confirmatory) test to identify the isoforms of these 5 genes that are differentially expressed using function step2.Ftest()
as follows
final.res2 <- step2.Ftest(data=data, file.step1="output/demo_uneqvar_type2.RData", alpha=0.05, method="holm")
## Number of significant genes: 5
## Cutoff for significant isoform is 0.0006265664
dim(final.res2)
## [1] 23 7
head(final.res2)
## gene isoform FC step2.p.value adj.p threshold decision
## 1 HNRNPA2B1 ENST00000354667 -1.522879 5.324928e-01 1.0000000000 0.0003336421 FALSE
## 2 HNRNPA2B1 ENST00000356674 1.682557 5.628982e-05 0.0003377389 0.0001044277 TRUE
## 3 HNRNPA2B1 ENST00000360787 1.207453 8.030475e-01 1.0000000000 0.0005031626 FALSE
## 4 HNRNPA2B1 ENST00000463181 -1.401538 4.395106e-02 0.2197552996 0.0001253133 FALSE
## 5 HNRNPA2B1 ENST00000476233 4.267469 6.709748e-02 0.2683899222 0.0001566416 FALSE
## 6 HNRNPA2B1 ENST00000495810 -1.504948 4.740218e-01 1.0000000000 0.0002970062 FALSE
The isoforms that are called significant are as follows
final.res2[final.res2$decision, ]
## gene isoform FC step2.p.value adj.p threshold decision
## 2 HNRNPA2B1 ENST00000356674 1.682557 5.628982e-05 0.0003377389 0.0001044277 TRUE
We see that with the standard t-test option we identified only 1 isoform that is called significant.
Next we perform the step 2 (confirmatory) test to identify the isoforms of these 5 genes that are differentially expressed using function mc.step2.Waldtest
as follows
final.res2.wald <- mc.step2.Waldtest(data=data, VarCov.type="uneqvar", nCore = NULL, file.step1="output/demo_uneqvar_type2.RData", alpha=0.05, method="holm")
## Number of significant genes: 5
## Fitting mixed effects model. This may take a while...
## Model fitting is completed
## # Number of total genes tested: 399
## # Number of significant genes: 5
## # Cutoff for significant isoform is 0.0006265664
## # Number of significant isoform: 5
Let’s take a look at the results returned by above command.
dim(final.res2.wald)
## [1] 23 6
head(final.res2.wald)
## gene isoform step2.p.value adj.p threshold decision
## 1 HNRNPA2B1 ENST00000354667 4.875446e-01 1.000000e+00 0.0003054791 FALSE
## 2 HNRNPA2B1 ENST00000356674 5.720757e-11 3.432454e-10 0.0001044277 TRUE
## 3 HNRNPA2B1 ENST00000360787 7.829920e-01 1.000000e+00 0.0004905965 FALSE
## 4 HNRNPA2B1 ENST00000463181 1.506111e-02 7.530553e-02 0.0001253133 FALSE
## 5 HNRNPA2B1 ENST00000476233 2.967788e-02 1.187115e-01 0.0001566416 FALSE
## 6 HNRNPA2B1 ENST00000495810 4.246485e-01 1.000000e+00 0.0002660705 FALSE
tail(final.res2.wald)
## gene isoform step2.p.value adj.p threshold decision
## 18 TRIP12 ENST00000470302 5.758099e-01 1.000000e+00 0.0003607832 FALSE
## 19 TRIP12 ENST00000487178 5.098594e-01 1.000000e+00 0.0003194608 FALSE
## 20 C3orf17 ENST00000383675 2.849060e-01 8.547179e-01 0.0002088555 FALSE
## 21 C3orf17 ENST00000472705 3.840888e-01 8.547179e-01 0.0002815633 FALSE
## 22 C3orf17 ENST00000474311 8.501880e-07 3.400752e-06 0.0001566416 TRUE
## 23 C3orf17 ENST00000494891 6.967420e-01 8.547179e-01 0.0005107593 FALSE
The isoforms that are called significant are as follows
final.res2.wald[final.res2.wald$decision, ]
## gene isoform step2.p.value adj.p threshold decision
## 2 HNRNPA2B1 ENST00000356674 5.720757e-11 3.432454e-10 1.044277e-04 TRUE
## 9 RRBP1 ENST00000495501 2.488224e-05 7.464671e-05 2.088555e-04 TRUE
## 12 PRKAA1 ENST00000511248 3.032114e-08 9.096341e-08 2.088555e-04 TRUE
## 15 TRIP12 ENST00000428959 4.191024e-08 2.933717e-07 8.950949e-05 TRUE
## 22 C3orf17 ENST00000474311 8.501880e-07 3.400752e-06 1.566416e-04 TRUE
We see that with the Wald test option we identified 5 isoforms that are called significant and the isoform identified by t-test is one of them.
There are two functions plot.gene()
and plot.gene.bat()
in R file [Tools.R] that can be used to make plots of isoform abundances for given genes. plot.gene()
plots one gene specified by the argument gene
and output the figure to a given graphic device. plot.gene.bat()
plots a series of genes by assigning a vector of gene names to the argument gene
in the function. The following command will create the plots for the 14 genes that passed the Type 1 screening test and output the results to a pdf file demo_figure.pdf.
load(file="data/data4demo.RData")
plot.gene.bat(data=data, gene=rownames(sig.res1), xlim=c(-0.2, 2.3), lwd=3, isoform.id.opt=1, Phenotype="Tumor Status", file="demo_figure.pdf", pPanel=c(2, 2), width=7, height=9, y.lab=c("Free of Cancer", "Not"))
## png
## 2
Next we make the isoform plots for two genes LIFR and HNRNPA2B1 using function plot.gene()
.
par(mfrow=c(1, 2))
dat<-plot.gene(data=data, gene="LIFR", Phenotype="Tumor Status", xlim = c(-0.05, 1.5), isoform.id.opt=1)
dat<-plot.gene(data=data, gene="HNRNPA2B1", Phenotype="Tumor Status", xlim = c(-0.05, 1.5), isoform.id.opt=1)
Gene LIFR is a typical gene that passed the Type 1 screening test but the Type 2 screening test. Form the plot of the gene we see that there is a parallel pattern; all the three isoforms of the genes are up-regulated in patients who are free of cancer and down-regulated in patients who are not. There is no evidence for the differential splicing or isoform switches. On the other hand, gene HNRNPA2B1 passed both screening tests. We see an interaction pattern, where some of the isoforms are up-regulated in patients who are free of cancer and down-regulated in patients who are not while other isoforms are opposite. There might be isoform switching events for this gene.
We present this tutorial as a companion to the manuscript submitted for publication to PLOS ONE. The tutorial shows step by step how to identify genes that have differentially expressed/spliced isoforms using the R functions that we provided in the R file Tools.R. When users have a need to analyze their own data using our proposed methods they only need to prepare the data in the format as describted in Section Data set and Data Format. All of the analyses can be easily done by following the example presented in the tutorial. However, some modifications may still be required to improve the algorithms. We welcome the feedback from all the users who are interested in our approach.