Biostatistics: EquiGeneNorm

This text file contains an R function “equi.gene.norm” to normalize oligonucleotide gene expression data using equivalently expressed genes in experiments comparing two groups of samples (for example, case-control samples), as outlined in L-X Qin and JM Satagopan (2009). Normalization method for transcriptional studies of heterogeneous samples - Simultaneous array normalization and identification of equivalent expression. Statistical Applications in Genetics and Molecular Biology, Volume 8, Issue 1, Article 10.
Download R function (This is a txt file.)

The input parameters are:

data.y = An (G, m, P)-dimension array, where G is the number of genes or probe sets, m is the number of samples, and P is the number of probes per gene or probeset.

data.x = A binary vector of length m giving the group membership of each subject, organized in the same order as the sample dimension of data.y

pi.start = starting value of pi, where pi is the fraction of differentially expressed genes

lamda.hat = starting value of lambda, the fraction of overexpressed genes among those that are differentially expressed

The function can be called using the following command:
equi.gene.norm(data.y, data.x, pi.start, lamda.hat)

EquiGeneNorm — Normalize Expression Data Using Equivalently Expressed Genes

Cancer Res. 2011 Apr 1;71(7):2697-705. Epub 2011 Feb 18.

Expression profiling of liposarcoma yields a multigene predictor of patient outcome and identifies genes that contribute to liposarcomagenesis.

Gobble RM, Qin LX, Brill ER, Angeles CV, Ugras S, O'Connor RB, Moraco NH, Decarolis PL, Antonescu C, Singer S.

Department of Surgery, Sarcoma Disease Management Program, Department of Epidemiology and Biostatistics, and Department of Pathology, Memorial Sloan Kettering Cancer Center, New York, New York.


Liposarcomas are the most common type of soft tissue sarcoma but their genetics are poorly defined. To identify genes that contribute to liposarcomagenesis and serve as prognostic candidates, we undertook expression profiling of 140 primary liposarcoma samples, which were randomly split into training set (n = 95) and test set (n = 45). A multigene predictor for distant recurrence-free survival (DRFS) was developed by the supervised principal component method. Expression levels of the 588 genes in the predictor were used to calculate a risk score for each patient. In validation of the predictor in the test set, patients with low risk score had a 3-year DRFS of 83% versus 45% for high risk score patients (P = 0.001). The HR for high versus low score, adjusted for histologic subtype, was 4.42 (95% CI, 1.26-15.55; P = 0.021). The concordance probability for risk score was 0.732. In contrast, the concordance probability for histologic subtype, which had been considered the best predictor of outcome in liposarcoma, was 0.669. Genes related to adipogenesis, DNA replication, mitosis, and spindle assembly checkpoint control were all highly represented in the multigene predictor. Three genes from the predictor, TOP2A, PTK7, and CHEK1, were found to be overexpressed in liposarcoma samples of all five subtypes and in liposarcoma cell lines. RNAi-mediated knockdown of these genes in liposarcoma cell lines reduced proliferation and invasiveness and increased apoptosis. Taken together, our findings identify genes that seem to be involved in liposarcomagenesis and have promise as therapeutic targets, and support the use of this multigene predictor to improve risk stratification for individual patients with liposarcoma. Cancer Res; 71(7); 2697-705. ©2011 AACR.

Array data is being prepared to be submitted to Gene Expression Omnibus (GEO).

Software Contributors