Linked to treatment resistance? Array data from Van Essen et al. PLoS Biol. (2009). Mar 31;7(3):e73....

1
Linked to treatment resistance? Array data from Van Essen et al. PLoS Biol. (2009). Mar 31;7(3):e73. • TNF-α stimulated cells were compared to unstimulated cells. • We performed full pipeline analysis was performed. • As expected, our analysis showed increases in gene expression of TNF-α pathway components. A Universal Gene Expression Data Analysis Platform Designed Using bioKepler And The R Language For Statistical Computing Tyler Kolisnik and Mark Bieda Department of Biochemistry and Molecular Biology, Cumming School of Medicine Abstract Advantages of Using Kepler Workflow Overview Sample Study Analyzed Conclusions & Future Directions Gene expression patterns determine cellular behavior and identity. Understanding these patterns can give critical insight to a large range of issues in child and maternal health, ranging from normal development to disease. Data analysis presents large challenges in this area. In particular, reuse of public data and cross-platform analysis is difficult for most users, severely limiting the promise of large public resources, which seem to be mostly used only by expert bioinformaticians. We seek to develop computational tools that are straightforward for the novice user yet are scalable, robust, extensible, modifiable and easily comprehensible for the advanced user. Our first target is Affymetrix gene expression microarrays including classical array designs and Gene ST designs. We present user friendly R/BioConductor scripts and Kepler workflows that perform comprehensive analyses of these datasets and produce simple, ready for publication outputs. We also present complete pipelines for accessing public data from NCBI GEO and integrating public data with user data for comparison purposes. We outline our strategies for RNAseq pipeline creation and crossplatform analysis. This work provides user friendly, yet sophisticated and easily modifiable, tools to allow novice users to analyze their own data or public datasets with ease. Comprehensible Extensible Modifiable Distributable Reproducible Research • In this paper glioblastoma-derived cancer stem cells (GBM CSCs) were analyzed and characterized into adherent and neurosphere-like samples based on growth properties. A New Generation Of Software Tools A bioKepler workflow for analyzing gene expression microarray data. Adjustabl e Inputs Guiding Comments Actual Workflow A heatmap of gene expression data generated by the workflow. The three leftmost columns show the gene expression results of the adherent (CD133-) glioblastoma cells. The three columns on the right show the gene expression results of cell lines having neurosphere-like growth properties (CD133+). GSM172063 GSM172064 GSM172065 GSM172066 GSM172067 GSM170898 VS CD133 +/Neurosphere-like CD133 - /Adherent Gene expression was assessed using Affymetrix HG-U133 Plus 2.0 arrays. • We compared our conclusions to those of Beier et al. (2007). • We found that our approach led to enhanced understanding of the Beier data beyond the original paper conclusions. • Beier et al. (2007) clustered genes manually, while our workflow does this automatically. • In accord with their analysis, our gene ontology and pathway analysis found many immune system related genes/pathways upregulated in adherent cells. • Our gene ontology and pathway analysis sections implicated genes and pathways (DNA replication and cell-cycle pathways) which could explain why NS cells proliferate faster than AD cells. Beier et al. (2007) did not find genes and pathways related to the proliferation difference. • Mismatch repair, nucleotide excision repair and homologous recombination were also shown to be elevated in the Neurosphere-like cells. These upregulated genes and pathways in the Neurosphere- like cells are indicative of potential resistance to traditional anti-cancer therapies. • We found interesting patterns of up- and down- regulation of cell adhesion molecule pathways that may partially explain neurosphere vs adherent growth patterns. In sum, these results indicate that automated gene ontology and pathway analysis can yield novel insights into a microarray Beier D, Hau P, Proescholdt M, Lohmeier A et al. CD133(+) and CD133(-) glioblastoma-derived cancer stem cells show differential growth characteristics and molecular profiles. Cancer Res 2007 May 1;67(9):4010-5. Normalize Data Classic Affymetrix Arrays (e.g. HGU-133 Plus 2.0) High-Density Affymetrix Arrays (e.g. Gene ST Arrays) Future Goals: Illumina Beadstatio ns, Agilent Arrays Select Differential ly Expressed Genes Gene Ontology Pathway Analysis Optional Other Analyses Corresponden ce Analysis Principal Component Analysis Future Goals Step 1. Normalize Data I. Altintas, J. Wang, D. Crawl, W. Li, “Challenges and approaches for distributed workflow-driven analysis of large-scale biological data”, pp 73- 78. in: Proceedings of the Workshop on Data analytics in the Cloud at EDBT/ICDT 2012 Conference, DanaC2012, 2012. Step 2. Select Differentially Expressed Genes GOBPID Pvalue OddsRatio ExpCount Coun t Size Term GO:001988 6 4.61792716e- 13 33.6835391 0.434481 84 11 92 antigen processing and presentation of exogenous peptide antigen via MHC class II GO:000250 4 7.4657333427e -13 32.0898039 0.453372 36 11 96 antigen processing and presentation of peptide or polysaccharide antigen via MHC class II GO:001988 4 2.9876666450e -11 18.4672975 0.826460 02 12 175 antigen processing and presentation of exogenous antigen GO:003129 5 4.082129509e- 11 34.8878772 0.335306 64 9 71 T cell costimulation GO:004800 2 5.7320312029e -11 17.388067 0.873686 31 12 185 antigen processing and presentation of peptide antigen GO:006033 3 6.7824202747e -11 32.7646628 0.354197 15 9 75 interferon-gamma-mediated signaling pathway GOBPID Pvalue OddsRatio ExpCount Count Size Term GO:007180 4 0.0003654 13.1117958 0.3392310 8 4 75 cellular potassium ion transport GO:003241 1 0.0008342 6 18.1311741 0.1854463 2 3 41 positive regulation of transporter activity GO:003252 5 0.0008861 4 56.6590909 0.0452308 1 2 10 somite rostral/caudal axis specification GO:007200 1 0.0009019 1 5.86340104 1.1262471 7 6 249 renal system development GO:200013 6 0.0023219 8 32.3636364 0.0723693 2 16 regulation of cell proliferation involved in heart morphogenesis GO:004326 8 0.0026239 1 30.2040404 0.0768923 8 2 17 positive regulation of potassium ion transport GO:003018 2 0.0034567 2.74147465 4.9618198 7 12 1097 neuron differentiation GO:000027 9 0.0036338 2 25.1649832 0.0904616 2 2 20 M phase hsa03030 DNA replication hsa04110 Cell cycle hsa03430 Mismatch repair hsa03420 Nucleotide excision repair hsa04974 Protein digestion and absorption hsa03440 Homologous recombination hsa04145 Phagosome hsa04672 Intestinal immune network for IgA production hsa04640 Hematopoietic cell lineage hsa04612 Antigen processing and presentation hsa04142 Lysosome hsa04514 Cell adhesion molecules (CAMs) Steps 3 & 4. Gene Ontology and Pathway Analysis • Pipeline creation within bioKepler was found to be fast, user-friendly, and easily allowed for the integration of R scripts and R/Bioconductor packages. • The analysis of data takes approximately 8 minutes in total and is compatible with Windows, Mac, and Linux. • These workflows will allow for a comprehensive gene expression microarray analysis to be conducted by non-expert investigators. • When completed, these workflows will allow for fast, consistent, and thorough analyses of microarray data. • Understanding gene expression patterns has a wide range of applications in medical research, especially in the area of child and maternal health. • We are looking into adding support for Agilent arrays and Illumina beadstation arrays. • We are also looking into expanding current Output includes a table of normalized gene expression values. Adhere nt Neurosphe re Output includes differentially expressed genes table and heatmap. SYMBOL Probeset Id Log of Fold Change P Value Adjusted P Value GIMAP1 1552316_a_at 0.7458091 0.001676504 0.001676504 RDH10 1552378_s_at 1.45989171 0.000521453 0.000521453 LACTB 1552486_s_at -0.995149583 0.005611844 0.005611844 SGK494 1553292_s_at 1.506701912 0.000704543 0.000704543 SALL3 1553411_s_at 2.692282261 0.005047593 0.005047593 ADC 1554393_a_at -0.869116588 0.001360285 0.001360285 ZFYVE16 1554638_at 1.028713621 0.009793249 0.009793249 LAPTM4B 1554679_a_at 0.905566923 0.007518872 0.007518872 MINA 1554774_at -0.892993044 0.008736615 0.008736615 SYMBOL Probeset Id GSM172065.CEL GSM172064.CEL GSM172063.CEL GSM172067.CEL GSM172066.CEL GSM170898.CEL DDR1 1007_s_at 11.54221842 11.70391222 11.81086649 12.12506653 11.46143597 11.03332603 MIR4640 1007_s_at 11.54221842 11.70391222 11.81086649 12.12506653 11.46143597 11.03332603 RFC2 1053_at 10.10999211 9.135247471 9.268206929 7.819224167 9.357807909 9.472890098 HSPA6 117_at 7.088749776 8.866329285 7.012518539 6.200621666 7.283671506 6.793308674 PAX8 121_at 7.973178785 7.994079494 7.926540186 8.20509645 7.293444913 7.960692036 GUCA1A 1255_g_at 3.960506548 3.696936849 3.890547604 3.613902203 3.487833056 3.554299006 UBA7 1294_at 8.196973776 8.465411822 8.488872769 8.032207962 7.285372501 8.79548918 MIR5193 1294_at 8.196973776 8.465411822 8.488872769 8.032207962 7.285372501 8.79548918 THRA 1316_at 5.95284798 6.224669624 5.785903152 6.058626775 5.778035831 5.958867336 PTPN21 1320_at 5.498272624 5.767174339 5.485820211 5.021762355 6.074202128 5.716295128 A key result: Neurosphere-like GBM CSC lines proliferate faster than adherent GBM CSC lines. 1 2 Upregulated in Adherent (“AD”) Upregulated in Neurosphere-like (“NS”) Methods • Data analysis pipelines were created in Kepler by using the R language for statistical programming and R/Bioconductor packages. • The limma package was used for statistical analysis and selection of differentially expressed genes. Pathway analysis graphs are generated by linking to the KEGG database. • Due to the nature of Kepler the program can be ran not only in the Kepler environment, but also in R by the separate R script generated if desired. Gene Ontology Analysis Pathway Analysis Gene Ontology Analysis Pathway Analysis Linked to adherent properties? hsa030 30 Green indicates upregulated in Adherent Red indicates upregulated in Neurosphere- like Green indicates upregulated in Adherent Red indicates upregulated in Neurosphere-like hsa045 14 hsa034 30 Output includes gene ontology tables, pathway tables and diagrams as shown below. Step 5. Analysis Results Automated pathway analysis yielded the osteoclast differentiation pathway. This pathway has TNF-α components which were upregulated as expected in these experiments. Only partial pathway shown (above). Classic Affymetrix Arrays High Density Arrays ~11 probes per gene ~27 probes per gene Uses affy analysis package Uses oligo analysis package I. Classic Affymetrix Arrays II. High Density Arrays mmu04380

Transcript of Linked to treatment resistance? Array data from Van Essen et al. PLoS Biol. (2009). Mar 31;7(3):e73....

Page 1: Linked to treatment resistance? Array data from Van Essen et al. PLoS Biol. (2009). Mar 31;7(3):e73. TNF-α stimulated cells were compared to unstimulated.

Linked to treatment resistance?

Array data from Van Essen et al. PLoS Biol. (2009). Mar 31;7(3):e73.

• TNF-α stimulated cells were compared to unstimulated cells.• We performed full pipeline analysis was performed. • As expected, our analysis showed increases in gene expression of TNF-α pathway components.

A Universal Gene Expression Data Analysis Platform Designed Using bioKepler And The R Language For Statistical Computing

Tyler Kolisnik and Mark BiedaDepartment of Biochemistry and Molecular Biology, Cumming School of Medicine

Abstract

Advantages of Using Kepler

Workflow Overview

Sample Study Analyzed

Conclusions & Future Directions

Gene expression patterns determine cellular behavior and identity. Understanding these patterns can give critical insight to a large range of issues in child and maternal health, ranging from normal development to disease. Data analysis presents large challenges in this area. In particular, re use of public data and cross- platform analysis is difficult for most users, severely limiting the promise of large public resources, which seem to be mostly used only by expert bioinformaticians. We seek to develop computational tools that are straightforward for the novice user yet are scalable, robust, extensible, modifiable and easily comprehensible for the advanced user.Our first target is Affymetrix gene expression microarrays including classical array designs and Gene ST designs. We present user friendly R/BioConductor scripts and Kepler workflows that perform comprehensive analyses of these datasets and produce simple, ready for publication outputs. We also present complete pipelines for accessing public data from NCBI GEO and integrating public data with user data for comparison purposes. We outline our strategies for RNAseq pipeline creation and cross platform analysis. This work provides user friendly, yet sophisticated and easily modifiable, tools to allow novice users to analyze their own data or public datasets with ease.

• Comprehensible• Extensible• Modifiable• Distributable• Reproducible Research

• In this paper glioblastoma-derived cancer stem cells (GBM CSCs) were analyzed and characterized into adherent and neurosphere-like samples based on growth properties.

A New Generation Of Software Tools

A bioKepler workflow for analyzing gene expression microarray data.

Adjustable Inputs Guiding

Comments

Actual Workflow

A heatmap of gene expression data generated by the workflow. The three leftmost columns show the gene expression results of the adherent (CD133-) glioblastoma cells. The three columns on the right show the gene expression results of cell lines having neurosphere-like growth properties (CD133+).

GSM172063

GSM172064

GSM172065

GSM172066

GSM172067

GSM170898VS

CD133 +/Neurosphere-like CD133 - /Adherent

Gene expression was assessed using Affymetrix HG-U133 Plus 2.0 arrays.

• We compared our conclusions to those of Beier et al. (2007). • We found that our approach led to enhanced understanding of the Beier data beyond the original paper conclusions. • Beier et al. (2007) clustered genes manually, while our workflow does this automatically.• In accord with their analysis, our gene ontology and pathway analysis found many immune system related genes/pathways upregulated in adherent cells.• Our gene ontology and pathway analysis sections implicated genes and pathways (DNA replication and cell-cycle pathways) which could explain why NS cells proliferate faster than AD cells. Beier et al. (2007) did not find genes and pathways related to the proliferation difference. • Mismatch repair, nucleotide excision repair and homologous recombination were also shown to be elevated in the Neurosphere-like cells. These upregulated genes and pathways in the Neurosphere-like cells are indicative of potential resistance to traditional anti-cancer therapies. • We found interesting patterns of up- and down- regulation of cell adhesion molecule pathways that may partially explain neurosphere vs adherent growth patterns. In sum, these results indicate that automated gene ontology and pathway analysis can yield novel insights into a microarray dataset.

Beier D, Hau P, Proescholdt M, Lohmeier A et al. CD133(+) and CD133(-) glioblastoma-derived cancer stem cells show differential growth characteristics and molecular profiles. Cancer Res 2007 May 1;67(9):4010-5.

Normalize Data

Classic Affymetrix Arrays(e.g. HGU-133 Plus 2.0)

High-Density Affymetrix Arrays

(e.g. Gene ST Arrays) Future Goals: Illumina

Beadstations,Agilent Arrays

Select Differentially

Expressed Genes

Gene Ontology

Pathway Analysis

Optional Other Analyses

Correspondence Analysis

Principal Component

Analysis

Future Goals

Step 1. Normalize Data

I. Altintas, J. Wang, D. Crawl, W. Li, “Challenges and approaches for distributed workflow-driven analysis of large-scale biological data”, pp 73-78. in: Proceedings of the Workshop on Data analytics in the Cloud at EDBT/ICDT 2012 Conference, DanaC2012, 2012.

Step 2. Select Differentially Expressed Genes

GOBPID Pvalue OddsRatio ExpCount Count Size Term

GO:0019886 4.61792716e-13 33.6835391 0.43448184 11 92antigen processing and presentation of exogenous

peptide antigen via MHC class II

GO:0002504 7.4657333427e-13 32.0898039 0.45337236 11 96antigen processing and presentation of peptide or

polysaccharide antigen via MHC class II

GO:0019884 2.9876666450e-11 18.4672975 0.82646002 12 175antigen processing and presentation of exogenous

antigen

GO:0031295 4.082129509e-11 34.8878772 0.33530664 9 71 T cell costimulation

GO:0048002 5.7320312029e-11 17.388067 0.87368631 12 185 antigen processing and presentation of peptide antigen

GO:0060333 6.7824202747e-11 32.7646628 0.35419715 9 75 interferon-gamma-mediated signaling pathway

GOBPID Pvalue OddsRatio ExpCount Count Size TermGO:0071804 0.0003654 13.1117958 0.33923108 4 75 cellular potassium ion transportGO:0032411 0.00083426 18.1311741 0.18544632 3 41 positive regulation of transporter activityGO:0032525 0.00088614 56.6590909 0.04523081 2 10 somite rostral/caudal axis specificationGO:0072001 0.00090191 5.86340104 1.12624717 6 249 renal system development

GO:2000136 0.00232198 32.3636364 0.0723693 2 16regulation of cell proliferation involved in heart

morphogenesisGO:0043268 0.00262391 30.2040404 0.07689238 2 17 positive regulation of potassium ion transportGO:0030182 0.0034567 2.74147465 4.96181987 12 1097 neuron differentiationGO:0000279 0.00363382 25.1649832 0.09046162 2 20 M phase

hsa03030 DNA replicationhsa04110 Cell cyclehsa03430 Mismatch repairhsa03420 Nucleotide excision repairhsa04974 Protein digestion and absorptionhsa03440 Homologous recombination

hsa04145 Phagosomehsa04672 Intestinal immune network for IgA productionhsa04640 Hematopoietic cell lineagehsa04612 Antigen processing and presentationhsa04142 Lysosomehsa04514 Cell adhesion molecules (CAMs)

Steps 3 & 4. Gene Ontology and Pathway Analysis

• Pipeline creation within bioKepler was found to be fast, user-friendly, and easily allowed for the integration of R scripts and R/Bioconductor packages. • The analysis of data takes approximately 8 minutes in total and is compatible with Windows, Mac, and Linux.• These workflows will allow for a comprehensive gene expression microarray analysis to be conducted by non-expert investigators. • When completed, these workflows will allow for fast, consistent, and thorough analyses of microarray data. • Understanding gene expression patterns has a wide range of applications in medical research, especially in the area of child and maternal health. • We are looking into adding support for Agilent arrays and Illumina beadstation arrays.• We are also looking into expanding current features and adding new features such as the principal component analysis and correspondence analysis.

Output includes a table of normalized gene expression values.

Adherent Neurosphere

Output includes differentially expressed genes table and heatmap. SYMBOL Probeset Id Log of Fold Change P Value Adjusted P Value

GIMAP1 1552316_a_at 0.7458091 0.001676504 0.001676504

RDH10 1552378_s_at 1.45989171 0.000521453 0.000521453LACTB 1552486_s_at -0.995149583 0.005611844 0.005611844

SGK494 1553292_s_at 1.506701912 0.000704543 0.000704543

SALL3 1553411_s_at 2.692282261 0.005047593 0.005047593

ADC 1554393_a_at -0.869116588 0.001360285 0.001360285

ZFYVE16 1554638_at 1.028713621 0.009793249 0.009793249LAPTM4B 1554679_a_at 0.905566923 0.007518872 0.007518872MINA 1554774_at -0.892993044 0.008736615 0.008736615

SYMBOL Probeset Id GSM172065.CEL GSM172064.CEL GSM172063.CEL GSM172067.CEL GSM172066.CEL GSM170898.CELDDR1 1007_s_at 11.54221842 11.70391222 11.81086649 12.12506653 11.46143597 11.03332603MIR4640 1007_s_at 11.54221842 11.70391222 11.81086649 12.12506653 11.46143597 11.03332603RFC2 1053_at 10.10999211 9.135247471 9.268206929 7.819224167 9.357807909 9.472890098HSPA6 117_at 7.088749776 8.866329285 7.012518539 6.200621666 7.283671506 6.793308674PAX8 121_at 7.973178785 7.994079494 7.926540186 8.20509645 7.293444913 7.960692036GUCA1A 1255_g_at 3.960506548 3.696936849 3.890547604 3.613902203 3.487833056 3.554299006UBA7 1294_at 8.196973776 8.465411822 8.488872769 8.032207962 7.285372501 8.79548918MIR5193 1294_at 8.196973776 8.465411822 8.488872769 8.032207962 7.285372501 8.79548918THRA 1316_at 5.95284798 6.224669624 5.785903152 6.058626775 5.778035831 5.958867336PTPN21 1320_at 5.498272624 5.767174339 5.485820211 5.021762355 6.074202128 5.716295128

A key result: Neurosphere-like GBM CSC lines proliferate faster than adherent GBM CSC lines.

1

2

Upregulated in Adherent (“AD”)

Upregulated in Neurosphere-like (“NS”)

Methods• Data analysis pipelines were created in Kepler by using the R language for statistical programming and R/Bioconductor packages. • The limma package was used for statistical analysis and selection of differentially expressed genes. Pathway analysis graphs are generated by linking to the KEGG database.• Due to the nature of Kepler the program can be ran not only in the Kepler environment, but also in R by the separate R script generated if desired.

Gene Ontology Analysis

Pathway Analysis

Gene Ontology Analysis

Pathway Analysis

Linked to adherent properties?

hsa03030

Green indicates upregulated in Adherent

Red indicates upregulated in Neurosphere-like

Green indicates upregulated in AdherentRed indicates upregulated in Neurosphere-like

hsa04514

hsa03430

Output includes gene ontology tables, pathway tables and diagrams as shown below.

Step 5. Analysis Results

Automated pathway analysis yielded the osteoclast differentiation pathway. This pathway has TNF-α components which were upregulated as expected in these experiments. Only partial pathway shown (above).

Classic Affymetrix Arrays High Density Arrays

~11 probes per gene ~27 probes per gene

Uses affy analysis package Uses oligo analysis package

I. Classic Affymetrix Arrays

II. High Density Arrays

mmu04380