Contact Us

Instructions

Frequently Asked Questions

ETD Help

Policies and Procedures

Copyright and Patents

Access Restrictions

Search ETDs:
Advanced Search
Browse by:
Browse ProQuest
Search ProQuest

Laney Graduate School

Rollins School of Public Health

Candler School of Theology

Emory College

Emory Libraries

New ETD website is now LIVE and located here: etd.library.emory.edu

Algorithmic Approaches to Classifying Biological Networks

Bray, Margaret Justice (2015)
Dissertation (269 pages)
Committee Chair / Thesis Adviser: Hertzberg, Vicki S
Committee Members: Hanfelt, John ; McClellan, William ; Yu, Tianwei
Research Fields: Biology, Biostatistics
Keywords: graph theory; networks; classification; PPI network
Program: Laney Graduate School, Biostatistics
Permanent url: http://pid.emory.edu/ark:/25593/ps8vj

Abstract

As technology has become more advanced, the ease with which data can be collected has improved. This has left researchers with copious amounts of information, so much information that previous analytical techniques fall short. This has led to an increase in the popularity of representing data with networks. However, this data often have errors. This makes any conclusion gleamed from the analysis of a network unreliable.

One type of network for which the inaccuracies are a particular issue is the protein-protein interaction (PPI) network. Researchers would like to use these networks to detect and diagnosis diseases by identifying specific interactions. Unfortunately, the errors in the networks make this impossible. One way to fix this is classify the empirical network into a category of model graph. By doing so, we will be able to mathematically predict which interactions are legitimate, and which are not.

In this dissertation, we begin by testing the classification accuracy of five algorithms: degree distribution distance (DDD), characteristic curve (CC), relative graphlet frequency (RGF), graphlet degree distribution using arithmetic mean (GDD (A)) and using geometric mean (GDD (G)). Overall accuracies were poor, ranging from 68% for the GDD (A) down to 47% for the DDD. With accuracies this low, it is difficult to trust the classification results for an empirical network of unknown origin.

Therefore, we propose two solutions. First, we provide several modifications to both versions of the GDD. The reformulated GDD is more accurate, classifying 76% of known graphs correctly, while also performing the analysis with increased speed. Second, we present a new classification algorithm: cross scoring. This novel method works by comparing networks based on a pre-selected group of network measures. Each type of model graph is ranked by how close its measure value falls to the empirical value compared to the other model types considered. Points are awarded and the model type with the fewest points at the end of the comparisons is considered the best fit. Accuracy across twelve trials was 82.9% with a standard deviation of 0.98%. These results are an obvious improvement over the five original algorithms considered.

Table of Contents

1 Introduction to Graphs and Graph Theory 1

1.1 Review of Essential Graph Theory Elements. . . . . . . . . . . . . . . . . . 3

1.2 Network Measures ................................ 5

1.2.1 Small-World and Scale-Free....................... 9

1.2.2 Centrality................................. 13

1.3 Discussion..................................... 15

2 Introduction to the Structure of Biomolecular Networks 17

2.1 Protein-Protein Interactions........................... 17

2.2 Sacchraomyces cerevisiae Protein-Protein Interaction Network . . . . . . . . 18

2.3 Motivation for Network Classification ..................... 21

3 Introduction to Model Graphs 23

3.1 Model Graph Descriptions............................ 23

3.1.1 RandomStatic .............................. 25

3.1.2 Small-World ............................... 25

3.1.3 3D-Geometric............................... 26

3.1.4 Linear Preferential Attachment..................... 26

3.1.5 Random Growing............................. 27

3.1.6 Aging Vertex ............................... 28

3.1.7 Duplication-Mutation-Complementation and Duplication-Mutation using Random Mutation.......................... 28

3.1.8 Stickiness Model ............................. 29

3.2 Methods...................................... 30

3.3 Results....................................... 30

3.3.1 Numbers of Nodes and Edges...................... 30

3.3.2 Density .................................. 34¨

3.3.3 Proportion of Nodes in the Giant Component . . . . . . . . . . . . . 36

3.3.4 Diameter, Radius, and Average Shortest Path Length . . . . . . . . 38

3.3.5 AverageDegreeandAssortativity ................... 42

3.3.6 S-metric.................................. 45

3.3.7 Clustering................................. 46

3.3.8 Centralities: Betweenness, Closeness, Degree, and Eigenvector . . . . 49

3.4 Discussion..................................... 53

4 Measure Based Comparison of Model Graphs v Saccharomyces cerevisiae PPI Network 57

4.1 Methods...................................... 57

4.2 Results....................................... 58

4.2.1 Size Measures............................... 58

4.2.2 Distance Measures ............................ 63

4.2.3 Centrality Measures ........................... 67

4.2.4 Connection Measures........................... 69

4.2.5 Biologically Significant Measures .................... 72

4.2.6 Summary of Measure Based Comparison Broken Down by Category 74

4.3 Discussion..................................... 75

5 Introduction to Network Classification Methods 78

5.1 Network Classification Methods......................... 79

5.1.1 Relative Graphlet Frequency and Graphlet Degree Distribution . . . 79

5.1.2 Characteristic Curve........................... 82

5.1.3 Degree Distribution Distance ...................... 85

5.2 Limitations of Previous Work.......................... 86

6 Random Graph Classification 87

6.1 Methods...................................... 87

6.2 Results....................................... 90

6.3 Discussion..................................... 94

7 Model Graph Classification 96

7.1 Methods...................................... 96

7.2 Results....................................... 99

7.2.1 Filtering Out Large Graphs....................... 99

7.2.2 Degree Distribution Distance ...................... 100

7.2.3 Characteristic Curve........................... 105

7.2.4 Relative Graphlet Frequency ...................... 109

7.2.5 Graphlet Degree Distribution...................... 112

7.2.6 Comparison of Classification Accuracy Broken Down by Model Type and Method................................ 117

7.2.7 Patterns in Statistical Performance................... 119

7.2.8 Treatment of DMC and DMR Model Graphs . . . . . . . . . . . . . 121

7.3 Discussion..................................... 123

7.3.1 Strengths and Limitations........................ 126 7.3.2 Next Steps ................................ 129

8 Saccharomyces cerevisiae PPI Network Classification 130

8.1 Methods...................................... 130

8.2 Results....................................... 132

8.2.1 Degree Distribution Distance ...................... 132

8.2.2 Characteristic Curve........................... 134

8.2.3 Relative Graphlet Frequency ...................... 135

8.2.4 Graphlet Degree Distribution...................... 137

8.2.5 Kendall's W Comparison of Ranking Lists. . . . . . . . . . . . . . . 139

8.3 Discussion..................................... 140

9 Relative Graphlet Frequency

9.1 Formula Error................................... 142

9.2 Methods...................................... 143

9.3 Results....................................... 144

9.3.1 Random Graph Classification...................... 144

9.3.2 Model Graph Classification ....................... 144

9.3.3 Saccharomyces cerevisiae PPI Network Classification . . . . . . . . . 148

9.4 Conclusions.................................... 151

10 Reformulations of the Graphlet Degree Distribution 153

10.1 Graphlet Degree Distribution Issues ...................... 154

10.1.1 Geometric Mean ............................. 155

10.1.2 Contradictory Outcomes......................... 158

10.1.3 Scaling and Normalization........................ 164

10.2 Methods...................................... 165

10.2.1 Version 1 ................................. 166

10.2.2 Version 2 ................................. 166

10.2.3 Version 3 ................................. 167

10.2.4 Analysis of Performance......................... 168

10.3 Results....................................... 168

10.3.1 Model Graph Classification ....................... 168

10.3.2 Comparison of the Original Graphlet Degree Distribution to the Reformulated Versions ........................... 176

10.3.3 Saccharomyces cerevisiae PPI Network Classification . . . . . . . . . 179

10.4 Conclusions.................................... 181

11 Designing the Cross Scoring Algorithm 182

11.1 Methods...................................... 182

11.1.1 Measures of Center and Spread..................... 183

11.1.2 Nonlinear Scoring ............................ 183

11.1.3 Zeroing .................................. 184

11.1.4 Approximations.............................. 185

11.1.5 Tie Breaking ............................... 185

11.2 Data........................................ 186

11.3 Results....................................... 187

11.3.1 Mean Results............................... 188

11.3.2 Median Results.............................. 189

11.3.3 Comparison of Mean and Median Results . . . . . . . . . . . . . . . 192

11.4 Discussion..................................... 193

12 Determining the Cross Scoring Measure List 195

12.1 Methods...................................... 195

12.1.1 Macro- v Micro-Scoring ......................... 196

12.1.2 Importance-Scoring ........................... 196

12.2 Data........................................ 196

12.3 Results....................................... 198

12.3.1 Macro-lists ................................ 199

12.3.2 Micro-list ................................. 199

12.3.3 Importance-Scoring ........................... 201

12.4 Discussion..................................... 201

13 Applying the Cross Scoring Algorithm 203

13.1 Methods...................................... 203

13.2 Data........................................ 204

13.3 Results....................................... 204

13.3.1 Measure Selection ............................ 204

13.3.2 Macro-Lists................................ 206

13.3.3 Macro-Scoring Performance....................... 208

13.3.4 Micro-Scoring Performance ....................... 213

13.3.5 Macro- v Micro-Scoring Results..................... 218

13.3.6 Classification of the S. cerevisiae PPI Network . . . . . . . . . . . . 220

13.3.7 Importance-Scoring ........................... 223

13.3.8 Robustness ................................ 225

13.3.9 Comparison of All Classifiers ...................... 225

13.4 Discussion..................................... 227

13.4.1 Biological Implications and their Effect on the Cross Scoring Design 229

13.4.2 Strengths and Limitations........................ 230

14 Summary 232

14.1 Overview ..................................... 232

14.1.1 Which Model Type is the Best Fit for the Saccharomyces cerevisiae PPI Network?............................... 235

14.1.2 Do PPI Networks Exhibit Scale-Free Properties? . . . . . . . . . . . 236

14.2 Future Work ................................... 237

14.2.1 Extension of Analyses .......................... 237

14.2.2 Redesign of DMC and DMR Growth Mechanisms. . . . . . . . . . . 238

14.2.3 Does the Growth Mechanism Define the Model Graph Type? . . . . 239

A... 240

Files

application/pdf Algorithmic Approaches to Classifying Biological Networks 269 pages (6.5 MB) [Access copy of Dissertation]
Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.