
GraphWeb is a public web server for graphbased analysis of biological
networks that:

Introduction  Example  Suggestions for browsing  About this document Methods of input  Example inputs  Manage data uploads with g:PEDaM Input syntax  Edge weights explained  Gene ID conversion and orthologs Algorithms  MCL  Betweenness centrality clustering  Connected components  Strongly connected components  Biconnected components  Maximal cliques  Hub based modules  Whole graph  Network neighbourhood Options and settings  Assign more weight to smaller networks  Create global network  Remove unknown genes  Remove ambiguous genes  Keep N% of highest degree nodes  Remove edges with less than N labels  Keep N% of heaviest edges  Module filtering  Hide modules with less than N nodes  Show N largest modules  Calculate g:Profiler scores  Sort modules by functional score  Hide insignificant modules Output  Node name conversion  Comp  Number of nodes  List nodes  List edges  Zoom in  Label distribution  Score  g:Profiler annotations  Visualisations  Summary row  g:Cocoa  Searching the output IntroductionThis application enables you to input graphs and explore their structure, extracting potentially interesting subgraphs (modules) from them with various algorithms. GraphWeb is especially suited for analysing biological (such as proteinprotein interaction or gene regulation) networks. In such a case, the nodes of the graphs are expected to represent genes or proteins, so the resulting modules can be sent as a query to g:Profiler, a tool that finds significant common properties of a group of genes and visualises the results. The edges of the graphs can be directed or undirected, edge weights are also supported. Furthermore, it is possible to compare and combine data coming from different sources by using the mechanism of edge labels. Each edge in the graph can have one or more labels, and a separate edge weight can be associated with each label. It is also possible to assign a default weight to each label, which will be given to edges with that label with no weight specified. Also, edges having a greater number of labels are given greater weights (by adding together their weights for all labels). GraphWeb provides a simple syntax for representing edges (relationships between pairs of nodes), edge directions, weights and label information. After submitting the input, the program first converts gene IDs into a standard form, which allows easily combining data that uses different IDs. If you wish, genes of one species may be mapped into their orthologs in another species. After that, weights are assigned to all labels and then to edges. Then, optionally, filtering takes place, where lowweight edges and lowdegree nodes are removed, based on userspecified tresholds. You can also specify a set of nodes whose neighbourhood will be extracted from the graph. After the steps of reading the input and possibly filtering the resulting graph, we now have a final graph, which is then processed by a chosen algorithm. The algorithm returns a collection of modules. Each module is a group of nodes of the graph. The exact nature of the modules depends on the algorithm: they may be connected components, clusters, cliques, consist of neighbours of a central hub and so on. Each module is displayed on a separate line in the output table. Again, it is possible to filter some modules out (smaller ones, or those with no biological relevance detected). For each module, you can see information about the subgraph induced by the nodes of the module, and find out about the biological relevance of the module. ExampleTo get a quick introduction to GraphWeb, you can try the following simple example. Copy the following input into the big textbox at the left side of the form. Note that each line describes an edge of the graph, listed as a pair of nodes. A B A C B C E F G H E H F G C D E G B K T1 T2 Since we're not dealing with biological data in this example, you must turn off the checkbox "Hide insignificant modules" (default is on), otherwise the results will be discarded as biologically irrelevant. Choose the "connected components" algorithm and click the submit button. This is a good algorithm to begin analysing a graph with  it simply returns the separate connected parts of a graph. In the output, you will see two modules displayed, which correspond to the connected components {A,B,C,D,K} and {E,F,G,H}. Check out the links under "Visual", "Nodes" and "Edges". Also check out what "Zoom in" does. Note that one connected component, {T1,T2}, was filtered out  modules smaller than 3 are not shown by default. Decrease the parameter "Remove modules with less than 3 nodes" in the form and submit again to see this component as well. Next, look at what the following input does and try to understand the results. Pay attention to the coloured edges in the visualisation. This will show how edge labels work. Again, you have to make sure that "Hide insignificant modules" is unchecked. DS1: N1 N2 DS1: N2 N3 DS1: N3 N1 DS2: N1 N2 DS2: N2 N4 DS2: N2 N5 DS3: N4 N1 DS3: N2 N5 DS1: N3 N5 DS2: N3 N5 DS3: N3 N5 After trying and understanding this example, you might try what happens if you add the option "Remove edges with less than 2 labels". To see an example of how biological data is handled, check the box "From a file in our server" and try out some of the available data files. Suggestions for browsingJavaScript should be allowed on your browser for GraphWeb to work properly. Displaying the results page involves intensive computation and may take several minutes if the input is large. You are recommended not to navigate away from the results page if computing them took a long time, since pressing the 'back' button on your browser may cause the results to be recomputed. For this purpose, links on the results page have been set up to open in a new window or in a popup window. About this documentThis help file aims to describe all the features and functionality of GraphWeb. It is currently mostly accurate, but slightly outdated, as we are currently updating it. Some features are not fully described here yet, such as the advanced input feature. Also, some of the information may be outdated. If you find disrepancies between this document and the actual behaviour of GraphWeb, please use the contact form and notify us. Methods of inputThe graph data can be input in several ways (which can be combined), all found in the leftmost column of the input form:
The last two options are both available by selecting "Choose an existing file", in a common dropdown menu. If more than one input source is used (e.g. there is text in the textbox and an example file is selected), the data is combined by concatenation. This makes it possible to manually add some edges to an input file without modifying the file itself, or to combine your data with our example inputs. Example inputsThe example inputs are:
When choosing an example input, make sure that you select the correct organism from the organism menu as well (the input name contains the name of the organism). To download or see the example inputs, click on the link above the dropdown menu. Manage data uploads with g:PEDaMBesides uploading a dataset from your computer for analysis, it is also possible to store your data permanently on our server in your personal data folder, using the g:PEDaM file manager. The g:PEDaM interface itself, containing tools for uploading, viewing and deleting stored inputs, is located in the bottom half of the leftmost column. To create your personal storage folder, click on "Create new folder". You will be able to protect your folder with a password. Later, you can click on "Use my existing folder" to log in with your chosen password. The files that you have uploaded are available in a menu where they can be viewed and deleted. The files you have uploaded can be selected for analysation from the same menu that contains example inputs, which can be displayed by choosing the option "From a file in our server" in the first column. Input syntaxEach line of the input should either describe one edge of the graph or assign a weight to an edge label. For an edge with several labels, there should be a line for each label it has. The syntax to describe an edge is [Label:] Node1 [<><>] Node2 [[%!]EdgeWeight] where square brackets denote optional parts of the syntax and the symbol  separates alternatives. The label and node identifiers are case insensitive (so abc and ABC refer to the same object), they must not contain whitespace or the symbols <, >, and =. Whitespace is required after the colon following the label (this is because node names may contain colons), but it is optional elsewhere as long as the input remains unambiguous. Edges for which the label name is not present will be given the label DEFAULT. The arrow symbol between the edges shows the direction of the edge (onedirectional or bidirectional). If the symbol is missing, the edge is assumed to be bidirectional. The whole graph is considered undirected if all edges in the input are bidirectional and directed if there is at least one edge where < or > is used as the separator. In the case of a directed graph, all undirected edges in the input will become two separate directed edges, pointed both ways. GraphWeb also supports various methods for assigning edge weights. Weight here means the "trustworthiness" or "goodness" of the edge: the bigger the weight, the stronger the edge. The weights can be used in the MCL and betweenness centrality clustering algorithms, but also for simple edge filtering. The weights should be positive floating point numbers, written in the standard C format, with a dot (not a comma) as the decimal point. If an edge weight is not specified, the default weight of the label will be given to the edge. If an edge has multiple labels, the final weight of this edge will be the sum of weights associated to it for each label. However, an edge appearing twice with the same label is not supported and if the input contains this situation, a warning is shown and only the largest of the conflicting weights is assigned. To learn more about edge weights, see the paragraph Edge weights explained. The syntax for assigning a default weight to a label is Label = Weight Whitespace is optional. The weight of a label is used as the weight of those edges having the label that don't have a weight specified themselves. It should be a positive real number. One label should be assigned a weight only once in the input  in case of duplicate weights, all but the first one are ignored. The assignment can be made in any part of the input and takes effect for the whole graph, both for edges described before and after the assignment. Edge weights explainedAs explained above, each edge can have a weight assigned to it on the line where the edge is described. Labels can also have weights assigned with the "Label = Weight" syntax. Weights are relevant in the algorithms MCL and betweenness centrality clustering (although both work without weights as well), they can also be used to filter out weaker edges in the graph. If no weights for edges and labels are specified in the graph, the weight of each edge will simply be the number of labels it appears with. In the normal behaviour for assigning weights to edges, the weights will be normalised within each label. This means that all weights of edges with a given label are scaled into values between zero and one and then multiplied with the label weight. This means that the edge with the highest weight with this label will receive exactly the label weight (or 1 if there is no label weight). If an exclamation mark ("!") is added in front of an edge weight, this weight will not be a part of the normalisation process. Instead, the edge will just get the weight after the exclamation mark. If a precentage mark ("%") is inserted before an edge weight, the edge weight will be specified as a precentage of the label weight. For example, if the label weight is 200 and an edge with this label is given the weight \%40, the weight of the labeled edge will be 40% of 200 or 80. If an edge has no weight specified, its weight for this label will be the corresponding label weight. If a label doesn't have a weight specified in the input, its weight will normally be equal to 1, which means that edge weights will be normalised into the interval (0,1]. However, there is also a special setting, Assign more weight to smaller networks, which gives an alternative method of allocating weights to labels. With this method, labels with more edges get smaller weights and labels with less edges get larger weights. You can use it instead of choosing the weights yourself if you happen to consider your smaller datasets more trustworthy than large ones. Gene ID conversion and orthologsIf the ID of a node is a valid protein or gene ID, g:Convert is used to convert it to a standard form (Ensembl gene name). This allows the input to contain different kinds of IDs. When two node IDs in the input really represent the same gene, they are considered one node. To use this feature, the correct organism must be chosen from the organism menu. GraphWeb also supports converting the genes of one species into orthologous genes of another species. Orthologs are genes/proteins in different organisms that have the same evolutionary origin. It may be of interest to consider graphs where vertices are proteins of one organism (e.g human) and edges are between interacting human proteins, but also between those proteins whose orthologs in some other organism (e.g mouse) interact with each other. It is easily possible: just concatenate a human dataset with a mouse dataset (such as in the example dataset "IntAct human and mouse"), select human as the main organism, select the checkbox "Convert orthologs" and select mouse as the ortholog organism. All mouse genes that have human orthologs will be replaced with those orthologs in the graph. It is possible that a gene in the input has either no Ensembl ID, or has several different IDs. GraphWeb's behaviour in this case depends on the parameters Remove unknown genes and Remove ambiguous genes. The unknown or ambiguous genes are either removed from the graph or retained with their original IDs. In the output, clicking on the link Node name conversion displays information about which IDs were converted, which were unknown and which were ambiguous. AlgorithmsIn the second column of the form you can choose the method that is used to extract subgraphs from a graph. Some of these methods also need parameters to be specified. Graph clusteringMCLUses the Markov Cluster Algorithm by Stijn van Dongen to split the graph into clusters. Each node will belong to exactly one cluster. The output of the commandline program mcl implementing this algorithm is also displayed. The algorithm may run several minutes in large graphs. The parameter called "inflation" can be used to impact the granularity of clustering. Values should range between 1.1 and 5.0, with 2.0 being the default and also a good value to start with. A short overview of the algorithm follows (see the website for details). A set of nodes is considered a cluster if a random walk in the graph is likely to remain in that set. Edge weights are interpreted as probabilities of traversing that edge on one step of the random walk. For example, if a node has three outedges with weights 2, 3 and 7, the probabilities of the next step of the random walk traversing those edges are 2/11, 3/11 and 7/11, respectively. The algorithm alternates two kinds of steps: expansion and inflation. Expansion means calculating the random walk probabilities for longer walks, which is accompliced by squaring the adjacency matrix. At the inflation step, entries of the matrix are all raised to the power of the inflation parameter, which increases the difference between high and lowweighted edges. This process converges to a steady state from which the clustering is obtained. We use the MCL implementation by van Dongen. For information about the speed of this algorithm, see How fast is MCL on the MCL website (basically, it is usually linear to the number of nodes, but with quite a large constant). Betweenness centrality clusteringA shortest path from one node to another is a (directed) path with the property that the sum of the weights of its edges is the minimal possible. If the graph is unweighted, the shortest path is the one with the least number of edges. There can be more than one shortest path between a pair of edges, if several paths with the same length exist. For the simple case where every node pair has an unique shortest path between them, the betweenness centrality of an edge is defined as the number of shortest paths passing through the given edge, where paths over all node pairs are considered. In the general case, if there are n shortest paths between some pair of nodes, each path is counted 1/n times. A definition of betweenness centrality is given here. In betweenness centrality clustering, the betweenness centralities of all edges in the graph are calculated and the edge with the highest centrality is removed. This process is repeated until enough edges of the graph have been removed, at which point the connected modules of the remaining graph are taken as clusters. This is justified by the fact that edges between two distinct dense clusters are likely to have a high centrality, because shortest paths from one cluster to another will often include that edge (there are few alternatives). A full implementation of the algorithm would require finding shortest paths between all pairs of nodes as many times as many edges will be removed, which would take too much time on large graphs. To solve this problem, we have tried to approximate the centrality by not finding all shortest paths, but selecting a random sample of nodes and finding shortest paths from nodes in the sample to all nodes. We hypothesize that if the sample is large enough, the resulting clustering will approximate the full betweenness centrality clustering. Another possible optimisation is removing several edges on each iteration, which means that less iterations are needed to remove the required number of edges. You can modify three options. The first is the precentage of edges in the graph that will remain in the graph. The smaller it is, the more iterations of the clustering will be made and the more finegrained the clustering will be. The second parameter is the precentage of edges that are included in the random sample. For smaller graphs (e.g. IntAct mouse or cattle datasets) it is OK to leave this at 100%, resulting in a true (nonrandomised) betweenness centrality clustering. If the graph is bigger, the sample must be decreased. The third parameter is the number of edges removed on each iteration. If it is 1, shortest paths are recomputed after each edge is removed. The bigger it is, the faster the calculation. To avoid load on our server, the computation is immediately cancelled if it would take too much time (more than 10 billion computational steps, calculated as edges_to_remove*sample_size*num_edges/edges_removed_on_iteration). Some graphs may be just too large to cluster with this method. Basic graph algorithmsConnected componentsIn an undirected graph, two nodes belong to the same connected component if there exists a path from one to the other. In a directed graph, choosing this will find weakly connected components  connected components in the graph obtained by ignoring all edge directions. The algorithm takes O(V+E) computation time, where V is the number of nodes and E the number of edges. Strongly connected componentsIn a directed graph, two nodes u and v belong to the same strongly connected component if there are directed paths both from u to v and from v to u. Similarly to connected components in undirected graphs, this is an equivalence relation, so every node is in exactly one component. The computation time is O(V+E). On undirected graphs, this algorithm has the same effect as connected components. Biconnected componentsA graph is called biconnected if it is connected and the removal of any one node does not disconnect it. A biconnected component of a graph is any maximal biconnected subgraph (i.e. it is not possible to add nodes or edges to the subgraph without losing its biconnectedness). A graph is partitioned into biconnected components by articulation points  nodes whose removal would increase the number of connected components. Articulation points belong to more than one biconnected component, all other nodes  to exactly one component. This algorithm is meant for undirected graphs. If the graph is directed, it will be transformed into an undirected graph by ignoring all edge directions. The time complexity is O(V+E). Maximal cliquesA clique is a subgraph where each pair of nodes has an edge between them. A maximal clique is a clique that is not a part of a bigger clique. The algorithm finds maximal cliques of a graph with 4 or more nodes. The algorithm is very slow when the graph is very big and dense. Because of that, the program will stop searching for new cliques if more than 1000 are found. Also, searching for cliques is disabled for very dense graphs where the number of connected edge pairs is over 50% of total edge pairs. Node groupingHubbased modulesFor each node in the graph, a subgraph will be given containing all nodes whose (unweighted) distance from the hub is not greater than the given parameter. For example, if the parameter equals 1, each module will contain a central node with all of its neighbours. If it equals 2, neighbours of the hub's neighbours are also added, etc. Visualisations in this case show not only edges from the hub to the neighbours, but also edges between the neighbours. Whole graphReturns just one module which contains all nodes of the input graph, after filtering has been applied to it. Options and settingsLabels and edge weightsAssign more weight to smaller networksIf selected, those labels that don't have their weights specified in the input will have weights chosen so that they will be inverse proportional to the number of edges in each label (that is, the product of the number of edges in a label and the label weight will be constant). This means that labels with less edges are considered more trustworthy and edges having them get greater weights. Labels that have their weights specified in the input are not affected by this setting. The weights of the unknown labels are scaled so that the equality SUM(weight(label) * num_of_edges(label)) = Average_weight * SUM(num_of_edges(label)) holds, where both sums are over those labels that don't have their weights specified in the input. Average_weight is equal to 1 by default. This can be changed by assigning a weight to the label DEFAULT, that is the label that edges that don't have a label specified belong to. In that case, Average_weight will equal the weight of DEFAULT. For an explanation of label weights and how they affect edge weights, see Input syntax and Edge weights explained. For the purpose of this setting, the number of edges in a label is calculated after unknown and ambiguous genes have been removed, but before any other node or edge filtering is applied. Create global networkIf this is selected, labels will not be shown in the output. Instead, all edges are merged into one label and the weight of each edge will be assigned to the sum of the weights of this edge in each label. This setting also affects edge lists and visualisations, where the label structure of the input will not be displayed. In the edge lists of modules, each edge will have its weight specified separately, without the use of labels. Also, label distribution in modules is not shown with this setting. The setting may be useful if the number of labels is very large, since edges belonging to very many labels will look strange on visualisations otherwise. Note that using labels in the input may still be useful with this setting because it helps to determine edge weights. Node filteringThis section contains options that are quite different, but they all have in common the fact that they remove some nodes from the graph.
Remove unknown genes

GraphWeb 20072008 Laur Tooming & Jüri Reimand & Jaak Vilo @ BIIT Group, Institute of Computer Science, University of Tartu. 