Mining Frequent Patterns in Large Networks
The amount of data we generate and store every day has been increasing rapidly over the past many decades. In 2013 alone, based on some estimates, we generated almost 28 terabytes of data per second. By 2018, this is projected to rise to 50 terabytes per second. With the vast amount of data we produce, it has become all but impossible to manually extract any useful information out of the data. The field of datamining has emerged as an answer to this problem; it deals with automatic extraction of useful and previously unknown information from large amounts of data using techniques from statistics, artificial intelligence and databases. One of the fundamental tasks in datamining is frequent pattern mining. It involves finding patterns that frequently reoccurr (above some user specified threshold t) in the data. Traditionally, in datamining, data had consisted of a simple set of independent tuples (known as the propositional format). In real life, however, the data is often more structured and complex, with various interdependencies and relationships within the data. To represent the data as a simple set of independent tuples, can often result in significant loss of important information. Hence, of late, interest has been developing in the field in more structured and expressive forms of data. In the last decade or so, graphs have emerged as a powerful data formalism, that seem to hit the right balance between expressivity and efficiency for mining useful information from data with (inter-)dependencies/relationships. In this thesis, we develop methods to mine frequent patterns in single large graphs (networks).
Mining patterns in a single large network is computationally challenging. The main challenge is the computation of the matching operator, i.e., to match a pattern to a network, which in general is NP-Complete, for most matching operators of interest (e.g., subgraph (homo/iso/homeo)-morphism matching operators). The subgraph isomorphism is often a popular matching operator, as it is a natural choice in many applications, where it makes more sense if he matching of the pattern vertices to the network vertices is 1-to-1 (as opposed to many-to-1, as in the subgraph homomorphism matching operator). For subgraph isomorphism, classical matching algorithms that exist quickly become intractable, even for reasonably small patterns, on networks which are large or have a high average degree. Recent advances in the theory of parameterized complexity, however, have produced algorithms that can tractably compute the subgraph isomorphism matching operator, when certain parameters of the problem are bounded. Such theoretical results are often deemed impractical by both theoreticians and practitioners alike. As we demonstrate in this thesis, with intelligent optimizations and careful implementation, this need not be the case.
In this thesis we first present a novel miner for mining frequent rooted trees in large networks, based on results from parameterized complexity theory. The miner mines patterns with delay time that is only mildly exponential in the pattern size, while linear in the network size. With thorough experimental evaluation on both synthetic and real-world networks, we demonstrate the miners practical applicability, and show that we can tractably mine patterns of size up to 17.
Next, we extend our above results to a more general class of graphs called bounded treewidth (BTW) graphs. BTW is an important class of graphs that is interesting not only theoretically but also in practice, since many real-world graphs tend to have low treewidth. Based also on results from parameterized complexity theory, we develop a miner for mining frequent BTW graph patterns in large networks, and show that we can tractably mine patterns of low treewidth, of size up to 13. In addition, we observe that in practice a vast majority of real-world networks are labeled. We exploit this property of the real-world networks to further reduce the complexity bound of our matching operator. We show that, in the presence of labels, we can greatly improve our results and mine much larger patterns of size up to 23.