< Back to previous page

Publication

Soccer Analytics Meets Artificial Intelligence: Learning Value and Style from Soccer Event Stream Data

Book - Dissertation

Soccer analytics has seen an explosion of interest in the last decade. The success of data analysis in other sports has driven soccer clubs and other stakeholders in soccer to wonder if they could also deepen their understanding of the game by analyzing data and translate this deepened understanding into tangible results such as signing good players and winning matches. Consequently, more data than ever is being collected in soccer. One prominent data source is event stream data, which is collected by human annotators who watch video feeds of soccer matches through special annotation software and rigorously describe all on-the-ball actions performed on the pitch such as passes, dribbles, interceptions, tackles, and shots. While event stream data is an incredibly rich data source, gleaning useful soccer insights from it has proven to be difficult in practice. One part of the problem is soccer being a fluid sport that involves many complex interactions between players. Furthermore, soccer's low-scoring nature and susceptibility to chance make it hard to correlate player skill with match results. Another part of the problem is event stream data being hard to analyze in its raw form. Analysts typically have to deal with a number of issues such as parsing complicated data structures, adapting to vendor-specific terminologies, dealing with data sparsity, scaling to millions of data points, and incorporating domain knowledge. These issues have motivated researchers to apply techniques from the field of artificial intelligence (AI) to event stream data, as these techniques are often intended to be used semi-autonomously on large and complicated data sets. Consequently, researchers have successfully used AI techniques such as classification, reinforcement learning, pattern mining, and network analysis to address soccer analytics tasks such as estimating shot quality, rating players, and detecting tactics. However, existing literature on learning from event stream data with AI techniques shows a number of shortcomings. First, no efforts have been made to address the data engineering challenges of event stream data, severely obstructing the reproducibility of papers within the field. Second, no approaches exist for valuing on-the-ball actions that consider the full context in which actions are performed or recognize the value of defensive actions such as tackles and clearances. Third, existing works have not sufficiently explored how to best model the locations and directions of actions when capturing the playing style of teams and players. Most approaches that attempt to capture playing style either rudimentarily divide the pitch into zones or ignore the spatial component of event stream data all together. This dissertation makes three main contributions to the field of soccer analytics that attempt to address these shortcomings. First, to better represent event stream data, we construct a new language that simplifies and unifies the data of different event stream data vendors, alleviating many data engineering challenges and encouraging the reproducibility of soccer analytics research. Second, we propose a framework for assigning values to on-the-ball actions that, compared to simpler metrics and possession-based approaches, considers a more complete view of the context in which actions occur. Our framework uses a simple and elegant formula that formalizes the intuition that all actions in a match are performed with the intention of increasing the chance of scoring a goal and/or decreasing the chance of conceding a goal. The latter point is what allows our framework to recognize the value of defensive actions. Third, we introduce a number of approaches that express the playing style of teams and players based on where on the pitch they perform certain types of actions. Our approaches improve over earlier work by modelling the spatial component of event stream data in a data-driven manner using decomposition techniques such as non-negative matrix factorization and mixture models.
Publication year:2020
Accessibility:Open