Supervisor: Anna Scaife
Project description: The radio morphology of Double Radio sources associated with Active Galactic Nuclei (DRAGNs) is typically determined by examining the distribution of high and low surface brightness regions in the radio synchrotron emitting relativistic jets associated with these systems. This morphology, in conjunction with the radio luminosity of each source, is the basis of the well known Fanaroff-Riley (FR) radio galaxy classification scheme. Traditionally, this classification has been done via visual inspection, a method made feasible by the modest sample sizes of historic radio surveys such as NVSS, SUMSS & FIRST, all of which catalogued ~100,000 objects. However, the catalogs produced by the next generation of radio telescopes are anticipated to be much larger. The Australia SKA Pathfinder (ASKAP) Evolutionary Map of the Universe survey expects to produce a catalog of ~70 million sources. Among the objects in this catalog, around 7 million extended radio sources are likely to require visual inspection. Beyond ASKAP, the Square Kilometre Array proposes to observe all FR Is and FR IIs up to z~4.
The expected volume of data from new radio surveys has motivated the expanded use of semi-automatic and automatic object classification algorithms, including the use of convolutional neural networks. Initial studies using CNNs have been applied to FR morphology classification and suggest that complex radio source structures can be identified and classified according to their morphology using this method. Indeed, in astronomy more widely, CNNs are becoming increasingly well-used: to identify galaxy clusters and filaments, detect fast radio bursts, recognize strong gravitational lenses and to classify supernovae. Whilst these studies have demonstrated the successful application of CNNs to astronomical classification problems, there remain three key issues that currently inhibit the systematic use of CNNs in astronomy: (1) a paucity of labelled data for training; (2) a lack of computational power; and (3) an understanding of the biases introduced in the classification due to observational and astrophysical selection biases in the training data. The first of these issues may be solved to some extent through the use of citizen science projects such as Galaxy Zoo; however, it is unlikely that astronomy will ever reach the volume of clean training data required for the deep learning architectures that are implemented commercially: paradoxically, although we have too much data, we don't have enough. The second can be addressed financially through the expansion of computational resources and a transition to GPU-based processing. The third issue is the subject of this proposal. Although the benefits of using automated classifiers for astronomy have been recognized, the inherent biases in their operation have yet to be investigated and addressed. Similar issues are increasingly evident in the business and everyday applications of machine learning, where social biases are emerging on a range of levels. Within scientific applications these biases will also have an impact, caused by the presence of observational and astrophysical selection biases in the training data provided to the algorithms (resolution, luminosity selection, redshift evolution, etc.). Identifying and accounting for these biases is essential to enable the robust application of machine learning in astrophysics, just as it has been historically for more general statistical analysis. However, for image-based machine learning classification, the effect of bias is not trivial to disentangle from the back-propagation optimization method employed by multi-layered CNNs.
In this project the student will explore the effect of different observational biases on CNN-based classifiers, focusing on the problem of radio galaxy classification for large radio surveys. By artificially introducing biases into training data gathered from archival surveys (NVSS, FIRST & SUMSS) they will assess the impact of different observational selection biases and look at methods of mitigating against them and/or quantitatively predicting their effect by treating them (i) as a specific form of class imbalance in the training data, and (ii) introducing a so-called ``Bayesian layer" after the fully-connected layers of the classifier in order to implement probabilistic back-propagation, equivalent to imposing a parameterised prior corresponding to the expected underlying population. This structure will allow us to integrate parameter estimation for redshift-dependent galaxy evolution models directly into our classifier and marginalise over multiple underlying source population models in order to minimise bias. We will then extend this approach to the SKA data challenges (e.g. https://astronomers.skatelescope.org/ska-science-data-challenge-1/
) in order to assess the applicability of networks trained from existing archival survey data to more sensitive observations. In doing this we will adapt our architecture to implement joint object detection and classification. The student will then extend this purely morphological radio galaxy classification to incorporate catalogue-based multi-wavelength data as well as radio images. This multi-messenger classification will require us to determine methods for handling multiple selection biases from different input data-sets within the same classifier. Rather than addressing these multiple biases directly, the student may implement a multi-task network, which will allow them to marginalise over multiple different representations to minimise overfitting/bias from any individual dataset.