Projects
Current projects
Here are some projects I have been working on recently.
LLMs for primary care clinical data wrangling and information extraction
Large Language Models (LLMs) offer the potential to streamline many data science tasks including data cleaning and data transformations, information extraction and analysis. In the clinical domain this is challenging as clinical data is highly specialised and sensitivity and privacy concerns also mean that sending data to external LLM providers is prohibited. In this project I am fine-tuning open source LLMs using data that is specific to the primary care setting. The goal is to improve the efficiency of some of the more laborious tasks in clinical data science, saving substantial time for data scientists and analysts so that they can focus on more interesting analytical tasks.
Algorithmic fairness in machine learning models for disease surveillance
Machine learning models are known to exhibit biases. This project involved developing tools for assessing and mitigating against bias in a disease surveillance platform that uses streaming data from hospital emergency departments. This provided a rigorous methodology for ensuring that decisions made using the algorithms do not disproportionately harm different groups of people.
Cancer risk prediction models and electronic decision support
Early detection of cancer is an important factor in reducing mortality, however many patients who are later diagnosed with cancer first present with so-called non-specific symptoms which are a challenge in clinical decision making. This project involved the development of machine learning models to identify non-specific symptoms in primary care patients’ electronic health records, and then implementing the resulting algorithms into a clinical decision support tool. This tool is currently being trialled in general practices and will allow clinicians to better identify higher risk patients, benefiting both clinicians and their patients.
Data linkage
Part of my work has involved helping to develop linked data infrastructure to support research into cancer services by bringing together data from primary care, hospitals and cancer registries. This has been a big team effort and is now the first such resource in Australia. It enables researchers and policy makers to see a more complete picture of how cancer patients interact with the health system, in order to improve cancer outcomes. A recent paper describing one of these resources can be found here.
Some previous projects
Geolocation of high-risk COVID-19 cases
During the pandemic, the Victorian Department of Health required a system that could automatically extract location information from contact tracing notes and identify COVID-19 cases at any high risk locations. I led a team that developed a scalable record linkage tool for this purpose. This tool was used for COVID-19 reporting, analysis and research throughout the pandemic and saved substantial time in manually reviewing data.
Automated tools for geospatial data collection
In my work at Gapmaps I developed a system for automatically extracting and transforming geospatial data on business locations, replacing the exisiting manual approach and saving substantial time and money for the business.
Open source projects
Whereabouts
whereabouts is an open-source Python package for geocoding. It allows organisations to geocode 1000s of addresses per second, within their own environments and without any cost, bypassing the need for commercial APIs. The geocoding algorithm (based on this paper) is implemented in SQL using DuckDB.