CV

My background

I studied maths and physics, completing my PhD in 2015. Since then I have been solving data science problems in business, government and academia, with a recent focus on clinical data science.

My current projects involve application of machine learning and statistical techniques to improve cancer outcomes, using linked primary care data. This is a complex area in terms of the data and domain knowledge required, and also often involves working with many different stakeholders.

My technical skill set ranges from data cleaning and analysis (Pandas, Polars, DuckDB, Arrow), to machine learning model development (Pytorch, Keras), data visualisation (Seaborn, ggplot), deployment (Azure) and, recently, fine-tuning LLMs for facilitating data science workflows.

A large part of my current role and previous role involves communicating across domains, for example working with clinicians to refine ideas so that they can be translated into data science problems. This is challenging but also rewarding and a good source of ideas for applications, tools and open problems.

Recent achievements

  • Developed and implemented an algorithm for cancer risk prediction in primary care patients with non-specific symptoms, assisting clinicians to better identify high risk patients.
  • Helped to develop the first Australian primary care data linkage platform, enabling research across the entire patient continuum of care.
  • Delivered a consultancy project with the Victorian Department of Health to ensure that disease surveillance algorithms are rigorously analysed for bias and fairness, in order to mitigate potential harms against particular cohorts of patients.
  • Worked with a broad range of stakeholders including clinicians, statisticians, and health professionals to develop data science capability.

Some previous achievements

  • Led the development of a COVID-19 geolocation tool for identifying cases at high-risk locations, enabling research, reporting and analysis throughout the pandemic.
  • Development of an open-source Python package for scalable geocoding, allowing organisations to geocode 1000s of addresses per second, within their own environments and without any cost, bypassing the need for commercial APIs.

Work history

Skill Set

Data science:

  • Python: Pandas, Polars, Numpy, Sklearn, Keras, Pytorch, spaCy, Requests, Jupyter Notebook
  • R: Arrow, Tidyverse
  • SQL: PostgreSQL, DuckDB
  • Algorithms: Clustering (K-means, Spectral Clustering), Random Forests, LDA, Word Embeddings, Linear and Logistic Regression, Neural Networks, Record Linkage
  • Visualisation:
    • Python: Seaborn, Matplotlib, Plotly Express
    • Tableau
    • R: ggplot

Data engineering:

  • Databricks, Azure: development and deployment of models

Software development:

Education

  • PhD thesis, Discretely holomorphic observables in statistical mechanics, University of Melbourne
  • Bachelor of Science (First Class Honours), Major in Mathematical Physics
  • Bachelor of Arts, Major in Spanish and Latin American Studies

Recent publications

2024

  • Replication of a diagnostic accuracy study for cancer risk in primary care patients with unintended weight loss, Lee A, et al., 2024, Under review
  • Development and validation of a phenotyping algorithm for primary care patients with unintended weight loss, Lee A, et al., 2024, Under review
  • Patient Preferences for Investigation of Cancer Symptoms in Australian General Practice: A Discrete Choice Experiment, Brent Venning, Alison Pearce, Richard De Abreu Lourenco, Rebekah Hall, Rebecca Bergin, Alex Lee, Keith Donohoe, Jon Emery , British Journal of General Practice, Feb 2024

2023

  • Data Resource Profile: Victorian Comprehensive Cancer Centre Data Connect, Lee A, McCarthy D, Bergin R, et al., International Journal of Epidemiology, Volume 52, Issue 6, December 2023, Pages e292–e300
  • Factors affecting patient decisions to undergo testing for cancer symptoms: an exploratory qualitative study in Australian general practice, Brent Venning, Rebecca Bergin, Alison Pearce, Alex Lee, Jon D Emery, BJGP Open, Mar 2023.

Recent presentations

  • 2024: Illuminating the cancer continuum of care through large-scale primary care data linkage, Health Services Research Conference, Brisbane, Australia
  • 2024: Replication of a diagnostic accuracy study for primary care patients with unintended weight loss, Cancer in Primary Care Network Conference, Melbourne, Australia
  • 2024: Research capabilities of linked data, VCCC Data Connect Showcase, Melbourne Australia
  • 2023: What’s the big deal with large language models?, Primary Care Cancer Team PARROT seminar
  • 2023: Fast, accurate, open-source geocoding in Python, PyConAU, Adelaide, Australia
  • 2023: VCCC Data Connect launch, Melbourne, Australia
  • 2022: Comparing primary care patients with unintended weight loss between Australia and the UK, Oxford University, United Kingdom
  • 2022: Machine learning for detection of upper GI cancers, Making Digital Health Real seminar, University of Melbourne

Teaching

  • Currently co-designing and teaching a course on Data Science for Health Professionals

Supervision

  • Currently co-supervising two PhD students
  • Co-supervised one Masters student
  • Mentored a PhD student and research assistant in clinical data science