CV
My background
I studied maths and physics, completing my PhD in 2015. Since then I have been solving data science problems in business, government and academia, with a recent focus on clinical data science.
My current projects involve application of machine learning and statistical techniques to improve cancer outcomes, using linked primary care data. This is a complex area in terms of the data and domain knowledge required, and also often involves working with many different stakeholders.
My technical skill set ranges from data cleaning and analysis (Pandas, Polars, DuckDB, Arrow), to machine learning model development (Pytorch, Keras), data visualisation (Seaborn, ggplot), deployment (Azure) and, recently, fine-tuning LLMs for facilitating data science workflows.
A large part of my current role and previous role involves communicating across domains, for example working with clinicians to refine ideas so that they can be translated into data science problems. This is challenging but also rewarding and a good source of ideas for applications, tools and open problems.
Recent achievements
- Developed and implemented an algorithm for cancer risk prediction in primary care patients with non-specific symptoms, assisting clinicians to better identify high risk patients.
- Helped to develop the first Australian primary care data linkage platform, enabling research across the entire patient continuum of care.
- Delivered a consultancy project with the Victorian Department of Health to ensure that disease surveillance algorithms are rigorously analysed for bias and fairness, in order to mitigate potential harms against particular cohorts of patients.
- Worked with a broad range of stakeholders including clinicians, statisticians, and health professionals to develop data science capability.
Some previous achievements
- Led the development of a COVID-19 geolocation tool for identifying cases at high-risk locations, enabling research, reporting and analysis throughout the pandemic.
- Development of an open-source Python package for scalable geocoding, allowing organisations to geocode 1000s of addresses per second, within their own environments and without any cost, bypassing the need for commercial APIs.
Work history
- 2025 - present: Machine Learning Engineer, Beyond Blue
- 2021 - 2025: Data Scientist and Researcher, Victorian Comprehensive Cancer Centre Data Connect
- 2018 - 2021: Data Scientist, Victorian Centre for Data Insights
- 2016 - 2018: Signal Processing and Machine Learning Scientist, DST Group
- 2016 - 2018: Data Analyst and Programmer, GapMaps
Skill Set
Data science:
- Python: Pandas, Polars, Numpy, Sklearn, Keras, Pytorch, spaCy, Requests, Jupyter Notebook
- R: Arrow, Tidyverse
- SQL: PostgreSQL, DuckDB
- Algorithms: Clustering (K-means, Spectral Clustering), Random Forests, LDA, Word Embeddings, Linear and Logistic Regression, Neural Networks, Record Linkage
- Visualisation:
- Python: Seaborn, Matplotlib, Plotly Express
- Tableau
- R: ggplot
Data engineering:
- Databricks, Azure: development and deployment of models
Software development:
- Packages, code and notebooks as part of data science projects
- Development of whereabouts package for scalable geocoding and record linkage in Python, presented at PyConAU 2023 in Adelaide. Jointly awarded the Venables Award for Open Source Software, 2024
- Github / Gitlab
Education
- PhD thesis, Discretely holomorphic observables in statistical mechanics, University of Melbourne
- Bachelor of Science (First Class Honours), Major in Mathematical Physics
- Bachelor of Arts, Major in Spanish and Latin American Studies
Recent publications
2024
- Replication of a diagnostic accuracy study for cancer risk in primary care patients with unintended weight loss, Lee A, et al., 2024, Under review
- Development and validation of a phenotyping algorithm for primary care patients with unintended weight loss, Lee A, et al., 2024, Under review
- Patient Preferences for Investigation of Cancer Symptoms in Australian General Practice: A Discrete Choice Experiment, Brent Venning, Alison Pearce, Richard De Abreu Lourenco, Rebekah Hall, Rebecca Bergin, Alex Lee, Keith Donohoe, Jon Emery , British Journal of General Practice, Feb 2024
2023
- Data Resource Profile: Victorian Comprehensive Cancer Centre Data Connect, Lee A, McCarthy D, Bergin R, et al., International Journal of Epidemiology, Volume 52, Issue 6, December 2023, Pages e292–e300
- Factors affecting patient decisions to undergo testing for cancer symptoms: an exploratory qualitative study in Australian general practice, Brent Venning, Rebecca Bergin, Alison Pearce, Alex Lee, Jon D Emery, BJGP Open, Mar 2023.
Recent presentations
- 2024: Illuminating the cancer continuum of care through large-scale primary care data linkage, Health Services Research Conference, Brisbane, Australia
- 2024: Replication of a diagnostic accuracy study for primary care patients with unintended weight loss, Cancer in Primary Care Network Conference, Melbourne, Australia
- 2024: Research capabilities of linked data, VCCC Data Connect Showcase, Melbourne Australia
- 2023: What’s the big deal with large language models?, Primary Care Cancer Team PARROT seminar
- 2023: Fast, accurate, open-source geocoding in Python, PyConAU, Adelaide, Australia
- 2023: VCCC Data Connect launch, Melbourne, Australia
- 2022: Comparing primary care patients with unintended weight loss between Australia and the UK, Oxford University, United Kingdom
- 2022: Machine learning for detection of upper GI cancers, Making Digital Health Real seminar, University of Melbourne
Teaching
- Currently co-designing and teaching a course on Data Science for Health Professionals
Supervision
- Currently co-supervising two PhD students
- Co-supervised one Masters student
- Mentored a PhD student and research assistant in clinical data science