Research
We enable researchers in the Department of Surgical Oncology to access new data and align existing databases with our centralized repository of electronic health record (EHR)-based data. This standardizes our data for research and quality improvement projects. For MD Anderson faculty and researchers only: To make a data request, please complete the form.
1. Natural Language Processing (NLP) and Large Language Models (LLMs): Vast amounts of unstructured data are generated in our electronic health records (EHRs), including clinical notes, pathology reports, radiology reports, and operative notes. Manual and retrospective review of key clinical endpoints from these texts is labor-intensive, time-consuming, and error-prone. We are developing and evaluating natural language processing models, including transformer-based large language models (LLMs), to automate the extraction of critical oncologic clinical variables from unstructured documents. Our LLM projects also explore the generation of synthetic clinical data, scalable phenotyping tools, and applications for clinical documentation support. These tools aim to enhance the quality, completeness, and usability of research data, while also easing the documentation burden on clinicians.
2. Predictive Model Generation: We are conducting research focused on developing predictive models to improve oncologic care. Using structured and unstructured data, including outputs from NLP pipelines, we are training machine learning and deep learning models to predict important clinical outcomes — such as disease recurrence, response to treatment, and surgical complications. Our ultimate goal is to integrate these predictive tools into clinical workflows to support decision-making and personalize treatment strategies.
3. Synthetic Data Generation: Access to large, comprehensive datasets is critical for research and the development of predictive models. Access to relevant clinical datasets is often limited by patient privacy concerns, data silos and the lack of high-quality data, particularly for rare cancer research. We are utilizing existing data to develop and implement advanced techniques for synthetic data generation. We hope to utilize this methodology to overcome data scarcity, facilitate data sharing, test hypotheses and simulate clinical trials.
4. Automated Cancer Data Initiative (ACDI): Unstructured data entry into EHRs create significant challenges when collecting phenotypic or clinical endpoints for clinical research. Retrospective collection of data is limited by lack of standardization and significant bias. Manual data collection is also time consuming and prone to errors. We are implementing a comprehensive structured data entry system that allows us to prospectively collect clinical data and ensure high-quality data for research and clinical decision making.
If you are interested in collaborating on any research projects, please contact us.