Shel Burkes, PhD
Principal Applied Scientist
ML Research
Health Sensing
Computational Biology
Research Profile

PhD scientist specializing in machine learning, clustering analysis, and physiological signal characterization. I develop novel measurement frameworks and composite scoring systems from complex, multi-modal health and biological datasets — including original metrics where no formal quantification previously existed. Experienced in taking research from exploratory concept to validated methodology across consumer health sensing, biological sequence analysis, and applied generative AI.

Technical Skills
Time Series Analysis Clustering Specialist Novel Metric Development Composite Scoring Systems Physiological Signal Characterization Python PyTorch TensorFlow Scikit-learn Deep Learning Statistical Modeling Experimental Design Hypothesis Testing Signal Processing PCA · UMAP · K-means Model Evaluation Dimensionality Reduction Bioinformatics SQL MongoDB AWS Athena · SageMaker Databricks · Snowflake Git · Bash Tableau · Plotly Dash
Experience
Boehringer Ingelheim
Sep 2025 — Present
Principal Applied Scientist & Senior Data Scientist (Generative AI)
  • Prototyped and directed development of generative AI and LLM-based solutions, building foundational implementations in Python, PyTorch, and TensorFlow before partnering with engineering for production delivery.
  • Designed an LLM-based misinformation detection pipeline using Databricks and Snowflake — delivered a production-ready PoC in 60 days with automated content classification, clustering, and trend detection.
  • Built a financial potential modeling tool to estimate purchasing capacity across major customer portfolios, enabling commercial teams to refine targeting and prioritize high-value outreach.
  • Navigated regulated data-access processes in a pharmaceutical environment, ensuring all modeling activities complied with internal governance and regulatory standards.
  • Mentored junior data scientists on modeling techniques, code structure, and best practices for scalable, maintainable AI development.
FitSkin
Sep 2021 — Sep 2025
Senior Data Scientist
Led original research into the computational quantification of physiological skin properties, developing novel metrics and deep learning systems to measure attributes — including radiance, tone, and inflammation — that previously existed only as qualitative descriptors.
  • Developed lambent, a novel composite scoring system quantifying skin radiance from multi-modal physiological features — the first formalized computational measure of this property. Pearson r = 0.64 96.8% extreme-class accuracy
  • Built iridis, a perceptual skin tone classification system developed on 2M+ diverse images, incorporating hue variation and human color perception validated with color scientists under D65-calibrated imaging conditions. Clustering methodology recovered structure consistent with Fitzpatrick and ITA scales while extending to 40+ perceptual categories — capturing skin tone variation that existing clinical scales do not represent.
  • Developed argus, a CNN-based anomaly detection system for large clinical imaging databases, deployed via AWS SageMaker.
  • Worked with real-time capacitive sensor data capturing physiological skin states, building longitudinal profiling systems modeling individual baselines over time — directly analogous to continuous health sensing from wearable devices.
  • Investigated computational quantification of hyperpigmentation, redness/inflammation, and hair follicle structure — characterizing previously unmeasured biological attributes from sensor and image data.
  • Led A/B testing and causal inference projects using rigorous experimental protocols to isolate causal effects and inform product decisions across diverse populations.
Syngenta
Sep 2021 — Sep 2022
Data Scientist
  • Built predictive models on haplotype biological datasets to optimize trait selection based on environmental and genetic factors.
  • Designed and deployed scalable bioinformatics pipelines for trait-based prediction and optimization, integrating results into relational databases.
  • Led launch of a decision-making analytics platform as Product Owner using Agile and Scrum methodologies.
NC Research Campus
Sep 2020 — Sep 2021
Postdoctoral Researcher
  • Designed analytical pipelines for large-scale next-generation sequencing (NGS) datasets in Python within UNIX and cloud/HPC environments.
  • Led computational biology projects applying machine learning to NGS datasets for protein sequence analysis and peptide/protein structure annotation.
UNC Charlotte
Jan 2017 — Sep 2020
Graduate Researcher & Teaching Assistant
  • Designed real-time analytical pipelines for large-scale genomic data including A. sativa genome annotation using Illumina and PacBio sequencing.
  • Instructed courses and advised students on bioinformatics tools and data analysis methodologies.
Selected Projects
lambent.
Radiance Qualification · 2025–
Novel composite scoring system quantifying skin radiance — the first formalized computational measure of a property previously assessed only qualitatively. Validated against expert grading using an optimized XGBoost ensemble.
Pearson r = 0.64 · 96.8% extreme-class accuracy
iridis.
Skin Tone Classification · 2024–
Perceptual classification system using K-means and Visual Transformers (ViT) on 2M+ diverse skin images. Addresses inclusivity gaps in existing scales with applications in personalized health sensing.
topos.
Stability-First Discovery Framework · 2025–
Formal framework using biological LLMs to validate and compare clustering methodologies — enabling rigorous go/no-go decisions in research pipelines and earlier resource reallocation.
argus.
Clinical Imaging QC · AWS SageMaker · 2022–
CNN-based anomaly detection system for large clinical imaging databases. Deployed via AWS SageMaker with client-facing SDK for real-time outlier identification.
RepBox.
BMC Bioinformatics · Published 2023 · doi:10.1186/s12859-023-05419-5
Bioinformatics pipeline for identification and classification of novel repetitive genomic elements. Demonstrated 7% growth in detected repetitive elements and increased diversity of identified types across the A. sativa genome.
Peer-reviewed · BMC Bioinformatics 2023
Education
PhD
Data Science & Bioinformatics
UNC Charlotte
MS
Data Science & Bioinformatics
UNC Charlotte
BS
Biology
UNC Charlotte