Accelerating dataset generation for machine learning using large language models: a pharmaceutical additive manufacturing case

Creating high-quality datasets for training machine learning models in specialized domains like pharmaceutical research is often constrained by the manual effort required to extract and compute critical parameters from heterogeneous literature. A novel deep prompt-engineering framework was developed to transform GPT-4 into a robust tool for automated and accelerated generation of structured datasets. Using a multi-set prompt strategy, GPT-4 analysed 70 full-text articles from literature on pharmaceutical inkjet printing to extract and compute 22 domain-relevant variables. These variables were organized into three main parameter groups: (i) printing parameters, (ii) rheological properties, and (iii) drug dose parameters, which were analysed using dedicated prompts. The outputs were benchmarked against a human-curated dataset compiled over four months by four domain experts, previously used to train machine learning models for predicting inkjet printability.

Partager

Besoin d’informations complémentaires ou envie d’échanger 
avec nos experts ?
Contactez-nous dès maintenant.

Restez informé et ne manquez rien de nos nouveautés.