GPT-2 Based Medical Dialogue System

A medical question-answering system built on the GPT-2 language model, fine-tuned on a large corpus of doctor-patient dialogues. The system supports multi-turn conversations and provides both command-line and web-based interfaces for interaction.

English | 中文

Features

Specialized in medical domain, trained on over 30,000 real doctor-patient conversations
Fine-tuned from pre-trained GPT-2 for coherent and context-aware response generation
Supports multi-turn dialogue with configurable history length
Provides two deployment options: interactive command-line tool and Flask-based web service
Complete training, preprocessing, and inference pipeline

Quick Start

Environment Requirements

Python ≥ 3.6
PyTorch ≥ 1.7.0
Transformers ≥ 4.2.0

Install dependencies:

pip install -r requirements.txt

Data Preparation

Place the training and validation text files in the data directory:

data/
├── medical_train.txt
└── medical_valid.txt

Preprocess the data (if not already done):

python data_preprocess/preprocess.py

Training

python train.py --pretrained_model gpt2-medium

Training parameters such as batch size, learning rate, and number of epochs can be adjusted in parameter_config.py.

Inference

Command-line interaction:
```
python interact.py
```
Web interface:
```
python flask_predict.py
```
Then visit http://localhost:5000 in your browser.

Project Structure

Gpt2_Chatbot/
├── data/                   # Training and validation data
├── data_preprocess/        # Data preprocessing scripts
│   ├── preprocess.py
│   ├── dataset.py
│   └── dataloader.py
├── save_model/             # Trained model checkpoints
├── train.py                # Training script
├── interact.py             # Command-line inference
├── flask_predict.py        # Web service
├── app.py                  # Flask application
└── parameter_config.py     # Hyperparameters and paths

Model Architecture

The system uses GPT2LMHeadModel with a custom tokenizer (BertTokenizerFast) configured with [CLS] and [SEP] tokens to handle dialogue turns. Input sequences are formatted as:

[CLS] utterance1 [SEP] utterance2 [SEP] ...

Generation employs top-k sampling with repetition penalty to produce fluent and relevant responses.

Training Metrics

Metric	Training Set	Validation Set
Accuracy	92.3%	88.7%
Perplexity (PPL)	15.2	18.6

Example Dialogue

User: What auxiliary treatments are available for Parkinson's plus syndrome?

System: Recommended approaches include:

Rehabilitation training (e.g., balance exercises)
Daily living guidance (fall prevention measures)
Low-frequency repetitive transcranial magnetic stimulation