The Limits of Fine-Tuning Large Language Models for Question Answering
Large language models (LLMs) like GPT-3 have shown impressive capabilities in question answering when fine-tuned on domain-specific data. However, this approach has inherent limitations.
Fine-tuning teaches the LLM new patterns and tasks, but needs to impart proper understanding or knowledge about the world. The model may learn to answer questions about medical diagnoses, but it needs to gain an understanding of biology or pathology. This can lead to issues like confabulation or hallucination when the model generates plausible-sounding but incorrect answers.
Fine-tuning also requires large datasets of question-answer pairs in the target domain, which can be time-consuming and expensive to collect and annotate. Retraining is needed whenever new documents are added to the knowledge corpus. This lack of scalability limits the flexibility of fine-tuned LLMs.
A better approach may be semantic search over a corpus of documents. Here, documents are indexed based on embedding-based semantic similarity. User questions are converted to queries using an LLM, and the most relevant passages retrieved. These passages can be summarized and synthesized into an answer by the LLM.
This architecture separates knowledge (in the document corpus) from reasoning (the LLM). It also scales seamlessly as new documents are added without expensive retraining. The system behaves much like a human researcher - searching a library to find promising books, skimming them for relevant facts, and compiling an answer.
Future progress will require architectures beyond simple scaling to equip LLMs with more robust reasoning, causal understanding, and epistemological abilities. Responsible disclosure practices around model capabilities will also be essential to ensure benefits outweigh risks.
In summary, while fine-tuning has uses in adapting LLMs for new domains, a combination of techniques, including semantic search, better architectures, and transparency around capabilities, will likely be needed to enable robust and trustworthy question-answering.