Small Models, Niche Success: Lessons from Building SajuGPT, a Fortune-Telling AI Chatbot

For the past 18 months, I've been running SajuGPT, an AI chatbot focused on the specific domain of "Saju" – traditional Korean fortune-telling and cosmology. When we started, neither investors nor I were certain it would work. From the outset, we made a conscious decision to avoid relying on large API-based models like ChatGPT. Instead, we focused on fine-tuning smaller models (under 10 billion parameters). Our strategy wasn't to attract a few high-paying users, but to build a large, engaged base of free users.

This journey involved numerous experiments with various sub-10B models. The key takeaway? A well-tuned smaller model can achieve user satisfaction levels comparable to, or even exceeding, large language models (LLMs) within its specific niche. We proved this without significant advertising spend, relying almost entirely on viral growth and word-of-mouth.

The following chart illustrates the growth in our site's visibility on Google over time (note: the specific Y-axis values are intentionally omitted):

Currently, searching for "사주" (Saju) on Google Korea places our service on the first page, near prominent resources like the Namuwiki entry for "Saju Palja." This organic growth validates our approach.

Focusing on What Matters: Model Performance Above All Else

Interestingly, we made very few changes to the basic menu structure or features after the initial launch. Partly due to resource constraints, but primarily because we wanted to test a hypothesis: Could we drive service improvement solely through enhancing the core model's performance?

Fine-tuning itself isn't the most complex task. The real challenge, especially in the early days with few users, was evaluating the effectiveness of the tuning. Gathering meaningful user feedback was slow and often inaccurate.

Beyond Benchmarks: User Session Duration as the North Star

Once we achieved a critical mass of users, we shifted our evaluation focus. Standard benchmarks proved almost useless for our specific goals. Instead, user session duration became our most crucial metric.

Our process became straightforward: deploy a newly tuned model and closely monitor the average conversation length of active users. If the session duration didn't increase, we discarded that model iteration, regardless of benchmark scores. Conversely, when session duration did increase, we consistently observed a significant spike in the overall user base one to two months later. This pattern held true repeatedly.

Data: The Real Engine of Niche Performance

Our experience hammered home a crucial lesson: for fine-tuning, the dataset is more critical than the base model. Once we had accumulated a sufficiently large and high-quality dataset specific to Saju, we found that different base models (within the <10B parameter range) could achieve similar performance levels after tuning.

This isn't to say the base model is irrelevant, of course. A better base model provides a better starting point. However, swapping base models is relatively easy compared to the painstaking effort of building and refining a domain-specific dataset. Therefore, the dataset provides significantly more leverage in the long run.

The Limits of LLMs in Specialized Knowledge Domains

We also confirmed that relying solely on large, general-purpose models like ChatGPT or Claude can be disadvantageous for highly specialized domains. Knowledge areas like Saju and Myeongrihak (a related study) are often poorly represented even in massive models. Prompt engineering can only go so far when the underlying model lacks sufficient foundational knowledge. If the target domain knowledge isn't adequately present in the model itself, fine-tuning appears to be the most viable, perhaps only, path to high performance.

The Relentless Loop: Data, Tuning, Monitoring

In summary, our journey with SajuGPT highlights several key points for AI engineers working on specialized applications:

Niche AI is Viable: Don't underestimate the potential of focused AI services.
Small Models Can Win: Fine-tuned smaller models can outperform LLMs in specific domains when data and tuning are done right.
Data is Paramount: Invest heavily in curating high-quality, domain-specific datasets. This is likely your biggest lever.
Benchmarks Aren't Everything: Real user behavior metrics (like session duration) can be far more indicative of model quality and user satisfaction in practice.
Embrace the Iteration: The process is a continuous loop: gather user interaction data -> refine/create datasets -> fine-tune -> deploy -> monitor user behavior -> repeat.

Building SajuGPT has been a testament to the power of focused effort, domain-specific data, and listening to user behavior over chasing generic benchmarks or relying solely on the largest available models. For engineers building specialized AI, these principles might offer a more effective path to success.