How we boosted SQL query accuracy by 33% with LLMs / Habr

November 2, 2025 - By admin

An LLM-based SQL generator seems like a straightforward product with obvious value. It can function as a standalone platform or as a tool within a larger agent-based system. Fortunately, modern models can generate code with varying degrees of success.

But is it that simple? Can we just hook up a powerful model’s API, grant it database access, ask a question, and replace a human analyst? Of course not. An analyst’s job involves far more than just writing and executing SQL queries.

Let’s focus on the SQL generation part and explore why even that is a complex challenge.

No company will pipe sensitive data into an external API from OpenAI or Anthropic. It’s a non-starter.
If external APIs are off the table, you need to self-host a model. Doing this efficiently, especially without a GPU, is expensive and requires infrastructure that most companies don’t have.
Open-source LLMs often struggle with complex, poorly structured contexts or specific SQL dialects, like PostgreSQL 17.
Collecting data and training an LLM for a niche SQL dialect is expensive in terms of time, money, and computational resources.
While you can use grammars for guided decoding with local models, developing and debugging them is a non-trivial task.

These problems are significant. Let’s explore how we can tackle them.

What makes a model great at generating SQL?

In our field, the proof is in the pudding. Let’s examine a few established approaches that have delivered excellent results on industry benchmarks.

CHASE-SQL

The creators of CHASE-SQL argue that using SQL generation techniques in isolation is inefficient. Each method has unique strengths and weaknesses. Their approach? Use everything.

If benchmark accuracy is the primary goa l —trumping accessibility or speed — why not leverage the entire arsenal of available tools?

CHASE-SQL solves a task using several parallel pipelines. The output from each is fed into a selector model, which picks the best final answer. Each pipeline typically involves advanced prompt engineering with a large model (like Gemini or Claude) and a self-reflection step to correct syntax errors.

SQLFuse

SQLFuse is a landmark in the Text-to-SQL field, combining the best of prompt engineering, multi-module pipelines, and fine-tuning.

The model has four main components:

Schema Mining. Handling prompts and creating context for the model, including tables, columns, and constraints like primary and foreign keys.
Schema Linking. Identifies a meaningful subset of columns and tables for SQL generation, implemented using an LLM trained specifically for this task with SFT (Supervised Fine-Tuning).
SQL Generation. The core model, which receives the accumulated context and the user’s question. It’s fine-tuned on the Spider benchmark’s training dataset.
SQL Critic. A validation module that reviews the generated SQL. It’s equipped with examples of good and bad practices, similar question-answer pairs, and other a priori information to refine the output.

SkyRL-SQL

This approach uses Reinforcement Learning (RL) to train a base model to solve tasks in multiple steps (Multi-Turn RL). It’s the closest conceptual parallel to the method we describe in this article.

Ultimately, combining model fine-tuning, sophisticated context creation, and multi-stage pipelines is what makes SQL generation a stable and promising field.

How to benchmark SQL generators

Evaluating SQL generators “by eye” is not a serious option. The standard is to use well-designed, community-vetted benchmarks. For Text-to-SQL, the two most accepted are Spider and BIRD.

Spider

Spider is a classic, cross-domain benchmark that tests a model’s ability to generalize to new database schemas and generate complex SQL (e.g., queries with JOIN, nested subqueries, and GROUP BY/HAVING clauses).

Key metrics:

Exact Match (EM). The generated SQL is identical to the reference query.
Execution Accuracy (EX). The execution result of the generated SQL matches the result of the reference query.

Benchmark examples:

Question	Answer
How many heads of the departments are older than 56?	SELECT count(*) FROM head WHERE age > 56
List the name, born state, and age of the heads of departments ordered by age.	SELECT name, born_state, age FROM head ORDER BY age
List the creation year, name, and budget of each department.	SELECT creation, name, budget_in_billions FROM department
What is the maximum and minimum budget of the departments?	SELECT max(budget_in_billions), min(budget_in_billions) FROM department
What is the average number of employees of departments whose rank is between 10 and 15?	SELECT avg(num_employees) FROM department WHERE ranking BETWEEN 10 AND 15

BIRD

BIRD is a large-scale benchmark designed to reflect real-world scenarios, with a strong focus on correct value extraction from tables.

Metrics:

Benchmark examples:

Question	Answer
What is the ratio of customers who pay in EUR vs. CZK?	SELECT CAST(SUM(CASE WHEN Currency=’EUR’ THEN 1 ELSE 0 END) AS DOUBLE)/SUM(CASE WHEN Currency=’CZK’ THEN 1 ELSE 0 END) FROM customers
In 2012, who had the least consumption in LAM?	SELECT T1.CustomerID FROM customers AS T1 INNER JOIN yearmonth AS T2 ON T1.CustomerID=T2.CustomerID WHERE T1.Segment=”LAM” AND SUBSTR(T2.Date,1,4)=’2012′ GROUP BY T1.CustomerID ORDER BY SUM(T2.Consumption) ASC LIMIT 1
Average monthly consumption of SME customers in 2013?	SELECT AVG(T2.Consumption)/12 FROM customers AS T1 INNER JOIN yearmonth AS T2 ON T1.CustomerID=T2.CustomerID WHERE SUBSTR(T2.Date,1,4)=’2013′ AND T1.Segment=”SME”
Difference in gas consumption between CZK and EUR customers in 2012?	SELECT SUM(CASE WHEN T1.Currency=’CZK’ THEN T2.Consumption ELSE 0 END)-SUM(CASE WHEN T1.Currency=’EUR’ THEN T2.Consumption ELSE 0 END) FROM customers AS T1 INNER JOIN yearmonth AS T2 ON T1.CustomerID=T2.CustomerID WHERE SUBSTR(T2.Date,1,4)=’2012′
Which year recorded the most gas consumption paid in CZK?	SELECT SUBSTR(T2.Date,1,4) FROM customers AS T1 INNER JOIN yearmonth AS T2 ON T1.CustomerID=T2.CustomerID WHERE T1.Currency=’CZK’ GROUP BY SUBSTR(T2.Date,1,4)

How to fine-tune reasoning models

Why classic SFT fails

Supervised Fine-Tuning (SFT) trains a model on labeled (input, output) pairs to minimize a loss function like cross-entropy. In essence, the model learns to imitate correct answers.

For tasks requiring long reasoning (SQL, coding, mathematics), SFT hits two major limits:

No ready-made reasoning datasets. Open datasets either lack chain-of-thought (CoT) or have short, noisy, inconsistent explanations.
No alignment with end goals. SFT optimizes token plausibility, whereas practical metrics are execution correctness (EX for SQL), user preferences, etc.

As a result, the model writes better, but doesn’t necessarily solve better.

Generating a reasoning dataset for SFT

To bridge this gap, practitioners often “reverse-mine” the CoT. You give a powerful model a problem and the correct answer, then ask it to reconstruct a plausible reasoning process that leads to that answer.

This involves generating multiple reasoning paths, automatically verifying them, selecting the best ones, normalizing the style, and packaging them into an (input -> reasoning + answer) format. This new dataset teaches a student model both the format and the logic of reasoning, not just the final output.

While this works, it has three major drawbacks:

Expensive and time-consuming. A strong teacher model costs both money and time, and verification isn’t free either — setting up environments, executing queries, and regenerating failed examples all take resources.
Not always effective. Sometimes the teacher reconstructs a reasoning chain that looks elegant but is logically inconsistent. After filtering, fewer high-quality examples remain than desired.
Hard to avoid leaking prior knowledge from the prompt. For reverse-mining to work at all, prompts usually include hints — the answer or solution structure. This prior knowledge ends up in the SFT dataset, which can make SFT largely pointless.

The power of RL

Reinforcement Learning (RL) treats a model as a policy that selects actions to maximize an expected reward.

The three core components are the state (the model’s context), the action (the next token or sequence), and the reward. An agent interacts with an environment, tries different actions, receives feedback (reward), and updates its policy to maximize future rewards.

For LLMs, a reward can be anything from a passing unit test to a positive user rating. The key difference from SFT is that RL directly optimizes for the target metric, not for token likelihood. If we want high execution accuracy (EX), we reward the model for it.

GRPO

GRPO (Group Relative Policy Optimization) is a PPO variant for RL fine-tuning, where training is performed on groups of multiple samples for the same query.

The crucial point is that the policy is optimized not based on absolute rewards, but on relative advantages within the group of responses. This reduces gradient variance and eliminates the need for a Value Model (see PPO for details).

For each query, we sample answers $\{o_i\}_{i=1}^G$ from the old policy $\pi_{\theta_{\mathrm{old}}}$ , calculate their rewards , and convert them into relative advantages:

$A_i = R_i - \frac{1}{G}\sum_{j=1}^{G} R_j$

The probability ratio is calculated at the token level:

$r_{i,t}(\theta) =\frac{\pi_{\theta}\!\big(o_{i,t}\mid q,\,o_{i,<t}\big)} {\pi_{\theta_{\mathrm{old}}}\!\big(o_{i,t}\mid q,\,o_{i,<t}\big)}.$

The PPO-style objective function is:

$\mathcal{J}(\theta)=\mathbb{E}_{q\sim P(q),\,\{o_i\}\sim \pi_{\theta_{\mathrm{old}}}}\!\left[\frac{1}{G}\sum_{i=1}^{G}\frac{1}{|o_i|}\sum_{t=1}^{|o_i|}\min\!\Big(r_{i,t}(\theta)\,A_i,\;\operatorname{clip}\!\big(r_{i,t}(\theta),\,1-\varepsilon,\,1+\varepsilon\big)\,A_i\Big)\right]\!.$

GSPO

GSPO (Group Sequence Policy Optimization) is a newer RL approach, similar to GRPO, but its objective function’s gradient depends on the entire sequence, not just individual tokens.

This is achieved with a small change to the formula:

$\mathcal{J}_{\mathrm{GSPO}}(\theta) = \mathbb{E}_{q\sim P(q),\, \{o_i\}_{i=1}^{G}\sim \pi_{\theta_{\mathrm{old}}}(O\mid q)} \Bigg[ \frac{1}{G}\sum_{i=1}^{G} \min\!\Big( s_i(\theta)A_i,\, \operatorname{clip}\!\big(s_i(\theta),\,1-\varepsilon,\,1+\varepsilon\big)A_i \Big) \Bigg]$

where $s_i(\theta) = \left( \frac{\pi_{\theta}(o_i\mid q)}{\pi_{\theta_{\mathrm{old}}}(o_i\mid q)} \right)^{1/\lvert o_i\rvert} = \exp\!\left( \frac{1}{\lvert o_i\rvert} \sum_{t=1}^{\lvert o_i\rvert} \log\frac{\pi_{\theta}(o_{i,t}\mid q,\,o_{i,<t})}{\pi_{\theta_{\mathrm{old}}}(o_{i,t}\mid q,\,o_{i,<t})} \right)$

This modification leads to more stable training and faster convergence, especially for long reasoning chains and Mixture-of-Experts (MoE) models.

You can find more details on GRPO vs. GSPO here.

GGPO

We can enhance this process by adding a GBNF grammar to the sampling parameters in vLLM, as I’ve described previously. This prevents syntactically incorrect SQL from being generated, though it does require a suitable grammar.

We call this combined framework GGPO (Guided Grammar Policy Optimization).

The dataset

We prepared two custom datasets for this project:

Both datasets were filtered to ensure all samples work correctly on a PostgreSQL database. This involved two checks:

Databases for both BIRD and the synthetic set can be created and populated without errors.
All SQL queries execute correctly and return a non-empty result.

We also have a demo dataset with reasoning content for preliminary SFT, but its quality is still under review: BIRD-reasoning-postgresql.

Results

Technical details:

Training was done on an A100 80GB GPU over three days.
Framework — TRL (GRPO + sequence-level sampling, which effectively implements GSPO).
Group size — 10 generations per sample.
LoRA. Rank=32, α=64.
Rollouts generated with vLLM.
Sampling — temperature=0.7, top_p=0.9, top_k=20.
Batch size — 20.

On our validation set, we saw an ~11% increase in the target metric compared to the base Qwen3-0.6B model. This is a strong result, especially since the first training epoch was only 28% complete, indicating significant potential for further gains.

You can try the model yourself here.

BIRD benchmark performance

How does our GSPO fine-tuned model (FT) compare to the base model Qwen3-0.6B (BASE) on the BIRD benchmark’s mini-dev set?

We ran 500 benchmark examples 30 times on both models, with temperature=0.3 and top_p=0.9. We measured EX across three categories: simple (148 examples), moderate (250 examples), and challenging (102 examples).

Here are the results:

Category	BASE (Mean ± Std)	FT (Mean ± Std)	Diff	Relative %	95% CI	p-value	Holm-adj p	Hedges’ g	Result
simple (N=148)	24.055 ± 1.964	23.807 ± 2.021	−0.248	−1.03%	[−1.239; 0.744]	0.63647	0.88056	−0.123	✖
moderate (N=250)	6.627 ± 1.006	6.053 ± 0.865	−0.573	−8.65%	[−1.040; −0.120]	0.02310	0.06930	−0.603	✖
challenging(N=102)	2.744 ± 1.322	3.659 ± 1.285	+0.915	+33.33%	[0.261; 1.568]	0.01120	0.04480	+0.692	✅
total (N=500)	10.993 ± 0.797	10.820 ± 0.846	−0.173	−1.58%	[−0.587; 0.234]	0.44028	0.88056	−0.208	✖

The results are fascinating. The BASE model is already quite capable, and our FT model didn’t surpass its total score. However, the category breakdown tells a different story. We achieved a massive +33% relative increase in execution accuracy on challenging queries.

Because the challenging set is the smallest, this significant gain barely moved the total score. We did see a slight, though not statistically significant, performance drop on moderate queries after applying the Holm-Bonferroni correction for multiple comparisons.

Our primary statistical method was a permutation test with Welch’s t-statistic, which is robust to unequal variances and doesn’t require assumptions about data distribution. We also used Hedges’ g to estimate effect size, where a value of ~0.8 indicates a large effect. Our +0.692 on challenging queries is a strong signal.

Conclusion: our fine-tuning reliably boosted the model’s performance on the most difficult SQL queries. The slight dip on medium-difficulty tasks was not statistically significant.

Next steps

To further improve performance, you can:

Implement a two-stage fine-tuning process: preliminary SFT on our reasoning dataset, followed by GSPO.
Calibrate the training dataset for a more uniform distribution of difficulty.
Increase the LoRA rank.
Extend the training time to cover more of the training set.
Integrate a guided decoding grammar covering our training and validation sets.

How we boosted SQL query accuracy by 33% with LLMs / Habr