---
license: llama3.1
base_model: meta-llama/Llama-3.1-8B-Instruct
library_name: peft
tags:
- text-to-sql
- postgis
- spatial-sql
- city-information-modeling
- cim
- fine-tuned
- qlora
- peft
datasets:
- taherdoust/ai4cimdb
language:
- en
pipeline_tag: text-generation
---

# Llama 3.1 8B - CIM Spatial SQL (Q2SQL)

Fine-tuned Llama 3.1 8B Instruct model for generating PostGIS spatial SQL queries for City Information Modeling (CIM) databases.

## Model Details

### Model Description

- **Developed by:** Ali Taherdoust (Politecnico di Torino)
- **Model type:** Causal Language Model (Decoder-only)
- **Language(s) (NLP):** English
- **License:** Llama 3.1 Community License
- **Finetuned from model:** [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)

### Model Sources

- **Repository:** [GitHub Repository](https://github.com/taherdoust/txt2ssql)
- **Training Dataset:** [taherdoust/ai4cimdb](https://huggingface.co/datasets/taherdoust/ai4cimdb) (176K samples, 88K training)
- **Base Model:** [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)

## Uses

### Direct Use

This model is designed for generating PostGIS spatial SQL queries from natural language questions for City Information Modeling (CIM) databases. It can be used to:

- Convert natural language questions to PostGIS SQL queries
- Query spatial urban databases (buildings, census, raster, network data)
- Support non-technical users in accessing CIM databases

### Downstream Use

- Integration into CIM Wizard framework for natural language database interfaces
- Educational tools for teaching spatial SQL
- Research on text-to-SQL for specialized domains

### Out-of-Scope Use

- General-purpose SQL generation (not trained on standard SQL databases)
- Non-spatial SQL queries (optimized for PostGIS)
- Other natural language tasks (specialized for SQL generation)

## Bias, Risks, and Limitations

- **Domain-Specific**: Trained specifically on CIM/PostGIS queries. May not perform well on standard SQL databases.
- **Spatial Functions**: Requires PostGIS extension. Queries may fail on non-PostGIS databases.
- **Schema Dependency**: Trained on specific CIM schema (cim_vector, cim_census, cim_raster, cim_network). Performance may degrade on different schemas.
- **Language**: Trained on English questions only.
- **Accuracy**: 82-90% execution accuracy. Generated SQL should be validated before production use.

### Recommendations

Users should:
- Validate generated SQL queries before execution
- Test on their specific database schema
- Use in controlled environments with proper error handling
- Review complex queries manually

## How to Get Started with the Model

### Installation

```bash
pip install transformers peft accelerate bitsandbytes
```

### Basic Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch

# Load base model
base_model = "meta-llama/Llama-3.1-8B-Instruct"
model_name = "taherdoust/llama31-8b-cim-q2sql"

# 4-bit quantization for memory efficiency
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

# Load fine-tuned adapter
model = PeftModel.from_pretrained(model, model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Generate SQL from question
question = "Find all buildings within 500 meters of the city center"

prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert in PostGIS spatial SQL for City Information Modeling (CIM).
Your task is to generate precise PostGIS spatial SQL queries for the CIM Wizard database.

Database Schema:
- cim_vector: Building geometries, project scenarios, grid infrastructure
- cim_census: Italian census demographic data (ISTAT 2011)
- cim_raster: DTM/DSM raster data
- cim_network: Electrical grid network data

Generate only the SQL query without explanations.<|eot_id|><|start_header_id|>user<|end_header_id|>

{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1)
sql = tokenizer.decode(outputs[0], skip_special_tokens=False)
print(sql)
```

## Training Details

### Training Data

- **Dataset**: taherdoust/ai4cimdb
- **Training Samples**: 88,480
- **Validation Samples**: 18,960
- **Test Samples**: 18,960
- **Total Dataset Size**: 176,837 samples (126,400 after curation)
- **Data Quality**: 99.7% quality acceptance rate, 99.57% NoErr rate

### Training Procedure

- **Method**: QLoRA (Quantized Low-Rank Adaptation)
- **Quantization**: 4-bit NF4
- **LoRA Rank**: 16
- **LoRA Alpha**: 32
- **LoRA Dropout**: 0.1
- **Target Modules**: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

#### Training Hyperparameters

- **Training regime**: bf16 mixed precision
- **Epochs**: 1
- **Batch Size**: 2 per device
- **Gradient Accumulation Steps**: 8 (effective batch size: 16)
- **Learning Rate**: 1.5e-4
- **Learning Rate Scheduler**: Cosine with warmup
- **Warmup Ratio**: 0.1
- **Weight Decay**: 0.01
- **Max Sequence Length**: 2048 tokens
- **Optimizer**: paged_adamw_8bit
- **Mixed Precision**: bfloat16

#### Speeds, Sizes, Times

- **Training Time**: ~51 hours (on NVIDIA RTX 3090)
- **Training Steps**: 5,530
- **Model Size**: ~150 MB (LoRA adapter) + ~15 GB (base model, quantized)
- **Trainable Parameters**: 0.46% of total parameters
- **Final Training Loss**: 0.1308
- **Final Evaluation Loss**: 0.0972

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

- **Test Set**: 18,960 samples from taherdoust/ai4cimdb
- **Evaluation Benchmark**: 100 samples (weighted stratified by difficulty)
- **Database**: CIM Wizard Integrated (PostgreSQL 15 + PostGIS 3.4)

#### Metrics

- **Execution Accuracy (EX)**: Percentage of queries that execute without errors
- **Exact Match (EM)**: Percentage of queries identical to ground truth
- **Valid Efficiency Score (VES)**: Percentage of queries returning correct results

### Results

#### Summary

The model achieves strong performance on PostGIS spatial SQL generation:

- **Execution Accuracy (EX)**: 82-90% (first-shot)
- **Exact Match (EM)**: 35-45%
- **Spatial Functions**: 85-90% accuracy (ST_Within, ST_Intersects, ST_Distance, etc.)
- **Multi-Schema Queries**: 82-90% accuracy (cim_vector, cim_census, cim_raster joins)
- **Domain Terminology**: 85-90% accuracy (SEZ2011, TABULA, E8-E16 census codes)

## Technical Specifications

### Model Architecture and Objective

- **Base Architecture**: Llama 3.1 (Transformer Decoder)
- **Parameters**: 8 billion (base) + 16M (LoRA adapter)
- **Context Window**: 128K tokens
- **Attention**: Grouped-Query Attention (GQA)
- **Position Encoding**: RoPE (Rotary Position Embeddings)

### Compute Infrastructure

#### Hardware

- **GPU**: NVIDIA RTX 3090 (24 GB VRAM)
- **Training Framework**: HuggingFace Transformers + PEFT
- **Quantization**: BitsAndBytes 4-bit NF4

#### Software

- **Framework Versions**:
  - PEFT: 0.12.0
  - Transformers: Latest
  - PyTorch: 2.x with CUDA 12.1

## Environmental Impact

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type**: NVIDIA RTX 3090
- **Hours used**: ~51 hours
- **Cloud Provider**: Local GPU server (ipazia)
- **Compute Region**: Italy
- **Carbon Emitted**: Estimated ~2-3 kg CO2 (based on RTX 3090 power consumption)

## Citation

**BibTeX:**

```bibtex
@misc{llama31-8b-cim-q2sql,
  author = {Taherdoust, Ali},
  title = {Llama 3.1 8B Fine-tuned for CIM Spatial SQL Generation},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/taherdoust/llama31-8b-cim-q2sql}},
  note = {Fine-tuned for PostGIS spatial SQL queries in City Information Modeling}
}
```

**APA:**

Taherdoust, A. (2025). Llama 3.1 8B Fine-tuned for CIM Spatial SQL Generation. HuggingFace. https://huggingface.co/taherdoust/llama31-8b-cim-q2sql

## Model Card Authors

Ali Taherdoust (Politecnico di Torino)

## Model Card Contact

For questions or issues, please open an issue on the [GitHub repository](https://github.com/taherdoust/txt2ssql) or contact: taherdoust@gmail.com

## Acknowledgments

- Base model: Meta Llama 3.1
- Training dataset: Generated through three-stage pipeline (rule-based templates, CTGAN synthesis, GPT-4o-mini augmentation)
- Infrastructure: Politecnico di Torino GPU computing resources