--- license: llama3.1 base_model: meta-llama/Llama-3.1-8B-Instruct library_name: peft tags: - text-to-sql - postgis - spatial-sql - city-information-modeling - cim - fine-tuned - qlora - peft datasets: - taherdoust/ai4cimdb language: - en pipeline_tag: text-generation --- # Llama 3.1 8B - CIM Spatial SQL (Q2SQL) Fine-tuned Llama 3.1 8B Instruct model for generating PostGIS spatial SQL queries for City Information Modeling (CIM) databases. ## Model Details ### Model Description - **Developed by:** Ali Taherdoust (Politecnico di Torino) - **Model type:** Causal Language Model (Decoder-only) - **Language(s) (NLP):** English - **License:** Llama 3.1 Community License - **Finetuned from model:** [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) ### Model Sources - **Repository:** [GitHub Repository](https://github.com/taherdoust/txt2ssql) - **Training Dataset:** [taherdoust/ai4cimdb](https://huggingface.co/datasets/taherdoust/ai4cimdb) (176K samples, 88K training) - **Base Model:** [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) ## Uses ### Direct Use This model is designed for generating PostGIS spatial SQL queries from natural language questions for City Information Modeling (CIM) databases. It can be used to: - Convert natural language questions to PostGIS SQL queries - Query spatial urban databases (buildings, census, raster, network data) - Support non-technical users in accessing CIM databases ### Downstream Use - Integration into CIM Wizard framework for natural language database interfaces - Educational tools for teaching spatial SQL - Research on text-to-SQL for specialized domains ### Out-of-Scope Use - General-purpose SQL generation (not trained on standard SQL databases) - Non-spatial SQL queries (optimized for PostGIS) - Other natural language tasks (specialized for SQL generation) ## Bias, Risks, and Limitations - **Domain-Specific**: Trained specifically on CIM/PostGIS queries. May not perform well on standard SQL databases. - **Spatial Functions**: Requires PostGIS extension. Queries may fail on non-PostGIS databases. - **Schema Dependency**: Trained on specific CIM schema (cim_vector, cim_census, cim_raster, cim_network). Performance may degrade on different schemas. - **Language**: Trained on English questions only. - **Accuracy**: 82-90% execution accuracy. Generated SQL should be validated before production use. ### Recommendations Users should: - Validate generated SQL queries before execution - Test on their specific database schema - Use in controlled environments with proper error handling - Review complex queries manually ## How to Get Started with the Model ### Installation ```bash pip install transformers peft accelerate bitsandbytes ``` ### Basic Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig from peft import PeftModel import torch # Load base model base_model = "meta-llama/Llama-3.1-8B-Instruct" model_name = "taherdoust/llama31-8b-cim-q2sql" # 4-bit quantization for memory efficiency bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ) # Load base model model = AutoModelForCausalLM.from_pretrained( base_model, quantization_config=bnb_config, device_map="auto", trust_remote_code=True ) # Load fine-tuned adapter model = PeftModel.from_pretrained(model, model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) # Generate SQL from question question = "Find all buildings within 500 meters of the city center" prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are an expert in PostGIS spatial SQL for City Information Modeling (CIM). Your task is to generate precise PostGIS spatial SQL queries for the CIM Wizard database. Database Schema: - cim_vector: Building geometries, project scenarios, grid infrastructure - cim_census: Italian census demographic data (ISTAT 2011) - cim_raster: DTM/DSM raster data - cim_network: Electrical grid network data Generate only the SQL query without explanations.<|eot_id|><|start_header_id|>user<|end_header_id|> {question}<|eot_id|><|start_header_id|>assistant<|end_header_id|> """ inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1) sql = tokenizer.decode(outputs[0], skip_special_tokens=False) print(sql) ``` ## Training Details ### Training Data - **Dataset**: taherdoust/ai4cimdb - **Training Samples**: 88,480 - **Validation Samples**: 18,960 - **Test Samples**: 18,960 - **Total Dataset Size**: 176,837 samples (126,400 after curation) - **Data Quality**: 99.7% quality acceptance rate, 99.57% NoErr rate ### Training Procedure - **Method**: QLoRA (Quantized Low-Rank Adaptation) - **Quantization**: 4-bit NF4 - **LoRA Rank**: 16 - **LoRA Alpha**: 32 - **LoRA Dropout**: 0.1 - **Target Modules**: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj #### Training Hyperparameters - **Training regime**: bf16 mixed precision - **Epochs**: 1 - **Batch Size**: 2 per device - **Gradient Accumulation Steps**: 8 (effective batch size: 16) - **Learning Rate**: 1.5e-4 - **Learning Rate Scheduler**: Cosine with warmup - **Warmup Ratio**: 0.1 - **Weight Decay**: 0.01 - **Max Sequence Length**: 2048 tokens - **Optimizer**: paged_adamw_8bit - **Mixed Precision**: bfloat16 #### Speeds, Sizes, Times - **Training Time**: ~51 hours (on NVIDIA RTX 3090) - **Training Steps**: 5,530 - **Model Size**: ~150 MB (LoRA adapter) + ~15 GB (base model, quantized) - **Trainable Parameters**: 0.46% of total parameters - **Final Training Loss**: 0.1308 - **Final Evaluation Loss**: 0.0972 ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data - **Test Set**: 18,960 samples from taherdoust/ai4cimdb - **Evaluation Benchmark**: 100 samples (weighted stratified by difficulty) - **Database**: CIM Wizard Integrated (PostgreSQL 15 + PostGIS 3.4) #### Metrics - **Execution Accuracy (EX)**: Percentage of queries that execute without errors - **Exact Match (EM)**: Percentage of queries identical to ground truth - **Valid Efficiency Score (VES)**: Percentage of queries returning correct results ### Results #### Summary The model achieves strong performance on PostGIS spatial SQL generation: - **Execution Accuracy (EX)**: 82-90% (first-shot) - **Exact Match (EM)**: 35-45% - **Spatial Functions**: 85-90% accuracy (ST_Within, ST_Intersects, ST_Distance, etc.) - **Multi-Schema Queries**: 82-90% accuracy (cim_vector, cim_census, cim_raster joins) - **Domain Terminology**: 85-90% accuracy (SEZ2011, TABULA, E8-E16 census codes) ## Technical Specifications ### Model Architecture and Objective - **Base Architecture**: Llama 3.1 (Transformer Decoder) - **Parameters**: 8 billion (base) + 16M (LoRA adapter) - **Context Window**: 128K tokens - **Attention**: Grouped-Query Attention (GQA) - **Position Encoding**: RoPE (Rotary Position Embeddings) ### Compute Infrastructure #### Hardware - **GPU**: NVIDIA RTX 3090 (24 GB VRAM) - **Training Framework**: HuggingFace Transformers + PEFT - **Quantization**: BitsAndBytes 4-bit NF4 #### Software - **Framework Versions**: - PEFT: 0.12.0 - Transformers: Latest - PyTorch: 2.x with CUDA 12.1 ## Environmental Impact Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). - **Hardware Type**: NVIDIA RTX 3090 - **Hours used**: ~51 hours - **Cloud Provider**: Local GPU server (ipazia) - **Compute Region**: Italy - **Carbon Emitted**: Estimated ~2-3 kg CO2 (based on RTX 3090 power consumption) ## Citation **BibTeX:** ```bibtex @misc{llama31-8b-cim-q2sql, author = {Taherdoust, Ali}, title = {Llama 3.1 8B Fine-tuned for CIM Spatial SQL Generation}, year = {2025}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/taherdoust/llama31-8b-cim-q2sql}}, note = {Fine-tuned for PostGIS spatial SQL queries in City Information Modeling} } ``` **APA:** Taherdoust, A. (2025). Llama 3.1 8B Fine-tuned for CIM Spatial SQL Generation. HuggingFace. https://huggingface.co/taherdoust/llama31-8b-cim-q2sql ## Model Card Authors Ali Taherdoust (Politecnico di Torino) ## Model Card Contact For questions or issues, please open an issue on the [GitHub repository](https://github.com/taherdoust/txt2ssql) or contact: taherdoust@gmail.com ## Acknowledgments - Base model: Meta Llama 3.1 - Training dataset: Generated through three-stage pipeline (rule-based templates, CTGAN synthesis, GPT-4o-mini augmentation) - Infrastructure: Politecnico di Torino GPU computing resources