P3
Project Idea: Fine-tuning TinyLLaMA for Enhanced Jac MTLLM Performance#
This project focuses on fine-tuning the TinyLLaMA model to significantly improve its ability to handle Jac's MTLLM by <llm>
calls. The primary goal is to enhance its proficiency in accurately processing structured inputs (like Jac objects passed as context) and generating well-formed, type-consistent structured outputs (such as Jac object instances or other typed data) directly at by <llm>
call sites.
Problem Statement#
While MTLLM provides a powerful bridge to integrate Large Language Models within Jac programs, smaller, more accessible models like TinyLLaMA might not natively excel at:
- Interpreting Complex Jac Structures: Difficulty in understanding the schema and content of custom Jac objects or complex data structures provided as context (e.g., via
incl_info
) toby <llm>
calls. - Strict Adherence to Output Typing: Challenges in consistently generating outputs that strictly conform to Jac's precise type hints (e.g.,
list[MyObject]
,dict[str, int]
) or custom object definitions (e.g., ensuring all required fields of aobj MyData
are present and correctly typed in the generated output). - Nuanced Understanding of Semstrings: Suboptimal interpretation of semantic strings (semstrings) when they are intended to guide the generation of specifically structured data rather than free-form text.
Proposed Solution & Jac's Role#
The core of this project is to create a specialized fine-tuning dataset and a robust evaluation process for TinyLLaMA, leveraging Jac's capabilities:
-
Dataset Generation using Jac:
- Corpus Creation: Assemble a diverse corpus of Jac code examples that utilize the
by <llm>
feature. This should include:- Various input data types passed as context: primitive types, lists, dictionaries, and instances of custom Jac objects.
- A wide range of output type hints: simple types (
str
,int
), collections (list[str]
), and complex custom Jac objects (e.g.,obj Result { has status: bool; has data: dict; }
). - Examples demonstrating the use of
incl_info
for contextual data passing. - Scenarios where semstrings are used to guide the LLM in generating structured output.
- Automated Input/Output Pair Extraction:
- Develop Jac walkers and scripts to parse the Jac code corpus.
- These tools would identify
by <llm>
call sites and automatically extract or help formulate the (prompt, ideal_completion) pairs for fine-tuning. - The "prompt" would encapsulate the function/ability signature, relevant context (including serialized Jac objects), and semstrings.
- The "ideal_completion" would be the string representation of the perfectly structured and typed Jac output (e.g., a valid Jac object instantiation string
MyData(field1='value', field2=123)
).
- Corpus Creation: Assemble a diverse corpus of Jac code examples that utilize the
-
Fine-tuning TinyLLaMA:
- Employ standard LLM fine-tuning libraries and techniques (e.g., Hugging Face
transformers
, PEFT/LoRA) with the specialized dataset generated in the previous step. - The objective is to train TinyLLaMA to recognize the patterns of Jac MTLLM calls and learn to generate outputs that are syntactically and semantically valid within the Jac ecosystem.
- Employ standard LLM fine-tuning libraries and techniques (e.g., Hugging Face
-
Evaluation Framework in Jac:
- Construct a Jac-based evaluation harness.
- This harness will consist of Jac programs that invoke the fine-tuned TinyLLaMA via
by <fine_tuned_tiny_llm>
. - The harness will programmatically:
- Call abilities that use the fine-tuned model.
- Receive the (potentially structured) output.
- Validate the output against expected Jac types and structures using Jac's runtime type checking and object introspection capabilities.
- Measure accuracy in terms of structural integrity, type correctness, and semantic plausibility.
-
Integration as an MTLLM Backend:
- Adapt the fine-tuned TinyLLaMA model to be seamlessly integrated as a custom backend within the
jac-mtllm
plugin system, likely by creating a new class inheriting fromBaseLLM
.
- Adapt the fine-tuned TinyLLaMA model to be seamlessly integrated as a custom backend within the
Benefits#
- Accessible and Local MTLLM: Empowers developers to use MTLLM features with a small, efficient model that can run locally, reducing dependency on large, cloud-based LLMs.
- Improved Reliability for Small Models: Significantly enhances the reliability and predictability of MTLLM when used with smaller models by making them more adept at Jac's structured data paradigms.
- Cost-Effective AI Solutions: Lowers or eliminates API costs for many GenAI tasks that can be effectively handled by a fine-tuned local model.
- Enhanced Privacy: Facilitates on-device processing for applications dealing with sensitive code or data.
- Community Enablement: Provides the Jac community with a powerful, optimized small model for local MTLLM experimentation and development.
High-Level Project Steps#
- Setup & Tooling: Prepare the development environment for Jac programming and LLM fine-tuning (Python, PyTorch, Hugging Face libraries, etc.).
- Dataset Design and Scoping: Define the scope and variety of Jac
by <llm>
patterns, data structures (input/output), and specific tasks to be included in the fine-tuning dataset. - Jac-Powered Dataset Generation:
- Implement Jac walkers/scripts to automatically or semi-automatically generate fine-tuning data (prompt-completion pairs) from the Jac code corpus.
- TinyLLaMA Fine-Tuning Execution:
- Select a suitable pre-trained TinyLLaMA variant.
- Conduct the fine-tuning process using the generated dataset, experimenting with different strategies (e.g., full fine-tuning vs. parameter-efficient methods like LoRA).
- Jac Evaluation Harness Implementation:
- Develop Jac archetypes and test suites to programmatically assess the fine-tuned model's performance on structured data generation tasks.
- MTLLM Backend Integration: Package the fine-tuned model as an easily usable backend for
jac-mtllm
. - Testing, Iteration, and Documentation:
- Thoroughly test the integrated fine-tuned model across diverse scenarios.
- Iteratively refine the dataset and fine-tuning process based on evaluation results.
- Provide clear documentation on how to set up and use the fine-tuned TinyLLaMA with Jac MTLLM.
This project would be a significant contribution to the Jac ecosystem, making its advanced AI-integration features more accessible, efficient, and reliable, especially for developers preferring or requiring local model execution.