P3

Project Idea: Fine-tuning TinyLLaMA for Enhanced Jac MTLLM Performance#

This project focuses on fine-tuning the TinyLLaMA model to significantly improve its ability to handle Jac's MTLLM by <llm> calls. The primary goal is to enhance its proficiency in accurately processing structured inputs (like Jac objects passed as context) and generating well-formed, type-consistent structured outputs (such as Jac object instances or other typed data) directly at by <llm> call sites.

Problem Statement#

While MTLLM provides a powerful bridge to integrate Large Language Models within Jac programs, smaller, more accessible models like TinyLLaMA might not natively excel at:

Interpreting Complex Jac Structures: Difficulty in understanding the schema and content of custom Jac objects or complex data structures provided as context (e.g., via incl_info) to by <llm> calls.
Strict Adherence to Output Typing: Challenges in consistently generating outputs that strictly conform to Jac's precise type hints (e.g., list[MyObject], dict[str, int]) or custom object definitions (e.g., ensuring all required fields of a obj MyData are present and correctly typed in the generated output).
Nuanced Understanding of Semstrings: Suboptimal interpretation of semantic strings (semstrings) when they are intended to guide the generation of specifically structured data rather than free-form text.

Proposed Solution & Jac's Role#

The core of this project is to create a specialized fine-tuning dataset and a robust evaluation process for TinyLLaMA, leveraging Jac's capabilities:

Dataset Generation using Jac:
- Corpus Creation: Assemble a diverse corpus of Jac code examples that utilize the by <llm> feature. This should include:
  - Various input data types passed as context: primitive types, lists, dictionaries, and instances of custom Jac objects.
  - A wide range of output type hints: simple types (str, int), collections (list[str]), and complex custom Jac objects (e.g., obj Result { has status: bool; has data: dict; }).
  - Examples demonstrating the use of incl_info for contextual data passing.
  - Scenarios where semstrings are used to guide the LLM in generating structured output.
- Automated Input/Output Pair Extraction:
  - Develop Jac walkers and scripts to parse the Jac code corpus.
  - These tools would identify by <llm> call sites and automatically extract or help formulate the (prompt, ideal_completion) pairs for fine-tuning.
  - The "prompt" would encapsulate the function/ability signature, relevant context (including serialized Jac objects), and semstrings.
  - The "ideal_completion" would be the string representation of the perfectly structured and typed Jac output (e.g., a valid Jac object instantiation string MyData(field1='value', field2=123)).
Fine-tuning TinyLLaMA:
- Employ standard LLM fine-tuning libraries and techniques (e.g., Hugging Face transformers, PEFT/LoRA) with the specialized dataset generated in the previous step.
- The objective is to train TinyLLaMA to recognize the patterns of Jac MTLLM calls and learn to generate outputs that are syntactically and semantically valid within the Jac ecosystem.
Evaluation Framework in Jac:
- Construct a Jac-based evaluation harness.
- This harness will consist of Jac programs that invoke the fine-tuned TinyLLaMA via by <fine_tuned_tiny_llm>.
- The harness will programmatically:
  - Call abilities that use the fine-tuned model.
  - Receive the (potentially structured) output.
  - Validate the output against expected Jac types and structures using Jac's runtime type checking and object introspection capabilities.
  - Measure accuracy in terms of structural integrity, type correctness, and semantic plausibility.
Integration as an MTLLM Backend:
- Adapt the fine-tuned TinyLLaMA model to be seamlessly integrated as a custom backend within the jac-mtllm plugin system, likely by creating a new class inheriting from BaseLLM.

Benefits#

Accessible and Local MTLLM: Empowers developers to use MTLLM features with a small, efficient model that can run locally, reducing dependency on large, cloud-based LLMs.
Improved Reliability for Small Models: Significantly enhances the reliability and predictability of MTLLM when used with smaller models by making them more adept at Jac's structured data paradigms.
Cost-Effective AI Solutions: Lowers or eliminates API costs for many GenAI tasks that can be effectively handled by a fine-tuned local model.
Enhanced Privacy: Facilitates on-device processing for applications dealing with sensitive code or data.
Community Enablement: Provides the Jac community with a powerful, optimized small model for local MTLLM experimentation and development.

High-Level Project Steps#

Setup & Tooling: Prepare the development environment for Jac programming and LLM fine-tuning (Python, PyTorch, Hugging Face libraries, etc.).
Dataset Design and Scoping: Define the scope and variety of Jac by <llm> patterns, data structures (input/output), and specific tasks to be included in the fine-tuning dataset.
Jac-Powered Dataset Generation:
- Implement Jac walkers/scripts to automatically or semi-automatically generate fine-tuning data (prompt-completion pairs) from the Jac code corpus.
TinyLLaMA Fine-Tuning Execution:
- Select a suitable pre-trained TinyLLaMA variant.
- Conduct the fine-tuning process using the generated dataset, experimenting with different strategies (e.g., full fine-tuning vs. parameter-efficient methods like LoRA).
Jac Evaluation Harness Implementation:
- Develop Jac archetypes and test suites to programmatically assess the fine-tuned model's performance on structured data generation tasks.
MTLLM Backend Integration: Package the fine-tuned model as an easily usable backend for jac-mtllm.
Testing, Iteration, and Documentation:
- Thoroughly test the integrated fine-tuned model across diverse scenarios.
- Iteratively refine the dataset and fine-tuning process based on evaluation results.
- Provide clear documentation on how to set up and use the fine-tuned TinyLLaMA with Jac MTLLM.

This project would be a significant contribution to the Jac ecosystem, making its advanced AI-integration features more accessible, efficient, and reliable, especially for developers preferring or requiring local model execution.