### **Strategic Recommendations for Initial Conversion**

1. **Xenova/llama2.c-stories15M** 1: This model operates as the optimal primary candidate for the TinyRustLM ecosystem. Utilizing the standardized, modern Llama architecture encompassing Root Mean Square Normalization (RMSNorm), Rotary Positional Embeddings (RoPE), and SwiGLU activation functions, it serves as a robust foundational baseline.3 At 15.2 million parameters, it leaves substantial headroom in the 33.5 MB budget, allowing both the Q8\_0 and Q4\_0 quantization formats to load instantaneously within the browser memory space, making it perfect for continuous integration, adapter experiments, and WebAssembly (WASM) smoke tests.  
2. **MultivexAI/Aurelius-Llama-v2.5-10M-Large** 5: This candidate brilliantly addresses the specific computational and memory bandwidth bottlenecks inherent to single-threaded WebAssembly execution. By aggressively compressing the vocabulary size down to 1,536 tokens and employing Grouped-Query Attention (GQA) at a 3:1 ratio 5, it significantly curtails the memory payload required for both the embedding projection matrices and the final logit generation layers. It represents the premier choice for extreme memory-constrained execution environments.  
3. **EleutherAI/pythia-14m** 6: Functioning as a representative of the GPT-NeoX family, this model introduces a different architectural topology, specifically parallel attention and multi-layer perceptron (MLP) residuals. Crucially, the Pythia scaling suite provides identically trained model configurations across widely varying initialization seeds.7 This specific characteristic allows for rigorous, deterministic validation of the custom Rust/WASM floating-point execution engine against known reference outputs, ensuring that the manual tensor mathematics do not drift from established baseline targets.

## **The Zero-Dependency Browser-Local AI Paradigm**

The deployment of autoregressive language models directly within client-side browser environments necessitates a radical departure from traditional cloud-based, high-performance computing architectures. When targeting a browser-local AI runtime specifically written in Rust and subsequently compiled to WebAssembly (WASM), hardware and software constraints become the primary driver of architectural selection and feasibility. The target ecosystem, defined as TinyRustLM, operates under incredibly austere conditions: it functions without relying on third-party machine learning libraries (such as libtorch or TensorFlow Lite), external inference frameworks, customized Python-only tokenizers, or dynamic remote application programming interfaces (APIs). Consequently, all mathematical operations, memory allocations, string parsing, and tensor manipulations must be completely self-contained within localized .slm artifacts and execute within a strict memory envelope native to the browser's JavaScript engine sandbox.  
WebAssembly fundamentally operates within a flat, linear memory model. While the theoretical addressing limit of a standard 32-bit WASM implementation is four gigabytes, practical allocation ceilings within consumer browser environments—such as the V8 engine in Google Chrome or the SpiderMonkey engine in Mozilla Firefox—often restrict contiguous buffer allocations and active heap sizes to significantly lower thresholds. Browsers routinely enforce aggressive garbage collection and memory eviction policies to ensure that foreground web applications do not compromise the stability of the host operating system. To navigate this hostile environment, the TinyRustLM specification designates a strict browser selector model byte budget of approximately 33.5 megabytes. This budget is not an arbitrary limitation; it is precisely calibrated to ensure that the downloading, parsing, and instantiation of the neural network weights occur with imperceptible latency, preventing the main thread or Web Worker from stalling.  
This strict 33.5 megabyte limitation places a hard ceiling on the parameter mass that can be instantiated dynamically. Standard consumer language models, which routinely measure parameter counts in the billions and sizes in the gigabytes, are structurally incompatible with this paradigm. The search for viable candidates must therefore pivot toward microscopic decoders that exhibit exceptionally high representational density, standard architectural configurations that can be hard-coded into the Rust inference loops, and minimal topological complexity. Furthermore, these models must possess permissive open-source licenses, specifically MIT or Apache 2.0, allowing for frictionless redistribution within client-side application bundles without triggering catastrophic legal or provenance liabilities.

## **WASM Memory Boundaries and Quantization Mathematics**

To bridge the gap between complex neural network topologies and the rigid 33.5 megabyte limitation, sophisticated offline weight quantization is an absolute necessity. The TinyRustLM runtime specifically delineates support for three primary tensor data paths: f32 (32-bit floating-point), q8\_0 (8-bit symmetric block quantization), and q4\_0 (4-bit symmetric block quantization). Understanding the mathematical realities of these formats dictates exactly which language models can transition from theoretical candidates to functional browser deployments.  
The baseline f32 format represents the uncompressed, full-precision standard for machine learning tensors. It requires exactly four bytes of storage per individual parameter. When applying the 33.5 megabyte budget against this format, the mathematical ceiling for an unquantized model rests at roughly 8.3 million parameters. Because the vast majority of useful language models exceed this microscopic threshold, f32 execution within TinyRustLM is largely reserved for rudimentary logic tests, isolated continuous integration routines, and the very smallest of the architectural scaling artifacts.  
To achieve functional utility, the runtime relies on the q8\_0 format. This quantization scheme utilizes a blocked architecture, wherein thirty-two consecutive parameters are grouped together into a singular cohesive unit. Within each block, the offline converter identifies the absolute maximum parameter value and computes a shared 16-bit floating-point (f16) scaling factor. The individual weights are then uniformly quantized into 8-bit signed integers. When factoring in the storage space required for the 8-bit integers alongside the overhead of the shared f16 scaling value, the effective memory footprint drops dramatically to approximately 1.0625 bytes per parameter. By applying this format, the 33.5 megabyte browser budget suddenly accommodates neural networks scaling up to approximately 31.5 million parameters. This opens the door to a wide array of highly capable, specialized models.  
When targeting models that exceed the 30 million parameter threshold, the runtime must deploy the q4\_0 format. This scheme mirrors the block structure of the 8-bit variant but compresses the integer representations further down to 4-bit nibbles, effectively packing two parameters into every single byte of storage. Utilizing the same 32-element blocks and a shared f16 scale, the mathematical footprint plummets to roughly 0.5625 bytes per parameter. Under the q4\_0 schema, the 33.5 megabyte budget can comfortably host models nearing 59.5 million parameters, bringing nearly all of the requested candidates into viability.  
However, executing quantized tensors within WebAssembly introduces profound computational complexities. Unlike high-performance graphic processing units (GPUs), standard WASM environments do not currently possess native, hardware-accelerated instructions for 4-bit integer mathematics. Consequently, the Rust runtime must manually iterate over the compressed arrays, applying bit-wise masks and logical shift operations to extract the high and low nibbles before dynamically casting them back into 8-bit or 32-bit formats for arithmetic accumulation. While the WebAssembly System Interface (WASI) and modern browser standards offer Single Instruction, Multiple Data (SIMD) extensions—specifically v128 registers that process 128 bits of data concurrently—the dequantization overhead remains severe. For a tiny model runtime, this dynamic expansion often shifts the primary processing bottleneck from memory bandwidth starvation to raw central processing unit (CPU) instruction saturation. Therefore, architectural selection must carefully weigh the balance between the absolute memory size of the weights and the arithmetic complexity of the runtime execution loop.

## **Deep Architectural Profiling: The Llama Family Lineage**

The Llama architecture has definitively established itself as the modern de facto standard for autoregressive text generation. It systematically replaced traditional Layer Normalization with Root Mean Square Normalization (RMSNorm) to reduce computational overhead, abandoned absolute positional embeddings in favor of Rotary Positional Embeddings (RoPE) to enhance context extrapolation, and adopted the SwiGLU activation function to maximize non-linear expressivity. Because the TinyRustLM engine must hard-code these operational loops without external libraries, thoroughly mapping the structural realities of Llama-based candidates is paramount.

### **Xenova/llama2.c-stories15M**

Derived directly from the educational llama2.c framework, this model represents an ideal synthesis of modern architectural standards and extreme parameter compression.1 Boasting 15.2 million parameters, the network was trained specifically on the synthetically generated TinyStories dataset, allowing it to produce highly coherent English narratives despite its diminutive size.4 The internal topology operates with a hidden dimension of 288, distributed across six transformer layers and managed by six distinct attention heads.3  
The most critical dynamic for the .slm converter to navigate involves the model's vocabulary. The network utilizes a massive 32,000-token dictionary.3 In an unquantized f32 state, the input embedding matrix (32,000 \* 288\) requires over 9.2 million parameters, translating to approximately 36.8 megabytes of linear memory.9 This massive matrix threatens to dwarf the active transformer layers completely. To mitigate this catastrophic memory pressure, the architecture aggressively employs weight tying (tie\_word\_embeddings: true).10 Under this paradigm, the input embedding matrix is completely reused as the final linear projection layer prior to the ultimate softmax operation. The TinyRustLM converter must explicitly recognize this architectural shortcut and construct the .slm memory map to utilize shared pointers, ensuring that the massive 36.8-megabyte matrix is not physically duplicated within the serialization format.  
At 15.2 million parameters, the model demonstrates exemplary runtime suitability. It requires approximately 16.1 megabytes in the q8\_0 format and a negligible 8.5 megabytes in the q4\_0 format, resting entirely comfortably within the 33.5 megabyte limit. It functions perfectly as the foundational baseline for verifying that the Rust runtime's Llama execution path—specifically the complex interleaving of the SwiGLU gating mechanism and the complex complex-number rotation math inherent to RoPE—is functionally flawless.

### **MultivexAI/Aurelius-Llama-v2.5-10M-Large**

The Aurelius 10M Large model represents a masterclass in aggressive, targeted optimization designed explicitly for minimal-footprint, edge-device execution.5 Operating with 9.92 million parameters 5, the model expands the hidden dimension to 384, utilizes six transformer layers, and incorporates a massive intermediate MLP size of 1,008.5  
The architecture differentiates itself through an extreme compression of the vocabulary size down to a mere 1,536 tokens.5 By enforcing this severe lexical restriction, the embedding matrix is reduced to just 589,824 parameters. This structural decision allows the overwhelming majority of the 9.9 million parameter budget to be allocated directly into the active transformer layers, resulting in vastly higher representational density. Crucially, this mitigates the mathematical phenomenon known as the "softmax bottleneck," a pervasive flaw in tiny language models where low-dimensional hidden states fail to express enough variance to select accurately from massive vocabulary pools.5  
Furthermore, the Aurelius architecture leverages Grouped-Query Attention (GQA) utilizing a 3:1 ratio—specifically, six query heads sharing access to just two key and two value heads.5 Within a highly constrained WASM environment, GQA provides a massive architectural advantage. During the sequential, token-by-token generation phase, the memory bandwidth required to load the Key and Value caches from linear memory is reduced by a factor of three when compared to standard Multi-Head Attention (MHA).  
At 9.9 million parameters, the model presents a fascinating deployment scenario: it can be loaded entirely unquantized into WASM memory in f32 precision using 39.6 megabytes, hovering just slightly above the theoretical budget but remaining functionally viable for local testing.5 When quantized to q8\_0, it requires a trivial 10.5 megabytes. It serves as an exceptional candidate for stress-testing GQA logic parsing and cache allocation algorithms within the .slm converter.

### **SupraLabs/StorySupra-10M**

Functioning as a topological intermediary, the StorySupra-10M model operates with 12.58 million parameters.11 It strikes a balance by utilizing a hidden size of 256, distributed across an increased depth of eight hidden layers and managed by eight attention heads.11  
The model relies on a mid-sized vocabulary of 8,192 tokens 12, offering a compromise between the extreme compression seen in the Aurelius architecture and the bloated dictionaries typical of standard models. Distributed via the Safetensors format 13, the weight extraction path for the .slm packaging script is entirely standard and frictionless. StorySupra-10M provides a vital testing surface for ensuring the runtime correctly scales to deeper layer counts (eight versus six) without overflowing the Rust execution stack or inducing unexpected cache miss penalties during the feed-forward traversals.

### **Xenova/llama2.c-stories42M**

Serving as the larger sibling to the 15 million parameter variant, this architecture pushes the boundaries of the TinyRustLM constraints. It scales the hidden dimension up to 512 and expands the layer depth to eight.14  
With exactly 42.0 million parameters 2, the unquantized f32 footprint reaches a prohibitive 168 megabytes, completely disqualifying it from uncompressed deployment. Even when subjected to q8\_0 quantization, the model requires approximately 44.6 megabytes, exceeding the rigid 33.5 megabyte selector budget. Therefore, this model is only functionally viable within the target ecosystem when serialized utilizing the most aggressive q4\_0 format, which reduces the payload to 23.6 megabytes.14 As such, this model is specifically valuable not as a primary deployment target, but as an extreme stress test designed to evaluate the q4\_0 execution pipeline, the dynamic bit-shifting dequantization routines, and the overall stability of WASM heap allocations under maximum memory pressure.

## **Deep Architectural Profiling: The GPT-NeoX Family Lineage**

The GPT-NeoX architecture introduces a series of structural permutations that are distinctly separate from the Llama lineage. Most notably, the architecture retains the use of standard Layer Normalization rather than adopting RMSNorm, and it introduces the parallel execution of the Attention and Feed-Forward Neural Network (MLP) blocks. In standard sequential architectures (like GPT-2 or Llama), the output of the attention mechanism is added to the residual stream before the MLP subsequently processes that updated stream. In the GPT-NeoX topology, both the Attention block and the MLP process the identical pre-layer-norm input simultaneously, and their respective outputs are summed together afterward. This parallel topological design drastically alters memory access patterns and can potentially improve data throughput in memory-constrained WebAssembly executions.

### **EleutherAI/pythia-14m**

Originally engineered and trained at the explicit request of interpretability researchers attempting to scale sparse autoencoders, the Pythia-14M model utilizes a hidden dimension of 128, six layers, and four attention heads.6  
The defining characteristic of this model is its extreme lexical imbalance. The architecture shares the identical tokenizer utilized by the massive GPT-NeoX-20B model, featuring a sprawling vocabulary of 50,304 tokens.6 Because the hidden dimension is so narrow (128), the parameter distribution is heavily and disproportionately skewed toward the edges of the network. The embedding layer alone constitutes 6.43 million parameters. As a result, over forty-five percent of the model's total 14.1 million parameters remain entirely static, trapped within the embedding and unembedding matrices, leaving merely 1.2 million parameters for the active, dynamic mathematical transformations occurring within the attention and MLP layers.6  
For a browser-local WASM runtime, this imbalance creates a highly unique computational profile. The dense matrix multiplications occurring within the six transformer layers execute with near-instantaneous speed due to their microscopic size. Conversely, the final token generation step—a massive linear projection multiplying the hidden state against 50,304 individual tokens to calculate logits—demands the overwhelming majority of the CPU cycles.  
The most vital utility of the Pythia-14M model stems from its rigorous deterministic properties. The Pythia scaling suite is entirely unique in that the researchers provide identically trained model configurations that vary only by their initial random seed (for example, seed5 and seed8).7 By processing identical prompts through these distinct statistical variations within the TinyRustLM architecture, developers can definitively verify that the localized floating-point calculations, the activation functions, and the normalization loops are executing with flawless mathematical fidelity when compared against canonical PyTorch outputs.

### **EleutherAI/pythia-31m**

Scaling the hidden dimension to 256 and expanding the attention mechanism to eight heads, the 31 million parameter variant provides a significantly higher-capacity testing ground.16  
While the base models within the Pythia suite are strictly autoregressive next-token predictors, the 31 million parameter envelope possesses enough representational mass to support structured instruction tuning. Fine-tuned variants, most notably the Felladrin/Pythia-31M-Chat-v1 iteration, explicitly leverage advanced training regimes including Direct Preference Optimization (DPO) and Supervised Fine-Tuning (SFT).16 Furthermore, this specific variant incorporates structural chat templates natively (specifically the ChatML format).16  
At exactly 31.0 million parameters, the q8\_0 implementation of this model requires 32.9 megabytes of storage, allowing it to sit precisely at the absolute upper boundary of the 33.5 megabyte selector budget.16 Converting this chat-optimized variant into the .slm format offers the TinyRustLM ecosystem an unparalleled opportunity to run highly deterministic conversational demonstrations, validate prompt-template parsing logic, and demonstrate the viability of structured human-AI interactions entirely offline within the browser sandbox.

## **Legacy Architectures: Profiling GPT-2 and GPT-Neo**

While modern topologies dominate current research, verifying a generalized inference engine requires the parsing and execution of legacy architectural decisions. These legacy models serve as critical edge-case evaluations for the .slm offline converter and the WASM execution loop.

### **Crusadersk/tiny-gpt2**

Designed explicitly as a benchmarking artifact for scaling studies, this 25.0 million parameter model implements the pure, vanilla GPT-2 architecture.18 It utilizes a simplified structure consisting of just three transformer layers and a hidden dimension of 384\.18  
The primary utility of this model lies in its deviation from modern positional strategies. It completely lacks complex relative positional frameworks like RoPE or Attention with Linear Biases (ALiBi). Instead, it relies on legacy, absolute learned positional embeddings. It is critical for the TinyRustLM converter to successfully identify, parse, and inject these absolute position tensors into the .slm package to guarantee comprehensive architectural support. At 25 million parameters, the model easily and comfortably fits within the 8-bit quantization budget (requiring roughly 26.5 megabytes), rendering it highly accessible for widespread testing.18

### **roneneldan/TinyStories-1M**

Representing the original proof-of-concept for microscopic narrative generation, these models operate on the GPT-Neo architecture, a slight structural modification of GPT-2.19 It is vital to note that the "1M" designation is highly misleading; while the active non-embedding parameters total slightly over one million, the inclusion of the massive 50,257-token vocabulary embedding matrix pushes the true mathematical parameter count to approximately 3.7 million.20  
Despite its historical relevance, this model presents significant integration blockers for the TinyRustLM pipeline. The official repository strictly provides legacy PyTorch binaries (pytorch\_model.bin) and omits modern serialization standards entirely.21 Consequently, developers are forced to write and execute localized Python scripts utilizing the Hugging Face safetensors export utility before the Rust-based .slm converter can even attempt to ingest the weight matrices securely. While its microscopic size—requiring a mere 14.8 megabytes of memory in uncompressed f32 format—is attractive, its deployment viability is vastly inferior to the modern, Safetensor-native Llama variants.

## **Tokenizer Dynamics in Zero-Dependency WASM Environments**

A major and highly restrictive constraint imposed by the TinyRustLM project specifications is the absolute prohibition of third-party machine learning libraries and custom Python-only tokenizers. The localized runtime must execute the entirety of the tokenization and detokenization phases natively in pure Rust. This stringent requirement forces a deep evaluation of how vocabulary dictionaries are structured, parsed, and executed within a browser sandbox.  
Models derived from the Pythia (GPT-NeoX) lineage 6 and the GPT-2 lineage 18 rely heavily on standard Byte-Pair Encoding (BPE). Under this paradigm, the runtime must dynamically deserialize a static JSON dictionary containing exact string-to-integer mappings, alongside a massive, heavily populated list of sequential BPE merge rules. For an architecture containing a vocabulary size of 50,304 (such as the Pythia variants), the raw tokenizer.json and merges.txt files can collectively consume over two megabytes of raw text data.23 Parsing these sprawling JSON objects and subsequently constructing the internal Rust HashMaps or Trie data structures for over 50,000 unique entries during the WASM initialization phase introduces highly measurable latency, potentially blocking the main browser thread.  
Furthermore, string manipulation, regular expression (regex) parsing, and pre-tokenization logic implemented in Rust/WASM can incur profound performance penalties. The tokenizer split rules dictate exactly how punctuation, whitespace, capitalization, and special Unicode characters are isolated prior to the application of the BPE merge sequences. If the custom Rust runtime fails to replicate the exact regex splitting behavior of the original Python implementation (for instance, the notoriously complex GPT-2 regex splitting rules), the resulting token IDs will drift out of alignment. This misalignment inevitably causes the model to generate sub-optimal, hallucinated, or completely incoherent text.  
The architectural analysis clearly highlights the profound advantage of low-vocabulary models. Candidates like the Aurelius-Llama-v2.5-10M-Large (featuring just 1,536 tokens) 5 and the StorySupra-10M (featuring 8,192 tokens) 11 are inherently and objectively superior for memory-constrained browser environments. The drastic reduction in vocabulary size translates directly into a minimal tokenizer state overhead. A 1,536-entry BPE lookup table initializes exponentially faster within WebAssembly, massively minimizes the memory footprint of the runtime's internal string caches, and fundamentally simplifies the pre-tokenization logic required to process user inputs.  
To achieve true zero-dependency deployment, the offline .slm conversion script must abandon standalone tokenizer files entirely. The script must be engineered to accurately deserialize the source tokenizers, serialize the mappings into a dense, binary format, and bundle the tokenizer metadata directly into the header of the compiled .slm binary payload. This architectural decision completely bypasses the need for separate network requests for tokenizer.json blobs and ensures total encapsulation of the model logic.

## **Tensor-Shape and Offline Converter Implementation Requirements**

Converting these disparate neural architectures into the highly customized, unified .slm artifact format requires meticulous handling of tensor shapes and memory strides. Native PyTorch checkpoints, standardized Safetensors files, and various downstream inference backends often store contiguous weights utilizing drastically different dimensional layouts. The offline Rust converter must aggressively normalize these arrays prior to quantization to prevent catastrophic runtime panics or silent mathematical failures within the WebAssembly environment.  
**1\. Interleaved versus Concatenated RoPE Weights:** Llama-based candidates, such as the llama2.c-stories15M and Aurelius variants 3, natively utilize Rotary Positional Embeddings to track sequence context. Depending on the exact framework utilized during the original training phase, the Query and Key projection weights (q\_proj and k\_proj) may store their respective attention heads continuously (all dimensions of head zero mapped sequentially, followed by head one) or in an interleaved fashion (dimension zero of all heads mapped sequentially, followed by dimension one). The offline TinyRustLM converter must dynamically detect the original tensor shape metadata and mathematically permute the arrays into a standardized, universal memory layout before executing the quantization pass. Failure to resolve this layout discrepancy will result in the WASM runtime applying mathematically invalid rotary rotations during inference, destroying the model's ability to maintain coherent context.  
**2\. GQA Key-Value Broadcasting Strategies:** When processing the Aurelius-Llama-v2.5 architecture 5, the Grouped-Query Attention mechanism dictates that six independent Query heads must simultaneously share access to just two Key and two Value heads. The Rust execution loop must be engineered to support a physical memory stride that allows multiple distinct query dot-products to reference the identical Key-Value cache block concurrently. If the WASM runtime only supports standard Multi-Head Attention algorithms, attempting to execute the Aurelius model will require significant algorithmic refactoring to avoid duplicating the cache allocations in memory, which would entirely negate the efficiency gains of the GQA topology.  
**3\. QKV Tensor Fusion and Splitting:** GPT-NeoX architectures, such as the Pythia suite 6, typically fuse the Query, Key, and Value projection matrices into a single contiguous query\_key\_value tensor. This is a common optimization designed to minimize the number of independent kernel launches required on high-performance GPUs. Conversely, standard Llama models generally serialize these transformations as three distinct, separated q\_proj, k\_proj, and v\_proj matrices. To maintain a highly simplistic, universal, and easily maintainable linear algebra backend in Rust, the offline .slm converter must be programmed to automatically slice fused QKV matrices into their distinct geometric components during the packaging phase. This ensures that the WASM runtime can execute a singular, uniform structural loop regardless of the originating architectural family.

## **Strategic Deployment Scenarios for TinyRustLM**

The incredibly specific technical constraints defining the TinyRustLM project—zero reliance on third-party stacks, the complete absence of server-side reliance, and extremely tight byte budgets—align perfectly with a series of highly specialized deployment scenarios. By leveraging the unique attributes of the identified model candidates, developers can unlock complex local interactions previously deemed impossible in browser environments.

### **1\. Deterministic Smoke Testing and Verification**

The continuous integration and continuous deployment (CI/CD) pipelines utilized for maintaining the TinyRustLM engine must ruthlessly verify that the manual, from-scratch implementation of standard activation functions (including GELU and SwiGLU) and normalization layers (including RMSNorm and LayerNorm) 24 yields mathematical results identical to the canonical PyTorch baselines. The EleutherAI/pythia-14m model operates as the optimal instrument for this validation. By dynamically instantiating the pythia-14m-seed5 and pythia-14m-seed8 variants 7 in their uncompressed f32 formats (which easily fit within the available memory bounds during local development testing), the inference engine can guarantee cross-architecture fidelity. By avoiding quantization entirely during these tests, the engineers prevent integer rounding noise from muddying the regression results, ensuring mathematical perfection.

### **2\. Microscopic Conversational Demonstrations**

While microscopic language models inherently lack the capacity for deep reasoning, complex logical deduction, or extensive factual recall, they remain highly capable of sophisticated structural mimicking. Fine-tuned variants, such as the Felladrin/Pythia-31M-Chat-v1 16, provide an immediate and highly compelling showcase for browser-local AI capabilities. Packaged securely within the q8\_0 format (occupying precisely 32.9 megabytes) 16, this model can natively parse and process standard system, user, and assistant templates. It actively demonstrates the engine's capability to parse multi-turn, interactive chat loops entirely disconnected from the cloud architecture, providing immediate, zero-latency sensory feedback to end-users evaluating the viability of the TinyRustLM ecosystem.

### **3\. Model-Breeding and Dynamic Adapter Experiments**

Tiny models represent unparalleled computational sandboxes for testing experimental parameter injection algorithms. "Model breeding" involves permuting specific transformer layers, blending distinct architectural checkpoints, or dynamically overriding weights at runtime. The 15 million parameter Llama architecture (llama2.c-stories15M) 2 serves as an absolutely ideal substrate for testing Low-Rank Adaptation (LoRA) implementations directly within the customized Rust engine.  
A standard rank-8 LoRA adaptation applied to a hidden dimension of 288 (the exact size utilized within the llama2.c-stories15M model) requires absolutely minimal auxiliary parameter allocation. The designated ![][image1] matrix (![][image2]) and the ![][image3] matrix (![][image4]) utilized for the query and value projections total just 4,608 parameters per individual layer. Across the model's six transformer layers, the entire structural adapter payload amounts to merely 27,648 parameters (which equates to roughly 110 kilobytes when serialized in f32). This microscopic payload allows the TinyRustLM platform to test the concept of dynamic adapter hot-swapping—downloading tiny, 100-kilobyte specialized behavior patches directly to the browser over standard network connections and injecting them dynamically into the base model's computation graph during live inference. This process allows for infinite behavioral flexibility without ever permanently altering the underlying, cached .slm base weights.

### **4\. Comprehensive Tokenizer and Runtime Coverage**

Evaluating how gracefully the WebAssembly engine handles unpredictable edge cases requires diverse, highly structured testing matrices. Integrating the Aurelius-Llama-v2.5-10M-Large model forces the engine to actively process complex GQA algorithms and interact with highly compressed token dictionaries.5 Deploying the legacy tiny-gpt2 model aggressively tests absolute positional handling pathways and ensures the engine can function in the absence of modern RMSNorm mechanisms.18 Deploying these vastly varied architectures directly alongside the highly standardized Llama implementations ensures that both the offline .slm conversion script and the subsequent WASM execution loop are universally robust, fundamentally preventing long-term architectural lock-in and ensuring future-proof viability.

## **Conclusions on Viability and Implementation**

The engineering and sustained development of a robust, standalone Rust/WASM inference runtime demands an incredibly meticulous curation of neural model artifacts. The physical, hard-coded constraints of browser-based linear memory—specifically defined by the 33.5 megabyte selector budget—categorically preclude the deployment of standard, billion-parameter open-source models. However, the rapidly maturing, highly specialized ecosystem of high-density, sub-50M parameter language models provides a highly viable and immediate pathway for fully localized, zero-dependency execution.  
The comprehensive architectural analysis clearly confirms that the **Llama architecture**, specifically instantiated in the form of the **Xenova/llama2.c-stories15M** candidate 1, is the most secure, stable, and highly recommended primary target. Its exact dimensional specifications align perfectly with the target quantization mathematics, requiring no topological anomalies, proprietary modifications, or algorithmic hacks to execute flawlessly within the Rust runtime.  
For advanced testing vectors designed to push the engine to its limits, the introduction of the **MultivexAI/Aurelius-Llama-v2.5-10M-Large** model 5 provides a radical, highly effective solution to WASM computational bottlenecks through its minimized vocabulary and aggressive use of Grouped-Query Attention. Simultaneously, the **EleutherAI/pythia** series offers an expansive, robust testing ground for validating parallel topological pathways and achieving absolute deterministic validation.6 By focusing the offline .slm converter optimization logic heavily on these specific architectural variations, developers can guarantee a perfectly stable, incredibly lightweight, and completely offline-capable execution environment, ultimately fulfilling the core technical vision of secure, zero-dependency browser-local artificial intelligence.

#### **Works cited**

1. mgoin/llama2.c-stories15M-ds \- Hugging Face, accessed June 29, 2026, [https://huggingface.co/mgoin/llama2.c-stories15M-ds](https://huggingface.co/mgoin/llama2.c-stories15M-ds)  
2. GitHub \- karpathy/llama2.c: Inference Llama 2 in one file of pure C, accessed June 29, 2026, [https://github.com/karpathy/llama2.c](https://github.com/karpathy/llama2.c)  
3. config.json · Xenova/llama2.c-stories15M at ... \- Hugging Face, accessed June 29, 2026, [https://huggingface.co/Xenova/llama2.c-stories15M/blob/9397aa7180a7b88fa20a0197da2a0601973373e9/config.json](https://huggingface.co/Xenova/llama2.c-stories15M/blob/9397aa7180a7b88fa20a0197da2a0601973373e9/config.json)  
4. softmax1/llama2.c-tinystories: Inference Llama 2 in one file of pure C \- GitHub, accessed June 29, 2026, [https://github.com/softmax1/llama2.c-tinystories](https://github.com/softmax1/llama2.c-tinystories)  
5. MultivexAI/Aurelius-Llama-v2.5-10M-Large \- Hugging Face, accessed June 29, 2026, [https://huggingface.co/MultivexAI/Aurelius-Llama-v2.5-10M-Large](https://huggingface.co/MultivexAI/Aurelius-Llama-v2.5-10M-Large)  
6. EleutherAI/pythia-14m · Hugging Face, accessed June 29, 2026, [https://huggingface.co/EleutherAI/pythia-14m](https://huggingface.co/EleutherAI/pythia-14m)  
7. config.json · EleutherAI/pythia-14m-seed5 at main \- Hugging Face, accessed June 29, 2026, [https://huggingface.co/EleutherAI/pythia-14m-seed5/blob/main/config.json](https://huggingface.co/EleutherAI/pythia-14m-seed5/blob/main/config.json)  
8. Add model files from step143000 to main branch · EleutherAI/pythia, accessed June 29, 2026, [https://huggingface.co/EleutherAI/pythia-14m-seed8/commit/b23e2bff35693ed8ac2a91109e589eb54f06fd35](https://huggingface.co/EleutherAI/pythia-14m-seed8/commit/b23e2bff35693ed8ac2a91109e589eb54f06fd35)  
9. llama2, accessed June 29, 2026, [https://www.marble.onl/posts/llama2.html](https://www.marble.onl/posts/llama2.html)  
10. Incorrect parameter counts for 15M, 42M, 110M models? · Issue \#378 · karpathy/llama2.c, accessed June 29, 2026, [https://github.com/karpathy/llama2.c/issues/378](https://github.com/karpathy/llama2.c/issues/378)  
11. README.md · SupraLabs/StorySupra-10M at main \- Hugging Face, accessed June 29, 2026, [https://huggingface.co/SupraLabs/StorySupra-10M/blob/main/README.md](https://huggingface.co/SupraLabs/StorySupra-10M/blob/main/README.md)  
12. Upload 9 files · SupraLabs/StorySupra-10M at 366d265, accessed June 29, 2026, [https://huggingface.co/SupraLabs/StorySupra-10M/commit/366d2657efec37ff73755f0c74988220e9d97421](https://huggingface.co/SupraLabs/StorySupra-10M/commit/366d2657efec37ff73755f0c74988220e9d97421)  
13. model.safetensors · SupraLabs/StorySupra-10M at main \- Hugging Face, accessed June 29, 2026, [https://huggingface.co/SupraLabs/StorySupra-10M/blob/main/model.safetensors](https://huggingface.co/SupraLabs/StorySupra-10M/blob/main/model.safetensors)  
14. mradermacher/llama2.c-stories42M-GGUF at main \- Hugging Face, accessed June 29, 2026, [https://huggingface.co/mradermacher/llama2.c-stories42M-GGUF/blob/main/llama2.c-stories42M.IQ4\_XS.gguf](https://huggingface.co/mradermacher/llama2.c-stories42M-GGUF/blob/main/llama2.c-stories42M.IQ4_XS.gguf)  
15. EleutherAI/pythia-31m-seed8 at 862f153c60ef1c1b5d4176452da80f76b226b0a2 \- Hugging Face, accessed June 29, 2026, [https://huggingface.co/EleutherAI/pythia-31m-seed8/tree/862f153c60ef1c1b5d4176452da80f76b226b0a2](https://huggingface.co/EleutherAI/pythia-31m-seed8/tree/862f153c60ef1c1b5d4176452da80f76b226b0a2)  
16. Felladrin/Pythia-31M-Chat-v1 · Hugging Face, accessed June 29, 2026, [https://huggingface.co/Felladrin/Pythia-31M-Chat-v1](https://huggingface.co/Felladrin/Pythia-31M-Chat-v1)  
17. RichardErkhov/Felladrin\_-\_Pythia-31M-Chat-v1-8bits \- Hugging Face, accessed June 29, 2026, [https://huggingface.co/RichardErkhov/Felladrin\_-\_Pythia-31M-Chat-v1-8bits](https://huggingface.co/RichardErkhov/Felladrin_-_Pythia-31M-Chat-v1-8bits)  
18. Crusadersk/tiny-gpt2 \- Hugging Face, accessed June 29, 2026, [https://huggingface.co/Crusadersk/tiny-gpt2](https://huggingface.co/Crusadersk/tiny-gpt2)  
19. TinyStories: How Small Can Language Models Be and Still Speak Coherent English? \- ar5iv, accessed June 29, 2026, [https://ar5iv.labs.arxiv.org/html/2305.07759](https://ar5iv.labs.arxiv.org/html/2305.07759)  
20. roneneldan/TinyStories-1M · Actual number of parameters? \- Hugging Face, accessed June 29, 2026, [https://huggingface.co/roneneldan/TinyStories-1M/discussions/5](https://huggingface.co/roneneldan/TinyStories-1M/discussions/5)  
21. roneneldan/TinyStories-33M at main \- Hugging Face, accessed June 29, 2026, [https://huggingface.co/roneneldan/TinyStories-33M/tree/main](https://huggingface.co/roneneldan/TinyStories-33M/tree/main)  
22. Generate completion \- Datafund App, accessed June 29, 2026, [https://app.datafund.io/viewentry/3](https://app.datafund.io/viewentry/3)  
23. sshleifer/tiny-gpt2 at 7343b21fbb20c9705baef296fb74055c983b4d7d \- Hugging Face, accessed June 29, 2026, [https://huggingface.co/sshleifer/tiny-gpt2/tree/7343b21fbb20c9705baef296fb74055c983b4d7d](https://huggingface.co/sshleifer/tiny-gpt2/tree/7343b21fbb20c9705baef296fb74055c983b4d7d)  
24. litgpt/litgpt/config.py at main · Lightning-AI/litgpt \- GitHub, accessed June 29, 2026, [https://github.com/Lightning-AI/litgpt/blob/main/litgpt/config.py](https://github.com/Lightning-AI/litgpt/blob/main/litgpt/config.py)

[image1]: <data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAA8AAAAcCAYAAAC+lOV/AAAAz0lEQVR4XmNgGOyAG4hZoZhkQLZmYSA+BcS+UEwSyADi/wxkaJYF4pNA/A+Iy6GYKMAIxJVAHAHEDxlI1KwNxO0MENtBmhdCMUHAwgDRCDKAF4gPM9BLsy0DxL8gf/MA8QEkDOJjBZxQvBSI84E4BIhjgPgGAxGaA6C4hwGiEYYPAfFdKBaHq0YCoJQ0DYpBbGTQygAJcRCWRJMD+60MiCOhGB2A4vcTFOsjS/gB8TsGSBKEKXCEyoFcsBuIf0HlQfgVEJdA5SnTPApGAe0AAGr2OAmQWWDHAAAAAElFTkSuQmCC>

[image2]: <data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAEIAAAAZCAYAAACFHfjcAAAC60lEQVR4Xu2WS6hPURjFlxDyKs+JuN4kUV4pA0SRlBgQMjCSJDFRpCskRYkiBrgKeaQM5JmuRwYmyhQlcxPFwHut/7f3Pfvsu8//nDv4UzqrfoOz93f2XvvbT6BWrVol6kvmkvmOQfnqLql8kWMi6Z2vbkhlqvNxRW21WqGPyh7qRMAS8IjsJjcdX8l20iuI20iukQ2OA+Q5GRfEjCTXYW1tcjwke5Bvq9WKfYQekj6Gw4Jmum8feIj8JMtd+VRyFrZyQm0mV0kf963/lmTVDamPJ2RhVN5KxT5CD0kfc8gX8pgMDMq1nL6TU+57NblPBnRFmOaR27B/tfyU1BW5CEvsZbIyKo+lNkbHhZFGoPtkxEr5CD0kfeino2QL8ktGA9T28IkIEzbGoT14hGxzMfr/IvlGtrp6MY3cI6NcXJGGkg7YVk1pFTmB8kSkfIQeynzktAP5raGlv5/8hnUgXpO9yBubTN67uHeOZ2RKENNMSsYN2IHttdZxEuVJ8Ip99MRDQ22Ot+Q48h2PhW2PHw518hIW76XsaxY+uXqfuJ2uroqUjCuwZPgE9CQJUuwj9FDqQwZ0e4h25DueTl6QGWSI4wysk07YFtOS1NI9SPqRpY6P5BdsUFUlL9qGl2DnUnw2NVPKR+ih1EedCNigT8POBqHGtIQGu3odmioPpZj1sOWnw1XJ6kT3k38Y7OrSctc/VSSzF2CHnn/kVVXKR+ih0IcKdyG7OXyQBqd91Z/chV2hsZSoB7Drdhns2kq94jSwW8jeG820DjYpmhx/Xvgzo4qKfHgPSR8atGb6FTkfocb8MmqHzU58YE1Cdi2NJ2/IglyE9aFl6q/ZZgqT4KVkiHOoloyUj9BD0scs8hnZCR+iB5VmWvLX2h1kT2ytlqdktouRdNd/gD1tfVwHOYbuSYw1gexDcZw8HIY9qsoU+wg9FLVfWcpqG1nj0MMn1ai20mLYVhLxmfG3FPr4Vx5q1apVq1at/1V/APfnqmfYI2fWAAAAAElFTkSuQmCC>

[image3]: <data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABEAAAAbCAYAAACa9mScAAABAElEQVR4Xu3SsUtCURgF8E9QMDBJCSRycJEIsr9At6BJh3QIHJudHR2ccpLG9nBpbXGQh47NjUE6OAjiFpSDnst34nUv3O0txTvw48H97rs8zrsi/zq39OgxgDPyJpJDjugaPqEHJ1SEPmypyXe8uRHdWHXWzWFzCiBjTZ08wDsUnPUL2NAIkvZYY042AngWe1NKtJcFVX7NrJzTGp6gRXcwhnvIkjcN+oK6hKWWRT9/CiXyxnTh6+MUPuCF0taUOYQZuX2Y5OAV3ujYHmsiOeSnUKPrzExqol0NKWGPNW0Jb6N7yS5Fe5pAnqxcif73b9jRkmvGis8OHPCdOHH+cPbel0SGPZuB/AAAAABJRU5ErkJggg==>

[image4]: <data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAEIAAAAZCAYAAACFHfjcAAADRElEQVR4Xu2XWahOURiGPyFkKjMpjlnEhSlFDjkiQ8KF4UYUklxwIcPFOeFCUYoyZioZI1GGhKS4UIZyIwo3LiRRbpThffrW9q9/7384R45c7Leezr/XXnvvtd7vW99axyxXrlxV1FIMFJMCHYpv/zO1FbViZPhdTl3FFNFHtEjdS9RajBDjrfK7ipQbIXUX58QGsSxwS2y08h9pDs0X18y/e1d8ErOKepi1EpvEAbFI7BOXRee4kzRYXBWrxErxSCwo6lFC28XUVBuO3xETU+3NpaHipugVronmefHBPKqAakWDFQLEX8ZfH64R0T9obkaiQeKhqAlkxBIg+jNT7XzglGUjklZ70TPdmFI384lV0lzxUxyN2shM2lYHENmw34ozlUjHz/UWj82XV6KO4ooYHciIFx4X38QK81oBw8R10aPQtaRIyZNibPqGNDuwx6obweCZ4IyojQmmjZgnfph/s5NoJ06LunAfEdx75tlEO3OcJs6Y94eSIoVem3/0VeC+GBJ3qiDMII0pSomYxN5ANRNKicFjTHppMAlMYKyfxROxJPSPRSGlxtDvhbhghWVXVmQA2fDR/EEgQ9aHe40RZhAZzEhMwIA/MQFNMDdhnfkk44mOEs/Fd/OxEoQu0X2EYfXiixVMm2PZd/0WjaRug2hjnkLwzjwFq1baSJhxW5ywCunXCPUTz8Ryyw56uvk3iG5f83Wf1BZ2FODbZ813C66Xiq+BMYGMciOChpsXlnTlJ9XYPkn39GDKCdOOmRffuF40RZh5ybzIIpYWOxOwLXJmiHc4lu5m8dbcGMAszhBxMNg+X4otgYx4iO2z1EmSiV00d7WaFpofbhh4XC+aYgjP7jIfUyLGkMA2/MCyEWXyHJgGBNhhKKhpYcDOQEY15oWH4hSLLGC5rEm1l1JsQiLMOBRojBk8u9s8Cw9HcD0uQEDYAsmAOEsnh/Zka+QQ+FT0j/rwfs5FdYGSIg3fmB+xFwdwlOhUq/pEYKuV7ocZsMM8mpXEwS3ZsWLeWyHSKIn+EfMj9jZxI7QnwiR2G7ZNjtgczFgqa8O9iks9+WeHEx6ka8b/JOoCtQ0j+FtuiycIHNCA37ly5cqVK1euv6dfmk6jAuVPfK4AAAAASUVORK5CYII=>