.intro-50.gbi-1774695056-ako3DytKkGsu7QgJKkDu2M:before { opacity: 1; background-image: url('/static/581985ce578b1ee53ec6bcb2270a420a/42d8c/macro-4-blog-banner.jpg'); }

IBM Granite: now with speech-to-text, fill-in-the-middle coding and reasoning gains

June 11, 2025

Philip D'Souza

3 minute read

Digital neural network representing a large language model

IBM is continuing to evolve its Granite family of LLMs at a rapid pace. Hot on the heels of announcing the Granite 3.2 models, it has released the Granite 3.3 series, which brings several significant new advances, including IBM's first speech-to-text model, improvements in AI reasoning performance and a new "fill-in-the-middle" capability that's particularly helpful for AI-powered coding use cases.

The new Granite 3.3 series comprises the new Granite Speech 3.3 8B, speech-to-text (STT) model, the Granite 3.3 8B Instruct model (the LLM that serves as the speech model's foundation) and the smaller 2B Instruct version. Granite 3.3 8B Base and Granite 3.3 2B Base models are also available for developers to fine-tune.

High-performing speech-to-text

Granite Speech 3.3 8B is aimed at enterprise applications that want to process speech inputs and is optimized for automatic speech recognition (ASR) and automatic speech translation (AST) tasks. It outperforms leading competitors (has a lower error rate) on transcription tests and matches GPT-4o in translation tests, providing automated translation from English-to-French, Spanish, Italian, German, Portuguese, Japanese, Mandarin and more.

Unlike some speech models that combine speech and text in a single pass, Granite Speech 3.3 uses a modular two-step architecture. So, if you want to ask the model questions about an audio file, you first input the audio file to the speech encoder module to transcribe and then query that transcribed text using the underlying Granite 3.3 8B Instruct model. This separation ensures that the model's performance on text queries is equivalent to that of underlying Granite 3.3 8B Instruct, avoiding the reduction in text-based performance that often comes with multimodal models.

Another important advantage of Granite Speech 3.3 is its ability to accept and process audio files up to 20 minutes (1-minute chunks for optimal accuracy). It's not restricted to a 30-second window as is standard on the likes of Open AI's Whisper speech model - which requires longer audio files to be cut into 30-second chunks, introducing inaccuracies where the cuts are made.

Improved reasoning with reinforced learning

In my last article, I discussed the emergence of AI reasoning models, highlighting that IBM had introduced reasoning capabilities in its Granite 3.2 series. Now, with the Granite 3.3 Instruct model, IBM has quickly taken things further by achieving improvements on industry benchmarks for reasoning.

The new model has been fine-tuned with multi-stage reinforcement learning using TPO and Group Relative Policy Optimization (GRPO) enabling it to make gains on highly technical benchmarks conventionally associated with "reasoning" capabilities. I understand that IBM has been able to partly lean on synthetic data for model training which is likely to have contributed to the reasoning improvements.

Fill-in-the-middle for AI coding

The Granite 3.3 Instruct models' new fill-in-the-middle (FIM) capabilities are particularly helpful for tasks ranging from "code repair and error correction to refactoring, quickly generating boilerplate code and enabling the insertion of function arguments or docstrings", according to IBM.

Most text generation LLMs are designed to move forward, from left to right, predicting the next token in a sequence, based on information from the preceding tokens. While this has proved a powerful way to deliver on a variety of generative tasks, it falls short on tasks that require predicting the correct tokens based on those that come before and after, i.e. those that need to "fill in the middle".

With Granite 3.3, IBM has adapted the LLM by redesigning the training to accept structured inputs with '<fim_prefix>', '<fim_suffix>', and '<fim_middle>' tokens "tricking" it into predicting tokens in the middle 'using its intrinsic left-to-right prediction ability'.

New RAG-specific LoRAs

To drive enhancements to existing applications that are driven by Granite, IBM is also releasing a suite of retrieval augmented generation (RAG)-focused LoRA adapters for Granite 3.2 models, which will eventually be developed for the Granite 3.3 Instruct base and future Granite models.

Low-rank adaptation (LoRA) are a faster, cheaper and more efficient way of fine-tuning LLMs. Essentially, they allow you to fine-tune only a small subset of the base model's weights (~0.1%) - rather than the whole model. You can think of this as having a plug-in module that will provide the required expertise and specialized capabilities to your model.

The five new LoRAs that have been brought in include support for:

RAG Hallucination Detection (which provides a "faithfulness score" to measure alignment between the responses and source documents)
RAG Query Rewrite (which rewrites queries that require context from earlier in an AI interaction so that they produce more accurate responses)
RAG Citation Generation (which provides a citation for each sentence of the model's output or response for a specific query - as long as that response was informed by any external sources)
RAG Answerability Prediction (which confirms if a query is "answerable" or "unanswerable" using the relevant documents available after the retrieval stage of RAG)
Uncertainty Prediction (which provides a score that measures the extent to which the model's responses are supported by information contained within its training data)

Alongside these conventional adapters, IBM Research has also developed a series of activated LoRAs (or aLoRAs). These are a new kind of experimental low-rank LoRA that cut the LLMs inference costs and memory requirements as well making it easier and quicker to switch between adapters. For example, IBM researchers estimate that an activated LoRA can accomplish individual tasks 20 to 30 times faster than a traditional LoRA.

IBM Granite: Trusted, transparent and built for enterprise

Looking at the recent evolution of Granite, it feels like IBM's strategy is to focus on consolidating multiple use cases - speech, reasoning, coding and so on - out of the box into a unified model, rather than building narrower, specialised models. This is likely to suit organizations that are looking for broad functionality from a single model family.

The other huge advantage for enterprise users is IBM's strategy is to focus on consolidating multiple use cases - speech, reasoning, coding and so on - out of the box into a unified model, rather than building narrower, specialised models. This is likely to suit organisations that are looking for broad functionality from a single model family.

The other huge advantage for enterprise users is IBM's commitment to being fully transparent about the data that it uses to train its models, so there are no hidden risks. The training data goes through a rigorous end-to-end process that includes filtering out copyrighted material, poor quality data, data with privacy protections and more. It also has to pass through IBM's Granite Guardian HAP-38M, IBM's open-source toxicity filter that detects hate, abuse, and profanity with 92% accuracy. Adding additional reassurance, IBM also indemnifies clients against third-party IP claims on IBM-developed foundation models such as Granite.

With Granite 3.3, IBM underlines its commitment to developing trusted, continuously advancing, open-source AI for the enterprise.

This blog was originally published on the IBM Community.

Subscribe to our blog for updates

Get expert blog content delivered straight to your inbox