Blog

Choose your words wisely: The essential role of tokenization in clinical AI

Written by Clinithink | Jan 16, 2025 4:25:35 PM

In our daily lives, finding the clearest way to communicate generally comes down to choosing the right words. When we speak or write, we don’t just use words randomly; we select them, often from a wide range of synonyms, and then put them together into phrases and sentences we think will be understood. In this way, every utterance is the product of a complex system we usually take for granted. If the words we use lack forethought, our message loses clarity and effect.

To explore this idea further, consider words to be like a child’s barrel of plastic block toys. When the child assembles these building blocks, the possibilities are endless for what they might create. The blocks can vary in terms of color, length, width, and so on, but each one shares the possibility of being connected to others. The child can make the blocks stack up or extend outward until they form a simple or complex construct like a house, a rocket, or a car.

Similarly, we can assemble language into constructs. In the case of language, the blocks are the words, which we connect to become phrases, sentences, paragraphs, and documents, all helping the author to create a simple or complex concept, whether it is the text on a greeting card, a page in an instruction manual—or a medical records document.

In this way, language can capture ideas and meaning in a specific way that can be interpreted and understood by a reader or listener, whether human or AI. In the case of humans, we learn to read and understand language as a part of our thinking. In the case of an AI system, it processes text through a series of steps that first break down the documents into smaller parts, paragraphs, sentences, phrases, and words and parts of words. These become inputs for the model to process; referred to as tokens, and the processing of these inputs in quantity is called tokenization.

 

How AI systems tokenize their input into words

The concept of tokenization has been around for a long time in information processing and has been used in many ways by many people, from cybersecurity managers to programmers who compile software code, perform spelling checks on documents, and compress file contents into portable formats. These activities all use tokenization to parse input data into individual words, enabling the people or programs performing the work to accomplish their purpose.

In the case of an AI system, and more specifically an LLM, a token is a unit of the input that the model consumes and processes. The AI system uses the full set of tokens available to it to preprocess the input it’s given. The input can be text, images, or audio files, but for our purposes, we’ll focus on text. How are text tokens created? An AI system can take many approaches, including identifying the tokens by typographic boundaries such as blank spaces, punctuation, carriage returns, or they can identify the tokens against a reference point such as a set of rules, a dictionary, or some other detailed ontology of terms.

In general, we can consider a token to approximately match a single, whole word, in our case an English language word, and that it would usually be one token per word. Some methods, however, such as byte pair encoding, work at the sub-word level, churning through their training data finding the most frequently occurring combinations of characters up to a limit. These tokens become the items we look for when we parse any large, stratified text volume (known as a corpus in language studies). Byte pair encoding is especially useful for breaking down unusual words, and an AI system trained on 50,000 tokens affords it a great balance between wide recognition of words and storage.

 

Optimizing tokenization for clinical natural language processing

At Clinithink, our CLiX platform uses a combination of tokenization approaches, after which it infuses the tokenized input with everything it has learned from real-world volumes of clinical narrative—including, significantly, the specialized and nuanced contextual text hidden in doctors’ notes and other unstructured data.

CLiX starts by receiving a set of standard UTF-8 encoded characters, expressed as sentences and phrases and contained in one or more unstructured text files, from an upstream data source such as an electronic medical record system or data warehouse. CLiX then processes these text files, matching up each tokenized word to a known set of tokens that are derived from ontologies of clinical text. This comprehensive method of tokenization allows the AI to find acronyms that need to be expanded (for example, MRI expands to Magnetic Resonance Imaging), contractions and abbreviations that will help derive true meaning (for example, “didn’t” expands to “did not”—a critical capability for preventing false positives based on negation), and phrases that are irregular on inconsistent across thousands of human authors and many thousands of free-text records.

All of this in concert achieves the goal of producing a standardized interpretation of the text, imparting a level of consistency throughout the AI processing, because it proceeds from language and grammar rules that are applied against tokens. This consistency, and the ability to achieve it reliably at scale, gives clinical natural language processing (CNLP) the boost it needs to help data scientists arrive at clinical insights faster, and with more clarity.

Tokenization also allows our AI to manage the text so that it can perform more advanced functions such as:

  • Spelling checks using computational linguistic techniques such as edit distance and “sounds like” algorithms.
  • Word root understanding through lemmatization, the grouping together of inflected word forms so they can be analyzed as a single entity.
  • Ranking of relevance words from low to high.
  • Understanding of synonymous meanings to enable proper interpretation.

Our AI model can also use tokenization to tag parts of speech (nouns, verbs, adjectives, adverbs, etc.) and use that grammatical information to derive semantic meaning from document section headings to create discrete segments of text, sentences, and paragraphs that carry their own context. This capability further enhances the AI system’s ability to derive meaning, temporality, family relationships, and other important context for ensuring the relative accuracy of its output.

Similar to an LLM, the CLiX engine understands the importance of attention mechanisms, which we’ll discuss in a future blog post. Using tokenization and attention, CLiX finds a working set window that is dynamically sized to ensure all the information relevant to the current context has been considered, down to the optimal level of granularity. The matching performed between the text and clinical concepts uses familiar techniques such as cosine similarity score, where CLiX allows a configurable threshold, allowing more precise results in specific contexts.

This powerful process is at the heart of our primary focus: Enabling better, faster AI insights from clinical data. At Clinithink, we are neither generative nor general AI; we are, very specifically, clinical AI. By focusing on this goal and committing our technology to the best methods, we seek to help life sciences companies and hospital systems achieve more from their data and use AI insights to drive innovation, ingenuity, better outcomes, and business growth.

Discover how the unique capabilities and innovations of our technology can unlock valuable insights and improve healthcare outcomes and efficiency for your organization. We invite you to read our latest white paper, “Not All Healthcare AI is Created Equal”.