Transforming clinical research: The role of advanced encoding and querying in AI

Written by Clinithink | Mar 11, 2025 5:14:24 PM

Computer systems have always relied on encoding information to function. From Morse code to binary encoding to the analog-to-digital encoding that enables telephone communications, encoding is embedded in our everyday technologies.

The AI revolution brought new dimensions to encoding. In 2018, Google introduced BERT, transforming how machines understand language. Today, we recognize three primary types of Language Models:

Encoder-only models like BERT excel at understanding context but not generating text

Encoder-decoder models like Google's T5 handle two-way tasks such as translation

Decoder-only models, made famous by ChatGPT, predict with remarkable accuracy which words should follow in any sentence

All these models encode their inputs into tokens using various approaches.

However, at Clinithink, we've developed a fundamentally different approach.

While conventional natural language processor (NLP) systems and LLMs tokenize and vectorize language, we use a clinical NLP—specifically SNOMED CT—to structure and populate our token library. Our encoding process differentiates us:

We break text into tagged chunks using various linguistic constructs

Multiple parallel tasks operate on each chunk, completing in milliseconds

The input is "mapped" to the SNOMED CT structure, with real data populating the nodes

We qualify each concept with properties (post-coordination), resulting in millions of individual records

This approach achieves remarkable efficiency—our benchmark is one million documents per hour on a 16-core server running 32 tasks/threads.

The Power of Taxonomic Structure

The key to our speed lies in SNOMED CT's acyclic taxonomic structure. This allows us to quickly find and categorize information in a hierarchical manner. For example, a viral pneumonia is classified as:

A viral pneumonia

Which is an infectious pneumonia

Which is a pneumonia

Which is a lung disease

SNOMED CT is a poly-hierarchy (sometimes called multi-nodal), meaning concepts can have multiple parents. For instance, viral pneumonia is both a pneumonia and an infectious disease:

And “acyclic” here means no more than “is a” - a pictorial rendition showing the “poly” nature of the structure is shown below (i.e. viral pneumonia “is a” infectious disease).

View full post