Computer systems have always relied on encoding information to function. From Morse code to binary encoding to the analog-to-digital encoding that enables telephone communications, encoding is embedded in our everyday technologies.
The AI revolution brought new dimensions to encoding. In 2018, Google introduced BERT, transforming how machines understand language. Today, we recognize three primary types of Language Models:
All these models encode their inputs into tokens using various approaches.
However, at Clinithink, we've developed a fundamentally different approach.
While conventional natural language processor (NLP) systems and LLMs tokenize and vectorize language, we use a clinical NLP—specifically SNOMED CT—to structure and populate our token library. Our encoding process differentiates us:
This approach achieves remarkable efficiency—our benchmark is one million documents per hour on a 16-core server running 32 tasks/threads.
The key to our speed lies in SNOMED CT's acyclic taxonomic structure. This allows us to quickly find and categorize information in a hierarchical manner. For example, a viral pneumonia is classified as:
SNOMED CT is a poly-hierarchy (sometimes called multi-nodal), meaning concepts can have multiple parents. For instance, viral pneumonia is both a pneumonia and an infectious disease:
And “acyclic” here means no more than “is a” - a pictorial rendition showing the “poly” nature of the structure is shown below (i.e. viral pneumonia “is a” infectious disease).