BluBracket Uncovers Trojan Source Unicode (Bidirectional Algorithm) Vulnerabilities

In this era of fast code deployment and non-stop design-to-deploy, systemic code vulnerabilities can end up being devastating because of the speed at which code is shared via git repositories.

The shift left movement has made developers aware of cybersecurity hygiene and best practices. This same movement has sought to give developers more responsibility and control for ensuring their code is secure. Several AppSec tools that fall into the categories of SAST, DAST, etc. however they do not address all of the challenges faced in specific steps of the CI/CD journey. In spite of a plethora of discrete security tools available, developers can inadvertently find themselves the super spreaders of certain vulnerabilities that are too well hidden to be observed.

Trojan Source: CVE-2021-42574

Such is the case of the recently identified vulnerability discovered at the University of Cambridge. The research team has named this vulnerability “Trojan Source”. Trojan Source was notable enough to quickly make its way into the NVD (national vulnerabilities database) at NIST and was assigned the enumeration CVE-2021-42574.

Here is the current detail in the listing: “An issue was discovered in the Bidirectional Algorithm in the Unicode Specification through 14.0. It permits the visual reordering of characters via control sequences, which can be used to craft source code that renders different logic than the logical ordering of tokens ingested by compilers and interpreters. Adversaries can leverage this to encode source code for compilers accepting Unicode such that targeted vulnerabilities are introduced invisibly to human reviewers.” []

The Unicode Standard

It takes a little peeling back of the layers to understand this bug which affects most software development environments and development tools. It is important to understand that Unicode is a standard that enables computers and programs to process, render and display information across almost all known languages. Unicode, formally the Unicode Standard, is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world’s writing systems. The standard, which is maintained by the Unicode Consortium, defines 144,697 characters covering 159 modern and historic scripts, as well as symbols, emoji, and non-visual control and formatting codes. []

Explaining the Vulnerability

Unicode defines a bi-directional algorithm (Bidi Algorithm) that can automate the handling of character processing for languages that read left to right (most of the Latin based languages), as well as languages that read right to left (such as Arabic). 

In order to accommodate the myriad of exceptions present in every language, Unicode implanted a Unicode override feature that allows these exceptions to be codified for mixed scripts and directional conflicts in processing. This capability in the algorithm, referred to as Bidi Override, allows control characters to swap the display ordering of groups of characters so that directional conflicts can be resolved programmatically.

Ross Anderson, Professor at Cambridge University and co-author of the paper that first described the vulnerability explains it as follows, “by placing Bidi override characters exclusively within comments and strings, we can smuggle them into source code in a manner that most compilers will accept. Our key insight is that we can reorder source code characters in such a way that the resulting display order also represents syntactically valid source code.”

Anderson goes on to add, “Bringing all this together, we arrive at a novel supply-chain attack on source code. By injecting Unicode Bidi override characters into comments and strings, an adversary can produce syntactically-valid source code in most modern languages for which the display order of characters presents logic that diverges from the real logic. In effect, we anagram program A into program B.”

A human code reviewer in most cases would not focus too much on the commented sections of the code and completely miss the fact that the changes to source code were being introduced in this manner. Additionally, these artifacts would be easily migrated as code copy and paste functions could effectively disseminate the malware.

In an article related to Trojan Source published by KrebsOnSecurity on November 1st, Matthew Green, an associate professor at the Johns Hopkins Information Security Institute, is quoted as saying, “There are no defenses to it, and now that people know about it they might start exploiting it,” Green said. “Hopefully compiler and code editor developers will patch this quickly! But since some people don’t update their development tools regularly there will be some risk for a while at least.”

How BluBracket Identifies and Mitigates the Trojan Source/Bidi Algorithm Threat

BluBracket has the ability to check for hard-coded secrets, patterns and the presence of certain content strings in commit histories across a whole range of repositories, both internal and external. Additionally, developers and security engineers can discover who has access to code and allow steps to limit the number of owners and collaborators in order to enhance security. Once the risk is remediated, BluBracket can ascertain that the risk is no longer there and automatically positively update the risk score of the impacted repos. 

Using secrets detection BluBracket can identify when bidi override control sequences are being utilized and alert on it.

In addition to detecting the presence or use of Unicode bidirectional strings in code, BluBracket also implements Unicode discovery as part of the standard pull request flow. This can also be combined with configuration checks related to IaC (infrastructure as code) to detect potential infrastructure vulnerabilities directly within the development workflow. This eliminates having to handle security incidents post-deployment.

When BluBracket identifies bidirectional characters it generates an alert for developers and highlights the segment in code as a risk. Developers can choose to bypass the alert and continue with the use of Unicode characters whether in code or in commented sections.

In a post merge validation scenario, BluBracket can create the alert and use immediate notification using enterprise tools such as PagerDuty, ServiceNow, etc. Alternatively, preventative controls can be configured to block commit or merge requests from proceeding.

For more information on how the BluBracket Code Security Solution can enhance your AppSec Program and integrate security directly within your developers’ workflow, or to arrange a free trial of the solution please visit

About BluBracket

BluBracket is the most complete code security solution that enables developers to effectively identify and remediate risks within their software development environment across code repositories, infrastructure as code and cloud environments. With BluBracket, organizations can shift left by enabling developers to address security at the very start of the development lifecycle.

Share this post!