Preventing Unicode Spoofing in Rust: Safely Handling Confusable Identifiers and Text

By Alessandro D.P.3 min readJune 23, 2026

Why Unicode spoofing matters

Unicode spoofing is the practice of crafting text that appears trustworthy but is actually deceptive. Common examples include:

Homoglyph attacks: replacing Latin characters with visually similar characters from other scripts, such as a with Cyrillic а
Mixed-script identifiers: combining scripts in a way that looks legitimate but is hard to detect
Normalization tricks: using different Unicode representations that compare unequal unless normalized
Invisible characters: inserting zero-width or format characters into usernames, tokens, or file names

In Rust applications, this becomes relevant when you accept:

usernames and display names
email local parts
organization names
tags, labels, and project names
security-sensitive identifiers used in access control or routing

If your application treats text as a security boundary, you need more than == and trim().

The security goal: compare what users see, not just code points

A secure Unicode handling strategy usually has three layers:

Normalize text into a canonical form
Reject or restrict dangerous characters such as control and format characters
Detect confusable or mixed-script input where appropriate

The right policy depends on the field. A display name can be permissive. A username or tenant slug should be much stricter.

Common policy choices

Input type	Recommended policy	Reason
Display name	Normalize and filter control characters	Preserve international text while avoiding invisible abuse
Username	Restrict to a safe profile or script set	Prevent impersonation and routing ambiguity
Email address	Normalize domain, validate local part carefully	Domain spoofing is especially risky
Internal identifiers	Prefer ASCII unless there is a strong requirement	Simplifies comparisons and logging

Start with normalization

Unicode has multiple ways to represent the same visible text. For example, é can be a single code point or an e plus combining accent. If you compare raw strings, they may not match even though they look identical.

In Rust, the unicode-normalization crate provides standard normalization forms such as NFC and NFKC.

use unicode_normalization::UnicodeNormalization;

fn normalize_for_comparison(input: &str) -> String {
    input.nfc().collect()
}

fn main() {
    let a = "e\u{301}"; // e + combining acute accent
    let b = "é";

    assert_ne!(a, b);
    assert_eq!(normalize_for_comparison(a), normalize_for_comparison(b));
}

Which normalization form should you use?

NFC: good default for preserving appearance while canonicalizing equivalent text
NFKC: stronger compatibility normalization; can fold some visually distinct characters together
NFD/NFKD: useful for analysis, not usually ideal for storage or comparison

For security-sensitive identifiers, NFKC is often useful because it reduces compatibility variants, but it can also change meaning in some languages. Use it only when your product requirements allow it.

A practical rule:

Use NFC for general text storage and comparison
Use NFKC for restricted identifiers when you want to reduce spoofing surface
Document the choice clearly so behavior is predictable

Reject invisible and control characters

Many spoofing attacks rely on characters that do not render clearly, such as:

bidirectional controls
zero-width joiners and non-joiners
other Unicode format characters
ASCII control characters

These characters can make a username look like one thing in the UI and another in the database or logs.

A simple validation pass can reject them:

fn contains_dangerous_characters(input: &str) -> bool {
    input.chars().any(|c| {
        c.is_control()
            || matches!(
                c,
                '\u{200B}' // zero-width space
                    | '\u{200C}' // zero-width non-joiner
                    | '\u{200D}' // zero-width joiner
                    | '\u{2066}' // left-to-right isolate
                    | '\u{2067}' // right-to-left isolate
                    | '\u{2068}' // first strong isolate
                    | '\u{2069}' // pop directional isolate
            )
    })
}

fn validate_username(input: &str) -> Result<(), &'static str> {
    let normalized = input.nfkc().collect::<String>();

    if contains_dangerous_characters(&normalized) {
        return Err("username contains disallowed characters");
    }

    if normalized.is_empty() {
        return Err("username cannot be empty");
    }

    Ok(())
}

This example is intentionally strict. In a real application, you may want to allow some formatting characters in display names but not in identifiers.

Detect mixed-script identifiers

Mixed-script text is not always malicious, but it is a major warning sign in security-sensitive fields. A username like раураl may look like paypal while using Cyrillic characters.

The unicode-script crate can help identify which scripts appear in a string.

use unicode_script::{Script, UnicodeScript};
use std::collections::HashSet;

fn scripts_used(input: &str) -> HashSet<Script> {
    input
        .chars()
        .filter_map(|c| {
            let script = c.script();
            if script == Script::Common || script == Script::Inherited {
                None
            } else {
                Some(script)
            }
        })
        .collect()
}

fn is_mixed_script(input: &str) -> bool {
    scripts_used(input).len() > 1
}

fn main() {
    assert!(!is_mixed_script("alice"));
    assert!(is_mixed_script("раураl"));
}

When to block mixed scripts

Mixed-script blocking is appropriate for:

usernames
tenant names
invite codes
security labels
internal resource names

It is usually too aggressive for:

free-form comments
display names
multilingual content fields

A better policy is often “allow one script per identifier, plus common punctuation and digits.” That preserves usability while reducing spoofing risk.

Build a safe identifier validator

A practical validator should combine normalization, character filtering, and script policy. The following example accepts a restricted username format:

normalized with NFKC
ASCII letters, digits, underscore, and hyphen only
no control or format characters
length-limited

use unicode_normalization::UnicodeNormalization;

#[derive(Debug, PartialEq)]
enum UsernameError {
    Empty,
    TooLong,
    InvalidCharacter,
    DangerousCharacter,
}

fn validate_username(input: &str) -> Result<String, UsernameError> {
    let normalized: String = input.nfkc().collect();

    if normalized.is_empty() {
        return Err(UsernameError::Empty);
    }

    if normalized.len() > 32 {
        return Err(UsernameError::TooLong);
    }

    if normalized.chars().any(|c| c.is_control()) {
        return Err(UsernameError::DangerousCharacter);
    }

    if !normalized
        .chars()
        .all(|c| c.is_ascii_alphanumeric() || c == '_' || c == '-')
    {
        return Err(UsernameError::InvalidCharacter);
    }

    Ok(normalized)
}

fn main() {
    assert_eq!(validate_username("alice_01"), Ok("alice_01".to_string()));
    assert!(validate_username("аlice").is_err()); // Cyrillic a
    assert!(validate_username("admin\u{200B}").is_err()); // zero-width space
}

This is a strong policy because it intentionally avoids Unicode in usernames. That is often the safest choice for authentication and authorization identifiers.

Why ASCII-only can be the right answer

For security-critical identifiers, ASCII-only has real advantages:

simpler comparisons
fewer normalization edge cases
easier logging and auditing
less risk of visual impersonation
better compatibility across systems

If your product must support internationalized usernames, consider storing:

a canonical identifier for lookup and authorization
a display name for user-facing presentation

That separation avoids mixing security logic with presentation logic.

Handle email addresses carefully

Email addresses are a common source of Unicode confusion. The local part may be case-sensitive in theory, but many systems treat it as case-insensitive. Domains may contain Unicode via IDNA, which can introduce punycode and homograph concerns.

A safe approach is:

normalize and validate the domain
convert the domain to ASCII using IDNA rules
keep the local part policy explicit
never use the raw email string as a security key without canonicalization

For example, two visually similar domains may resolve differently:

example.com
examp1e.com
ехample.com where the first letter is Cyrillic

If your application uses email for account recovery or login, always compare a canonical form rather than the raw input. Also, display the canonicalized domain carefully in the UI so users can verify it.

Avoid security decisions based on display text

A common mistake is to use a visible label as an authorization key. For example:

“project name” becomes the database key
“organization display name” controls access
“role label” is used in policy checks

That is dangerous because display text can be changed, spoofed, or normalized into collisions.

Instead:

use opaque internal IDs for authorization
keep display names separate
validate display names for safety, but do not trust them for identity
store both the original and canonical forms if needed

Good design pattern

Purpose	Use	Example
Authorization	Opaque ID	`org_7f3a9c`
Lookup	Canonical identifier	normalized username
Presentation	Display string	user-chosen name
Audit	Canonical + original	both forms recorded safely

This separation prevents a class of bugs where visually similar text accidentally maps to the wrong account or resource.

Test with adversarial cases

Unicode security bugs are easy to miss in normal testing. Add explicit cases for:

mixed scripts
combining marks
zero-width characters
bidirectional controls
normalization collisions

A good test suite should include examples like:

paypal vs раураl
e\u{301} vs é
admin\u{200B}
strings with leading or trailing format characters

Property-based testing can also help. Generate random Unicode strings and assert that your validator either rejects them or canonicalizes them predictably.

Practical testing checklist

verify normalization is stable
verify dangerous characters are rejected
verify mixed-script policy behaves as expected
verify length checks occur after normalization
verify storage and lookup use the same canonical form

Length checks are especially important: normalization can change the number of bytes and sometimes the number of characters. Always validate on the canonical form you actually store.

Recommended implementation strategy

For most Rust services, a layered approach works best:

Define the field’s security role

identifier, display name, email, tag, or free text

Choose a normalization form

NFC for general text, NFKC for restricted identifiers

Apply character policy

reject controls, format characters, and invisible characters

Apply script policy if needed

block mixed scripts for usernames and tenant names

Store canonical and original forms separately

canonical for lookup, original for display if safe

Test with spoofing examples

include known confusables and edge cases

This approach keeps your application usable while reducing the risk of impersonation and text-based ambiguity.