Why Unicode spoofing matters

Unicode spoofing is the practice of crafting text that appears trustworthy but is actually deceptive. Common examples include:

  • Homoglyph attacks: replacing Latin characters with visually similar characters from other scripts, such as a with Cyrillic а
  • Mixed-script identifiers: combining scripts in a way that looks legitimate but is hard to detect
  • Normalization tricks: using different Unicode representations that compare unequal unless normalized
  • Invisible characters: inserting zero-width or format characters into usernames, tokens, or file names

In Rust applications, this becomes relevant when you accept:

  • usernames and display names
  • email local parts
  • organization names
  • tags, labels, and project names
  • security-sensitive identifiers used in access control or routing

If your application treats text as a security boundary, you need more than == and trim().


The security goal: compare what users see, not just code points

A secure Unicode handling strategy usually has three layers:

  1. Normalize text into a canonical form
  2. Reject or restrict dangerous characters such as control and format characters
  3. Detect confusable or mixed-script input where appropriate

The right policy depends on the field. A display name can be permissive. A username or tenant slug should be much stricter.

Common policy choices

Input typeRecommended policyReason
Display nameNormalize and filter control charactersPreserve international text while avoiding invisible abuse
UsernameRestrict to a safe profile or script setPrevent impersonation and routing ambiguity
Email addressNormalize domain, validate local part carefullyDomain spoofing is especially risky
Internal identifiersPrefer ASCII unless there is a strong requirementSimplifies comparisons and logging

Start with normalization

Unicode has multiple ways to represent the same visible text. For example, é can be a single code point or an e plus combining accent. If you compare raw strings, they may not match even though they look identical.

In Rust, the unicode-normalization crate provides standard normalization forms such as NFC and NFKC.

use unicode_normalization::UnicodeNormalization;

fn normalize_for_comparison(input: &str) -> String {
    input.nfc().collect()
}

fn main() {
    let a = "e\u{301}"; // e + combining acute accent
    let b = "é";

    assert_ne!(a, b);
    assert_eq!(normalize_for_comparison(a), normalize_for_comparison(b));
}

Which normalization form should you use?

  • NFC: good default for preserving appearance while canonicalizing equivalent text
  • NFKC: stronger compatibility normalization; can fold some visually distinct characters together
  • NFD/NFKD: useful for analysis, not usually ideal for storage or comparison

For security-sensitive identifiers, NFKC is often useful because it reduces compatibility variants, but it can also change meaning in some languages. Use it only when your product requirements allow it.

A practical rule:

  • Use NFC for general text storage and comparison
  • Use NFKC for restricted identifiers when you want to reduce spoofing surface
  • Document the choice clearly so behavior is predictable

Reject invisible and control characters

Many spoofing attacks rely on characters that do not render clearly, such as:

  • bidirectional controls
  • zero-width joiners and non-joiners
  • other Unicode format characters
  • ASCII control characters

These characters can make a username look like one thing in the UI and another in the database or logs.

A simple validation pass can reject them:

fn contains_dangerous_characters(input: &str) -> bool {
    input.chars().any(|c| {
        c.is_control()
            || matches!(
                c,
                '\u{200B}' // zero-width space
                    | '\u{200C}' // zero-width non-joiner
                    | '\u{200D}' // zero-width joiner
                    | '\u{2066}' // left-to-right isolate
                    | '\u{2067}' // right-to-left isolate
                    | '\u{2068}' // first strong isolate
                    | '\u{2069}' // pop directional isolate
            )
    })
}

fn validate_username(input: &str) -> Result<(), &'static str> {
    let normalized = input.nfkc().collect::<String>();

    if contains_dangerous_characters(&normalized) {
        return Err("username contains disallowed characters");
    }

    if normalized.is_empty() {
        return Err("username cannot be empty");
    }

    Ok(())
}

This example is intentionally strict. In a real application, you may want to allow some formatting characters in display names but not in identifiers.


Detect mixed-script identifiers

Mixed-script text is not always malicious, but it is a major warning sign in security-sensitive fields. A username like раураl may look like paypal while using Cyrillic characters.

The unicode-script crate can help identify which scripts appear in a string.

use unicode_script::{Script, UnicodeScript};
use std::collections::HashSet;

fn scripts_used(input: &str) -> HashSet<Script> {
    input
        .chars()
        .filter_map(|c| {
            let script = c.script();
            if script == Script::Common || script == Script::Inherited {
                None
            } else {
                Some(script)
            }
        })
        .collect()
}

fn is_mixed_script(input: &str) -> bool {
    scripts_used(input).len() > 1
}

fn main() {
    assert!(!is_mixed_script("alice"));
    assert!(is_mixed_script("раураl"));
}

When to block mixed scripts

Mixed-script blocking is appropriate for:

  • usernames
  • tenant names
  • invite codes
  • security labels
  • internal resource names

It is usually too aggressive for:

  • free-form comments
  • display names
  • multilingual content fields

A better policy is often “allow one script per identifier, plus common punctuation and digits.” That preserves usability while reducing spoofing risk.


Build a safe identifier validator

A practical validator should combine normalization, character filtering, and script policy. The following example accepts a restricted username format:

  • normalized with NFKC
  • ASCII letters, digits, underscore, and hyphen only
  • no control or format characters
  • length-limited
use unicode_normalization::UnicodeNormalization;

#[derive(Debug, PartialEq)]
enum UsernameError {
    Empty,
    TooLong,
    InvalidCharacter,
    DangerousCharacter,
}

fn validate_username(input: &str) -> Result<String, UsernameError> {
    let normalized: String = input.nfkc().collect();

    if normalized.is_empty() {
        return Err(UsernameError::Empty);
    }

    if normalized.len() > 32 {
        return Err(UsernameError::TooLong);
    }

    if normalized.chars().any(|c| c.is_control()) {
        return Err(UsernameError::DangerousCharacter);
    }

    if !normalized
        .chars()
        .all(|c| c.is_ascii_alphanumeric() || c == '_' || c == '-')
    {
        return Err(UsernameError::InvalidCharacter);
    }

    Ok(normalized)
}

fn main() {
    assert_eq!(validate_username("alice_01"), Ok("alice_01".to_string()));
    assert!(validate_username("аlice").is_err()); // Cyrillic a
    assert!(validate_username("admin\u{200B}").is_err()); // zero-width space
}

This is a strong policy because it intentionally avoids Unicode in usernames. That is often the safest choice for authentication and authorization identifiers.

Why ASCII-only can be the right answer

For security-critical identifiers, ASCII-only has real advantages:

  • simpler comparisons
  • fewer normalization edge cases
  • easier logging and auditing
  • less risk of visual impersonation
  • better compatibility across systems

If your product must support internationalized usernames, consider storing:

  • a canonical identifier for lookup and authorization
  • a display name for user-facing presentation

That separation avoids mixing security logic with presentation logic.


Handle email addresses carefully

Email addresses are a common source of Unicode confusion. The local part may be case-sensitive in theory, but many systems treat it as case-insensitive. Domains may contain Unicode via IDNA, which can introduce punycode and homograph concerns.

A safe approach is:

  • normalize and validate the domain
  • convert the domain to ASCII using IDNA rules
  • keep the local part policy explicit
  • never use the raw email string as a security key without canonicalization

For example, two visually similar domains may resolve differently:

  • example.com
  • examp1e.com
  • ехample.com where the first letter is Cyrillic

If your application uses email for account recovery or login, always compare a canonical form rather than the raw input. Also, display the canonicalized domain carefully in the UI so users can verify it.


Avoid security decisions based on display text

A common mistake is to use a visible label as an authorization key. For example:

  • “project name” becomes the database key
  • “organization display name” controls access
  • “role label” is used in policy checks

That is dangerous because display text can be changed, spoofed, or normalized into collisions.

Instead:

  • use opaque internal IDs for authorization
  • keep display names separate
  • validate display names for safety, but do not trust them for identity
  • store both the original and canonical forms if needed

Good design pattern

PurposeUseExample
AuthorizationOpaque IDorg_7f3a9c
LookupCanonical identifiernormalized username
PresentationDisplay stringuser-chosen name
AuditCanonical + originalboth forms recorded safely

This separation prevents a class of bugs where visually similar text accidentally maps to the wrong account or resource.


Test with adversarial cases

Unicode security bugs are easy to miss in normal testing. Add explicit cases for:

  • mixed scripts
  • combining marks
  • zero-width characters
  • bidirectional controls
  • normalization collisions

A good test suite should include examples like:

  • paypal vs раураl
  • e\u{301} vs é
  • admin\u{200B}
  • strings with leading or trailing format characters

Property-based testing can also help. Generate random Unicode strings and assert that your validator either rejects them or canonicalizes them predictably.

Practical testing checklist

  • verify normalization is stable
  • verify dangerous characters are rejected
  • verify mixed-script policy behaves as expected
  • verify length checks occur after normalization
  • verify storage and lookup use the same canonical form

Length checks are especially important: normalization can change the number of bytes and sometimes the number of characters. Always validate on the canonical form you actually store.


Recommended implementation strategy

For most Rust services, a layered approach works best:

  1. Define the field’s security role
  • identifier, display name, email, tag, or free text
  1. Choose a normalization form
  • NFC for general text, NFKC for restricted identifiers
  1. Apply character policy
  • reject controls, format characters, and invisible characters
  1. Apply script policy if needed
  • block mixed scripts for usernames and tenant names
  1. Store canonical and original forms separately
  • canonical for lookup, original for display if safe
  1. Test with spoofing examples
  • include known confusables and edge cases

This approach keeps your application usable while reducing the risk of impersonation and text-based ambiguity.


Learn more with useful resources