
Preventing Unicode Spoofing in Rust: Safely Handling Confusable Identifiers and Text
Why Unicode spoofing matters
Unicode spoofing is the practice of crafting text that appears trustworthy but is actually deceptive. Common examples include:
- Homoglyph attacks: replacing Latin characters with visually similar characters from other scripts, such as
awith Cyrillicа - Mixed-script identifiers: combining scripts in a way that looks legitimate but is hard to detect
- Normalization tricks: using different Unicode representations that compare unequal unless normalized
- Invisible characters: inserting zero-width or format characters into usernames, tokens, or file names
In Rust applications, this becomes relevant when you accept:
- usernames and display names
- email local parts
- organization names
- tags, labels, and project names
- security-sensitive identifiers used in access control or routing
If your application treats text as a security boundary, you need more than == and trim().
The security goal: compare what users see, not just code points
A secure Unicode handling strategy usually has three layers:
- Normalize text into a canonical form
- Reject or restrict dangerous characters such as control and format characters
- Detect confusable or mixed-script input where appropriate
The right policy depends on the field. A display name can be permissive. A username or tenant slug should be much stricter.
Common policy choices
| Input type | Recommended policy | Reason |
|---|---|---|
| Display name | Normalize and filter control characters | Preserve international text while avoiding invisible abuse |
| Username | Restrict to a safe profile or script set | Prevent impersonation and routing ambiguity |
| Email address | Normalize domain, validate local part carefully | Domain spoofing is especially risky |
| Internal identifiers | Prefer ASCII unless there is a strong requirement | Simplifies comparisons and logging |
Start with normalization
Unicode has multiple ways to represent the same visible text. For example, é can be a single code point or an e plus combining accent. If you compare raw strings, they may not match even though they look identical.
In Rust, the unicode-normalization crate provides standard normalization forms such as NFC and NFKC.
use unicode_normalization::UnicodeNormalization;
fn normalize_for_comparison(input: &str) -> String {
input.nfc().collect()
}
fn main() {
let a = "e\u{301}"; // e + combining acute accent
let b = "é";
assert_ne!(a, b);
assert_eq!(normalize_for_comparison(a), normalize_for_comparison(b));
}Which normalization form should you use?
- NFC: good default for preserving appearance while canonicalizing equivalent text
- NFKC: stronger compatibility normalization; can fold some visually distinct characters together
- NFD/NFKD: useful for analysis, not usually ideal for storage or comparison
For security-sensitive identifiers, NFKC is often useful because it reduces compatibility variants, but it can also change meaning in some languages. Use it only when your product requirements allow it.
A practical rule:
- Use NFC for general text storage and comparison
- Use NFKC for restricted identifiers when you want to reduce spoofing surface
- Document the choice clearly so behavior is predictable
Reject invisible and control characters
Many spoofing attacks rely on characters that do not render clearly, such as:
- bidirectional controls
- zero-width joiners and non-joiners
- other Unicode format characters
- ASCII control characters
These characters can make a username look like one thing in the UI and another in the database or logs.
A simple validation pass can reject them:
fn contains_dangerous_characters(input: &str) -> bool {
input.chars().any(|c| {
c.is_control()
|| matches!(
c,
'\u{200B}' // zero-width space
| '\u{200C}' // zero-width non-joiner
| '\u{200D}' // zero-width joiner
| '\u{2066}' // left-to-right isolate
| '\u{2067}' // right-to-left isolate
| '\u{2068}' // first strong isolate
| '\u{2069}' // pop directional isolate
)
})
}
fn validate_username(input: &str) -> Result<(), &'static str> {
let normalized = input.nfkc().collect::<String>();
if contains_dangerous_characters(&normalized) {
return Err("username contains disallowed characters");
}
if normalized.is_empty() {
return Err("username cannot be empty");
}
Ok(())
}This example is intentionally strict. In a real application, you may want to allow some formatting characters in display names but not in identifiers.
Detect mixed-script identifiers
Mixed-script text is not always malicious, but it is a major warning sign in security-sensitive fields. A username like раураl may look like paypal while using Cyrillic characters.
The unicode-script crate can help identify which scripts appear in a string.
use unicode_script::{Script, UnicodeScript};
use std::collections::HashSet;
fn scripts_used(input: &str) -> HashSet<Script> {
input
.chars()
.filter_map(|c| {
let script = c.script();
if script == Script::Common || script == Script::Inherited {
None
} else {
Some(script)
}
})
.collect()
}
fn is_mixed_script(input: &str) -> bool {
scripts_used(input).len() > 1
}
fn main() {
assert!(!is_mixed_script("alice"));
assert!(is_mixed_script("раураl"));
}When to block mixed scripts
Mixed-script blocking is appropriate for:
- usernames
- tenant names
- invite codes
- security labels
- internal resource names
It is usually too aggressive for:
- free-form comments
- display names
- multilingual content fields
A better policy is often “allow one script per identifier, plus common punctuation and digits.” That preserves usability while reducing spoofing risk.
Build a safe identifier validator
A practical validator should combine normalization, character filtering, and script policy. The following example accepts a restricted username format:
- normalized with NFKC
- ASCII letters, digits, underscore, and hyphen only
- no control or format characters
- length-limited
use unicode_normalization::UnicodeNormalization;
#[derive(Debug, PartialEq)]
enum UsernameError {
Empty,
TooLong,
InvalidCharacter,
DangerousCharacter,
}
fn validate_username(input: &str) -> Result<String, UsernameError> {
let normalized: String = input.nfkc().collect();
if normalized.is_empty() {
return Err(UsernameError::Empty);
}
if normalized.len() > 32 {
return Err(UsernameError::TooLong);
}
if normalized.chars().any(|c| c.is_control()) {
return Err(UsernameError::DangerousCharacter);
}
if !normalized
.chars()
.all(|c| c.is_ascii_alphanumeric() || c == '_' || c == '-')
{
return Err(UsernameError::InvalidCharacter);
}
Ok(normalized)
}
fn main() {
assert_eq!(validate_username("alice_01"), Ok("alice_01".to_string()));
assert!(validate_username("аlice").is_err()); // Cyrillic a
assert!(validate_username("admin\u{200B}").is_err()); // zero-width space
}This is a strong policy because it intentionally avoids Unicode in usernames. That is often the safest choice for authentication and authorization identifiers.
Why ASCII-only can be the right answer
For security-critical identifiers, ASCII-only has real advantages:
- simpler comparisons
- fewer normalization edge cases
- easier logging and auditing
- less risk of visual impersonation
- better compatibility across systems
If your product must support internationalized usernames, consider storing:
- a canonical identifier for lookup and authorization
- a display name for user-facing presentation
That separation avoids mixing security logic with presentation logic.
Handle email addresses carefully
Email addresses are a common source of Unicode confusion. The local part may be case-sensitive in theory, but many systems treat it as case-insensitive. Domains may contain Unicode via IDNA, which can introduce punycode and homograph concerns.
A safe approach is:
- normalize and validate the domain
- convert the domain to ASCII using IDNA rules
- keep the local part policy explicit
- never use the raw email string as a security key without canonicalization
For example, two visually similar domains may resolve differently:
example.comexamp1e.comехample.comwhere the first letter is Cyrillic
If your application uses email for account recovery or login, always compare a canonical form rather than the raw input. Also, display the canonicalized domain carefully in the UI so users can verify it.
Avoid security decisions based on display text
A common mistake is to use a visible label as an authorization key. For example:
- “project name” becomes the database key
- “organization display name” controls access
- “role label” is used in policy checks
That is dangerous because display text can be changed, spoofed, or normalized into collisions.
Instead:
- use opaque internal IDs for authorization
- keep display names separate
- validate display names for safety, but do not trust them for identity
- store both the original and canonical forms if needed
Good design pattern
| Purpose | Use | Example |
|---|---|---|
| Authorization | Opaque ID | org_7f3a9c |
| Lookup | Canonical identifier | normalized username |
| Presentation | Display string | user-chosen name |
| Audit | Canonical + original | both forms recorded safely |
This separation prevents a class of bugs where visually similar text accidentally maps to the wrong account or resource.
Test with adversarial cases
Unicode security bugs are easy to miss in normal testing. Add explicit cases for:
- mixed scripts
- combining marks
- zero-width characters
- bidirectional controls
- normalization collisions
A good test suite should include examples like:
paypalvsраураle\u{301}vséadmin\u{200B}- strings with leading or trailing format characters
Property-based testing can also help. Generate random Unicode strings and assert that your validator either rejects them or canonicalizes them predictably.
Practical testing checklist
- verify normalization is stable
- verify dangerous characters are rejected
- verify mixed-script policy behaves as expected
- verify length checks occur after normalization
- verify storage and lookup use the same canonical form
Length checks are especially important: normalization can change the number of bytes and sometimes the number of characters. Always validate on the canonical form you actually store.
Recommended implementation strategy
For most Rust services, a layered approach works best:
- Define the field’s security role
- identifier, display name, email, tag, or free text
- Choose a normalization form
- NFC for general text, NFKC for restricted identifiers
- Apply character policy
- reject controls, format characters, and invisible characters
- Apply script policy if needed
- block mixed scripts for usernames and tenant names
- Store canonical and original forms separately
- canonical for lookup, original for display if safe
- Test with spoofing examples
- include known confusables and edge cases
This approach keeps your application usable while reducing the risk of impersonation and text-based ambiguity.
