URL WTFs

While working on an application that would accept user-provided URLs, it would be nice to test a wide range of URLs to figure out which ones to accept and which ones not to accept. Lots of security vulnerabilities might be hiding behind such simple functionality as fetching a user-provided URL.

Validating URLs is one of those cases that is perfect for property-based testing. I happened to use fast-check and they have a pretty comprehensive page on why you would need property based testing in the first place. Please read it.

The idea is that you just don’t want to just test on a few hardcoded values, get some basic tests passing and call it a day. You want to test your application logic against a wide range of possibly valid data and ensure the application behaves correctly when provided with a wide range of data, especially examples that you wouldn’t think to check. Property based tests are a pretty elegant approach to this problem, since libraries like fast-check can take care of generating valid values, in this case URLs, and then feeding them to the tests and checking that the code behaves as expected.

Property based tests using fast-check looks like this:

import fc from 'fast-check';
import { validateUrl } from '../server/services/FetcherUtils.res.mjs';

describe("URL validation rejects invalid URLs", () => {
  it("should accept valid http and https URLs with domain names", async () => {
    const validUrl = fc.webUrl({
      validSchemes: ['http', 'https'],
      withFragments: true,
      withQueryParameters: true,
    });

    fc.assert(
      fc.property(validUrl, (url) => {
        const result = validateUrl(url);
        return result.valid === true && result.error === undefined;
      }),
      { numRuns: 20 }
    );
  });

  it("should reject URLs with invalid protocols", async () => {
    const invalidProtocolUrl = fc.webUrl({
        validSchemes: ['ftp', 'file', 'javascript', 'data', 'mailto', 'ws', 'wss'],
        withFragments: true,
        withQueryParameters: true,
      });

    fc.assert(
      fc.property(invalidProtocolUrl, (url) => {
        const result = validateUrl(url);
        return (
          result.valid === false &&
          result.error === "URL must use http or https protocol"
        );
      }),
      { numRuns: 20 }
    );
  });
});

The idea should be clear - Generate different kinds of URLs, both valid and invalid, and ensure your code handles it appropriately.

The functions used to generate different kinds of data should be mostly self-explanatory. In this example we are using fc.webUrl to generate different kinds of URLs specifying the schemes we want to allow

As an aside, the function being tested is imported from a compiled Rescript file. This is one of the great advantages of Rescript. It gives you all the benefits of a type-safe language, but it compiles to plain JS, so there is zero lock-in. You can even delete it and keep the compiled JS files. Anyway for this case, the nice part about this example is that you can use all the amazing testing tools and libraries already available in JS-land to test functions written in Rescript 🤩

Detour: What is a valid URL?

I came to this structure after finding out that my original code was not restrictive enough. The property tests were generating quite strange URLs that kept being accepted as valid. This right here is one of the strengths of using property-based tests. So what was going on?

This is the JavaScript compiled version of the code I started out with to accept and fetch URLs, basically deferring to the already existing URL interface in JavaScript.

Looks like an innocent and simple function but don’t be fooled. There are a lot of dangers lurking underneath. Read on for all the issues I found from this simple function

function fetchUrl(userInput) {
  const url = new URL(userInput);
  return fetch(url);
}

For full transparency, I asked an AI agent to document why some of the weird URLs that were being generated from the property tests were being accepted. I thought it best to write it down somewhere easy to reference and so that I don’t need to re-discover this again in the future.

JavaScript URL Constructor Quirks

The JavaScript URL constructor (based on the WHATWG URL Standard) accepts many URL formats that may be surprising or unexpected. This document catalogs these behaviors to help developers understand URL validation edge cases.

Key Takeaway: The URL constructor is very permissive and accepts many strings that don't look like traditional URLs. Always validate the parsed components (protocol, hostname, etc.) after construction.

Unexpected Formats Accepted

1. Single-Letter Schemes

The URL constructor accepts any single letter followed by a colon as a valid URL scheme.

new URL('A: ')
// ✅ Valid
// protocol: "a:"
// hostname: ""
// href: "a:"

new URL('Z:')
// ✅ Valid
// protocol: "z:"
// hostname: ""
// href: "z:"

new URL('x:anything')
// ✅ Valid
// protocol: "x:"
// hostname: ""
// pathname: "anything"

Why: Per the URL spec, any string matching the pattern [a-zA-Z][a-zA-Z0-9+.-]*: is a valid scheme.

Implication: Must explicitly check for allowed protocols (http/https) after parsing.

2. Schemes Without Authority

URLs don't require the // authority component.

new URL('mailto:user@example.com')
// ✅ Valid
// protocol: "mailto:"
// pathname: "user@example.com"

new URL('data:text/plain,Hello')
// ✅ Valid
// protocol: "data:"
// pathname: "text/plain,Hello"

new URL('javascript:alert(1)')
// ✅ Valid
// protocol: "javascript:"
// pathname: "alert(1)"

Why: Many URL schemes (mailto, data, javascript, etc.) don't use the // authority syntax.

Implication: Dangerous for security - must validate protocol to prevent XSS and other attacks.

3. Single Slash After Protocol

URLs with a single slash after the protocol are valid.

new URL('http:/example.com')
// ✅ Valid
// protocol: "http:"
// hostname: "example.com"
// pathname: ""

new URL('https:/path/to/resource')
// ✅ Valid
// protocol: "https:"
// hostname: "path"
// pathname: "/to/resource"

Why: The URL spec allows this - the authority component is optional.

Implication: May parse differently than expected - hostname might not be what you think.

4. Empty Authority

URLs can have an empty authority (no hostname).

new URL('http://')
// ❌ Throws TypeError: Invalid URL

new URL('http:///')
// ✅ Valid
// protocol: "http:"
// hostname: ""
// pathname: "/"

new URL('file:///')
// ✅ Valid
// protocol: "file:"
// hostname: ""
// pathname: "/"

Why: The spec allows empty hostnames for certain schemes.

Implication: Must check that hostname is not empty for http/https URLs.

5. Whitespace Handling

Leading and trailing whitespace is trimmed, but internal whitespace causes errors.

new URL('  http://example.com  ')
// ✅ Valid (whitespace trimmed)
// href: "http://example.com/"

new URL('http://example .com')
// ❌ Throws TypeError: Invalid URL

new URL('http://example.com/path with spaces')
// ❌ Throws TypeError: Invalid URL

Why: The spec requires trimming ASCII whitespace from the input.

Implication: Whitespace in the middle is invalid, but leading/trailing is silently removed.

6. Case Insensitivity

Schemes and hostnames are case-insensitive and normalized to lowercase.

new URL('HTTP://EXAMPLE.COM')
// ✅ Valid
// protocol: "http:"
// hostname: "example.com"
// href: "http://example.com/"

new URL('HtTp://ExAmPlE.cOm')
// ✅ Valid
// protocol: "http:"
// hostname: "example.com"

Why: DNS and URL schemes are case-insensitive per spec.

Implication: Always compare protocols and hostnames in lowercase.

7. Numeric Hostnames

Hostnames can be numeric (interpreted as IP addresses).

new URL('http://127.0.0.1')
// ✅ Valid
// hostname: "127.0.0.1"

new URL('http://2130706433')
// ✅ Valid (decimal IP representation)
// hostname: "2130706433"

new URL('http://0x7f000001')
// ✅ Valid (hexadecimal IP)
// hostname: "0x7f000001"

Why: The spec allows various IP address formats.

Implication: Must validate that hostname is not an IP address if you want to require domain names.

8. IPv6 Addresses

IPv6 addresses must be enclosed in brackets.

new URL('http://[::1]')
// ✅ Valid
// hostname: "[::1]"

new URL('http://[2001:db8::1]')
// ✅ Valid
// hostname: "[2001:db8::1]"

new URL('http://::1')
// ❌ Throws TypeError: Invalid URL

new URL('http://2001:db8::1')
// ❌ Throws TypeError: Invalid URL

Why: Brackets are required to disambiguate colons in IPv6 from port numbers.

Implication: IPv6 addresses without brackets are invalid.

9. Port Numbers

Port numbers can be specified but are optional.

new URL('http://example.com:8080')
// ✅ Valid
// port: "8080"

new URL('http://example.com:')
// ✅ Valid (empty port)
// port: ""

new URL('http://example.com:abc')
// ❌ Throws TypeError: Invalid URL

Why: Ports must be numeric or empty.

Implication: Empty port is valid but non-numeric ports are not.

10. Userinfo in URLs

URLs can contain username and password (deprecated for security).

new URL('http://user:pass@example.com')
// ✅ Valid
// username: "user"
// password: "pass"
// hostname: "example.com"

new URL('http://user@example.com')
// ✅ Valid
// username: "user"
// hostname: "example.com"

Why: The spec supports userinfo for backward compatibility.

Implication: Security risk - passwords in URLs are visible in logs, history, etc.

11. Fragment and Query Handling

Fragments (#) and queries (?) are always valid.

new URL('http://example.com#fragment')
// ✅ Valid
// hash: "#fragment"

new URL('http://example.com?query=value')
// ✅ Valid
// search: "?query=value"

new URL('http://example.com?#')
// ✅ Valid
// search: "?"
// hash: "#"

Why: Fragments and queries are standard URL components.

Implication: Always present, even if empty. Ensure to parse the whole URL not just the domain.

12. Relative URLs Require Base

Relative URLs need a base URL to resolve against.

new URL('/path/to/resource')
// ❌ Throws TypeError: Invalid URL

new URL('/path/to/resource', 'http://example.com')
// ✅ Valid
// href: "http://example.com/path/to/resource"

new URL('//example.com/path')
// ❌ Throws TypeError: Invalid URL

new URL('//example.com/path', 'http://base.com')
// ✅ Valid (protocol-relative URL)
// href: "http://example.com/path"

Why: Relative URLs need context to be resolved.

Implication: Must provide base URL for relative URLs. (This was absolutely new for me)

Security Implications

XSS (Cross-Site Scripting)

JavaScript URLs can execute code:

new URL('javascript:alert(1)')
// ✅ Valid but DANGEROUS
// protocol: "javascript:"

// ❌ DANGEROUS - Could execute JavaScript
element.href = userInput;

// ✅ SAFE - Validates protocol
const url = new URL(userInput);
if (url.protocol !== 'http:' && url.protocol !== 'https:') {
  throw new Error('Invalid protocol');
}
element.href = url.href;

Resolutions

I ended up discovering a better way to parse and validate a URL before blindly fetching it, especially if it comes from external sources.

// ❌ DANGEROUS - Accepts many unexpected formats
function fetchUrl(userInput) {
  const url = new URL(userInput);
  return fetch(url);
}

// Basic SSRF protection
// Avoid redirect to unintended locations e.g. internal resources
function isValidDomainName(hostname) {
  // Check it's not an IP address
  if (/^\d+\.\d+\.\d+\.\d+$/.test(hostname)) {
    return false;
  }

  // Check it's not IPv6 (in brackets)
  if (hostname.startsWith('[') && hostname.endsWith(']')) {
    return false;
  }

  // Check it contains at least one dot (has TLD)
  if (!hostname.includes('.')) {
    return false;
  }

  // Check it's not localhost
  if (hostname === 'localhost') {
    return false;
  }

  return true;
}

// ✅ SAFE - Validates protocol and hostname
function fetchUrl(userInput) {
  const url = new URL(userInput);

  // ✅ Only allow known-safe protocols
  if (url.protocol !== 'http:' && url.protocol !== 'https:') {
    throw new Error('Invalid protocol');
  }

  // Check hostname is not empty
  if (!url.hostname) {
    throw new Error('Hostname required');
  }

  // Validate not IP / Loopback / IPV6
  if (isValidDomainName(url.hostname)) {
    throw new Error('IP addresses not allowed');
  }

  return fetch(url);
}

Conclusion

As a developer working on web applications, knowing all the kind of URLs that exist and are considered valid should have been common knowledge. Unfortunately a lot of the strange valid URLs were new to me and I was learning most of the security vulnerabilities hiding behind a simple URL for the first time. For example, I had no idea about Server-side request forgery (SSRF) but fortunately now I am much wiser.

Adopting property-based testing helped me discover some huge security holes in my application and I hope this discovery and the examples can also help others avoid similar issues when developing their own web applications.

URL WTFs

Detour: What is a valid URL?

JavaScript URL Constructor Quirks

Unexpected Formats Accepted

1. Single-Letter Schemes

2. Schemes Without Authority

3. Single Slash After Protocol

4. Empty Authority

5. Whitespace Handling

6. Case Insensitivity

7. Numeric Hostnames

8. IPv6 Addresses

9. Port Numbers

10. Userinfo in URLs

11. Fragment and Query Handling

12. Relative URLs Require Base

Security Implications

XSS (Cross-Site Scripting)

Resolutions

Conclusion

Comments

More from this blog

Mastering Clojure Iteration: A Practical Guide

Reviving Legacy Code: Rescript Tips for Modernizing Your JavaScript Projects with AI

The AI Reckoning

Building our own zsh_stats command line app

Command Palette

Detour: What is a valid URL?

JavaScript URL Constructor Quirks

Unexpected Formats Accepted

1. Single-Letter Schemes

2. Schemes Without Authority

3. Single Slash After Protocol

4. Empty Authority

5. Whitespace Handling

6. Case Insensitivity

7. Numeric Hostnames

8. IPv6 Addresses

9. Port Numbers

10. Userinfo in URLs

11. Fragment and Query Handling

12. Relative URLs Require Base

Security Implications

XSS (Cross-Site Scripting)

Resolutions

Conclusion

Comments

More from this blog