Back to blog
Engineering

How we built a disposable email dataset

Josselin Liebe
Josselin Liebe

When it comes to blocking fake accounts, the first step is usually the same: verify the email. But to catch a disposable email, you first need to know which domains are disposable. Public lists on GitHub cover the obvious cases, but they are rarely up to date. New services appear every week, with fresh domains that fly under the radar.

We decided to build our own dataset at WebAPI. Here is how.

Why fake accounts are a real business problem

Fake accounts are not just a technical issue. They have a direct cost on the business.

  • CRM noise : Every fake account created with a disposable email lands in the sales pipeline. Sales teams waste time qualifying leads that do not exist (Hubspot, Salesforce, etc.), conversion metrics get skewed, and the signal-to-noise ratio in the CRM degrades. For a SaaS with thousands of signups per month, that can mean dozens of wasted hours.

  • Free trial and resource abuse : Disposable emails let a single user create accounts in a loop to abuse free trials indefinitely, stack up free credits, or bypass quotas. For an API product like ours, that is resource consumption with zero conversion behind it.

  • Polluted product metrics : Fake accounts artificially inflate signup numbers and distort activation (e.g. a Mixpanel or Metabase dashboard), retention, and conversion rates. You cannot make reliable product decisions with cohorts full of ghost accounts.

  • Reputation and deliverability risks : Sending onboarding emails to disposable addresses is sending into the void. Bounces pile up, deliverability drops, and your sending domain reputation takes a hit with email providers.

The problem: providers that move fast

Temporary email services do not all work the same way. Some expose a fixed list of domains in a dropdown menu. Others generate random addresses on every visit, rotating through domains. A few inject their domains dynamically via JavaScript, invisible in the raw HTML.

Building a reliable dataset means adapting to each of these behaviors.

Step 1: multi-strategy collection

We target 20+ disposable email providers. For each one, we deploy the extraction strategy best suited to how it works.

Standard HTTP requests : For sites that expose their domains directly in the HTML (dropdown menus, lists, pre-filled input fields), a simple HTTP call is enough. We parse the DOM and extract the domains.

Headless browsers : Many modern providers load their content via JavaScript. The static HTML contains nothing useful. We use headless browsers that execute JS, wait for the full render, then give us access to the final DOM.

Screenshots + OCR : Some services go further: the domain does not exist anywhere in the DOM and is rendered only inside a canvas or an image. In these cases, we take a screenshot and extract text via OCR.

Direct API calls : When a service exposes an API (public or semi-public) to generate addresses, we use it directly.

Step 2: smart domain extraction

Getting the HTML or screenshot is not enough. We still need to extract domains reliably. We apply multiple extraction layers in sequence:

  1. Input fields : we scan <input> elements for pre-filled email addresses and extract the domain
  2. Dropdown menus : <select> elements often contain the full list of available domains
  3. Links and raw text : as a last resort, we search for @domain.tld patterns across all visible text

Step 3: false positive filtering

This is arguably the most critical step. Aggressive extraction produces noise: domains that are not actually disposable. Without filtering, we risk blocking legitimate users : something we cannot afford.

Our pipeline applies several filters:

  • Exclusion list : major legitimate domains (gmail.com, outlook.com, etc.) and infrastructure domains (googleapis.com, cloudflare.com, etc.) are systematically excluded
  • Structural validation : we verify that the domain has a valid format, a recognized TLD, and does not look like a static asset (.js, .css, .png)
  • DNS verification : we validate the presence of MX records to confirm the domain can actually receive email
  • Confidence scoring : each domain is tagged with every source that reported it. A domain seen across multiple independent providers is almost certainly disposable

This multi-layered filtering lets us maintain a very low false positive rate, even with aggressive extraction.

Step 4: deduplication and indexing

The same domains often appear across multiple providers. Before enriching our database, we merge results: each domain is normalized, deduplicated, and we keep the full list of its sources.

The result is indexed in a search engine optimized for real-time queries. When our API receives a validation request, the lookup against this dataset takes less than one millisecond.

Step 5: monitoring and continuous updates

The disposable email landscape changes constantly. Domains disappear, new ones appear. We run this pipeline automatically and on a regular schedule to capture these changes.

Each run produces a report: number of providers processed, domains discovered, any errors. If a provider changes its interface or blocks our requests, we are alerted immediately and adapt our strategy.

The numbers

Today, our dataset contains over 380,000 verified disposable domains from 20+ different providers. It directly powers the disposable field in our Email Validation API.

What this means for you

When you call GET /v1/intelligence/email with an address, the disposable field is checked against this dataset in real time. Not a static list downloaded six months ago : a living dataset, updated every day by automated pipelines.

import requests

response = requests.get(
    "https://api.veille.io/v1/intelligence/email",
    params={"query": "user@guerrillamail.com"},
    headers={"x-api-key": "YOUR_API_KEY"},
)
data = response.json()

if data["disposable"]:
    print(f"Disposable email detected (risk: {data['risk_score']})")

Blocking disposable emails cuts the first lever attackers use to create fake accounts at scale. And it starts with a dataset that holds up.