← RFC Reference

How Spam Filters Work

Email Concepts Encyclopedia
ELI5: Imagine a bouncer at a club. Before you even reach the door, they check if you are on the banned list (blocklists). At the door, they check your ID (authentication). Inside, they look at how you are dressed and what you are saying (content filtering). And if the regulars keep complaining about you, you get thrown out regardless of everything else (engagement signals). Spam filters work in layers, each one catching what the previous one missed.

The filtering pipeline that every email passes through — from connection-time checks to machine-learning classifiers — and how major providers decide what reaches the inbox.

The Filtering Pipeline

Spam filtering is not a single check. It is a multi-stage pipeline that evaluates a message at every phase of the SMTP transaction and after delivery. Each stage can reject, defer, or flag a message. The stages roughly follow this order:

  1. Connection-time checks — IP reputation, blocklists, rate limiting
  2. Envelope checks — Sender verification, recipient validation
  3. Authentication checks — SPF, DKIM, DMARC evaluation
  4. Header analysis — Structural validation, consistency checks
  5. Content analysis — Body scanning, URL checking, attachment inspection
  6. Reputation scoring — Sender reputation weighted against all signals
  7. Machine learning classification — Bayesian and neural network models
  8. Post-delivery signals — Engagement, user actions, complaint feedback

Modern spam filters at providers like Gmail and Outlook run most of these in parallel, producing a composite score that determines inbox placement. But understanding them as a pipeline helps explain how each layer contributes.

Stage 1: Connection-Time Checks

Before a single byte of email content is transmitted, the receiving server evaluates the connecting IP address.

# Connection from a blocklisted IP
550 5.7.1 Service unavailable; client [198.51.100.42] blocked
using zen.spamhaus.org

Connection-time checks are the most cost-effective filter. Rejecting at connection saves the server from processing the entire message.

Stage 2: Envelope Checks

During the SMTP envelope phase (MAIL FROM and RCPT TO), additional checks run:

Stage 3: Authentication Checks

Once the message content arrives, the server evaluates email authentication:

Authentication results are recorded in the Authentication-Results header:

Authentication-Results: mx.google.com;
dkim=pass header.i=@example.com header.s=mtg;
spf=pass (google.com: 198.51.100.42 is permitted) smtp.mailfrom=example.com;
dmarc=pass (p=REJECT) header.from=example.com

Authentication is a prerequisite, not a guarantee. Passing SPF, DKIM, and DMARC does not mean your message reaches the inbox. Spammers can set up valid authentication too. But failing authentication is a strong negative signal that will almost certainly route your message to spam or rejection.

Stage 4: Header Analysis

Spam filters inspect message headers for anomalies:

Stage 5: Content Analysis

Content analysis examines the message body, HTML structure, and attachments.

Text and HTML analysis

URL and link analysis

Attachment analysis

Stage 6: Reputation Scoring

All of the above signals feed into a reputation model. This is where IP and domain reputation have their greatest impact.

Reputation acts as a multiplier. A sender with excellent reputation gets the benefit of the doubt — borderline content is delivered to the inbox. A sender with poor reputation gets no benefit of the doubt — even clean content may be filtered. This is why reputation is often more important than content.

Providers weigh signals differently:

Stage 7: Machine Learning Classification

Modern spam filters use machine learning models trained on billions of messages.

Bayesian filtering

The foundational technique. A Bayesian filter calculates the probability that a message is spam based on the frequency of its words (tokens) in known-spam versus known-ham corpora. If the word "invoice" appears in 80% of ham and 5% of spam, it is a strong ham signal. If "unsubscribe" appears alongside "Congratulations! You won!" the combined probability shifts toward spam.

Bayesian filters are adaptive — they learn from new messages. When a user marks a message as spam, the filter updates its probability tables. This per-user learning is why the same message might be filtered as spam for one user and delivered to the inbox for another.

Neural network models

Major providers now use deep learning models that go far beyond individual word frequencies. These models evaluate:

Google's spam filters, for example, process over 99.9% of spam before it reaches any inbox, while maintaining a false-positive rate below 0.05%. This is only possible with large-scale machine learning.

Stage 8: Post-Delivery Signals

Filtering does not stop when the message hits the inbox. Post-delivery signals continuously refine placement:

Engagement-based filtering creates a feedback loop: if your early messages to a new subscriber are not opened, future messages are more likely to be filtered. This is why IP warming advice always says to start with your most engaged recipients.

How Major Providers Differ

Gmail

Gmail's filtering is the most sophisticated and the most engagement-driven. Key characteristics:

Outlook.com / Microsoft 365

Yahoo / AOL

Spam Filter Testing and Debugging

When your messages land in spam, you need a systematic approach to diagnose the cause.

Reading filter headers

Most spam filters add headers to the message that reveal their verdict. Send a test message to yourself and inspect the raw headers:

# Gmail adds these headers (visible in "Show original"):
X-Gm-Message-State: [internal state data]
X-Google-DKIM-Signature: [Google's own signature]
Authentication-Results: mx.google.com;
spf=pass ... dkim=pass ... dmarc=pass

# Microsoft adds:
X-Microsoft-Antispam: BCL:0;
X-MS-Exchange-Organization-SCL: 1
# SCL (Spam Confidence Level): -1=safe, 0-4=delivered, 5-6=junk, 7-9=blocked

# SpamAssassin (open source, widely used) adds:
X-Spam-Status: No, score=-1.2 required=5.0
tests=DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,SPF_PASS,
RCVD_IN_DNSWL_LOW autolearn=ham

These headers tell you exactly which tests were applied and what their results were. The Authentication-Results header is standardized; the spam-score headers are filter-specific.

Seed testing

Send test messages to accounts at multiple providers (Gmail, Outlook, Yahoo, corporate servers) and check whether they land in the inbox or spam. Do this before every major campaign or infrastructure change. Several third-party services automate this with panels of test addresses across dozens of providers.

Isolating the variable

If a message lands in spam, change one variable at a time to identify the trigger:

What Can Go Wrong

Legitimate email filtered as spam

Your transactional emails (password resets, order confirmations) land in spam because your marketing emails on the same domain tanked your domain reputation. The fix: consider separating transactional and marketing email onto different subdomains so reputation damage from marketing does not affect critical transactional delivery.

Content triggers on legitimate content

Your invoice email contains the word "payment" plus an attachment plus a link — all legitimate, but the combination scores high. The fix: ensure strong authentication and reputation so that content signals are evaluated in the context of a trusted sender.

Engagement death spiral

You send to a large list of inactive subscribers. Few open your email. The low engagement rate causes providers to move subsequent messages to spam. Even fewer people see them. Open rates drop further. More messages go to spam. The fix: regularly prune inactive subscribers and use re-engagement campaigns before they become disengaged.

URL blocklisting

A domain linked in your emails gets blocklisted (perhaps your tracking domain, or a shared link shortener). Every email containing that link is now flagged. The fix: use your own domain for tracking links, monitor link reputation, and avoid shared URL shorteners in email.

Key Takeaways

Further Reading

Related RFCs