lib/Email/Abuse/Investigator.pm

Structural Coverage (Approximate)

TER1 (Statement): 85.80%
TER2 (Branch): 72.05%
TER3 (LCSAJ): 96.2% (230/239)
Approximate LCSAJ segments: 663

LCSAJ Legend

โ— Covered โ€” this LCSAJ path was executed during testing.

โ— Not covered โ€” this LCSAJ path was never executed. These are the paths to focus on.

Multiple dots on a line indicate that multiple control-flow paths begin at that line. Hovering over any dot shows:

        start โ†’ end โ†’ jump
        

Uncovered paths show [NOT COVERED] in the tooltip.

Mutant Testing Legend

Survived (tests missed this) Killed (tests detected this) No mutation
    1: package Email::Abuse::Investigator;
    2: 
    3: use strict;
    4: use warnings;
    5: use autodie qw(:all);
    6: use Time::Piece;
    7: 
    8: =head1 NAME
    9: 
   10: Email::Abuse::Investigator - Analyse spam email to identify originating hosts,
   11: hosted URLs, and suspicious domains
   12: 
   13: =head1 VERSION
   14: 
   15: Version 0.08
   16: 
   17: =cut
   18: 
   19: our $VERSION = '0.08';
   20: 
   21: =head1 SYNOPSIS
   22: 
   23:     use Email::Abuse::Investigator;
   24: 
   25:     my $analyser = Email::Abuse::Investigator->new( verbose => 1 );
   26:     $analyser->parse_email($raw_email_text);
   27: 
   28:     # Originating IP and its network owner
   29:     my $origin = $analyser->originating_ip();
   30: 
   31:     # All HTTP/HTTPS URLs found in the body
   32:     my @urls  = $analyser->embedded_urls();
   33: 
   34:     # All domains extracted from mailto: links and bare addresses in the body
   35:     my @mdoms = $analyser->mailto_domains();
   36: 
   37:     # All domains mentioned anywhere (union of the above)
   38:     my @adoms = $analyser->all_domains();
   39: 
   40:     # Full printable report
   41:     print $analyser->report();
   42: 
   43: =head1 DESCRIPTION
   44: 
   45: C<Email::Abuse::Investigator> examines the raw source of a spam/phishing e-mail
   46: and answers the questions manual abuse investigators ask:
   47: 
   48: =over 4
   49: 
   50: =item 1. Where did the message really come from?
   51: 
   52: Walks the C<Received:> chain, skips private/trusted IPs, and identifies the
   53: first external hop.  Enriches with rDNS, WHOIS/RDAP org name and abuse
   54: contact.  Both IPv4 and IPv6 addresses are supported.
   55: 
   56: =item 2. Who hosts the advertised web sites?
   57: 
   58: Extracts every C<http://> and C<https://> URL from both plain-text and HTML
   59: parts, resolves each hostname to an IP, and looks up the network owner.
   60: 
   61: =item 3. Who owns the reply-to / contact domains?
   62: 
   63: Extracts domains from C<mailto:> links, bare e-mail addresses in the body,
   64: the C<From:>/C<Reply-To:>/C<Sender:>/C<Return-Path:> headers, C<DKIM-Signature: d=>
   65: (the signing domain), C<List-Unsubscribe:> (the ESP or bulk-sender domain), and the
   66: C<Message-ID:> domain.  For each unique domain it gathers:
   67: 
   68: =over 8
   69: 
   70: =item * Domain registrar and registrant (WHOIS)
   71: 
   72: =item * Web-hosting IP and network owner (A record -> RDAP)
   73: 
   74: =item * Mail-hosting IP and network owner (MX record -> RDAP)
   75: 
   76: =item * DNS nameserver operator (NS record -> RDAP)
   77: 
   78: =item * Whether the domain was recently registered (potential flag)
   79: 
   80: =back
   81: 
   82: =back
   83: 
   84: =head1 REQUIRED MODULES
   85: 
   86: The following modules are mandatory:
   87: 
   88:     Readonly::Values::Months
   89:     Socket              (core since Perl 5)
   90:     IO::Socket::INET    (core since Perl 5)
   91:     MIME::QuotedPrint   (core since Perl 5.8)
   92:     MIME::Base64        (core since Perl 5.8)
   93: 
   94: The following are optional but strongly recommended:
   95: 
   96:     Net::DNS            -- enables MX, NS, AAAA record lookups
   97:     LWP::UserAgent      -- enables RDAP (faster and richer than raw WHOIS)
   98:     HTML::LinkExtor     -- enables structural HTML link extraction
   99:     CHI                 -- enables cross-message IP/domain result caching
  100:     IO::Socket::IP      -- enables IPv6 WHOIS connections
  101: 
  102: =cut
  103: 
  104: use Carp qw(croak carp);
  105: use IO::Select;
  106: use IO::Socket::INET;
  107: use MIME::QuotedPrint qw( decode_qp );
  108: use MIME::Base64 qw( decode_base64 );
  109: use Object::Configure;
  110: use Params::Get;
  111: use Params::Validate::Strict;
  112: use Readonly;
  113: use Readonly::Values::Months;
  114: use Socket qw( inet_aton inet_ntoa AF_INET );
  115: 
  116: # -----------------------------------------------------------------------
  117: # Optional modules -- gracefully degraded when absent
  118: # -----------------------------------------------------------------------
  119: 
  120: # Net::DNS enables MX, NS, AAAA lookups; falls back to gethostbyname
  121: my $HAS_NET_DNS;
  122: 
  123: # LWP::UserAgent enables RDAP queries; falls back to raw WHOIS
  124: my $HAS_LWP;
  125: my $HAS_CONN_CACHE;
  126: 
  127: # HTML::LinkExtor enables structural HTML link extraction
  128: my $HAS_HTML_LINKEXTOR;
  129: 
  130: # CHI enables a persistent cross-message cache for IP/domain data
  131: my $HAS_CHI;
  132: 
  133: # IO::Socket::IP provides dual-stack (IPv4+IPv6) socket support
  134: my $HAS_IO_SOCKET_IP;
  135: 
  136: # Domain::PublicSuffix enables accurate eTLD+1 normalisation
  137: my $HAS_PUBLIC_SUFFIX;
  138: 
  139: # AnyEvent::DNS enables parallel DNS queries
  140: my $HAS_ANYEVENT_DNS;
  141: 
  142: BEGIN {
  143: 	$HAS_NET_DNS       = eval { require Net::DNS;           1 };
  144: 	$HAS_LWP           = eval { require LWP::UserAgent;     1 };
  145: 	$HAS_CONN_CACHE    = eval { require LWP::ConnCache;     1 };
  146: 	$HAS_HTML_LINKEXTOR= eval { require HTML::LinkExtor;    1 };
  147: 	$HAS_CHI           = eval { require CHI;                1 };
  148: 	$HAS_IO_SOCKET_IP  = eval { require IO::Socket::IP;     1 };
  149: 	$HAS_PUBLIC_SUFFIX = eval { require Domain::PublicSuffix; 1 };
  150: 	$HAS_ANYEVENT_DNS  = eval { require AnyEvent::DNS;      1 };
  151: }
  152: 
  153: # -----------------------------------------------------------------------
  154: # Constants -- all magic numbers and strings live here
  155: # -----------------------------------------------------------------------
  156: 
  157: # WHOIS protocol port (IANA-assigned)
  158: Readonly::Scalar my $WHOIS_PORT        => 43;
  159: 
  160: # Bytes to read per sysread() call from a WHOIS socket
  161: Readonly::Scalar my $WHOIS_READ_CHUNK  => 4096;
  162: 
  163: # Maximum WHOIS response bytes stored in whois_raw (keep reports compact)
  164: Readonly::Scalar my $WHOIS_RAW_MAX     => 2048;
  165: 
  166: # Maximum multipart nesting depth (recursion guard -- RFC 2046 has no limit
  167: # but we cap it to prevent stack exhaustion on crafted messages)
  168: Readonly::Scalar my $MAX_MULTIPART_DEPTH => 20;
  169: 
  170: # Number of days before registration that triggers recently_registered flag
  171: Readonly::Scalar my $RECENT_REG_DAYS   => 180;
  172: 
  173: # Number of days ahead of expiry that triggers domain_expires_soon flag
  174: Readonly::Scalar my $EXPIRY_WARN_DAYS  => 30;
  175: 
  176: # Seconds in a day -- used in date arithmetic throughout
  177: Readonly::Scalar my $SECS_PER_DAY      => 86400;
  178: 
  179: # Suspicious date window: dates outside +/- 7 days raise a flag
  180: Readonly::Scalar my $DATE_SKEW_DAYS    => 7;
  181: 
  182: # Maximum positive timezone offset in minutes (+14:00 = Line Islands)
  183: Readonly::Scalar my $TZ_MAX_POS_MINS   => 840;
  184: 
  185: # Maximum negative timezone offset in minutes (-12:00 = Baker Island)
  186: Readonly::Scalar my $TZ_MAX_NEG_MINS   => 720;
  187: 
  188: # High-risk score threshold
  189: Readonly::Scalar my $SCORE_HIGH        => 9;
  190: 
  191: # Medium-risk score threshold
  192: Readonly::Scalar my $SCORE_MEDIUM      => 5;
  193: 
  194: # Low-risk score threshold
  195: Readonly::Scalar my $SCORE_LOW         => 2;
  196: 
  197: # Flag severity weights (contribute to the numeric risk score)
  198: Readonly::Hash my %FLAG_WEIGHT => (
  199: 	HIGH   => 3,
  200: 	MEDIUM => 2,
  201: 	LOW    => 1,
  202: 	INFO   => 0,
  203: );
  204: 
  205: # Maximum merged-role display string length before summarisation kicks in
  206: Readonly::Scalar my $ROLE_MAX_LEN      => 80;
  207: 
  208: # CHI cache TTL in seconds (1 hour -- IP allocations change slowly)
  209: Readonly::Scalar my $CACHE_TTL_SECS    => 3600;
  210: 
  211: # Default constructor timeout for network operations (seconds)
  212: Readonly::Scalar my $DEFAULT_TIMEOUT   => 10;
  213: 
  214: # Maximum role string length before truncation
  215: Readonly::Scalar my $ROLE_WRAP_LEN     => 66;
  216: 
  217: # -----------------------------------------------------------------------
  218: # Private ranges -- IPs that are never actionable abuse targets
  219: # -----------------------------------------------------------------------
  220: 
  221: # Both IPv4 and IPv6 private/reserved ranges.  Each entry is a compiled
  222: # regex; _is_private() iterates over them and returns true on first match.
  223: my @PRIVATE_RANGES = (
  224: 	# IPv4 ranges
  225: 	qr/^0\./,                         # 0.0.0.0/8  this-network (RFC 1122)
  226: 	qr/^127\./,                       # 127.0.0.0/8 loopback
  227: 	qr/^10\./,                        # 10.0.0.0/8  RFC 1918
  228: 	qr/^192\.168\./,                  # 192.168.0.0/16 RFC 1918
  229: 	qr/^172\.(?:1[6-9]|2\d|3[01])\./, # 172.16.0.0/12  RFC 1918
  230: 	qr/^169\.254\./,                  # 169.254.0.0/16 link-local
  231: 	qr/^100\.(?:6[4-9]|[7-9]\d|1(?:[01]\d|2[0-7]))\./,  # 100.64.0.0/10 CGN (RFC 6598)
  232: 	qr/^192\.0\.0\./,                 # 192.0.0.0/24  IETF protocol (RFC 6890)
  233: 	qr/^192\.0\.2\./,                 # 192.0.2.0/24  TEST-NET-1 (RFC 5737)
  234: 	qr/^198\.51\.100\./,              # 198.51.100.0/24 TEST-NET-2 (RFC 5737)
  235: 	qr/^203\.0\.113\./,               # 203.0.113.0/24 TEST-NET-3 (RFC 5737)
  236: 	qr/^255\./,                       # 255.0.0.0/8 broadcast
  237: 	# IPv6 ranges
  238: 	qr/^::1$/,                         # IPv6 loopback
  239: 	qr/^fe80:/i,                       # IPv6 link-local (fe80::/10)
  240: 	qr/^fc/i,                          # IPv6 ULA fc00::/7
  241: 	qr/^fd/i,                          # IPv6 ULA fd00::/8
  242: 	qr/^2001:db8:/i,                   # IPv6 documentation range (RFC 3849)
  243: 	qr/^64:ff9b:/i,                    # IPv6 NAT64 well-known prefix
  244: );
  245: 
  246: # Priority-ordered patterns for extracting IPs from Received: headers.
  247: # Covers bracketed IPv4, bracketed IPv6, parenthesised address, and bare dotted-quad.
  248: my @RECEIVED_IP_RE = (
  249: 	qr/\[\s*([\d.]+)\s*\]/,                          # [1.2.3.4]
  250: 	qr/\[\s*([0-9a-fA-F:]+)\s*\]/,                  # [IPv6 address]
  251: 	qr/\(\s*[\w.-]*\s*\[?\s*([\d.]+)\s*\]?\s*\)/,   # (hostname [1.2.3.4])
  252: 	qr/from\s+[\w.-]+\s+([\d.]+)/,                  # from hostname addr
  253: 	qr/([\d]{1,3}\.[\d]{1,3}\.[\d]{1,3}\.[\d]{1,3})/, # bare dotted-quad fallback
  254: );
  255: 
  256: # -----------------------------------------------------------------------
  257: # Default configuration -- overridable via Object::Configure
  258: # -----------------------------------------------------------------------
  259: 
  260: # Object::Configure may overlay
  261: # values from a file before new() uses them.  Use Readonly for constants
  262: # that should never be overridden at runtime.
  263: 
  264: # -----------------------------------------------------------------------
  265: # Trusted domains (infrastructure -- never report these as abuse targets)
  266: # Can be overrideen at runtime by Object::Configure
  267: # -----------------------------------------------------------------------
  268: 
  269: my %TRUSTED_DOMAINS = map { $_ => 1 } qw(
  270: 	gmail.com googlemail.com yahoo.com outlook.com hotmail.com
  271: 	google.com microsoft.com apple.com amazon.com
  272: 	googlegroups.com groups.google.com
  273: 	w3.org
  274: 	fedex.com ups.com dhl.com usps.com royalmail.com
  275: );
  276: 
  277: # -----------------------------------------------------------------------
  278: # URL shortener domains (real destination is hidden behind these)
  279: # -----------------------------------------------------------------------
  280: 
  281: my %URL_SHORTENERS = map { $_ => 1 } qw(
  282: 	bit.ly      bitly.com   tinyurl.com  t.co        ow.ly
  283: 	goo.gl      is.gd       buff.ly      ift.tt       dlvr.it
  284: 	short.link  rebrand.ly  tiny.cc      cutt.ly      rb.gy
  285: 	shorturl.at bl.ink      smarturl.it  yourls.org   clicky.me
  286: 	snip.ly     adf.ly      bc.vc        lnkd.in      fb.me
  287: 	youtu.be
  288: );
  289: 
  290: # -----------------------------------------------------------------------
  291: # Well-known provider abuse contacts
  292: # Can be overrideen at runtime by Object::Configure
  293: # -----------------------------------------------------------------------
  294: 
  295: # Curated table of provider abuse contacts.  Entries with only a 'form'
  296: # key (no 'email') require web-form submission; abuse_contacts() suppresses
  297: # email addresses for those providers and form_contacts() surfaces them.
  298: my %PROVIDER_ABUSE = (
  299: 	# Google / Gmail
  300: 	'google.com'        => { email => 'abuse@google.com',      note => 'Also report Gmail accounts via https://support.google.com/mail/contact/abuse' },
  301: 	'gmail.com'         => { email => 'abuse@google.com',      note => 'Report Gmail spam via https://support.google.com/mail/contact/abuse' },
  302: 	'googlemail.com'    => { email => 'abuse@google.com',      note => 'Report via https://support.google.com/mail/contact/abuse' },
  303: 	'1e100.net'         => { email => 'abuse@google.com',      note => 'Google infrastructure' },
  304: 	'blogspot.com'      => { email => 'abuse@google.com',      note => 'Blogger/Blogspot -- report via https://support.google.com/blogger/answer/76315' },
  305: 	'blogger.com'       => { email => 'abuse@google.com',      note => 'Blogger platform abuse' },
  306: 	'sites.google.com'  => { email => 'abuse@google.com',      note => 'Google Sites hosted content' },
  307: 	# Microsoft
  308: 	'microsoft.com'     => { email => 'abuse@microsoft.com',   note => 'Also report via https://www.microsoft.com/en-us/wdsi/support/report-unsafe-site' },
  309: 	'outlook.com'       => { email => 'abuse@microsoft.com',   note => 'Report Outlook spam: https://support.microsoft.com/en-us/office/report-phishing' },
  310: 	'hotmail.com'       => { email => 'abuse@microsoft.com',   note => 'Report via https://support.microsoft.com/en-us/office/report-phishing' },
  311: 	'live.com'          => { email => 'abuse@microsoft.com',   note => 'Microsoft consumer mail' },
  312: 	'office365.com'     => { email => 'abuse@microsoft.com',   note => 'Microsoft 365 infrastructure' },
  313: 	'protection.outlook.com' => { email => 'abuse@microsoft.com', note => 'Microsoft EOP gateway' },
  314: 	# Yahoo
  315: 	'yahoo.com'         => { email => 'abuse@yahoo-inc.com',   note => 'Also use https://io.help.yahoo.com/contact/index' },
  316: 	'yahoo.co.uk'       => { email => 'abuse@yahoo-inc.com',   note => 'Yahoo UK' },
  317: 	# Apple
  318: 	'apple.com'         => { email => 'reportphishing@apple.com', note => 'iCloud / Apple Mail abuse' },
  319: 	'icloud.com'        => { email => 'reportphishing@apple.com', note => 'iCloud abuse' },
  320: 	'me.com'            => { email => 'reportphishing@apple.com', note => 'Apple legacy mail' },
  321: 	# Amazon / AWS
  322: 	'amazon.com'        => { email => 'abuse@amazonaws.com',   note => 'Also https://aws.amazon.com/forms/report-abuse' },
  323: 	'amazonaws.com'     => { email => 'abuse@amazonaws.com',   note => 'AWS abuse form: https://aws.amazon.com/forms/report-abuse' },
  324: 	'amazonses.com'     => { email => 'abuse@amazonaws.com',   note => 'Amazon SES sending infrastructure' },
  325: 	# Cloudflare
  326: 	'cloudflare.com'    => { email => 'abuse@cloudflare.com',  note => 'Report via https://www.cloudflare.com/abuse/' },
  327: 	# Fastly / Akamai
  328: 	'fastly.net'        => { email => 'abuse@fastly.com',      note => 'Fastly CDN' },
  329: 	'akamai.com'        => { email => 'abuse@akamai.com',      note => 'Akamai CDN' },
  330: 	'akamaitechnologies.com' => { email => 'abuse@akamai.com', note => 'Akamai CDN' },
  331: 	# Namecheap
  332: 	'namecheap.com'     => { email => 'abuse@namecheap.com',   note => 'Registrar abuse' },
  333: 	# GoDaddy -- web form only; email bounces
  334: 	'godaddy.com'       => {
  335: 		form        => 'https://supportcenter.godaddy.com/AbuseReport',
  336: 		form_paste  => 'Select the abuse type (spam, phishing, malware etc). '
  337: 		             . 'Enter the domain name in the Domain field. '
  338: 		             . 'Paste the originating IP, risk flags, and the relevant '
  339: 		             . 'Received: headers from the report below.',
  340: 		form_upload => 'Take a screenshot of the report as a .png or .jpg, '
  341: 		             . 'or export it as a .pdf.',
  342: 		note        => 'Registrar/host -- email reports not monitored, use web form',
  343: 	},
  344: 	# SendGrid / Twilio
  345: 	'sendgrid.net'      => { email => 'abuse@sendgrid.com',    note => 'ESP -- include full headers' },
  346: 	'sendgrid.com'      => { email => 'abuse@sendgrid.com',    note => 'ESP -- include full headers' },
  347: 	# Mailchimp / Mandrill
  348: 	'mailchimp.com'     => { email => 'abuse@mailchimp.com',   note => 'ESP abuse' },
  349: 	'mandrillapp.com'   => { email => 'abuse@mailchimp.com',   note => 'Mandrill transactional ESP' },
  350: 	# OVH
  351: 	'ovh.net'           => { email => 'abuse@ovh.net',         note => 'OVH hosting' },
  352: 	'ovh.com'           => { email => 'abuse@ovh.com',         note => 'OVH hosting' },
  353: 	# Hetzner
  354: 	'hetzner.com'       => { email => 'abuse@hetzner.com',     note => 'Hetzner hosting' },
  355: 	# Digital Ocean
  356: 	'digitalocean.com'  => { email => 'abuse@digitalocean.com',note => 'DO abuse form: https://www.digitalocean.com/company/contact/#abuse' },
  357: 	# Linode / Akamai
  358: 	'linode.com'        => { email => 'abuse@linode.com',      note => 'Linode/Akamai Cloud' },
  359: 	# Constant Contact
  360: 	'constantcontact.com' => { email => 'abuse@constantcontact.com', note => 'ESP abuse' },
  361: 	'r.constantcontact.com' => { email => 'abuse@constantcontact.com', note => 'Constant Contact sending infrastructure' },
  362: 	# HubSpot
  363: 	'hubspot.com'         => { email => 'abuse@hubspot.com',       note => 'ESP abuse' },
  364: 	'hs-analytics.net'    => { email => 'abuse@hubspot.com',       note => 'HubSpot analytics infrastructure' },
  365: 	# Campaign Monitor
  366: 	'createsend.com'      => { email => 'abuse@campaignmonitor.com', note => 'Campaign Monitor ESP' },
  367: 	'cmail20.com'         => { email => 'abuse@campaignmonitor.com', note => 'Campaign Monitor sending infrastructure' },
  368: 	# Klaviyo
  369: 	'klaviyo.com'         => { email => 'abuse@klaviyo.com',       note => 'ESP abuse' },
  370: 	# Brevo (formerly Sendinblue)
  371: 	'sendinblue.com'      => { email => 'abuse@sendinblue.com',    note => 'ESP abuse' },
  372: 	'brevo.com'           => { email => 'abuse@brevo.com',         note => 'ESP abuse' },
  373: 	# Mailgun
  374: 	'mailgun.com'         => { email => 'abuse@mailgun.com',       note => 'ESP abuse' },
  375: 	'mailgun.org'         => { email => 'abuse@mailgun.com',       note => 'Mailgun sending infrastructure' },
  376: 	# Postmark
  377: 	'postmarkapp.com'     => { email => 'abuse@postmarkapp.com',   note => 'ESP abuse' },
  378: 	# WordPress.com
  379: 	'wordpress.com'       => { email => 'abuse@wordpress.com',     note => 'WordPress.com hosted blog -- report via https://en.wordpress.com/abuse/' },
  380: 	'wp.com'              => { email => 'abuse@wordpress.com',     note => 'WordPress.com short domain' },
  381: 	# Substack
  382: 	'substack.com'        => { email => 'abuse@substack.com',      note => 'Substack newsletter platform abuse' },
  383: 	# ActiveCampaign
  384: 	'activecampaign.com'  => { email => 'abuse@activecampaign.com', note => 'ActiveCampaign ESP' },
  385: 	'ac-tinker.com'       => { email => 'abuse@activecampaign.com', note => 'ActiveCampaign tracking infrastructure' },
  386: 	# Salesforce Marketing Cloud
  387: 	'salesforce.com'      => { email => 'abuse@salesforce.com',    note => 'Salesforce Marketing Cloud / ExactTarget ESP' },
  388: 	'mc.salesforce.com'   => { email => 'abuse@salesforce.com',    note => 'Salesforce Marketing Cloud sending infrastructure' },
  389: 	'exacttarget.com'     => { email => 'abuse@salesforce.com',    note => 'ExactTarget / Salesforce Marketing Cloud ESP' },
  390: 	'et.exacttarget.com'  => { email => 'abuse@salesforce.com',    note => 'ExactTarget sending infrastructure' },
  391: 	# Vultr
  392: 	'vultr.com'           => { email => 'abuse@vultr.com',         note => 'Vultr hosting' },
  393: 	# Contabo
  394: 	'contabo.com'         => { email => 'abuse@contabo.com',       note => 'Contabo hosting' },
  395: 	# Leaseweb
  396: 	'leaseweb.com'        => { email => 'abuse@leaseweb.com',      note => 'Leaseweb hosting' },
  397: 	# M247
  398: 	'm247.com'            => { email => 'abuse@m247.com',          note => 'M247 hosting' },
  399: 	# MarkMonitor -- web form only
  400: 	'markmonitor.com'       => {
  401: 		form        => 'https://corp.markmonitor.com/domain/ui/abuse-report',
  402: 		form_paste  => 'Complete all fields including the domain name and your '
  403: 		             . 'description of the abuse.  Paste the originating IP, '
  404: 		             . 'risk flags, and the relevant Received: headers from the '
  405: 		             . 'report below.',
  406: 		form_upload => 'Take a screenshot of the report as a .png or .jpg, or export it as a .pdf.  MarkMonitor does not accept .eml files.',
  407: 		note        => 'Brand-protection registrar -- email reports not processed',
  408: 	},
  409: 	# URL shortener operators
  410: 	'is.gd'             => { email => 'abuse@is.gd',           note => 'URL shortener -- report via https://is.gd/contact.php' },
  411: 	'bitly.com'         => { email => 'abuse@bitly.com',        note => 'URL shortener abuse' },
  412: 	'bit.ly'            => { email => 'abuse@bitly.com',        note => 'URL shortener abuse' },
  413: 	'tinyurl.com'       => { email => 'abuse@tinyurl.com',      note => 'URL shortener abuse' },
  414: 	'ow.ly'             => { email => 'abuse@hootsuite.com',    note => 'Hootsuite URL shortener' },
  415: 	'buff.ly'           => { email => 'abuse@buffer.com',       note => 'Buffer URL shortener' },
  416: 	'rb.gy'             => { email => 'abuse@rb.gy',            note => 'URL shortener abuse' },
  417: 	'cutt.ly'           => { email => 'abuse@cutt.ly',          note => 'URL shortener abuse' },
  418: 	'shorturl.at'       => { email => 'abuse@shorturl.at',      note => 'URL shortener abuse' },
  419: 	# Dynadot -- web form only
  420: 	'dynadot.com'           => {
  421: 		form        => 'https://www.dynadot.com/report-abuse',
  422: 		form_paste  => 'Complete all fields including the domain name and your '
  423: 		             . 'description of the abuse.  Paste the originating IP, '
  424: 		             . 'risk flags, and the relevant Received: headers from the '
  425: 		             . 'report below.',
  426: 		form_upload => 'Take a screenshot of the report as a .png or .jpg, '
  427: 		             . 'or export it as a .pdf.',
  428: 		note        => 'Registrar -- email reports not monitored, use web form',
  429: 	},
  430: 	# Global Domain Group -- web form only
  431: 	'globaldomaingroup.com' => {
  432: 		form        => 'https://globaldomaingroup.com/report-abuse',
  433: 		form_paste  => 'Complete all fields including the domain name and your '
  434: 		             . 'description of the abuse.  Paste the originating IP, '
  435: 		             . 'risk flags, and the relevant Received: headers from the '
  436: 		             . 'report below.',
  437: 		form_upload => 'Attach the original spam message as an .eml file.',
  438: 		note        => 'Registrar -- email reports explicitly not accepted',
  439: 	},
  440: 	# TPG / Internode (Australia)
  441: 	'tpgi.com.au'       => { email => 'abuse@tpg.com.au',      note => 'TPG Telecom Australia' },
  442: 	'tpg.com.au'        => { email => 'abuse@tpg.com.au',      note => 'TPG Telecom Australia' },
  443: 	'internode.on.net'  => { email => 'abuse@internode.on.net',note => 'Internode Australia' },
  444: );
  445: 
  446: # -----------------------------------------------------------------------
  447: # Constructor
  448: # -----------------------------------------------------------------------
  449: 
  450: =head1 METHODS
  451: 
  452: =head2 new( %options )
  453: 
  454: Constructs and returns a new C<Email::Abuse::Investigator> analyser object.  The
  455: object is stateless until C<parse_email()> is called; all analysis results
  456: are stored on the object and retrieved via the public accessor methods
  457: documented below.
  458: 
  459: A single object may be reused for multiple emails by calling C<parse_email()>
  460: again: all per-message cached state from the previous message is discarded
  461: automatically.  Cross-message IP and domain lookup results are retained
  462: in a shared CHI cache (if C<CHI> is installed) to avoid redundant network
  463: queries across messages processed in the same process.
  464: 
  465: =head3 Usage
  466: 
  467:     # Minimal -- all options take safe defaults
  468:     my $analyser = Email::Abuse::Investigator->new();
  469: 
  470:     # With options
  471:     my $analyser = Email::Abuse::Investigator->new(
  472:         timeout        => 15,
  473:         trusted_relays => ['203.0.113.0/24', '10.0.0.0/8'],
  474:         verbose        => 0,
  475:     );
  476: 
  477:     $analyser->parse_email($raw_rfc2822_text);
  478:     my $origin   = $analyser->originating_ip();
  479:     my @urls     = $analyser->embedded_urls();
  480:     my @domains  = $analyser->mailto_domains();
  481:     my $risk     = $analyser->risk_assessment();
  482:     my @contacts = $analyser->abuse_contacts();
  483:     print $analyser->report();
  484: 
  485: =head3 Arguments
  486: 
  487: All arguments are optional named parameters passed as a flat key-value list.
  488: 
  489: =over 4
  490: 
  491: =item C<timeout> (integer, default 10)
  492: 
  493: Maximum seconds to wait for any single network operation.  Set to 0 to
  494: disable timeouts (not recommended for production use).
  495: 
  496: =item C<trusted_relays> (arrayref of strings, default [])
  497: 
  498: IP addresses or CIDR blocks to skip during Received: chain analysis.
  499: Each element may be an exact IPv4 address (C<'192.0.2.1'>) or a CIDR
  500: block (C<'192.0.2.0/24'>).
  501: 
  502: =item C<verbose> (boolean, default 0)
  503: 
  504: When true, diagnostic messages are written to STDERR.
  505: 
  506: =back
  507: 
  508: =head3 Returns
  509: 
  510: A blessed C<Email::Abuse::Investigator> object.  No network I/O is performed
  511: during construction.
  512: 
  513: =head3 Side Effects
  514: 
  515: If C<CHI> is installed, a shared in-memory cache is initialised (or
  516: re-used if a cache was already created by a prior call to C<new()>).
  517: This cache persists for the lifetime of the process.
  518: 
  519: =head3 Notes
  520: 
  521: =over 4
  522: 
  523: =item *
  524: 
  525: Unknown option keys are silently ignored.
  526: 
  527: =item *
  528: 
  529: The object is not thread-safe.  Use a separate object per thread.
  530: 
  531: =item *
  532: 
  533: WHOIS read timeouts use C<IO::Select> rather than C<alarm()>, so they
  534: work correctly on Windows and in threaded Perl interpreters.
  535: 
  536: =back
  537: 
  538: =head3 API Specification
  539: 
  540: =head4 Input
  541: 
  542:     {
  543:         timeout => {
  544:             type     => 'integer',
  545:             optional => 1,
  546:             min      => 0,
  547:             default  => 10,
  548:         },
  549:         trusted_relays => {
  550:             type          => 'arrayref',
  551:             element_type  => 'string',
  552:             optional      => 1,
  553:             default       => [],
  554:         },
  555:         verbose => {
  556:             type     => 'boolean',
  557:             optional => 1,
  558:             default  => 0,
  559:         },
  560:     }
  561: 
  562: =head4 Output
  563: 
  564:     {
  565:         type => 'Email::Abuse::Investigator',
  566:         isa  => 'Email::Abuse::Investigator',
  567:     }
  568: 
  569: =head3 FORMAL SPECIFICATION
  570: 
  571:     -- Z notation (simplified)
  572:     new == [
  573:       timeout        : N;
  574:       trusted_relays : seq STRING;
  575:       verbose        : BOOL;
  576:       _raw           : STRING;
  577:       _headers       : seq (STRING x STRING);
  578:       _origin?       : IP_INFO | undefined;
  579:       _urls?         : seq URL_INFO | undefined;
  580:       _risk?         : RISK_INFO | undefined
  581:     ]
  582:     pre: timeout >= 0
  583:     post: self.timeout = params.timeout /\ self._raw = ''
  584: 
  585: =cut
  586: 
  587: # Class-level cross-message CHI cache (shared across all instances).
  588: # Populated lazily on first call to new() when CHI is available.
  589: my $_cache;
  590: 
  591: sub new {
โ—592 โ†’ 619 โ†’ 628โ—592 โ†’ 619 โ†’ 0  592: 	my $class = shift;
  593: 
  594: 	# Accept hash or hashref arguments uniformly
  595: 	my $params = Params::Validate::Strict::validate_strict({
  596: 		args => Params::Get::get_params(undef, \@_) || {},
  597: 		schema => {
  598: 			timeout => {
  599: 				'type'     => 'integer',
  600: 				'optional' => 1,
  601: 				'min'      => 0,
  602: 			},
  603: 			trusted_relays => {
  604: 				'type'         => 'arrayref',
  605: 				'element_type' => 'string',
  606: 				'optional'     => 1,
  607: 			},
  608: 			verbose => {
  609: 				'type'     => 'boolean',
  610: 				'optional' => 1,
  611: 			},
  612: 		},
  613: 	});
  614: 
  615: 	# Merge in any file-based configuration via Object::Configure
  616: 	$params = Object::Configure::configure($class, $params);
  617: 
  618: 	# Initialise the cross-message CHI cache on first construction
  619: 	if ($HAS_CHI && !$_cache) {

Mutants (Total: 1, Killed: 1, Survived: 0)

620: $_cache = CHI->new( 621: driver => 'Memory', 622: global => 1, 623: expires_in => $CACHE_TTL_SECS, 624: ); 625: } 626: 627: # Build and bless the object with default slot values โ—[NOT COVERED] 628 โ†’ 628 โ†’ 0 628: return bless { 629: timeout => $DEFAULT_TIMEOUT, 630: trusted_relays => [], 631: verbose => 0, 632: _raw => '', 633: _headers => [], 634: _body_plain => '', 635: _body_html => '', 636: _received => [], 637: _origin => undef, 638: _urls => undef, # lazy-computed by embedded_urls() 639: _mailto_domains=> undef, # lazy-computed by mailto_domains() 640: _domain_info => {}, # per-message domain analysis cache 641: _sending_sw => [], # X-Mailer / X-PHP-Originating-Script etc. 642: _rcvd_tracking => [], # per-hop tracking IDs from Received: headers 643: %{$params}, # Override the defaults with Object:Configure and the values passed in 644: }, $class; 645: } 646: 647: # ----------------------------------------------------------------------- 648: # Public: parse 649: # ----------------------------------------------------------------------- 650: 651: =head2 parse_email( $text ) 652: 653: Feeds a raw RFC 2822 email message to the analyser and prepares it for 654: subsequent interrogation. This is the only method that must be called 655: before any other public method. 656: 657: If the same object is used for a second message, calling C<parse_email()> 658: again completely replaces all per-message state from the first message. 659: The cross-message CHI cache is B<not> flushed; IP and domain lookups 660: cached from prior messages are retained. 661: 662: =head3 Usage 663: 664: my $raw = do { local $/; <STDIN> }; 665: $analyser->parse_email($raw); 666: 667: # Scalar reference (avoids copying large messages) 668: $analyser->parse_email(\$raw); 669: 670: # Chained 671: my $analyser = Email::Abuse::Investigator->new()->parse_email($raw); 672: 673: =head3 Arguments 674: 675: =over 4 676: 677: =item C<$text> (scalar or scalar reference, required) 678: 679: Complete raw RFC 2822 email message, including all headers and the body. 680: Both LF-only and CRLF line endings are accepted. 681: 682: =back 683: 684: =head3 Returns 685: 686: The object itself (C<$self>), enabling method chaining. 687: 688: =head3 Side Effects 689: 690: Parses headers, decodes the body (quoted-printable, base64, multipart), 691: extracts sending-software fingerprints, and populates per-hop tracking 692: data. All previously computed lazy results are discarded. 693: 694: =head3 Notes 695: 696: =over 4 697: 698: =item * 699: 700: If C<$text> is empty or contains no header/body separator, all public 701: methods will return empty/safe values. 702: 703: =item * 704: 705: Decoding errors in base64 or quoted-printable payloads are silenced; raw 706: bytes are used in place of correct output to prevent exceptions. 707: 708: =back 709: 710: =head3 API Specification 711: 712: =head4 Input 713: 714: [ 715: { 716: type => 'scalar | scalarref', 717: }, 718: ] 719: 720: =head4 Output 721: 722: { 723: type => 'Email::Abuse::Investigator', 724: isa => 'Email::Abuse::Investigator', 725: } 726: 727: =head3 FORMAL SPECIFICATION 728: 729: -- Z notation 730: parse_email == [ 731: Delta Email::Abuse::Investigator; 732: text? : STRING | ref STRING 733: ] 734: pre: defined text? 735: post: self._raw = deref(text?) /\ 736: self._origin = undefined /\ 737: self._urls = undefined /\ 738: self._risk = undefined 739: 740: =cut 741: 742: # TODO: Allow a Mail::Message object to be passed in 743: sub parse_email { 744: my $self = shift; 745: 746: # Accept both positional string and named 'text' argument 747: my $args = Params::Get::get_params('text', \@_); 748: my $text = $args->{text}; 749: 750: # Dereference if a scalar reference was supplied 751: $text = $$text if ref $text eq 'SCALAR'; 752: 753: # Sanitise: strip control characters that could affect terminal output. 754: # Keep \t (tabs in headers), \n (line endings), \r (CRLF mail format). 755: $text =~ s/[^\x09\x0A\x0D\x20-\x7E\x80-\xFF]//g if defined $text; 756: 757: # Store the sanitised raw text for later reproduction in reports 758: $self->{_raw} = $text // ''; 759: 760: # Invalidate all per-message lazy caches 761: $self->{_origin} = undef; 762: $self->{_urls} = undef; 763: $self->{_mailto_domains} = undef; 764: $self->{_domain_info} = {}; 765: $self->{_risk} = undef; 766: $self->{_auth_results} = undef; 767: $self->{_sending_sw} = []; 768: $self->{_rcvd_tracking} = []; 769: 770: # Perform synchronous header/body parsing (no network I/O) 771: $self->_split_message($text) if defined $text && $text =~ /\S/; 772: return $self;

Mutants (Total: 2, Killed: 0, Survived: 2)
773: } 774: 775: # ----------------------------------------------------------------------- 776: # Public: originating host 777: # ----------------------------------------------------------------------- 778: 779: =head2 originating_ip() 780: 781: Identifies the IP address of the machine that originally injected the 782: message into the mail system by walking the C<Received:> chain, skipping 783: private/trusted hops, and enriching the first external hop with rDNS, 784: WHOIS/RDAP organisation name, abuse contact, and country code. 785: 786: Both IPv4 and IPv6 addresses are extracted and evaluated. 787: 788: The result is cached; subsequent calls return the same hashref without 789: repeating network I/O. 790: 791: =head3 Usage 792: 793: my $orig = $analyser->originating_ip(); 794: if (defined $orig) { 795: printf "Origin: %s (%s)\n", $orig->{ip}, $orig->{rdns}; 796: printf "Owner: %s\n", $orig->{org}; 797: } 798: 799: =head3 Arguments 800: 801: None. C<parse_email()> must have been called first. 802: 803: =head3 Returns 804: 805: A hashref with keys C<ip>, C<rdns>, C<org>, C<abuse>, C<confidence>, 806: C<note>, and C<country> (may be undef). Returns C<undef> if no suitable 807: originating IP can be determined. 808: 809: =head3 Side Effects 810: 811: On first call: one PTR lookup and one RDAP/WHOIS query. Results are cached 812: in the object and in the cross-message CHI cache (if available). 813: 814: =head3 Notes 815: 816: Only the first (oldest) external IP in the chain is reported. See 817: C<received_trail()> for the full chain. 818: 819: =head3 API Specification 820: 821: =head4 Input 822: 823: [] 824: 825: =head4 Output 826: 827: { 828: type => 'hashref | undef', 829: keys => { 830: ip => { type => 'scalar', regex => qr/[\d.:a-fA-F]/ }, 831: rdns => { type => 'scalar' }, 832: org => { type => 'scalar' }, 833: abuse => { type => 'scalar' }, 834: confidence => { type => 'scalar', regex => qr/^(?:high|medium|low)$/ }, 835: note => { type => 'scalar' }, 836: country => { type => 'scalar', optional => 1 }, 837: }, 838: } 839: 840: =head3 FORMAL SPECIFICATION 841: 842: -- Z notation 843: originating_ip == [ 844: Xi Email::Abuse::Investigator; 845: result! : IP_INFO | undefined 846: ] 847: pre: self._raw /= '' 848: post: result! = self._origin /\ 849: (result! /= undefined => result!.ip in EXTERNAL_IPS) 850: 851: =cut 852: 853: sub originating_ip { 854: my $self = $_[0]; 855: 856: # Return the cached result if we already have it 857: $self->{_origin} //= $self->_find_origin(); 858: return $self->{_origin};
Mutants (Total: 2, Killed: 0, Survived: 2)
859: } 860: 861: # ----------------------------------------------------------------------- 862: # Public: HTTP/HTTPS URLs 863: # ----------------------------------------------------------------------- 864: 865: =head2 embedded_urls() 866: 867: Extracts every HTTP and HTTPS URL from the message body and enriches each 868: one with the hosting IP address, network organisation name, abuse contact, 869: and country code. Both IPv4 and IPv6 host addresses are supported. 870: 871: URL extraction runs across both plain-text and HTML body parts. DNS 872: lookups for each unique hostname are optionally parallelised via 873: C<AnyEvent::DNS> if that module is installed. 874: 875: The result is cached; subsequent calls return the same list without 876: repeating network I/O. 877: 878: =head3 Usage 879: 880: my @urls = $analyser->embedded_urls(); 881: for my $u (@urls) { 882: printf "URL: %s host: %s org: %s\n", 883: $u->{url}, $u->{host}, $u->{org}; 884: } 885: 886: =head3 Arguments 887: 888: None. C<parse_email()> must have been called first. 889: 890: =head3 Returns 891: 892: A list of hashrefs, one per unique URL, in first-seen order. Returns an 893: empty list if no HTTP/HTTPS URLs are present. Each hashref has keys 894: C<url>, C<host>, C<ip>, C<org>, C<abuse>, C<country>. 895: 896: =head3 Side Effects 897: 898: Per unique hostname: one A/AAAA lookup and one RDAP/WHOIS query. Results 899: are cached in the object and in the cross-message CHI cache. 900: 901: =head3 Notes 902: 903: Only C<http://> and C<https://> URLs are extracted. URL shortener hosts 904: are included in the returned list (they are flagged by C<risk_assessment()>). 905: 906: =head3 API Specification 907: 908: =head4 Input 909: 910: [] 911: 912: =head4 Output 913: 914: ( 915: { 916: type => 'hashref', 917: keys => { 918: url => { type => 'scalar', regex => qr{^https?://}i }, 919: host => { type => 'scalar' }, 920: ip => { type => 'scalar' }, 921: org => { type => 'scalar' }, 922: abuse => { type => 'scalar' }, 923: country => { type => 'scalar', optional => 1 }, 924: }, 925: }, 926: ... 927: ) 928: 929: =head3 FORMAL SPECIFICATION 930: 931: -- Z notation 932: embedded_urls == [ 933: Xi Email::Abuse::Investigator; 934: result! : seq URL_INFO 935: ] 936: pre: self._raw /= '' 937: post: result! = self._urls /\ 938: forall u : result! @ u.url =~ m{^https?://}i 939: 940: =cut 941: 942: sub embedded_urls { 943: my $self = $_[0]; 944: 945: $self->{_urls} //= $self->_extract_and_resolve_urls(); 946: return @{ $self->{_urls} }; 947: } 948: 949: # ----------------------------------------------------------------------- 950: # Public: mailto / reply-to / from domains 951: # ----------------------------------------------------------------------- 952: 953: =head2 mailto_domains() 954: 955: Identifies every domain associated with the message as a contact, reply, 956: or delivery address, then runs a full intelligence pipeline on each one 957: (A record, MX, NS, WHOIS) to determine hosting and registration details. 958: 959: The result is cached; subsequent calls return the same list without 960: repeating network I/O. 961: 962: =head3 Usage 963: 964: my @domains = $analyser->mailto_domains(); 965: for my $d (@domains) { 966: printf "Domain: %s registrar: %s\n", 967: $d->{domain}, $d->{registrar} // 'unknown'; 968: } 969: 970: =head3 Arguments 971: 972: None. C<parse_email()> must have been called first. 973: 974: =head3 Returns 975: 976: A list of hashrefs, one per unique domain. See the main POD for the full 977: set of possible keys. Returns an empty list if no qualifying domains are 978: found. 979: 980: =head3 Side Effects 981: 982: Per unique domain: up to three A lookups, one MX lookup, one NS lookup, 983: and two WHOIS queries. Results are cached in the object and in the 984: cross-message CHI cache. 985: 986: =head3 Notes 987: 988: MX and NS lookups require C<Net::DNS>. Without it those keys are absent 989: from every returned hashref. 990: 991: =head3 API Specification 992: 993: =head4 Input 994: 995: [] 996: 997: =head4 Output 998: 999: ( 1000: { 1001: type => 'hashref', 1002: keys => { 1003: domain => { type => 'scalar' }, 1004: source => { type => 'scalar' }, 1005: # All other keys optional -- see main POD 1006: }, 1007: }, 1008: ... 1009: ) 1010: 1011: =head3 FORMAL SPECIFICATION 1012: 1013: -- Z notation 1014: mailto_domains == [ 1015: Xi Email::Abuse::Investigator; 1016: result! : seq DOMAIN_INFO 1017: ] 1018: pre: self._raw /= '' 1019: post: result! = self._mailto_domains /\ 1020: forall d : result! @ d.domain =~ /\.[a-zA-Z]{2,}$/ 1021: 1022: =cut 1023: 1024: sub mailto_domains { 1025: my $self = $_[0]; 1026: 1027: $self->{_mailto_domains} //= $self->_extract_and_analyse_domains(); 1028: return @{ $self->{_mailto_domains} }; 1029: } 1030: 1031: =head2 all_domains() 1032: 1033: Returns the deduplicated union of every registrable domain seen anywhere 1034: in the message -- URL hosts from C<embedded_urls()> and contact domains 1035: from C<mailto_domains()> -- normalised to eTLD+1 form. 1036: 1037: Triggers C<embedded_urls()> and C<mailto_domains()> lazily. 1038: 1039: =head3 Usage 1040: 1041: my @domains = $analyser->all_domains(); 1042: print "$_\n" for @domains; 1043: 1044: =head3 Arguments 1045: 1046: None. 1047: 1048: =head3 Returns 1049: 1050: A list of plain strings (registrable domain names), lower-cased, no 1051: duplicates, in first-seen order. 1052: 1053: =head3 Side Effects 1054: 1055: Triggers C<embedded_urls()> and C<mailto_domains()> if not already cached. 1056: 1057: =head3 Notes 1058: 1059: Normalisation to eTLD+1 uses C<Domain::PublicSuffix> if installed, falling 1060: back to a built-in heuristic otherwise. 1061: 1062: =head3 API Specification 1063: 1064: =head4 Input 1065: 1066: [] 1067: 1068: =head4 Output 1069: 1070: ( 1071: { type => 'scalar', regex => qr/^[a-z0-9][a-z0-9.-]*\.[a-z]{2,}$/ }, 1072: ... 1073: ) 1074: 1075: =head3 FORMAL SPECIFICATION 1076: 1077: -- Z notation 1078: all_domains == [ 1079: Xi Email::Abuse::Investigator; 1080: result! : seq STRING 1081: ] 1082: post: result! = deduplicate( 1083: map(_registrable, url_hosts union mailto_domains) 1084: ) 1085: 1086: =cut 1087: 1088: sub all_domains { โ—1089 โ†’ 1093 โ†’ 1099โ—1089 โ†’ 1093 โ†’ 0 1089: my $self = $_[0]; 1090: my (%seen, @out); 1091: 1092: # Collect registrable domains from URL hosts first 1093: for my $u ($self->embedded_urls()) { 1094: my $dom = _registrable($u->{host}); 1095: push @out, $dom if $dom && !$seen{$dom}++; 1096: } 1097: 1098: # Then from contact domains (normalise subdomains to registrable parent) โ—1099 โ†’ 1099 โ†’ 1103โ—1099 โ†’ 1099 โ†’ 0 1099: for my $d ($self->mailto_domains()) { 1100: my $dom = _registrable($d->{domain}) // $d->{domain}; 1101: push @out, $dom if $dom && !$seen{$dom}++; 1102: } โ—1103 โ†’ 1103 โ†’ 0 1103: return @out;
Mutants (Total: 2, Killed: 0, Survived: 2)
1104: } 1105: 1106: =head2 unresolved_contacts() 1107: 1108: Returns a list of domains and URL hosts found in the message for which no 1109: abuse contact could be determined. Useful for surfacing parties that may 1110: warrant manual investigation. 1111: 1112: =head3 Usage 1113: 1114: my @unresolved = $analyser->unresolved_contacts(); 1115: for my $u (@unresolved) { 1116: printf "Unresolved: %s (%s) via %s\n", 1117: $u->{domain}, $u->{type}, $u->{source}; 1118: } 1119: 1120: =head3 Arguments 1121: 1122: None. 1123: 1124: =head3 Returns 1125: 1126: A list of hashrefs, each with keys C<domain>, C<type> (C<'url_host'> or 1127: C<'domain'>), and C<source> (where the domain was found). 1128: 1129: =head3 Side Effects 1130: 1131: Triggers C<embedded_urls()>, C<mailto_domains()>, C<abuse_contacts()>, 1132: and C<form_contacts()> if not already cached. 1133: 1134: =head3 Notes 1135: 1136: Domains sourced only from spoofable sending headers (C<From:>, 1137: C<Return-Path:>, C<Sender:>) are excluded. 1138: 1139: =head3 API Specification 1140: 1141: =head4 Input 1142: 1143: [] 1144: 1145: =head4 Output 1146: 1147: ( 1148: { 1149: type => 'hashref', 1150: keys => { 1151: domain => { type => 'scalar' }, 1152: type => { type => 'scalar', regex => qr/^(?:url_host|domain)$/ }, 1153: source => { type => 'scalar' }, 1154: }, 1155: }, 1156: ... 1157: ) 1158: 1159: =head3 FORMAL SPECIFICATION 1160: 1161: -- Z notation 1162: unresolved_contacts == [ 1163: Xi Email::Abuse::Investigator; 1164: result! : seq UNRESOLVED_INFO 1165: ] 1166: post: forall u : result! @ 1167: u.domain not_in covered_domains(abuse_contacts, form_contacts) 1168: 1169: =cut 1170: 1171: sub unresolved_contacts { โ—1172 โ†’ 1176 โ†’ 1186โ—1172 โ†’ 1176 โ†’ 0 1172: my $self = $_[0]; 1173: 1174: # Build a set of domains already covered by email or form contacts 1175: my %covered; 1176: for my $c ($self->abuse_contacts(), $self->form_contacts()) { 1177: my $dom = $c->{form_domain}; 1178: unless ($dom) {

Mutants (Total: 1, Killed: 1, Survived: 0)

1179: # Extract domain from abuse email address 1180: ($dom) = ($c->{address} // '') =~ /\@([\w.-]+)/; 1181: } 1182: $covered{lc $dom}++ if $dom; 1183: } 1184: 1185: # Also mark URL hosts that already have a resolved abuse address โ—1186 โ†’ 1186 โ†’ 1191โ—1186 โ†’ 1186 โ†’ 0 1186: for my $u ($self->embedded_urls()) { 1187: (my $bare = lc $u->{host}) =~ s/^www\.//; 1188: $covered{$bare}++ if $u->{abuse} && $u->{abuse} ne '(unknown)'; 1189: } 1190: โ—1191 โ†’ 1194 โ†’ 1206โ—1191 โ†’ 1194 โ†’ 0 1191: my (@out, %seen); 1192: 1193: # Check URL hosts first 1194: for my $u ($self->embedded_urls()) { 1195: (my $bare = lc $u->{host}) =~ s/^www\.//; 1196: next if $covered{$bare}; 1197: next if $seen{"url:$bare"}++; 1198: push @out, { 1199: domain => $u->{host}, 1200: type => 'url_host', 1201: source => 'URL in body', 1202: }; 1203: } 1204: 1205: # Then check contact domains, skipping spoofable-header-only sources โ—1206 โ†’ 1206 โ†’ 1219โ—1206 โ†’ 1206 โ†’ 0 1206: for my $d ($self->mailto_domains()) { 1207: my $dom = $d->{domain}; 1208: my $source = $d->{source} // ''; 1209: next if $source =~ /^(?:From:|Return-Path:|Sender:) header$/; 1210: next if $covered{lc $dom}; 1211: next if $seen{"dom:$dom"}++; 1212: push @out, { 1213: domain => $dom, 1214: type => 'domain', 1215: source => $source, 1216: }; 1217: } 1218: โ—1219 โ†’ 1219 โ†’ 0 1219: return @out;

Mutants (Total: 2, Killed: 0, Survived: 2)
1220: } 1221: 1222: # ----------------------------------------------------------------------- 1223: # Public: sending software fingerprint 1224: # ----------------------------------------------------------------------- 1225: 1226: =head2 sending_software() 1227: 1228: Returns information extracted from headers that identify the software or 1229: server-side infrastructure used to compose or inject the message. Headers 1230: such as C<X-PHP-Originating-Script> reveal the exact PHP script and Unix 1231: account responsible on shared-hosting platforms. 1232: 1233: Data is extracted during C<parse_email()> with no network I/O. 1234: 1235: =head3 Usage 1236: 1237: my @sw = $analyser->sending_software(); 1238: for my $s (@sw) { 1239: printf "%-30s : %s\n", $s->{header}, $s->{value}; 1240: } 1241: 1242: =head3 Arguments 1243: 1244: None. C<parse_email()> must have been called first. 1245: 1246: =head3 Returns 1247: 1248: A list of hashrefs in alphabetical header-name order. Returns an empty 1249: list if none of the watched headers are present. Each hashref has keys 1250: C<header>, C<value>, and C<note>. 1251: 1252: =head3 Side Effects 1253: 1254: None. Data is pre-collected during C<parse_email()>. 1255: 1256: =head3 Notes 1257: 1258: Header names are lower-cased. Header values are stored verbatim. 1259: 1260: =head3 API Specification 1261: 1262: =head4 Input 1263: 1264: [] 1265: 1266: =head4 Output 1267: 1268: ( 1269: { 1270: type => 'hashref', 1271: keys => { 1272: header => { type => 'scalar' }, 1273: value => { type => 'scalar' }, 1274: note => { type => 'scalar' }, 1275: }, 1276: }, 1277: ... 1278: ) 1279: 1280: =head3 FORMAL SPECIFICATION 1281: 1282: -- Z notation 1283: sending_software == [ 1284: Xi Email::Abuse::Investigator; 1285: result! : seq SW_INFO 1286: ] 1287: post: result! = self._sending_sw 1288: 1289: =cut 1290: 1291: sub sending_software { 1292: my $self = $_[0]; 1293: 1294: return @{ $self->{_sending_sw} }; 1295: } 1296: 1297: # ----------------------------------------------------------------------- 1298: # Public: per-hop tracking IDs 1299: # ----------------------------------------------------------------------- 1300: 1301: =head2 received_trail() 1302: 1303: Returns per-hop tracking data extracted from the C<Received:> header chain: 1304: the IP address, envelope recipient address, and server session ID for each 1305: relay. ISP postmasters use these identifiers to locate the SMTP session in 1306: their logs. 1307: 1308: =head3 Usage 1309: 1310: my @trail = $analyser->received_trail(); 1311: for my $hop (@trail) { 1312: printf "IP: %s ID: %s\n", 1313: $hop->{ip} // '?', $hop->{id} // '?'; 1314: } 1315: 1316: =head3 Arguments 1317: 1318: None. C<parse_email()> must have been called first. 1319: 1320: =head3 Returns 1321: 1322: A list of hashrefs in oldest-first order. Returns an empty list if no 1323: C<Received:> headers are present or none yielded extractable data. Each 1324: hashref has keys C<received>, C<ip> (may be undef), C<for> (may be undef), 1325: C<id> (may be undef). 1326: 1327: =head3 Side Effects 1328: 1329: None. Data is pre-collected during C<parse_email()>. 1330: 1331: =head3 Notes 1332: 1333: Private IPs are NOT filtered here; all IPs including RFC 1918 addresses 1334: are returned as found. Filtering is applied only by C<originating_ip()>. 1335: 1336: =head3 API Specification 1337: 1338: =head4 Input 1339: 1340: [] 1341: 1342: =head4 Output 1343: 1344: ( 1345: { 1346: type => 'hashref', 1347: keys => { 1348: received => { type => 'scalar' }, 1349: ip => { type => 'scalar', optional => 1 }, 1350: for => { type => 'scalar', optional => 1 }, 1351: id => { type => 'scalar', optional => 1 }, 1352: }, 1353: }, 1354: ... 1355: ) 1356: 1357: =head3 FORMAL SPECIFICATION 1358: 1359: -- Z notation 1360: received_trail == [ 1361: Xi Email::Abuse::Investigator; 1362: result! : seq HOP_INFO 1363: ] 1364: post: result! = self._rcvd_tracking 1365: 1366: =cut 1367: 1368: sub received_trail { 1369: my $self = $_[0]; 1370: 1371: return @{ $self->{_rcvd_tracking} }; 1372: } 1373: 1374: # ----------------------------------------------------------------------- 1375: # Public: risk assessment 1376: # ----------------------------------------------------------------------- 1377: 1378: =head2 risk_assessment() 1379: 1380: Evaluates the message against heuristic checks and returns an overall risk 1381: level, a weighted numeric score, and a list of every specific red flag. 1382: 1383: The assessment covers five categories: originating IP, email authentication, 1384: Date: header validity, identity/header consistency, and URL/domain properties. 1385: 1386: The result is cached; subsequent calls return the same hashref without 1387: repeating any analysis. 1388: 1389: =head3 Usage 1390: 1391: my $risk = $analyser->risk_assessment(); 1392: printf "Risk: %s (score: %d)\n", $risk->{level}, $risk->{score}; 1393: for my $f (@{ $risk->{flags} }) { 1394: printf " [%s] %s\n", $f->{severity}, $f->{detail}; 1395: } 1396: 1397: =head3 Arguments 1398: 1399: None. C<parse_email()> must have been called first. 1400: 1401: =head3 Returns 1402: 1403: A hashref with keys C<level> (HIGH/MEDIUM/LOW/INFO), C<score> (integer), 1404: and C<flags> (arrayref of hashrefs with C<severity>, C<flag>, C<detail>). 1405: 1406: =head3 Side Effects 1407: 1408: Triggers C<originating_ip()>, C<embedded_urls()>, and C<mailto_domains()> 1409: if not already cached. 1410: 1411: =head3 Notes 1412: 1413: Scores: HIGH >= 9, MEDIUM >= 5, LOW >= 2, INFO < 2. 1414: Flag weights: HIGH=3, MEDIUM=2, LOW=1, INFO=0. 1415: 1416: =head3 API Specification 1417: 1418: =head4 Input 1419: 1420: [] 1421: 1422: =head4 Output 1423: 1424: { 1425: type => 'hashref', 1426: keys => { 1427: level => { type => 'scalar', regex => qr/^(?:HIGH|MEDIUM|LOW|INFO)$/ }, 1428: score => { type => 'scalar', regex => qr/^\d+$/ }, 1429: flags => { type => 'arrayref' }, 1430: }, 1431: } 1432: 1433: =head3 FORMAL SPECIFICATION 1434: 1435: -- Z notation 1436: risk_assessment == [ 1437: Xi Email::Abuse::Investigator; 1438: result! : RISK_INFO 1439: ] 1440: post: result!.score = sum({ w(f.severity) | f in result!.flags }) /\ 1441: result!.level = classify(result!.score) 1442: where: 1443: w(HIGH) = 3; w(MEDIUM) = 2; w(LOW) = 1; w(INFO) = 0 1444: classify(s) = HIGH if s >= 9 1445: | MEDIUM if s >= 5 1446: | LOW if s >= 2 1447: | INFO otherwise 1448: 1449: =cut 1450: 1451: sub risk_assessment { โ—1452 โ†’ 1468 โ†’ 1501โ—1452 โ†’ 1468 โ†’ 0 1452: my $self = $_[0]; 1453: 1454: return $self->{_risk} if $self->{_risk};

Mutants (Total: 2, Killed: 2, Survived: 0)

1455: 1456: my (@flags, $score); 1457: $score = 0; 1458: 1459: # Closure to record a flag and accumulate its weight 1460: my $flag = sub { 1461: my ($severity, $name, $detail) = @_; 1462: $score += $FLAG_WEIGHT{$severity} // 1; 1463: push @flags, { severity => $severity, flag => $name, detail => $detail }; 1464: }; 1465: 1466: # --- Group 1: Originating IP checks --- 1467: my $orig = $self->originating_ip(); 1468: if ($orig) {

Mutants (Total: 1, Killed: 0, Survived: 1)
1469: # Residential / broadband rDNS patterns suggest a compromised host 1470: if ($orig->{rdns} && $orig->{rdns} =~ /
Mutants (Total: 1, Killed: 0, Survived: 1)
1471: \d+[-_.]\d+[-_.]\d+[-_.]\d+ # dotted-quad in rDNS 1472: | (?:dsl|adsl|cable|broad|dial|dynamic|dhcp|ppp| 1473: residential|cust|home|pool|client|user| 1474: static\d|host\d) 1475: /xi) { 1476: $flag->('HIGH', 'residential_sending_ip', 1477: "Sending IP $orig->{ip} rDNS '$orig->{rdns}' looks like a broadband/residential line, not a legitimate mail server"); 1478: } 1479: 1480: # Absence of rDNS is a strong spam indicator 1481: if (!$orig->{rdns} || $orig->{rdns} eq '(no reverse DNS)') {
Mutants (Total: 1, Killed: 0, Survived: 1)
1482: $flag->('HIGH', 'no_reverse_dns', 1483: "Sending IP $orig->{ip} has no reverse DNS -- legitimate mail servers always have rDNS"); 1484: } 1485: 1486: # Low-confidence origin means the IP came from an unverifiable header 1487: if ($orig->{confidence} eq 'low') {
Mutants (Total: 1, Killed: 0, Survived: 1)
1488: $flag->('MEDIUM', 'low_confidence_origin', 1489: "Originating IP taken from unverified header ($orig->{note})"); 1490: } 1491: 1492: # Statistically high-volume spam countries (informational only) 1493: if ($orig->{country} && $orig->{country} =~ /^(?:CN|RU|NG|VN|IN|PK|BD)$/) {
Mutants (Total: 1, Killed: 0, Survived: 1)
1494: $flag->('INFO', 'high_spam_country', 1495: 'Sending IP is in ' . _country_name($orig->{country}) . 1496: " ($orig->{country}) -- statistically high spam volume country"); 1497: } 1498: } 1499: 1500: # --- Group 2: Email authentication checks --- โ—1501 โ†’ 1502 โ†’ 1514โ—1501 โ†’ 1502 โ†’ 0 1501: my $auth = $self->_parse_auth_results_cached(); 1502: if (defined $auth->{spf}) {
Mutants (Total: 1, Killed: 0, Survived: 1)
1503: if ($auth->{spf} =~ /^fail/i) {
Mutants (Total: 1, Killed: 0, Survived: 1)
1504: $flag->('HIGH', 'spf_fail', 1505: "SPF result: $auth->{spf} -- sending IP not authorised by domain's SPF record"); 1506: } elsif ($auth->{spf} =~ /^softfail/i) { 1507: $flag->('MEDIUM', 'spf_softfail', 1508: "SPF result: softfail (~all) -- sending IP not explicitly authorised"); 1509: } elsif ($auth->{spf} !~ /^pass/i) { 1510: $flag->('HIGH', 'spf_fail', 1511: "SPF result: $auth->{spf} -- sending IP not authorised"); 1512: } 1513: } โ—1514 โ†’ 1514 โ†’ 1518โ—1514 โ†’ 1514 โ†’ 0 1514: if (defined $auth->{dkim} && $auth->{dkim} !~ /^pass/i) {
Mutants (Total: 1, Killed: 0, Survived: 1)
1515: $flag->('HIGH', 'dkim_fail', 1516: "DKIM result: $auth->{dkim} -- message signature invalid or absent"); 1517: } โ—1518 โ†’ 1518 โ†’ 1523โ—1518 โ†’ 1518 โ†’ 0 1518: if (defined $auth->{dmarc} && $auth->{dmarc} !~ /^pass/i) {
Mutants (Total: 1, Killed: 0, Survived: 1)
1519: $flag->('HIGH', 'dmarc_fail', "DMARC result: $auth->{dmarc}"); 1520: } 1521: 1522: # DKIM signing domain vs From: domain mismatch check โ—1523 โ†’ 1523 โ†’ 1545โ—1523 โ†’ 1523 โ†’ 0 1523: if ($auth->{dkim_domain}) {
Mutants (Total: 1, Killed: 0, Survived: 1)
1524: my ($from_domain) = ($self->_header_value('from') // '') =~ /\@([\w.-]+)/; 1525: if ($from_domain) {
Mutants (Total: 1, Killed: 0, Survived: 1)
1526: my $reg_dkim = _registrable($auth->{dkim_domain}) // $auth->{dkim_domain}; 1527: my $reg_from = _registrable(lc $from_domain) // lc $from_domain; 1528: if ($reg_dkim ne $reg_from) {
Mutants (Total: 1, Killed: 0, Survived: 1)
1529: # Passing DKIM with a different domain is normal for ESPs 1530: if ($auth->{dkim} && $auth->{dkim} =~ /^pass/i) {

Mutants (Total: 1, Killed: 1, Survived: 0)

1531: $flag->('INFO', 'dkim_domain_mismatch', 1532: "DKIM signed by '$auth->{dkim_domain}' but From: domain is '$from_domain'" 1533: . ' -- message sent via third-party sender (normal for bulk/ESP mail)'); 1534: } else { 1535: # Failing DKIM plus mismatched domain is more suspicious 1536: $flag->('MEDIUM', 'dkim_domain_mismatch', 1537: "DKIM signed by '$auth->{dkim_domain}' but From: domain is '$from_domain'" 1538: . ' and DKIM did not pass -- possible impersonation'); 1539: } 1540: } 1541: } 1542: } 1543: 1544: # --- Group 3: Date: header checks --- โ—1545 โ†’ 1546 โ†’ 1579โ—1545 โ†’ 1546 โ†’ 0 1545: my $date_raw = $self->_header_value('date'); 1546: if (!$date_raw || $date_raw !~ /\S/) {

Mutants (Total: 1, Killed: 1, Survived: 0)

1547: $flag->('MEDIUM', 'missing_date', 1548: 'No Date: header -- violates RFC 5322; common in spam'); 1549: } else { 1550: # Check for an implausible timezone offset (outside real-world bounds) 1551: if ($date_raw =~ /([+-])(\d{2})(\d{2})\s*$/) {

Mutants (Total: 1, Killed: 0, Survived: 1)
1552: my ($sign, $hh, $mm) = ($1, $2, $3); 1553: my $offset_mins = $hh * 60 + $mm; 1554: my $implausible = $mm >= 60
Mutants (Total: 3, Killed: 0, Survived: 3)
1555: || ($sign eq '+' && $offset_mins > $TZ_MAX_POS_MINS)
Mutants (Total: 3, Killed: 0, Survived: 3)
1556: || ($sign eq '-' && $offset_mins > $TZ_MAX_NEG_MINS);
Mutants (Total: 3, Killed: 0, Survived: 3)
1557: if ($implausible) {

Mutants (Total: 1, Killed: 1, Survived: 0)

1558: $flag->('MEDIUM', 'implausible_timezone', 1559: "Date: '$date_raw' contains an implausible timezone offset " 1560: . "($sign$hh$mm) -- header is likely forged"); 1561: } 1562: } 1563: 1564: # Check for dates more than DATE_SKEW_DAYS outside the analysis window 1565: my $date_epoch = _parse_rfc2822_date($date_raw); 1566: if (defined $date_epoch) {

Mutants (Total: 1, Killed: 0, Survived: 1)
1567: my $delta = time() - $date_epoch; 1568: if ($delta > $DATE_SKEW_DAYS * $SECS_PER_DAY) {
Mutants (Total: 4, Killed: 0, Survived: 4)
1569: $flag->('LOW', 'suspicious_date', 1570: "Date: '$date_raw' is more than $DATE_SKEW_DAYS days in the past"); 1571: } elsif ($delta < -($DATE_SKEW_DAYS * $SECS_PER_DAY)) {
Mutants (Total: 3, Killed: 0, Survived: 3)
1572: $flag->('LOW', 'suspicious_date', 1573: "Date: '$date_raw' is more than $DATE_SKEW_DAYS days in the future"); 1574: } 1575: } 1576: } 1577: 1578: # --- Group 4: Header identity checks --- โ—1579 โ†’ 1583 โ†’ 1599โ—1579 โ†’ 1583 โ†’ 0 1579: my $from_raw = $self->_header_value('from') // ''; 1580: my $from_decoded = $self->_decode_mime_words($from_raw); 1581: 1582: # Display-name domain spoofing: "PayPal paypal.com" <phish@evil.example> 1583: if ($from_decoded =~ /^"?([^"<]+?)"?\s*<([^>]+)>/) {
Mutants (Total: 1, Killed: 0, Survived: 1)
1584: my ($display, $addr) = ($1, $2); 1585: while ($display =~ /\b([\w-]+\.(?:com|net|org|io|co|uk|au|gov|edu))\b/gi) { 1586: my $disp_domain = lc $1; 1587: my ($addr_domain) = $addr =~ /\@([\w.-]+)/; 1588: $addr_domain = lc($addr_domain // ''); 1589: my $reg_disp = _registrable($disp_domain); 1590: my $reg_addr = _registrable($addr_domain); 1591: if ($reg_disp && $reg_addr && $reg_disp ne $reg_addr) {
Mutants (Total: 1, Killed: 0, Survived: 1)
1592: $flag->('HIGH', 'display_name_domain_spoof', 1593: "From: display name mentions '$disp_domain' but actual address is <$addr>"); 1594: } 1595: } 1596: } 1597: 1598: # Free webmail sender flag (no corporate infrastructure) โ—1599 โ†’ 1599 โ†’ 1606โ—1599 โ†’ 1599 โ†’ 0 1599: if ($from_raw =~ /\@(gmail|yahoo|hotmail|outlook|live|aol|protonmail|yandex)\./i

Mutants (Total: 1, Killed: 1, Survived: 0)

1600: || $from_raw =~ /\@mail\.ru(?:[\s>]|$)/i) { 1601: $flag->('MEDIUM', 'free_webmail_sender', 1602: "Message sent from free webmail address ($from_raw)"); 1603: } 1604: 1605: # Reply-To differs from From: -- replies harvested by different address โ—1606 โ†’ 1607 โ†’ 1617โ—1606 โ†’ 1607 โ†’ 0 1606: my $reply_to = $self->_header_value('reply-to'); 1607: if ($reply_to) {

Mutants (Total: 1, Killed: 0, Survived: 1)
1608: my ($from_addr) = $from_raw =~ /([\w.+%-]+\@[\w.-]+)/; 1609: my ($reply_addr) = $reply_to =~ /([\w.+%-]+\@[\w.-]+)/; 1610: if ($from_addr && $reply_addr && lc($from_addr) ne lc($reply_addr)) {
Mutants (Total: 1, Killed: 0, Survived: 1)
1611: $flag->('MEDIUM', 'reply_to_differs_from_from', 1612: "Reply-To ($reply_addr) differs from From: ($from_addr)"); 1613: } 1614: } 1615: 1616: # Undisclosed or absent To: header โ—1617 โ†’ 1618 โ†’ 1624โ—1617 โ†’ 1618 โ†’ 0 1617: my $to = $self->_header_value('to') // ''; 1618: if ($to =~ /undisclosed|:;/ || $to eq '') {
Mutants (Total: 1, Killed: 0, Survived: 1)
1619: $flag->('MEDIUM', 'undisclosed_recipients', 1620: "To: header is '$to' -- message was bulk-sent with hidden recipient list"); 1621: } 1622: 1623: # MIME-encoded Subject (potential filter evasion) โ—1624 โ†’ 1625 โ†’ 1632โ—1624 โ†’ 1625 โ†’ 0 1624: my $subj_raw = $self->_header_value('subject') // ''; 1625: if ($subj_raw =~ /=\?[^?]+\?[BQ]\?/i) {
Mutants (Total: 1, Killed: 0, Survived: 1)
1626: $flag->('LOW', 'encoded_subject', 1627: "Subject line is MIME-encoded: '$subj_raw' (decoded: '" 1628: . $self->_decode_mime_words($subj_raw) . "')"); 1629: } 1630: 1631: # --- Group 5: URL and domain checks --- โ—1632 โ†’ 1633 โ†’ 1653โ—1632 โ†’ 1633 โ†’ 0 1632: my (%shortener_seen, %url_host_seen); 1633: for my $u ($self->embedded_urls()) { 1634: # Skip trusted infrastructure -- these are not spam indicators 1635: my $bare = lc $u->{host}; 1636: $bare =~ s/^www\.//; 1637: next if $self->{trusted_domains}->{$bare}; 1638: next if $TRUSTED_DOMAINS{$bare}; 1639: 1640: # URL shortener hides real destination 1641: if(($URL_SHORTENERS{$bare} || $self->{url_shorteners}->{$bare}) && !$shortener_seen{$bare}++) {
Mutants (Total: 1, Killed: 0, Survived: 1)
1642: $flag->('MEDIUM', 'url_shortener', 1643: "$u->{host} is a URL shortener -- the real destination is hidden"); 1644: } 1645: # Plain HTTP provides no encryption 1646: if ($u->{url} =~ m{^http://}i && !$url_host_seen{ $u->{host} }++) {
Mutants (Total: 1, Killed: 0, Survived: 1)
1647: $flag->('LOW', 'http_not_https', 1648: "$u->{host} linked over plain HTTP -- no encryption"); 1649: } 1650: } 1651: 1652: # Domain-level checks against contact/reply domains โ—1653 โ†’ 1653 โ†’ 1689โ—1653 โ†’ 1653 โ†’ 0 1653: for my $d ($self->mailto_domains()) { 1654: # Recently registered domain is a common phishing indicator 1655: if ($d->{recently_registered}) {
Mutants (Total: 1, Killed: 0, Survived: 1)
1656: $flag->('HIGH', 'recently_registered_domain', 1657: "$d->{domain} was registered $d->{registered} (less than ${\$RECENT_REG_DAYS} days ago)"); 1658: } 1659: 1660: # Domain expiry checks 1661: if ($d->{expires}) {
Mutants (Total: 1, Killed: 0, Survived: 1)
1662: if(my $exp = $self->_parse_date_to_epoch($d->{expires})) {

Mutants (Total: 1, Killed: 1, Survived: 0)

1663: my $now = time(); 1664: my $remaining = $exp - $now; 1665: if ($remaining > 0 && $remaining < $EXPIRY_WARN_DAYS * $SECS_PER_DAY) {

Mutants (Total: 7, Killed: 0, Survived: 7)
1666: $flag->('HIGH', 'domain_expires_soon', 1667: "$d->{domain} expires $d->{expires} -- may be a throwaway domain"); 1668: } elsif ($remaining <= 0) {
Mutants (Total: 3, Killed: 0, Survived: 3)
1669: $flag->('HIGH', 'domain_expired', 1670: "$d->{domain} expired $d->{expires} -- domain has lapsed"); 1671: } 1672: } 1673: } 1674: 1675: # Lookalike domain check (brand name in a non-brand domain) 1676: for my $brand (qw(paypal apple google amazon microsoft netflix ebay 1677: instagram facebook twitter linkedin bankofamerica 1678: wellsfargo chase barclays hsbc lloyds santander)) { 1679: if ($d->{domain} =~ /\Q$brand\E/i &&

Mutants (Total: 1, Killed: 1, Survived: 0)

1680: $d->{domain} !~ /^\Q$brand\E\.(?:com|co\.uk|net|org)$/) { 1681: $flag->('HIGH', 'lookalike_domain', 1682: "$d->{domain} contains brand name '$brand' but is not the real domain -- possible phishing"); 1683: last; 1684: } 1685: } 1686: } 1687: 1688: # Determine overall risk level from accumulated score โ—1689 โ†’ 1695 โ†’ 0 1689: my $level = $score >= $SCORE_HIGH ? 'HIGH'

Mutants (Total: 3, Killed: 0, Survived: 3)
1690: : $score >= $SCORE_MEDIUM ? 'MEDIUM'
Mutants (Total: 3, Killed: 0, Survived: 3)
1691: : $score >= $SCORE_LOW ? 'LOW'
Mutants (Total: 3, Killed: 0, Survived: 3)
1692: : 'INFO'; 1693: 1694: $self->{_risk} = { level => $level, score => $score, flags => \@flags }; 1695: return $self->{_risk};

Mutants (Total: 2, Killed: 2, Survived: 0)

1696: } 1697: 1698: # ----------------------------------------------------------------------- 1699: # Public: abuse report text 1700: # ----------------------------------------------------------------------- 1701: 1702: =head2 abuse_report_text() 1703: 1704: Produces a compact, plain-text string suitable for sending as the body of 1705: an abuse report email. It summarises risk level, red flags, originating IP, 1706: abuse contacts, and original message headers. The message body is omitted 1707: to keep the report concise. 1708: 1709: Use C<abuse_contacts()> to get the recipient addresses and this method for 1710: the body text. 1711: 1712: =head3 Usage 1713: 1714: my $text = $analyser->abuse_report_text(); 1715: my @contacts = $analyser->abuse_contacts(); 1716: for my $c (@contacts) { 1717: send_email(to => $c->{address}, body => $text); 1718: } 1719: 1720: =head3 Arguments 1721: 1722: None. C<parse_email()> must have been called first. 1723: 1724: =head3 Returns 1725: 1726: A plain scalar string, newline-terminated, Unix line endings. Never empty 1727: or undef. 1728: 1729: =head3 Side Effects 1730: 1731: Calls C<risk_assessment()>, C<originating_ip()>, and C<abuse_contacts()> 1732: if not already cached. 1733: 1734: =head3 Notes 1735: 1736: Output text is sanitised: control characters that could affect terminal or 1737: HTML rendering are stripped from all user-derived content before inclusion. 1738: 1739: =head3 API Specification 1740: 1741: =head4 Input 1742: 1743: [] 1744: 1745: =head4 Output 1746: 1747: { type => 'scalar' } 1748: 1749: =head3 FORMAL SPECIFICATION 1750: 1751: -- Z notation 1752: abuse_report_text == [ 1753: Xi Email::Abuse::Investigator; 1754: result! : STRING 1755: ] 1756: post: result! /= '' /\ result! ends_with '\n' 1757: 1758: =cut 1759: 1760: sub abuse_report_text { โ—1761 โ†’ 1773 โ†’ 1782โ—1761 โ†’ 1773 โ†’ 0 1761: my $self = $_[0]; 1762: my @out; 1763: 1764: push @out, 'This is an automated abuse report generated by Email::Abuse::Investigator.', 1765: 'Please investigate the following spam/phishing message.', 1766: ''; 1767: 1768: my $risk = $self->risk_assessment(); 1769: push @out, "RISK LEVEL: $risk->{level} (score: $risk->{score})", 1770: ''; 1771: 1772: # List each red flag with its severity prefix 1773: if (@{ $risk->{flags} }) {

Mutants (Total: 1, Killed: 0, Survived: 1)
1774: push @out, 'RED FLAGS IDENTIFIED:'; 1775: for my $f (@{ $risk->{flags} }) { 1776: push @out, " [$f->{severity}] " . _sanitise_output($f->{detail}); 1777: } 1778: push @out, ''; 1779: } 1780: 1781: # Originating IP summary block โ—1782 โ†’ 1783 โ†’ 1790โ—1782 โ†’ 1783 โ†’ 0 1782: my $orig = $self->originating_ip(); 1783: if ($orig) {
Mutants (Total: 1, Killed: 0, Survived: 1)
1784: push @out, 'ORIGINATING IP: ' . _sanitise_output("$orig->{ip} ($orig->{rdns})"), 1785: 'NETWORK OWNER: ' . _sanitise_output($orig->{org}), 1786: ''; 1787: } 1788: 1789: # Email abuse contacts โ—1790 โ†’ 1791 โ†’ 1798โ—1790 โ†’ 1791 โ†’ 0 1790: my @contacts = $self->abuse_contacts(); 1791: if (@contacts) {
Mutants (Total: 1, Killed: 0, Survived: 1)
1792: push @out, 'ABUSE CONTACTS:'; 1793: push @out, ' ' . _sanitise_output("$_->{address} ($_->{role})") for @contacts; 1794: push @out, ''; 1795: } 1796: 1797: # Web-form contacts (providers that reject email) โ—1798 โ†’ 1798 โ†’ 1812โ—1798 โ†’ 1798 โ†’ 0 1798: if(my @form_cs = $self->form_contacts()) {
Mutants (Total: 1, Killed: 0, Survived: 1)
1799: push @out, 'WEB-FORM REPORTS REQUIRED:', 1800: ' The following parties do not accept email -- submit manually:'; 1801: for my $c (@form_cs) { 1802: push @out, " [$c->{role}]", 1803: ' Form : ' . _sanitise_output($c->{form}); 1804: push @out, ' Domain : ' . _sanitise_output($c->{form_domain}) if $c->{form_domain}; 1805: push @out, ' Paste : ' . _sanitise_output($c->{form_paste}) if $c->{form_paste}; 1806: push @out, ' Upload : ' . _sanitise_output($c->{form_upload}) if $c->{form_upload}; 1807: } 1808: push @out, ''; 1809: } 1810: 1811: # Separator and raw headers (body excluded for brevity) โ—1812 โ†’ 1816 โ†’ 1819โ—1812 โ†’ 1816 โ†’ 0 1812: push @out, '-' x 72, 1813: 'ORIGINAL MESSAGE HEADERS:', 1814: '-' x 72; 1815: 1816: for my $h (@{ $self->{_headers} }) { 1817: push @out, _sanitise_output("$h->{name}: $h->{value}"); 1818: } โ—1819 โ†’ 1821 โ†’ 0 1819: push @out, ''; 1820: 1821: return join("\n", @out); 1822: } 1823: 1824: # ----------------------------------------------------------------------- 1825: # Public: abuse contacts 1826: # ----------------------------------------------------------------------- 1827: 1828: =head2 abuse_contacts() 1829: 1830: Collates the complete set of parties that should receive an abuse report: 1831: the sending ISP, URL host operators, contact domain web/mail/DNS/registrar 1832: contacts, account providers identified from key headers, the DKIM signer, 1833: and the ESP identified via List-Unsubscribe. 1834: 1835: Addresses are deduplicated globally; if the same address is found via 1836: multiple routes, a single entry is kept and role strings are merged. 1837: 1838: =head3 Usage 1839: 1840: my @contacts = $analyser->abuse_contacts(); 1841: my @addrs = map { $_->{address} } @contacts; 1842: 1843: =head3 Arguments 1844: 1845: None. C<parse_email()> must have been called first. 1846: 1847: =head3 Returns 1848: 1849: A list of hashrefs, one per unique abuse address, in discovery order. 1850: Each hashref has keys C<role>, C<roles> (arrayref), C<address>, C<note>, 1851: C<via>. Returns an empty list if no contacts can be determined. 1852: 1853: =head3 Side Effects 1854: 1855: Triggers C<originating_ip()>, C<embedded_urls()>, and C<mailto_domains()> 1856: if not already cached. 1857: 1858: =head3 Notes 1859: 1860: The result is not independently cached; each call recomputes the contact 1861: list from the cached results of the underlying methods. 1862: 1863: =head3 API Specification 1864: 1865: =head4 Input 1866: 1867: [] 1868: 1869: =head4 Output 1870: 1871: ( 1872: { 1873: type => 'hashref', 1874: keys => { 1875: role => { type => 'scalar' }, 1876: roles => { type => 'arrayref' }, 1877: address => { type => 'scalar', regex => qr/\@/ }, 1878: note => { type => 'scalar' }, 1879: via => { type => 'scalar', regex => qr/^(?:provider-table|ip-whois|domain-whois)$/ }, 1880: }, 1881: }, 1882: ... 1883: ) 1884: 1885: =head3 FORMAL SPECIFICATION 1886: 1887: -- Z notation 1888: abuse_contacts == [ 1889: Xi Email::Abuse::Investigator; 1890: result! : seq CONTACT_INFO 1891: ] 1892: post: forall c : result! @ c.address contains '@' /\ 1893: forall c1, c2 : result! @ c1 /= c2 => c1.address /= c2.address 1894: 1895: =cut 1896: 1897: sub abuse_contacts { โ—1898 โ†’ 1951 โ†’ 1972โ—1898 โ†’ 1951 โ†’ 0 1898: my $self = $_[0]; 1899: 1900: my (@contacts, %seen_idx); 1901: 1902: # Inner closure: add one contact entry, merging roles for duplicate addresses 1903: my $add = sub { 1904: my (%args) = @_; 1905: my $addr = lc($args{address} // ''); 1906: return unless $addr && $addr =~ /\@/; 1907: 1908: # Suppress addresses belonging to form-only providers (no email accepted) 1909: if ($addr =~ /\@([\w.-]+)$/) {
Mutants (Total: 1, Killed: 0, Survived: 1)
1910: my $dom = $1; 1911: my $pa = $self->_provider_abuse_for_host($dom); 1912: return if $pa && $pa->{form} && !$pa->{email}; 1913: } 1914: 1915: if (exists $seen_idx{$addr}) {
Mutants (Total: 1, Killed: 0, Survived: 1)
1916: # Merge the new role into the existing entry 1917: my $entry = $contacts[ $seen_idx{$addr} ]; 1918: push @{ $entry->{roles} }, $args{role}; 1919: 1920: # Collapse repeated role labels to avoid unreadable strings 1921: my (%role_counts, @ordered_roles); 1922: for my $r (@{ $entry->{roles} }) { 1923: push @ordered_roles, $r unless $role_counts{$r}++; 1924: } 1925: my @display = map { 1926: $role_counts{$_} > 1 ? "$_ (x$role_counts{$_})" : $_
Mutants (Total: 3, Killed: 0, Survived: 3)
1927: } @ordered_roles; 1928: my $joined = join(' and ', @display); 1929: 1930: # Summarise if the merged string is too long to read 1931: if (length($joined) > $ROLE_MAX_LEN) {
Mutants (Total: 4, Killed: 1, Survived: 3)
1932: my @short = map { 1933: (my $s = $_) =~ s/[:(\d].*//; 1934: $s =~ s/\s+$//; 1935: $s; 1936: } @display; 1937: $joined = scalar(@display) . ' routes: ' . join(', ', @short); 1938: } 1939: $entry->{role} = $joined; 1940: return; 1941: } 1942: 1943: # First time seeing this address -- record and store 1944: $seen_idx{$addr} = scalar @contacts; 1945: $args{roles} = [ $args{role} ]; 1946: push @contacts, \%args; 1947: }; 1948: 1949: # Route 1 -- Sending ISP (originating IP) 1950: my $orig = $self->originating_ip(); 1951: if ($orig) {
Mutants (Total: 1, Killed: 0, Survived: 1)
1952: my $pa = $self->_provider_abuse_for_ip($orig->{ip}, $orig->{rdns}); 1953: if ($pa) {
Mutants (Total: 1, Killed: 0, Survived: 1)
1954: $add->( 1955: role => 'Sending ISP', 1956: address => $pa->{email}, 1957: note => "$orig->{ip} ($orig->{rdns}) -- $pa->{note}", 1958: via => 'provider-table', 1959: ); 1960: } 1961: if ($orig->{abuse} && $orig->{abuse} ne '(unknown)') {
Mutants (Total: 1, Killed: 0, Survived: 1)
1962: $add->( 1963: role => 'Sending ISP', 1964: address => $orig->{abuse}, 1965: note => "Network owner of originating IP $orig->{ip} ($orig->{org})", 1966: via => 'ip-whois', 1967: ); 1968: } 1969: } 1970: 1971: # Route 2 -- URL hosts โ—1972 โ†’ 1973 โ†’ 2000โ—1972 โ†’ 1973 โ†’ 0 1972: my %url_host_seen; 1973: for my $u ($self->embedded_urls()) { 1974: next if $url_host_seen{ $u->{host} }++; 1975: my $bare_host = lc $u->{host}; 1976: $bare_host =~ s/^www\.//; 1977: # Skip trusted infrastructure (Google, W3C, etc.) 1978: next if $self->{trusted_domains}->{$bare_host}; 1979: next if $TRUSTED_DOMAINS{$bare_host}; 1980: my $pa = $self->_provider_abuse_for_host($u->{host}); 1981: if ($pa) {
Mutants (Total: 1, Killed: 0, Survived: 1)
1982: $add->( 1983: role => "URL host: $u->{host}", 1984: address => $pa->{email}, 1985: note => "$u->{host} -- $pa->{note}", 1986: via => 'provider-table', 1987: ); 1988: } 1989: if ($u->{abuse} && $u->{abuse} ne '(unknown)') {
Mutants (Total: 1, Killed: 0, Survived: 1)
1990: $add->( 1991: role => "URL host: $u->{host}", 1992: address => $u->{abuse}, 1993: note => "Hosting $u->{host} ($u->{ip}, $u->{org})", 1994: via => 'ip-whois', 1995: ); 1996: } 1997: } 1998: 1999: # Route 3 -- Contact domain hosting and registration โ—2000 โ†’ 2000 โ†’ 2067โ—2000 โ†’ 2000 โ†’ 0 2000: for my $d ($self->mailto_domains()) { 2001: my $dom = $d->{domain}; 2002: 2003: # Web host contact 2004: if ($d->{web_abuse}) {

Mutants (Total: 1, Killed: 1, Survived: 0)

2005: my $pa = $self->_provider_abuse_for_host($dom); 2006: if ($pa) {

Mutants (Total: 1, Killed: 0, Survived: 1)
2007: $add->(role => "Web host of $dom", address => $pa->{email}, 2008: note => $pa->{note}, via => 'provider-table'); 2009: } 2010: $add->( 2011: role => "Web host of $dom", 2012: address => $d->{web_abuse}, 2013: note => sprintf('Hosting %s (%s, %s)', 2014: $dom // '(unknown domain)', 2015: $d->{web_ip} // '(unknown IP)', 2016: $d->{web_org} // '(unknown org)'), 2017: via => 'ip-whois', 2018: ); 2019: } 2020: 2021: # MX (mail host) contact 2022: if ($d->{mx_abuse}) {

Mutants (Total: 1, Killed: 1, Survived: 0)

2023: $add->( 2024: role => "Mail host (MX) for $dom", 2025: address => $d->{mx_abuse}, 2026: note => sprintf('MX %s (%s, %s)', 2027: $d->{mx_host} // '(unknown host)', 2028: $d->{mx_ip} // '(unknown IP)', 2029: $d->{mx_org} // '(unknown org)'), 2030: via => 'ip-whois', 2031: ); 2032: } 2033: 2034: # NS (DNS host) contact 2035: if ($d->{ns_abuse}) {

Mutants (Total: 1, Killed: 0, Survived: 1)
2036: $add->( 2037: role => "DNS host (NS) for $dom", 2038: address => $d->{ns_abuse}, 2039: note => sprintf('NS %s (%s, %s)', 2040: $d->{ns_host} // '(unknown host)', 2041: $d->{ns_ip} // '(unknown IP)', 2042: $d->{ns_org} // '(unknown org)'), 2043: via => 'ip-whois', 2044: ); 2045: } 2046: 2047: # Domain registrar (skip if domain only seen in spoofable headers) 2048: if ($d->{registrar_abuse}) {

Mutants (Total: 1, Killed: 1, Survived: 0)

2049: my $spoofable_only = 2050: $d->{source} =~ /^(?:From:|Return-Path:|Sender:) header$/ && 2051: !scalar(grep { 2052: $_->{host} && 2053: _registrable($_->{host}) eq (_registrable($dom) // $dom) 2054: } $self->embedded_urls()); 2055: unless ($spoofable_only) {

Mutants (Total: 1, Killed: 0, Survived: 1)
2056: $add->( 2057: role => "Domain registrar for $dom", 2058: address => $d->{registrar_abuse}, 2059: note => 'Registrar: ' . ($d->{registrar} // '(unknown)'), 2060: via => 'domain-whois', 2061: ); 2062: } 2063: } 2064: } 2065: 2066: # Route 4 -- From:/Reply-To:/Return-Path:/Sender: account provider โ—2067 โ†’ 2067 โ†’ 2092โ—2067 โ†’ 2067 โ†’ 0 2067: for my $hname (qw(from reply-to return-path sender)) { 2068: my $val = $self->_header_value($hname) // next; 2069: 2070: # Extract addr-spec from angle-bracket form to avoid display-name @-signs 2071: my $addr_spec = ($val =~ /<([^>]*)>\s*$/) ? $1 : $val; 2072: my ($addr_domain) = $addr_spec =~ /\@([\w.-]+)/; 2073: next unless $addr_domain; 2074: 2075: # Skip SRS-rewritten forwarder addresses (not the real sender) 2076: next if $addr_spec =~ /\+SRS[0-9]?=/i; 2077: 2078: my $pa = $self->_provider_abuse_for_host($addr_domain); 2079: if ($pa) {
Mutants (Total: 1, Killed: 0, Survived: 1)
2080: my $role_addr = $addr_spec =~ /\@/ ? $addr_spec : $val; 2081: $role_addr =~ s/^\s+|\s+$//g; 2082: $add->( 2083: role => "Account provider ($hname: $role_addr)", 2084: address => $pa->{email}, 2085: note => $pa->{note}, 2086: via => 'provider-table', 2087: ); 2088: } 2089: } 2090: 2091: # Route 5 -- DKIM signing organisation โ—2092 โ†’ 2093 โ†’ 2106โ—2092 โ†’ 2093 โ†’ 0 2092: my $auth = $self->_parse_auth_results_cached(); 2093: if ($auth->{dkim_domain}) {
Mutants (Total: 1, Killed: 0, Survived: 1)
2094: my $pa = $self->_provider_abuse_for_host($auth->{dkim_domain}); 2095: if ($pa) {

Mutants (Total: 1, Killed: 1, Survived: 0)

2096: $add->( 2097: role => "DKIM signer: $auth->{dkim_domain}", 2098: address => $pa->{email}, 2099: note => $pa->{note}, 2100: via => 'provider-table', 2101: ); 2102: } 2103: } 2104: 2105: # Route 6 -- List-Unsubscribe ESP domain โ—2106 โ†’ 2107 โ†’ 2130โ—2106 โ†’ 2107 โ†’ 0 2106: my $unsub = $self->_header_value('list-unsubscribe'); 2107: if ($unsub) {

Mutants (Total: 1, Killed: 0, Survived: 1)
2108: my @unsub_domains; 2109: while ($unsub =~ m{https?://([^/:?\s>]+)}gi) { 2110: push @unsub_domains, lc $1; 2111: } 2112: while ($unsub =~ m{mailto:[^@\s>]+\@([\w.-]+)}gi) { 2113: push @unsub_domains, lc $1; 2114: } 2115: my %unsub_seen; 2116: for my $dom (grep { !$unsub_seen{$_}++ } @unsub_domains) { 2117: my $pa = $self->_provider_abuse_for_host($dom); 2118: if ($pa) {
Mutants (Total: 1, Killed: 0, Survived: 1)
2119: $add->( 2120: role => "ESP / bulk sender (List-Unsubscribe: $dom)", 2121: address => $pa->{email}, 2122: note => "$pa->{note} -- responsible for this bulk delivery", 2123: via => 'provider-table', 2124: ); 2125: } 2126: } 2127: } 2128: 2129: # Route 7 -- Reply addresses embedded in the message body โ—2130 โ†’ 2132 โ†’ 2146โ—2130 โ†’ 2132 โ†’ 0 2130: my %body_addr_seen; 2131: my $combined_body = $self->{_body_plain} . "\n" . $self->{_body_html}; 2132: for my $addr_dom ($self->_domains_from_text($combined_body)) { 2133: next if $body_addr_seen{$addr_dom}++; 2134: my $pa = $self->_provider_abuse_for_host($addr_dom); 2135: next unless $pa && $pa->{email}; 2136: my ($example_addr) = $combined_body =~ /(\S+\@\Q$addr_dom\E)/i; 2137: $example_addr //= "\@$addr_dom"; 2138: $add->( 2139: role => "Reply address in body ($example_addr)", 2140: address => $pa->{email}, 2141: note => $pa->{note}, 2142: via => 'provider-table', 2143: ); 2144: } 2145: โ—2146 โ†’ 2146 โ†’ 0 2146: return @contacts;
Mutants (Total: 2, Killed: 0, Survived: 2)
2147: } 2148: 2149: # ----------------------------------------------------------------------- 2150: # Public: form contacts (providers that require web-form submission) 2151: # ----------------------------------------------------------------------- 2152: 2153: =head2 form_contacts() 2154: 2155: Returns the list of parties that require abuse reports via a web form 2156: rather than email. These are providers whose C<%PROVIDER_ABUSE> entry 2157: has a C<form> key. Each hashref includes the form URL, paste 2158: instructions, upload instructions, and the discovery role. 2159: 2160: =head3 Usage 2161: 2162: my @forms = $analyser->form_contacts(); 2163: for my $c (@forms) { 2164: printf "Open: %s\n", $c->{form}; 2165: } 2166: 2167: =head3 Arguments 2168: 2169: None. C<parse_email()> must have been called first. 2170: 2171: =head3 Returns 2172: 2173: A list of hashrefs, one per unique form contact. Each hashref has keys 2174: C<form>, C<role>, C<note>, C<form_paste> (optional), C<form_upload> 2175: (optional), and C<via>. Returns an empty list if no form contacts are found. 2176: 2177: =head3 Side Effects 2178: 2179: Triggers C<originating_ip()>, C<embedded_urls()>, and C<mailto_domains()> 2180: if not already cached. 2181: 2182: =head3 Notes 2183: 2184: Deduplication is by form URL. 2185: 2186: =head3 API Specification 2187: 2188: =head4 Input 2189: 2190: [] 2191: 2192: =head4 Output 2193: 2194: ( 2195: { 2196: type => 'hashref', 2197: keys => { 2198: form => { type => 'scalar', regex => qr{^https?://} }, 2199: role => { type => 'scalar' }, 2200: note => { type => 'scalar' }, 2201: form_paste => { type => 'scalar', optional => 1 }, 2202: form_upload => { type => 'scalar', optional => 1 }, 2203: via => { type => 'scalar' }, 2204: }, 2205: }, 2206: ... 2207: ) 2208: 2209: =head3 FORMAL SPECIFICATION 2210: 2211: -- Z notation 2212: form_contacts == [ 2213: Xi Email::Abuse::Investigator; 2214: result! : seq FORM_CONTACT_INFO 2215: ] 2216: post: forall c : result! @ c.form =~ m{^https?://} /\ 2217: forall c1, c2 : result! @ c1 /= c2 => c1.form /= c2.form 2218: 2219: =cut 2220: 2221: sub form_contacts { โ—2222 โ†’ 2237 โ†’ 2252โ—2222 โ†’ 2237 โ†’ 0 2222: my $self = $_[0]; 2223: 2224: my (@contacts, %seen); 2225: 2226: # Inner closure: add one form-contact entry, deduplicating by form URL 2227: my $add = sub { 2228: my (%args) = @_; 2229: my $form = $args{form} // ''; 2230: return unless $form; 2231: return if $seen{$form}++; 2232: push @contacts, \%args; 2233: }; 2234: 2235: # Route 1 -- Sending ISP 2236: my $orig = $self->originating_ip(); 2237: if ($orig) {

Mutants (Total: 1, Killed: 1, Survived: 0)

2238: my $pa = $self->_provider_abuse_for_ip($orig->{ip}, $orig->{rdns}); 2239: if ($pa && $pa->{form}) {

Mutants (Total: 1, Killed: 0, Survived: 1)
2240: $add->( 2241: role => 'Sending ISP', 2242: form => $pa->{form}, 2243: note => $pa->{note} // '', 2244: form_paste => $pa->{form_paste} // '', 2245: form_upload => $pa->{form_upload} // '', 2246: via => 'provider-table', 2247: ); 2248: } 2249: } 2250: 2251: # Route 2 -- URL hosts โ—2252 โ†’ 2253 โ†’ 2270โ—2252 โ†’ 2253 โ†’ 0 2252: my %url_host_seen; 2253: for my $u ($self->embedded_urls()) { 2254: next if $url_host_seen{ $u->{host} }++; 2255: my $pa = $self->_provider_abuse_for_host($u->{host}); 2256: if ($pa && $pa->{form}) {

Mutants (Total: 1, Killed: 1, Survived: 0)

2257: $add->( 2258: role => "URL host: $u->{host}", 2259: form => $pa->{form}, 2260: form_domain => $u->{host}, 2261: note => $pa->{note} // '', 2262: form_paste => $pa->{form_paste} // '', 2263: form_upload => $pa->{form_upload} // '', 2264: via => 'provider-table', 2265: ); 2266: } 2267: } 2268: 2269: # Route 3 -- Contact domains (web host + registrar) โ—2270 โ†’ 2270 โ†’ 2304โ—2270 โ†’ 2270 โ†’ 0 2270: for my $d ($self->mailto_domains()) { 2271: my $dom = $d->{domain}; 2272: my $pa = $self->_provider_abuse_for_host($dom); 2273: if ($pa && $pa->{form}) {

Mutants (Total: 1, Killed: 0, Survived: 1)
2274: $add->( 2275: role => "Web host of $dom", 2276: form => $pa->{form}, 2277: form_domain => $dom, 2278: note => $pa->{note} // '', 2279: form_paste => $pa->{form_paste} // '', 2280: form_upload => $pa->{form_upload} // '', 2281: via => 'provider-table', 2282: ); 2283: } 2284: 2285: # Registrar identified via WHOIS -- check for form-only registrar 2286: if ($d->{registrar_abuse} && $d->{registrar_abuse} =~ /\@([\w.-]+)/) {

Mutants (Total: 1, Killed: 1, Survived: 0)

2287: my $reg_domain = lc $1; 2288: my $rpa = $self->_provider_abuse_for_host($reg_domain); 2289: if ($rpa && $rpa->{form}) {

Mutants (Total: 1, Killed: 1, Survived: 0)

2290: $add->( 2291: role => "Domain registrar for $dom (web form only)", 2292: form => $rpa->{form}, 2293: form_domain => $dom, 2294: note => $rpa->{note} // '', 2295: form_paste => $rpa->{form_paste} // '', 2296: form_upload => $rpa->{form_upload} // '', 2297: via => 'provider-table', 2298: ); 2299: } 2300: } 2301: } 2302: 2303: # Route 4 -- Account provider headers โ—2304 โ†’ 2304 โ†’ 2327โ—2304 โ†’ 2304 โ†’ 0 2304: for my $hname (qw(from reply-to return-path sender)) { 2305: my $val = $self->_header_value($hname) // next; 2306: my $addr_spec = ($val =~ /<([^>]*)>\s*$/) ? $1 : $val; 2307: my ($addr_domain) = $addr_spec =~ /\@([\w.-]+)/; 2308: next unless $addr_domain; 2309: # Skip SRS forwarder rewrite addresses 2310: next if $addr_spec =~ /\+SRS[0-9]?=/i; 2311: my $pa = $self->_provider_abuse_for_host($addr_domain); 2312: if ($pa && $pa->{form}) {

Mutants (Total: 1, Killed: 0, Survived: 1)
2313: my $role_addr = $addr_spec =~ /@/ ? $addr_spec : $val; 2314: $role_addr =~ s/^\s+|\s+$//g; 2315: $add->( 2316: role => "Account provider ($hname: $role_addr)", 2317: form => $pa->{form}, 2318: note => $pa->{note} // '', 2319: form_paste => $pa->{form_paste} // '', 2320: form_upload => $pa->{form_upload} // '', 2321: via => 'provider-table', 2322: ); 2323: } 2324: } 2325: 2326: # Route 5 -- DKIM signer โ—2327 โ†’ 2328 โ†’ 2343โ—2327 โ†’ 2328 โ†’ 0 2327: my $auth = $self->_parse_auth_results_cached(); 2328: if ($auth->{dkim_domain}) {
Mutants (Total: 1, Killed: 0, Survived: 1)
2329: my $pa = $self->_provider_abuse_for_host($auth->{dkim_domain}); 2330: if ($pa && $pa->{form}) {
Mutants (Total: 1, Killed: 0, Survived: 1)
2331: $add->( 2332: role => "DKIM signer: $auth->{dkim_domain}", 2333: form => $pa->{form}, 2334: note => $pa->{note} // '', 2335: form_paste => $pa->{form_paste} // '', 2336: form_upload => $pa->{form_upload} // '', 2337: via => 'provider-table', 2338: ); 2339: } 2340: } 2341: 2342: # Route 6 -- List-Unsubscribe ESP domains โ—2343 โ†’ 2344 โ†’ 2364โ—2343 โ†’ 2344 โ†’ 0 2343: my $unsub = $self->_header_value('list-unsubscribe'); 2344: if ($unsub) {
Mutants (Total: 1, Killed: 0, Survived: 1)
2345: my @unsub_domains; 2346: while ($unsub =~ m{https?://([^/:?\s>]+)}gi) { push @unsub_domains, lc $1 } 2347: while ($unsub =~ m{mailto:[^@\s>]+\@([\w.-]+)}gi) { push @unsub_domains, lc $1 } 2348: my %useen; 2349: for my $dom (grep { !$useen{$_}++ } @unsub_domains) { 2350: my $pa = $self->_provider_abuse_for_host($dom); 2351: if ($pa && $pa->{form}) {

Mutants (Total: 1, Killed: 1, Survived: 0)

2352: $add->( 2353: role => "ESP / bulk sender (List-Unsubscribe: $dom)", 2354: form => $pa->{form}, 2355: note => $pa->{note} // '', 2356: form_paste => $pa->{form_paste} // '', 2357: form_upload => $pa->{form_upload} // '', 2358: via => 'provider-table', 2359: ); 2360: } 2361: } 2362: } 2363: โ—2364 โ†’ 2364 โ†’ 0 2364: return @contacts;

Mutants (Total: 2, Killed: 0, Survived: 2)
2365: } 2366: 2367: # ----------------------------------------------------------------------- 2368: # Public: full analyst report 2369: # ----------------------------------------------------------------------- 2370: 2371: =head2 report() 2372: 2373: Produces a comprehensive, analyst-facing plain-text report covering all 2374: findings: envelope fields, risk assessment, originating host, sending 2375: software, received chain tracking IDs, embedded URLs, contact domain 2376: intelligence, and recommended abuse contacts. 2377: 2378: Use C<report()> for human review or ticketing systems. Use 2379: C<abuse_report_text()> for sending to ISP abuse desks. 2380: 2381: =head3 Usage 2382: 2383: print $analyser->report(); 2384: 2385: open my $fh, '>', 'report.txt' or croak "Cannot open: $!"; 2386: print $fh $analyser->report(); 2387: close $fh; 2388: 2389: =head3 Arguments 2390: 2391: None. C<parse_email()> must have been called first. 2392: 2393: =head3 Returns 2394: 2395: A plain scalar string, newline-terminated, Unix line endings. Never empty 2396: or undef. 2397: 2398: =head3 Side Effects 2399: 2400: Triggers all analysis methods if not already cached. 2401: 2402: =head3 Notes 2403: 2404: The report is idempotent: calling it multiple times on the same object 2405: always returns an identical string. All user-derived content is sanitised 2406: before output. 2407: 2408: =head3 API Specification 2409: 2410: =head4 Input 2411: 2412: [] 2413: 2414: =head4 Output 2415: 2416: { type => 'scalar' } 2417: 2418: =head3 FORMAL SPECIFICATION 2419: 2420: -- Z notation 2421: report == [ 2422: Xi Email::Abuse::Investigator; 2423: result! : STRING 2424: ] 2425: post: result! /= '' /\ result! ends_with '\n' 2426: 2427: =cut 2428: 2429: sub report { โ—2430 โ†’ 2441 โ†’ 2449โ—2430 โ†’ 2441 โ†’ 0 2430: my $self = $_[0]; 2431: 2432: my @out; 2433: 2434: # Banner header 2435: push @out, '=' x 72; 2436: push @out, " Email::Abuse::Investigator Report (v$VERSION)"; 2437: push @out, '=' x 72; 2438: push @out, ''; 2439: 2440: # Envelope summary -- decode MIME encoded-words for readability 2441: for my $f (qw(from reply-to return-path subject date message-id)) { 2442: my $v = $self->_header_value($f); 2443: next unless defined $v; 2444: my $decoded = $self->_decode_mime_words($v); 2445: my $label = ucfirst($f); 2446: push @out, sprintf(' %-14s : %s', $label, 2447: _sanitise_output($decoded ne $v ? "$decoded [encoded: $v]" : $v)); 2448: } โ—2449 โ†’ 2454 โ†’ 2461โ—2449 โ†’ 2454 โ†’ 0 2449: push @out, ''; 2450: 2451: # Risk assessment section 2452: my $risk = $self->risk_assessment(); 2453: push @out, "[ RISK ASSESSMENT: $risk->{level} (score: $risk->{score}) ]"; 2454: if (@{ $risk->{flags} }) {

Mutants (Total: 1, Killed: 1, Survived: 0)

2455: for my $f (@{ $risk->{flags} }) { 2456: push @out, " [$f->{severity}] " . _sanitise_output($f->{detail}); 2457: } 2458: } else { 2459: push @out, ' (no specific red flags detected)'; 2460: } โ—2461 โ†’ 2466 โ†’ 2477โ—2461 โ†’ 2466 โ†’ 0 2461: push @out, ''; 2462: 2463: # Originating host section 2464: push @out, '[ ORIGINATING HOST ]'; 2465: my $orig = $self->originating_ip(); 2466: if ($orig) {

Mutants (Total: 1, Killed: 0, Survived: 1)
2467: push @out, ' IP : ' . _sanitise_output($orig->{ip}); 2468: push @out, ' Reverse DNS : ' . _sanitise_output($orig->{rdns}) if $orig->{rdns}; 2469: push @out, ' Country : ' . _sanitise_output($orig->{country}) if $orig->{country}; 2470: push @out, ' Organisation : ' . _sanitise_output($orig->{org}) if $orig->{org}; 2471: push @out, ' Abuse addr : ' . _sanitise_output($orig->{abuse}) if $orig->{abuse}; 2472: push @out, " Confidence : $orig->{confidence}"; 2473: push @out, ' Note : ' . _sanitise_output($orig->{note}) if $orig->{note}; 2474: } else { 2475: push @out, ' (could not determine originating IP)'; 2476: } โ—2477 โ†’ 2481 โ†’ 2491โ—2477 โ†’ 2481 โ†’ 0 2477: push @out, ''; 2478: 2479: # Sending software section (omitted if none found) 2480: my @sw = $self->sending_software(); 2481: if (@sw) {

Mutants (Total: 1, Killed: 1, Survived: 0)

2482: push @out, '[ SENDING SOFTWARE / INFRASTRUCTURE CLUES ]'; 2483: for my $s (@sw) { 2484: push @out, sprintf(' %-14s : %s', $s->{header}, _sanitise_output($s->{value})); 2485: push @out, " Note : $s->{note}"; 2486: push @out, ''; 2487: } 2488: } 2489: 2490: # Received chain tracking IDs (only hops with id or for are shown) โ—2491 โ†’ 2493 โ†’ 2506โ—2491 โ†’ 2493 โ†’ 0 2491: my @trail = grep { defined $_->{id} || defined $_->{for} } 2492: $self->received_trail(); 2493: if (@trail) {

Mutants (Total: 1, Killed: 0, Survived: 1)
2494: push @out, '[ RECEIVED CHAIN TRACKING IDs ]'; 2495: push @out, ' (Supply these to the relevant ISP abuse team to trace the session)'; 2496: push @out, ''; 2497: for my $hop (@trail) { 2498: push @out, ' IP : ' . (_sanitise_output($hop->{ip}) // '(unknown)'); 2499: push @out, ' Envelope for : ' . _sanitise_output($hop->{for}) if $hop->{for}; 2500: push @out, ' Server ID : ' . _sanitise_output($hop->{id}) if $hop->{id}; 2501: push @out, ''; 2502: } 2503: } 2504: 2505: # Embedded URLs section -- grouped by hostname โ—2506 โ†’ 2508 โ†’ 2551โ—2506 โ†’ 2508 โ†’ 0 2506: push @out, '[ EMBEDDED HTTP/HTTPS URLs ]'; 2507: my @urls = $self->embedded_urls(); 2508: if (@urls) {
Mutants (Total: 1, Killed: 0, Survived: 1)
2509: my (%host_order, %host_meta, %host_paths); 2510: my $seq = 0; 2511: for my $u (@urls) { 2512: my $h = $u->{host}; 2513: unless (exists $host_order{$h}) {
Mutants (Total: 1, Killed: 0, Survived: 1)
2514: $host_order{$h} = $seq++; 2515: $host_meta{$h} = { 2516: ip => $u->{ip}, 2517: org => $u->{org}, 2518: abuse => $u->{abuse}, 2519: country => $u->{country}, 2520: }; 2521: } 2522: push @{ $host_paths{$h} }, $u->{url}; 2523: } 2524: 2525: # Output each host group in first-seen order 2526: for my $h (sort { $host_order{$a} <=> $host_order{$b} } keys %host_order) { 2527: my $m = $host_meta{$h}; 2528: my $bare = lc $h; $bare =~ s/^www\.//; 2529: push @out, ' Host : ' . _sanitise_output($h) . 2530: (($URL_SHORTENERS{$bare} || $self->{url_shorteners}->{$bare}) 2531: ? ' *** URL SHORTENER -- real destination hidden ***' : ''); 2532: push @out, ' IP : ' . _sanitise_output($m->{ip}) if $m->{ip}; 2533: push @out, ' Country : ' . _sanitise_output($m->{country}) if $m->{country}; 2534: push @out, ' Organisation : ' . _sanitise_output($m->{org}) if $m->{org}; 2535: push @out, ' Abuse addr : ' . _sanitise_output($m->{abuse}) if $m->{abuse}; 2536: my @paths = @{ $host_paths{$h} }; 2537: if (@paths == 1) {
Mutants (Total: 2, Killed: 0, Survived: 2)
2538: push @out, ' URL : ' . _sanitise_output($paths[0]); 2539: } else { 2540: push @out, ' URLs (' . scalar(@paths) . ') :'; 2541: push @out, ' ' . _sanitise_output($_) for @paths; 2542: } 2543: push @out, ''; 2544: } 2545: } else { 2546: push @out, ' (none found)'; 2547: push @out, ''; 2548: } 2549: 2550: # Contact / reply-to domains section โ—2551 โ†’ 2553 โ†’ 2593โ—2551 โ†’ 2553 โ†’ 0 2551: push @out, '[ CONTACT / REPLY-TO DOMAINS ]'; 2552: my @mdoms = $self->mailto_domains(); 2553: if (@mdoms) {
Mutants (Total: 1, Killed: 0, Survived: 1)
2554: for my $d (@mdoms) { 2555: push @out, ' Domain : ' . _sanitise_output($d->{domain}); 2556: push @out, ' Found in : ' . _sanitise_output($d->{source}); 2557: if ($d->{recently_registered}) {

Mutants (Total: 1, Killed: 1, Survived: 0)

2558: push @out, ' *** WARNING: RECENTLY REGISTERED - possible phishing domain ***'; 2559: } 2560: push @out, ' Registered : ' . $d->{registered} if $d->{registered}; 2561: push @out, ' Expires : ' . $d->{expires} if $d->{expires}; 2562: push @out, ' Registrar : ' . _sanitise_output($d->{registrar}) if $d->{registrar}; 2563: push @out, ' Reg. abuse : ' . _sanitise_output($d->{registrar_abuse}) if $d->{registrar_abuse}; 2564: if ($d->{web_ip}) {

Mutants (Total: 1, Killed: 1, Survived: 0)

2565: push @out, ' Web host IP : ' . _sanitise_output($d->{web_ip}); 2566: push @out, ' Web host org : ' . _sanitise_output($d->{web_org}) if $d->{web_org}; 2567: push @out, ' Web abuse : ' . _sanitise_output($d->{web_abuse}) if $d->{web_abuse}; 2568: } else { 2569: push @out, ' Web host : (no A record / unreachable)'; 2570: } 2571: if ($d->{mx_host}) {

Mutants (Total: 1, Killed: 0, Survived: 1)
2572: push @out, ' MX host : ' . _sanitise_output($d->{mx_host}); 2573: push @out, ' MX IP : ' . _sanitise_output($d->{mx_ip}) if $d->{mx_ip}; 2574: push @out, ' MX org : ' . _sanitise_output($d->{mx_org}) if $d->{mx_org}; 2575: push @out, ' MX abuse : ' . _sanitise_output($d->{mx_abuse}) if $d->{mx_abuse}; 2576: } else { 2577: push @out, ' MX host : (none found)'; 2578: } 2579: if ($d->{ns_host}) {
Mutants (Total: 1, Killed: 0, Survived: 1)
2580: push @out, ' NS host : ' . _sanitise_output($d->{ns_host}); 2581: push @out, ' NS IP : ' . _sanitise_output($d->{ns_ip}) if $d->{ns_ip}; 2582: push @out, ' NS org : ' . _sanitise_output($d->{ns_org}) if $d->{ns_org}; 2583: push @out, ' NS abuse : ' . _sanitise_output($d->{ns_abuse}) if $d->{ns_abuse}; 2584: } 2585: push @out, ''; 2586: } 2587: } else { 2588: push @out, ' (none found)'; 2589: push @out, ''; 2590: } 2591: 2592: # Abuse contacts summary โ—2593 โ†’ 2595 โ†’ 2609โ—2593 โ†’ 2595 โ†’ 0 2593: push @out, '[ WHERE TO SEND ABUSE REPORTS ]'; 2594: my @contacts = $self->abuse_contacts(); 2595: if (@contacts) {

Mutants (Total: 1, Killed: 1, Survived: 0)

2596: for my $c (@contacts) { 2597: push @out, ' Role : ' . _sanitise_output($c->{role}); 2598: push @out, ' Send to : ' . _sanitise_output($c->{address}); 2599: push @out, ' Note : ' . _sanitise_output($c->{note}) if $c->{note}; 2600: push @out, " Discovered : $c->{via}"; 2601: push @out, ''; 2602: } 2603: } else { 2604: push @out, ' (no abuse contacts could be determined)'; 2605: push @out, ''; 2606: } 2607: 2608: # Web-form contacts (providers that require manual form submission) โ—2609 โ†’ 2610 โ†’ 2642โ—2609 โ†’ 2610 โ†’ 0 2609: my @form_cs = $self->form_contacts(); 2610: if (@form_cs) {

Mutants (Total: 1, Killed: 0, Survived: 1)
2611: push @out, '[ WHERE TO FILE WEB-FORM REPORTS ]'; 2612: push @out, ' The following parties require manual submission via a web form.'; 2613: push @out, ' Open each URL in a browser, then follow the instructions below it.'; 2614: push @out, ''; 2615: for my $c (@form_cs) { 2616: push @out, ' Role : ' . _sanitise_output($c->{role}); 2617: push @out, ' Form URL : ' . _sanitise_output($c->{form}); 2618: push @out, ' Domain/URL : ' . _sanitise_output($c->{form_domain}) if $c->{form_domain}; 2619: push @out, ' Note : ' . _sanitise_output($c->{note}) if $c->{note}; 2620: if ($c->{form_paste}) {

Mutants (Total: 1, Killed: 1, Survived: 0)

2621: # Word-wrap the paste hint at ROLE_WRAP_LEN characters 2622: my $hint = $c->{form_paste}; 2623: my @words = split /\s+/, $hint; 2624: my (@lines, $line); 2625: for my $w (@words) { 2626: if (defined $line && length("$line $w") > $ROLE_WRAP_LEN) {

Mutants (Total: 4, Killed: 0, Survived: 4)
2627: push @lines, $line; 2628: $line = $w; 2629: } else { 2630: $line = defined $line ? "$line $w" : $w; 2631: } 2632: } 2633: push @lines, $line if defined $line; 2634: push @out, ' Paste : ' . shift @lines if @lines; 2635: push @out, ' ' . $_ for @lines; 2636: } 2637: push @out, ' Upload : ' . _sanitise_output($c->{form_upload}) if $c->{form_upload}; 2638: push @out, ''; 2639: } 2640: } 2641: โ—2642 โ†’ 2643 โ†’ 0 2642: push @out, '=' x 72; 2643: return join("\n", @out) . "\n"; 2644: } 2645: 2646: # ----------------------------------------------------------------------- 2647: # Private: output sanitisation 2648: # ----------------------------------------------------------------------- 2649: 2650: # _sanitise_output( $str ) -> $str 2651: # 2652: # Purpose: 2653: # Strip control characters that could affect terminal rendering or HTML 2654: # injection from any string that will appear in a report or abuse email. 2655: # Preserves printable ASCII, high-bytes (for UTF-8 content), tabs, and 2656: # line endings. 2657: # 2658: # Entry criteria: 2659: # $str -- a defined or undef scalar. 2660: # 2661: # Exit status: 2662: # Returns the sanitised string, or the empty string if $str is undef. 2663: # 2664: # Notes: 2665: # Only strips C0 control characters below 0x20 (except \t) and the DEL 2666: # character (0x7F). High bytes (0x80-0xFF) are preserved because they 2667: # form valid UTF-8 multi-byte sequences in headers and body text. 2668: 2669: sub _sanitise_output { 2670: my ($str) = @_; 2671: return '' unless defined $str;
Mutants (Total: 2, Killed: 0, Survived: 2)
2672: # Remove C0 controls (except tab) and DEL 2673: $str =~ s/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]//g; 2674: return $str;
Mutants (Total: 2, Killed: 0, Survived: 2)
2675: } 2676: 2677: # ----------------------------------------------------------------------- 2678: # Private: message parsing 2679: # ----------------------------------------------------------------------- 2680: 2681: # _split_message( $text ) 2682: # 2683: # Purpose: 2684: # Split a raw RFC 2822 email into headers and body, parse all headers, 2685: # decode the body (including multipart), extract sending-software 2686: # fingerprints, and populate per-hop tracking data. 2687: # 2688: # Entry criteria: 2689: # $text -- defined scalar, already dereferenced by parse_email(). 2690: # $self->{_sending_sw} and $self->{_rcvd_tracking} reset to [] by caller. 2691: # 2692: # Exit status: 2693: # Returns undef silently if the header block is empty/whitespace-only. 2694: # Otherwise all results are communicated via side effects on $self. 2695: # 2696: # Side effects: 2697: # Populates _headers, _received, _body_plain, _body_html, _sending_sw, 2698: # and _rcvd_tracking. 2699: # 2700: # Notes: 2701: # Delegates to _decode_multipart() for multipart/* content types. 2702: # Lines not matching the header pattern are silently discarded. 2703: # Boundary extraction uses a simple regex; missing boundary causes the 2704: # body to be skipped silently. 2705: 2706: sub _split_message { โ—2707 โ†’ 2720 โ†’ 2725โ—2707 โ†’ 2720 โ†’ 0 2707: my ($self, $text) = @_; 2708: 2709: # Split at the first blank line (RFC 2822 header/body separator) 2710: my ($header_block, $body_raw) = split /\r?\n\r?\n/, $text, 2; 2711: 2712: return unless defined $header_block && $header_block =~ /\S/; 2713: $body_raw //= ''; 2714: 2715: # Unfold RFC 2822 continuation lines (s2.2.3) 2716: $header_block =~ s/\r?\n([ \t]+)/ $1/g; 2717: 2718: # Parse each header line into a { name, value } pair 2719: my @headers; 2720: for my $line (split /\r?\n/, $header_block) { 2721: if ($line =~ /^([\w-]+)\s*:\s*(.*)/) {
Mutants (Total: 1, Killed: 0, Survived: 1)
2722: push @headers, { name => lc($1), value => $2 }; 2723: } 2724: } โ—2725 โ†’ 2740 โ†’ 2750โ—2725 โ†’ 2740 โ†’ 0 2725: $self->{_headers} = \@headers; 2726: 2727: # Collect all Received: header values (most-recent first, as in message) 2728: $self->{_received} = [ 2729: map { $_->{value} } 2730: grep { $_->{name} eq 'received' } @headers 2731: ]; 2732: 2733: # Determine content type and transfer encoding from top-level headers 2734: my ($ct_h) = grep { $_->{name} eq 'content-type' } @headers; 2735: my ($cte_h) = grep { $_->{name} eq 'content-transfer-encoding' } @headers; 2736: my $ct = defined $ct_h ? $ct_h->{value} : ''; 2737: my $cte = defined $cte_h ? $cte_h->{value} : ''; 2738: 2739: # Decode multipart or single-part body as appropriate 2740: if ($ct =~ /multipart/i) {
Mutants (Total: 1, Killed: 0, Survived: 1)
2741: my ($boundary) = $ct =~ /boundary="?([^";]+)"?/i; 2742: # Pass depth=0 to enforce the MAX_MULTIPART_DEPTH recursion guard 2743: $self->_decode_multipart($body_raw, $boundary, 0) if $boundary; 2744: } else { 2745: my $decoded = $self->_decode_body($body_raw, $cte); 2746: if ($ct =~ /html/i) { $self->{_body_html} = $decoded }

Mutants (Total: 1, Killed: 1, Survived: 0)

2747: else { $self->{_body_plain} = $decoded } 2748: } 2749: โ—2750 โ†’ 2764 โ†’ 2776โ—2750 โ†’ 2764 โ†’ 0 2750: $self->_debug(sprintf 'Parsed %d headers, %d Received lines', 2751: scalar @headers, scalar @{ $self->{_received} }); 2752: 2753: # --- Sending software fingerprints --- 2754: # These headers identify the mailer or shared-hosting script that sent 2755: # the message; invaluable for shared-hosting abuse reports. 2756: my %sw_notes = ( 2757: 'x-php-originating-script' => 'PHP script on shared hosting -- report to hosting abuse team', 2758: 'x-source' => 'Source file on shared hosting -- report to hosting abuse team', 2759: 'x-source-host' => 'Sending hostname injected by shared hosting provider', 2760: 'x-source-args' => 'Command-line args injected by shared hosting provider', 2761: 'x-mailer' => 'Email client or bulk-mailer identifier', 2762: 'user-agent' => 'Email client identifier', 2763: ); 2764: for my $sw_hdr (sort keys %sw_notes) { 2765: my ($h) = grep { $_->{name} eq $sw_hdr } @headers; 2766: next unless $h; 2767: push @{ $self->{_sending_sw} }, { 2768: header => $sw_hdr, 2769: value => $h->{value}, 2770: note => $sw_notes{$sw_hdr}, 2771: }; 2772: } 2773: 2774: # --- Per-hop tracking IDs from Received: chain --- 2775: # Walk oldest-first (reverse) so _rcvd_tracking is oldest-first โ—2776 โ†’ 2776 โ†’ 0 2776: for my $rcvd (reverse @{ $self->{_received} }) { 2777: my $ip = $self->_extract_ip_from_received($rcvd); 2778: my ($for_addr) = $rcvd =~ /\bfor\s+<?([^\s>]+\@[\w.-]+\.[\w]+)>?/i; 2779: my ($srv_id) = $rcvd =~ /\bid\s+([\w.-]+)/i; 2780: # Skip hops with no actionable tracking data 2781: next unless defined $ip || defined $for_addr || defined $srv_id; 2782: push @{ $self->{_rcvd_tracking} }, { 2783: received => $rcvd, 2784: ip => $ip, 2785: for => $for_addr, 2786: id => $srv_id, 2787: }; 2788: } 2789: } 2790: 2791: # _decode_multipart( $body, $boundary, $depth ) 2792: # 2793: # Purpose: 2794: # Recursively split a MIME multipart body on its boundary and decode each 2795: # text/plain and text/html part. Nested multipart/* containers are 2796: # recursed into up to MAX_MULTIPART_DEPTH levels deep. 2797: # 2798: # Entry criteria: 2799: # $body -- the raw body text of the multipart container. 2800: # $boundary -- the boundary string from the Content-Type header. 2801: # $depth -- current recursion depth (starts at 0 from _split_message). 2802: # 2803: # Exit status: 2804: # Returns undef if $depth >= MAX_MULTIPART_DEPTH (recursion guard). 2805: # Otherwise all results via side effects. 2806: # 2807: # Side effects: 2808: # Appends decoded text to $self->{_body_plain} and $self->{_body_html}. 2809: # 2810: # Notes: 2811: # Whitespace-only MIME segments between boundaries are silently skipped. 2812: # Decoding errors are silenced; raw bytes are used as fallback. 2813: 2814: sub _decode_multipart { โ—2815 โ†’ 2820 โ†’ 2827โ—2815 โ†’ 2820 โ†’ 0 2815: my ($self, $body, $boundary, $depth) = @_; 2816: $depth //= 0; 2817: 2818: # Enforce the recursion depth limit to prevent stack exhaustion on 2819: # pathological crafted messages with deeply nested multipart structures. 2820: if ($depth >= $MAX_MULTIPART_DEPTH) {

Mutants (Total: 4, Killed: 4, Survived: 0)

2821: Carp::carp 'Email::Abuse::Investigator: multipart nesting depth limit', 2822: "($MAX_MULTIPART_DEPTH) exceeded; stopping recursion"; 2823: return; 2824: } 2825: 2826: # Split on the boundary marker; the (?:--)? suffix handles closing boundary โ—2827 โ†’ 2829 โ†’ 0 2827: my @parts = split /--\Q$boundary\E(?:--)?/, $body; 2828: 2829: for my $part (@parts) { 2830: # Skip whitespace-only segments between boundaries 2831: next unless $part =~ /\S/; 2832: 2833: $part =~ s/^\r?\n//; 2834: 2835: # Each MIME part has its own headers separated from body by a blank line 2836: my ($phdr_block, $pbody) = split /\r?\n\r?\n/, $part, 2; 2837: next unless defined $pbody; 2838: 2839: # Unfold continuation header lines within this part 2840: $phdr_block =~ s/\r?\n([ \t]+)/ $1/g; 2841: 2842: # Parse this part's headers into a simple hash 2843: my %phdr; 2844: for my $line (split /\r?\n/, $phdr_block) { 2845: $phdr{ lc($1) } = $2 if $line =~ /^([\w-]+)\s*:\s*(.*)/; 2846: } 2847: 2848: my $pct = $phdr{'content-type'} // ''; 2849: my $pcte = $phdr{'content-transfer-encoding'} // ''; 2850: 2851: # Nested multipart/* must be recursed into; without this URLs in 2852: # multipart/alternative inside multipart/mixed would be missed. 2853: if ($pct =~ /multipart/i) {

Mutants (Total: 1, Killed: 1, Survived: 0)

2854: my ($inner_boundary) = $pct =~ /boundary\s*=\s*"?([^";]+)"?/i; 2855: if ($inner_boundary) {

Mutants (Total: 1, Killed: 1, Survived: 0)

2856: $inner_boundary =~ s/\s+$//; 2857: # Increment depth counter for the recursion guard 2858: $self->_decode_multipart($pbody, $inner_boundary, $depth + 1); 2859: } 2860: next; 2861: } 2862: 2863: # Decode transfer encoding and accumulate by content type 2864: my $decoded = $self->_decode_body($pbody, $pcte); 2865: if ($pct =~ /text\/html/i) { $self->{_body_html} .= $decoded }

Mutants (Total: 1, Killed: 1, Survived: 0)

2866: elsif ($pct =~ /text/i || !$pct) { $self->{_body_plain} .= $decoded } 2867: } 2868: } 2869: 2870: # _decode_body( $body, $cte ) -> string 2871: # 2872: # Purpose: 2873: # Decode a MIME body part according to its Content-Transfer-Encoding. 2874: # 2875: # Entry criteria: 2876: # $body -- raw body string (may be undef). 2877: # $cte -- Content-Transfer-Encoding value string (may be undef). 2878: # 2879: # Exit status: 2880: # Returns the decoded string, or the original string if the encoding is 2881: # 7bit/8bit/binary or unrecognised. 2882: # 2883: # Notes: 2884: # decode_qp and decode_base64 are imported from MIME:: modules; errors 2885: # from malformed content are silenced by the eval wrappers they provide. 2886: 2887: sub _decode_body { 2888: my ($self, $body, $cte) = @_; 2889: $cte //= ''; 2890: return decode_qp($body) if $cte =~ /quoted-printable/i; 2891: return decode_base64($body) if $cte =~ /base64/i; 2892: return $body // '';

Mutants (Total: 2, Killed: 0, Survived: 2)
2893: } 2894: 2895: # ----------------------------------------------------------------------- 2896: # Private: Received-chain -> originating IP 2897: # ----------------------------------------------------------------------- 2898: 2899: # _find_origin() 2900: # 2901: # Purpose: 2902: # Walk the Received: chain (oldest-first) to find the first external IP, 2903: # or fall back to X-Originating-IP. Enrich with rDNS and WHOIS. 2904: # 2905: # Entry criteria: 2906: # $self->{_received} populated by _split_message(). 2907: # $self->{trusted_relays} set by new(). 2908: # 2909: # Exit status: 2910: # Returns { ip, rdns, org, abuse, country, confidence, note } on success. 2911: # Returns undef if no usable IP can be identified. 2912: # 2913: # Side effects: 2914: # Network I/O via _enrich_ip(): one PTR lookup, one RDAP/WHOIS query. 2915: # Results are also stored in the CHI cross-message cache if available. 2916: # 2917: # Notes: 2918: # confidence 'high' = 2+ distinct external IPs; 2919: # 'medium' = exactly one external IP; 2920: # 'low' = taken from X-Originating-IP. 2921: 2922: sub _find_origin { โ—2923 โ†’ 2928 โ†’ 2936โ—2923 โ†’ 2928 โ†’ 0 2923: my $self = $_[0]; 2924: 2925: my @candidates; 2926: 2927: # Walk oldest-first (reverse) to collect external IPs 2928: for my $hdr (reverse @{ $self->{_received} }) { 2929: my $ip = $self->_extract_ip_from_received($hdr) // next; 2930: next if $self->_is_private($ip); 2931: next if $self->_is_trusted($ip); 2932: push @candidates, $ip; 2933: } 2934: 2935: # Fall back to X-Originating-IP if no external IPs in Received: chain โ—2936 โ†’ 2936 โ†’ 2948โ—2936 โ†’ 2936 โ†’ 0 2936: unless (@candidates) {
Mutants (Total: 1, Killed: 0, Survived: 1)
2937: my $xoip = $self->_header_value('x-originating-ip'); 2938: if ($xoip) {

Mutants (Total: 1, Killed: 1, Survived: 0)

2939: $xoip =~ s/[\[\]\s]//g; 2940: return $self->_enrich_ip($xoip, 'low',

Mutants (Total: 2, Killed: 0, Survived: 2)
2941: 'Taken from X-Originating-IP (webmail, unverified)') 2942: unless $self->_is_private($xoip); 2943: } 2944: return undef;
Mutants (Total: 2, Killed: 0, Survived: 2)
2945: } 2946: 2947: # Report the oldest (first) external IP; confidence depends on count โ—2948 โ†’ 2948 โ†’ 0 2948: return $self->_enrich_ip(
Mutants (Total: 2, Killed: 0, Survived: 2)
2949: $candidates[0], 2950: @candidates > 1 ? 'high' : 'medium',

Mutants (Total: 3, Killed: 3, Survived: 0)

2951: 'First external hop in Received: chain', 2952: ); 2953: } 2954: 2955: # _extract_ip_from_received( $hdr ) -> ipv4_or_ipv6_string | undef 2956: # 2957: # Purpose: 2958: # Extract the most-significant IP address from a raw Received: header 2959: # value, trying patterns in priority order. Supports both IPv4 dotted- 2960: # quad and IPv6 bracket notation. 2961: # 2962: # Entry criteria: 2963: # $hdr -- a defined Received: header value string. 2964: # 2965: # Exit status: 2966: # Returns the IP string on success, undef if no IP can be extracted. 2967: # 2968: # Notes: 2969: # IPv4 addresses are validated (all octets <= 255). 2970: # IPv6 addresses are returned as-is if they contain colons. 2971: 2972: sub _extract_ip_from_received { โ—2973 โ†’ 2974 โ†’ 2987โ—2973 โ†’ 2974 โ†’ 0 2973: my ($self, $hdr) = @_; 2974: for my $re (@RECEIVED_IP_RE) { 2975: if ($hdr =~ $re) {

Mutants (Total: 1, Killed: 0, Survived: 1)
2976: my $ip = $1; 2977: 2978: # Accept IPv6 addresses (contain colons) without further validation 2979: return $ip if $ip =~ /:/;
Mutants (Total: 2, Killed: 0, Survived: 2)
2980: 2981: # Validate IPv4 format and octet range 2982: next unless $ip =~ /^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$/; 2983: next if grep { $_ > 255 } split /\./, $ip;

Mutants (Total: 3, Killed: 3, Survived: 0)

2984: return $ip;

Mutants (Total: 2, Killed: 0, Survived: 2)
2985: } 2986: } โ—2987 โ†’ 2987 โ†’ 0 2987: return undef;
Mutants (Total: 2, Killed: 0, Survived: 2)
2988: } 2989: 2990: # _is_private( $ip ) -> bool 2991: # 2992: # Purpose: 2993: # Test whether an IP address falls in any private, reserved, or special- 2994: # use range (IPv4 or IPv6) that should never be reported as a spam origin. 2995: # 2996: # Entry criteria: 2997: # $ip -- a scalar IP string (IPv4 or IPv6); may be undef. 2998: # 2999: # Exit status: 3000: # Returns 1 (true) if the IP is private/reserved, 0 (false) otherwise. 3001: # Returns 1 for undef or empty strings. 3002: # 3003: # Notes: 3004: # Uses the module-level @PRIVATE_RANGES array of pre-compiled regexes. 3005: # Covers all ranges listed in RFC 1122, 1918, 5737, 6598, and RFC 4193. 3006: 3007: sub _is_private { โ—3008 โ†’ 3010 โ†’ 3011โ—3008 โ†’ 3010 โ†’ 0 3008: my ($self, $ip) = @_; 3009: return 1 unless defined $ip && $ip ne '';

Mutants (Total: 2, Killed: 2, Survived: 0)

3010: for my $re (@PRIVATE_RANGES) { return 1 if $ip =~ $re }

Mutants (Total: 2, Killed: 0, Survived: 2)
โ—3011 โ†’ 3011 โ†’ 0 3011: return 0;

Mutants (Total: 2, Killed: 2, Survived: 0)

3012: } 3013: 3014: # _is_trusted( $ip ) -> bool 3015: # 3016: # Purpose: 3017: # Test whether an IP address matches any entry in the caller-supplied 3018: # trusted_relays list (exact IP or CIDR block). 3019: # 3020: # Entry criteria: 3021: # $ip -- a defined IPv4 address string. 3022: # $self->{trusted_relays} -- arrayref of exact IPs or CIDR strings. 3023: # 3024: # Exit status: 3025: # Returns 1 (true) if the IP matches any trusted relay, 0 otherwise. 3026: 3027: sub _is_trusted { โ—3028 โ†’ 3029 โ†’ 3032โ—3028 โ†’ 3029 โ†’ 0 3028: my ($self, $ip) = @_; 3029: for my $cidr (@{ $self->{trusted_relays} }) { 3030: return 1 if $self->_ip_in_cidr($ip, $cidr);

Mutants (Total: 2, Killed: 0, Survived: 2)
3031: } โ—3032 โ†’ 3032 โ†’ 0 3032: return 0;

Mutants (Total: 2, Killed: 2, Survived: 0)

3033: } 3034: 3035: # ----------------------------------------------------------------------- 3036: # Private: HTTP/HTTPS URL extraction and resolution 3037: # ----------------------------------------------------------------------- 3038: 3039: # _extract_and_resolve_urls() -> arrayref of url hashrefs 3040: # 3041: # Purpose: 3042: # Extract all HTTP/HTTPS URLs from the decoded body, resolve each unique 3043: # hostname to an IP, and enrich with WHOIS/RDAP data. Optionally uses 3044: # AnyEvent::DNS to parallelise the DNS resolution step. 3045: # 3046: # Entry criteria: 3047: # $self->{_body_plain} and $self->{_body_html} populated by _split_message(). 3048: # 3049: # Exit status: 3050: # Returns an arrayref of url hashrefs (possibly empty). 3051: # 3052: # Side effects: 3053: # Network I/O per unique hostname: one A/AAAA lookup, one RDAP/WHOIS. 3054: # Results stored in the CHI cross-message cache if available. 3055: 3056: sub _extract_and_resolve_urls { โ—3057 โ†’ 3067 โ†’ 3073โ—3057 โ†’ 3067 โ†’ 0 3057: my $self = $_[0]; 3058: my (%url_seen, %host_cache); 3059: my @results; 3060: my $combined = $self->{_body_plain} . "\n" . $self->{_body_html}; 3061: 3062: # Collect unique URLs from body 3063: my @urls = grep { !$url_seen{$_}++ } $self->_extract_http_urls($combined); 3064: 3065: # Extract unique hostnames for parallel DNS resolution 3066: my %hostname_needed; 3067: for my $url (@urls) { 3068: my ($host) = $url =~ m{https?://([^/:?\s#]+)}i; 3069: $hostname_needed{$host}++ if $host; 3070: } 3071: 3072: # Parallelise DNS lookups if AnyEvent::DNS is available โ—3073 โ†’ 3073 โ†’ 3078โ—3073 โ†’ 3073 โ†’ 0 3073: if ($HAS_ANYEVENT_DNS && scalar(keys %hostname_needed) > 1) {

Mutants (Total: 4, Killed: 0, Survived: 4)
3074: $self->_parallel_resolve_hosts(\%hostname_needed, \%host_cache); 3075: } 3076: 3077: # Process each URL: resolve hostname and WHOIS-enrich โ—3078 โ†’ 3078 โ†’ 3116โ—3078 โ†’ 3078 โ†’ 0 3078: for my $url (@urls) { 3079: my ($host) = $url =~ m{https?://([^/:?\s#]+)}i; 3080: next unless $host; 3081: 3082: # Resolve and WHOIS once per unique hostname, then cache the result 3083: unless (exists $host_cache{$host}) {
Mutants (Total: 1, Killed: 0, Survived: 1)
3084: # Check the cross-message CHI cache first 3085: my $cached = $_cache ? $_cache->get("url:$host") : undef; 3086: if ($cached) {
Mutants (Total: 1, Killed: 0, Survived: 1)
3087: $host_cache{$host} = $cached; 3088: } else { 3089: my $ip = $self->_resolve_host($host) // '(unresolved)'; 3090: my $whois = $ip ne '(unresolved)' 3091: ? $self->_whois_ip($ip) 3092: : {}; 3093: 3094: # Fall back to domain WHOIS if IP lookup returned nothing 3095: if (!$whois->{abuse}) {

Mutants (Total: 1, Killed: 1, Survived: 0)

3096: my $reg = _registrable($host) // $host; 3097: my $dw = $self->_parse_domain_whois_abuse($reg); 3098: $whois = $dw if $dw->{abuse}; 3099: } 3100: 3101: my $entry = { 3102: ip => $ip, 3103: org => $whois->{org} // '(unknown)', 3104: abuse => $whois->{abuse} // '(unknown)', 3105: country => $whois->{country} // undef, 3106: }; 3107: $host_cache{$host} = $entry; 3108: 3109: # Store in cross-message cache for reuse across messages 3110: $_cache->set("url:$host", $entry) if $_cache; 3111: } 3112: } 3113: 3114: push @results, { url => $url, host => $host, %{ $host_cache{$host} } }; 3115: } โ—3116 โ†’ 3116 โ†’ 0 3116: return \@results; 3117: } 3118: 3119: # _parallel_resolve_hosts( \%hostnames, \%cache ) 3120: # 3121: # Purpose: 3122: # Resolve multiple hostnames to IPs in parallel using AnyEvent::DNS. 3123: # Populates the cache with resolved IPs so the sequential loop in 3124: # _extract_and_resolve_urls() can skip the DNS step for pre-resolved hosts. 3125: # 3126: # Entry criteria: 3127: # $hostnames_ref -- hashref keyed by hostname (values ignored). 3128: # $cache_ref -- hashref to populate with { ip => '...' } results. 3129: # AnyEvent::DNS must be installed ($HAS_ANYEVENT_DNS is true). 3130: # 3131: # Exit status: 3132: # Returns undef; all results written to %$cache_ref via side effects. 3133: # 3134: # Notes: 3135: # Errors (NXDOMAIN, timeout) are silently swallowed; the sequential 3136: # resolution loop will return '(unresolved)' for those hosts. 3137: 3138: sub _parallel_resolve_hosts { โ—3139 โ†’ 3146 โ†’ 3163โ—3139 โ†’ 3146 โ†’ 0 3139: my ($self, $hostnames_ref, $cache_ref) = @_; 3140: return unless $HAS_ANYEVENT_DNS; 3141: 3142: # Build an AnyEvent condvar to wait for all lookups to complete 3143: my $cv = AnyEvent->condvar; 3144: my $pending = scalar keys %$hostnames_ref; 3145: 3146: for my $host (keys %$hostnames_ref) { 3147: # Fire an async A (and AAAA) query for each hostname 3148: AnyEvent::DNS::resolve( 3149: $host, 'A', 3150: sub { 3151: my @answers = @_; 3152: if (@answers) {

Mutants (Total: 1, Killed: 0, Survived: 1)
3153: # Cache the first A record result 3154: $cache_ref->{$host} = { ip => $answers[0][4] }; 3155: } 3156: # Decrement the pending counter; signal when all done 3157: $cv->send if --$pending <= 0;
Mutants (Total: 3, Killed: 0, Survived: 3)
3158: }, 3159: ); 3160: } 3161: 3162: # Block until all DNS queries complete (subject to AnyEvent's own timeouts) โ—[NOT COVERED] 3163 โ†’ 3163 โ†’ 0 3163: $cv->recv; 3164: } 3165: 3166: # _extract_http_urls( $body ) -> list of url strings 3167: # 3168: # Purpose: 3169: # Extract all HTTP and HTTPS URLs from a body string, using both 3170: # structural HTML parsing (if HTML::LinkExtor is available) and a 3171: # plain-text regex pass. Deduplicates and strips trailing punctuation. 3172: # 3173: # Entry criteria: 3174: # $body -- combined plain+HTML body string. 3175: # 3176: # Exit status: 3177: # Returns a list of URL strings (possibly empty), deduplicated. 3178: 3179: sub _extract_http_urls { โ—3180 โ†’ 3184 โ†’ 3201โ—3180 โ†’ 3184 โ†’ 0 3180: my ($self, $body) = @_; 3181: my @urls; 3182: 3183: # Structural HTML link extraction (handles quoted attributes correctly) 3184: if ($HAS_HTML_LINKEXTOR) {
Mutants (Total: 1, Killed: 0, Survived: 1)
3185: my $p = HTML::LinkExtor->new(sub { 3186: my ($tag, %attrs) = @_; 3187: for my $attr (qw(href src action)) { 3188: my $val = $attrs{$attr} // ''; 3189: if ($val =~ m{^https?://}i) {
Mutants (Total: 1, Killed: 0, Survived: 1)
3190: push @urls, $val; 3191: } elsif ($val =~ m{^//[\w.-]}) { 3192: # Protocol-relative -- assume https 3193: push @urls, 'https:' . $val; 3194: } 3195: } 3196: }); 3197: $p->parse($body); 3198: } 3199: 3200: # Plain-text regex pass for bare URLs not in HTML attributes โ—3201 โ†’ 3201 โ†’ 3206โ—3201 โ†’ 3201 โ†’ 0 3201: while ($body =~ m{(https?://[^\s<>"'\)\]]+)}gi) { 3202: push @urls, $1; 3203: } 3204: 3205: # Protocol-relative URLs not caught above โ—3206 โ†’ 3206 โ†’ 3211โ—3206 โ†’ 3206 โ†’ 0 3206: while ($body =~ m{(?:^|[\s"'=])(//[\w.-][^\s<>"'\)\]]*)}gim) { 3207: push @urls, 'https:' . $1; 3208: } 3209: 3210: # Deduplicate and strip trailing punctuation โ—3211 โ†’ 3214 โ†’ 0 3211: my %seen; 3212: my @all = grep { !$seen{$_}++ } @urls; 3213: s/[.,;:!?\)>\]]+$// for @all; 3214: return @all;
Mutants (Total: 2, Killed: 0, Survived: 2)
3215: } 3216: 3217: # ----------------------------------------------------------------------- 3218: # Private: domain extraction and full analysis 3219: # ----------------------------------------------------------------------- 3220: 3221: # _extract_and_analyse_domains() -> arrayref of domain hashrefs 3222: # 3223: # Purpose: 3224: # Collect all non-infrastructure contact domains from headers and body, 3225: # run the full domain intelligence pipeline on each, and return an arrayref 3226: # suitable for storage in $self->{_mailto_domains}. 3227: # 3228: # Entry criteria: 3229: # _split_message() must have been called. 3230: # 3231: # Exit status: 3232: # Always returns an arrayref; never undef. 3233: # 3234: # Side effects: 3235: # Network I/O per domain via _analyse_domain(). 3236: # Results stored in $self->{_domain_info} and CHI cache. 3237: 3238: sub _extract_and_analyse_domains { โ—3239 โ†’ 3244 โ†’ 3254โ—3239 โ†’ 3244 โ†’ 0 3239: my $self = $_[0]; 3240: my (%seen, @domains_with_source); 3241: 3242: # Build a set of recipient domains to exclude (victims, not senders) 3243: my %recipient_domains; 3244: for my $hname (qw(to cc)) { 3245: my $val = $self->_header_value($hname) // next; 3246: for my $dom ($self->_domains_from_text($val)) { 3247: my $reg = _registrable($dom) // $dom; 3248: $recipient_domains{$dom}++; 3249: $recipient_domains{$reg}++; 3250: } 3251: } 3252: 3253: # Also exclude domains from Received: "for" envelope recipients โ—3254 โ†’ 3254 โ†’ 3263โ—3254 โ†’ 3254 โ†’ 0 3254: for my $hop (@{ $self->{_rcvd_tracking} }) { 3255: next unless $hop->{for} && $hop->{for} =~ /\@([\w.-]+)/; 3256: my $dom = lc $1; 3257: my $reg = _registrable($dom) // $dom; 3258: $recipient_domains{$dom}++; 3259: $recipient_domains{$reg}++; 3260: } 3261: 3262: # Inner closure: record a domain if it passes all filters โ—3263 โ†’ 3285 โ†’ 3292โ—3263 โ†’ 3285 โ†’ 0 3263: my $record = sub { 3264: my ($dom, $source) = @_; 3265: $dom = lc $dom; 3266: $dom =~ s/\.$//; 3267: next if $self->{trusted_domains}->{$dom}; 3268: return if $TRUSTED_DOMAINS{$dom}; 3269: return if $recipient_domains{$dom}; 3270: return if $recipient_domains{ _registrable($dom) // $dom }; 3271: # Discard non-routable hostnames (single-label, pseudo-TLDs, etc.) 3272: return unless $dom =~ /\.[a-zA-Z]{2,}$/; 3273: return if $dom =~ /\.(?:local|internal|lan|localdomain|arpa)$/i; 3274: return if $seen{$dom}++; 3275: push @domains_with_source, { domain => $dom, source => $source }; 3276: }; 3277: 3278: # Collect from standard sender/reply headers 3279: my %header_sources = ( 3280: 'from' => 'From: header', 3281: 'reply-to' => 'Reply-To: header', 3282: 'return-path' => 'Return-Path: header', 3283: 'sender' => 'Sender: header', 3284: ); 3285: for my $hname (sort keys %header_sources) { 3286: my $val = $self->_header_value($hname) // next; 3287: $record->($_, $header_sources{$hname}) 3288: for $self->_domains_from_text($val); 3289: } 3290: 3291: # Message-ID domain often reveals the real bulk-sending platform โ—3292 โ†’ 3293 โ†’ 3301โ—3292 โ†’ 3293 โ†’ 0 3292: my $mid = $self->_header_value('message-id'); 3293: if ($mid && $mid =~ /\@([\w.-]+)/) {
Mutants (Total: 1, Killed: 0, Survived: 1)
3294: my $mid_dom = lc $1; 3295: my $mid_reg = _registrable($mid_dom) // $mid_dom; 3296: $record->($mid_dom, 'Message-ID: header') 3297: unless $TRUSTED_DOMAINS{$mid_dom} || $TRUSTED_DOMAINS{$mid_reg} || $self->{trusted_domains}->{$mid_dom} || $self->{trusted_domains}->{$mid_reg}; 3298: } 3299: 3300: # DKIM signing domain(s) -- the organisation that vouches for the message โ—3301 โ†’ 3302 โ†’ 3307โ—3301 โ†’ 3302 โ†’ 0 3301: my $auth = $self->_parse_auth_results_cached(); 3302: for my $dkim_d (@{ $auth->{dkim_domains} // [] }) { 3303: $record->($dkim_d, 'DKIM-Signature: d= (signing domain)'); 3304: } 3305: 3306: # List-Unsubscribe identifies the ESP or bulk sender โ—3307 โ†’ 3308 โ†’ 3318โ—3307 โ†’ 3308 โ†’ 0 3307: my $unsub = $self->_header_value('list-unsubscribe'); 3308: if ($unsub) {

Mutants (Total: 1, Killed: 1, Survived: 0)

3309: while ($unsub =~ m{https?://([^/:?\s>]+)}gi) { 3310: $record->(lc $1, 'List-Unsubscribe: header'); 3311: } 3312: while ($unsub =~ m{mailto:[^@\s>]+\@([\w.-]+)}gi) { 3313: $record->(lc $1, 'List-Unsubscribe: header'); 3314: } 3315: } 3316: 3317: # Body email addresses (mailto: and bare user@domain forms) โ—3318 โ†’ 3324 โ†’ 3328โ—3318 โ†’ 3324 โ†’ 0 3318: my $combined = $self->{_body_plain} . "\n" . $self->{_body_html}; 3319: $record->($_, 'email address / mailto in body') 3320: for $self->_domains_from_text($combined); 3321: 3322: # Run the full intelligence pipeline on each collected domain 3323: my @results; 3324: for my $entry (@domains_with_source) { 3325: my $info = $self->_analyse_domain($entry->{domain}); 3326: push @results, { %$entry, %$info }; 3327: } โ—3328 โ†’ 3328 โ†’ 0 3328: return \@results; 3329: } 3330: 3331: # _domains_from_text( $text ) -> list of domain strings 3332: # 3333: # Purpose: 3334: # Extract unique domain names from mailto: links and bare user@domain 3335: # addresses in a block of text. 3336: # 3337: # Entry criteria: 3338: # $text -- a defined scalar of decoded body or header text. 3339: # 3340: # Exit status: 3341: # Returns a list of lower-cased domain strings (possibly empty). 3342: 3343: sub _domains_from_text { โ—3344 โ†’ 3348 โ†’ 3354โ—3344 โ†’ 3348 โ†’ 0 3344: my ($self, $text) = @_; 3345: my (%seen, @out); 3346: 3347: # mailto: links (including HTML-entity-encoded @ signs from QP) 3348: while ($text =~ /mailto:(?:[^@\s<>"]+)@([\w.-]+)/gi) { 3349: my $dom = lc $1; $dom =~ s/\.$//; 3350: push @out, $dom unless $seen{$dom}++; 3351: } 3352: 3353: # Bare user@domain patterns โ—3354 โ†’ 3354 โ†’ 3358โ—3354 โ†’ 3354 โ†’ 0 3354: while ($text =~ /\b[\w.+%-]+@([\w.-]+\.[a-zA-Z]{2,})\b/g) { 3355: my $dom = lc $1; $dom =~ s/\.$//; 3356: push @out, $dom unless $seen{$dom}++; 3357: } โ—3358 โ†’ 3358 โ†’ 0 3358: return @out;

Mutants (Total: 2, Killed: 0, Survived: 2)
3359: } 3360: 3361: # _analyse_domain( $domain ) -> hashref 3362: # 3363: # Purpose: 3364: # Run the complete intelligence pipeline for a single domain: A record 3365: # (web hosting), MX record (mail hosting), NS record (DNS hosting), 3366: # and WHOIS (registrar, creation/expiry dates, abuse contact). 3367: # Each IP is enriched via RDAP/WHOIS. Results are cached per domain 3368: # in $self->{_domain_info} and in the CHI cross-message cache. 3369: # 3370: # Entry criteria: 3371: # $domain -- lower-cased, no trailing dot, not in TRUSTED_DOMAINS. 3372: # $self->{timeout} used for all network operations. 3373: # 3374: # Exit status: 3375: # Always returns a hashref reference; never undef; may be empty ({}). 3376: # Possible keys: web_ip, web_org, web_abuse, mx_host, mx_ip, mx_org, 3377: # mx_abuse, ns_host, ns_ip, ns_org, ns_abuse, registrar, 3378: # registrar_abuse, registered, expires, recently_registered, whois_raw. 3379: # 3380: # Side effects: 3381: # Network I/O; writes result to $self->{_domain_info}{$domain} and CHI. 3382: # 3383: # Notes: 3384: # MX/NS lookups require Net::DNS; absent without it. 3385: # recently_registered is set to 1 (not 0) when the threshold is met. 3386: # whois_raw is truncated to WHOIS_RAW_MAX bytes. 3387: 3388: sub _analyse_domain { โ—3389 โ†’ 3396 โ†’ 3404โ—3389 โ†’ 3396 โ†’ 0 3389: my ($self, $domain) = @_; 3390: 3391: # Return the per-message cached result if already analysed 3392: return $self->{_domain_info}{$domain}
Mutants (Total: 2, Killed: 0, Survived: 2)
3393: if $self->{_domain_info}{$domain}; 3394: 3395: # Check the cross-message CHI cache before hitting the network 3396: if ($_cache) {

Mutants (Total: 1, Killed: 1, Survived: 0)

3397: my $cached = $_cache->get("dom:$domain"); 3398: if ($cached) {

Mutants (Total: 1, Killed: 0, Survived: 1)
3399: $self->{_domain_info}{$domain} = $cached; 3400: return $cached;
Mutants (Total: 2, Killed: 0, Survived: 2)
3401: } 3402: } 3403: โ—3404 โ†’ 3409 โ†’ 3417โ—3404 โ†’ 3409 โ†’ 0 3404: $self->_debug("Analysing domain: $domain"); 3405: my %info; 3406: 3407: # --- A record -> web hosting IP --- 3408: my $web_ip = $self->_resolve_host($domain); 3409: if ($web_ip) {
Mutants (Total: 1, Killed: 0, Survived: 1)
3410: $info{web_ip} = $web_ip; 3411: my $w = $self->_whois_ip($web_ip); 3412: $info{web_org} = $w->{org} if $w->{org}; 3413: $info{web_abuse} = $w->{abuse} if $w->{abuse}; 3414: } 3415: 3416: # MX and NS lookups require Net::DNS โ—3417 โ†’ 3417 โ†’ 3460โ—3417 โ†’ 3417 โ†’ 0 3417: if ($HAS_NET_DNS) {
Mutants (Total: 1, Killed: 0, Survived: 1)
3418: my $res = Net::DNS::Resolver->new( 3419: tcp_timeout => $self->{timeout}, 3420: udp_timeout => $self->{timeout}, 3421: ); 3422: 3423: # --- MX record -> mail hosting --- 3424: my $mxq = $res->search($domain, 'MX'); 3425: if ($mxq) {
Mutants (Total: 1, Killed: 0, Survived: 1)
3426: my ($best) = sort { $a->preference <=> $b->preference } 3427: grep { $_->type eq 'MX' } $mxq->answer; 3428: if ($best) {
Mutants (Total: 1, Killed: 0, Survived: 1)
3429: (my $mx_host = lc $best->exchange) =~ s/\.$//; 3430: $info{mx_host} = $mx_host; 3431: my $mx_ip = $self->_resolve_host($mx_host); 3432: if ($mx_ip) {
Mutants (Total: 1, Killed: 0, Survived: 1)
3433: $info{mx_ip} = $mx_ip; 3434: my $mw = $self->_whois_ip($mx_ip); 3435: $info{mx_org} = $mw->{org} if $mw->{org}; 3436: $info{mx_abuse} = $mw->{abuse} if $mw->{abuse}; 3437: } 3438: } 3439: } 3440: 3441: # --- NS record -> DNS hosting --- 3442: my $nsq = $res->search($domain, 'NS'); 3443: if ($nsq) {
Mutants (Total: 1, Killed: 0, Survived: 1)
3444: my ($first) = grep { $_->type eq 'NS' } $nsq->answer; 3445: if ($first) {
Mutants (Total: 1, Killed: 0, Survived: 1)
3446: (my $ns_host = lc $first->nsdname) =~ s/\.$//; 3447: $info{ns_host} = $ns_host; 3448: my $ns_ip = $self->_resolve_host($ns_host); 3449: if ($ns_ip) {
Mutants (Total: 1, Killed: 0, Survived: 1)
3450: $info{ns_ip} = $ns_ip; 3451: my $nw = $self->_whois_ip($ns_ip); 3452: $info{ns_org} = $nw->{org} if $nw->{org}; 3453: $info{ns_abuse} = $nw->{abuse} if $nw->{abuse}; 3454: } 3455: } 3456: } 3457: } 3458: 3459: # --- Domain WHOIS -> registrar + dates --- โ—3460 โ†’ 3461 โ†’ 3513โ—3460 โ†’ 3461 โ†’ 0 3460: my $domain_whois = $self->_domain_whois($domain); 3461: if ($domain_whois) {
Mutants (Total: 1, Killed: 0, Survived: 1)
3462: # Truncate raw WHOIS for storage but parse structured fields from full text 3463: $info{whois_raw} = substr($domain_whois, 0, $WHOIS_RAW_MAX); 3464: 3465: # Registrar name 3466: if ($domain_whois =~ /Registrar:\s*(.+)/i) {
Mutants (Total: 1, Killed: 0, Survived: 1)
3467: ($info{registrar} = $1) =~ s/\s+$//; 3468: } 3469: 3470: # Registrar abuse contact email (try multiple field names) 3471: for my $pat ( 3472: qr/Registrar Abuse Contact Email:\s*(\S+@\S+)/i, 3473: qr/Abuse Contact Email:\s*(\S+@\S+)/i, 3474: qr/abuse-contact:\s*(\S+@\S+)/i, 3475: ) { 3476: if (!$info{registrar_abuse} && $domain_whois =~ $pat) {
Mutants (Total: 1, Killed: 0, Survived: 1)
3477: ($info{registrar_abuse} = $1) =~ s/\s+$//; 3478: } 3479: } 3480: 3481: # Domain creation date (multiple registrar field name variations) 3482: for my $pat ( 3483: qr/Creation Date:\s*(\S+)/i, 3484: qr/Created(?:\s+On)?:\s*(\S+)/i, 3485: qr/Registration Time:\s*(\S+)/i, 3486: qr/^registered:\s*(\S+)/im, 3487: ) { 3488: if (!$info{registered} && $domain_whois =~ $pat) {
Mutants (Total: 1, Killed: 0, Survived: 1)
3489: ($info{registered} = $1) =~ s/[TZ].*//; 3490: } 3491: } 3492: 3493: # Domain expiry date 3494: for my $pat ( 3495: qr/Registry Expiry Date:\s*(\S+)/i, 3496: qr/Expir(?:y|ation)(?: Date)?:\s*(\S+)/i, 3497: qr/paid-till:\s*(\S+)/i, 3498: ) { 3499: if (!$info{expires} && $domain_whois =~ $pat) {

Mutants (Total: 1, Killed: 1, Survived: 0)

3500: ($info{expires} = $1) =~ s/[TZ].*//; 3501: } 3502: } 3503: 3504: # Flag recently-registered domains (< RECENT_REG_DAYS old) 3505: if ($info{registered}) {

Mutants (Total: 1, Killed: 0, Survived: 1)
3506: my $epoch = $self->_parse_date_to_epoch($info{registered}); 3507: $info{recently_registered} = 1 3508: if $epoch && (time() - $epoch) < $RECENT_REG_DAYS * $SECS_PER_DAY;
Mutants (Total: 3, Killed: 0, Survived: 3)
3509: } 3510: } 3511: 3512: # Store in per-message and cross-message caches โ—3513 โ†’ 3516 โ†’ 0 3513: $self->{_domain_info}{$domain} = \%info; 3514: $_cache->set("dom:$domain", \%info) if $_cache; 3515: 3516: return \%info; 3517: } 3518: 3519: # ----------------------------------------------------------------------- 3520: # Private: DNS helpers 3521: # ----------------------------------------------------------------------- 3522: 3523: # _resolve_host( $host ) -> ip_string | undef 3524: # 3525: # Purpose: 3526: # Resolve a hostname to an IPv4 (or IPv6) address. Uses Net::DNS for 3527: # both A and AAAA queries when available; falls back to inet_aton for 3528: # pure IPv4 resolution. 3529: # 3530: # Entry criteria: 3531: # $host -- hostname string or already-numeric IP. 3532: # 3533: # Exit status: 3534: # Returns the first resolved IP string, or undef on failure. 3535: # 3536: # Notes: 3537: # When the input is already a dotted-quad IPv4 it is returned immediately. 3538: # AAAA records are tried if the A query fails and Net::DNS is available. 3539: 3540: sub _resolve_host { โ—3541 โ†’ 3545 โ†’ 3550โ—3541 โ†’ 3545 โ†’ 0 3541: my ($self, $host) = @_; 3542: return $host if $host =~ /^\d{1,3}(?:\.\d{1,3}){3}$/;
Mutants (Total: 2, Killed: 0, Survived: 2)
3543: 3544: # Check the CHI cache before hitting DNS 3545: if ($_cache) {
Mutants (Total: 1, Killed: 0, Survived: 1)
3546: my $cached_ip = $_cache->get("resolve:$host"); 3547: return $cached_ip if defined $cached_ip;
Mutants (Total: 2, Killed: 0, Survived: 2)
3548: } 3549: โ—3550 โ†’ 3552 โ†’ 3581โ—3550 โ†’ 3552 โ†’ 0 3550: my $ip; 3551: 3552: if ($HAS_NET_DNS) {

Mutants (Total: 1, Killed: 1, Survived: 0)

3553: my $res = Net::DNS::Resolver->new( 3554: tcp_timeout => $self->{timeout}, 3555: udp_timeout => $self->{timeout}, 3556: ); 3557: 3558: # Try A record first, then AAAA for IPv6 3559: for my $type (qw(A AAAA)) { 3560: my $query = $res->search($host, $type); 3561: if ($query) {

Mutants (Total: 1, Killed: 0, Survived: 1)
3562: for my $rr ($query->answer) { 3563: if ($rr->type eq 'A') {

Mutants (Total: 1, Killed: 1, Survived: 0)

3564: $ip = $rr->address; 3565: last; 3566: } elsif ($rr->type eq 'AAAA') { 3567: $ip = $rr->address; 3568: last; 3569: } 3570: } 3571: } 3572: last if defined $ip; 3573: } 3574: } else { 3575: # Fallback: gethostbyname (IPv4 only) 3576: my $packed = eval { inet_aton($host) }; 3577: $ip = $packed ? inet_ntoa($packed) : undef; 3578: } 3579: 3580: # Cache the result (including undef as '' to avoid repeated failed lookups) โ—3581 โ†’ 3581 โ†’ 3585โ—3581 โ†’ 3581 โ†’ 0 3581: if ($_cache) {

Mutants (Total: 1, Killed: 0, Survived: 1)
3582: $_cache->set("resolve:$host", $ip // ''); 3583: } 3584: โ—3585 โ†’ 3585 โ†’ 0 3585: return $ip;
Mutants (Total: 2, Killed: 0, Survived: 2)
3586: } 3587: 3588: # _reverse_dns( $ip ) -> hostname | undef 3589: # 3590: # Purpose: 3591: # Perform a PTR (reverse DNS) lookup for an IP address. Supports both 3592: # IPv4 and IPv6 via Net::DNS when available; falls back to gethostbyaddr. 3593: # 3594: # Entry criteria: 3595: # $ip -- a defined IPv4 or IPv6 address string. 3596: # 3597: # Exit status: 3598: # Returns the PTR hostname string, or undef if no record exists. 3599: 3600: sub _reverse_dns { โ—3601 โ†’ 3604 โ†’ 3616โ—3601 โ†’ 3604 โ†’ 0 3601: my ($self, $ip) = @_; 3602: return undef unless $ip;
Mutants (Total: 2, Killed: 0, Survived: 2)
3603: 3604: if ($HAS_NET_DNS) {

Mutants (Total: 1, Killed: 1, Survived: 0)

3605: my $res = Net::DNS::Resolver->new(tcp_timeout => $self->{timeout}); 3606: my $query = $res->search($ip, 'PTR'); 3607: if ($query) {

Mutants (Total: 1, Killed: 0, Survived: 1)
3608: for my $rr ($query->answer) { 3609: return $rr->ptrdname if $rr->type eq 'PTR';

Mutants (Total: 2, Killed: 2, Survived: 0)

3610: } 3611: } 3612: return undef;

Mutants (Total: 2, Killed: 0, Survived: 2)
3613: } 3614: 3615: # Fallback for IPv4 only โ—3616 โ†’ 3616 โ†’ 0 3616: return scalar gethostbyaddr(inet_aton($ip), AF_INET); 3617: } 3618: 3619: # ----------------------------------------------------------------------- 3620: # Private: WHOIS / RDAP 3621: # ----------------------------------------------------------------------- 3622: 3623: # _whois_ip( $ip ) -> hashref 3624: # 3625: # Purpose: 3626: # Enrich an IP address with organisation name, abuse contact, and country 3627: # code. Tries RDAP first (if LWP is available), then falls back to raw 3628: # WHOIS via IANA referral. Results are cached in CHI if available. 3629: # 3630: # Entry criteria: 3631: # $ip -- a defined IPv4 or IPv6 address string. 3632: # 3633: # Exit status: 3634: # Returns { org, abuse, country } hashref; keys absent when unknown. 3635: 3636: sub _whois_ip { โ—3637 โ†’ 3640 โ†’ 3645โ—3637 โ†’ 3640 โ†’ 0 3637: my ($self, $ip) = @_; 3638: 3639: # Check CHI cache before going to the network 3640: if ($_cache) {

Mutants (Total: 1, Killed: 1, Survived: 0)

3641: my $cached = $_cache->get("whois_ip:$ip"); 3642: return $cached if $cached;

Mutants (Total: 2, Killed: 2, Survived: 0)

3643: } 3644: โ—3645 โ†’ 3648 โ†’ 3658โ—3645 โ†’ 3648 โ†’ 0 3645: my $result = $HAS_LWP ? $self->_rdap_lookup($ip) : {}; 3646: 3647: # Fall back to raw WHOIS if RDAP returned no organisation 3648: unless ($result->{org}) {

Mutants (Total: 1, Killed: 0, Survived: 1)
3649: my $raw = $self->_raw_whois($ip, 'whois.iana.org'); 3650: if ($raw) {

Mutants (Total: 1, Killed: 1, Survived: 0)

3651: my ($ref) = $raw =~ /whois:\s*([\w.-]+)/i; 3652: my $detail = $ref ? $self->_raw_whois($ip, $ref) : $raw; 3653: $result = $self->_parse_whois_text($detail) if $detail; 3654: } 3655: } 3656: 3657: # Cache the enrichment result โ—3658 โ†’ 3660 โ†’ 0 3658: $_cache->set("whois_ip:$ip", $result) if $_cache && $result; 3659: 3660: return $result;

Mutants (Total: 2, Killed: 0, Survived: 2)
3661: } 3662: 3663: # _domain_whois( $domain ) -> raw_whois_string | undef 3664: # 3665: # Purpose: 3666: # Perform a two-step WHOIS lookup for a domain: first ask IANA for the 3667: # TLD's authoritative WHOIS server, then query that server. 3668: # 3669: # Entry criteria: 3670: # $domain -- a lower-cased domain name string. 3671: # 3672: # Exit status: 3673: # Returns the raw WHOIS response string, or undef on failure. 3674: 3675: sub _domain_whois { 3676: my ($self, $domain) = @_; 3677: my $iana = $self->_raw_whois($domain, 'whois.iana.org') // return undef; 3678: my ($server) = $iana =~ /whois:\s*([\w.-]+)/i; 3679: return undef unless $server;
Mutants (Total: 2, Killed: 0, Survived: 2)
3680: return $self->_raw_whois($domain, $server);
Mutants (Total: 2, Killed: 0, Survived: 2)
3681: } 3682: 3683: # _parse_domain_whois_abuse( $domain ) -> hashref 3684: # 3685: # Purpose: 3686: # Lightweight domain WHOIS lookup to extract only registrar name and 3687: # abuse contact. Used as a fallback in _extract_and_resolve_urls() when 3688: # a URL host cannot be resolved to an IP. 3689: # 3690: # Entry criteria: 3691: # $domain -- a registrable domain name string. 3692: # 3693: # Exit status: 3694: # Returns { org, abuse } hashref; empty hashref on failure. 3695: 3696: sub _parse_domain_whois_abuse { โ—3697 โ†’ 3700 โ†’ 3704โ—3697 โ†’ 3700 โ†’ 0 3697: my ($self, $domain) = @_; 3698: my $raw = $self->_domain_whois($domain) // return {}; 3699: my %info; 3700: if ($raw =~ /Registrar:\s*(.+)/i) {
Mutants (Total: 1, Killed: 0, Survived: 1)
3701: ($info{org} = $1) =~ s/\s+$//; 3702: } 3703: # Try multiple field name patterns for the abuse email โ—3704 โ†’ 3704 โ†’ 3713โ—3704 โ†’ 3704 โ†’ 0 3704: for my $pat ( 3705: qr/Registrar Abuse Contact Email:\s*(\S+\@\S+)/i, 3706: qr/Abuse Contact Email:\s*(\S+\@\S+)/i, 3707: qr/abuse-contact:\s*(\S+\@\S+)/i, 3708: ) { 3709: if (!$info{abuse} && $raw =~ $pat) {
Mutants (Total: 1, Killed: 0, Survived: 1)
3710: ($info{abuse} = $1) =~ s/\s+$//; 3711: } 3712: } โ—3713 โ†’ 3713 โ†’ 0 3713: return \%info; 3714: } 3715: 3716: # _rdap_lookup( $ip ) -> hashref 3717: # 3718: # Purpose: 3719: # Query the ARIN RDAP API for IP block ownership information. RDAP is 3720: # preferred over raw WHOIS because it returns structured JSON. 3721: # 3722: # Entry criteria: 3723: # $ip -- a defined IPv4 or IPv6 address string. 3724: # LWP::UserAgent must be installed. 3725: # 3726: # Exit status: 3727: # Returns { org, abuse, country } hashref; empty hashref on failure. 3728: 3729: sub _rdap_lookup { โ—[NOT COVERED] 3730 โ†’ 3734 โ†’ 3752โ—[NOT COVERED] 3730 โ†’ 3734 โ†’ 0 3730: my ($self, $ip) = @_; 3731: return {} unless $HAS_LWP; 3732: 3733: my $ua = $self->{ua}; 3734: if(!defined($ua)) {

Mutants (Total: 1, Killed: 1, Survived: 0)

3735: $ua = LWP::UserAgent->new( 3736: timeout => $self->{timeout}, 3737: agent => "Email-Abuse-Investigator/$VERSION", 3738: ); 3739: 3740: if($HAS_CONN_CACHE) {

Mutants (Total: 1, Killed: 0, Survived: 1)
3741: my $conn_cache = LWP::ConnCache->new(); 3742: $conn_cache->total_capacity(10); 3743: $ua->conn_cache($conn_cache); 3744: } 3745: 3746: $ua->env_proxy(1); 3747: $self->{ua} = $ua; 3748: } 3749: 3750: # Use the ARIN RDAP endpoint; it covers the ARIN region and redirects 3751: # for RIPE/APNIC/LACNIC/AfriNIC allocations. โ—[NOT COVERED] 3752 โ†’ 3763 โ†’ 3770โ—[NOT COVERED] 3752 โ†’ 3763 โ†’ 0 3752: my $res = eval { $ua->get("https://rdap.arin.net/registry/ip/$ip") }; 3753: return {} unless $res && $res->is_success(); 3754: 3755: my $j = $res->decoded_content(); 3756: my %info; 3757: 3758: # Extract organisation name from the JSON response 3759: $info{org} = $1 if $j =~ /"name"\s*:\s*"([^"]+)"/; 3760: $info{handle} = $1 if $j =~ /"handle"\s*:\s*"([^"]+)"/; 3761: 3762: # Extract abuse email from the vcardArray contact block 3763: if ($j =~ /"abuse".*?"email"\s*:\s*"([^"]+)"/s) {

Mutants (Total: 1, Killed: 1, Survived: 0)

3764: $info{abuse} = $1; 3765: } elsif ($j =~ /"email"\s*:\s*"([^@"]+@[^"]+)"/) { 3766: $info{abuse} = $1; 3767: } 3768: 3769: # Country code from the network's country field โ—[NOT COVERED] 3770 โ†’ 3772 โ†’ 0 3770: $info{country} = $1 if $j =~ /"country"\s*:\s*"([A-Z]{2})"/; 3771: 3772: return \%info; 3773: } 3774: 3775: # _raw_whois( $query, $server ) -> string | undef 3776: # 3777: # Purpose: 3778: # Open a TCP connection to a WHOIS server on port 43, send the query, 3779: # and return the full response as a string. Uses IO::Select for read 3780: # timeouts so that alarm() is never needed (alarm() is unreliable on 3781: # Windows and in threaded Perl). Supports IPv6 WHOIS servers via 3782: # IO::Socket::IP when that module is available. 3783: # 3784: # Entry criteria: 3785: # $query -- the domain name or IP to query (defined, non-empty). 3786: # $server -- the WHOIS server hostname (default: 'whois.iana.org'). 3787: # $self->{timeout} -- seconds used for connect and per-read waits. 3788: # 3789: # Exit status: 3790: # Returns the raw WHOIS response string, or undef on connection/write failure. 3791: # 3792: # Notes: 3793: # Uses IO::Socket::IP (dual-stack) when available, falling back to 3794: # IO::Socket::INET (IPv4 only) otherwise. The IO::Select loop reads 3795: # until the server closes the connection or the per-read timeout expires. 3796: 3797: sub _raw_whois { โ—3798 โ†’ 3826 โ†’ 3838โ—3798 โ†’ 3826 โ†’ 0 3798: my ($self, $query, $server) = @_; 3799: $server //= 'whois.iana.org'; 3800: $self->_debug("WHOIS $server -> $query"); 3801: 3802: # Choose the socket class based on what is installed. 3803: # IO::Socket::IP supports both IPv4 and IPv6 WHOIS servers. 3804: my $sock_class = $HAS_IO_SOCKET_IP ? 'IO::Socket::IP' : 'IO::Socket::INET'; 3805: 3806: # Attempt TCP connection to port 43 on the WHOIS server 3807: my $sock = eval { 3808: $sock_class->new( 3809: PeerAddr => $server, 3810: PeerPort => $WHOIS_PORT, 3811: Proto => 'tcp', 3812: Timeout => $self->{timeout}, 3813: ); 3814: }; 3815: return undef unless $sock;

Mutants (Total: 2, Killed: 0, Survived: 2)
3816: 3817: # Send the WHOIS query in wire format (CRLF-terminated per RFC 3912) 3818: $sock->print("$query\r\n") or do { $sock->close(); return undef };
Mutants (Total: 2, Killed: 0, Survived: 2)
3819: 3820: # Use IO::Select to implement per-read timeouts without alarm() 3821: my $sel = IO::Select->new($sock); 3822: my $response = ''; 3823: my $buf = ''; 3824: 3825: # Read until EOF (server closes) or timeout 3826: while ($sel->can_read($self->{timeout})) { 3827: # Wrap in eval to catch 'Connection reset by peer' thrown by Fatal/autodie 3828: my $n = eval { sysread($sock, $buf, $WHOIS_READ_CHUNK) }; 3829: 3830: if ($@ || !defined $n || $n <= 0) {
Mutants (Total: 4, Killed: 1, Survived: 3)
3831: $self->_debug("WHOIS read failed: $@") if $@; 3832: last; 3833: } 3834: last unless defined $n && $n > 0;
Mutants (Total: 3, Killed: 0, Survived: 3)
3835: $response .= $buf; 3836: } 3837: โ—3838 โ†’ 3839 โ†’ 0 3838: $sock->close(); 3839: return $response || undef;
Mutants (Total: 2, Killed: 0, Survived: 2)
3840: } 3841: 3842: # _parse_whois_text( $text ) -> hashref 3843: # 3844: # Purpose: 3845: # Parse a raw WHOIS IP block response to extract organisation name, 3846: # abuse contact email, and country code. 3847: # 3848: # Entry criteria: 3849: # $text -- a defined WHOIS response string. 3850: # 3851: # Exit status: 3852: # Returns { org, abuse, country } hashref; keys absent when not found. 3853: 3854: sub _parse_whois_text { โ—3855 โ†’ 3860 โ†’ 3870โ—3855 โ†’ 3860 โ†’ 0 3855: my ($self, $text) = @_; 3856: return {} unless $text; 3857: my %info; 3858: 3859: # Try multiple field names for the organisation name 3860: for my $pat ( 3861: qr/^OrgName:\s*(.+)/mi, qr/^org-name:\s*(.+)/mi, 3862: qr/^owner:\s*(.+)/mi, qr/^descr:\s*(.+)/mi, 3863: ) { 3864: if (!$info{org} && $text =~ $pat) {

Mutants (Total: 1, Killed: 1, Survived: 0)

3865: ($info{org} = $1) =~ s/\s+$//; 3866: } 3867: } 3868: 3869: # Try multiple field names for the abuse email โ—3870 โ†’ 3870 โ†’ 3880โ—3870 โ†’ 3870 โ†’ 0 3870: for my $pat ( 3871: qr/OrgAbuseEmail:\s*(\S+@\S+)/mi, 3872: qr/abuse-mailbox:\s*(\S+@\S+)/mi, 3873: ) { 3874: if (!$info{abuse} && $text =~ $pat) {

Mutants (Total: 1, Killed: 0, Survived: 1)
3875: ($info{abuse} = $1) =~ s/\s+$//; 3876: } 3877: } 3878: 3879: # Last-resort: any abuse@ address in the response โ—3880 โ†’ 3883 โ†’ 3886โ—3880 โ†’ 3883 โ†’ 0 3880: $info{abuse} //= $1 if $text =~ /(abuse\@[\w.-]+)/i; 3881: 3882: # Country code (case-insensitive match, normalised to uppercase) 3883: if ($text =~ /^country:\s*([A-Za-z]{2})\s*$/m) {
Mutants (Total: 1, Killed: 0, Survived: 1)
3884: $info{country} = uc $1; 3885: } โ—3886 โ†’ 3886 โ†’ 0 3886: return \%info; 3887: } 3888: 3889: # ----------------------------------------------------------------------- 3890: # Private: authentication results parsing 3891: # ----------------------------------------------------------------------- 3892: 3893: # _parse_auth_results_cached() -> hashref 3894: # 3895: # Purpose: 3896: # Parse the Authentication-Results: header(s) from the message once, 3897: # cache the result, and return it. Extracts SPF, DKIM, DMARC, ARC 3898: # results and the DKIM signing domain(s). 3899: # 3900: # Entry criteria: 3901: # $self->{_headers} populated by _split_message(). 3902: # 3903: # Exit status: 3904: # Returns { spf, dkim, dmarc, arc, dkim_domain, dkim_domains } hashref. 3905: # Keys absent when the corresponding header or field is not present. 3906: 3907: sub _parse_auth_results_cached { โ—3908 โ†’ 3927 โ†’ 3933โ—3908 โ†’ 3927 โ†’ 0 3908: my $self = $_[0]; 3909: return $self->{_auth_results} if $self->{_auth_results};

Mutants (Total: 2, Killed: 2, Survived: 0)

3910: 3911: my %auth; 3912: 3913: # Concatenate all Authentication-Results: header values 3914: my $raw = join('; ', 3915: map { $_->{value} } 3916: grep { $_->{name} eq 'authentication-results' } 3917: @{ $self->{_headers} } 3918: ); 3919: 3920: # Extract individual authentication mechanism results 3921: $auth{spf} = $1 if $raw =~ /\bspf=(\S+)/i; 3922: $auth{dkim} = $1 if $raw =~ /\bdkim=(\S+)/i; 3923: $auth{dmarc} = $1 if $raw =~ /\bdmarc=(\S+)/i; 3924: $auth{arc} = $1 if $raw =~ /\barc=(\S+)/i; 3925: 3926: # Strip trailing punctuation captured by the greedy \S+ 3927: for my $k (qw(spf dkim dmarc arc)) { 3928: $auth{$k} =~ s/[;,\s]+$// if defined $auth{$k}; 3929: } 3930: 3931: # Extract DKIM signing domains from all DKIM-Signature: d= tags. 3932: # Prefer the first domain that matches the provider table (identifies ESP). โ—3933 โ†’ 3934 โ†’ 3940โ—3933 โ†’ 3934 โ†’ 0 3933: my @dkim_domains; 3934: for my $h (grep { $_->{name} eq 'dkim-signature' } @{ $self->{_headers} }) { 3935: if ($h->{value} =~ /\bd=([^;,\s]+)/) {

Mutants (Total: 1, Killed: 1, Survived: 0)

3936: push @dkim_domains, lc $1; 3937: } 3938: } 3939: โ—3940 โ†’ 3940 โ†’ 3953โ—3940 โ†’ 3940 โ†’ 0 3940: if (@dkim_domains) {

Mutants (Total: 1, Killed: 0, Survived: 1)
3941: # Check if any signing domain matches a known provider 3942: my $preferred; 3943: for my $d (@dkim_domains) { 3944: if ($self->_provider_abuse_for_host($d)) {
Mutants (Total: 1, Killed: 0, Survived: 1)
3945: $preferred = $d; 3946: last; 3947: } 3948: } 3949: $auth{dkim_domain} = $preferred // $dkim_domains[0]; 3950: $auth{dkim_domains} = \@dkim_domains; 3951: } 3952: โ—3953 โ†’ 3954 โ†’ 0 3953: $self->{_auth_results} = \%auth; 3954: return \%auth; 3955: } 3956: 3957: # ----------------------------------------------------------------------- 3958: # Private: provider-table lookups 3959: # ----------------------------------------------------------------------- 3960: 3961: # _provider_abuse_for_host( $host ) -> hashref | undef 3962: # 3963: # Purpose: 3964: # Look up a hostname (and each of its parent domains, stripping one label 3965: # at a time from the left) in the %PROVIDER_ABUSE table. 3966: # 3967: # Entry criteria: 3968: # $host -- a defined hostname or domain string. 3969: # 3970: # Exit status: 3971: # Returns the %PROVIDER_ABUSE entry hashref on match, undef otherwise. 3972: 3973: sub _provider_abuse_for_host { โ—3974 โ†’ 3977 โ†’ 3982โ—3974 โ†’ 3977 โ†’ 0 3974: my ($self, $host) = @_; 3975: $host = lc $host; 3976: # Strip successive subdomains until we find a match or exhaust labels 3977: while ($host =~ /\./) { 3978: return $self->{provider_abuse}->{$host} if $self->{provider_abuse}->{$host};
Mutants (Total: 2, Killed: 0, Survived: 2)
3979: return $PROVIDER_ABUSE{$host} if $PROVIDER_ABUSE{$host}; 3980: $host =~ s/^[^.]+\.//; 3981: } โ—3982 โ†’ 3982 โ†’ 0 3982: return undef;

Mutants (Total: 2, Killed: 2, Survived: 0)

3983: } 3984: 3985: # _provider_abuse_for_ip( $ip, $rdns ) -> hashref | undef 3986: # 3987: # Purpose: 3988: # Look up an IP's reverse-DNS hostname in the %PROVIDER_ABUSE table to 3989: # identify well-known provider networks by rDNS pattern. 3990: # 3991: # Entry criteria: 3992: # $ip -- IPv4 or IPv6 address string (used as fallback if $rdns absent). 3993: # $rdns -- optional rDNS hostname string. 3994: # 3995: # Exit status: 3996: # Returns the %PROVIDER_ABUSE entry on match, undef otherwise. 3997: 3998: sub _provider_abuse_for_ip { 3999: my ($self, $ip, $rdns) = @_; 4000: return $self->_provider_abuse_for_host($rdns) if $rdns;

Mutants (Total: 2, Killed: 2, Survived: 0)

4001: return undef;

Mutants (Total: 2, Killed: 2, Survived: 0)

4002: } 4003: 4004: # ----------------------------------------------------------------------- 4005: # Private: eTLD+1 normalisation 4006: # ----------------------------------------------------------------------- 4007: 4008: # _registrable( $host ) -> string | undef 4009: # 4010: # Purpose: 4011: # Return the registrable eTLD+1 form of a hostname. Uses 4012: # Domain::PublicSuffix when installed for accurate results; falls back 4013: # to a built-in heuristic for the common two-letter ccTLD+2 pattern. 4014: # 4015: # Entry criteria: 4016: # $host -- a hostname string (may include subdomains). 4017: # 4018: # Exit status: 4019: # Returns the registrable domain string, or undef for single-label 4020: # hostnames (e.g. 'localhost'). 4021: # 4022: # Notes: 4023: # The heuristic handles co.uk, com.au, net.jp, org.nz etc. but not 4024: # uncommon second-level delegations like ltd.uk or plc.uk. 4025: 4026: sub _registrable { โ—4027 โ†’ 4031 โ†’ 4038โ—4027 โ†’ 4031 โ†’ 0 4027: my ($host) = @_; 4028: return undef unless $host && $host =~ /\./;

Mutants (Total: 2, Killed: 2, Survived: 0)

4029: 4030: # Use Domain::PublicSuffix for accurate PSL-based normalisation 4031: if ($HAS_PUBLIC_SUFFIX) {

Mutants (Total: 1, Killed: 0, Survived: 1)
4032: my $psl = Domain::PublicSuffix->new(); 4033: my $root = $psl->get_root_domain(lc $host); 4034: return $root if $root;
Mutants (Total: 2, Killed: 0, Survived: 2)
4035: } 4036: 4037: # Built-in heuristic fallback โ—4038 โ†’ 4042 โ†’ 4046โ—4038 โ†’ 4042 โ†’ 0 4038: my @labels = split /\./, lc $host; 4039: return $host if @labels <= 2;
Mutants (Total: 5, Killed: 0, Survived: 5)
4040: 4041: # Detect common ccTLD second-level patterns (e.g. co.uk, com.au) 4042: if ($labels[-1] =~ /^[a-z]{2}$/ &&
Mutants (Total: 1, Killed: 0, Survived: 1)
4043: $labels[-2] =~ /^(?:co|com|net|org|gov|edu|ac|me)$/) { 4044: return join('.', @labels[-3..-1]); 4045: } โ—4046 โ†’ 4046 โ†’ 0 4046: return join('.', @labels[-2..-1]); 4047: } 4048: 4049: # ----------------------------------------------------------------------- 4050: # Private: utilities 4051: # ----------------------------------------------------------------------- 4052: 4053: # _enrich_ip( $ip, $confidence, $note ) -> origin hashref 4054: # 4055: # Purpose: 4056: # Perform rDNS and WHOIS/RDAP for a single IP and package the results 4057: # into the standard origin hashref returned by originating_ip(). 4058: # 4059: # Entry criteria: 4060: # $ip -- a defined, non-private IPv4 or IPv6 address string. 4061: # $confidence -- 'high', 'medium', or 'low'. 4062: # $note -- human-readable explanation of why this IP was chosen. 4063: # 4064: # Exit status: 4065: # Returns { ip, rdns, org, abuse, country, confidence, note } hashref. 4066: 4067: sub _enrich_ip { 4068: my ($self, $ip, $confidence, $note) = @_; 4069: my $rdns = $self->_reverse_dns($ip); 4070: my $whois = $self->_whois_ip($ip); 4071: return { 4072: ip => $ip, 4073: rdns => $rdns // '(no reverse DNS)', 4074: org => $whois->{org} // '(unknown)', 4075: abuse => $whois->{abuse} // '(unknown)', 4076: country => $whois->{country} // undef, 4077: confidence => $confidence, 4078: note => $note, 4079: }; 4080: } 4081: 4082: # _header_value( $name ) -> value_string | undef 4083: # 4084: # Purpose: 4085: # Return the value of the first header matching the given lower-cased 4086: # header name. 4087: # 4088: # Entry criteria: 4089: # $name -- a lower-cased header name string. 4090: # $self->{_headers} populated by _split_message(). 4091: # 4092: # Exit status: 4093: # Returns the value string, or undef if the header is not present. 4094: 4095: sub _header_value { โ—4096 โ†’ 4097 โ†’ 4100โ—4096 โ†’ 4097 โ†’ 0 4096: my ($self, $name) = @_; 4097: for my $h (@{ $self->{_headers} }) { 4098: return $h->{value} if $h->{name} eq lc($name);

Mutants (Total: 2, Killed: 2, Survived: 0)

4099: } โ—4100 โ†’ 4100 โ†’ 0 4100: return undef;

Mutants (Total: 2, Killed: 0, Survived: 2)
4101: } 4102: 4103: # _ip_in_cidr( $ip, $cidr ) -> bool 4104: # 4105: # Purpose: 4106: # Test whether an IPv4 address falls within a CIDR block or is an exact 4107: # match (when $cidr contains no '/' separator). 4108: # 4109: # Entry criteria: 4110: # $ip -- a defined dotted-quad IPv4 address string. 4111: # $cidr -- a CIDR string like '10.0.0.0/8' or an exact IP. 4112: # 4113: # Exit status: 4114: # Returns 1 (true) if the IP is within the CIDR block, 0 otherwise. 4115: 4116: sub _ip_in_cidr { 4117: my ($self, $ip, $cidr) = @_; 4118: return $ip eq $cidr unless $cidr =~ m{/};
Mutants (Total: 2, Killed: 0, Survived: 2)
4119: my ($net_addr, $prefix) = split m{/}, $cidr; 4120: return 0 unless defined $prefix && $prefix =~ /^\d+$/ && $prefix <= 32;
Mutants (Total: 5, Killed: 0, Survived: 5)
4121: 4122: # Compute the network mask and compare masked network addresses 4123: my $mask = ~0 << (32 - $prefix); 4124: my $net_n = unpack 'N', (inet_aton($net_addr) // return 0); 4125: my $ip_n = unpack 'N', (inet_aton($ip) // return 0); 4126: return ($ip_n & $mask) == ($net_n & $mask);
Mutants (Total: 1, Killed: 0, Survived: 1)
4127: } 4128: 4129: # _decode_mime_words( $str ) -> decoded_string 4130: # 4131: # Purpose: 4132: # Decode MIME encoded-words (=?charset?B/Q?...?=) in a header value 4133: # string for human-readable display in reports. 4134: # 4135: # Entry criteria: 4136: # $str -- a defined header value string; may be undef. 4137: # 4138: # Exit status: 4139: # Returns the decoded string, or '' if $str is undef. 4140: 4141: sub _decode_mime_words { 4142: my ($self, $str) = @_; 4143: return '' unless defined $str;
Mutants (Total: 2, Killed: 0, Survived: 2)
4144: # Replace each encoded-word with its decoded equivalent 4145: $str =~ s/=\?([^?]+)\?([BbQq])\?([^?]*)\?=/_decode_ew($1,$2,$3)/ge; 4146: return $str;

Mutants (Total: 2, Killed: 2, Survived: 0)

4147: } 4148: 4149: # _decode_ew( $charset, $enc, $text ) -> decoded_bytes 4150: # 4151: # Purpose: 4152: # Decode a single MIME encoded-word component (base64 or quoted-printable). 4153: # 4154: # Notes: 4155: # Non-UTF-8 charsets return raw bytes; good enough for display-name spoof 4156: # detection which only needs ASCII matching. 4157: 4158: sub _decode_ew { โ—4159 โ†’ 4161 โ†’ 4168โ—4159 โ†’ 4161 โ†’ 0 4159: my ($charset, $enc, $text) = @_; 4160: my $raw; 4161: if (uc($enc) eq 'B') {

Mutants (Total: 1, Killed: 0, Survived: 1)
4162: $raw = decode_base64($text); 4163: } else { 4164: # Quoted-printable encoded-word uses underscore for space 4165: $text =~ s/_/ /g; 4166: $raw = decode_qp($text); 4167: } โ—4168 โ†’ 4168 โ†’ 0 4168: return $raw;
Mutants (Total: 2, Killed: 0, Survived: 2)
4169: } 4170: 4171: # _parse_date_to_epoch( $str ) -> epoch_int | undef 4172: # 4173: # Purpose: 4174: # Parse common WHOIS date strings to a Unix epoch integer. 4175: # Handles YYYY-MM-DD, YYYY-MM-DDThh:mm:ssZ, and DD-Mon-YYYY formats. 4176: # 4177: # Entry criteria: 4178: # $str -- a defined date string; may be undef. 4179: # 4180: # Exit status: 4181: # Returns epoch integer on success, undef if the string cannot be parsed. 4182: 4183: sub _parse_date_to_epoch { โ—4184 โ†’ 4191 โ†’ 4207โ—4184 โ†’ 4191 โ†’ 0 4184: my ($self, $str) = @_; 4185: return undef unless $str;
Mutants (Total: 2, Killed: 0, Survived: 2)
4186: 4187: # Clean the string of trailing whitespace/newlines 4188: $str =~ s/^\s+|\s+$//g; 4189: 4190: # Guard Regex: Validates the strict YYYY-MM-DDThh:mm:ssZ format 4191: if ($str =~ /^(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})(?:\.\d+)?Z$/) {
Mutants (Total: 1, Killed: 0, Survived: 1)
4192: # Parse the string 4193: # We use 'strptime' to create a Time::Piece object. 4194: # The 'Z' indicates UTC (Zulu time). 4195: my $epoch = eval { 4196: my $t = Time::Piece->strptime($1, '%Y-%m-%dT%H:%M:%S'); 4197: 4198: # Return seconds since the epoch 4199: # Time::Piece handles the timezone offset internally when calling ->epoch 4200: 4201: # strptime returns a local time object. 4202: # We must subtract the local timezone offset to get the true UTC epoch. 4203: return $t->epoch - $t->tzoffset->seconds;
Mutants (Total: 2, Killed: 0, Survived: 2)
4204: }; 4205: return $epoch if defined $epoch;

Mutants (Total: 2, Killed: 2, Survived: 0)

4206: } โ—4207 โ†’ 4209 โ†’ 4213โ—4207 โ†’ 4209 โ†’ 0 4207: my ($y, $m, $d); 4208: 4209: if ($str =~ /^(\d{4})-(\d{2})-(\d{2})/) { ($y,$m,$d)=($1,$2,$3) }

Mutants (Total: 1, Killed: 0, Survived: 1)
4210: elsif ($str =~ /^(\d{2})-([A-Za-z]{3})-(\d{4})/) { ($d,$m,$y)=($1,$Readonly::Values::Months::months{lc$2}//0,$3) } 4211: elsif ($str =~ /^(\d{2})\/(\d{2})\/(\d{4})/) { ($m,$d,$y)=($1,$2,$3) } 4212: โ—4213 โ†’ 4215 โ†’ 4219โ—4213 โ†’ 4215 โ†’ 0 4213: return undef unless $y && $m && $d;
Mutants (Total: 2, Killed: 0, Survived: 2)
4214: 4215: if (eval { require Time::Local; 1 }) {
Mutants (Total: 1, Killed: 0, Survived: 1)
4216: return eval { Time::Local::timegm(0,0,0,$d,$m-1,$y-1900) }; 4217: } 4218: # Approximate fallback without Time::Local โ—[NOT COVERED] 4219 โ†’ 4219 โ†’ 0 4219: return ($y-1970)*365.25*$SECS_PER_DAY + ($m-1)*30.5*$SECS_PER_DAY + ($d-1)*$SECS_PER_DAY; 4220: } 4221: 4222: # _parse_rfc2822_date( $str ) -> epoch_int | undef 4223: # 4224: # Purpose: 4225: # Parse an RFC 2822 Date: header value to a Unix epoch integer. 4226: # Timezone offsets are intentionally ignored; the function returns a 4227: # UTC-equivalent value. For the 7-day suspicious_date window the 4228: # maximum error is ~14 hours, well within the tolerance. 4229: # 4230: # Entry criteria: 4231: # $str -- a defined Date: header value string. 4232: # 4233: # Exit status: 4234: # Returns epoch integer on success, undef if the string cannot be parsed. 4235: 4236: sub _parse_rfc2822_date { โ—4237 โ†’ 4241 โ†’ 4249โ—4237 โ†’ 4241 โ†’ 0 4237: my ($str) = @_; 4238: return undef unless $str;

Mutants (Total: 2, Killed: 2, Survived: 0)

4239: 4240: # Match: DD Mon YYYY HH:MM:SS (timezone offset ignored) 4241: if ($str =~ /(\d{1,2})\s+([A-Za-z]{3})\s+(\d{4})\s+(\d{2}):(\d{2}):(\d{2})/) {

Mutants (Total: 1, Killed: 0, Survived: 1)
4242: my ($d, $m, $y, $H, $M, $S) = 4243: ($1, $Readonly::Values::Months::months{ lc $2 } // 0, $3, $4, $5, $6); 4244: return undef unless $m;
Mutants (Total: 2, Killed: 0, Survived: 2)
4245: if (eval { require Time::Local; 1 }) {
Mutants (Total: 1, Killed: 0, Survived: 1)
4246: return eval { Time::Local::timegm($S, $M, $H, $d, $m - 1, $y - 1900) }; 4247: } 4248: } โ—[NOT COVERED] 4249 โ†’ 4249 โ†’ 0 4249: return undef;
Mutants (Total: 2, Killed: 0, Survived: 2)
4250: } 4251: 4252: # _country_name( $cc ) -> country_name_string 4253: # 4254: # Purpose: 4255: # Return a human-readable country name for a two-letter ISO 3166-1 4256: # alpha-2 country code. Only the small set of statistically high-volume 4257: # spam-originating countries is covered; other codes are returned as-is. 4258: # 4259: # Entry criteria: 4260: # $cc -- a two-letter uppercase country code string. 4261: # 4262: # Exit status: 4263: # Returns the country name string, or the code itself if not in the table. 4264: 4265: sub _country_name { 4266: my ($cc) = @_; 4267: my %names = ( 4268: CN => 'China', RU => 'Russia', NG => 'Nigeria', 4269: VN => 'Vietnam', IN => 'India', PK => 'Pakistan', 4270: BD => 'Bangladesh', 4271: ); 4272: return $names{$cc} // $cc; 4273: } 4274: 4275: # _debug( $msg ) 4276: # 4277: # Purpose: 4278: # Write a diagnostic message to STDERR when verbose mode is enabled. 4279: # 4280: # Entry criteria: 4281: # $msg -- a defined message string. 4282: # 4283: # Notes: 4284: # Messages are prefixed with the class name for easy grepping. 4285: 4286: sub _debug { โ—4287 โ†’ 4289 โ†’ 0 4287: my ($self, $msg) = @_; 4288: 4289: if($self->{verbose}) {
Mutants (Total: 1, Killed: 0, Survived: 1)
4290: if(my $logger = $self->{logger}) { # May have been set in Object::Configure
Mutants (Total: 1, Killed: 0, Survived: 1)
4291: $logger->debug("[Email::Abuse::Investigator] $msg"); 4292: } else { 4293: print STDERR "[Email::Abuse::Investigator] $msg\n"; 4294: } 4295: } 4296: } 4297: 4298: 1; 4299: 4300: __END__ 4301: 4302: =head1 ALGORITHM: DOMAIN INTELLIGENCE PIPELINE 4303: 4304: For each unique non-infrastructure domain found in the email, the module 4305: runs the following pipeline: 4306: 4307: Domain name 4308: | 4309: +-- A/AAAA record --> web hosting IP --> RDAP --> org + abuse contact 4310: | 4311: +-- MX record --> mail server hostname --> A --> RDAP --> org + abuse 4312: | 4313: +-- NS record --> nameserver hostname --> A --> RDAP --> org + abuse 4314: | 4315: +-- WHOIS (TLD whois server via IANA referral) 4316: +-- Registrar name + abuse contact 4317: +-- Creation date (-> recently-registered flag if < 180 days) 4318: +-- Expiry date (-> expires-soon or expired flags) 4319: 4320: Domains are collected from: 4321: 4322: From:/Reply-To:/Sender:/Return-Path: headers 4323: DKIM-Signature: d= (signing domain) 4324: List-Unsubscribe: (ESP / bulk sender domain) 4325: Message-ID: (often reveals real sending platform) 4326: mailto: links and bare addresses in the body 4327: 4328: =head1 CACHING 4329: 4330: Two levels of caching are used: 4331: 4332: =over 4 4333: 4334: =item Per-message cache (C<$self-E<gt>{_domain_info}>) 4335: 4336: Stores domain analysis results for the lifetime of one C<parse_email()> 4337: call. Invalidated by each call to C<parse_email()>. 4338: 4339: =item Cross-message cache (CHI Memory driver, if C<CHI> is installed) 4340: 4341: Stores IP WHOIS, DNS resolution, and domain analysis results across all 4342: messages processed by the same process. TTL is one hour. Prevents 4343: redundant WHOIS queries for infrastructure that appears in multiple 4344: messages in the same run (e.g. a sending ISP seen in 500 spam messages). 4345: 4346: =back 4347: 4348: =head1 IPV6 SUPPORT 4349: 4350: IPv6 addresses are extracted from C<Received:> headers using bracketed 4351: notation (C<[2001:db8::1]>). They are tested against the private range 4352: list (which covers ::1, fe80::/10, fc00::/7, fd00::/8, and the 4353: documentation range 2001:db8::/32) and passed through C<_whois_ip()> and 4354: C<_rdap_lookup()> in the same way as IPv4 addresses. 4355: 4356: C<_resolve_host()> attempts both A and AAAA lookups when C<Net::DNS> is 4357: installed. C<_raw_whois()> uses C<IO::Socket::IP> for dual-stack WHOIS 4358: connections when that module is installed. 4359: 4360: =head1 SEE ALSO 4361: 4362: =over 4 4363: 4364: =item * L<Configure an Object at Runtime|Object::Configure> 4365: 4366: The provider_abuse, trusted_domains and url_shorteners tables can all be overridden at runtime 4367: 4368: =item * L<Test Dashboard|https://nigelhorne.github.io/Email-Abuse-Investigator/coverage/> 4369: 4370: =item * L<ARIN RDAP|https://rdap.arin.net/> 4371: 4372: =item * L<Net::DNS>, L<LWP::UserAgent>, L<HTML::LinkExtor> 4373: 4374: =item * L<CHI>, L<AnyEvent::DNS>, L<IO::Socket::IP>, L<Domain::PublicSuffix> 4375: 4376: =back 4377: 4378: =head1 REPOSITORY 4379: 4380: L<https://github.com/nigelhorne/Email-Abuse-Investigator> 4381: 4382: =head1 SUPPORT 4383: 4384: This module is provided as-is without any warranty. 4385: 4386: Please report any bugs or feature requests to C<bug-email-abuse-investigator at rt.cpan.org>, 4387: or through the web interface at 4388: L<http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Email-Abuse-Investigator>. 4389: I will be notified, and then you'll 4390: automatically be notified of progress on your bug as I make changes. 4391: 4392: You can find documentation for this module with the perldoc command. 4393: 4394: perldoc Email::Abuse::Investigator 4395: 4396: You can also look for information at: 4397: 4398: =over 4 4399: 4400: =item * MetaCPAN 4401: 4402: L<https://metacpan.org/dist/Email-Abuse-Investigator> 4403: 4404: =item * RT: CPAN's request tracker 4405: 4406: L<https://rt.cpan.org/NoAuth/Bugs.html?Dist=Email-Abuse-Investigator> 4407: 4408: =item * CPAN Testers' Matrix 4409: 4410: L<http://matrix.cpantesters.org/?dist=Email-Abuse-Investigator> 4411: 4412: =item * CPAN Testers Dependencies 4413: 4414: L<http://deps.cpantesters.org/?module=Email-Abuse-Investigator> 4415: 4416: =back 4417: 4418: =head1 LICENCE AND COPYRIGHT 4419: 4420: Copyright 2026 Nigel Horne. 4421: 4422: Usage is subject to GPL2 licence terms. 4423: If you use it, 4424: please let me know. 4425: 4426: =cut 4427: 4428: 1;