August 7, 2024 · By Aisha Mensah

How Accurate Is AI HS Code Extraction? A Field Test on 500 Shipments

We ran Tradevynt's extraction engine against 500 real commercial invoices and measured classification accuracy by HS heading depth. Here's what failed and why.

HS code classification accuracy in automated customs extraction

When we first started building Tradevynt's HS code extraction layer, one question came up constantly in early conversations with freight forwarders: "What's the accuracy rate?" The honest answer was that we didn't fully know yet — and more importantly, "accuracy" means very different things depending on where you're measuring it in the HS code hierarchy.

A few months ago we ran a structured test on 500 real commercial invoices sourced from forwarders handling consumer electronics, industrial machinery, and chemical intermediates — three product categories where misclassification risk is highest. What we found reshaped how we think about the problem.

How We Structured the Test

The Harmonized System uses a six-level hierarchy: 2-digit Chapter, 4-digit Heading, 6-digit Subheading, then national extensions (8-digit in the EU's CN nomenclature, 10-digit under the HTSUS for US imports). Accuracy looks very different at each level.

We evaluated extractions against broker-confirmed classifications, treated as ground truth. Each invoice had between one and seven line items, and we evaluated each line independently. The 500 invoices produced 1,847 individual line-item classifications.

We measured accuracy at four levels: Chapter (2-digit), Heading (4-digit), Subheading (6-digit), and full 10-digit HTSUS. We also tracked a fifth metric — "directional accuracy" — which we define as cases where the extracted code landed in the wrong subheading but carried the same duty rate and PGA requirements as the correct code. For many practical purposes, a directionally accurate result doesn't cause a problem.

What the Numbers Showed

At the Chapter level, accuracy was 97.4%. That sounds good, and for simple products it genuinely is — consumer electronics in Chapter 85, plastics in Chapter 39, steel in Chapter 72 are usually unambiguous. The invoice description alone gets you there reliably.

At the 4-digit Heading level, accuracy dropped to 91.2%. This is where product specificity starts mattering. The difference between 8471 (computers) and 8473 (parts for computers) depends on whether an item is a complete functional unit or a component — and commercial invoice descriptions frequently don't make that distinction explicit.

At the 6-digit Subheading level, accuracy was 83.7%. This is the level most relevant for duty rate determination under most trade agreements, and it's where the extraction engine has to reason about physical properties that often aren't in the document: fiber composition percentages in textiles, alloying elements in steel, active ingredient concentrations in chemicals.

At the full 10-digit HTSUS level, accuracy was 78.1%. The gap between 6-digit and 10-digit is almost entirely statistical subheading splits — distinctions between subheadings that carry identical duty rates but different statistical reporting requirements. These matter for compliance but rarely affect duty calculation.

Where It Failed and Why

The failures clustered into four identifiable patterns.

Vague product descriptions. About 40% of failures involved invoices where the product description was too generic to support subheading classification. "Electronic components" for a batch of SMD capacitors, "industrial equipment" for a hydraulic pump assembly, "textile goods" for a specific woven fabric. The extraction engine correctly flagged low confidence on most of these rather than hallucinating a code — that behavior is intentional, and we think it's the right tradeoff.

Cross-chapter ambiguity. Roughly 22% of failures involved products that sit at a chapter boundary. A steel tube fittings shipment split between Chapter 73 (iron/steel articles) and Chapter 84 (machinery parts) depending on whether the end use qualifies the items as machinery components. Without end-use context from the importer, the extraction engine can't reliably resolve these.

Multi-component assemblies. About 18% of failures came from assemblies where the correct classification depends on GRI Rule 3 analysis — specifically, which component "gives the article its essential character." A power supply with an integrated control board and mounting chassis isn't straightforwardly Chapter 85 or Chapter 84; it depends on the primary function. This is genuinely hard for any classification system, automated or human.

Trade-name products. The remaining 20% of failures involved products identified by trade name only, with no technical description. "Kapton tape" on an invoice with no further detail has to be resolved via trade name lookup before classification is even possible. We've since added a trade name resolution layer that addresses most of this category.

What This Means in Practice

An 83.7% accuracy rate at the 6-digit level sounds imperfect — and it is, if you're imagining a system that operates without human review. But that's not how Tradevynt is designed to work.

Every extraction result carries a confidence score. In the 500-invoice test, 71% of line items came back at confidence 0.90 or above. On that subset, 6-digit accuracy was 96.1%. The low-confidence flags direct broker attention to exactly the cases that need it, rather than asking brokers to re-examine every line item on every document.

A mid-size forwarder processing around 300 entries per month was running about 2.5 hours of classification review per day before integrating document extraction. After routing only low-confidence extractions to their customs team, that dropped to roughly 45 minutes — concentrated on the genuinely ambiguous cases rather than tedious re-entry of descriptions that were already clear.

We're not claiming this removes the need for licensed customs brokers. HS code classification at the subheading level is a legal determination with duty and compliance consequences. What the extraction layer does is eliminate the manual work for cases where the document contains sufficient information to reach a reliable classification — which, in a well-described shipment, is most of them.

The Depth-Accuracy Tradeoff Is a Design Choice

One thing the test clarified internally: the system should not always try to return a 10-digit HTSUS code. For a vaguely described product, returning a confident-looking 10-digit code is more dangerous than returning a 6-digit code with a low-confidence flag at the subheading level.

We've moved toward a tiered output model: full 10-digit when confidence supports it, 6-digit with an explicit "subheading review needed" flag when it doesn't. Brokers told us this is actually more useful than a forced full code — it pre-locates the classification question in the schedule without pretending the extraction system knows something it doesn't.

The honest picture of automated HS extraction is this: it handles routine volume reliably, surfaces complexity clearly, and compresses the time brokers spend on data entry. The judgment calls at chapter boundaries and for ambiguous assemblies remain exactly where they should be — with a licensed professional who can ask the importer the right questions.